赛题理解、数据读取与数据分析

一、赛题数据

数据说明

数据为新闻文本，并按照字符级别进行匿名处理。整合划分出14个候选分类类别。

数据：训练集20w条样本，测试集A包括5w条样本，测试集B包括5w条样本。

label	text
6	57 44 66 56 2 3 3 37 5 41 9 57 44 47 45 33 13 63 58 31 17 47 0 1 1 69 26 60 62 15 21 12 49 18 38 20 50 23 57 44 45 33 25 28 47 22 52 35 30 14 24 69 54 7 48 19 11 51 16 43 26 34 53 27 64 8 4 42 36 46 65 69 29 39 15 37 57 44 45 33 69 54 7 25 40 35 30 66 56 47 55 69 61 10 60 42 36 46 65 37 5 41 32 67 6 59 47 0 1 1 68

1	{'科技': 0, '股票': 1, '体育': 2, '娱乐': 3, '时政': 4, '社会': 5, '教育': 6, '财经': 7, '家居': 8, '游戏': 9, '房产': 10, '时尚': 11, '彩票': 12, '星座': 13}

数据展示

1
2
3

import pandas as pd
train_df = pd.read_csv('./train_set.csv', sep='\t')
train_df.head(10)

二、评测标准

$F 1=2 * \frac{(\text {precision } * \text {recall})}{(\text {precision}+\text {recall})}$

from sklearn.metrics import f1_score
y_true = [0, 1, 2, 0, 1, 2]
y_pred = [0, 2, 1, 0, 0, 1]
f1_score(y_true, y_pred, average='macro')

三、解题思路

赛题给出的数据是匿名化的，不能直接使用中文分词等操作，这个是赛题的难点。文本数据是一种典型的非结构化数据，因此可能涉及到特征提取和分类模型两个部分。提供了一些解题思路供大家参考：

思路1：TF-IDF + 机器学习分类器

直接使用TF-IDF对文本提取特征，并使用分类器进行分类。在分类器的选择上，可以使用SVM、LR、或者XGBoost。
思路2：FastText

FastText是入门款的词向量，利用Facebook提供的FastText工具，可以快速构建出分类器。
思路3：WordVec + 深度学习分类器

WordVec是进阶款的词向量，并通过构建深度学习分类完成分类。深度学习分类的网络结构可以选择TextCNN、TextRNN或者BiLSTM。
思路4：Bert词向量

Bert是高配款的词向量，具有强大的建模学习能力。

四、数据分析

句子长度分析

1
2
3

%pylab inline
train_df['text_len'] = train_df['text'].apply(lambda x: len(x.split(' ')))
print(train_df['text_len'].describe())

直方图展示

1
2
3

_ = plt.hist(train_df['text_len'], bins=200)
plt.xlabel('Text char count')
plt.title("Histogram of char count")

类别分布

1
2
3

train_df['label'].value_counts().plot(kind='bar')
plt.title('News class count')
plt.xlabel("category")

字符分布统计

from collections import Counter
all_lines = ' '.join(list(train_df['text']))
word_count = Counter(all_lines.split(" "))
word_count = sorted(word_count.items(), key=lambda d:d[1], reverse = True)
print(len(word_count))
print(word_count[0])
print(word_count[-1])

==> 6969
==> ('3750',7482224)
==> ('3133',1)

分析结论

字符个数平均为1000个，可能需要截断
类别分布不均匀，会严重影响模型的精度
总共包括7000-8000个字符

作业

假设字符3750，字符900和字符648是句子的标点符号，请分析赛题每篇新闻平均由多少个句子构成？

import re
train_df['sent'] = train_df['text'].apply(lambda x: len(re.split('3750|900|648',x)))
average = sum(train_df['sent'])/len(train_df['sent'])
print('平均字符：{:.2f}'.format(average))
train_df['sent'].describe()

统计每类新闻中出现次数对多的字符。

%time

from collections import Counter


train_df['text'] = train_df['text'].apply(lambda x: x.replace(' 3750',"").replace(' 900',"").replace(' 648',""))

print(train_df.groupby('label').text.apply(lambda x: Counter(' '.join(list(x)).split(" ")).most_common(1)))