环境:Ubuntu 14.04, Gensim, jieba
先中文分词:
python -m jieba wiki.zh.text.jian.utf-8 > cut_result.txt
抽取3万个文档:
head -n 30000 cut_result.txt > cut_small.txt
处理脚本如下:
from gensim import corpora
train_data = []
corpus1 = []
corpus2 = []
with open(‘cut_small.txt’, ‘r’) as f:
for i in f.readlines():
train_data.append(list(i.decode(‘utf8’).split(‘/’)))
dic = corpora.Dictionary(train)
corpus1 = [dic.doc2bow(text) for text in train_data]
with open(‘cut_small.txt’, ‘r’) as f:
for i in f.readlines():
corpus2.append([dic.token2id[j] for j in i.decode(‘utf8’).split(‘/’)])