环境:Oracle database 11g, Gensim, jieba, spark 1.0
思路: 首先从数据仓库中抽取出每个人对应的搜索词集合, 然后对搜索词集合做分词处理,统计每个词的频率。 然后输出用户与分词处理后的词语的矩阵,其中搜索次数为矩阵中的数值。
步骤:
1. 在oracle数据库查出每个的搜索词集合
select employee_id, to_char(yd_concat(q_content)) from agg_kw_daily group by employee_id;
2. 分词处理,输出用户与分词处理后的词语的矩阵
from gensim import corpora
import jieba
train_set = []
q_content = [i.split(‘ ‘) for i in open(‘/u01/jerry/emp_query_conten.txt’).readlines()]
[train_set.append(list(jieba.cut(i[1]))) for i in q_content]
train_set2 = []
for i in train_set:
train_set2.append([j for j in i if j not in set([u’,’, u’_’, u’-‘, u’ ‘, u’.’, u”, u’不’, u’的’])])
dic = corpora.Dictionary(train_set2)
corpus = [dic.doc2bow(text) for text in train_set2]
corpus2 = []
for i in corpus:
corpus2.append([j for j in i if j[1] > 1])
import sys
reload(sys)
sys.setdefaultencoding(‘utf-8’)
output = open(‘/u01/jerry/qw_dic’, ‘w’)
for key, value in dic.iteritems():
output.write(str(key) + ‘ ‘ + value + ‘\n’)
for i in range(0, len(corpus2)):
for j in corpus2[i]:
print q_content[i][0], j[0], j[1]
output = open(‘/u01/jerry/emp_q_cnt’, ‘w’)
for i in range(0, len(corpus2)):
for j in corpus2[i]:
output.write(str(q_content[i][0]) + ‘ ‘ + str(j[0]) + ‘ ‘ + str(j[1]) + ‘\n’)
3. 将输出的文件emp_q_cnt在spark mllib中计算,得出预测模型
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val data = sc.textFile(“/home/cloudera/emp_q_cnt”)
val ratings = data.map(_.split(‘\t’) match { case Array(user,item,rate) => Rating(user.toInt, item.toInt, rate.toDouble)})
val rank = 10
val numIterations = 1000
val model = ALS.train(ratings, rank, numIterations, 0.01)
4. 查看某个用户对某一分词的预测值(用户10008, 分词2)
model.predict(sc.parallelize(Array((10008, 2)))).map{case Rating(user, item, rate) => ((user, item), rate)}.take(1)