2015年1月 – 强的部落格

员工卡戏剧性找回

今天是新年上班的第一天，刚坐到椅子上就发现之前丢失的员工卡。之前记得是在办公区丢的，没想到竟然在自己的座椅后面。（这是第二次丢了，又花了一百）已经重办了也没办法退了。下回可得多留心些！

Ubuntu升级慎重

环境：ubuntu 12.04, ubuntu 14.04, wordpress 4.0, opencart 1.5, postgresql 9.1, mysql 5

昨天升级ubuntu系统至14.04版本后，发现之前安装的wordpress和opencart全线瘫痪。（冒汗不止)。这两个主页都是空白的，无任务报错信息。只能到网上狂搜一把。解决方案如下

wordpress:

“无法选择数据库”（postgresql)，试过各种方法，最后只能把wordpress降版本至3.4.2

wget https://cn.wordpress.org/wordpress-3.4.2-zh_CN.tar.gz

重新布置到/var/www目录下。（注意ubuntu 14.04的apache2的DocumentRoot与之前不同位置。

sudo vi /etc/apache2/sites-enabled/000-default.conf

将DocumentRoot /var/www/html修改为DocumentRoot /var/www）

还是一个问题是wordpress默认主题需要更改，否则也是只能看到空白页面

opencart:

增加一行到index.php显示出错日志

<?php
ini_set(‘display_errors’, ‘on’);
?>

Fatal error: Call to undefined function mcrypt_create_iv() is that mcrypt

重新安装mcrypt和php5-mcrypt

sudo apt-get install mcrypt
sudo apt-get install php5-mcrypt

php -m | grep mcrypt

加载模块

sudo php5enmod mcrypt

今天是几乎花费半天的时间来处理这两个应用的问题，网上的解决方法也是只能给个思路。只能自己深入地了解问题才能定位并解决。还有一点，升级系统切记要慎重！

基于搜索词做的推荐

环境：Oracle database 11g, Gensim, jieba, spark 1.0

思路：首先从数据仓库中抽取出每个人对应的搜索词集合，然后对搜索词集合做分词处理，统计每个词的频率。然后输出用户与分词处理后的词语的矩阵，其中搜索次数为矩阵中的数值。

步骤：
1. 在oracle数据库查出每个的搜索词集合
select employee_id, to_char(yd_concat(q_content)) from agg_kw_daily group by employee_id;

2. 分词处理，输出用户与分词处理后的词语的矩阵

from gensim import corpora
import jieba

train_set = []

q_content = [i.split(‘ ‘) for i in open(‘/u01/jerry/emp_query_conten.txt’).readlines()]
[train_set.append(list(jieba.cut(i[1]))) for i in q_content]

train_set2 = []
for i in train_set:
train_set2.append([j for j in i if j not in set([u’,’, u’_’, u’-‘, u’ ‘, u’.’, u”, u’不’, u’的’])])

dic = corpora.Dictionary(train_set2)
corpus = [dic.doc2bow(text) for text in train_set2]

corpus2 = []
for i in corpus:
corpus2.append([j for j in i if j[1] > 1])

import sys
reload(sys)
sys.setdefaultencoding(‘utf-8’)
output = open(‘/u01/jerry/qw_dic’, ‘w’)
for key, value in dic.iteritems():
output.write(str(key) + ‘ ‘ + value + ‘\n’)

for i in range(0, len(corpus2)):
for j in corpus2[i]:
print q_content[i][0], j[0], j[1]

output = open(‘/u01/jerry/emp_q_cnt’, ‘w’)
for i in range(0, len(corpus2)):
for j in corpus2[i]:
output.write(str(q_content[i][0]) + ‘ ‘ + str(j[0]) + ‘ ‘ + str(j[1]) + ‘\n’)

3. 将输出的文件emp_q_cnt在spark mllib中计算，得出预测模型

import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating

val data = sc.textFile(“/home/cloudera/emp_q_cnt”)
val ratings = data.map(_.split(‘\t’) match { case Array(user,item,rate) => Rating(user.toInt, item.toInt, rate.toDouble)})

val rank = 10
val numIterations = 1000
val model = ALS.train(ratings, rank, numIterations, 0.01)

4. 查看某个用户对某一分词的预测值（用户10008，分词2)

model.predict(sc.parallelize(Array((10008, 2)))).map{case Rating(user, item, rate) => ((user, item), rate)}.take(1)

Kaldi的rnnlm训练

环境： Ubuntu 12.04, Kaldi

深度学习在NLP上的应用（具体可参考这篇文章 http://licstar.net/archives/328) 中提到一个概念：词向量（英文为distributed representation, word representation, word embeding中任一个)。在Mikolov 的 RNNLM中有涉及到到词向量的训练，其中Kaldi中有实现示例。

1. 切换到Kaldi目录/u01/kaldi/tools，未找到rnnlm目录。可能是版本有些旧了，直接从网上下载这个目录

svn co https://svn.code.sf.net/p/kaldi/code/trunk/tools/rnnlm-hs-0.1b

2.
cd rnnlm-hs-01.b
make
生成rnnlm执行文件

jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ./rnnlm
RNNLM based on WORD VECTOR estimation toolkit v 0.1b

Options:
Parameters for training:
-train <file>
Use text data from <file> to train the model
-valid <file>
Use text data from <file> to perform validation and control learning rate
-test <file>
Use text data from <file> to compute logprobs with an existing model
-rnnlm <file>
Use <file> to save the resulting language model
-hidden <int>
Set size of hidden layer; default is 100
-bptt <int>
Set length of BPTT unfolding; default is 3; set to 0 to disable truncation
-bptt-block <int>
Set period of BPTT unfolding; default is 10; BPTT is performed each bptt+bptt_block steps
-gen <int>
Sampling mode; number of sentences to sample, default is 0 (off); enter negative number for interactive mode
-threads <int>
Use <int> threads (default 1)
-min-count <int>
This will discard words that appear less than <int> times; default is 0
-alpha <float>
Set the starting learning rate; default is 0.1
-maxent-alpha <float>
Set the starting learning rate for maxent; default is 0.1
-reject-threshold <float>
Reject nnet and reload nnet from previous epoch if the relative entropy improvement on the validation set is below this threshold (default 0.997)
-stop <float>
Stop training when the relative entropy improvement on the validation set is below this threshold (default 1.003); see also -retry
-retry <int>
Stop training iff N retries with halving learning rate have failed (default 2)
-debug <int>
Set the debug mode (default = 2 = more info during training)
-direct-size <int>
Set the size of hash for maxent parameters, in millions (default 0 = maxent off)
-direct-order <int>
Set the order of n-gram features to be used in maxent (default 3)
-beta1 <float>
L2 regularisation parameter for RNNLM weights (default 1e-6)
-beta2 <float>
L2 regularisation parameter for maxent weights (default 1e-6)
-recompute-counts <int>
Recompute train words counts, useful for fine-tuning (default = 0 = use counts stored in the vocab file)

Examples:
./rnnlm -train data.txt -valid valid.txt -rnnlm result.rnnlm -debug 2 -hidden 200

3. 使用kaldi中的wsj示例
下载一个包含wsj的 git clone https://github.com/foundintranslation/Kaldi.git
将其中的cp wsj/s1 /u01/kaldi/egs/wsj/ -Rf
发现其中的wsj数据源是要用dvd光盘上的，没法获得，这条路走不通。

4. 到网站http://www.fit.vutbr.cz/~imikolov/rnnlm/下载

rnnlm-0.3e

Basic examples

这两个文件，其中有程序和示例，解压Basic_examples，里面有数据文件data

jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ls /u01/jerry/simple-examples/data
ptb.char.test.txt ptb.char.train.txt ptb.char.valid.txt ptb.test.txt ptb.train.txt ptb.valid.txt README

开始训练词向量
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ./rnnlm -train /u01/jerry/simple-examples/data/ptb.train.txt -valid /u01/jerry/simple-examples/data/ptb.valid.txt -rnnlm result.rnnlm -debug2 -hidden 100

Vocab size: 10000
Words in train file: 929589
Starting training using file /u01/jerry/simple-examples/data/ptb.train.txt
Iteration 0 Valid Entropy 9.457519
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 28.40k Iteration 1 Valid Entropy 8.416857
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 28.18k Iteration 2 Valid Entropy 8.203366
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.98k Iteration 3 Valid Entropy 8.090350
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.25k Iteration 4 Valid Entropy 8.026399
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.35k Iteration 5 Valid Entropy 7.979509
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.43k Iteration 6 Valid Entropy 7.949336
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.35k Iteration 7 Valid Entropy 7.931067 Decay started
Alpha: 0.050000 ME-alpha: 0.050000 Progress: 99.11% Words/thread/sec: 28.55k Iteration 8 Valid Entropy 7.827513
Alpha: 0.025000 ME-alpha: 0.025000 Progress: 99.11% Words/thread/sec: 28.37k Iteration 9 Valid Entropy 7.759574
Alpha: 0.012500 ME-alpha: 0.012500 Progress: 99.11% Words/thread/sec: 28.45k Iteration 10 Valid Entropy 7.714383
Alpha: 0.006250 ME-alpha: 0.006250 Progress: 99.11% Words/thread/sec: 28.51k Iteration 11 Valid Entropy 7.684731
Alpha: 0.003125 ME-alpha: 0.003125 Progress: 99.11% Words/thread/sec: 28.64k Iteration 12 Valid Entropy 7.668839 Retry 1/2
Alpha: 0.001563 ME-alpha: 0.001563 Progress: 99.11% Words/thread/sec: 28.25k Iteration 13 Valid Entropy 7.668437 Retry 2/2

jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ls -l
total 8184
-rw-rw-r– 1 jerry jerry 11358 Aug 25 15:08 LICENSE
-rw-rw-r– 1 jerry jerry 407 Aug 25 15:08 Makefile
-rw-rw-r– 1 jerry jerry 8325 Aug 25 15:08 README.txt
-rw-rw-r– 1 jerry jerry 109943 Aug 25 18:05 result.rnnlm
-rw-rw-r– 1 jerry jerry 8040020 Aug 25 18:05 result.rnnlm.nnet
-rwxrwxr-x 1 jerry jerry 142501 Aug 25 15:08 rnnlm
-rw-rw-r– 1 jerry jerry 33936 Aug 25 15:08 rnnlm.c
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$

vi result.rnnlm

</s> 42068
the 50770
<unk> 45020
N 32481
of 24400
to 23638
a 21196
in 18000
and 17474
‘s 9784

Gensim做中文主题模型（LDA)

环境： Ubuntu 12.04, gensim, jieba

中文语料来自http://www.sogou.com/labs/dl/c.html 的精简版（tar.gz格式） 24M
jerry@hq:/u01/jerry/Reduced$ ls
C000008 C000010 C000013 C000014 C000016 C000020 C000022 C000023 C000024

各个文件夹的分类：
C000007 汽车
C000008 财经
C000010 IT
C000013 健康
C000014 体育
C000016 旅游
C000020 教育
C000022 招聘
C000023 文化
C000024 军事

步骤如下：

import jieba, os
from gensim import corpora, models, similarities

train_set = []

walk = os.walk(‘/u01/jerry/Reduced’)
for root, dirs, files in walk:
for name in files:
f = open(os.path.join(root, name), ‘r’)
raw = f.read()
word_list = list(jieba.cut(raw, cut_all = False))
train_set.append(word_list)

dic = corpora.Dictionary(train_set)
corpus = [dic.doc2bow(text) for text in train_set]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lda = models.LdaModel(corpus_tfidf, id2word = dic, num_topics = 10)
corpus_lda = lda[corpus_tfidf]

>>> for i in range(0, 10):
… print lda.print_topic(i)
…
0.000*康宁 + 0.000*sohu2 + 0.000*wmv + 0.000*bbn7 + 0.000*mmst + 0.000*cid + 0.000*icp + 0.000*沙尘 + 0.000*性骚扰 + 0.000*乌里韦
0.000*media + 0.000*mid + 0.000*stream + 0.000*bbn7 + 0.000*mmst + 0.000*sohu2 + 0.000*cid + 0.000*icp + 0.000*wmv + 0.000*that
0.012* + 0.000*米兰 + 0.000*老板 + 0.000*男人 + 0.000*女人 + 0.000*她 + 0.000*小说 + 0.000*病人 + 0.000*我 + 0.000*女性
0.002*& + 0.002*nbsp + 0.001*０ + 0.001*; + 0.001*西安 + 0.001*报名 + 0.001*１ + 0.001*∶ + 0.001*00 + 0.001*５
0.002*手机 + 0.002*孩子 + 0.001*球 + 0.001*国家队 + 0.001*胜 + 0.001*教练 + 0.001*; + 0.001*名单 + 0.001*阅读 + 0.001*高校
0.001*’ + 0.000* + 0.000*= + 0.000*var + 0.000*height + 0.000*width + 0.000*NewWin + 0.000*} + 0.000*{ + 0.000*+
0.003* + 0.002*比赛 + 0.002*我 + 0.002*　 + 0.001*; + 0.001*- + 0.001*， + 0.001*他 + 0.001*& + 0.001*―
0.000*航班 + 0.000*劳动合同 + 0.000*最低工资 + 0.000*农民工 + 0.000*养老保险 + 0.000*劳动者 + 0.000*用人单位 + 0.000*养老 + 0.000*上调 + 0.000*锦江
0.000*面板 + 0.000*碘 + 0.000*食物 + 0.000*维生素 + 0.000*营养 + 0.000*皮肤 + 0.000*蛋白质 + 0.000*药物 + 0.000*症状 + 0.000*体内
0.000* + 0.000*EMC + 0.000*包机 + 0.000*基金 + 0.000*陆纯初 + 0.000*南越 + 0.000*Kashya + 0.000*西沙群岛 + 0.000*Clariion + 0.000*西沙

感觉最终的主题模型不太理想，可以需要多增加参数num_topics的数量。

淘宝图片存储系统架构

链接:http://wenku.baidu.com/view/c5504a08bb68a98271fefaaa.html

NoSQL数据库笔谈

链接:http://sebug.net/paper/databases/nosql/Nosql.html

RocksDB, pyrocksdb 的安装与使用

环境：Ubuntu 12.04, RocksDB, pyrocksdb

RocksDB是FB基于google的LevelDB基础上改良的键值对数据库，类似于memcache和redis，支持RAM, Flash, Disk存储，写速度快过LevelDB 10倍左右，听起来有点高大上的感觉，可参考https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks 。不管这么多了，先安装试用下

安装步骤：

rocksdb安装：

sudo git clone https://github.com/facebook/rocksdb.git
cd rocksdb

vi Makefile
将这一行 OPT += -O2 -fno-omit-frame-pointer -momit-leaf-frame-pointer
修改为 OPT += -O2 -lrt -fno-omit-frame-pointer -momit-leaf-frame-pointer

在~/.bashrc中增加 export LD_PRELOAD=/lib/x86_64-linux-gnu/librt.so.1，并使变量生效source ~/.bashrc

(这两步用于解决这个问题 ” undefined symbol: clock_gettime”)

sudo git checkout 2.8.fb
sudo make shared_lib

cd ..
sudo chown jerry:jerry rocksdb -Rf
cd rocksdb

sudo cp librocksdb.so /usr/local/lib
sudo mkdir -p /usr/local/include/rocksdb/
sudo cp -r ./include/* /usr/local/include/

(这三步解决这个问题 “ Fatal error: rocksdb/slice.h: No such file or directory “)

pyrocksdb安装：
sudo pip install “Cython>=0.20”
sudo pip install git+git://github.com/stephan-hof/pyrocksdb.git@v0.2.1

至些安装成功
进入pyrocksdb环境

jerry@hq:/u01/rocksdb$ python
Python 2.7.3 (default, Sep 26 2013, 20:03:06)
[GCC 4.6.3] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import rocksdb
>>> db = rocksdb.DB(“test.db”, rocksdb.Options(create_if_missing=True))
>>> db.put(b“key1”, b“v1”)
>>> db.put(b“key2”, b“v2”)
>>> db.get(b”key1″)
‘v1’

五种开源的深度学习软件的评估

环境： Windows 7, Ubuntu 12.04, H2O, RStudio, Pylearn2, Caffe, Cuba_convnet2, Octave

Java版本： H2O

可参与我的文章： http://blog.itpub.net/16582684/viewspace-1255976/

优点：实现CPU集群，实现并行和分布式，与R语言结果比较方便处理数据
缺点：不支持GPU

C++版本：Caffe, Cuba_convnet2

可参与我的文章：http://blog.itpub.net/16582684/viewspace-1256400/ http://blog.itpub.net/16582684/viewspace-1254584/

Caffe 优点：支持CPU和GPU，支持python, matlab接口，计算速度比较快，目前是图像分类效果比较好
缺点：不支持集群

Cuba_convnet2: 优点：支持单机GPU集群
缺点：不支持CPU，操作有些复杂

Python版本：Pylearn2

可参与我的文章： http://blog.itpub.net/16582684/viewspace-1243187/

优点：支持CPU和GPU
缺点：不支持并发和集群

Octave/Matlab版本： DeeplearnToolbox

可参与我的文章：http://blog.itpub.net/16582684/viewspace-1255317/

优点：编码简洁，容易理解其算法
缺点：只支持单个cpu计算

总结：以上各个版本都有自己的适应场景，没法去找出一个最好的。目前深度学习架构发展朝两个方向： 1. GPU集群， 2. CPU和GPU混合集群。开源版本已经给出第一种，目前第二种也就只有一两家公司实现了。

Cascalog简介

环境: CentOS 5.7, CDH 4.2.0

Cascalog是一款基于cascading和hadoop上用clojure定义的DSL。由于clojure的元数据和函数编程范式，它很好地定义函数和查询。

下面讲解下使用场景：

1. 使用lein创建一个工程
lein cascalog_incanter

2. 切入到cascalog_incanter，编辑project.clj 如下所示：

(defproject cascalog_incanter “0.1.0-SNAPSHOT”
:description “FIXME: write description”
:url “http://example.com/FIXME”
:license {:name “Eclipse Public License”
:url “http://www.eclipse.org/legal/epl-v10.html”}
:dependencies [[org.clojure/clojure “1.6.0”]
[cascalog/cascalog-core “2.1.1”]
[incanter “1.5.5”]]
:repositories [[“conjars.org” “http://conjars.org/repo”]
[“cloudera” “https://repository.cloudera.com/artifactory/cloudera-repos/”]]
:profiles {
:provided {
:dependencies [
;[org.apache.hadoop/hadoop-core “1.2.1”] ; Apache Hadoop MapReduce v1
;[org.apache.hadoop/hadoop-core “2.0.0-mr1-cdh4.2.0”] ; CDH 4.2.0 MapReduce v1
[org.apache.hadoop/hadoop-common “2.0.0-cdh4.2.0” ] ; Cloudera Hadoop 4.2.0 YARN
[org.apache.hadoop/hadoop-mapreduce-client-core “2.0.0-cdh4.2.0” ] ; Cloudera Hadoop 4.2.0 MapReduce v2
]
}
:dev {
:dependencies [
[org.apache.hadoop/hadoop-minicluster “2.0.0-cdh4.2.0”] ; Cloudera Hadoop 4.2.0
]}
}
)

3. 进入编程模式
lein repl

4. 参考示例http://cascalog.org/articles/getting_started.html