
环境:ubuntu 12.04, ubuntu 14.04, wordpress 4.0, opencart 1.5, postgresql 9.1, mysql 5




wget https://cn.wordpress.org/wordpress-3.4.2-zh_CN.tar.gz

重新布置到/var/www目录下。 (注意ubuntu 14.04的apache2的DocumentRoot与之前不同位置。

sudo vi /etc/apache2/sites-enabled/000-default.conf

将DocumentRoot /var/www/html修改为DocumentRoot /var/www)





ini_set(‘display_errors’, ‘on’);

Fatal error: Call to undefined function mcrypt_create_iv() is that mcrypt


sudo apt-get install mcrypt
sudo apt-get install php5-mcrypt

php -m | grep mcrypt


sudo php5enmod mcrypt


今天是几乎花费半天的时间来处理这两个应用的问题, 网上的解决方法也是只能给个思路。只能自己深入地了解问题才能定位并解决。还有一点,升级系统切记要慎重!




环境:Oracle database 11g,  Gensim, jieba, spark 1.0

思路: 首先从数据仓库中抽取出每个人对应的搜索词集合, 然后对搜索词集合做分词处理,统计每个词的频率。 然后输出用户与分词处理后的词语的矩阵,其中搜索次数为矩阵中的数值。

1. 在oracle数据库查出每个的搜索词集合
select employee_id, to_char(yd_concat(q_content)) from agg_kw_daily group by employee_id;

2. 分词处理,输出用户与分词处理后的词语的矩阵

from gensim import corpora
import jieba

train_set = []

q_content = [i.split(‘ ‘) for i in open(‘/u01/jerry/emp_query_conten.txt’).readlines()]
[train_set.append(list(jieba.cut(i[1]))) for i in q_content]

train_set2 = []
for i in train_set:
train_set2.append([j for j in i if j not in set([u’,’, u’_’, u’-‘, u’ ‘, u’.’, u”, u’不’, u’的’])])

dic = corpora.Dictionary(train_set2)
corpus = [dic.doc2bow(text) for text in train_set2]

corpus2 = []
for i in corpus:
corpus2.append([j for j in i if j[1] > 1])

import sys
output = open(‘/u01/jerry/qw_dic’, ‘w’)
for key, value in dic.iteritems():
output.write(str(key) + ‘ ‘ + value + ‘\n’)

for i in range(0, len(corpus2)):
for j in corpus2[i]:
print q_content[i][0], j[0], j[1]

output = open(‘/u01/jerry/emp_q_cnt’, ‘w’)
for i in range(0, len(corpus2)):
for j in corpus2[i]:
output.write(str(q_content[i][0]) + ‘ ‘ +  str(j[0]) + ‘ ‘ + str(j[1]) + ‘\n’)

3. 将输出的文件emp_q_cnt在spark mllib中计算,得出预测模型

import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating

val data = sc.textFile(“/home/cloudera/emp_q_cnt”)
val ratings = data.map(_.split(‘\t’) match { case Array(user,item,rate) => Rating(user.toInt, item.toInt, rate.toDouble)})

val rank = 10
val numIterations = 1000
val model = ALS.train(ratings, rank, numIterations, 0.01)

4. 查看某个用户对某一分词的预测值(用户10008, 分词2)

model.predict(sc.parallelize(Array((10008, 2)))).map{case Rating(user, item, rate) => ((user, item), rate)}.take(1)


环境: Ubuntu 12.04, Kaldi

深度学习在NLP上的应用(具体可参考这篇文章 http://licstar.net/archives/328) 中提到一个概念:词向量 (英文为distributed representation, word representation, word embeding中任一个)。在Mikolov 的 RNNLM中有涉及到到词向量的训练,其中Kaldi中有实现示例。

1. 切换到Kaldi目录/u01/kaldi/tools,未找到rnnlm目录。 可能是版本有些旧了, 直接从网上下载这个目录

svn co https://svn.code.sf.net/p/kaldi/code/trunk/tools/rnnlm-hs-0.1b

cd rnnlm-hs-01.b

jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ./rnnlm
RNNLM based on WORD VECTOR estimation toolkit v 0.1b

Parameters for training:
-train <file>
Use text data from <file> to train the model
-valid <file>
Use text data from <file> to perform validation and control learning rate
-test <file>
Use text data from <file> to compute logprobs with an existing model
-rnnlm <file>
Use <file> to save the resulting language model
-hidden <int>
Set size of hidden layer; default is 100
-bptt <int>
Set length of BPTT unfolding; default is 3; set to 0 to disable truncation
-bptt-block <int>
Set period of BPTT unfolding; default is 10; BPTT is performed each bptt+bptt_block steps
-gen <int>
Sampling mode; number of sentences to sample, default is 0 (off); enter negative number for interactive mode
-threads <int>
Use <int> threads (default 1)
-min-count <int>
This will discard words that appear less than <int> times; default is 0
-alpha <float>
Set the starting learning rate; default is 0.1
-maxent-alpha <float>
Set the starting learning rate for maxent; default is 0.1
-reject-threshold <float>
Reject nnet and reload nnet from previous epoch if the relative entropy improvement on the validation set is below this threshold (default 0.997)
-stop <float>
Stop training when the relative entropy improvement on the validation set is below this threshold (default 1.003); see also -retry
-retry <int>
Stop training iff N retries with halving learning rate have failed (default 2)
-debug <int>
Set the debug mode (default = 2 = more info during training)
-direct-size <int>
Set the size of hash for maxent parameters, in millions (default 0 = maxent off)
-direct-order <int>
Set the order of n-gram features to be used in maxent (default 3)
-beta1 <float>
L2 regularisation parameter for RNNLM weights (default 1e-6)
-beta2 <float>
L2 regularisation parameter for maxent weights (default 1e-6)
-recompute-counts <int>
Recompute train words counts, useful for fine-tuning (default = 0 = use counts stored in the vocab file)

./rnnlm -train data.txt -valid valid.txt -rnnlm result.rnnlm -debug 2 -hidden 200

3.  使用kaldi中的wsj示例
下载一个包含wsj的 git clone https://github.com/foundintranslation/Kaldi.git
将其中的cp wsj/s1 /u01/kaldi/egs/wsj/ -Rf

4. 到网站http://www.fit.vutbr.cz/~imikolov/rnnlm/下载


Basic examples

这两个文件,其中有程序和示例, 解压Basic_examples,里面有数据文件data

jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ls /u01/jerry/simple-examples/data
ptb.char.test.txt  ptb.char.train.txt  ptb.char.valid.txt  ptb.test.txt  ptb.train.txt  ptb.valid.txt  README

jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ./rnnlm -train /u01/jerry/simple-examples/data/ptb.train.txt -valid /u01/jerry/simple-examples/data/ptb.valid.txt -rnnlm result.rnnlm -debug2 -hidden 100

Vocab size: 10000
Words in train file: 929589
Starting training using file /u01/jerry/simple-examples/data/ptb.train.txt
Iteration 0     Valid Entropy 9.457519
Alpha: 0.100000  ME-alpha: 0.100000  Progress: 99.11%  Words/thread/sec: 28.40k Iteration 1     Valid Entropy 8.416857
Alpha: 0.100000  ME-alpha: 0.100000  Progress: 99.11%  Words/thread/sec: 28.18k Iteration 2     Valid Entropy 8.203366
Alpha: 0.100000  ME-alpha: 0.100000  Progress: 99.11%  Words/thread/sec: 27.98k Iteration 3     Valid Entropy 8.090350
Alpha: 0.100000  ME-alpha: 0.100000  Progress: 99.11%  Words/thread/sec: 27.25k Iteration 4     Valid Entropy 8.026399
Alpha: 0.100000  ME-alpha: 0.100000  Progress: 99.11%  Words/thread/sec: 27.35k Iteration 5     Valid Entropy 7.979509
Alpha: 0.100000  ME-alpha: 0.100000  Progress: 99.11%  Words/thread/sec: 27.43k Iteration 6     Valid Entropy 7.949336
Alpha: 0.100000  ME-alpha: 0.100000  Progress: 99.11%  Words/thread/sec: 27.35k Iteration 7     Valid Entropy 7.931067  Decay started
Alpha: 0.050000  ME-alpha: 0.050000  Progress: 99.11%  Words/thread/sec: 28.55k Iteration 8     Valid Entropy 7.827513
Alpha: 0.025000  ME-alpha: 0.025000  Progress: 99.11%  Words/thread/sec: 28.37k Iteration 9     Valid Entropy 7.759574
Alpha: 0.012500  ME-alpha: 0.012500  Progress: 99.11%  Words/thread/sec: 28.45k Iteration 10    Valid Entropy 7.714383
Alpha: 0.006250  ME-alpha: 0.006250  Progress: 99.11%  Words/thread/sec: 28.51k Iteration 11    Valid Entropy 7.684731
Alpha: 0.003125  ME-alpha: 0.003125  Progress: 99.11%  Words/thread/sec: 28.64k Iteration 12    Valid Entropy 7.668839  Retry 1/2
Alpha: 0.001563  ME-alpha: 0.001563  Progress: 99.11%  Words/thread/sec: 28.25k Iteration 13    Valid Entropy 7.668437  Retry 2/2


jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ls -l
total 8184
-rw-rw-r– 1 jerry jerry   11358 Aug 25 15:08 LICENSE
-rw-rw-r– 1 jerry jerry     407 Aug 25 15:08 Makefile
-rw-rw-r– 1 jerry jerry    8325 Aug 25 15:08 README.txt
-rw-rw-r– 1 jerry jerry  109943 Aug 25 18:05 result.rnnlm
-rw-rw-r– 1 jerry jerry 8040020 Aug 25 18:05 result.rnnlm.nnet
-rwxrwxr-x 1 jerry jerry  142501 Aug 25 15:08 rnnlm
-rw-rw-r– 1 jerry jerry   33936 Aug 25 15:08 rnnlm.c

vi  result.rnnlm

</s> 42068
the 50770
<unk> 45020
N 32481
of 24400
to 23638
a 21196
in 18000
and 17474
‘s 9784



环境: Ubuntu 12.04, gensim, jieba

中文语料来自http://www.sogou.com/labs/dl/c.html 的精简版(tar.gz格式) 24M
jerry@hq:/u01/jerry/Reduced$ ls
C000008  C000010  C000013  C000014  C000016  C000020  C000022  C000023  C000024

C000007 汽车
C000008 财经
C000010 IT
C000013 健康
C000014 体育
C000016 旅游
C000020 教育
C000022 招聘
C000023 文化
C000024 军事


import jieba, os
from gensim import corpora, models, similarities

train_set = []

walk = os.walk(‘/u01/jerry/Reduced’)
for root, dirs, files in walk:
for name in files:
f = open(os.path.join(root, name), ‘r’)
raw = f.read()
word_list = list(jieba.cut(raw, cut_all = False))

dic = corpora.Dictionary(train_set)
corpus = [dic.doc2bow(text) for text in train_set]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lda = models.LdaModel(corpus_tfidf, id2word = dic, num_topics = 10)
corpus_lda = lda[corpus_tfidf]

>>> for i in range(0, 10):
…      print lda.print_topic(i)

0.000*康宁 + 0.000*sohu2 + 0.000*wmv + 0.000*bbn7 + 0.000*mmst + 0.000*cid + 0.000*icp + 0.000*沙尘 + 0.000*性骚扰 + 0.000*乌里韦
0.000*media + 0.000*mid + 0.000*stream + 0.000*bbn7 + 0.000*mmst + 0.000*sohu2 + 0.000*cid + 0.000*icp + 0.000*wmv + 0.000*that
0.012* + 0.000*米兰 + 0.000*老板 + 0.000*男人 + 0.000*女人 + 0.000*她 + 0.000*小说 + 0.000*病人 + 0.000*我 + 0.000*女性
0.002*& + 0.002*nbsp + 0.001*0 + 0.001*; + 0.001*西安 + 0.001*报名 + 0.001*1 + 0.001*∶ + 0.001*00 + 0.001*5
0.002*手机 + 0.002*孩子 + 0.001*球 + 0.001*国家队 + 0.001*胜 + 0.001*教练 + 0.001*; + 0.001*名单 + 0.001*阅读 + 0.001*高校
0.001*’ + 0.000* + 0.000*= + 0.000*var + 0.000*height + 0.000*width + 0.000*NewWin + 0.000*} + 0.000*{ + 0.000*+
0.003*  + 0.002*比赛 + 0.002*我 + 0.002*  + 0.001*; + 0.001*- + 0.001*, + 0.001*他 + 0.001*& + 0.001*―
0.000*航班 + 0.000*劳动合同 + 0.000*最低工资 + 0.000*农民工 + 0.000*养老保险 + 0.000*劳动者 + 0.000*用人单位 + 0.000*养老 + 0.000*上调 + 0.000*锦江
0.000*面板 + 0.000*碘 + 0.000*食物 + 0.000*维生素 + 0.000*营养 + 0.000*皮肤 + 0.000*蛋白质 + 0.000*药物 + 0.000*症状 + 0.000*体内
0.000* + 0.000*EMC + 0.000*包机 + 0.000*基金 + 0.000*陆纯初 + 0.000*南越 + 0.000*Kashya + 0.000*西沙群岛 + 0.000*Clariion + 0.000*西沙

感觉最终的主题模型不太理想, 可以需要多增加参数num_topics的数量。

RocksDB, pyrocksdb 的安装与使用

环境:Ubuntu 12.04,  RocksDB, pyrocksdb

RocksDB是FB基于google的LevelDB基础上改良的键值对数据库,类似于memcache和redis,支持RAM, Flash, Disk存储,写速度快过LevelDB 10倍左右, 听起来有点高大上的感觉,可参考https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks 。不管这么多了,先安装试用下



sudo git clone https://github.com/facebook/rocksdb.git
cd rocksdb

vi Makefile
将这一行 OPT += -O2 -fno-omit-frame-pointer -momit-leaf-frame-pointer
修改为 OPT += -O2 -lrt -fno-omit-frame-pointer -momit-leaf-frame-pointer

在~/.bashrc中增加 export LD_PRELOAD=/lib/x86_64-linux-gnu/librt.so.1,并使变量生效source ~/.bashrc

(这两步用于解决这个问题 ” undefined symbol: clock_gettime”)

sudo git checkout 2.8.fb
sudo make shared_lib

cd ..
sudo chown jerry:jerry rocksdb -Rf
cd rocksdb

sudo cp librocksdb.so /usr/local/lib
sudo mkdir -p /usr/local/include/rocksdb/
sudo cp -r ./include/* /usr/local/include/

(这三步解决这个问题 “ Fatal error: rocksdb/slice.h: No such file or directory “)

sudo pip install “Cython>=0.20”
sudo pip install git+git://github.com/stephan-hof/pyrocksdb.git@v0.2.1


jerry@hq:/u01/rocksdb$ python
Python 2.7.3 (default, Sep 26 2013, 20:03:06)
[GCC 4.6.3] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import rocksdb
>>> db = rocksdb.DB(“test.db”, rocksdb.Options(create_if_missing=True))
>>> db.put(b“key1”, b“v1”)
>>> db.put(b“key2”, b“v2”)
>>> db.get(b”key1″)



环境: Windows 7,  Ubuntu 12.04,  H2O, RStudio, Pylearn2, Caffe, Cuba_convnet2, Octave

Java版本: H2O

C++版本 :Caffe, Cuba_convnet2

Caffe  优点: 支持CPU和GPU,支持python, matlab接口,计算速度比较快,目前是图像分类效果比较好

Cuba_convnet2:  优点: 支持单机GPU集群
缺点: 不支持CPU, 操作有些复杂


Octave/Matlab版本: DeeplearnToolbox

优点: 编码简洁,容易理解其算法
缺点: 只支持单个cpu计算

总结: 以上各个版本都有自己的适应场景,没法去找出一个最好的。 目前深度学习架构发展朝两个方向: 1.  GPU集群,   2. CPU和GPU混合集群。  开源版本已经给出第一种,目前第二种也就只有一两家公司实现了。


环境: CentOS 5.7,  CDH 4.2.0



1. 使用lein创建一个工程
lein cascalog_incanter

2. 切入到cascalog_incanter,编辑project.clj 如下所示:

(defproject cascalog_incanter “0.1.0-SNAPSHOT”
:description “FIXME: write description”
:url “http://example.com/FIXME”
:license {:name “Eclipse Public License”
:url “http://www.eclipse.org/legal/epl-v10.html”}
:dependencies [[org.clojure/clojure “1.6.0”]
[cascalog/cascalog-core “2.1.1”]
[incanter “1.5.5”]]
:repositories [[“conjars.org” “http://conjars.org/repo”]
[“cloudera” “https://repository.cloudera.com/artifactory/cloudera-repos/”]]
:profiles {
:provided {
:dependencies [
;[org.apache.hadoop/hadoop-core “1.2.1”] ; Apache Hadoop MapReduce v1
;[org.apache.hadoop/hadoop-core “2.0.0-mr1-cdh4.2.0”] ; CDH 4.2.0 MapReduce v1
[org.apache.hadoop/hadoop-common “2.0.0-cdh4.2.0” ] ; Cloudera Hadoop 4.2.0 YARN
[org.apache.hadoop/hadoop-mapreduce-client-core “2.0.0-cdh4.2.0” ] ; Cloudera Hadoop 4.2.0 MapReduce v2
:dev {
:dependencies [
[org.apache.hadoop/hadoop-minicluster “2.0.0-cdh4.2.0”] ; Cloudera Hadoop 4.2.0

3. 进入编程模式
lein repl

4. 参考示例http://cascalog.org/articles/getting_started.html

Titan-hadoop 分布式图计算框架

环境: Centos 5.7,    titan-0.5.0-hadoop2

titan-hadoop是一款支持在hadoop上做分布式图计算的框架, 它的前身是faunus,是图分析引擎,后来归并到titan项目上。 可以在http://s3.thinkaurelius.com/downloads/titan/titan-0.5.0-hadoop2.zip下载安装文件。 具体使用请参考 http://s3.thinkaurelius.com/docs/titan/0.5.0/hadoop-getting-started.html 。

目前的版本支持hadoop 2.2.0,  跟其它的hadoop版本编译的时候出现不少问题。 还没有找到针对hadoop的titan-hadoop2源码。