Ubuntu升级慎重

环境:ubuntu 12.04, ubuntu 14.04, wordpress 4.0, opencart 1.5, postgresql 9.1, mysql 5

昨天升级ubuntu系统至14.04版本后,发现之前安装的wordpress和opencart全线瘫痪。(冒汗不止)。这两个主页都是空白的,无任务报错信息。只能到网上狂搜一把。解决方案如下

wordpress:

“无法选择数据库”(postgresql),试过各种方法,最后只能把wordpress降版本至3.4.2

wget https://cn.wordpress.org/wordpress-3.4.2-zh_CN.tar.gz

重新布置到/var/www目录下。 (注意ubuntu 14.04的apache2的DocumentRoot与之前不同位置。

sudo vi /etc/apache2/sites-enabled/000-default.conf

将DocumentRoot /var/www/html修改为DocumentRoot /var/www)

还是一个问题是wordpress默认主题需要更改,否则也是只能看到空白页面

 

opencart:

增加一行到index.php显示出错日志

<?php
ini_set(‘display_errors’, ‘on’);
?>

Fatal error: Call to undefined function mcrypt_create_iv() is that mcrypt

重新安装mcrypt和php5-mcrypt

sudo apt-get install mcrypt
sudo apt-get install php5-mcrypt

php -m | grep mcrypt

加载模块

sudo php5enmod mcrypt

 

今天是几乎花费半天的时间来处理这两个应用的问题, 网上的解决方法也是只能给个思路。只能自己深入地了解问题才能定位并解决。还有一点,升级系统切记要慎重!

 

 

基于搜索词做的推荐

环境:Oracle database 11g,  Gensim, jieba, spark 1.0

思路: 首先从数据仓库中抽取出每个人对应的搜索词集合, 然后对搜索词集合做分词处理,统计每个词的频率。 然后输出用户与分词处理后的词语的矩阵,其中搜索次数为矩阵中的数值。

步骤:
1. 在oracle数据库查出每个的搜索词集合
select employee_id, to_char(yd_concat(q_content)) from agg_kw_daily group by employee_id;

2. 分词处理,输出用户与分词处理后的词语的矩阵

from gensim import corpora
import jieba

train_set = []

q_content = [i.split(‘ ‘) for i in open(‘/u01/jerry/emp_query_conten.txt’).readlines()]
[train_set.append(list(jieba.cut(i[1]))) for i in q_content]

train_set2 = []
for i in train_set:
train_set2.append([j for j in i if j not in set([u’,’, u’_’, u’-‘, u’ ‘, u’.’, u”, u’不’, u’的’])])

dic = corpora.Dictionary(train_set2)
corpus = [dic.doc2bow(text) for text in train_set2]

corpus2 = []
for i in corpus:
corpus2.append([j for j in i if j[1] > 1])

import sys
reload(sys)
sys.setdefaultencoding(‘utf-8’)
output = open(‘/u01/jerry/qw_dic’, ‘w’)
for key, value in dic.iteritems():
output.write(str(key) + ‘ ‘ + value + ‘\n’)

for i in range(0, len(corpus2)):
for j in corpus2[i]:
print q_content[i][0], j[0], j[1]

output = open(‘/u01/jerry/emp_q_cnt’, ‘w’)
for i in range(0, len(corpus2)):
for j in corpus2[i]:
output.write(str(q_content[i][0]) + ‘ ‘ +  str(j[0]) + ‘ ‘ + str(j[1]) + ‘\n’)

3. 将输出的文件emp_q_cnt在spark mllib中计算,得出预测模型

import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating

val data = sc.textFile(“/home/cloudera/emp_q_cnt”)
val ratings = data.map(_.split(‘\t’) match { case Array(user,item,rate) => Rating(user.toInt, item.toInt, rate.toDouble)})

val rank = 10
val numIterations = 1000
val model = ALS.train(ratings, rank, numIterations, 0.01)

4. 查看某个用户对某一分词的预测值(用户10008, 分词2)

model.predict(sc.parallelize(Array((10008, 2)))).map{case Rating(user, item, rate) => ((user, item), rate)}.take(1)

Kaldi的rnnlm训练

环境: Ubuntu 12.04, Kaldi

深度学习在NLP上的应用(具体可参考这篇文章 http://licstar.net/archives/328) 中提到一个概念:词向量 (英文为distributed representation, word representation, word embeding中任一个)。在Mikolov 的 RNNLM中有涉及到到词向量的训练,其中Kaldi中有实现示例。

1. 切换到Kaldi目录/u01/kaldi/tools,未找到rnnlm目录。 可能是版本有些旧了, 直接从网上下载这个目录

svn co https://svn.code.sf.net/p/kaldi/code/trunk/tools/rnnlm-hs-0.1b

2.
cd rnnlm-hs-01.b
make
生成rnnlm执行文件

jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ./rnnlm
RNNLM based on WORD VECTOR estimation toolkit v 0.1b

Options:
Parameters for training:
-train <file>
Use text data from <file> to train the model
-valid <file>
Use text data from <file> to perform validation and control learning rate
-test <file>
Use text data from <file> to compute logprobs with an existing model
-rnnlm <file>
Use <file> to save the resulting language model
-hidden <int>
Set size of hidden layer; default is 100
-bptt <int>
Set length of BPTT unfolding; default is 3; set to 0 to disable truncation
-bptt-block <int>
Set period of BPTT unfolding; default is 10; BPTT is performed each bptt+bptt_block steps
-gen <int>
Sampling mode; number of sentences to sample, default is 0 (off); enter negative number for interactive mode
-threads <int>
Use <int> threads (default 1)
-min-count <int>
This will discard words that appear less than <int> times; default is 0
-alpha <float>
Set the starting learning rate; default is 0.1
-maxent-alpha <float>
Set the starting learning rate for maxent; default is 0.1
-reject-threshold <float>
Reject nnet and reload nnet from previous epoch if the relative entropy improvement on the validation set is below this threshold (default 0.997)
-stop <float>
Stop training when the relative entropy improvement on the validation set is below this threshold (default 1.003); see also -retry
-retry <int>
Stop training iff N retries with halving learning rate have failed (default 2)
-debug <int>
Set the debug mode (default = 2 = more info during training)
-direct-size <int>
Set the size of hash for maxent parameters, in millions (default 0 = maxent off)
-direct-order <int>
Set the order of n-gram features to be used in maxent (default 3)
-beta1 <float>
L2 regularisation parameter for RNNLM weights (default 1e-6)
-beta2 <float>
L2 regularisation parameter for maxent weights (default 1e-6)
-recompute-counts <int>
Recompute train words counts, useful for fine-tuning (default = 0 = use counts stored in the vocab file)

Examples:
./rnnlm -train data.txt -valid valid.txt -rnnlm result.rnnlm -debug 2 -hidden 200

3.  使用kaldi中的wsj示例
下载一个包含wsj的 git clone https://github.com/foundintranslation/Kaldi.git
将其中的cp wsj/s1 /u01/kaldi/egs/wsj/ -Rf
发现其中的wsj数据源是要用dvd光盘上的,没法获得,这条路走不通。

4. 到网站http://www.fit.vutbr.cz/~imikolov/rnnlm/下载

rnnlm-0.3e

Basic examples

这两个文件,其中有程序和示例, 解压Basic_examples,里面有数据文件data

jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ls /u01/jerry/simple-examples/data
ptb.char.test.txt  ptb.char.train.txt  ptb.char.valid.txt  ptb.test.txt  ptb.train.txt  ptb.valid.txt  README

开始训练词向量
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ./rnnlm -train /u01/jerry/simple-examples/data/ptb.train.txt -valid /u01/jerry/simple-examples/data/ptb.valid.txt -rnnlm result.rnnlm -debug2 -hidden 100

Vocab size: 10000
Words in train file: 929589
Starting training using file /u01/jerry/simple-examples/data/ptb.train.txt
Iteration 0     Valid Entropy 9.457519
Alpha: 0.100000  ME-alpha: 0.100000  Progress: 99.11%  Words/thread/sec: 28.40k Iteration 1     Valid Entropy 8.416857
Alpha: 0.100000  ME-alpha: 0.100000  Progress: 99.11%  Words/thread/sec: 28.18k Iteration 2     Valid Entropy 8.203366
Alpha: 0.100000  ME-alpha: 0.100000  Progress: 99.11%  Words/thread/sec: 27.98k Iteration 3     Valid Entropy 8.090350
Alpha: 0.100000  ME-alpha: 0.100000  Progress: 99.11%  Words/thread/sec: 27.25k Iteration 4     Valid Entropy 8.026399
Alpha: 0.100000  ME-alpha: 0.100000  Progress: 99.11%  Words/thread/sec: 27.35k Iteration 5     Valid Entropy 7.979509
Alpha: 0.100000  ME-alpha: 0.100000  Progress: 99.11%  Words/thread/sec: 27.43k Iteration 6     Valid Entropy 7.949336
Alpha: 0.100000  ME-alpha: 0.100000  Progress: 99.11%  Words/thread/sec: 27.35k Iteration 7     Valid Entropy 7.931067  Decay started
Alpha: 0.050000  ME-alpha: 0.050000  Progress: 99.11%  Words/thread/sec: 28.55k Iteration 8     Valid Entropy 7.827513
Alpha: 0.025000  ME-alpha: 0.025000  Progress: 99.11%  Words/thread/sec: 28.37k Iteration 9     Valid Entropy 7.759574
Alpha: 0.012500  ME-alpha: 0.012500  Progress: 99.11%  Words/thread/sec: 28.45k Iteration 10    Valid Entropy 7.714383
Alpha: 0.006250  ME-alpha: 0.006250  Progress: 99.11%  Words/thread/sec: 28.51k Iteration 11    Valid Entropy 7.684731
Alpha: 0.003125  ME-alpha: 0.003125  Progress: 99.11%  Words/thread/sec: 28.64k Iteration 12    Valid Entropy 7.668839  Retry 1/2
Alpha: 0.001563  ME-alpha: 0.001563  Progress: 99.11%  Words/thread/sec: 28.25k Iteration 13    Valid Entropy 7.668437  Retry 2/2

 

jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ls -l
total 8184
-rw-rw-r– 1 jerry jerry   11358 Aug 25 15:08 LICENSE
-rw-rw-r– 1 jerry jerry     407 Aug 25 15:08 Makefile
-rw-rw-r– 1 jerry jerry    8325 Aug 25 15:08 README.txt
-rw-rw-r– 1 jerry jerry  109943 Aug 25 18:05 result.rnnlm
-rw-rw-r– 1 jerry jerry 8040020 Aug 25 18:05 result.rnnlm.nnet
-rwxrwxr-x 1 jerry jerry  142501 Aug 25 15:08 rnnlm
-rw-rw-r– 1 jerry jerry   33936 Aug 25 15:08 rnnlm.c
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$

vi  result.rnnlm

</s> 42068
the 50770
<unk> 45020
N 32481
of 24400
to 23638
a 21196
in 18000
and 17474
‘s 9784

 

Gensim做中文主题模型(LDA)

环境: Ubuntu 12.04, gensim, jieba

中文语料来自http://www.sogou.com/labs/dl/c.html 的精简版(tar.gz格式) 24M
jerry@hq:/u01/jerry/Reduced$ ls
C000008  C000010  C000013  C000014  C000016  C000020  C000022  C000023  C000024

各个文件夹的分类:
C000007 汽车
C000008 财经
C000010 IT
C000013 健康
C000014 体育
C000016 旅游
C000020 教育
C000022 招聘
C000023 文化
C000024 军事

步骤如下:

import jieba, os
from gensim import corpora, models, similarities

train_set = []

walk = os.walk(‘/u01/jerry/Reduced’)
for root, dirs, files in walk:
for name in files:
f = open(os.path.join(root, name), ‘r’)
raw = f.read()
word_list = list(jieba.cut(raw, cut_all = False))
train_set.append(word_list)

dic = corpora.Dictionary(train_set)
corpus = [dic.doc2bow(text) for text in train_set]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lda = models.LdaModel(corpus_tfidf, id2word = dic, num_topics = 10)
corpus_lda = lda[corpus_tfidf]

>>> for i in range(0, 10):
…      print lda.print_topic(i)

0.000*康宁 + 0.000*sohu2 + 0.000*wmv + 0.000*bbn7 + 0.000*mmst + 0.000*cid + 0.000*icp + 0.000*沙尘 + 0.000*性骚扰 + 0.000*乌里韦
0.000*media + 0.000*mid + 0.000*stream + 0.000*bbn7 + 0.000*mmst + 0.000*sohu2 + 0.000*cid + 0.000*icp + 0.000*wmv + 0.000*that
0.012* + 0.000*米兰 + 0.000*老板 + 0.000*男人 + 0.000*女人 + 0.000*她 + 0.000*小说 + 0.000*病人 + 0.000*我 + 0.000*女性
0.002*& + 0.002*nbsp + 0.001*0 + 0.001*; + 0.001*西安 + 0.001*报名 + 0.001*1 + 0.001*∶ + 0.001*00 + 0.001*5
0.002*手机 + 0.002*孩子 + 0.001*球 + 0.001*国家队 + 0.001*胜 + 0.001*教练 + 0.001*; + 0.001*名单 + 0.001*阅读 + 0.001*高校
0.001*’ + 0.000* + 0.000*= + 0.000*var + 0.000*height + 0.000*width + 0.000*NewWin + 0.000*} + 0.000*{ + 0.000*+
0.003*  + 0.002*比赛 + 0.002*我 + 0.002*  + 0.001*; + 0.001*- + 0.001*, + 0.001*他 + 0.001*& + 0.001*―
0.000*航班 + 0.000*劳动合同 + 0.000*最低工资 + 0.000*农民工 + 0.000*养老保险 + 0.000*劳动者 + 0.000*用人单位 + 0.000*养老 + 0.000*上调 + 0.000*锦江
0.000*面板 + 0.000*碘 + 0.000*食物 + 0.000*维生素 + 0.000*营养 + 0.000*皮肤 + 0.000*蛋白质 + 0.000*药物 + 0.000*症状 + 0.000*体内
0.000* + 0.000*EMC + 0.000*包机 + 0.000*基金 + 0.000*陆纯初 + 0.000*南越 + 0.000*Kashya + 0.000*西沙群岛 + 0.000*Clariion + 0.000*西沙

感觉最终的主题模型不太理想, 可以需要多增加参数num_topics的数量。

RocksDB, pyrocksdb 的安装与使用

环境:Ubuntu 12.04,  RocksDB, pyrocksdb

RocksDB是FB基于google的LevelDB基础上改良的键值对数据库,类似于memcache和redis,支持RAM, Flash, Disk存储,写速度快过LevelDB 10倍左右, 听起来有点高大上的感觉,可参考https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks 。不管这么多了,先安装试用下

安装步骤:

rocksdb安装:

sudo git clone https://github.com/facebook/rocksdb.git
cd rocksdb

vi Makefile
将这一行 OPT += -O2 -fno-omit-frame-pointer -momit-leaf-frame-pointer
修改为 OPT += -O2 -lrt -fno-omit-frame-pointer -momit-leaf-frame-pointer

在~/.bashrc中增加 export LD_PRELOAD=/lib/x86_64-linux-gnu/librt.so.1,并使变量生效source ~/.bashrc

(这两步用于解决这个问题 ” undefined symbol: clock_gettime”)

sudo git checkout 2.8.fb
sudo make shared_lib

cd ..
sudo chown jerry:jerry rocksdb -Rf
cd rocksdb

sudo cp librocksdb.so /usr/local/lib
sudo mkdir -p /usr/local/include/rocksdb/
sudo cp -r ./include/* /usr/local/include/

(这三步解决这个问题 “ Fatal error: rocksdb/slice.h: No such file or directory “)

pyrocksdb安装:
sudo pip install “Cython>=0.20”
sudo pip install git+git://github.com/stephan-hof/pyrocksdb.git@v0.2.1

至些安装成功
进入pyrocksdb环境

jerry@hq:/u01/rocksdb$ python
Python 2.7.3 (default, Sep 26 2013, 20:03:06)
[GCC 4.6.3] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import rocksdb
>>> db = rocksdb.DB(“test.db”, rocksdb.Options(create_if_missing=True))
>>> db.put(b“key1”, b“v1”)
>>> db.put(b“key2”, b“v2”)
>>> db.get(b”key1″)

‘v1’

五种开源的深度学习软件的评估

环境: Windows 7,  Ubuntu 12.04,  H2O, RStudio, Pylearn2, Caffe, Cuba_convnet2, Octave

Java版本: H2O

可参与我的文章 : http://blog.itpub.net/16582684/viewspace-1255976/

优点:实现CPU集群,实现并行和分布式,与R语言结果比较方便处理数据
缺点:不支持GPU

C++版本 :Caffe, Cuba_convnet2

可参与我的文章 :http://blog.itpub.net/16582684/viewspace-1256400/      http://blog.itpub.net/16582684/viewspace-1254584/

Caffe  优点: 支持CPU和GPU,支持python, matlab接口,计算速度比较快,目前是图像分类效果比较好
缺点:不支持集群

Cuba_convnet2:  优点: 支持单机GPU集群
缺点: 不支持CPU, 操作有些复杂

Python版本:Pylearn2

可参与我的文章 : http://blog.itpub.net/16582684/viewspace-1243187/

优点:支持CPU和GPU
缺点:不支持并发和集群

Octave/Matlab版本: DeeplearnToolbox

可参与我的文章 :http://blog.itpub.net/16582684/viewspace-1255317/

优点: 编码简洁,容易理解其算法
缺点: 只支持单个cpu计算

总结: 以上各个版本都有自己的适应场景,没法去找出一个最好的。 目前深度学习架构发展朝两个方向: 1.  GPU集群,   2. CPU和GPU混合集群。  开源版本已经给出第一种,目前第二种也就只有一两家公司实现了。

Cascalog简介

环境: CentOS 5.7,  CDH 4.2.0

Cascalog是一款基于cascading和hadoop上用clojure定义的DSL。由于clojure的元数据和函数编程范式,它很好地定义函数和查询。

下面讲解下使用场景:

1. 使用lein创建一个工程
lein cascalog_incanter

2. 切入到cascalog_incanter,编辑project.clj 如下所示:

(defproject cascalog_incanter “0.1.0-SNAPSHOT”
:description “FIXME: write description”
:url “http://example.com/FIXME”
:license {:name “Eclipse Public License”
:url “http://www.eclipse.org/legal/epl-v10.html”}
:dependencies [[org.clojure/clojure “1.6.0”]
[cascalog/cascalog-core “2.1.1”]
[incanter “1.5.5”]]
:repositories [[“conjars.org” “http://conjars.org/repo”]
[“cloudera” “https://repository.cloudera.com/artifactory/cloudera-repos/”]]
:profiles {
:provided {
:dependencies [
;[org.apache.hadoop/hadoop-core “1.2.1”] ; Apache Hadoop MapReduce v1
;[org.apache.hadoop/hadoop-core “2.0.0-mr1-cdh4.2.0”] ; CDH 4.2.0 MapReduce v1
[org.apache.hadoop/hadoop-common “2.0.0-cdh4.2.0” ] ; Cloudera Hadoop 4.2.0 YARN
[org.apache.hadoop/hadoop-mapreduce-client-core “2.0.0-cdh4.2.0” ] ; Cloudera Hadoop 4.2.0 MapReduce v2
]
}
:dev {
:dependencies [
[org.apache.hadoop/hadoop-minicluster “2.0.0-cdh4.2.0”] ; Cloudera Hadoop 4.2.0
]}
}
)

3. 进入编程模式
lein repl

4. 参考示例http://cascalog.org/articles/getting_started.html