原网站地址是: http://kaldi.sourceforge.net/
后来更改为: http://kaldi-asr.org/ https://github.com/danpovey/kaldi-asr
量化自我和极简主义的窝藏点
原网站地址是: http://kaldi.sourceforge.net/
后来更改为: http://kaldi-asr.org/ https://github.com/danpovey/kaldi-asr
sudo: luarocks: command not found
solutition:
$ sudo -s
root@hq:~# luarocks install cutorch
使用luarocks安装lunit,出现下面报错:
jerry@ubuntu:~$ sudo luarocks install lunit
Warning: Failed searching manifest: Failed fetching manifest for https://raw.githubusercontent.com/torc h/rocks/master – Failed downloading https://raw.githubusercontent.com/torch/rocks/master/manifest
Warning: Failed searching manifest: Failed fetching manifest for https://raw.githubusercontent.com/rock s-moonscript-org/moonrocks-mirror/master – Failed downloading https://raw.githubusercontent.com/rocks-m oonscript-org/moonrocks-mirror/master/manifest
Error: No results matching query were found.
经查发现https://raw.githubusercontent.com/torch/rocks/master/manifest这个地址无法连接。只好切换另一个服务器了
方法1:
sudo luarocks install –verbose –only-server=http://rocks.moonscript.org lunit
方法2:
jerry@ubuntu:~$ mkdir ~/.cache/luarocks/https___rocks.moonscript.org
jerry@ubuntu:~$ sudo wget https://rocks.moonscript.org/manifest-5.1 -O ~/.cache/luarocks/https___rocks.moonscript.org/manifest-5.1
环境 :CentOS
[root@sandbox ~]# lsof -i:80
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
httpd 1118 root 4u IPv6 9178 0t0 TCP *:http (LISTEN)
httpd 1255 apache 4u IPv6 9178 0t0 TCP *:http (LISTEN)
httpd 1256 apache 4u IPv6 9178 0t0 TCP *:http (LISTEN)
httpd 1257 apache 4u IPv6 9178 0t0 TCP *:http (LISTEN)
httpd 1258 apache 4u IPv6 9178 0t0 TCP *:http (LISTEN)
httpd 1259 apache 4u IPv6 9178 0t0 TCP *:http (LISTEN)
httpd 1260 apache 4u IPv6 9178 0t0 TCP *:http (LISTEN)
httpd 1261 apache 4u IPv6 9178 0t0 TCP *:http (LISTEN)
httpd 1262 apache 4u IPv6 9178 0t0 TCP *:http (LISTEN)
环境:spark 1.1
spark-shell启用的时候一直出现这个提示:
Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory
查了一下午才明白,需要在master结点上做如下修改:
vi /etc/spark/conf/spark-defaults.conf
添加一项
spark.driver.memory=10g
注:spark.driver.memory表示master的memory大小设置, 而spark.executor.memory代表worker结点内存大小。
环境:CDH 5.2, spark 1.1
使用pyspark启动时发现报错如下:
socket.gaierror: [Errno -2] Name or service not known
经查是由/etc/hosts中缺件少
127.0.0.1 localhost localhost.localdomain
这一行造成的。添加后问题解决。
今天是新年上班的第一天,刚坐到椅子上就发现之前丢失的员工卡。之前记得是在办公区丢的,没想到竟然在自己的座椅后面。(这是第二次丢了,又花了一百)已经重办了也没办法退了。下回可得多留心些!
环境:ubuntu 12.04, ubuntu 14.04, wordpress 4.0, opencart 1.5, postgresql 9.1, mysql 5
昨天升级ubuntu系统至14.04版本后,发现之前安装的wordpress和opencart全线瘫痪。(冒汗不止)。这两个主页都是空白的,无任务报错信息。只能到网上狂搜一把。解决方案如下
wordpress:
“无法选择数据库”(postgresql),试过各种方法,最后只能把wordpress降版本至3.4.2
wget https://cn.wordpress.org/wordpress-3.4.2-zh_CN.tar.gz
重新布置到/var/www目录下。 (注意ubuntu 14.04的apache2的DocumentRoot与之前不同位置。
sudo vi /etc/apache2/sites-enabled/000-default.conf
将DocumentRoot /var/www/html修改为DocumentRoot /var/www)
还是一个问题是wordpress默认主题需要更改,否则也是只能看到空白页面
opencart:
增加一行到index.php显示出错日志
<?php
ini_set(‘display_errors’, ‘on’);
?>
Fatal error: Call to undefined function mcrypt_create_iv()
is that mcrypt
重新安装mcrypt和php5-mcrypt
sudo apt-get install mcrypt
sudo apt-get install php5-mcrypt
php -m | grep mcrypt
加载模块
sudo php5enmod mcrypt
今天是几乎花费半天的时间来处理这两个应用的问题, 网上的解决方法也是只能给个思路。只能自己深入地了解问题才能定位并解决。还有一点,升级系统切记要慎重!
环境:Oracle database 11g, Gensim, jieba, spark 1.0
思路: 首先从数据仓库中抽取出每个人对应的搜索词集合, 然后对搜索词集合做分词处理,统计每个词的频率。 然后输出用户与分词处理后的词语的矩阵,其中搜索次数为矩阵中的数值。
步骤:
1. 在oracle数据库查出每个的搜索词集合
select employee_id, to_char(yd_concat(q_content)) from agg_kw_daily group by employee_id;
2. 分词处理,输出用户与分词处理后的词语的矩阵
from gensim import corpora
import jieba
train_set = []
q_content = [i.split(‘ ‘) for i in open(‘/u01/jerry/emp_query_conten.txt’).readlines()]
[train_set.append(list(jieba.cut(i[1]))) for i in q_content]
train_set2 = []
for i in train_set:
train_set2.append([j for j in i if j not in set([u’,’, u’_’, u’-‘, u’ ‘, u’.’, u”, u’不’, u’的’])])
dic = corpora.Dictionary(train_set2)
corpus = [dic.doc2bow(text) for text in train_set2]
corpus2 = []
for i in corpus:
corpus2.append([j for j in i if j[1] > 1])
import sys
reload(sys)
sys.setdefaultencoding(‘utf-8’)
output = open(‘/u01/jerry/qw_dic’, ‘w’)
for key, value in dic.iteritems():
output.write(str(key) + ‘ ‘ + value + ‘\n’)
for i in range(0, len(corpus2)):
for j in corpus2[i]:
print q_content[i][0], j[0], j[1]
output = open(‘/u01/jerry/emp_q_cnt’, ‘w’)
for i in range(0, len(corpus2)):
for j in corpus2[i]:
output.write(str(q_content[i][0]) + ‘ ‘ + str(j[0]) + ‘ ‘ + str(j[1]) + ‘\n’)
3. 将输出的文件emp_q_cnt在spark mllib中计算,得出预测模型
import org.apache.spark.mllib.recommendation.ALS
import org.apache.spark.mllib.recommendation.Rating
val data = sc.textFile(“/home/cloudera/emp_q_cnt”)
val ratings = data.map(_.split(‘\t’) match { case Array(user,item,rate) => Rating(user.toInt, item.toInt, rate.toDouble)})
val rank = 10
val numIterations = 1000
val model = ALS.train(ratings, rank, numIterations, 0.01)
4. 查看某个用户对某一分词的预测值(用户10008, 分词2)
model.predict(sc.parallelize(Array((10008, 2)))).map{case Rating(user, item, rate) => ((user, item), rate)}.take(1)
环境: Ubuntu 12.04, Kaldi
深度学习在NLP上的应用(具体可参考这篇文章 http://licstar.net/archives/328) 中提到一个概念:词向量 (英文为distributed representation, word representation, word embeding中任一个)。在Mikolov 的 RNNLM中有涉及到到词向量的训练,其中Kaldi中有实现示例。
1. 切换到Kaldi目录/u01/kaldi/tools,未找到rnnlm目录。 可能是版本有些旧了, 直接从网上下载这个目录
svn co https://svn.code.sf.net/p/kaldi/code/trunk/tools/rnnlm-hs-0.1b
2.
cd rnnlm-hs-01.b
make
生成rnnlm执行文件
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ./rnnlm
RNNLM based on WORD VECTOR estimation toolkit v 0.1b
Options:
Parameters for training:
-train <file>
Use text data from <file> to train the model
-valid <file>
Use text data from <file> to perform validation and control learning rate
-test <file>
Use text data from <file> to compute logprobs with an existing model
-rnnlm <file>
Use <file> to save the resulting language model
-hidden <int>
Set size of hidden layer; default is 100
-bptt <int>
Set length of BPTT unfolding; default is 3; set to 0 to disable truncation
-bptt-block <int>
Set period of BPTT unfolding; default is 10; BPTT is performed each bptt+bptt_block steps
-gen <int>
Sampling mode; number of sentences to sample, default is 0 (off); enter negative number for interactive mode
-threads <int>
Use <int> threads (default 1)
-min-count <int>
This will discard words that appear less than <int> times; default is 0
-alpha <float>
Set the starting learning rate; default is 0.1
-maxent-alpha <float>
Set the starting learning rate for maxent; default is 0.1
-reject-threshold <float>
Reject nnet and reload nnet from previous epoch if the relative entropy improvement on the validation set is below this threshold (default 0.997)
-stop <float>
Stop training when the relative entropy improvement on the validation set is below this threshold (default 1.003); see also -retry
-retry <int>
Stop training iff N retries with halving learning rate have failed (default 2)
-debug <int>
Set the debug mode (default = 2 = more info during training)
-direct-size <int>
Set the size of hash for maxent parameters, in millions (default 0 = maxent off)
-direct-order <int>
Set the order of n-gram features to be used in maxent (default 3)
-beta1 <float>
L2 regularisation parameter for RNNLM weights (default 1e-6)
-beta2 <float>
L2 regularisation parameter for maxent weights (default 1e-6)
-recompute-counts <int>
Recompute train words counts, useful for fine-tuning (default = 0 = use counts stored in the vocab file)
Examples:
./rnnlm -train data.txt -valid valid.txt -rnnlm result.rnnlm -debug 2 -hidden 200
3. 使用kaldi中的wsj示例
下载一个包含wsj的 git clone https://github.com/foundintranslation/Kaldi.git
将其中的cp wsj/s1 /u01/kaldi/egs/wsj/ -Rf
发现其中的wsj数据源是要用dvd光盘上的,没法获得,这条路走不通。
4. 到网站http://www.fit.vutbr.cz/~imikolov/rnnlm/下载
这两个文件,其中有程序和示例, 解压Basic_examples,里面有数据文件data
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ls /u01/jerry/simple-examples/data
ptb.char.test.txt ptb.char.train.txt ptb.char.valid.txt ptb.test.txt ptb.train.txt ptb.valid.txt README
开始训练词向量
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ./rnnlm -train /u01/jerry/simple-examples/data/ptb.train.txt -valid /u01/jerry/simple-examples/data/ptb.valid.txt -rnnlm result.rnnlm -debug2 -hidden 100
Vocab size: 10000
Words in train file: 929589
Starting training using file /u01/jerry/simple-examples/data/ptb.train.txt
Iteration 0 Valid Entropy 9.457519
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 28.40k Iteration 1 Valid Entropy 8.416857
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 28.18k Iteration 2 Valid Entropy 8.203366
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.98k Iteration 3 Valid Entropy 8.090350
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.25k Iteration 4 Valid Entropy 8.026399
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.35k Iteration 5 Valid Entropy 7.979509
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.43k Iteration 6 Valid Entropy 7.949336
Alpha: 0.100000 ME-alpha: 0.100000 Progress: 99.11% Words/thread/sec: 27.35k Iteration 7 Valid Entropy 7.931067 Decay started
Alpha: 0.050000 ME-alpha: 0.050000 Progress: 99.11% Words/thread/sec: 28.55k Iteration 8 Valid Entropy 7.827513
Alpha: 0.025000 ME-alpha: 0.025000 Progress: 99.11% Words/thread/sec: 28.37k Iteration 9 Valid Entropy 7.759574
Alpha: 0.012500 ME-alpha: 0.012500 Progress: 99.11% Words/thread/sec: 28.45k Iteration 10 Valid Entropy 7.714383
Alpha: 0.006250 ME-alpha: 0.006250 Progress: 99.11% Words/thread/sec: 28.51k Iteration 11 Valid Entropy 7.684731
Alpha: 0.003125 ME-alpha: 0.003125 Progress: 99.11% Words/thread/sec: 28.64k Iteration 12 Valid Entropy 7.668839 Retry 1/2
Alpha: 0.001563 ME-alpha: 0.001563 Progress: 99.11% Words/thread/sec: 28.25k Iteration 13 Valid Entropy 7.668437 Retry 2/2
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$ ls -l
total 8184
-rw-rw-r– 1 jerry jerry 11358 Aug 25 15:08 LICENSE
-rw-rw-r– 1 jerry jerry 407 Aug 25 15:08 Makefile
-rw-rw-r– 1 jerry jerry 8325 Aug 25 15:08 README.txt
-rw-rw-r– 1 jerry jerry 109943 Aug 25 18:05 result.rnnlm
-rw-rw-r– 1 jerry jerry 8040020 Aug 25 18:05 result.rnnlm.nnet
-rwxrwxr-x 1 jerry jerry 142501 Aug 25 15:08 rnnlm
-rw-rw-r– 1 jerry jerry 33936 Aug 25 15:08 rnnlm.c
jerry@hq:/u01/kaldi/tools/rnnlm-hs-0.1b$
vi result.rnnlm
</s> 42068
the 50770
<unk> 45020
N 32481
of 24400
to 23638
a 21196
in 18000
and 17474
‘s 9784