Titan-hadoop访问DBpedia文件内容

环境: Centos, Titan-0.5.0-Hadoop2

Titan-hadoop 实现对N_TRIPLES格式的RDF 访问,从dbpedia下载nt格式的文件(例如: http://data.dws.informatik.uni-mannheim.de/dbpedia/2014/zh/labels_en_uris_zh.nt.bz2),编写访问属性文件,如下:
[cloudera@localhost titan-0.5.0-hadoop2]$ vi conf/hadoop/rdf-input.properties

# input graph parameters
titan.hadoop.input.format=com.thinkaurelius.titan.hadoop.formats.edgelist.rdf.RDFInputFormat
titan.hadoop.input.location=examples/labels_en_uris_zh.nt
titan.hadoop.input.conf.format=N_TRIPLES
titan.hadoop.input.conf.as-properties=http://www.w3.org/1999/02/22-rdf-syntax-ns#type
titan.hadoop.input.conf.use-localname=true
titan.hadoop.input.conf.literal-as-property=true

# output data parameters
titan.hadoop.output.format=com.thinkaurelius.titan.hadoop.formats.graphson.GraphSONOutputFormat
titan.hadoop.sideeffect.format=org.apache.hadoop.mapreduce.lib.output.TextOutputFormat

查询数据:

[cloudera@localhost titan-0.5.0-hadoop2]$ gremlin.sh

gremlin> g = HadoopFactory.open(“conf/hadoop/rdf-input.properties”)

gremlin> g.V.map()

……

17:37:12 INFO  org.apache.hadoop.mapred.LocalJobRunner  – reduce > reduce
17:37:12 INFO  org.apache.hadoop.mapred.Task  – Task ‘attempt_local1370056218_0005_r_000000_0’ done.
17:37:13 INFO  org.apache.hadoop.mapreduce.Job  – Job job_local1370056218_0005 completed successfully
17:37:13 INFO  org.apache.hadoop.mapreduce.Job  – Counters: 35
File System Counters
FILE: Number of bytes read=2911187173
FILE: Number of bytes written=3038059762
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
Map-Reduce Framework
Map input records=405909
Map output records=405909
Map output bytes=65118176
Map output materialized bytes=66297322
Input split bytes=268
Combine input records=405909
Combine output records=405909
Reduce input groups=405909
Reduce shuffle bytes=0
Reduce input records=405909
Reduce output records=0
Spilled Records=811818
Shuffled Maps =0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=5136
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=2091909120
com.thinkaurelius.titan.hadoop.formats.edgelist.EdgeListInputMapReduce$Counters
IN_EDGES_CREATED=0
OUT_EDGES_CREATED=0
VERTEX_PROPERTIES_CREATED=1217727
VERTICES_CREATED=405909
VERTICES_EMITTED=405909
com.thinkaurelius.titan.hadoop.mapreduce.transform.PropertyMapMap$Counters
VERTICES_PROCESSED=405909
com.thinkaurelius.titan.hadoop.mapreduce.transform.VerticesMap$Counters
EDGES_PROCESSED=0
VERTICES_PROCESSED=405909
File Input Format Counters
Bytes Read=54114517
File Output Format Counters
Bytes Written=0
==>47994559900176       {label_=[慾望], _id=[47994559900176], name=[Want], uri=[http://dbpedia.org/resource/Want]}
==>60888991522182       {label_=[无机化学命名法], _id=[60888991522182], name=[IUPAC_nomenclature_of_inorganic_chemistry], uri=[http://dbpedia.org/resource/IUPAC_nomenclature_of_inorganic_chemistry]}
==>78841791384159       {label_=[诺伊斯塔特-格莱韦], _id=[78841791384159], name=[Neustadt-Glewe], uri=[http://dbpedia.org/resource/Neustadt-Glewe]}
==>78961407639797       {label_=[打狗英國領事館文化園區], _id=[78961407639797], name=[Former_British_Consulate_at_Takao], uri=[http://dbpedia.org/resource/Former_British_Consulate_at_Takao]}
==>95522075072286       {label_=[賴琳恩], _id=[95522075072286], name=[Lene_Lai], uri=[http://dbpedia.org/resource/Lene_Lai]}
==>153451821264409      {label_=[唐古韭], _id=[153451821264409], name=[Allium_tanguticum], uri=[http://dbpedia.org/resource/Allium_tanguticum]}
==>154857715280524      {label_=[温带], _id=[154857715280524], name=[Temperate_climate], uri=[http://dbpedia.org/resource/Temperate_climate]}
==>166027168671115      {label_=[GSh-18手槍], _id=[166027168671115], name=[GSh-18], uri=[http://dbpedia.org/resource/GSh-18]}
==>166513572484984      {label_=[WMA], _id=[166513572484984], name=[WMA], uri=[http://dbpedia.org/resource/WMA]}
==>182078824443170      {label_=[保罗·纳斯], _id=[182078824443170], name=[Paul_Nurse], uri=[http://dbpedia.org/resource/Paul_Nurse]}
==>211356647821663      {label_=[克魯克斯頓 (明尼蘇達州)], _id=[211356647821663], name=[Crookston,_Minnesota], uri=[http://dbpedia.org/resource/Crookston,_Minnesota]}
==>222227245802710      {label_=[我的女友是九尾狐], _id=[222227245802710], name=[My_Girlfriend_Is_a_Nine-Tailed_Fox], uri=[http://dbpedia.org/resource/My_Girlfriend_Is_a_Nine-Tailed_Fox]}
==>229972043766751      {label_=[李天荣], _id=[229972043766751], name=[Wilson_Lee_Flores], uri=[http://dbpedia.org/resource/Wilson_Lee_Flores]}
==>247488956381743      {label_=[1,2-双(二异丙基膦)乙烷], _id=[247488956381743], name=[1,2-Bis(diisopropylphosphino)ethane], uri=[http://dbpedia.org/resource/1,2-Bis(diisopropylphosphino)ethane]}
==>264200262547493      {label_=[欽迪龍屬], _id=[264200262547493], name=[Chindesaurus], uri=[http://dbpedia.org/resource/Chindesaurus]}
==>…

Clojure运行环境在windows上安装

环境:windows 7

有时需要在windows上环境上编写clojure代码,因此需要在这个环境上安装clojure运行环境

方法一:

1. 首先安装curl工具

2. 从https://raw.githubusercontent.com/technomancy/leiningen/stable/bin/lein.bat 下载文件lein.bat

3. 设置环境变量
set HTTP_CLIENT=curl –proxy-ntlm –insecure -f -L -o
set HTTPS_PROXY=

3. 安装 lein.bat self-install

4. 进入交互环境
lein new project_name
lein.bat repl

方法二:

1. 直接下载clojure-x.x.x.jar文件

2. 启动交互环境
java -jar clojure-x.x.x.jar

DOS命令sqlite3 中文乱码解决

在dos中使用sqlite3进行操作,由于dos窗口默认的是GBK编码,而sqlite通常为UTF-8,因此会出现sqlite中的中文字符在dos窗口中显示的是乱码的问题。

打开dos窗口,输入chcp 65001然后回车;注:65001即为UTF-8格式,936是GBK;

对着dos窗口的标题右键,在弹出来的窗口中选择属性,在弹出的窗口中将字体更改为:Lucida Console

 

Anconda 安装psycopg2

环境:Ubuntu 12.04,  Anaconda-1.9.2-Linux-x86_64

首先安装binstar

conda install binstar

然后使用binstar搜索psycopg2的网上安装路径

binstar search -t conda psycopg2
Run ‘binstar show <USER/PACKAGE>’ to get more details:
Packages:
Name | Access | Package Types | Summary
————————- | ———— | ————— | ——————–
auto/psycopg2database | published | conda | http://jimmyg.org/work/code/psycopg2databa se/index.html
bencpeters/psycopg2 | public | conda | Python-PostgreSQL Database Adapter
chuongdo/psycopg2 | public | conda | Python-PostgreSQL Database Adapter
dan_blanchard/psycopg2 | public | conda | http://initd.org/psycopg/
davidbgonzalez/psycopg2 | public | conda | None
deric/psycopg2 | public | conda | None
jonrowland/psycopg2 | public | conda | None
kevincal/psycopg2 | published | conda |
topper/psycopg2-windows | public | conda | PostgreSQL adapter for the Python programm ing language
trent/psycopg2 | public | conda | None
Found 10 packages

可知其中一个路径为bencpeters/psycopg2

安装如下:

 conda install -c https://conda.binstar.org/bencpeters psycopg2
Fetching package metadata: .Error: unknown host: http://repo.continuum.io/pkgs/pro/linux-64/
.Error: unknown host: http://repo.continuum.io/pkgs/free/linux-64/
.
Solving package specifications: .
Package plan for installation in environment /home/jerry/anaconda:

The following packages will be downloaded:

package | build
—————————|—————–
psycopg2-2.5.3 | py27_0 393 KB

The following packages will be linked:

package | build
—————————|—————–
psycopg2-2.5.3 | py27_0 hard-link

Proceed ([y]/n)? y

Fetching packages …
psycopg2-2.5.3-py27_0.tar.bz2 28% |#######################

ubuntu启用cron 服务的日志

环境: Ubuntu 12.04

 

修改rsyslog文件,将/etc/rsyslog.d/50-default.conf 文件中的#cron.*前的#删掉;
重启rsyslog服务service rsyslog restart;
重启cron服务service cron restart;

more /var/log/cron.log

增加定时任务:

sudo crontab -e

50 * * * * . $HOME/.bashrc;$HOME/get_wealth.py >/dev/null 2>&1

Thunderbird 安装importexporttool

环境:windows 7,  Thunderbird 31.3.0

由于需要使用邮件的内容来做NLP,因此需要导出并预处理。对于Thunderbird,需要安装importexporttool。

本地安装:

从https://addons.mozilla.org/zh-cn/thunderbird/addon/importexporttools/下载这个工具,然后到点击Thunderbird中的“附加组件”进入“附加组件管理器”,右上角有一个齿轮按钮。单击出现”从文件安装附加组件(I)…”,安装即可。

使用Thunderbird就出“以 mbox/eml格式导入或导出”的选项

 

Matlab 线性回归的向量化

环境: Ubuntu 12.04, Matlab 2013

数据如下:

jerry@hq:~/ml-class/mlclass-ex1$ more ex1data1.txt
6.1101,17.592
5.5277,9.1302
8.5186,13.662
7.0032,11.854
5.8598,6.8233
8.3829,11.886
7.4764,4.3483
8.5781,12
6.4862,6.5987
5.0546,3.8166
5.7107,3.2522
14.164,15.505
5.734,3.1551
8.4084,7.2258
……

 

做一个线性回归,用梯度下降方法,代码如下:

matlab -nodesktop

data = load(‘ex1data1.txt’);
X = data(:, 1), y = data(:, 2);
m = length(y);
X = [ones(m, 1), X];
theta = zeros(size(data, 2), 1);

for i = 1:1500
htheta = X * theta;
theta = theta – (0.01 / m * sum(repmat((htheta – y), 1, 2) .* X, 1))’;
end

 

另一种简洁的方法如下:

inv(X’ * X) * X’ * y

VirtualBox UUID already exists

今天创建virtualbox文件时,发现报错如下:

UUID {00c11b02-c807-443d-b10d-dfffe0ae1b96} already exists.

Result Code: E_INVALIDARG (0x80070057)
Component: VirtualBox
Interface: IVirtualBox {c28be65f-1a8f-43b4-81f1-eb60cb516e66}

 

需要重新reset UUID, 办法如下:

c:\Program Files\Oracle\VirtualBox>vboxmanage internalcommands sethduuid “D:\program\virtualbox\ubuntu14\Ubuntu-14.vdi”

 

Hbase性能测试工具

环境: HBase,   Yahoo Cloud System Benchmark(YCSB)

测试方法1:

hbase org.apache.hadoop.hbase.PerformanceEvaluation randomRead 3

 

测试方法2:

wget https://github.com/downloads/brianfrankcooper/YCSB/ycsb-0.1.4.tar.gz
tar xfvz ycsb-0.1.4
cd ycsb-0.1.4

加载数据

bin/ycsb load hbase -P workloads/workloada -p columnfamily=f1 -p recordcount=10000 -s -threads 10

测试

bin/ycsb load hbase -P workloads/workloada -p columnfamily=f1 -p recordcount=10000 -s -threads 10