调度系统Azkaban

Azkaban包括三个关键组件:

关系数据库:使用 Mysql数据库,主要用于保存流程、权限、任务状态、任务计划等信息。
AzkabanWebServer:为用户提供管理留存、任务计划、权限等功能。
AzkabanExecutorServer:执行任务,并把任务执行的输出日志保存到 Mysql;可以同时启动多个 AzkabanExecutorServer,通过mysql获取流程状态来协调工作。

 

安装步骤

1. 创建数据库azkaban,加载相应的元数据表(azkaban-sql-script-2.5.0.tar.gz)

CREATE DATABASE azkaban;
GRANT all privileges ON azkaban.* to ‘hq’@’%’;

mysql> source create-all-sql-2.5.0.sql

2. 下载并安装azkaban-web-server-2.5.0.tar.gz

tar xvf azkaban-web-server-2.5.0.tar.gz

创建SSL配置,命令如下:
keytool -keystore keystore -alias jetty -genkey -keyalg RSA
cp keystore azkaban-web-2.5.0/

cd azkaban-web-2.5.0

修改配置参数
vi conf/azkaban.properties

default.timezone.id=Asia/Shanghai

database.type=mysql
mysql.port=3306
mysql.host=xxx.xxx.xxx.xxx
mysql.database=azkaban
mysql.user=hq
mysql.password=xxxxxx

jetty.keystore=keystore
jetty.password=azkaban #(配置SSL的密码)
jetty.keypassword=azkaban #(配置SSL的密码)
jetty.truststore=keystore
jetty.trustpassword=azkaban #(配置SSL的密码)
3. 下载并安装azkaban-executor-server-2.5.0.tar.gz
tar xvf azkaban-executor-server-2.5.0.tar.gz
cd azkaban-executor-2.5.0

修改executor的运行参数
vi conf/azkaban.properties

mysql.host=xxx.xxx.xxx.xxx
mysql.database=azkaban
mysql.user=hq
mysql.password=xxxxxx
4. 启动web和executor的服务

cd azkaban-web-2.5.0
bin/azkaban-web-start.sh

cd azkaban-executor-2.5.0
bin/azkaban-executor-start.sh

升级gcc至4.9

环境: Ubuntu 14.02, gcc 4.8

安装gcc 4.9

sudo add-apt-repository ppa:ubuntu-toolchain-r/test
sudo apt-get update
sudo apt-get install g++-4.9

修改默认的gcc版本

sudo update-alternatives –install /usr/bin/gcc gcc /usr/bin/gcc-4.9 150
sudo update-alternatives –install /usr/bin/gcc gcc /usr/bin/gcc-4.8 100
sudo update-alternatives –config gcc

提取postgresql备份数据库内的数据

环境:CentOS 6.4, Postgresql9.4

今天一不小心把一张表的数据直接truncate掉了,头脑顿时发黑。 庆幸的是还有一份前几天的数据备份。

步骤如下:

找出对应表装载数据对应的行

cat alldb20150717.sql | grep -C number “COPY tab_stats”

提取出该行往后3000行的数据

cat alldb20150717.sql | grep -A3000 “COPY tab_stats”

 

postgresql压力测试pgbench

环境:CentOS 6.5, postgresql 9.4.2

pgbench是一款oltp压力测试软件, 使用TPC-B来模拟压力测试。

使用如下:

./pgbench –help
pgbench is a benchmarking tool for PostgreSQL.

Usage:
pgbench [OPTION]… [DBNAME]

Initialization options:
-i, –initialize invokes initialization mode
-F, –fillfactor=NUM set fill factor
-n, –no-vacuum do not run VACUUM after initialization
-q, –quiet quiet logging (one message each 5 seconds)
-s, –scale=NUM scaling factor
–foreign-keys create foreign key constraints between tables
–index-tablespace=TABLESPACE
create indexes in the specified tablespace
–tablespace=TABLESPACE create tables in the specified tablespace
–unlogged-tables create tables as unlogged tables

Benchmarking options:
-c, –client=NUM number of concurrent database clients (default: 1)
-C, –connect establish new connection for each transaction
-D, –define=VARNAME=VALUE
define variable for use by custom script
-f, –file=FILENAME read transaction script from FILENAME
-j, –jobs=NUM number of threads (default: 1)
-l, –log write transaction times to log file
-M, –protocol=simple|extended|prepared
protocol for submitting queries (default: simple)
-n, –no-vacuum do not run VACUUM before tests
-N, –skip-some-updates skip updates of pgbench_tellers and pgbench_branches
-P, –progress=NUM show thread progress report every NUM seconds
-r, –report-latencies report average latency per command
-R, –rate=NUM target rate in transactions per second
-s, –scale=NUM report this scale factor in output
-S, –select-only perform SELECT-only transactions
-t, –transactions=NUM number of transactions each client runs (default: 10)
-T, –time=NUM duration of benchmark test in seconds
-v, –vacuum-all vacuum all four standard tables before tests
–aggregate-interval=NUM aggregate data over NUM seconds
–sampling-rate=NUM fraction of transactions to log (e.g. 0.01 for 1%)

Common options:
-d, –debug print debugging output
-h, –host=HOSTNAME database server host or socket directory
-p, –port=PORT database server port number
-U, –username=USERNAME connect as specified database user
-V, –version output version information, then exit
-?, –help show this help, then exit

Report bugs to <pgsql-bugs@postgresql.org>.

 

先创建测试数据库benchdb,

./pgbench -i benchdb

./pgbench -c 50 -t 10 -r benchdb

starting vacuum…end.
transaction type: TPC-B (sort of)
scaling factor: 1
query mode: simple
number of clients: 50
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 500/500
latency average: 0.000 ms
tps = 119.220347 (including connections establishing)
tps = 124.124611 (excluding connections establishing)
statement latencies in milliseconds:
0.007204 \\set nbranches 1 * :scale
0.001772 \\set ntellers 10 * :scale
0.001308 \\set naccounts 100000 * :scale
0.001818 \\setrandom aid 1 :naccounts
0.001568 \\setrandom bid 1 :nbranches
0.001390 \\setrandom tid 1 :ntellers
0.001702 \\setrandom delta -5000 5000
1.218338 BEGIN;
1.219992 UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
0.406070 SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
257.778830 UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
53.106822 UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
0.349952 INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);
7.491730 END;

通过user agent信息判断浏览器信息

环境: Ubuntu 14.04

在做广告日志分析有一个字段是user-agent,是Http协议中的一部分,属于头域的组成部分,可以通过该属性来识别出所使用的浏览器类型及版本、操作系统及版本、浏览器内核、等信息的标识。

之前想通过字符串的正则匹配来识别出相应的字段,发现无法做到。后来查找有一个网站提供api能精确识别出来。

代码如下:

#coding:utf-8
#/usr/bin/python2.6

import sys
import urllib
import urllib2
import json
import pprint

ua = “Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.4 (KHTML, like Gecko) Chrome/13.0.782.41 Safari/535.4”
ua_url = urllib.quote(ua)  # 转url编码
url = “http://www.useragentstring.com/?uas=%s&getJSON=all” % ua_url
req = urllib2.Request(url)
j = urllib2.urlopen(req).read()
j = json.loads(j)

print json.dumps(j, indent=4, sort_keys=True)

pp = pprint.PrettyPrinter(indent=4)
pp.pprint(j)

 

BVLC Caffe 安装

环境: Ubuntu 12.04, CUDA 6.0

1. 预先安装软件

pip install -r /u01/caffe/python/requirements.txt
sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libboost-all-dev libhdf5-serial-dev

# gflags
wget https://github.com/schuhschuh/gflags/archive/master.zip
unzip master.zip
cd gflags-master
mkdir build && cd build
CXXFLAGS=”-fPIC” cmake .. -DGFLAGS_NAMESPACE=google
make && make install

# glog
wget https://google-glog.googlecode.com/files/glog-0.3.3.tar.gz
tar zxvf glog-0.3.3.tar.gz
cd glog-0.3.3
./configure
make && make install

# lmdb
git clone git://gitorious.org/mdb/mdb.git
cd mdb/libraries/liblmdb
make && make install

2.  配置安装文件

cp Makefile.config.example Makefile.config
vi Makefile.config, 去掉注释(由于虚拟机不支技显卡)
CPU_ONLY := 1

3. 编译,报错如下:

jerry@hq:/u01/caffe$ make
g++ .build_release/tools/convert_imageset.o .build_release/lib/libcaffe.a -o .build_release/tools/convert_imageset.bin -fPIC -DCPU_ONLY -DNDEBUG -O2 -I/usr/include/python2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/local/include -I.build_release/src -I./src -I./include -Wall -Wno-sign-compare -L/usr/lib -L/usr/local/lib -L/usr/lib -lglog -lgflags -lpthread -lprotobuf -lleveldb -lsnappy -llmdb -lboost_system -lhdf5_hl -lhdf5 -lopencv_core -lopencv_highgui -lopencv_imgproc -lcblas -latlas
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<float>::Update()’:
blob.cpp:(.text._ZN5caffe4BlobIfE6UpdateEv[_ZN5caffe4BlobIfE6UpdateEv]+0x43): undefined reference to `void caffe::caffe_gpu_axpy<float>(int, float, float const*, float*)’
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<float>::asum_data() const’:
blob.cpp:(.text._ZNK5caffe4BlobIfE9asum_dataEv[_ZNK5caffe4BlobIfE9asum_dataEv]+0x3f): undefined reference to `void caffe::caffe_gpu_asum<float>(int, float const*, float*)’
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<float>::asum_diff() const’:
blob.cpp:(.text._ZNK5caffe4BlobIfE9asum_diffEv[_ZNK5caffe4BlobIfE9asum_diffEv]+0x3f): undefined reference to `void caffe::caffe_gpu_asum<float>(int, float const*, float*)’
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<double>::Update()’:
blob.cpp:(.text._ZN5caffe4BlobIdE6UpdateEv[_ZN5caffe4BlobIdE6UpdateEv]+0x43): undefined reference to `void caffe::caffe_gpu_axpy<double>(int, double, double const*, double*)’
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<double>::asum_data() const’:
blob.cpp:(.text._ZNK5caffe4BlobIdE9asum_dataEv[_ZNK5caffe4BlobIdE9asum_dataEv]+0x3f): undefined reference to `void caffe::caffe_gpu_asum<double>(int, double const*, double*)’
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<double>::asum_diff() const’:
blob.cpp:(.text._ZNK5caffe4BlobIdE9asum_diffEv[_ZNK5caffe4BlobIdE9asum_diffEv]+0x3f): undefined reference to `void caffe::caffe_gpu_asum<double>(int, double const*, double*)’
.build_release/lib/libcaffe.a(common.o): In function `caffe::GlobalInit(int*, char***)’:
common.cpp:(.text+0x12a): undefined reference to `gflags::ParseCommandLineFlags(int*, char***, bool)’
.build_release/lib/libcaffe.a(common.o): In function `caffe::Caffe::Caffe()’:
common.cpp:(.text+0x179): undefined reference to `cublasCreate_v2′
common.cpp:(.text+0x1cb): undefined reference to `curandCreateGenerator’
common.cpp:(.text+0x22d): undefined reference to `curandSetPseudoRandomGeneratorSeed’
.build_release/lib/libcaffe.a(common.o): In function `caffe::Caffe::~Caffe()’:
common.cpp:(.text+0x434): undefined reference to `cublasDestroy_v2′
common.cpp:(.text+0x456): undefined reference to `curandDestroyGenerator’
.build_release/lib/libcaffe.a(common.o): In function `caffe::Caffe::DeviceQuery()’:
common.cpp:(.text+0x5f8): undefined reference to `cudaGetDevice’
common.cpp:(.text+0x616): undefined reference to `cudaGetDeviceProperties’
common.cpp:(.text+0xd22): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(common.o): In function `caffe::Caffe::SetDevice(int)’:
common.cpp:(.text+0x1222): undefined reference to `cudaGetDevice’
common.cpp:(.text+0x1247): undefined reference to `cudaSetDevice’
common.cpp:(.text+0x127b): undefined reference to `cublasDestroy_v2′
common.cpp:(.text+0x12a9): undefined reference to `curandDestroyGenerator’
common.cpp:(.text+0x12ce): undefined reference to `cublasCreate_v2′
common.cpp:(.text+0x12fc): undefined reference to `curandCreateGenerator’
common.cpp:(.text+0x1330): undefined reference to `curandSetPseudoRandomGeneratorSeed’
common.cpp:(.text+0x1729): undefined reference to `cudaGetErrorString’
common.cpp:(.text+0x1882): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(common.o): In function `caffe::Caffe::set_random_seed(unsigned int)’:
common.cpp:(.text+0x1aff): undefined reference to `curandDestroyGenerator’
common.cpp:(.text+0x1b2d): undefined reference to `curandCreateGenerator’
common.cpp:(.text+0x1b5c): undefined reference to `curandSetPseudoRandomGeneratorSeed’
.build_release/lib/libcaffe.a(math_functions.o): In function `void caffe::caffe_copy<double>(int, double const*, double*)’:
math_functions.cpp:(.text._ZN5caffe10caffe_copyIdEEviPKT_PS1_[_ZN5caffe10caffe_copyIdEEviPKT_PS1_]+0x6c): undefined reference to `cudaMemcpy’
math_functions.cpp:(.text._ZN5caffe10caffe_copyIdEEviPKT_PS1_[_ZN5caffe10caffe_copyIdEEviPKT_PS1_]+0x160): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(math_functions.o): In function `void caffe::caffe_copy<int>(int, int const*, int*)’:
math_functions.cpp:(.text._ZN5caffe10caffe_copyIiEEviPKT_PS1_[_ZN5caffe10caffe_copyIiEEviPKT_PS1_]+0x6c): undefined reference to `cudaMemcpy’
math_functions.cpp:(.text._ZN5caffe10caffe_copyIiEEviPKT_PS1_[_ZN5caffe10caffe_copyIiEEviPKT_PS1_]+0x160): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(math_functions.o): In function `void caffe::caffe_copy<unsigned int>(int, unsigned int const*, unsigned int*)’:
math_functions.cpp:(.text._ZN5caffe10caffe_copyIjEEviPKT_PS1_[_ZN5caffe10caffe_copyIjEEviPKT_PS1_]+0x6c): undefined reference to `cudaMemcpy’
math_functions.cpp:(.text._ZN5caffe10caffe_copyIjEEviPKT_PS1_[_ZN5caffe10caffe_copyIjEEviPKT_PS1_]+0x160): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(math_functions.o): In function `void caffe::caffe_copy<float>(int, float const*, float*)’:
math_functions.cpp:(.text._ZN5caffe10caffe_copyIfEEviPKT_PS1_[_ZN5caffe10caffe_copyIfEEviPKT_PS1_]+0x6c): undefined reference to `cudaMemcpy’
math_functions.cpp:(.text._ZN5caffe10caffe_copyIfEEviPKT_PS1_[_ZN5caffe10caffe_copyIfEEviPKT_PS1_]+0x160): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(syncedmem.o): In function `caffe::SyncedMemory::cpu_data()’:
syncedmem.cpp:(.text+0x26): undefined reference to `caffe::caffe_gpu_memcpy(unsigned long, void const*, void*)’
.build_release/lib/libcaffe.a(syncedmem.o): In function `caffe::SyncedMemory::mutable_cpu_data()’:
syncedmem.cpp:(.text+0x136): undefined reference to `caffe::caffe_gpu_memcpy(unsigned long, void const*, void*)’
.build_release/lib/libcaffe.a(syncedmem.o): In function `caffe::SyncedMemory::~SyncedMemory()’:
syncedmem.cpp:(.text+0x1c1): undefined reference to `cudaFree’
syncedmem.cpp:(.text+0x20f): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(syncedmem.o): In function `caffe::SyncedMemory::mutable_gpu_data()’:
syncedmem.cpp:(.text+0x29a): undefined reference to `caffe::caffe_gpu_memcpy(unsigned long, void const*, void*)’
syncedmem.cpp:(.text+0x2b9): undefined reference to `cudaMalloc’
syncedmem.cpp:(.text+0x2e5): undefined reference to `cudaMemset’
syncedmem.cpp:(.text+0x321): undefined reference to `cudaGetErrorString’
syncedmem.cpp:(.text+0x379): undefined reference to `cudaMalloc’
syncedmem.cpp:(.text+0x3c2): undefined reference to `cudaGetErrorString’
syncedmem.cpp:(.text+0x435): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(syncedmem.o): In function `caffe::SyncedMemory::gpu_data()’:
syncedmem.cpp:(.text+0x4ca): undefined reference to `caffe::caffe_gpu_memcpy(unsigned long, void const*, void*)’
syncedmem.cpp:(.text+0x4e9): undefined reference to `cudaMalloc’
syncedmem.cpp:(.text+0x515): undefined reference to `cudaMemset’
syncedmem.cpp:(.text+0x549): undefined reference to `cudaMalloc’
syncedmem.cpp:(.text+0x592): undefined reference to `cudaGetErrorString’
syncedmem.cpp:(.text+0x608): undefined reference to `cudaGetErrorString’
syncedmem.cpp:(.text+0x678): undefined reference to `cudaGetErrorString’
collect2: error: ld returned 1 exit status
make: *** [.build_release/tools/convert_imageset.bin] Error 1

很多引用是gpu的定义,但编译时使用cpu-only选项也是通不过的。

4. 修改Makefile.config, 注释CPU_ONLY := 1, 同时修改CUSTOM_CXX := g++-4.6

sudo apt-get install gcc-4.6 g++-4.6 gcc-4.6-multilib g++-4.6-multilib

修改这两个文件
vi src/caffe/common.cpp
vi tools/caffe.cpp
使用google替代gflags

make clean

make

make pycaffe
g++-4.6 -shared -o python/caffe/_caffe.so python/caffe/_caffe.cpp \\\\
.build_release/lib/libcaffe.a -fPIC -DNDEBUG -O2 -I/usr/include/python2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/local/include -I.build_release/src -I./src -I./include -I/usr/local/cuda/include -Wall -Wno-sign-compare -L/usr/lib -L/usr/local/lib -L/usr/lib -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib -lcudart -lcublas -lcurand -lglog -lgflags -lpthread -lprotobuf -lleveldb -lsnappy -llmdb -lboost_system -lhdf5_hl -lhdf5 -lopencv_core -lopencv_highgui -lopencv_imgproc -lcblas -latlas -lboost_python -lpython2.7

touch python/caffe/proto/__init__.py
protoc –proto_path=src –python_out=python src/caffe/proto/caffe_pretty_print.proto

protoc –proto_path=src –python_out=python src/caffe/proto/caffe.proto

执行 sudo cp /u01/caffe/python/caffe/ /usr/local/lib/python2.7/dist-packages/ -Rf

中文wikipead数据的LDA预处理

环境:Ubuntu 14.04, Gensim,  jieba

先中文分词:

python -m jieba wiki.zh.text.jian.utf-8 > cut_result.txt

抽取3万个文档:

head -n 30000 cut_result.txt > cut_small.txt

处理脚本如下:

from gensim import corpora

train_data = []
corpus1 = []
corpus2 = []

with open(‘cut_small.txt’, ‘r’) as f:
for i in f.readlines():
train_data.append(list(i.decode(‘utf8’).split(‘/’)))

dic = corpora.Dictionary(train)

corpus1 = [dic.doc2bow(text) for text in train_data]

with open(‘cut_small.txt’, ‘r’) as f:
for i in f.readlines():
corpus2.append([dic.token2id[j] for j in i.decode(‘utf8’).split(‘/’)])

 

数据预处理中英文wikipedia

环境:Ubuntu 14.04, Gensim,

处理脚本process_wiki.py:

 

 

 

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import logging
import os.path
import sys

from gensim.corpora import WikiCorpus

if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments
    if len(sys.argv) < 3:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0

    output = open(outp, 'w')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        output.write(space.join(text) + "\\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")

    output.close()
    logger.info("Finished Saved " + str(i) + " articles")

下载中文和英文的wikipedia
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
wget https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

方法一:

python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text

方法二:

Wikipedia Extractor 是用 Python 写的一个维基百科抽取器,使用非常方便。

wget http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
python WikiExtractor.py -cb1000M -o extracted  zhwiki-latest-pages-articles.xml.bz2
参数 -b1000M 表示以 1000M 为单位切分文件,默认是 500K。

 

 

将wiki.zh.text中的繁体字转化位简体字:

sudo apt-get install opencc

opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini

 

处理非utf-8字符

iconv -c -t UTF-8 < wiki.zh.text.jian > wiki.zh.text.jian.utf-8