2015年6月 – 强的部落格

烦烦烦

发自 WordPress for Android

ImportError: cannot import name IncompleteRead

环境: Ubuntu 14.04, pip

今天使用pip安装python包出现如下报错：

ImportError: cannot import name IncompleteRead

查找发现是pip的一个bug

重新下载安装新的版本：

sudo apt-get remove python-pip

sudo apt-get autoremove

wget https://raw.github.com/pypa/pip/master/contrib/get-pip.py –no-check-certificate sudo python get-pip.py

BVLC Caffe 安装

环境: Ubuntu 12.04, CUDA 6.0

1. 预先安装软件

pip install -r /u01/caffe/python/requirements.txt
sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libboost-all-dev libhdf5-serial-dev

# gflags
wget https://github.com/schuhschuh/gflags/archive/master.zip
unzip master.zip
cd gflags-master
mkdir build && cd build
CXXFLAGS=”-fPIC” cmake .. -DGFLAGS_NAMESPACE=google
make && make install

# glog
wget https://google-glog.googlecode.com/files/glog-0.3.3.tar.gz
tar zxvf glog-0.3.3.tar.gz
cd glog-0.3.3
./configure
make && make install

# lmdb
git clone git://gitorious.org/mdb/mdb.git
cd mdb/libraries/liblmdb
make && make install

2. 配置安装文件

cp Makefile.config.example Makefile.config
vi Makefile.config, 去掉注释（由于虚拟机不支技显卡)
CPU_ONLY := 1

3. 编译，报错如下：

jerry@hq:/u01/caffe$ make
g++ .build_release/tools/convert_imageset.o .build_release/lib/libcaffe.a -o .build_release/tools/convert_imageset.bin -fPIC -DCPU_ONLY -DNDEBUG -O2 -I/usr/include/python2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/local/include -I.build_release/src -I./src -I./include -Wall -Wno-sign-compare -L/usr/lib -L/usr/local/lib -L/usr/lib -lglog -lgflags -lpthread -lprotobuf -lleveldb -lsnappy -llmdb -lboost_system -lhdf5_hl -lhdf5 -lopencv_core -lopencv_highgui -lopencv_imgproc -lcblas -latlas
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<float>::Update()’:
blob.cpp:(.text._ZN5caffe4BlobIfE6UpdateEv[_ZN5caffe4BlobIfE6UpdateEv]+0x43): undefined reference to `void caffe::caffe_gpu_axpy<float>(int, float, float const*, float*)’
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<float>::asum_data() const’:
blob.cpp:(.text._ZNK5caffe4BlobIfE9asum_dataEv[_ZNK5caffe4BlobIfE9asum_dataEv]+0x3f): undefined reference to `void caffe::caffe_gpu_asum<float>(int, float const*, float*)’
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<float>::asum_diff() const’:
blob.cpp:(.text._ZNK5caffe4BlobIfE9asum_diffEv[_ZNK5caffe4BlobIfE9asum_diffEv]+0x3f): undefined reference to `void caffe::caffe_gpu_asum<float>(int, float const*, float*)’
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<double>::Update()’:
blob.cpp:(.text._ZN5caffe4BlobIdE6UpdateEv[_ZN5caffe4BlobIdE6UpdateEv]+0x43): undefined reference to `void caffe::caffe_gpu_axpy<double>(int, double, double const*, double*)’
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<double>::asum_data() const’:
blob.cpp:(.text._ZNK5caffe4BlobIdE9asum_dataEv[_ZNK5caffe4BlobIdE9asum_dataEv]+0x3f): undefined reference to `void caffe::caffe_gpu_asum<double>(int, double const*, double*)’
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<double>::asum_diff() const’:
blob.cpp:(.text._ZNK5caffe4BlobIdE9asum_diffEv[_ZNK5caffe4BlobIdE9asum_diffEv]+0x3f): undefined reference to `void caffe::caffe_gpu_asum<double>(int, double const*, double*)’
.build_release/lib/libcaffe.a(common.o): In function `caffe::GlobalInit(int*, char***)’:
common.cpp:(.text+0x12a): undefined reference to `gflags::ParseCommandLineFlags(int*, char***, bool)’
.build_release/lib/libcaffe.a(common.o): In function `caffe::Caffe::Caffe()’:
common.cpp:(.text+0x179): undefined reference to `cublasCreate_v2′
common.cpp:(.text+0x1cb): undefined reference to `curandCreateGenerator’
common.cpp:(.text+0x22d): undefined reference to `curandSetPseudoRandomGeneratorSeed’
.build_release/lib/libcaffe.a(common.o): In function `caffe::Caffe::~Caffe()’:
common.cpp:(.text+0x434): undefined reference to `cublasDestroy_v2′
common.cpp:(.text+0x456): undefined reference to `curandDestroyGenerator’
.build_release/lib/libcaffe.a(common.o): In function `caffe::Caffe::DeviceQuery()’:
common.cpp:(.text+0x5f8): undefined reference to `cudaGetDevice’
common.cpp:(.text+0x616): undefined reference to `cudaGetDeviceProperties’
common.cpp:(.text+0xd22): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(common.o): In function `caffe::Caffe::SetDevice(int)’:
common.cpp:(.text+0x1222): undefined reference to `cudaGetDevice’
common.cpp:(.text+0x1247): undefined reference to `cudaSetDevice’
common.cpp:(.text+0x127b): undefined reference to `cublasDestroy_v2′
common.cpp:(.text+0x12a9): undefined reference to `curandDestroyGenerator’
common.cpp:(.text+0x12ce): undefined reference to `cublasCreate_v2′
common.cpp:(.text+0x12fc): undefined reference to `curandCreateGenerator’
common.cpp:(.text+0x1330): undefined reference to `curandSetPseudoRandomGeneratorSeed’
common.cpp:(.text+0x1729): undefined reference to `cudaGetErrorString’
common.cpp:(.text+0x1882): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(common.o): In function `caffe::Caffe::set_random_seed(unsigned int)’:
common.cpp:(.text+0x1aff): undefined reference to `curandDestroyGenerator’
common.cpp:(.text+0x1b2d): undefined reference to `curandCreateGenerator’
common.cpp:(.text+0x1b5c): undefined reference to `curandSetPseudoRandomGeneratorSeed’
.build_release/lib/libcaffe.a(math_functions.o): In function `void caffe::caffe_copy<double>(int, double const*, double*)’:
math_functions.cpp:(.text._ZN5caffe10caffe_copyIdEEviPKT_PS1_[_ZN5caffe10caffe_copyIdEEviPKT_PS1_]+0x6c): undefined reference to `cudaMemcpy’
math_functions.cpp:(.text._ZN5caffe10caffe_copyIdEEviPKT_PS1_[_ZN5caffe10caffe_copyIdEEviPKT_PS1_]+0x160): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(math_functions.o): In function `void caffe::caffe_copy<int>(int, int const*, int*)’:
math_functions.cpp:(.text._ZN5caffe10caffe_copyIiEEviPKT_PS1_[_ZN5caffe10caffe_copyIiEEviPKT_PS1_]+0x6c): undefined reference to `cudaMemcpy’
math_functions.cpp:(.text._ZN5caffe10caffe_copyIiEEviPKT_PS1_[_ZN5caffe10caffe_copyIiEEviPKT_PS1_]+0x160): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(math_functions.o): In function `void caffe::caffe_copy<unsigned int>(int, unsigned int const*, unsigned int*)’:
math_functions.cpp:(.text._ZN5caffe10caffe_copyIjEEviPKT_PS1_[_ZN5caffe10caffe_copyIjEEviPKT_PS1_]+0x6c): undefined reference to `cudaMemcpy’
math_functions.cpp:(.text._ZN5caffe10caffe_copyIjEEviPKT_PS1_[_ZN5caffe10caffe_copyIjEEviPKT_PS1_]+0x160): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(math_functions.o): In function `void caffe::caffe_copy<float>(int, float const*, float*)’:
math_functions.cpp:(.text._ZN5caffe10caffe_copyIfEEviPKT_PS1_[_ZN5caffe10caffe_copyIfEEviPKT_PS1_]+0x6c): undefined reference to `cudaMemcpy’
math_functions.cpp:(.text._ZN5caffe10caffe_copyIfEEviPKT_PS1_[_ZN5caffe10caffe_copyIfEEviPKT_PS1_]+0x160): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(syncedmem.o): In function `caffe::SyncedMemory::cpu_data()’:
syncedmem.cpp:(.text+0x26): undefined reference to `caffe::caffe_gpu_memcpy(unsigned long, void const*, void*)’
.build_release/lib/libcaffe.a(syncedmem.o): In function `caffe::SyncedMemory::mutable_cpu_data()’:
syncedmem.cpp:(.text+0x136): undefined reference to `caffe::caffe_gpu_memcpy(unsigned long, void const*, void*)’
.build_release/lib/libcaffe.a(syncedmem.o): In function `caffe::SyncedMemory::~SyncedMemory()’:
syncedmem.cpp:(.text+0x1c1): undefined reference to `cudaFree’
syncedmem.cpp:(.text+0x20f): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(syncedmem.o): In function `caffe::SyncedMemory::mutable_gpu_data()’:
syncedmem.cpp:(.text+0x29a): undefined reference to `caffe::caffe_gpu_memcpy(unsigned long, void const*, void*)’
syncedmem.cpp:(.text+0x2b9): undefined reference to `cudaMalloc’
syncedmem.cpp:(.text+0x2e5): undefined reference to `cudaMemset’
syncedmem.cpp:(.text+0x321): undefined reference to `cudaGetErrorString’
syncedmem.cpp:(.text+0x379): undefined reference to `cudaMalloc’
syncedmem.cpp:(.text+0x3c2): undefined reference to `cudaGetErrorString’
syncedmem.cpp:(.text+0x435): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(syncedmem.o): In function `caffe::SyncedMemory::gpu_data()’:
syncedmem.cpp:(.text+0x4ca): undefined reference to `caffe::caffe_gpu_memcpy(unsigned long, void const*, void*)’
syncedmem.cpp:(.text+0x4e9): undefined reference to `cudaMalloc’
syncedmem.cpp:(.text+0x515): undefined reference to `cudaMemset’
syncedmem.cpp:(.text+0x549): undefined reference to `cudaMalloc’
syncedmem.cpp:(.text+0x592): undefined reference to `cudaGetErrorString’
syncedmem.cpp:(.text+0x608): undefined reference to `cudaGetErrorString’
syncedmem.cpp:(.text+0x678): undefined reference to `cudaGetErrorString’
collect2: error: ld returned 1 exit status
make: *** [.build_release/tools/convert_imageset.bin] Error 1

很多引用是gpu的定义，但编译时使用cpu-only选项也是通不过的。

4. 修改Makefile.config, 注释CPU_ONLY := 1, 同时修改CUSTOM_CXX := g++-4.6

sudo apt-get install gcc-4.6 g++-4.6 gcc-4.6-multilib g++-4.6-multilib

修改这两个文件
vi src/caffe/common.cpp
vi tools/caffe.cpp
使用google替代gflags

make clean

make

make pycaffe
g++-4.6 -shared -o python/caffe/_caffe.so python/caffe/_caffe.cpp \\\\
.build_release/lib/libcaffe.a -fPIC -DNDEBUG -O2 -I/usr/include/python2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/local/include -I.build_release/src -I./src -I./include -I/usr/local/cuda/include -Wall -Wno-sign-compare -L/usr/lib -L/usr/local/lib -L/usr/lib -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib -lcudart -lcublas -lcurand -lglog -lgflags -lpthread -lprotobuf -lleveldb -lsnappy -llmdb -lboost_system -lhdf5_hl -lhdf5 -lopencv_core -lopencv_highgui -lopencv_imgproc -lcblas -latlas -lboost_python -lpython2.7

touch python/caffe/proto/__init__.py
protoc –proto_path=src –python_out=python src/caffe/proto/caffe_pretty_print.proto

protoc –proto_path=src –python_out=python src/caffe/proto/caffe.proto

执行 sudo cp /u01/caffe/python/caffe/ /usr/local/lib/python2.7/dist-packages/ -Rf

中文wikipead数据的LDA预处理

环境：Ubuntu 14.04, Gensim, jieba

先中文分词：

python -m jieba wiki.zh.text.jian.utf-8 > cut_result.txt

抽取3万个文档：

head -n 30000 cut_result.txt > cut_small.txt

处理脚本如下：

from gensim import corpora

train_data = []
corpus1 = []
corpus2 = []

with open(‘cut_small.txt’, ‘r’) as f:
for i in f.readlines():
train_data.append(list(i.decode(‘utf8’).split(‘/’)))

dic = corpora.Dictionary(train)

corpus1 = [dic.doc2bow(text) for text in train_data]

with open(‘cut_small.txt’, ‘r’) as f:
for i in f.readlines():
corpus2.append([dic.token2id[j] for j in i.decode(‘utf8’).split(‘/’)])

数据预处理中英文wikipedia

环境：Ubuntu 14.04, Gensim,

处理脚本process_wiki.py：

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import logging
import os.path
import sys

from gensim.corpora import WikiCorpus

if __name__ == '__main__':
    program = os.path.basename(sys.argv[0])
    logger = logging.getLogger(program)

    logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s')
    logging.root.setLevel(level=logging.INFO)
    logger.info("running %s" % ' '.join(sys.argv))

    # check and process input arguments
    if len(sys.argv) < 3:
        print globals()['__doc__'] % locals()
        sys.exit(1)
    inp, outp = sys.argv[1:3]
    space = " "
    i = 0

    output = open(outp, 'w')
    wiki = WikiCorpus(inp, lemmatize=False, dictionary={})
    for text in wiki.get_texts():
        output.write(space.join(text) + "\\n")
        i = i + 1
        if (i % 10000 == 0):
            logger.info("Saved " + str(i) + " articles")

    output.close()
    logger.info("Finished Saved " + str(i) + " articles")

下载中文和英文的wikipedia

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

wget https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2

方法一：

python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text

方法二：

Wikipedia Extractor 是用 Python 写的一个维基百科抽取器，使用非常方便。

wget http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
python WikiExtractor.py -cb1000M -o extracted zhwiki-latest-pages-articles.xml.bz2
参数 -b1000M 表示以 1000M 为单位切分文件，默认是 500K。

将wiki.zh.text中的繁体字转化位简体字：

sudo apt-get install opencc

opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini

处理非utf-8字符

iconv -c -t UTF-8 < wiki.zh.text.jian > wiki.zh.text.jian.utf-8

CentOS 6.3 升级gcc版本

环境：CentOS 6.3, gcc 4.4.7 g++4.4.7

wget http://people.centos.org/tru/devtools-2/devtools-2.repo -O /etc/yum.repos.d/devtools-2.repo

yum install devtoolset-2-gcc devtoolset-2-binutils devtoolset-2-gcc-c++

scl enable devtoolset-2 bash

DMLC Wormhole

环境：Ubuntu 14.04

一直在关注DMLC 这个机器学习项目，最新的一个子项目是虫洞，提供可靠的和可扩展的机器学习工具在不平的计算平台（MPI, Yarn, Sungrid）。将大幅降低安装和部署分布式机器学习应用的门槛。对所有组件提供一致的数据流支持。还提供统一脚本来编译和运行所有组件。使得用户既可以在方便的本地集群运行深盟的任何一个分布式组件。

编译安装如下：

git clone https://github.com/dmlc/wormhole.git

cd wormhole

cp make/config.mk .

vi config.mk

注释HDFS, S3

#USE_HDFS = 1

#USE_S3 = 1

然后编译即可

make

生成两个执行文件：

kmeans.dmlc xgboost.dmlc

Oracle 表连接筛选字段执行计划不正确

问题：表SUMM_ADV_CONSUME是分区表，发出一个查询如下
select count(*) from dates T434858, SUMM_ADV_CONSUME T434932 where “T434858.DATE_ID” = T434932.DATE_ID and T434858.DATE_NAME = ‘20131202’

产生的执行计划扫描很多分区表，正常情况是一个分区表。使用同样的逻辑查询，用另个筛选条件date_name2

select count(*) from dates T434858, SUMM_ADV_CONSUME T434932 where “T434858.DATE_ID” = T434932.DATE_ID and T434858.DATE_NAME2 = to_date(‘20131105’, ‘yyyymmdd’)
只扫描一个分区表，执行计划正确。 分析两个字段发现date_name2上有建唯一索引。对date_name创建唯一索引也能得出正确的执行计划。

数据仓库中从mysql导数据到oracle

在数据仓库etl过程会有许多不同的数据源从dw导数据，以mysql数据源为例分几种方法来导入：

1. 借助etl工具本身来导入

优点：开发效率高，直接表映射
缺点：etl工具本身license，加载数据的效率低

2. 借助oracle gateway拉取mysql内的数据

优点：开发效率高，只需配置
缺点：如果数据量比较多的话拉取有瓶颈，不会使用谓语下推操作

3. 借助NFS将mysql数据导入，然后通过sqlldr加载NFS上的数据文件

优点：加载数据快，直接使用原生态的导出和加载
缺点：配置麻烦

CentOS 安装sun jdk

首先到sun java地址 http://www.oracle.com/technetwork/java/javase/downloads/index.html找到 JDK download，对应的链接为： http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html ，然后找到对应os的JDK, 例如： http://download.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.rpm

绕开“Accept License Agreement”，如下：

wget –no-check-certificate –no-cookies –header “Cookie: gpw_e24=http%3A%2F%2Fwww.oracle.com” “http://download.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.rpm”

安装jdk, 由于之前有低版本的jdk,故而安装如下：
rpm -ivh –force jdk-7u45-linux-x64.rpm