烦烦烦
发自 WordPress for Android
量化自我和极简主义的窝藏点
烦烦烦
发自 WordPress for Android
环境: Ubuntu 14.04, pip
今天使用pip安装python包出现如下报错:
ImportError: cannot import name IncompleteRead
查找发现是pip的一个bug
重新下载安装新的版本:
sudo apt-get remove python-pip
sudo apt-get autoremove
wget https://raw.github.com/pypa/pip/master/contrib/get-pip.py –no-check-certificate sudo python get-pip.py
环境: Ubuntu 12.04, CUDA 6.0
1. 预先安装软件
pip install -r /u01/caffe/python/requirements.txt
sudo apt-get install libprotobuf-dev libleveldb-dev libsnappy-dev libopencv-dev libboost-all-dev libhdf5-serial-dev
# gflags
wget https://github.com/schuhschuh/gflags/archive/master.zip
unzip master.zip
cd gflags-master
mkdir build && cd build
CXXFLAGS=”-fPIC” cmake .. -DGFLAGS_NAMESPACE=google
make && make install
# glog
wget https://google-glog.googlecode.com/files/glog-0.3.3.tar.gz
tar zxvf glog-0.3.3.tar.gz
cd glog-0.3.3
./configure
make && make install
# lmdb
git clone git://gitorious.org/mdb/mdb.git
cd mdb/libraries/liblmdb
make && make install
2. 配置安装文件
cp Makefile.config.example Makefile.config
vi Makefile.config, 去掉注释(由于虚拟机不支技显卡)
CPU_ONLY := 1
3. 编译,报错如下:
jerry@hq:/u01/caffe$ make
g++ .build_release/tools/convert_imageset.o .build_release/lib/libcaffe.a -o .build_release/tools/convert_imageset.bin -fPIC -DCPU_ONLY -DNDEBUG -O2 -I/usr/include/python2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/local/include -I.build_release/src -I./src -I./include -Wall -Wno-sign-compare -L/usr/lib -L/usr/local/lib -L/usr/lib -lglog -lgflags -lpthread -lprotobuf -lleveldb -lsnappy -llmdb -lboost_system -lhdf5_hl -lhdf5 -lopencv_core -lopencv_highgui -lopencv_imgproc -lcblas -latlas
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<float>::Update()’:
blob.cpp:(.text._ZN5caffe4BlobIfE6UpdateEv[_ZN5caffe4BlobIfE6UpdateEv]+0x43): undefined reference to `void caffe::caffe_gpu_axpy<float>(int, float, float const*, float*)’
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<float>::asum_data() const’:
blob.cpp:(.text._ZNK5caffe4BlobIfE9asum_dataEv[_ZNK5caffe4BlobIfE9asum_dataEv]+0x3f): undefined reference to `void caffe::caffe_gpu_asum<float>(int, float const*, float*)’
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<float>::asum_diff() const’:
blob.cpp:(.text._ZNK5caffe4BlobIfE9asum_diffEv[_ZNK5caffe4BlobIfE9asum_diffEv]+0x3f): undefined reference to `void caffe::caffe_gpu_asum<float>(int, float const*, float*)’
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<double>::Update()’:
blob.cpp:(.text._ZN5caffe4BlobIdE6UpdateEv[_ZN5caffe4BlobIdE6UpdateEv]+0x43): undefined reference to `void caffe::caffe_gpu_axpy<double>(int, double, double const*, double*)’
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<double>::asum_data() const’:
blob.cpp:(.text._ZNK5caffe4BlobIdE9asum_dataEv[_ZNK5caffe4BlobIdE9asum_dataEv]+0x3f): undefined reference to `void caffe::caffe_gpu_asum<double>(int, double const*, double*)’
.build_release/lib/libcaffe.a(blob.o): In function `caffe::Blob<double>::asum_diff() const’:
blob.cpp:(.text._ZNK5caffe4BlobIdE9asum_diffEv[_ZNK5caffe4BlobIdE9asum_diffEv]+0x3f): undefined reference to `void caffe::caffe_gpu_asum<double>(int, double const*, double*)’
.build_release/lib/libcaffe.a(common.o): In function `caffe::GlobalInit(int*, char***)’:
common.cpp:(.text+0x12a): undefined reference to `gflags::ParseCommandLineFlags(int*, char***, bool)’
.build_release/lib/libcaffe.a(common.o): In function `caffe::Caffe::Caffe()’:
common.cpp:(.text+0x179): undefined reference to `cublasCreate_v2′
common.cpp:(.text+0x1cb): undefined reference to `curandCreateGenerator’
common.cpp:(.text+0x22d): undefined reference to `curandSetPseudoRandomGeneratorSeed’
.build_release/lib/libcaffe.a(common.o): In function `caffe::Caffe::~Caffe()’:
common.cpp:(.text+0x434): undefined reference to `cublasDestroy_v2′
common.cpp:(.text+0x456): undefined reference to `curandDestroyGenerator’
.build_release/lib/libcaffe.a(common.o): In function `caffe::Caffe::DeviceQuery()’:
common.cpp:(.text+0x5f8): undefined reference to `cudaGetDevice’
common.cpp:(.text+0x616): undefined reference to `cudaGetDeviceProperties’
common.cpp:(.text+0xd22): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(common.o): In function `caffe::Caffe::SetDevice(int)’:
common.cpp:(.text+0x1222): undefined reference to `cudaGetDevice’
common.cpp:(.text+0x1247): undefined reference to `cudaSetDevice’
common.cpp:(.text+0x127b): undefined reference to `cublasDestroy_v2′
common.cpp:(.text+0x12a9): undefined reference to `curandDestroyGenerator’
common.cpp:(.text+0x12ce): undefined reference to `cublasCreate_v2′
common.cpp:(.text+0x12fc): undefined reference to `curandCreateGenerator’
common.cpp:(.text+0x1330): undefined reference to `curandSetPseudoRandomGeneratorSeed’
common.cpp:(.text+0x1729): undefined reference to `cudaGetErrorString’
common.cpp:(.text+0x1882): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(common.o): In function `caffe::Caffe::set_random_seed(unsigned int)’:
common.cpp:(.text+0x1aff): undefined reference to `curandDestroyGenerator’
common.cpp:(.text+0x1b2d): undefined reference to `curandCreateGenerator’
common.cpp:(.text+0x1b5c): undefined reference to `curandSetPseudoRandomGeneratorSeed’
.build_release/lib/libcaffe.a(math_functions.o): In function `void caffe::caffe_copy<double>(int, double const*, double*)’:
math_functions.cpp:(.text._ZN5caffe10caffe_copyIdEEviPKT_PS1_[_ZN5caffe10caffe_copyIdEEviPKT_PS1_]+0x6c): undefined reference to `cudaMemcpy’
math_functions.cpp:(.text._ZN5caffe10caffe_copyIdEEviPKT_PS1_[_ZN5caffe10caffe_copyIdEEviPKT_PS1_]+0x160): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(math_functions.o): In function `void caffe::caffe_copy<int>(int, int const*, int*)’:
math_functions.cpp:(.text._ZN5caffe10caffe_copyIiEEviPKT_PS1_[_ZN5caffe10caffe_copyIiEEviPKT_PS1_]+0x6c): undefined reference to `cudaMemcpy’
math_functions.cpp:(.text._ZN5caffe10caffe_copyIiEEviPKT_PS1_[_ZN5caffe10caffe_copyIiEEviPKT_PS1_]+0x160): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(math_functions.o): In function `void caffe::caffe_copy<unsigned int>(int, unsigned int const*, unsigned int*)’:
math_functions.cpp:(.text._ZN5caffe10caffe_copyIjEEviPKT_PS1_[_ZN5caffe10caffe_copyIjEEviPKT_PS1_]+0x6c): undefined reference to `cudaMemcpy’
math_functions.cpp:(.text._ZN5caffe10caffe_copyIjEEviPKT_PS1_[_ZN5caffe10caffe_copyIjEEviPKT_PS1_]+0x160): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(math_functions.o): In function `void caffe::caffe_copy<float>(int, float const*, float*)’:
math_functions.cpp:(.text._ZN5caffe10caffe_copyIfEEviPKT_PS1_[_ZN5caffe10caffe_copyIfEEviPKT_PS1_]+0x6c): undefined reference to `cudaMemcpy’
math_functions.cpp:(.text._ZN5caffe10caffe_copyIfEEviPKT_PS1_[_ZN5caffe10caffe_copyIfEEviPKT_PS1_]+0x160): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(syncedmem.o): In function `caffe::SyncedMemory::cpu_data()’:
syncedmem.cpp:(.text+0x26): undefined reference to `caffe::caffe_gpu_memcpy(unsigned long, void const*, void*)’
.build_release/lib/libcaffe.a(syncedmem.o): In function `caffe::SyncedMemory::mutable_cpu_data()’:
syncedmem.cpp:(.text+0x136): undefined reference to `caffe::caffe_gpu_memcpy(unsigned long, void const*, void*)’
.build_release/lib/libcaffe.a(syncedmem.o): In function `caffe::SyncedMemory::~SyncedMemory()’:
syncedmem.cpp:(.text+0x1c1): undefined reference to `cudaFree’
syncedmem.cpp:(.text+0x20f): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(syncedmem.o): In function `caffe::SyncedMemory::mutable_gpu_data()’:
syncedmem.cpp:(.text+0x29a): undefined reference to `caffe::caffe_gpu_memcpy(unsigned long, void const*, void*)’
syncedmem.cpp:(.text+0x2b9): undefined reference to `cudaMalloc’
syncedmem.cpp:(.text+0x2e5): undefined reference to `cudaMemset’
syncedmem.cpp:(.text+0x321): undefined reference to `cudaGetErrorString’
syncedmem.cpp:(.text+0x379): undefined reference to `cudaMalloc’
syncedmem.cpp:(.text+0x3c2): undefined reference to `cudaGetErrorString’
syncedmem.cpp:(.text+0x435): undefined reference to `cudaGetErrorString’
.build_release/lib/libcaffe.a(syncedmem.o): In function `caffe::SyncedMemory::gpu_data()’:
syncedmem.cpp:(.text+0x4ca): undefined reference to `caffe::caffe_gpu_memcpy(unsigned long, void const*, void*)’
syncedmem.cpp:(.text+0x4e9): undefined reference to `cudaMalloc’
syncedmem.cpp:(.text+0x515): undefined reference to `cudaMemset’
syncedmem.cpp:(.text+0x549): undefined reference to `cudaMalloc’
syncedmem.cpp:(.text+0x592): undefined reference to `cudaGetErrorString’
syncedmem.cpp:(.text+0x608): undefined reference to `cudaGetErrorString’
syncedmem.cpp:(.text+0x678): undefined reference to `cudaGetErrorString’
collect2: error: ld returned 1 exit status
make: *** [.build_release/tools/convert_imageset.bin] Error 1
4. 修改Makefile.config, 注释CPU_ONLY := 1, 同时修改CUSTOM_CXX := g++-4.6
sudo apt-get install gcc-4.6 g++-4.6 gcc-4.6-multilib g++-4.6-multilib
修改这两个文件
vi src/caffe/common.cpp
vi tools/caffe.cpp
使用google替代gflags
make clean
make
make pycaffe
g++-4.6 -shared -o python/caffe/_caffe.so python/caffe/_caffe.cpp \\\\
.build_release/lib/libcaffe.a -fPIC -DNDEBUG -O2 -I/usr/include/python2.7 -I/usr/lib/python2.7/dist-packages/numpy/core/include -I/usr/local/include -I.build_release/src -I./src -I./include -I/usr/local/cuda/include -Wall -Wno-sign-compare -L/usr/lib -L/usr/local/lib -L/usr/lib -L/usr/local/cuda/lib64 -L/usr/local/cuda/lib -lcudart -lcublas -lcurand -lglog -lgflags -lpthread -lprotobuf -lleveldb -lsnappy -llmdb -lboost_system -lhdf5_hl -lhdf5 -lopencv_core -lopencv_highgui -lopencv_imgproc -lcblas -latlas -lboost_python -lpython2.7
touch python/caffe/proto/__init__.py
protoc –proto_path=src –python_out=python src/caffe/proto/caffe_pretty_print.proto
protoc –proto_path=src –python_out=python src/caffe/proto/caffe.proto
执行 sudo cp /u01/caffe/python/caffe/ /usr/local/lib/python2.7/dist-packages/ -Rf
环境:Ubuntu 14.04, Gensim, jieba
先中文分词:
python -m jieba wiki.zh.text.jian.utf-8 > cut_result.txt
抽取3万个文档:
head -n 30000 cut_result.txt > cut_small.txt
处理脚本如下:
from gensim import corpora
train_data = []
corpus1 = []
corpus2 = []
with open(‘cut_small.txt’, ‘r’) as f:
for i in f.readlines():
train_data.append(list(i.decode(‘utf8’).split(‘/’)))
dic = corpora.Dictionary(train)
corpus1 = [dic.doc2bow(text) for text in train_data]
with open(‘cut_small.txt’, ‘r’) as f:
for i in f.readlines():
corpus2.append([dic.token2id[j] for j in i.decode(‘utf8’).split(‘/’)])
环境:Ubuntu 14.04, Gensim,
处理脚本process_wiki.py:
#!/usr/bin/env python # -*- coding: utf-8 -*- import logging import os.path import sys from gensim.corpora import WikiCorpus if __name__ == '__main__': program = os.path.basename(sys.argv[0]) logger = logging.getLogger(program) logging.basicConfig(format='%(asctime)s: %(levelname)s: %(message)s') logging.root.setLevel(level=logging.INFO) logger.info("running %s" % ' '.join(sys.argv)) # check and process input arguments if len(sys.argv) < 3: print globals()['__doc__'] % locals() sys.exit(1) inp, outp = sys.argv[1:3] space = " " i = 0 output = open(outp, 'w') wiki = WikiCorpus(inp, lemmatize=False, dictionary={}) for text in wiki.get_texts(): output.write(space.join(text) + "\\n") i = i + 1 if (i % 10000 == 0): logger.info("Saved " + str(i) + " articles") output.close() logger.info("Finished Saved " + str(i) + " articles")
下载中文和英文的wikipedia
wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2
wget https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2
方法一:
python process_wiki.py zhwiki-latest-pages-articles.xml.bz2 wiki.zh.text
方法二:
Wikipedia Extractor 是用 Python 写的一个维基百科抽取器,使用非常方便。
wget http://medialab.di.unipi.it/Project/SemaWiki/Tools/WikiExtractor.py
python WikiExtractor.py -cb1000M -o extracted zhwiki-latest-pages-articles.xml.bz2
参数 -b1000M 表示以 1000M 为单位切分文件,默认是 500K。
将wiki.zh.text中的繁体字转化位简体字:
sudo apt-get install opencc
opencc -i wiki.zh.text -o wiki.zh.text.jian -c zht2zhs.ini
处理非utf-8字符
iconv -c -t UTF-8 < wiki.zh.text.jian > wiki.zh.text.jian.utf-8
环境:CentOS 6.3, gcc 4.4.7 g++4.4.7
wget http://people.centos.org/tru/devtools-2/devtools-2.repo -O /etc/yum.repos.d/devtools-2.repo
yum install devtoolset-2-gcc devtoolset-2-binutils devtoolset-2-gcc-c++
scl enable devtoolset-2 bash
环境:Ubuntu 14.04
一直在关注DMLC 这个机器学习项目,最新的一个子项目是虫洞,提供可靠的和可扩展的机器学习工具在不平的计算平台(MPI, Yarn, Sungrid)。将大幅降低安装和部署分布式机器学习应用的门槛。对所有组件提供一致的数据流支持。还提供统一脚本来编译和运行所有组件。使得用户既可以在方便的本地集群运行深盟的任何一个分布式组件。
编译安装如下:
git clone https://github.com/dmlc/wormhole.git
cd wormhole
cp make/config.mk .
vi config.mk
注释HDFS, S3
#USE_HDFS = 1
#USE_S3 = 1
然后编译即可
make
生成两个执行文件:
kmeans.dmlc xgboost.dmlc
问题: 表SUMM_ADV_CONSUME是分区表,发出一个查询如下
select count(*) from dates T434858, SUMM_ADV_CONSUME T434932 where “T434858.DATE_ID” = T434932.DATE_ID and T434858.DATE_NAME = ‘20131202’
产生的执行计划扫描很多分区表,正常情况是一个分区表。使用同样的逻辑查询,用另个筛选条件date_name2
select count(*) from dates T434858, SUMM_ADV_CONSUME T434932 where “T434858.DATE_ID” = T434932.DATE_ID and T434858.DATE_NAME2 = to_date(‘20131105’, ‘yyyymmdd’)
只扫描一个分区表,执行计划正确。 分析两个字段发现date_name2上有建唯一索引。 对date_name创建唯一索引也能得出正确的执行计划。
在数据仓库etl过程会有许多不同的数据源从dw导数据, 以mysql数据源为例分几种方法来导入:
1. 借助etl工具本身来导入
优点: 开发效率高,直接表映射
缺点:etl工具本身license, 加载数据的效率低
2. 借助oracle gateway拉取mysql内的数据
优点:开发效率高,只需配置
缺点:如果数据量比较多的话拉取有瓶颈,不会使用谓语下推操作
3. 借助NFS将mysql数据导入,然后通过sqlldr加载NFS上的数据文件
优点:加载数据快, 直接使用原生态的导出和加载
缺点:配置麻烦
首先到sun java地址 http://www.oracle.com/technetwork/java/javase/downloads/index.html找到 JDK download, 对应的链接为: http://www.oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html , 然后找到对应os的JDK, 例如: http://download.oracle.com/otn-pub/java/jdk/7u45-b18/jdk-7u45-linux-x64.rpm
绕开“Accept License Agreement”, 如下:
安装jdk, 由于之前有低版本的jdk,故而安装如下:
rpm -ivh –force jdk-7u45-linux-x64.rpm