强的部落格 – 第20页 – 量化自我和极简主义的窝藏点

Windows 7安装lxml

环境：Windows 7, python 2.7

需要使用lxml来解析网页，还得安装VCForPython27，安装过程中发现一系统的问题：

pip install lxml

easy_install lxml

都有这个报错，是编译时出现的。

Could not find function xmlCheckVersion in library libxml2. Is libxml2 installed ?

最后直接从http://www.lfd.uci.edu/~gohlke/pythonlibs/dp2ng7en/lxml-3.6.4-cp27-cp27m-win_amd64.whl下载

pip install lxml-3.6.4-cp27-cp27m-win_amd64.whl

python打包成exe执行文件

环境：windows 7, python 2.7

写python文件格式.py的程序需要将其打包成可执行的文件形式，可以使用PyInstaller来打包。

下载PyInstaller-3.1文件，使用打包命令如下：

D:\\program\\PyInstaller-3.1>pyinstaller.py -F ../../qs123/s3test.py –upx-dir upx391w

此命令将其打包成一个可执行文件并进行压缩。

参数：

-F 指定打包后只生成一个exe格式的文件

-D –onedir 创建一个目录，包含exe文件，但会依赖很多文件（默认选项）

-c –console, –nowindowed 使用控制台，无界面(默认)

-w –windowed, –noconsole 使用窗口，无控制台

-p 添加搜索路径，让其找到对应的库。

-i 改变生成程序的icon图标

bash history历史命令查询

环境：logstash-2.4.0, elasticsearch-1.6.1, kafka 0.8

经常需要查看bash历史，而这个文件一般存储一定量的命令，有时需要查看什么时候执行过。因而使用logstash + kafka + elasticsearch来搭建bash历史命令检索系统。

配置文件如下：

logstash.conf

input {
file {
path => “/home/adadmin/.bash_history”
add_field => {“user” => “adadmin”}
}
}
filter {
ruby {
code => “event[‘updatetime’] = event.timestamp.time.localtime.strftime(‘%Y-%m-%d %H:%M:%S.%L’)”
}
}
output {
kafka {
bootstrap_servers => “10.121.93.50:9092,10.121.93.51:9092,10.121.93.53:9092”
topic_id => “bash-history”
}
}

elasticsearch:

curl -XPUT ‘xxx.xxx.xxx.53:9200/_river/kafka-river/_meta’ -d ‘
{
“type” : “kafka”,
“kafka” : {
“zookeeper.connect” : “xxx.xxx.xxx.50:2181,xxx.xxx.xxx.51:2181,xxx.xxx.xxx.53:2181”,
“zookeeper.connection.timeout.ms” : 10000,
“topic” : “bash-history”,
“message.type” : “json”
},
“index” : {
“index” : “kafka-index”,
“type” : “status”,
“bulk.size” : 3,
“concurrent.requests” : 1,
“action.type” : “index”,
“flush.interval” : “12h”
}
}’

启动logstash

bin/logstash -f logstash.conf

在terminal上执行一些命令，数据就由logstash传到kafka，再传到elasticSearch上，可以在上面直接查看历史命令。

Confluent的schema-registry的使用

git clone https://github.com/confluentinc/schema-registry.git

cd schema-registry
git checkout tags/v2.0.0
mvn clean package -DskipTests

vi config/schema-registry.properties
设置kafkastore.connection.url为zookeeper的连接地址

nohup ./bin/schema-registry-start ./config/schema-registry.properties &

查看schema-registry进程
[adadmin@s11 ~]$ jps
26995 NodeManager
74580 Kafka
61079 SchemaRegistryMain
62615 Jps
126392 Worker
26843 DataNode
118141 QuorumPeerMain

#producer 注意：输入一条数据才enter一次，退出使用ctrl + C
./bin/kafka-avro-console-producer –broker-list 10.121.93.50:9092 –topic test –property value.schema='{“type”:”record”,”name”:”myrecord”,”fields”:[{“name”:”f1″,”type”:”string”}]}’
{“f1”: “value1”}
{“f1”: “value2”}
{“f1”: “value3”}

./bin/kafka-avro-console-consumer –broker-list 10.121.93.50:9092 –topic test-avro –from-beginning

Linux history的详解

经常使用linux内置的history命令查看历史，在当前用户下面有个.bash_history文件用于保存历史命令，另外$HISTSIZE环境变量是保存最大条数。在当前shell环境中，命令放在内存中，退出时将最近$HISTSIZE条命令保存在.bash_history上，也可以使用history -a（从登录起到现在的命令)手工保存，另外history -w将当前命令保存下来。如果想命令立刻保存下来，可以在.bashrc中设置环境变量export PROMPT_COMMAND=’history -a’

spark写入kafka问题

环境：Spark 1.6, kafka 0.8

Failed to send producer request with correlation id java.io.IOException: Connection reset by peer kafka.common.FailedToSendMessageException: Failed to send messages after 3 tries.

由于使用spark读取和写入到kafka中，出现以上问题，一直以为是参数性能调整问题，调整不同的参数。

在producer端

producerConf.put(“retry.backoff.ms”, “8000”);
producerConf.put(“message.send.max.retries”, “10”);
producerConf.put(“topic.metadata.refresh.interval.ms”, “0”);
producerConf.put(“fetch.message.max.bytes”, “5252880”)
producerConf.put(“request.required.acks”, “0”)

在broker端 server.properties

message.max.bytes=5252880
replica.fetch.max.bytes=5252880
request.timeout.ms=600000

都无法解决些问题，后来才了解到producer默认的写入的方式是同步，因此问题就是在这一个参数上

producerConf.put(“producer.type”, “async”)

一天

工作倦怠

最近没啥项目，每天就是上上网看看新闻。过得没啥意思，感觉换工作也解决不了问题。给自己一个感兴趣的目标试试。

python将字符串转化为本地datetime

from pytz import timezone
from datetime import datetime
from time import mktime
import time

t1 = ‘2016-08-04T02:58:29Z’
tt1 = time.strptime(t1, “%Y-%m-%dT%H:%M:%SZ”)
dt1 = datetime.fromtimestamp(mktime(tt1))

tzchina = timezone(‘Asia/Chongqing’)
utc = timezone(‘UTC’)

dt1.replace(tzinfo=utc).astimezone(tzchina)

hadoop namenode启动问题

环境：CentOS 6.3, hadoop 2.6

由于hadoop集群中的namenode服务器cpu故障造成集群无法使用，重启后启动namenode出现错误提示：java.lang.OutOfMemoryError: GC overhead limit exceeded

解决方法：是由于java的内存回收机制造成的，在hadoop/dfs/name/current有namenode的大量的日志文件，需要修改etc/hadoop/hadoop-env.sh中增加“-Xms30G -Xmx50G”

export HADOOP_NAMENODE_OPTS=”-Xms30G -Xmx50G -Dhadoop.security.logger=${HADOOP_SECURITY_LOGGER:-INFO,RFAS} -Dhdfs.audit.logger=${HDFS_AUDIT_LOGGER:-INFO,NullAppender} $HADOOP_NAMENODE_OPTS”