h2o-sparkling 使用

环境: CentOS 6.2

h2o-sparking 是h2o与spark结合的产物,用于机器学习这一方面,它可在spark环境中使用h2o拥有的机器学习包。

安装如下 :
git clone https://github.com/0xdata/h2o-sparkling.git
cd h2o-sparking
sbt assembly

运行测试:
[cloudera@localhost h2o-sparkling]$ sbt -mem 500 “run –local”
[info] Loading project definition from /home/cloudera/h2o-sparkling/project
[info] Set current project to h2o-sparkling-demo (in build file:/home/cloudera/h2o-sparkling/)
[info] Running water.sparkling.demo.SparklingDemo –local
03:41:11.030 main      INFO WATER: —– H2O started —–
03:41:11.046 main      INFO WATER: Build git branch: (unknown)
03:41:11.047 main      INFO WATER: Build git hash: (unknown)
03:41:11.047 main      INFO WATER: Build git describe: (unknown)
03:41:11.047 main      INFO WATER: Build project version: (unknown)
03:41:11.047 main      INFO WATER: Built by: ‘(unknown)’
03:41:11.047 main      INFO WATER: Built on: ‘(unknown)’
03:41:11.048 main      INFO WATER: Java availableProcessors: 1
03:41:11.077 main      INFO WATER: Java heap totalMemory: 3.87 gb
03:41:11.077 main      INFO WATER: Java heap maxMemory: 3.87 gb
03:41:11.078 main      INFO WATER: Java version: Java 1.6.0_31 (from Sun Microsystems Inc.)
03:41:11.078 main      INFO WATER: OS   version: Linux 2.6.32-220.23.1.el6.x86_64 (amd64)
03:41:11.381 main      INFO WATER: Machine physical memory: 4.83 gb
03:41:11.393 main      INFO WATER: ICE root: ‘/tmp/h2o-cloudera’
03:41:11.438 main      INFO WATER: Possible IP Address: eth1 (eth1), 192.168.56.101
03:41:11.439 main      INFO WATER: Possible IP Address: eth0 (eth0), 10.0.2.15
03:41:11.439 main      INFO WATER: Possible IP Address: lo (lo), 127.0.0.1
03:41:11.669 main      WARN WATER: Multiple local IPs detected:
+                                    /192.168.56.101  /10.0.2.15
+                                  Attempting to determine correct address…
+                                  Using /10.0.2.15
03:41:11.929 main      INFO WATER: Internal communication uses port: 54322
+                                  Listening for HTTP and REST traffic on  http://10.0.2.15:54321/
03:41:12.912 main      INFO WATER: H2O cloud name: ‘cloudera’
03:41:12.913 main      INFO WATER: (v(unknown)) ‘cloudera’ on /10.0.2.15:54321, discovery address /230.63.2.255:58943
03:41:12.913 main      INFO WATER: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
+                                    1. Open a terminal and run ‘ssh -L 55555:localhost:54321 cloudera@10.0.2.15’
+                                    2. Point your browser to http://localhost:55555
03:41:12.954 main      INFO WATER: Cloud of size 1 formed [/10.0.2.15:54321 (00:00:00.000)]
03:41:12.954 main      INFO WATER: Log dir: ‘/tmp/h2o-cloudera/h2ologs’
prostate
03:41:20.369 main      INFO WATER: Running demo with following configuration: DemoConf(prostate,true,RDDExtractor@file,true)
03:41:20.409 main      INFO WATER: Demo configuration: DemoConf(prostate,true,RDDExtractor@file,true)
03:41:21.830 main      INFO WATER: Data : data/prostate.csv
03:41:21.831 main      INFO WATER: Table: prostate_table
03:41:21.831 main      INFO WATER: Query: SELECT * FROM prostate_table WHERE capsule=1
03:41:21.831 main      INFO WATER: Spark: LOCAL
03:41:21.901 main      INFO WATER: Creating LOCAL Spark context.
03:41:34.616 main      INFO WATER: RDD result has: 153 rows
03:41:34.752 main      INFO WATER: Going to write RDD into /tmp/rdd_null_6.csv
03:41:36.099 FJ-0-1    INFO WATER: Parse result for rdd_data_6 (153 rows):
03:41:36.136 FJ-0-1    INFO WATER:     C1:              numeric        min(6.000000)      max(378.000000)
03:41:36.140 FJ-0-1    INFO WATER:     C2:              numeric        min(1.000000)        max(1.000000)                    constant
03:41:36.146 FJ-0-1    INFO WATER:     C3:              numeric       min(47.000000)       max(79.000000)
03:41:36.152 FJ-0-1    INFO WATER:     C4:              numeric        min(0.000000)        max(2.000000)
03:41:36.158 FJ-0-1    INFO WATER:     C5:              numeric        min(1.000000)        max(4.000000)
03:41:36.161 FJ-0-1    INFO WATER:     C6:              numeric        min(1.000000)        max(2.000000)
03:41:36.165 FJ-0-1    INFO WATER:     C7:              numeric        min(1.400000)      max(139.700000)
03:41:36.169 FJ-0-1    INFO WATER:     C8:              numeric        min(0.000000)       max(73.400000)
03:41:36.176 FJ-0-1    INFO WATER:     C9:              numeric        min(5.000000)        max(9.000000)
03:41:37.457 main      INFO WATER: Extracted frame from Spark:
03:41:37.474 main      INFO WATER: {id,capsule,age,race,dpros,dcaps,psa,vol,gleason}, 2.8 KB
+                                  Chunk starts: {0,83,}
+                                  Rows: 153
03:41:37.482 #ti-UDP-R INFO WATER: Orderly shutdown command from /10.0.2.15:54321
[success] Total time: 44 s, completed Aug 4, 2014 3:41:37 AM

本地集群运行:
[cloudera@localhost h2o-sparkling]$ sbt -mem 100 “run –remote”
[info] Loading project definition from /home/cloudera/h2o-sparkling/project
[info] Set current project to h2o-sparkling-demo (in build file:/home/cloudera/h2o-sparkling/)
[info] Running water.sparkling.demo.SparklingDemo –remote
03:25:42.306 main      INFO WATER: —– H2O started —–
03:25:42.309 main      INFO WATER: Build git branch: (unknown)
03:25:42.309 main      INFO WATER: Build git hash: (unknown)
03:25:42.309 main      INFO WATER: Build git describe: (unknown)
03:25:42.309 main      INFO WATER: Build project version: (unknown)
03:25:42.309 main      INFO WATER: Built by: ‘(unknown)’
03:25:42.309 main      INFO WATER: Built on: ‘(unknown)’
03:25:42.310 main      INFO WATER: Java availableProcessors: 4
03:25:42.316 main      INFO WATER: Java heap totalMemory: 3.83 gb
03:25:42.316 main      INFO WATER: Java heap maxMemory: 3.83 gb
03:25:42.316 main      INFO WATER: Java version: Java 1.6.0_31 (from Sun Microsystems Inc.)
03:25:42.317 main      INFO WATER: OS   version: Linux 2.6.32-220.23.1.el6.x86_64 (amd64)
03:25:42.383 main      INFO WATER: Machine physical memory: 4.95 gb
03:25:42.384 main      INFO WATER: ICE root: ‘/tmp/h2o-cloudera’
03:25:42.389 main      INFO WATER: Possible IP Address: eth1 (eth1), 192.168.56.101
03:25:42.389 main      INFO WATER: Possible IP Address: eth0 (eth0), 10.0.2.15
03:25:42.389 main      INFO WATER: Possible IP Address: lo (lo), 127.0.0.1
03:25:42.587 main      WARN WATER: Multiple local IPs detected:
+                                    /192.168.56.101  /10.0.2.15
+                                  Attempting to determine correct address…
+                                  Using /10.0.2.15
03:25:42.650 main      INFO WATER: Internal communication uses port: 54322
+                                  Listening for HTTP and REST traffic on  http://10.0.2.15:54321/
03:25:43.906 main      INFO WATER: H2O cloud name: ‘cloudera’
03:25:43.906 main      INFO WATER: (v(unknown)) ‘cloudera’ on /10.0.2.15:54321, discovery address /230.63.2.255:58943
03:25:43.907 main      INFO WATER: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
+                                    1. Open a terminal and run ‘ssh -L 55555:localhost:54321 cloudera@10.0.2.15’
+                                    2. Point your browser to http://localhost:55555
03:25:43.920 main      INFO WATER: Cloud of size 1 formed [/10.0.2.15:54321 (00:00:00.000)]
03:25:43.921 main      INFO WATER: Log dir: ‘/tmp/h2o-cloudera/h2ologs’
prostate
03:25:46.985 main      INFO WATER: Running demo with following configuration: DemoConf(prostate,false,RDDExtractor@file,true)
03:25:46.991 main      INFO WATER: Demo configuration: DemoConf(prostate,false,RDDExtractor@file,true)
03:25:48.000 main      INFO WATER: Data : data/prostate.csv
03:25:48.000 main      INFO WATER: Table: prostate_table
03:25:48.000 main      INFO WATER: Query: SELECT * FROM prostate_table WHERE capsule=1
03:25:48.001 main      INFO WATER: Spark: REMOTE
03:25:48.024 main      INFO WATER: Creating REMOTE (spark://localhost:7077) Spark context.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:1 failed 4 times, most recent failure: TID 7 on host 192.168.56.101 failed for unknown reason
Driver stacktrace:
03:26:07.151 main      INFO WATER:      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
03:26:07.151 main      INFO WATER:      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
03:26:07.151 main      INFO WATER:      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
03:26:07.152 main      INFO WATER:      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
03:26:07.152 main      INFO WATER:      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
03:26:07.152 main      INFO WATER:      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
03:26:07.152 main      INFO WATER:      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
03:26:07.152 main      INFO WATER:      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
03:26:07.153 main      INFO WATER:      at scala.Option.foreach(Option.scala:236)
03:26:07.153 main      INFO WATER:      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
03:26:07.153 main      INFO WATER:      at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
03:26:07.153 main      INFO WATER:      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
03:26:07.155 main      INFO WATER:      at akka.actor.ActorCell.invoke(ActorCell.scala:456)
03:26:07.155 main      INFO WATER:      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
03:26:07.156 main      INFO WATER:      at akka.dispatch.Mailbox.run(Mailbox.scala:219)
03:26:07.156 main      INFO WATER:      at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
03:26:07.157 main      INFO WATER:      at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
03:26:07.158 main      INFO WATER:      at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
03:26:07.158 main      INFO WATER:      at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
03:26:07.162 main      INFO WATER:      at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
03:26:07.172 #ti-UDP-R INFO WATER: Orderly shutdown command from /10.0.2.15:54321
[success] Total time: 27 s, completed Aug 4, 2014 3:26:07 PM

运行失败, 目前还无法定位问题所在。

sqoop导oracle数据库的数据到hive

环境:CentOS 6.3,   hive-0.9.0-cdh4.1.2, Oracle database 11g
1. 首先将ojdbc.jar文件拷贝到/u01/cloudera/parcels/CDH/lib/sqoop/lib目录下

2. 启动命令:

sqoop export –connect jdbc:oracle:thin:@xxx:1521:biprod –username sqoop_user –password sqoop_user –table OS_ZHIXIN_CHG –export-dir /tmp/zhixin_chg/20140911/20140911charge.zhixin2.log

执行语句:sqoop import –connect jdbc:oracle:thin:@m1-ite-erp-bidev01.m1:8521:biprod –username ahmt –password haddmt –table FCT_PROX_HMT –hive-import -m 1
导数的时候出错: ERROR tool.ImportTool: Imported Failed: Attempted to generate class with no columns!
经排查,问题出在 –username ahmt 这个选项, 需要把用户名大写, –username AHMT, 同时注意表名也要大写

使用postgresql安装wordpress

环境: Ubuntu 12.04,  Postgresql 9.1,  WordPress 3.4.2

1. 环境安装

sudo apt-get install apache2
sudo apt-get install postgresql-9.1
sudo apt-get install php5
sudo apt-get install php5-pgsql

2. 下载wordpress,

wget -O wordpress.tar.gz http://wordpress.org/latest.tar.gz

wget https://downloads.wordpress.org/plugin/postgresql-for-wordpress.1.3.1.zip

3. 解压并放到/var/www目录下
unzip  latest.tar.gz
unzip postgresql-for-wordpress.1.3.1.zip

sudo cp -R wordpress  /var/www
sudo chown jerry:jerry /var/www/wordpress

sudo cp -R postgresql-for-wordpress/pg4wp /var/www/wordpress/wp-content/
cp /var/www/wordpress/wp-content/pg4wp/db.php /var/www/wordpress/wp-content

4. 切换到/var/www/wordpress目录,拷贝一份wp-config-sample.php文件为wp-config.php
vi wp-config.php

修改这四项为postgresql的配置参数

define(‘DB_NAME’, ‘wordpress’);

/** MySQL database username */
define(‘DB_USER’, ‘postgres’);

/** MySQL database password */
define(‘DB_PASSWORD’, ‘xxxxxxx’);

/** MySQL hostname */
define(‘DB_HOST’, ‘localhost:5432’);

5. 在浏览器上打开http://192.168.56.1/wordpress/,配置相应的参数就可以使用wordpress

以后可以有自己的博客了!

pidgin增加截图发送功能

环境: Windows 7, pidgin-2.10.9, send-screenshot-v0.8-3

pidgin默认安装是不支持屏幕截图发送功能,但可以通过插件来弥补这一功能。 从https://code.google.com/p/pidgin-sendscreenshot/downloads/list下载send-screenshot-v0.8-3.exe,安装些插件。

在“对话” –》 “更多(o) –> “截图发送”, 使用此项就能实现了。

通过SSH运行另一台服务器内的脚本

环境: CentOS 5.7

1. 配置好免密码登录的SSH

场景: 主机A, B, A访问B

首先在主机A执行
[oracle@A ~]$ ssh-keygen -t rsa -P ”
[oracle@A ~]$  scp .ssh/id_rsa.pub oracle@B:/home/oracle

主机B执行:
[oracle@B ~]$ cat /home/oracle/id_rsa.pub > ~/.ssh/authorized_keys
[oracle@B ~]$ chmod 600 ~/.ssh/authorized_keys
[oracle@B ~]$ chmod 700 ~/.ssh

2. vi test_ssh.sh
脚本如下
#!/bin/sh

cmd=”
cd /home/oracle
. ~/.bash_profile
ls
python load_zhixin.py “$1″

echo $cmd
ssh oracle@xx.xx.xx.xx “$cmd”

3. 执行如下 ./test_ssh.sh 20140917

APScheduler + Gearman 构建分布式定时任务调度

APScheduler是基于类似于Java Quartz的一个Python定时任务框架,实现了Quartz的所有功能。提供了基于日期、固定时间间隔以及crontab类型的任务,并且可以持久化任务。 它利用sqlalchemy包实现job状态存储于关系数据库,例:

__oracle_url = ‘oracle://test1:test1@10.44.74.13:8521/biprod’
__configure = { ‘apscheduler.standalone’: True,
‘apscheduler.jobstores.sqlalchemy_store.class’: ‘apscheduler.jobstores.sqlalchemy_store:SQLAlchemyJobStore’,
‘apscheduler.jobstores.sqlalchemy_store.url’: __oracle_url}

from apscheduler.jobstores.sqlalchemy_store import SQLAlchemyJobStore
from apscheduler.scheduler import Scheduler

scheduler = Scheduler(standalone=False)
scheduler.add_jobstore(SQLAlchemyJobStore(url=__oracle_url), ‘default’)

使用请参考: http://pythonhosted.org/APScheduler/index.html

Gearman是一款开源的通用的分布式任务分发框架,自己本身不做任何实际的工作。它可以将一个个的任务分发给其他的物理机器或者进程,以达到工作的并行运行和LB。 有人说Gearman是分布式 计算框架其实是不太准确的,因为相较于Hadoop而言,Gearman更偏重于任务的分发而不是执行。Gearman扮演的角色更像是一系列分布式进程 的神经系统。

Gearman框架中一共有三个角色:

  1. Client: 提交任务的人。创建需要被执行的job然后发送给Job Server。
  2. Worker: 真正干活的人。向Job Server注册然后从Job Server处拿活干。
  3. Job Server:传说中的manager。接收client提交的Job,分发给相应的worker。并能在worker出现异常时重新派发job。

http://img1.51cto.com/attachment/201307/152852966.png

由于yum或apt-get安装的版本太低,一般都到手工下载最新版本编译安装。步骤如下:
1. 安装依赖包, sudo apt-get install gcc autoconf bison flex libtool make libboost-all-dev libcurl4-openssl-dev curl libevent-dev memcached uuid-dev libpq-dev
2. 下载安装程序, wget https://launchpad.net/gearmand/1.2/1.1.5/+download/gearmand-1.1.5.tar.g
3. 解压编译安装,
tar xvzf gearmand-1.1.5.tar.gz
cd gearmand-1.1.5
./configure
make
make install
4. 当运行 /usr/local/sbin/gearmand -d 时出现 error while loading shared libraries: libgearman.so.1, 运行如下 sudo ldconfig

启动gearman:

1. gearmand –pid-file=/var/run/gearman/gearmand.pid –daemon –log-file=/var/log/gearman-job-server/gearman.log –listen=192.168.56.101
gearmand –verbose DEBUG -d

2. 通过命令行工具来体验 Gearman 的功能:
启动 Worker:gearman -w -f wc — wc -l &
运行 Client:gearman -f wc < /etc/passwd

gearman -w -f testgm — python &
gearman -f testgm < test_gearman.py