将kafka的数据导入至ElasticSearch

环境:ElasticSearch 1.4.4, elasticsearch-river-kafka-1.2.1-plugin, kafka 0.8.1

安装ElasticSearch的kafka插件
.bin/plugin -install kafka-river -url https://github.com/mariamhakobyan/elasticsearch-river-kafka/releases/download/v1.2.1/elasticsearch-river-kafka-1.2.1-plugin.zip

增加元数据
curl -XPUT ‘localhost:9200/_river/kafka-river/_meta’ -d ‘
{
“type” : “kafka”,
“kafka” : {
“zookeeper.connect” : “xxx.xxx.xxx.xxx:2181,xxx.xxx.xxx.xxx:2181,xxx.xxx.xxx.xxx:2181”,
“zookeeper.connection.timeout.ms” : 10000,
“topic” : “flume-topic1”,
“message.type” : “json”
},
“index” : {
“index” : “kafka-index”,
“type” : “status”,
“bulk.size” : 3,
“concurrent.requests” : 1,
“action.type” : “index”,
“flush.interval” : “12h”
}
}’

重启ElasticSearch的服务

查看元数据状态
curl -XGET ‘http://localhost:9200/_river/kafka-river/_search?pretty’
curl -XGET ‘http://localhost:9200/_river/kafka-index/_search?pretty’
curl -XDELETE ‘localhost:9200/_river/kafka-river/’

在kafka生成json数据
bin/kafka-console-producer.sh –topic flume-topic1 –broker-list xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092
{“id”:”123″, “name”:”hq”}
{“id”:”123″, “name”:”hq”}
{“id”:”123″, “name”:”hq”}
{“id”:”123″, “name”:”hq”}

查看最终数据
curl -XGET ‘http://localhost:9200/kafka-index/_search?pretty’

Flume配置导入kafka,ElasticSearch

环境:CentOS 6.3,  Kafka 8.1, Flume 1.6, elasticsearch-1.4.4

配置文件如下:

[adadmin@s9 apache-flume-1.6.0-bin]$ vi conf/flume.conf

#define source, sink, channel
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = exec
a1.sources.r1.channels = c1
a1.sources.r1.command = tail -F /home/adadmin/.bash_history

# Describe the sink
#only test
#a1.sinks.k1.type = logger

#load to Kafka
#a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
#a1.sinks.k1.batchSize = 5
#a1.sinks.k1.brokerList = xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092,xxx.xxx.xxx.xxx:9092
#a1.sinks.k1.topic = flume_topic1

#load to ElasticSearch
a1.sinks.k1.type = org.apache.flume.sink.elasticsearch.ElasticSearchSink
a1.sinks.k1.hostNames = xxx.xxx.xxx.xxx:9300
a1.sinks.k1.clusterName = elasticsearch
a1.sinks.k1.batchSize = 100
a1.sinks.k1.indexName = logstash
a1.sinks.k1.ttl = 5
a1.sinks.k1.serializer = org.apache.flume.sink.elasticsearch.ElasticSearchLogStashEventSerializer

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1

 

启用Flume agent

[adadmin@s9 apache-flume-1.6.0-bin]$ bin/flume-ng agent -c /home/adadmin/apache-flume-1.6.0-bin/conf -f /home/adadmin/apache-flume-1.6.0-bin/conf/flume.conf -n a1 -Dflume.root.logger=INFO,console

 

(注:在导入ElasticSearch时需要把此文件的lib导入到flume的库目录下,操作如下:

[adadmin@s9 apache-flume-1.6.0-bin]$ mkdir -p plugins.d/elasticsearch/libext
[adadmin@s9 apache-flume-1.6.0-bin]$cp /home/adadmin/elasticsearch-1.4.4/lib/*.jar plugins.d/elasticsearch/libext

)

使用CDH4 Maven Repository

环境: CentOS

在使用Maven编译一些与hadoop相关的产品时候需要使用hadoop相关版本对应的核心组件,而自己使用的大多数都是CDH版本。因而需要从些版本上下载相应的包。

相应的解决方法是在pom.xml增加如下:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">

  <repositories>
    <repository>
      <id>cloudera</id>
      <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
    </repository>
  </repositories>

</project>
以下显示的是project name, groupId, artifactId, and version required to access each CDH4 artifact.
Project groupId artifactId version
Hadoop org.apache.hadoop hadoop-annotations 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-archives 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-assemblies 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-auth 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-client 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-common 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-datajoin 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-dist 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-distcp 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-extras 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-gridmix 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-hdfs 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-mapreduce-client-app 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-mapreduce-client-common 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-mapreduce-client-core 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-mapreduce-client-hs 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-mapreduce-client-jobclient 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-mapreduce-client-shuffle 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-mapreduce-examples 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-rumen 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-yarn-api 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-yarn-applications-distributedshell 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-yarn-applications-unmanaged-am-launcher 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-yarn-client 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-yarn-common 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-yarn-server-common 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-yarn-server-nodemanager 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-yarn-server-resourcemanager 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-yarn-server-tests 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-yarn-server-web-proxy 2.0.0-cdh4.2.0
  org.apache.hadoop hadoop-yarn-site 2.0.0-cdh4.2.0
Hadoop MRv1 org.apache.hadoop hadoop-core 2.0.0-mr1-cdh4.2.0
  org.apache.hadoop hadoop-examples 2.0.0-mr1-cdh4.2.0
  org.apache.hadoop hadoop-minicluster 2.0.0-mr1-cdh4.2.0
  org.apache.hadoop hadoop-streaming 2.0.0-mr1-cdh4.2.0
  org.apache.hadoop hadoop-test 2.0.0-mr1-cdh4.2.0
  org.apache.hadoop hadoop-tools 2.0.0-mr1-cdh4.2.0
Hive org.apache.hive hive-anttasks 0.10.0-cdh4.2.0
  org.apache.hive hive-builtins 0.10.0-cdh4.2.0
  org.apache.hive hive-cli 0.10.0-cdh4.2.0
  org.apache.hive hive-common 0.10.0-cdh4.2.0
  org.apache.hive hive-contrib 0.10.0-cdh4.2.0
  org.apache.hive hive-exec 0.10.0-cdh4.2.0
  org.apache.hive hive-hbase-handler 0.10.0-cdh4.2.0
  org.apache.hive hive-hwi 0.10.0-cdh4.2.0
  org.apache.hive hive-jdbc 0.10.0-cdh4.2.0
  org.apache.hive hive-metastore 0.10.0-cdh4.2.0
  org.apache.hive hive-pdk 0.10.0-cdh4.2.0
  org.apache.hive hive-serde 0.10.0-cdh4.2.0
  org.apache.hive hive-service 0.10.0-cdh4.2.0
  org.apache.hive hive-shims 0.10.0-cdh4.2.0
HBase org.apache.hbase hbase 0.94.2-cdh4.2.0
ZooKeeper org.apache.zookeeper zookeeper 3.4.5-cdh4.2.0
Sqoop org.apache.sqoop sqoop 1.4.2-cdh4.2.0
Pig org.apache.pig pig 0.10.0-cdh4.2.0
  org.apache.pig pigsmoke 0.10.0-cdh4.2.0
  org.apache.pig pigunit 0.10.0-cdh4.2.0
Flume 1.x org.apache.flume flume-ng-configuration 1.3.0-cdh4.2.0
  org.apache.flume flume-ng-core 1.3.0-cdh4.2.0
  org.apache.flume flume-ng-embedded-agent 1.3.0-cdh4.2.0
  org.apache.flume flume-ng-node 1.3.0-cdh4.2.0
  org.apache.flume flume-ng-sdk 1.3.0-cdh4.2.0
  org.apache.flume flume-ng-tests 1.3.0-cdh4.2.0
  org.apache.flume.flume-ng-channels flume-file-channel 1.3.0-cdh4.2.0
  org.apache.flume.flume-ng-channels flume-jdbc-channel 1.3.0-cdh4.2.0
  org.apache.flume.flume-ng-channels flume-recoverable-memory-channel 1.3.0-cdh4.2.0
  org.apache.flume.flume-ng-clients flume-ng-log4jappender 1.3.0-cdh4.2.0
  org.apache.flume.flume-ng-legacy-sources flume-avro-source 1.3.0-cdh4.2.0
  org.apache.flume.flume-ng-legacy-sources flume-thrift-source 1.3.0-cdh4.2.0
  org.apache.flume.flume-ng-sinks flume-hdfs-sink 1.3.0-cdh4.2.0
  org.apache.flume.flume-ng-sinks flume-irc-sink 1.3.0-cdh4.2.0
  org.apache.flume.flume-ng-sinks flume-ng-elasticsearch-sink 1.3.0-cdh4.2.0
  org.apache.flume.flume-ng-sinks flume-ng-hbase-sink 1.3.0-cdh4.2.0
  org.apache.flume.flume-ng-sources flume-jms-source 1.3.0-cdh4.2.0
  org.apache.flume.flume-ng-sources flume-scribe-source 1.3.0-cdh4.2.0
Oozie org.apache.oozie oozie-client 3.3.0-cdh4.2.0
  org.apache.oozie oozie-core 3.3.0-cdh4.2.0
  org.apache.oozie oozie-examples 3.3.0-cdh4.2.0
  org.apache.oozie oozie-hadoop 2.0.0-cdh4.2.0.oozie-3.3.0-cdh4.2.0
  org.apache.oozie oozie-hadoop-distcp 2.0.0-mr1-cdh4.2.0.oozie-3.3.0-cdh4.2.0
  org.apache.oozie oozie-hadoop-test 2.0.0-mr1-cdh4.2.0.oozie-3.3.0-cdh4.2.0
  org.apache.oozie oozie-hbase 0.94.2-cdh4.2.0.oozie-3.3.0-cdh4.2.0
  org.apache.oozie oozie-sharelib-distcp 3.3.0-cdh4.2.0
  org.apache.oozie oozie-sharelib-distcp-yarn 3.3.0-cdh4.2.0
  org.apache.oozie oozie-sharelib-hive 3.3.0-cdh4.2.0
  org.apache.oozie oozie-sharelib-oozie 3.3.0-cdh4.2.0
  org.apache.oozie oozie-sharelib-pig 3.3.0-cdh4.2.0
  org.apache.oozie oozie-sharelib-sqoop 3.3.0-cdh4.2.0
  org.apache.oozie oozie-sharelib-streaming 3.3.0-cdh4.2.0
  org.apache.oozie oozie-sharelib-streaming-yarn 3.3.0-cdh4.2.0
  org.apache.oozie oozie-tools 3.3.0-cdh4.2.0
Mahout org.apache.mahout mahout-buildtools 0.7-cdh4.2.0
  org.apache.mahout mahout-core 0.7-cdh4.2.0
  org.apache.mahout mahout-examples 0.7-cdh4.2.0
  org.apache.mahout mahout-integration 0.7-cdh4.2.0
  org.apache.mahout mahout-math 0.7-cdh4.2.0
Whirr org.apache.whirr whirr-build-tools 0.8.0-cdh4.2.0
  org.apache.whirr whirr-cassandra 0.8.0-cdh4.2.0
  org.apache.whirr whirr-cdh 0.8.0-cdh4.2.0
  org.apache.whirr whirr-chef 0.8.0-cdh4.2.0
  org.apache.whirr whirr-cli 0.8.0-cdh4.2.0
  org.apache.whirr whirr-core 0.8.0-cdh4.2.0
  org.apache.whirr whirr-elasticsearch 0.8.0-cdh4.2.0
  org.apache.whirr whirr-examples 0.8.0-cdh4.2.0
  org.apache.whirr whirr-ganglia 0.8.0-cdh4.2.0
  org.apache.whirr whirr-hadoop 0.8.0-cdh4.2.0
  org.apache.whirr whirr-hama 0.8.0-cdh4.2.0
  org.apache.whirr whirr-hbase 0.8.0-cdh4.2.0
  org.apache.whirr whirr-mahout 0.8.0-cdh4.2.0
  org.apache.whirr whirr-pig 0.8.0-cdh4.2.0
  org.apache.whirr whirr-puppet 0.8.0-cdh4.2.0
  org.apache.whirr whirr-solr 0.8.0-cdh4.2.0
  org.apache.whirr whirr-yarn 0.8.0-cdh4.2.0
  org.apache.whirr whirr-zookeeper 0.8.0-cdh4.2.0
DataFu com.linkedin.datafu datafu 0.0.4-cdh4.2.0
Sqoop2 org.apache.sqoop sqoop-client 1.99.1-cdh4.2.0
  org.apache.sqoop sqoop-common 1.99.1-cdh4.2.0
  org.apache.sqoop sqoop-core 1.99.1-cdh4.2.0
  org.apache.sqoop sqoop-docs 1.99.1-cdh4.2.0
  org.apache.sqoop sqoop-spi 1.99.1-cdh4.2.0
  org.apache.sqoop.connector sqoop-connector-generic-jdbc 1.99.1-cdh4.2.0
  org.apache.sqoop.repository sqoop-repository-derby 1.99.1-cdh4.2.0
HCatalog org.apache.hcatalog hcatalog-core 0.4.0-cdh4.2.0
  org.apache.hcatalog hcatalog-pig-adapter 0.4.0-cdh4.2.0
  org.apache.hcatalog hcatalog-server-extensions 0.4.0-cdh4.2.0
  org.apache.hcatalog webhcat 0.4.0-cdh4.2.0
  org.apache.hcatalog webhcat-java-client 0.4.0-cdh4.2.0

sqoop的插件oraoop

环境: Centos 5.7,  CDH 4.3, sqoop 1.6

从http://downloads.cloudera.com/connectors/oraoop-1.6.0-cdh4.tgz 下载oraoop,解压生成
[oracle@xxx ~]$ ls oraoop-1.6.0
bin  conf  docs  install.sh  version.txt

设置环境参数vi ~/.bash_profile
export SQOOP_CONF_DIR=/etc/sqoop/conf
export SQOOP_HOME=/u01/cloudera/parcels/CDH/lib/sqoop
export HADOOP_CLIENT_OPTS=”-Xmx2048m $HADOOP_CLIENT_OPTS”

然后执行安装脚本./install.sh, 测试安装效果:

[oracle@xxx ~]$ sqoop list-tables –verbose –connect jdbc:oracle:thin:@xxx:8521:biprod –username xxx –password xxx
14/09/23 18:39:49 DEBUG tool.BaseSqoopTool: Enabled debug logging.
14/09/23 18:39:49 WARN tool.BaseSqoopTool: Setting your password on the command-line is insecure. Consider using -P instead.
14/09/23 18:39:49 DEBUG util.ClassLoaderStack: Checking for existing class: com.quest.oraoop.OraOopManagerFactory
14/09/23 18:39:49 DEBUG util.ClassLoaderStack: Class is already available. Skipping jar /u01/cloudera/parcels/CDH/lib/sqoop/lib/oraoop-1.6.0.jar
14/09/23 18:39:49 DEBUG sqoop.ConnFactory: Added factory com.quest.oraoop.OraOopManagerFactory in jar /u01/cloudera/parcels/CDH/lib/sqoop/lib/oraoop-1.6.0.jar specified by /etc/sqoop/conf/managers.d/oraoop
14/09/23 18:39:49 DEBUG sqoop.ConnFactory: Loaded manager factory: com.quest.oraoop.OraOopManagerFactory
14/09/23 18:39:49 DEBUG sqoop.ConnFactory: Loaded manager factory: com.cloudera.sqoop.manager.DefaultManagerFactory
14/09/23 18:39:49 DEBUG sqoop.ConnFactory: Trying ManagerFactory: com.quest.oraoop.OraOopManagerFactory
14/09/23 18:39:49 DEBUG sqoop.ConnFactory: Trying ManagerFactory: com.cloudera.sqoop.manager.DefaultManagerFactory
14/09/23 18:39:49 DEBUG manager.DefaultManagerFactory: Trying with scheme: jdbc:oracle:thin:@xxx:8521
14/09/23 18:39:49 DEBUG manager.OracleManager$ConnCache: Instantiated new connection cache.
14/09/23 18:39:49 INFO manager.SqlManager: Using default fetchSize of 1000
14/09/23 18:39:49 DEBUG sqoop.ConnFactory: Instantiated ConnManager org.apache.sqoop.manager.OracleManager@52f6438d
14/09/23 18:39:49 DEBUG manager.OracleManager: Creating a new connection for jdbc:oracle:thin:@xxx:8521:biprod, using username: SQOOP_USER
14/09/23 18:39:49 DEBUG manager.OracleManager: No connection paramenters specified. Using regular API for making connection.
14/09/23 18:40:01 INFO manager.OracleManager: Time zone has been set to GMT
14/09/23 18:40:02 DEBUG manager.OracleManager$ConnCache: Caching released connection for jdbc:oracle:thin:@xxx:8521:biprod/SQOOP_USER
OS_ZHIXIN_CHG
T1

h2o-sparkling 使用

环境: CentOS 6.2

h2o-sparking 是h2o与spark结合的产物,用于机器学习这一方面,它可在spark环境中使用h2o拥有的机器学习包。

安装如下 :
git clone https://github.com/0xdata/h2o-sparkling.git
cd h2o-sparking
sbt assembly

运行测试:
[cloudera@localhost h2o-sparkling]$ sbt -mem 500 “run –local”
[info] Loading project definition from /home/cloudera/h2o-sparkling/project
[info] Set current project to h2o-sparkling-demo (in build file:/home/cloudera/h2o-sparkling/)
[info] Running water.sparkling.demo.SparklingDemo –local
03:41:11.030 main      INFO WATER: —– H2O started —–
03:41:11.046 main      INFO WATER: Build git branch: (unknown)
03:41:11.047 main      INFO WATER: Build git hash: (unknown)
03:41:11.047 main      INFO WATER: Build git describe: (unknown)
03:41:11.047 main      INFO WATER: Build project version: (unknown)
03:41:11.047 main      INFO WATER: Built by: ‘(unknown)’
03:41:11.047 main      INFO WATER: Built on: ‘(unknown)’
03:41:11.048 main      INFO WATER: Java availableProcessors: 1
03:41:11.077 main      INFO WATER: Java heap totalMemory: 3.87 gb
03:41:11.077 main      INFO WATER: Java heap maxMemory: 3.87 gb
03:41:11.078 main      INFO WATER: Java version: Java 1.6.0_31 (from Sun Microsystems Inc.)
03:41:11.078 main      INFO WATER: OS   version: Linux 2.6.32-220.23.1.el6.x86_64 (amd64)
03:41:11.381 main      INFO WATER: Machine physical memory: 4.83 gb
03:41:11.393 main      INFO WATER: ICE root: ‘/tmp/h2o-cloudera’
03:41:11.438 main      INFO WATER: Possible IP Address: eth1 (eth1), 192.168.56.101
03:41:11.439 main      INFO WATER: Possible IP Address: eth0 (eth0), 10.0.2.15
03:41:11.439 main      INFO WATER: Possible IP Address: lo (lo), 127.0.0.1
03:41:11.669 main      WARN WATER: Multiple local IPs detected:
+                                    /192.168.56.101  /10.0.2.15
+                                  Attempting to determine correct address…
+                                  Using /10.0.2.15
03:41:11.929 main      INFO WATER: Internal communication uses port: 54322
+                                  Listening for HTTP and REST traffic on  http://10.0.2.15:54321/
03:41:12.912 main      INFO WATER: H2O cloud name: ‘cloudera’
03:41:12.913 main      INFO WATER: (v(unknown)) ‘cloudera’ on /10.0.2.15:54321, discovery address /230.63.2.255:58943
03:41:12.913 main      INFO WATER: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
+                                    1. Open a terminal and run ‘ssh -L 55555:localhost:54321 cloudera@10.0.2.15’
+                                    2. Point your browser to http://localhost:55555
03:41:12.954 main      INFO WATER: Cloud of size 1 formed [/10.0.2.15:54321 (00:00:00.000)]
03:41:12.954 main      INFO WATER: Log dir: ‘/tmp/h2o-cloudera/h2ologs’
prostate
03:41:20.369 main      INFO WATER: Running demo with following configuration: DemoConf(prostate,true,RDDExtractor@file,true)
03:41:20.409 main      INFO WATER: Demo configuration: DemoConf(prostate,true,RDDExtractor@file,true)
03:41:21.830 main      INFO WATER: Data : data/prostate.csv
03:41:21.831 main      INFO WATER: Table: prostate_table
03:41:21.831 main      INFO WATER: Query: SELECT * FROM prostate_table WHERE capsule=1
03:41:21.831 main      INFO WATER: Spark: LOCAL
03:41:21.901 main      INFO WATER: Creating LOCAL Spark context.
03:41:34.616 main      INFO WATER: RDD result has: 153 rows
03:41:34.752 main      INFO WATER: Going to write RDD into /tmp/rdd_null_6.csv
03:41:36.099 FJ-0-1    INFO WATER: Parse result for rdd_data_6 (153 rows):
03:41:36.136 FJ-0-1    INFO WATER:     C1:              numeric        min(6.000000)      max(378.000000)
03:41:36.140 FJ-0-1    INFO WATER:     C2:              numeric        min(1.000000)        max(1.000000)                    constant
03:41:36.146 FJ-0-1    INFO WATER:     C3:              numeric       min(47.000000)       max(79.000000)
03:41:36.152 FJ-0-1    INFO WATER:     C4:              numeric        min(0.000000)        max(2.000000)
03:41:36.158 FJ-0-1    INFO WATER:     C5:              numeric        min(1.000000)        max(4.000000)
03:41:36.161 FJ-0-1    INFO WATER:     C6:              numeric        min(1.000000)        max(2.000000)
03:41:36.165 FJ-0-1    INFO WATER:     C7:              numeric        min(1.400000)      max(139.700000)
03:41:36.169 FJ-0-1    INFO WATER:     C8:              numeric        min(0.000000)       max(73.400000)
03:41:36.176 FJ-0-1    INFO WATER:     C9:              numeric        min(5.000000)        max(9.000000)
03:41:37.457 main      INFO WATER: Extracted frame from Spark:
03:41:37.474 main      INFO WATER: {id,capsule,age,race,dpros,dcaps,psa,vol,gleason}, 2.8 KB
+                                  Chunk starts: {0,83,}
+                                  Rows: 153
03:41:37.482 #ti-UDP-R INFO WATER: Orderly shutdown command from /10.0.2.15:54321
[success] Total time: 44 s, completed Aug 4, 2014 3:41:37 AM

本地集群运行:
[cloudera@localhost h2o-sparkling]$ sbt -mem 100 “run –remote”
[info] Loading project definition from /home/cloudera/h2o-sparkling/project
[info] Set current project to h2o-sparkling-demo (in build file:/home/cloudera/h2o-sparkling/)
[info] Running water.sparkling.demo.SparklingDemo –remote
03:25:42.306 main      INFO WATER: —– H2O started —–
03:25:42.309 main      INFO WATER: Build git branch: (unknown)
03:25:42.309 main      INFO WATER: Build git hash: (unknown)
03:25:42.309 main      INFO WATER: Build git describe: (unknown)
03:25:42.309 main      INFO WATER: Build project version: (unknown)
03:25:42.309 main      INFO WATER: Built by: ‘(unknown)’
03:25:42.309 main      INFO WATER: Built on: ‘(unknown)’
03:25:42.310 main      INFO WATER: Java availableProcessors: 4
03:25:42.316 main      INFO WATER: Java heap totalMemory: 3.83 gb
03:25:42.316 main      INFO WATER: Java heap maxMemory: 3.83 gb
03:25:42.316 main      INFO WATER: Java version: Java 1.6.0_31 (from Sun Microsystems Inc.)
03:25:42.317 main      INFO WATER: OS   version: Linux 2.6.32-220.23.1.el6.x86_64 (amd64)
03:25:42.383 main      INFO WATER: Machine physical memory: 4.95 gb
03:25:42.384 main      INFO WATER: ICE root: ‘/tmp/h2o-cloudera’
03:25:42.389 main      INFO WATER: Possible IP Address: eth1 (eth1), 192.168.56.101
03:25:42.389 main      INFO WATER: Possible IP Address: eth0 (eth0), 10.0.2.15
03:25:42.389 main      INFO WATER: Possible IP Address: lo (lo), 127.0.0.1
03:25:42.587 main      WARN WATER: Multiple local IPs detected:
+                                    /192.168.56.101  /10.0.2.15
+                                  Attempting to determine correct address…
+                                  Using /10.0.2.15
03:25:42.650 main      INFO WATER: Internal communication uses port: 54322
+                                  Listening for HTTP and REST traffic on  http://10.0.2.15:54321/
03:25:43.906 main      INFO WATER: H2O cloud name: ‘cloudera’
03:25:43.906 main      INFO WATER: (v(unknown)) ‘cloudera’ on /10.0.2.15:54321, discovery address /230.63.2.255:58943
03:25:43.907 main      INFO WATER: If you have trouble connecting, try SSH tunneling from your local machine (e.g., via port 55555):
+                                    1. Open a terminal and run ‘ssh -L 55555:localhost:54321 cloudera@10.0.2.15’
+                                    2. Point your browser to http://localhost:55555
03:25:43.920 main      INFO WATER: Cloud of size 1 formed [/10.0.2.15:54321 (00:00:00.000)]
03:25:43.921 main      INFO WATER: Log dir: ‘/tmp/h2o-cloudera/h2ologs’
prostate
03:25:46.985 main      INFO WATER: Running demo with following configuration: DemoConf(prostate,false,RDDExtractor@file,true)
03:25:46.991 main      INFO WATER: Demo configuration: DemoConf(prostate,false,RDDExtractor@file,true)
03:25:48.000 main      INFO WATER: Data : data/prostate.csv
03:25:48.000 main      INFO WATER: Table: prostate_table
03:25:48.000 main      INFO WATER: Query: SELECT * FROM prostate_table WHERE capsule=1
03:25:48.001 main      INFO WATER: Spark: REMOTE
03:25:48.024 main      INFO WATER: Creating REMOTE (spark://localhost:7077) Spark context.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1.0:1 failed 4 times, most recent failure: TID 7 on host 192.168.56.101 failed for unknown reason
Driver stacktrace:
03:26:07.151 main      INFO WATER:      at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
03:26:07.151 main      INFO WATER:      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
03:26:07.151 main      INFO WATER:      at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
03:26:07.152 main      INFO WATER:      at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
03:26:07.152 main      INFO WATER:      at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
03:26:07.152 main      INFO WATER:      at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
03:26:07.152 main      INFO WATER:      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
03:26:07.152 main      INFO WATER:      at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:633)
03:26:07.153 main      INFO WATER:      at scala.Option.foreach(Option.scala:236)
03:26:07.153 main      INFO WATER:      at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:633)
03:26:07.153 main      INFO WATER:      at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1207)
03:26:07.153 main      INFO WATER:      at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
03:26:07.155 main      INFO WATER:      at akka.actor.ActorCell.invoke(ActorCell.scala:456)
03:26:07.155 main      INFO WATER:      at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
03:26:07.156 main      INFO WATER:      at akka.dispatch.Mailbox.run(Mailbox.scala:219)
03:26:07.156 main      INFO WATER:      at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
03:26:07.157 main      INFO WATER:      at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
03:26:07.158 main      INFO WATER:      at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
03:26:07.158 main      INFO WATER:      at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
03:26:07.162 main      INFO WATER:      at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
03:26:07.172 #ti-UDP-R INFO WATER: Orderly shutdown command from /10.0.2.15:54321
[success] Total time: 27 s, completed Aug 4, 2014 3:26:07 PM

运行失败, 目前还无法定位问题所在。

CXXNET安装

环境:ubuntu 14.04,  cuda 6.5

先安装cuda-toolkit, cuda-cublas, cudart, cuda-curand这四个安装包

cuda_6.5.14_linux_64.run

cuda-cublas-6-5_6.5-14_amd64.deb
cuda-cudart-6-5_6.5-14_amd64.deb
cuda-curand-6-5_6.5-14_amd64.deb

下载路径:http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1404/x86_64/

安装 OpenCV

sudo apt-get install libopencv-2.4

 

配置环境变量

vi ~/.bashrc

export CUDA_HOME=/usr/local/cuda-6.5
export PATH=$CUDA_HOME/bin:$PATH
export LD_LIBRARY_PATH=$CUDA_HOME/lib64:/usr/local/lib:$LD_LIBRARY_PATH
export CPLUS_INCLUDE_PATH=/usr/local/cuda/include

 

下载一份cxxnet

git clone https://github.com/dmlc/cxxnet.git

切换至目录 cd cxxnet

拷贝一份配置到当前目录 cp make/config.mk .

修改 vi config.mk

USE_CUDA = 1

USE_BLAS = blas

USE_DIST_PS = 1
USE_OPENMP_ITER = 1

编辑 vi  Makefile, 修改如下:

CFLAGS += -g -O3 -I./mshadow/  -fPIC $(MSHADOW_CFLAGS) -fopenmp -I/usr/local/cuda/include
LDFLAGS = -pthread $(MSHADOW_LDFLAGS) -L/usr/local/cuda/lib64

 

最后编译文件

./build.sh

 

 

RocksDB, pyrocksdb 的安装与使用

环境:Ubuntu 12.04,  RocksDB, pyrocksdb

RocksDB是FB基于google的LevelDB基础上改良的键值对数据库,类似于memcache和redis,支持RAM, Flash, Disk存储,写速度快过LevelDB 10倍左右, 听起来有点高大上的感觉,可参考https://github.com/facebook/rocksdb/wiki/Performance-Benchmarks 。不管这么多了,先安装试用下

安装步骤:

rocksdb安装:

sudo git clone https://github.com/facebook/rocksdb.git
cd rocksdb

vi Makefile
将这一行 OPT += -O2 -fno-omit-frame-pointer -momit-leaf-frame-pointer
修改为 OPT += -O2 -lrt -fno-omit-frame-pointer -momit-leaf-frame-pointer

在~/.bashrc中增加 export LD_PRELOAD=/lib/x86_64-linux-gnu/librt.so.1,并使变量生效source ~/.bashrc

(这两步用于解决这个问题 ” undefined symbol: clock_gettime”)

sudo git checkout 2.8.fb
sudo make shared_lib

cd ..
sudo chown jerry:jerry rocksdb -Rf
cd rocksdb

sudo cp librocksdb.so /usr/local/lib
sudo mkdir -p /usr/local/include/rocksdb/
sudo cp -r ./include/* /usr/local/include/

(这三步解决这个问题 “ Fatal error: rocksdb/slice.h: No such file or directory “)

pyrocksdb安装:
sudo pip install “Cython>=0.20”
sudo pip install git+git://github.com/stephan-hof/pyrocksdb.git@v0.2.1

至些安装成功
进入pyrocksdb环境

jerry@hq:/u01/rocksdb$ python
Python 2.7.3 (default, Sep 26 2013, 20:03:06)
[GCC 4.6.3] on linux2
Type “help”, “copyright”, “credits” or “license” for more information.
>>> import rocksdb
>>> db = rocksdb.DB(“test.db”, rocksdb.Options(create_if_missing=True))
>>> db.put(b“key1”, b“v1”)
>>> db.put(b“key2”, b“v2”)
>>> db.get(b”key1″)

‘v1’

Cascalog简介

环境: CentOS 5.7,  CDH 4.2.0

Cascalog是一款基于cascading和hadoop上用clojure定义的DSL。由于clojure的元数据和函数编程范式,它很好地定义函数和查询。

下面讲解下使用场景:

1. 使用lein创建一个工程
lein cascalog_incanter

2. 切入到cascalog_incanter,编辑project.clj 如下所示:

(defproject cascalog_incanter “0.1.0-SNAPSHOT”
:description “FIXME: write description”
:url “http://example.com/FIXME”
:license {:name “Eclipse Public License”
:url “http://www.eclipse.org/legal/epl-v10.html”}
:dependencies [[org.clojure/clojure “1.6.0”]
[cascalog/cascalog-core “2.1.1”]
[incanter “1.5.5”]]
:repositories [[“conjars.org” “http://conjars.org/repo”]
[“cloudera” “https://repository.cloudera.com/artifactory/cloudera-repos/”]]
:profiles {
:provided {
:dependencies [
;[org.apache.hadoop/hadoop-core “1.2.1”] ; Apache Hadoop MapReduce v1
;[org.apache.hadoop/hadoop-core “2.0.0-mr1-cdh4.2.0”] ; CDH 4.2.0 MapReduce v1
[org.apache.hadoop/hadoop-common “2.0.0-cdh4.2.0” ] ; Cloudera Hadoop 4.2.0 YARN
[org.apache.hadoop/hadoop-mapreduce-client-core “2.0.0-cdh4.2.0” ] ; Cloudera Hadoop 4.2.0 MapReduce v2
]
}
:dev {
:dependencies [
[org.apache.hadoop/hadoop-minicluster “2.0.0-cdh4.2.0”] ; Cloudera Hadoop 4.2.0
]}
}
)

3. 进入编程模式
lein repl

4. 参考示例http://cascalog.org/articles/getting_started.html