ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Hadoop HBase

ZhangGang2012.12.25

PC VS Farm

Since the Hadoop farm has not successfully

configured at CC, so I can not do some test with HBase.

I just use the machine named hadoop01 that belongs to that farm to select records from mysql , at the same time do the same select in my PC and compare the processing time.

As below is a table about the comparison:

PC VS Farm

内容内容内容内容内容内容

一级标题一级标题一级标题

Plot PC(s) Farm(s) LHCb web protal(s)

Diskspace by Site 97.39 36.93 about 16

Diskspace by Jobtype

97.47 40.45 about 7

CPUTime by Jobtype

90.66 40.08 about 7

CPUTime by Site(10/06/20-12/06/20)

97.69 39.54 about 9

CPUTime by Site(08/06/20-10/06/20)

86.64 32.31 about 7

页面标题页面标题

Install and configure Hadoop HBase in my PC

Install and Configure

The environment has not successfully configured in CC, besides ,some parts of references confused me ,I don not fully understand what the meaning. So I try to set up a pseudo-distributed mode in my own computer .

As I know from some references, we need some services to deal with our problem:


Hadoop :HDFS and MapReduce A framework that allows for the distributed

processing of large data sets across clusters of computers using simple programming models

The basic part ,the other parts are based on it.

HBase: A scalable, distributed database that supports

structured data storage for large tables. We will input data from mysql to HBase in one

format.


Sqoop: Sqoop(TM) is a tool designed for efficiently

transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

We use it to transfer our data from mysql to HBase.

Thrift: The Apache Thrift software framework, for

scalable cross-language services development. Because we want to use python ,so thrift is

needed.


To set up a Hadoop environment is much more complicated than I excepted. I met many unknown errors.

Till now I just successfully installed and configured the Hadoop, HBase and thrift. Sqoop still has some errors.


Hadoop: 1 .create a user account named hadoop 2. install SSh 3. install Java 4. install Hadoop and configure hadoop-env.sh,

then the standalone mode is successful.

There are three *.xml files ,in standalone mode, they are empty, if we want to get a pseudo-distributed mode. We must configure them.


core-site.xml: Hadoop core configuration items, like I/O configuration.

hdfs-site.xml: Hadoop daemon process configuration items, like namenode, datanode

maprep-site.xml: MapReduce daemon process configuration items, like jobtracker tasktracker.

Start Hadoop: Format the HDFS :

Start-all daemon process:内容内容

Install and Configure Then the Hadoop is get started a

内容内容内容内容内容内容内容

一级标题一级标题一级标题


HBase: Install java(has inatalled) Inatall habse Configure hbase-site.xml: set the hbase.rootdir Start HBase:

Use HBase shell and create a table named test, it has two column family : zhang and gang

内容内容内容内容内容内容


Thrift: (I feel complicated) 1. Install all the required tools and libraries to

build and install the Apache Thrift compiler. 2. From the top directory,do: ./configure 3. Once run configure,then :make and make

test 4.From the top directory, become superuser

and do: make install If no error(I met many), the thrift is

successfully installed. Then generate the python client and move it to ~/python2.7/site-packages:


After generate a python client, we can use python to access to HBase. The next part is about a script I write to interaction with HBase.

__________________________________________________________________________________________________

script

Script

As a test, I create a table named diracAccounting, it has two column families: ‘groupby’ and ‘generate’, and each family has a column: ‘groupby:Site’, ‘generate:CPUTime’.

The row key is the ‘starttime’in mysql tables. The whole code I push in github:

https://github.com/zhangg/LearningCode/blob/master/Program/Hadoop/HbasePy/hbaseplot.py



Script

def put(self):'''put some records to hbase table'''

Select‘Site’,‘CPUTime’,‘Starttime’from mysql database and put into the table in HBase. Set the ‘starttime’as the row key.

Script

def generatePlot (self, groupbyName, generateName):'''use records to generate a plot''‘

In this function , I scan the records and generate a plot .

end

Documents

ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named