18
Hadoop HBase ZhangGang 2012.12.25

ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Embed Size (px)

Citation preview

Page 1: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Hadoop HBase

ZhangGang2012.12.25

Page 2: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

PC VS Farm

Since the Hadoop farm has not successfully

configured at CC, so I can not do some test with HBase.

I just use the machine named hadoop01 that belongs to that farm to select records from mysql , at the same time do the same select in my PC and compare the processing time.

As below is a table about the comparison:

Page 3: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

PC VS Farm

内容内容内容内容内容内容

一级标题一级标题一级标题

Plot PC(s) Farm(s) LHCb web protal(s)

Diskspace by Site 97.39 36.93 about 16

Diskspace by Jobtype

97.47 40.45 about 7

CPUTime by Jobtype

90.66 40.08 about 7

CPUTime by Site(10/06/20-12/06/20)

97.69 39.54 about 9

CPUTime by Site(08/06/20-10/06/20)

86.64 32.31 about 7

Page 4: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

页面标题页面标题

Install and configure Hadoop HBase in my PC

Page 5: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Install and Configure

The environment has not successfully configured in CC, besides ,some parts of references confused me ,I don not fully understand what the meaning. So I try to set up a pseudo-distributed mode in my own computer .

As I know from some references, we need some services to deal with our problem:

Page 6: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Install and Configure

Hadoop :HDFS and MapReduce A framework that allows for the distributed

processing of large data sets across clusters of computers using simple programming models

The basic part ,the other parts are based on it.

HBase: A scalable, distributed database that supports

structured data storage for large tables. We will input data from mysql to HBase in one

format.

Page 7: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Install and Configure

Sqoop: Sqoop(TM) is a tool designed for efficiently

transferring bulk data between Apache Hadoop and structured datastores such as relational databases.

We use it to transfer our data from mysql to HBase.

Thrift: The Apache Thrift software framework, for

scalable cross-language services development. Because we want to use python ,so thrift is

needed.

Page 8: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Install and Configure

To set up a Hadoop environment is much more complicated than I excepted. I met many unknown errors.

Till now I just successfully installed and configured the Hadoop, HBase and thrift. Sqoop still has some errors.

Page 9: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Install and Configure

Hadoop: 1 .create a user account named hadoop 2. install SSh 3. install Java 4. install Hadoop and configure hadoop-env.sh,

then the standalone mode is successful.

There are three *.xml files ,in standalone mode, they are empty, if we want to get a pseudo-distributed mode. We must configure them.

Page 10: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Install and Configure

core-site.xml: Hadoop core configuration items, like I/O configuration.

hdfs-site.xml: Hadoop daemon process configuration items, like namenode, datanode

maprep-site.xml: MapReduce daemon process configuration items, like jobtracker tasktracker.

Start Hadoop: Format the HDFS :

Start-all daemon process:内容内容

Page 11: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Install and Configure Then the Hadoop is get started a

内容内容内容内容内容内容内容

一级标题一级标题一级标题

Page 12: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Install and Configure

HBase: Install java(has inatalled) Inatall habse Configure hbase-site.xml: set the hbase.rootdir Start HBase:

Use HBase shell and create a table named test, it has two column family : zhang and gang

内容内容内容内容内容内容

Page 13: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Install and Configure

Thrift: (I feel complicated) 1. Install all the required tools and libraries to

build and install the Apache Thrift compiler. 2. From the top directory,do: ./configure 3. Once run configure,then :make and make

test 4.From the top directory, become superuser

and do: make install If no error(I met many), the thrift is

successfully installed. Then generate the python client and move it to ~/python2.7/site-packages:

Page 14: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Install and Configure

After generate a python client, we can use python to access to HBase. The next part is about a script I write to interaction with HBase.

__________________________________________________________________________________________________

script

Page 15: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Script

As a test, I create a table named diracAccounting, it has two column families: ‘groupby’ and ‘generate’, and each family has a column: ‘groupby:Site’, ‘generate:CPUTime’.

The row key is the ‘starttime’in mysql tables. The whole code I push in github:

https://github.com/zhangg/LearningCode/blob/master/Program/Hadoop/HbasePy/hbaseplot.py

Page 16: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Script

def put(self):'''put some records to hbase table'''

Select‘Site’,‘CPUTime’,‘Starttime’from mysql database and put into the table in HBase. Set the ‘starttime’as the row key.

Page 17: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

Script

def generatePlot (self, groupbyName, generateName):'''use records to generate a plot''‘

In this function , I scan the records and generate a plot .

Page 18: ZhangGang 2012.12.25. Since the Hadoop farm has not successfully configured at CC, so I can not do some test with HBase. I just use the machine named

end