Upload
rosa-francis
View
213
Download
1
Embed Size (px)
Citation preview
Hadoop HBase
ZhangGang2012.12.25
PC VS Farm
Since the Hadoop farm has not successfully
configured at CC, so I can not do some test with HBase.
I just use the machine named hadoop01 that belongs to that farm to select records from mysql , at the same time do the same select in my PC and compare the processing time.
As below is a table about the comparison:
PC VS Farm
内容内容内容内容内容内容
一级标题一级标题一级标题
Plot PC(s) Farm(s) LHCb web protal(s)
Diskspace by Site 97.39 36.93 about 16
Diskspace by Jobtype
97.47 40.45 about 7
CPUTime by Jobtype
90.66 40.08 about 7
CPUTime by Site(10/06/20-12/06/20)
97.69 39.54 about 9
CPUTime by Site(08/06/20-10/06/20)
86.64 32.31 about 7
页面标题页面标题
Install and configure Hadoop HBase in my PC
Install and Configure
The environment has not successfully configured in CC, besides ,some parts of references confused me ,I don not fully understand what the meaning. So I try to set up a pseudo-distributed mode in my own computer .
As I know from some references, we need some services to deal with our problem:
Install and Configure
Hadoop :HDFS and MapReduce A framework that allows for the distributed
processing of large data sets across clusters of computers using simple programming models
The basic part ,the other parts are based on it.
HBase: A scalable, distributed database that supports
structured data storage for large tables. We will input data from mysql to HBase in one
format.
Install and Configure
Sqoop: Sqoop(TM) is a tool designed for efficiently
transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
We use it to transfer our data from mysql to HBase.
Thrift: The Apache Thrift software framework, for
scalable cross-language services development. Because we want to use python ,so thrift is
needed.
Install and Configure
To set up a Hadoop environment is much more complicated than I excepted. I met many unknown errors.
Till now I just successfully installed and configured the Hadoop, HBase and thrift. Sqoop still has some errors.
Install and Configure
Hadoop: 1 .create a user account named hadoop 2. install SSh 3. install Java 4. install Hadoop and configure hadoop-env.sh,
then the standalone mode is successful.
There are three *.xml files ,in standalone mode, they are empty, if we want to get a pseudo-distributed mode. We must configure them.
Install and Configure
core-site.xml: Hadoop core configuration items, like I/O configuration.
hdfs-site.xml: Hadoop daemon process configuration items, like namenode, datanode
maprep-site.xml: MapReduce daemon process configuration items, like jobtracker tasktracker.
Start Hadoop: Format the HDFS :
Start-all daemon process:内容内容
Install and Configure Then the Hadoop is get started a
内容内容内容内容内容内容内容
一级标题一级标题一级标题
Install and Configure
HBase: Install java(has inatalled) Inatall habse Configure hbase-site.xml: set the hbase.rootdir Start HBase:
Use HBase shell and create a table named test, it has two column family : zhang and gang
内容内容内容内容内容内容
Install and Configure
Thrift: (I feel complicated) 1. Install all the required tools and libraries to
build and install the Apache Thrift compiler. 2. From the top directory,do: ./configure 3. Once run configure,then :make and make
test 4.From the top directory, become superuser
and do: make install If no error(I met many), the thrift is
successfully installed. Then generate the python client and move it to ~/python2.7/site-packages:
Install and Configure
After generate a python client, we can use python to access to HBase. The next part is about a script I write to interaction with HBase.
__________________________________________________________________________________________________
script
Script
As a test, I create a table named diracAccounting, it has two column families: ‘groupby’ and ‘generate’, and each family has a column: ‘groupby:Site’, ‘generate:CPUTime’.
The row key is the ‘starttime’in mysql tables. The whole code I push in github:
https://github.com/zhangg/LearningCode/blob/master/Program/Hadoop/HbasePy/hbaseplot.py
Script
def put(self):'''put some records to hbase table'''
Select‘Site’,‘CPUTime’,‘Starttime’from mysql database and put into the table in HBase. Set the ‘starttime’as the row key.
Script
def generatePlot (self, groupbyName, generateName):'''use records to generate a plot''‘
In this function , I scan the records and generate a plot .
end