Upload
sridharee
View
41
Download
2
Embed Size (px)
DESCRIPTION
Hadoop Week2 PPT
Citation preview
Course Topics
Week 1 Introduction to HDFS
Week 3 Map-Reduce Basics, types and
formats
Week 5 HIVE
Week 7 ZOOKEEPER
Week 2 Setting Up Hadoop Cluster
Week 4 PIG
Week 6 HBASE
Week 8 SQOOP
Topics for Today
Revision
Hadoop Modes
Terminal Commands
Web UI Urls
Usecase in Healthcare
Sample example list in Hadoop
Running Teragen Example
Hadoop Configuration Files
Slaves & Masters
Name Node Recovery
Dump of MR jobs
Data Loading Techniques
HDFS Hadoop Distributed File System (storage)
MapReduce (processing)
Class 1 - Revision
Lets Revise
1. What is HDFS?
3. What is Namenode?
2. What is the difference between a Hadoop database and Relational Database?
4. What is Secondary Namenode?
5. Gen 1 and Gen 2 Hadoop.
Hadoop Modes
no daemons, everything runs in a single JVM
suitable for running MapReduce programs during development
has no dfs
Standalone (or local) mode
Hadoop daemons run on the local machine
Pseudo-distributed mode
Hadoop daemons run on a cluster of machines
Fully distributed mode
Hadoop can be run in one of three modes:
Terminal Commands
Terminal Commands
Web UI URLs
NameNode status: http://localhost:50070/dfshealth.jsp
JobTracker status: http://localhost:50030/jobtracker.jsp
TaskTracker status: http://localhost:50060/tasktracker.jsp
DataBlock Scanner Report: http://localhost:50075/blockScannerReport
http://localhost:50070/dfshealth.jsphttp://localhost:50070/dfshealth.jsphttp://localhost:50030/jobtracker.jsphttp://localhost:50060/tasktracker.jsphttp://localhost:50075/blockScannerReport
Sample Examples List
Running the Teragen Example
Checking the Output
Checking the Output
Hadoop Configuration Files
Sample Cluster Configuration
Slave01
Slave02
Slave03
Slave04
Slave05 DataNode
TaskTracker
Master
NameNode
JobTracker
Hadoop Configuration Files
Configuration Filenames Description of log files
hadoop-env.sh Enviroment variables that are used in the scripts to run Hadoop
core-site.xml Configuration settings for Hadoop Core such as I/O settings that are common to HDFS and MapReduce
hdfs-site.xml Configuration settings for HDFS daemons, the namenode, the secondary namenode and the data nodes.
mapred-site.xml Configuration settings for MapReduce daemons : the jobtracker and the task trackers
masters A list of machines(one per line) that each run a secondary namenode
slaves A list of machines(one per line) that each run a datanode and a task tracker
hadoop-metrics.properties Properties for controlling how metrics are published in Hadoop
log4j.properties Properties for system log files, the namenode audit log and the task log for the tasktracker child process
DD for each component
Core core-site.xml
HDFS hdfs-site.xml
MapReduce mapred-site.xml
core-site.xml and hdfs-site.xml
hdfs-site.xml core-site.xml
dfs.replication fs.default.name
1 hdfs://localhost:8020/
Defining HDFS details in hdfs-site.xml Property Value Description
dfs.data.dir /disk1/hdfs/data,/di
sk2/hdfs/data
A list of directories where the datanode stores
blocks. Each block is stored in only one of these
directories. ${hadoop.tmp.dir}/dfs/data
fs.checkpoint.dir /disk1/hdfs/names
econdary,/disk2/hdfs/name
secondary
A list of directories where the secondary
namenode stores checkpoints. It stores a copy of
the checkpoint in each directory in the list
${hadoop.tmp.dir}/dfs/namesecondary
Mapred-site.xml
mapred.job.tracker
localhost:8021
All Properties
1. http://hadoop.apache.org/docs/r1.1.2/core-default.html
2. http://hadoop.apache.org/docs/r1.1.2/mapred-default.html
3. http://hadoop.apache.org/docs/r1.1.2/hdfs-default.html
http://hadoop.apache.org/docs/r1.1.2/core-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/core-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/core-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/core-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/mapred-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/mapred-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/mapred-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/mapred-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/hdfs-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/hdfs-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/hdfs-default.htmlhttp://hadoop.apache.org/docs/r1.1.2/hdfs-default.html
Defining mapred-sites.xml
Property Value Description
Mapred.job.tracker localhost:
8021
The hostname and the port that the jobtrackers RPC server
runs on. If set to the default value of local, then the jobtracker
is run in-process on demand when you run a MapReduce job
(you dont need to start the jobtracker in this case, and in fact
you will get an error if you try to start it in this mode)
Mapred.local.dir ${hadoop.tmp.dir}
/mapred/local
A list of directories where MapReduce stores intermediate
data for jobs. The data is cleared out when the job ends.
Mapred.system.dir ${hadoop.tmp.dir}
/mapred/system
The directory relative to fs.default.name where shared files
are stored, during a job run.
Mapred.tasktracker.map.
tasks.maximum
2 The number of map tasks that may be run on a tasktracker at
any one time
Mapred.tasktracker.redu
ce.tasks.maximum
2 The number of reduce tasks tat may be run on a tasktracker
at any one time.
Slaves and masters
contains a list of hosts, one per line, that are to host DataNode and TaskTracker servers
slaves
contains a list of hosts, one per line, that are to host secondary NameNode servers
masters
Two files are used by the startup and shutdown commands:
Per-process runtime environment
JVM Hadoop-env.sh
hadoop-env.sh file:
This file also offers a way to provide custom parameters for each of the servers. Hadoop-env.sh is sourced by all of the Hadoop Core scripts provided in the conf/
directory of the installation.
Set parameter JAVA_HOME
hadoop.env-sh
Examples of environment variables that you can specify: export HADOOP_DATANODE_HEAPSIZE="128
export HADOOP_TASKTRACKER_HEAPSIZE="512
hadoop.env-sh Sample
# Set Hadoop-specific environment variables here. # The only required environment variable is JAVA_HOME. All others are # optional. When running a distributed configuration it is best to # set JAVA_HOME in this file, so that it is correctly defined on # remote nodes. # The java implementation to use. Required. # export JAVA_HOME=/usr/lib/j2sdk1.6-sun # Extra Java runtime options. Empty by default. export HADOOP_OPTS="-Djava.net.preferIPv4Stack=true ${HADOOP_OPTS}" .. .. .. # A string representing this instance of hadoop. $USER by default. export HADOOP_IDENT_STRING=$USER
Namenode Recovery
1 Shut down the secondary NameNode
2 secondary:fs.checkpoint.dir Namenode:dfs.name.dir
3 secondary:fs.checkpoint.edits Namenode:dfs.name.edits.dir
4
When the copy completes, start the NameNode and restart the secondary NameNode
Reporting
This file controls the reporting
The default is not to report
hadoop-metrics.properties
Dump of a MR Job
Data Loading Techniques
Using Hadoop Copy Commands
Using Flume
Using Sqoop
HDFS
FLUME
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of streaming event data.
SQOOP
Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.
http://hadoop.apache.org/
Assignment for this Week
Attempt the following assignment using the document present in the LMS under the tab Week 2: Flume Set-up on Cloudera Attempt Assignment Week 2
Refresh your Java Skills using Java for Hadoop Tutorial on LMS
Ask your doubts
Q & A..?
Thank You See You in Class Next Week