Hadoop operations

Marc Cluet – Lynx Consultants How Hadoop Works

What we’ll cover?

¡  Understand Hadoop in detail ¡  See how Hadoop works operationally ¡  Be able to start asking the right questions from your data

Lynx Consultants © 2013

Hadoop Distributions

¡  Cloudera CDH ¡  Hortonworks ¡  MapR


Hadoop Components

¡  HDFS ¡  Hbase ¡  MapRed ¡  YARN


Hadoop Components

¡  HDFS §  Hadoop Distributed File System §  Everything sits on top of it §  Has 3 copies by default of every block

¡  Hbase ¡  MapRed ¡  YARN


Hadoop Components

¡  HDFS ¡  Hbase

§  Hadoop Schemaless Database §  Key value Store §  Sits on top of HDFS

¡  MapRed ¡  YARN


Hadoop Components

¡  HDFS ¡  Hbase ¡  MapRed

§  Hadoop Map/Reduce §  Non-‐pluggable, archaic §  Requires HDFS for temp storage

¡  YARN


Hadoop Components

¡  HDFS ¡  Hbase ¡  MapRed ¡  YARN

§  Hadoop Map/Reduce version 2.0 §  Pluggable, you can add your own §  Fast and not so much memory hungry


Hadoop Component Breakdown

¡  All these components divide themselves in §  client/server §  master/slave scenarios

¡  We will now check each individual component breakdown


Hadoop Components Breakdown

¡  HDFS §  Master Namenode ▪  Keeps track of all file allocation on Datanodes ▪  Rebalances data if one of the namenodes goes down ▪  Is Rack aware

§  Secondary Namenode ▪  Does cleanup services for the namenode ▪  Not necessarily two different servers

§  Datanode ▪  Stores the data ▪  Good to have not RAID disks for extra I/O speed



¡  HDFS §  How to access ▪  Client can connect with hadoop client to hdfs://namenode:8020 ▪  Supports all basic Unix commands

§  Configuration files ▪  /etc/hadoop/conf/core-‐site.xml ▪  Defines major configuration as hdfs namenode and default parameters

▪  /etc/hadoop/conf/hdfs-‐site.xml ▪  Defines configuration specific to namenode or datanode on file locations

▪  /etc/hadoop/conf/slaves ▪  Defines the list of servers that are available in this cluster



¡  Hbase §  Master ▪  Controls the Hbase cluster, knows where the data is allocated and provides a client listening socket using Thrift and/or a RESTful API

§  Regionserver ▪  Hbase node, stores some of the information in one of the regions, it’d be equivalent to sharding

§  Thrift / REST ▪  Interface to connect to HBase



¡  Hbase §  How to access ▪  Through the Hbase client (using Thrift) ▪  Through the RESTful API

§  Configuration files ▪  /etc/hbase/conf/hbase-‐site.xml ▪  Defines all the basic configuration for accessing hbase

▪  /etc/hbase/conf/hbase-‐policy.xml ▪  Defines all the security (ACL) and all the hbase memory tweaks

▪  /etc/hbase/conf/regionservers ▪  List all the regionservers available to this cluster



¡  MapRed §  JobTracker ▪  Creates the Map/Reduce jobs ▪  Stores all the intermediate data ▪  Keeps track of all the previous results through the HistoryServer

§  TaskTracker ▪  Executed Tasks related to the Map/Reduce job ▪  Very CPU and memory intensive ▪  Stores intermediate results which then are pushed to JobTracker



¡  MapRed §  How to access ▪  Through the Hadoop Client ▪  Through any MapRed client like Pig or Hive ▪  Own Java code

§  Configuration files ▪  /etc/hadoop/conf/mapred-‐site.xml ▪  Defines how to contact this MapRed Cluster

▪  /etc/hadoop/conf/mapred-‐queue-‐acls.xml ▪  Defines ACL structure for accessing MapRed, normally not necessary

▪  /etc/hadoop/conf/slaves ▪  Defines the list of TaskTrackers in this cluster



¡  YARN §  Same structure as MapRed (lives on top of it) §  Configuration files ▪  /etc/hadoop/conf/yarn-‐site.xml ▪  All required configuration for YARN


Hadoop Cluster Breakdown

¡  Namenode Server §  HDFS Namenode §  Hbase Master

¡  Secondary Namenode Server §  HDFS Secondary Namenode

¡  JobTracker Server §  MapRed JobTracker §  MapRed History Server


Hadoop Cluster Breakdown

¡  Datanode Server §  HDFS Datanode §  Hbase RegionServer §  MapRed TaskTracker


Hadoop Hardware Requirements

¡  Namenode Server §  Redundant power supplies §  RAID1 Drives §  Enough memory (16Gb)

¡  Secondary Namenode Server §  Almost none


Hadoop Hardware Requirements

¡  Jobtracker Server §  Redundant power supplies §  RAID1 Drives §  Enough memory (16Gb)

¡  Datanode Server §  Lots of cheap disk (no RAID) §  Lots of memory (32Gb) §  Lots of CPU


Hadoop Default Ports

¡  HDFS §  8020: HDFS Namenode §  50010: HDFS Datanode FS transfer

¡  MapRed §  No defaults

¡  Hbase §  60010: Master §  60020: Regionserver


Hadoop HDFS Workflow


Hadoop MapRed Workflow


Hadoop MapRed Workflow


Flume

¡  Transports streams of data from point A to point B ¡  Source

§  Where the data is read from ¡  Channel

§  How the data is buffered ¡  Sink

§  Where the data is written


Flume

¡  Flume is fault tolerant ¡  Sources are pointer kept

§  With some exceptions, but most sources are in a known state ¡  Channels can be fault tolerant

§  Channel written to disk can recover from where it left ¡  Sinks can be redundant

§  More than one sink for the same data §  Data is serialised and deduplicated using AVRO


Flume


Flume

¡  Configuration files §  /etc/flume-‐ng/conf/flume.conf ▪  Defines the agent configuration with source, channel, sink


Flume


Hadoop Recommended Reads


Hadoop References

¡  Hadoop §  http://hadoop.apache.org/docs/stable/cluster_setup.html §  http://rc.cloudera.com/cdh/4/hadoop/hadoop-‐yarn/hadoop-‐yarn-‐site/

ClusterSetup.html §  http://pig.apache.org/docs/r0.7.0/setup.html §  http://wiki.apache.org/hadoop/NameNodeFailover

¡  Hbase §  http://hbase.apache.org/book/book.html

¡  Flume §  http://archive.cloudera.com/cdh4/cdh/4/flume-‐ng/

FlumeUserGuide.html


Questions?


Technology

Hadoop operations