Download pdf - Hadoop operations

Transcript
Page 1: Hadoop operations

Marc  Cluet  –  Lynx  Consultants  How  Hadoop  Works  

Page 2: Hadoop operations

What we’ll cover?

¡  Understand  Hadoop  in  detail  ¡  See  how  Hadoop  works  operationally  ¡  Be  able  to  start  asking  the  right  questions  from  your  data  

Lynx  Consultants  ©  2013  

Page 3: Hadoop operations

Hadoop Distributions

¡  Cloudera  CDH  ¡  Hortonworks  ¡  MapR  

Lynx  Consultants  ©  2013  

Page 4: Hadoop operations

Hadoop Components

¡  HDFS  ¡  Hbase  ¡  MapRed  ¡  YARN  

Lynx  Consultants  ©  2013  

Page 5: Hadoop operations

Hadoop Components

¡  HDFS  §  Hadoop  Distributed  File  System  §  Everything  sits  on  top  of  it  §  Has  3  copies  by  default  of  every  block  

¡  Hbase  ¡  MapRed  ¡  YARN  

Lynx  Consultants  ©  2013  

Page 6: Hadoop operations

Hadoop Components

¡  HDFS  ¡  Hbase  

§  Hadoop  Schemaless  Database  §  Key  value  Store  §  Sits  on  top  of  HDFS  

¡  MapRed  ¡  YARN  

Lynx  Consultants  ©  2013  

Page 7: Hadoop operations

Hadoop Components

¡  HDFS  ¡  Hbase  ¡  MapRed  

§  Hadoop  Map/Reduce  §  Non-­‐pluggable,  archaic  §  Requires  HDFS  for  temp  storage  

¡  YARN  

Lynx  Consultants  ©  2013  

Page 8: Hadoop operations

Hadoop Components

¡  HDFS  ¡  Hbase  ¡  MapRed  ¡  YARN  

§  Hadoop  Map/Reduce  version  2.0  §  Pluggable,  you  can  add  your  own  §  Fast  and  not  so  much  memory  hungry    

Lynx  Consultants  ©  2013  

Page 9: Hadoop operations

Hadoop Component Breakdown

¡  All  these  components  divide  themselves  in    §  client/server    §  master/slave  scenarios  

¡  We  will  now  check  each  individual  component  breakdown  

Lynx  Consultants  ©  2013  

Page 10: Hadoop operations

Hadoop Components Breakdown

¡  HDFS  §  Master  Namenode  ▪  Keeps  track  of  all  file  allocation  on  Datanodes  ▪  Rebalances  data  if  one  of  the  namenodes  goes  down  ▪  Is  Rack  aware  

§  Secondary  Namenode  ▪  Does  cleanup  services  for  the  namenode  ▪  Not  necessarily  two  different  servers  

§  Datanode  ▪  Stores  the  data  ▪  Good  to  have  not  RAID  disks  for  extra  I/O  speed  

Lynx  Consultants  ©  2013  

Page 11: Hadoop operations

Hadoop Components Breakdown

¡  HDFS  §  How  to  access  ▪  Client  can  connect  with  hadoop  client  to  hdfs://namenode:8020  ▪  Supports  all  basic  Unix  commands  

§  Configuration  files  ▪  /etc/hadoop/conf/core-­‐site.xml  ▪  Defines  major  configuration  as  hdfs  namenode  and  default  parameters  

▪  /etc/hadoop/conf/hdfs-­‐site.xml  ▪  Defines  configuration  specific  to  namenode  or  datanode  on  file  locations  

▪  /etc/hadoop/conf/slaves  ▪  Defines  the  list  of  servers  that  are  available  in  this  cluster  

Lynx  Consultants  ©  2013  

Page 12: Hadoop operations

Hadoop Components Breakdown

¡  Hbase  §  Master  ▪  Controls  the  Hbase  cluster,  knows  where  the  data  is  allocated  and  provides  a  client  listening  socket  using  Thrift  and/or  a  RESTful  API  

§  Regionserver  ▪  Hbase  node,  stores  some  of  the  information  in  one  of  the  regions,  it’d  be  equivalent  to  sharding  

§  Thrift  /  REST  ▪  Interface  to  connect  to  HBase  

Lynx  Consultants  ©  2013  

Page 13: Hadoop operations

Hadoop Components Breakdown

¡  Hbase  §  How  to  access  ▪  Through  the  Hbase  client  (using  Thrift)  ▪  Through  the  RESTful  API  

§  Configuration  files  ▪  /etc/hbase/conf/hbase-­‐site.xml  ▪  Defines  all  the  basic  configuration  for  accessing  hbase  

▪  /etc/hbase/conf/hbase-­‐policy.xml  ▪  Defines  all  the  security  (ACL)  and  all  the  hbase  memory  tweaks  

▪  /etc/hbase/conf/regionservers  ▪  List  all  the  regionservers  available  to  this  cluster  

Lynx  Consultants  ©  2013  

Page 14: Hadoop operations

Hadoop Components Breakdown

¡  MapRed  §  JobTracker  ▪  Creates  the  Map/Reduce  jobs  ▪  Stores  all  the  intermediate  data  ▪  Keeps  track  of  all  the  previous  results  through  the  HistoryServer  

§  TaskTracker  ▪  Executed  Tasks  related  to  the  Map/Reduce  job  ▪  Very  CPU  and  memory  intensive  ▪  Stores  intermediate  results  which  then  are  pushed  to  JobTracker  

Lynx  Consultants  ©  2013  

Page 15: Hadoop operations

Hadoop Components Breakdown

¡  MapRed  §  How  to  access  ▪  Through  the  Hadoop  Client  ▪  Through  any  MapRed  client  like  Pig  or  Hive  ▪  Own  Java  code  

§  Configuration  files  ▪  /etc/hadoop/conf/mapred-­‐site.xml  ▪  Defines  how  to  contact  this  MapRed  Cluster  

▪  /etc/hadoop/conf/mapred-­‐queue-­‐acls.xml  ▪  Defines  ACL  structure  for  accessing  MapRed,  normally  not  necessary  

▪  /etc/hadoop/conf/slaves  ▪  Defines  the  list  of  TaskTrackers  in  this  cluster  

Lynx  Consultants  ©  2013  

Page 16: Hadoop operations

Hadoop Components Breakdown

¡  YARN  §  Same  structure  as  MapRed  (lives  on  top  of  it)  §  Configuration  files  ▪  /etc/hadoop/conf/yarn-­‐site.xml  ▪  All  required  configuration  for  YARN  

Lynx  Consultants  ©  2013  

Page 17: Hadoop operations

Hadoop Cluster Breakdown

¡  Namenode  Server  §  HDFS  Namenode  §  Hbase  Master  

¡  Secondary  Namenode  Server  §  HDFS  Secondary  Namenode  

¡  JobTracker  Server  §  MapRed  JobTracker  §  MapRed  History  Server  

Lynx  Consultants  ©  2013  

Page 18: Hadoop operations

Hadoop Cluster Breakdown

¡  Datanode  Server  §  HDFS  Datanode  §  Hbase  RegionServer  §  MapRed  TaskTracker  

Lynx  Consultants  ©  2013  

Page 19: Hadoop operations

Hadoop Hardware Requirements

¡  Namenode  Server  §  Redundant  power  supplies  §  RAID1  Drives  §  Enough  memory  (16Gb)  

¡  Secondary  Namenode  Server  §  Almost  none  

Lynx  Consultants  ©  2013  

Page 20: Hadoop operations

Hadoop Hardware Requirements

¡  Jobtracker  Server  §  Redundant  power  supplies  §  RAID1  Drives  §  Enough  memory  (16Gb)  

¡  Datanode  Server  §  Lots  of  cheap  disk  (no  RAID)  §  Lots  of  memory  (32Gb)  §  Lots  of  CPU  

Lynx  Consultants  ©  2013  

Page 21: Hadoop operations

Hadoop Default Ports

¡  HDFS  §  8020:  HDFS  Namenode  §  50010:  HDFS  Datanode  FS  transfer  

¡  MapRed  §  No  defaults  

¡  Hbase  §  60010:  Master  §  60020:  Regionserver  

Lynx  Consultants  ©  2013  

Page 22: Hadoop operations

Hadoop HDFS Workflow

Lynx  Consultants  ©  2013  

Page 23: Hadoop operations

Hadoop MapRed Workflow

Lynx  Consultants  ©  2013  

Page 24: Hadoop operations

Hadoop MapRed Workflow

Lynx  Consultants  ©  2013  

Page 25: Hadoop operations

Flume

¡  Transports  streams  of  data  from  point  A  to  point  B  ¡  Source  

§  Where  the  data  is  read  from  ¡  Channel  

§  How  the  data  is  buffered  ¡  Sink  

§  Where  the  data  is  written  

Lynx  Consultants  ©  2013  

Page 26: Hadoop operations

Flume

¡  Flume  is  fault  tolerant  ¡  Sources  are  pointer  kept  

§  With  some  exceptions,  but  most  sources  are  in  a  known  state  ¡  Channels  can  be  fault  tolerant  

§  Channel  written  to  disk  can  recover  from  where  it  left  ¡  Sinks  can  be  redundant  

§  More  than  one  sink  for  the  same  data  §  Data  is  serialised  and  deduplicated  using  AVRO  

Lynx  Consultants  ©  2013  

Page 27: Hadoop operations

Flume

Lynx  Consultants  ©  2013  

Page 28: Hadoop operations

Flume

¡  Configuration  files  §  /etc/flume-­‐ng/conf/flume.conf  ▪  Defines  the  agent  configuration  with  source,  channel,  sink  

Lynx  Consultants  ©  2013  

Page 29: Hadoop operations

Flume

Lynx  Consultants  ©  2013  

Page 30: Hadoop operations

Hadoop Recommended Reads

Lynx  Consultants  ©  2013  

Page 31: Hadoop operations

Hadoop References

¡  Hadoop  §  http://hadoop.apache.org/docs/stable/cluster_setup.html  §  http://rc.cloudera.com/cdh/4/hadoop/hadoop-­‐yarn/hadoop-­‐yarn-­‐site/

ClusterSetup.html  §  http://pig.apache.org/docs/r0.7.0/setup.html  §  http://wiki.apache.org/hadoop/NameNodeFailover  

¡  Hbase  §  http://hbase.apache.org/book/book.html  

¡  Flume  §  http://archive.cloudera.com/cdh4/cdh/4/flume-­‐ng/

FlumeUserGuide.html  

Lynx  Consultants  ©  2013  

Page 32: Hadoop operations

Questions?

Lynx  Consultants  ©  2013  


Recommended