Hadoop operations

Embed Size (px)


Lynx Consultants training about Hadoop Operations

Text of Hadoop operations

  • 1. Marc Cluet Lynx Consultants How Hadoop Works

2. What well cover? Understand Hadoop in detail See how Hadoop works operationally Be able to start asking the right questions from your data Lynx Consultants 2013 3. Hadoop Distributions Cloudera CDH Hortonworks MapR Lynx Consultants 2013 4. Hadoop Components HDFS Hbase MapRed YARN Lynx Consultants 2013 5. Hadoop Components HDFS Hadoop Distributed File System Everything sits on top of it Has 3 copies by default of every block Hbase MapRed YARN Lynx Consultants 2013 6. Hadoop Components HDFS Hbase Hadoop Schemaless Database Key value Store Sits on top of HDFS MapRed YARN Lynx Consultants 2013 7. Hadoop Components HDFS Hbase MapRed Hadoop Map/Reduce Non-pluggable, archaic Requires HDFS for temp storage YARN Lynx Consultants 2013 8. Hadoop Components HDFS Hbase MapRed YARN Hadoop Map/Reduce version 2.0 Pluggable, you can add your own Fast and not so much memory hungry Lynx Consultants 2013 9. Hadoop Component Breakdown All these components divide themselves in client/server master/slave scenarios We will now check each individual component breakdown Lynx Consultants 2013 10. Hadoop Components Breakdown HDFS Master Namenode Keeps track of all le allocation on Datanodes Rebalances data if one of the namenodes goes down Is Rack aware Secondary Namenode Does cleanup services for the namenode Not necessarily two dierent servers Datanode Stores the data Good to have not RAID disks for extra I/O speed Lynx Consultants 2013 11. Hadoop Components Breakdown HDFS How to access Client can connect with hadoop client to hdfs://namenode:8020 Supports all basic Unix commands Conguration les /etc/hadoop/conf/core-site.xml Denes major conguration as hdfs namenode and default parameters /etc/hadoop/conf/hdfs-site.xml Denes conguration specic to namenode or datanode on le locations /etc/hadoop/conf/slaves Denes the list of servers that are available in this cluster Lynx Consultants 2013 12. Hadoop Components Breakdown Hbase Master Controls the Hbase cluster, knows where the data is allocated and provides a client listening socket using Thrift and/or a RESTful API Regionserver Hbase node, stores some of the information in one of the regions, itd be equivalent to sharding Thrift / REST Interface to connect to HBase Lynx Consultants 2013 13. Hadoop Components Breakdown Hbase How to access Through the Hbase client (using Thrift) Through the RESTful API Conguration les /etc/hbase/conf/hbase-site.xml Denes all the basic conguration for accessing hbase /etc/hbase/conf/hbase-policy.xml Denes all the security (ACL) and all the hbase memory tweaks /etc/hbase/conf/regionservers List all the regionservers available to this cluster Lynx Consultants 2013 14. Hadoop Components Breakdown MapRed JobTracker Creates the Map/Reduce jobs Stores all the intermediate data Keeps track of all the previous results through the HistoryServer TaskTracker Executed Tasks related to the Map/Reduce job Very CPU and memory intensive Stores intermediate results which then are pushed to JobTracker Lynx Consultants 2013 15. Hadoop Components Breakdown MapRed How to access Through the Hadoop Client Through any MapRed client like Pig or Hive Own Java code Conguration les /etc/hadoop/conf/mapred-site.xml Denes how to contact this MapRed Cluster /etc/hadoop/conf/mapred-queue-acls.xml Denes ACL structure for accessing MapRed, normally not necessary /etc/hadoop/conf/slaves Denes the list of TaskTrackers in this cluster Lynx Consultants 2013 16. Hadoop Components Breakdown YARN Same structure as MapRed (lives on top of it) Conguration les /etc/hadoop/conf/yarn-site.xml All required conguration for YARN Lynx Consultants 2013 17. Hadoop Cluster Breakdown Namenode Server HDFS Namenode Hbase Master Secondary Namenode Server HDFS Secondary Namenode JobTracker Server MapRed JobTracker MapRed History Server Lynx Consultants 2013 18. Hadoop Cluster Breakdown Datanode Server HDFS Datanode Hbase RegionServer MapRed TaskTracker Lynx Consultants 2013 19. Hadoop Hardware Requirements Namenode Server Redundant power supplies RAID1 Drives Enough memory (16Gb) Secondary Namenode Server Almost none Lynx Consultants 2013 20. Hadoop Hardware Requirements Jobtracker Server Redundant power supplies RAID1 Drives Enough memory (16Gb) Datanode Server Lots of cheap disk (no RAID) Lots of memory (32Gb) Lots of CPU Lynx Consultants 2013 21. Hadoop Default Ports HDFS 8020: HDFS Namenode 50010: HDFS Datanode FS transfer MapRed No defaults Hbase 60010: Master 60020: Regionserver Lynx Consultants 2013 22. Hadoop HDFS WorkflowLynx Consultants 2013 23. Hadoop MapRed WorkflowLynx Consultants 2013 24. Hadoop MapRed WorkflowLynx Consultants 2013 25. Flume Transports streams of data from point A to point B Source Where the data is read from Channel How the data is buered Sink Where the data is written Lynx Consultants 2013 26. Flume Flume is fault tolerant Sources are pointer kept With some exceptions, but most sources are in a known state Channels can be fault tolerant Channel written to disk can recover from where it left Sinks can be redundant More than one sink for the same data Data is serialised and deduplicated using AVRO Lynx Consultants 2013 27. FlumeLynx Consultants 2013 28. Flume Conguration les /etc/ume-ng/conf/ume.conf Denes the agent conguration with source, channel, sink Lynx Consultants 2013 29. FlumeLynx Consultants 2013 30. Hadoop Recommended ReadsLynx Consultants 2013 31. Hadoop References Hadoop http://hadoop.apache.org/docs/stable/cluster_setup.html http://rc.cloudera.com/cdh/4/hadoop/hadoop-yarn/hadoop-yarn-site/ClusterSetup.html http://pig.apache.org/docs/r0.7.0/setup.html http://wiki.apache.org/hadoop/NameNodeFailover Hbase http://hbase.apache.org/book/book.html Flume http://archive.cloudera.com/cdh4/cdh/4/ume-ng/FlumeUserGuide.html Lynx Consultants 2013 32. Questions?Lynx Consultants 2013