Hadoop Ecosystem Fundamentals - Big Data€¦ · Hadoop Ecosystem Fundamentals ... • Feb 2009 – The Yahoo! Search Webmap is a Hadoop application that runs on more than 10,000

Hadoop Ecosystem Fundamentals

Bu eğitim sunumları İstanbul Kalkınma Ajansı’nın 2016 yılı Yenilikçi ve Yaratıcı İstanbul Mali Destek Programı

kapsamında yürütülmekte olan TR10/16/YNY/0036 no’lu İstanbul Big Data Eğitim ve Araştırma Merkezi Projesi

dahilinde gerçekleştirilmiştir. İçerik ile ilgili tek sorumluluk Bahçeşehir Üniversitesi’ne ait olup İSTKA veya Kalkınma

Bakanlığı’nın görüşlerini yansıtmamaktadır.

How to Scale for Big Data?

• Data Volumes are massive

• Reliability of Storing PBs of data is challenging

• All kinds of failures: Disk/Hardware/Network Failures

• Probability of failures simply increase with the number of machines …

Distributed processing is non-trivial

• How to assign tasks to different workers in an efficient way?

• What happens if tasks fail?

• How do workers exchange results?

• How to synchronize distributed tasks allocated to different workers?

One popular solution: Hadoop

Hadoop Cluster at Yahoo! (Credit: Yahoo)

Hadoop History • Dec 2004 – Google GFS paper published

• July 2005 – Nutch uses MapReduce

• Feb 2006 – Becomes Lucene subproject

• Apr 2007 – Yahoo! on 1000-node cluster

• Jan 2008 – An Apache Top Level Project

• Jul 2008 – A 4000 node test cluster

• Sept 2008 – Hive becomes a Hadoop subproject • Feb 2009 – The Yahoo! Search Webmap is a Hadoop application

that runs on more than 10,000 core Linux cluster and produces data that is now used in every Yahoo! Web search query.

• June 2009 – On June 10, 2009, Yahoo! made available the source code to the version of Hadoop it runs in production.

• In 2010 Facebook claimed that they have the largest Hadoop cluster in the world with 21 PB of storage. On July 27, 2011 they announced the data has grown to 30 PB.

• Hadoop is an open-source implementation based on Google File System (GFS) and MapReduce from Google

• Hadoop was created by Doug Cutting and Mike Cafarella in 2005

• Hadoop was donated to Apache in 2006

Hadoop offers

• Redundant, Fault-tolerant data storage

• Parallel computation framework

• Job coordination

Programmers

No longer need to worry about

Q: Where file is located?

Q: How to handle failures & data lost?

Q: How to divide computation?

Q: How to program for scaling?

Typical Hadoop Cluster

Aggregation switch

Rack switch

• 40 nodes/rack, 1000-4000 nodes in cluster

• 1 GBps bandwidth in rack, 8 GBps out of rack

• Node specs (Yahoo terasort):

8 x 2.0 GHz cores, 8 GB RAM, 4 disks (= 4 TB?)

A real world example of New York Times

• Goal: Make entire archive of articles available online: 11 million, from 1851

• Task: Translate 4 TB TIFF images to PDF files

• Solution: Used Amazon Elastic Compute Cloud (EC2) and Simple Storage System (S3)

• Time: ?

• Costs: ?

A real world example of New York Times

• Goal: Make entire archive of articles available online: 11 million, from 1851

• Task: Translate 4 TB TIFF images to PDF files

• Solution: Used Amazon Elastic Compute Cloud (EC2) and Simple Storage System (S3)

• Time: < 24 hours

• Costs: $240

The Basic Hadoop Components

• Hadoop Common - libraries and utilities

• Hadoop Distributed File System (HDFS) – a distributed file-systemHadoop

• MapReduce – a programming model for large scale data processing

• Hadoop YARN – a resource-management platform, scheduling

Hadoop Stack

Computation

Storage

Applications and Frameworks

• HBase – a scalable data warehouse with support for large tables. Column-oriented database management system, Key-value store, Based on Google Big Table

• Hive – a data warehouse infrastructure that provides data summarization and ad hoc querying the data using a SQL- like language called HiveQL

• Pig – A high-level data-flow language and execution framework for parallel computation. Uses the language: Pig Latin

• Spark –a fast and general compute engine for Hadoop data. Wide range of applications ETL, Machine Learning, stream processing, and graph analytics.

MapReduce Framework

• Typically compute and storage nodes are the same.

• MapReduce tasks and HDFS running on the same nodes

• Can schedule tasks on nodes with data already present.

Original MapReduce Framework

• Single master JobTracker

• JobTracker schedules, monitors, and re-executes failed tasks.

• One slave TaskTracker per cluster node

• TaskTracker executes tasks per JobTracker requests.

Original HDFS Design

• Single NameNode - a master server that manages the file system namespace and regulates access to files by clients.

• Multiple DataNodes – typically one per node in the cluster. Functions: • Manage storage • Serving read/write requests from clients • Block creation, deletion, replication based

on instructions from NameNode

Hadoop 1.0 and Hadoop 2.0(YARN)

Whatʼs wrong with Original Mapreduce?

• Scaling >4000 nodes • HA of Job Tracker • Poor resource utilization

YARN(Yet Another Resource Negotiator)

Main idea – Separate resource management and job scheduling/monitoring of the JobTracker into separate daemons:

• Global ResourceManager(RM) - high availability scheduler

• ApplicationMaster(AM) – one for each application, framework specific entity tasked with negotiating resources from the RM

Resource Manager

The ResourceManager has a Pure scheduler, • responsible for allocating resources to running

applications • no monitoring or tracking of status for the application,

heartbeat to NodeManager. • no guarantees on restarting failed tasks either due to

application failure or hardware failures. • Runs on master node.

The scheduler performs its scheduling

• by using a resource container.

NodeManager and Application Manager

The NodeManager is a generalized Task Tracker • One per-machine slave, runs on slave nodes. • launching the containers, and killing them. • monitoring their resource usage (CPU, memory, disk,

network), • reporting the same to the ResourceManager.

The ApplicationMaster

• One per-application • negotiating resources with the RM

YARN Architecture

Resources model

An application can request resources via the ApplicationMaster with highly specific requirements such as:

• Resource-name (including hostname, rackname and possibly complex network topologies)

• Amount of Memory • CPUs (number/type of cores) • Eventually resources like disk/network I/O, GPUs, etc.

The Scheduler responds to a resource request by granting a container, which satisfies the requirements

Container

The YARN Container launch specification API is platform agnostic and contains:

• Command line to launch the process within the container. • Environment variables. • Local resources necessary on the machine prior to launch,

such as jars, shared-objects, auxiliary data files etc. • Security-related tokens. • This design allows the ApplicationMaster to work with the

NodeManager to launch containers ranging from simple shell scripts to C/Java/Python processes on Unix/Windows to full- fledged virtual machines.

Basic Hadoop API*

Mapper

void setup(Mapper.Context context)

Called once at the beginning of the task

void map(K key, V value, Mapper.Context context)

Called once for each key/value pair in the input split

void cleanup(Mapper.Context context)

Called once at the end of the task

Reducer/Combiner

void setup(Reducer.Context context)

Called once at the start of the task

void reduce(K key, Iterable<V> values, Reducer.Context context)

Called once for each key

void cleanup(Reducer.Context context)

Called once at the end of the task

*Note that there are two versions of the API!

Basic Hadoop API*

Partitioner

int getPartition(K key, V value, int numPartitions)

Get the partition number given total number of partitions

Job

Represents a packaged Hadoop job for submission to cluster

Need to specify input and output paths

Need to specify input and output formats

Need to specify mapper, reducer, combiner, partitioner classes

Need to specify intermediate/final key/value classes

Need to specify number of reducers (but not mappers, why?)

Don’t depend of defaults!

*Note that there are two versions of the API!

Basic Cluster Components*

One of each:

Namenode (NN): master node for HDFS

Set of each per slave machine:

Datanode (DN): serves HDFS data blocks

NameNode Metadata

Meta-data in Memory

– The entire metadata is in main memory

– No demand paging of meta-data

Types of Metadata

– List of files

– List of Blocks for each file

– List of DataNodes for each block

– File attributes, e.g creation time, replication factor

A Transaction Log

– Records file creations, file deletions. etc

Namenode Responsibilities

Managing the file system namespace:

Holds file/directory structure, metadata, file-to-block mapping,

access permissions, etc.

Coordinating file operations:

Directs clients to datanodes for reads and writes

No data is moved through the namenode

Maintaining overall health:

Periodic communication with the datanodes

Block re-replication and rebalancing

Garbage collection

DataNode

A Block Server

– Stores data in the local file system (e.g. ext3)

– Stores meta-data of a block (e.g. CRC)

– Serves data and meta-data to Clients

Block Report

– Periodically sends a report of all existing blocks to the NameNode

Facilitates Pipelining of Data

– Forwards data to other specified DataNodes

Adapted from (Ghemawat et al., SOSP 2003)

(file name, block id)

(block id, block location)

instructions to datanode

datanode state (block id, byte range)

block data

HDFS namenode

HDFS datanode

Linux file system

…

HDFS datanode

Linux file system

…

File namespace

/foo/bar

block 3df2

Application

HDFS Client

HDFS Working Flow

NameNode Failure

A single point of failure

Transaction Log stored in multiple directories

– A directory on the local file system

– A directory on a remote file system (NFS/CIFS)

Need to develop a real HA solution

Hadoop Workflow

Hadoop Cluster You

1. Load data into HDFS

2. Develop code locally

3. Submit MapReduce job 3a. Go back to Step 2

4. Retrieve data from HDFS

Recommended Workflow

Example:

Develop code in Eclipse on host machine

Build distribution on host machine

Check out copy of code on VM

Copy (i.e., scp) jars over to VM (in same directory structure)

Run job on VM

Iterate

…

Commit code on host machine and push

Pull from inside VM, verify

Avoid using the UI of the VM

Directly ssh into the VM

Hadoop Platforms

• Platforms: Unix and on Windows.

– Linux: the only supported production platform.

– Other variants of Unix, like Mac OS X: run Hadoop for development.

– Windows + Cygwin: development platform (openssh)

• Java 8

– Java 1.8.x (aka 8.0.x aka 8) is recommended for running Hadoop.

Hadoop Installation

• Download a stable version of Hadoop: – http://hadoop.apache.org/core/releases.html

• Untar the hadoop file: – tar xvfz hadoop-0.20.2.tar.gz

• JAVA_HOME at hadoop/conf/hadoop-env.sh: – Mac OS:

/System/Library/Frameworks/JavaVM.framework/Versions/1.6.0/Home (/Library/Java/Home)

– Linux: which java

• Environment Variables: – export PATH=$PATH:$HADOOP_HOME/bin

http://hadoop.apache.org/core/releases.html

Hadoop Modes

• Standalone (or local) mode – There are no daemons running and everything runs in a

single JVM. Standalone mode is suitable for running MapReduce programs during development, since it is easy to test and debug them.

• Pseudo-distributed mode – The Hadoop daemons run on the local machine, thus

simulating a cluster on a small scale.

• Fully distributed mode – The Hadoop daemons run on a cluster of machines.

Pseudo Distributed Mode

• Create an RSA key to be used by hadoop when ssh’ing to Localhost: – ssh-keygen -t rsa -P "" – cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys – ssh localhost

• Configuration Files – Core-site.xml – Mapred-site.xml – Yarn-site.xml – Hdfs-site.xml

<?xml version="1.0"?>



<configuration>

<property>

<name>fs.defaultFS</name>

<value>hdfs://localhost/</value>

</property>

</configuration>




<configuration>

<property>

<name>mapreduce.framework.name</name>

<value>yarn</value>

</property>

</configuration>


 <configuration> <property> <name>yarn.resourcemanager.hostname</name> <value>localhost</value> </property> <property> <name>yarn.nodemanager.aux-services</name> <value>mapreduce_shuffle</value> </property> </configuration> <?xml version="1.0"?>  <configuration> <property> <name>dfs.replication</name> <value>1</value> </property> </configuration>

Start Hadoop

• hdfs namenode –format

• bin/star-all.sh (start-dfs.sh/start-yarn.sh)

• jps

• bin/stop-all.sh

• Web-based UI

– http://localhost:50070 (Namenode report)

http://localhost:50070

Basic File Command in HDFS

• hdfs fs –cmd <args> – hadoop dfs

• URI: //authority/path – authority: hdfs://localhost:9000

• Adding files – hdfs fs –mkdir – hdfs fs -put

• Retrieving files – hdfs fs -get

• Deleting files – hdfs fs –rm

• hdfs fs –help ls

Run WordCount

• Create an input directory in HDFS

• Run wordcount example – hadoop jar hadoop-examples-2.7.4.jar wordcount

/user/hduser/input /user/hduser/ouput

• Check output directory

– hdfs fs lsr /user/hduser/ouput

– http://localhost:50070

Installation Tutorials • http://www.michael-noll.com/tutorials/running-

hadoop-on-ubuntu-linux-single-node-cluster/

• http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html

• http://oreilly.com/other-programming/excerpts/hadoop-tdg/installing-apache-hadoop.html

• http://snap.stanford.edu/class/cs246-2011/hw_files/hadoop_install.pdf

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

















http://hadoop.apache.org/common/docs/r0.20.2/quickstart.html














http://snap.stanford.edu/class/cs246-2011/hw_files/hadoop_install.pdf

Documents

Hadoop Ecosystem Fundamentals - Big Data€¦ · Hadoop Ecosystem Fundamentals ... • Feb 2009 – The Yahoo! Search Webmap is a Hadoop application that runs on more than 10,000