Hadoop Ecosystem

Hadoop EcosystemLior Sidi

Sep 2016

Hello!I am Lior Sidi

Big data V’sVolume

Velocity

Variety

What is Hadoop?

• Hadoop – Open source implementation of MapReduce (MR)• Perform MR Jobs fast and efficient

Goalgenerating Value from large datasets

That cannot be analyzed using traditional technologies

Hadoop Concepts

Requirements• Linear horizontal scalability• Jobs run in isolation• Simple programming model

Challenges and solution• Ch1: Data access bottleneck• Sol: Store and process data on same node

• Ch1: Distributed Programming is Difficult• Sol: Use high level languages API

Hadoop Timeline2003 Oct

Google File System paper released

2004 DecMapReduce: Simplified Data

Processing on Large Clusters

2006 OctHadoop 1.0 released

2007 OctYahoo Labs creates Pig

2008 OctCloudera, Hadoop

distributor is founded

2010 SepHive and Pig Graduates

2011 JanZookeeper Graduates

2013 MarYarn deployed in Yahoo

2014 FebApache Spark top

Level Apache Project

Data Streaming

Analysis

Data Injection

Resource Management

coor

dina

tion

Processingw

orkfl

ow

Visualization Cl

uste

r m

anag

emen

t

Storage

Search

Data Formats

Hadoop Ecosystem

Storage

Hadoop Ecosystem

Storage / HDFS

• “Hadoop Distributed File System”• Design:

• Write once – read many times pattern• Cheap hardware• Low latency data access

• Concepts:• Block – File is split to Size 128 MB blocks, redundancy - 3• NameNode (Master) – per cluster - file system namespace for blocks (single point of failure)• DataNode (Worker) – per Node - store and retrieve blocks

• Functions:• High availability – run a second NameNode• Block caching – block cached in only one DataNode• Locality - Rack sensitive, network topology• File permissions – like POSIX – r w x – owner/group/mode file/directory• Interfaces – HTTP (proxy/direct), Java API• Cluster balance – evenly spread the block on the cluster

2Rack

1Rack

Data

Block 1

Block 2

Block 3

DataNode

DataNode

DataNode

DataNode

Block 1

Block 1

Block 2

Block 2

Block 3

Block 3

Block 1

DataNod

eBlock 2

Block 3

NameNode

HDFS proxy Client

file is distribution and accessed on Hadoop HDFS

Resource Management

Storage

Hadoop Ecosystem

Resource Management / YARN

• “Yet Another Resource Negotiator”• Manage and schedule the cluster resource• Daemons:• Resource Manager – Per Cluster – manage resource across the cluster• Node Manager – Per Node – launch and monitor a Container

• Container – execute an app process• Resource requests for containers:• Amount of computers (CPU & Memory)• Locality (node/rack)• Lifespan: application per user job or long-running apps shared by users

• Scheduling:• Allocate resource by policy (FIFO, capacity (ordanisation), Fair

Hadoop Cluster

Node m

anager node

NodeManager

Container Master

Client node

application

Resource manager node

ResourceManager

Client

Node m

anager node

NodeManager

ContainerWorker

Node m

anager node

NodeManager

ContainerWorker

launch

launch

launch

launch

Launch YARN app

heartbeat Job scheduling on top Hadoop Cluster

Resource Management

Processing

Storage

Hadoop Ecosystem

Processing / MapReduce

• Simplify, large scale, automatic, Fault tolerant development data processing• origin - Google paper 2004• Batch processing • Hadoop MR:• JobTracker – 1per cluster - master process, schedule tasks on workers,

monitor progress• taskTracker – 1 per worker - execute map/reduce tasks locally and

report progress

Processing / MapReduce

Lior Ron LiorRon Ron Andrey

Lior Andrey Lior

Count Name

1 Lior

1 Ron

1 Lior

Count Name

1 Lior

1 Andrey

1 Lior

Count Name

1 Andrey

1 Ron

1 Ron

Count Name4 Lior

Count Name3 Ron

Count Name2 Andrey

Data

Map

ReduceShuffle & Sort

Hadoop Cluster

Node m

anager node

NodeManager

Container JobTracker

Client node

MR program

Resource manager node

ResourceManager

Client

Node m

anager node

NodeManager

ContainerTaskTracker

Node m

anager node

NodeManager

ContainerTaskTracker

launch

launch

launch

launch

Launch YARN app

heartbeat MR Job scheduling on top Hadoop Cluster

Resource Management

Processing

Storage

Hadoop Ecosystem

Storage / HBase

• Distributes Column Base database on top HDFS• Real time read/write random access for large data-sets• Region – tables splitting by row• Pheonex - SQL on HBase

RowKey Column Family 1 Column Family 2

Col 1.1

Version Data

Col 1.2 Col 1.3

Version DataVersion Data

Hbase Data Model

Resource Management

coor

dina

tion

Processing

Storage

Hadoop Ecosystem

Coordination / ZooKeeper

• Hadoop’s distributed coordination service• Coordinate read/write action on data• high availability filesystem• Implementation:• Data model:

• Tree build from Znodes (1MB data)• Znode – data changes, ACL (access control list )

• Leader - perform write and broadcast an update• Follower – pass atomic request to leader• Lock service• User groups• Replicate mode

Coordination / ZooKeeperHadoop Cluster

ZooKeaper Service

Leader

HDFSHBase

DataNodeDataNodeDataNode

HMaster Other ClientRegionRegionRegion

NameNode

/

/HBase HDFS/

Follower

/

/HBase HDFS/

Follower

/

/HBase HDFS/LOCK LOCK

ZooKeeper Coordination

example

Resource Management

coor

dina

tion

Processing

Storage Data Formats

Hadoop Ecosystem

Row Based \ Avro

• Language natural data serialization system• Share many data formats with many code language • Split able and sortable - Allow easy map reduce• Rich schema resolution – flexible scheme• Other Row Based formats• sequenceFile - Logfile format • MapFile - Sorted sequenceFile

Row Based \ Avro

Header Block 1 Block 2 Block N

Count objs Serialized objs SyncMarker

identifier Metadata: Schema & codec SyncMarker

Size objs

{ "Type":"record" "Name":"Person" "Fields": [{ "name":"firstName", "type":"string" "order":"descending" },{ "name":"age", "type":"int" },{... ]}

Schema

File Structure

File Structure

Parquet

• Columnar storage format• Skip unneeded columns• Fast queries & small size

• Efficient nested data store Header Block 1 Block 2 Block N

Column chunk Column chunk Column chunk

Page Page Page Page

Magic Number File Metadata

Footer

Message Person { Required binary name (UTF8); Required int32 age (UTF8); Required group hobbies (LIST) { Required binary array (UTF8); }}

Schema

Data Injection

Resource Management

coor

dina

tion

Processing


Hadoop Ecosystem

Data Integration / Sqoop

• Import/export structural data • Sqoop connector:• import/export from a database

• Sqoop1- command line• Sqoop2 – service• Connectors – connect RDBs

Hadoop Cluster

Export MapReduce Job

Database Table

Sqoop client

Import MapReduce Job

Hdfs Hdfs

Map Map

Hdfs Hdfs

Map Map

metadata

launch launch

ExportImport

Data Integration / Flume

• Event base data injection into Hadoop• Flume agent components:• Sources – spoolingDir (create events), Avro(RPC), Http (requests)• Channel• Sink – Avro, HDFS, HBase, Solr(=near real time)

• Reliability - Use separate transaction • Fan out – one source many sinks• Scale - agent tiers for aggregation multiple sources • Sink grouping- avoid failure and load balancing

Fan Out

Data Integration / Flume

Hadoop Data

File system

Flume Agent

Source Channel Sink

Tier 1 Flume Agent

Tier 1 Flume Agent

Tier 1 Flume Agent

Tier 2 Flume Agent

Tier 2 Flume Agent

Tier 3 Flume Agent

Tier 3 Flume Agent

File system Sink

GroupingScale

HDFS

HBase

Data

Data Integration / Kafka

• distributed publish-subscribe messaging system• Fast, scalable, durable• Components:• Topics – categories of feeds messages• Procedures – process that publish messages to topic• Message consumer – processes that subscribe for topic• Broker – kafka servers on cluster

• Distribution• Leader – allow read/write• Follower – replicate

Data StreamingData Injection

Resource Management

coor

dina

tion

Processing


Hadoop Ecosystem

Data Integration / Streaming

• Stream processing• Kafka Stream - Process and analyze data in Kafka• Storm – real-time computation• Spark streaming – process live data and can apply Spark MLib and

graphX

Flume Agent 1Data

Kafka

Spark Streaming

Flume Agent 2 Storm

Topic A

Topic B

HDFS

1

1

1

2

2

Data Streaming

Analysis

Data Injection

Resource Management

coor

dina

tion

Processing


Hadoop Ecosystem

• Cluster Computing Framework• In Memory processing• Language: Scala, Java and Python• RDD – resilience Distributed dataset

• Read only collection spread in the cluster• Computation of transformation happened when Action

• DAG engine – schedule many transformations to one optimal Job• Spark context

• parallel jobs• Caching• Broadcast variables (Data/Functions)

• Cluster Manager of executors:• Local, Standalone, Mesos , Yarn

Computation / Spark

Computation / Spark

Hadoop

Driver

SparkContext

Spark Program

DAG Scheduler

Task Scheduler

Scheduler backend

Executer Executer Executer

Job

Job

Stages

Tasks

Task Task Task

Scripting / Pig

• Data flow programming language - Map reduce abstraction• support: User defined functions (UDF), Streaming, nested data• Don’t support: random read/write• Pig Latin - Scripting language• Load, store, filtering, Group, Join, Sort, Union and Split, UDF, Co-group

• Modes• Local – small datasets• MR mode – run on cluster

• Execution - script, grunt (shell), embedded (java)• Parameter substitution – run script with different parameters• Similar• Crunch – MR pipeline with Java (no UDF)

Query / Hive

• Components• MetaStore – tables description• HiveQL – SQL dialect (SQL: 2003)

• tables Management• warehouse directory• external tables

• functionality• Bucketing and Partitions by column• Support UDF and UDAF (aggregate)• Insert Update Delete:

• Saved in delta files• Background MR Jobs• (Available Transaction context)

• Lock table (avoid drop)

Query / Comparison

SparkSql (shark) Impala Hive

Procedural development

BI & SQL analytics Batch Usage

OK Best bad Speed

Memory Dedicated Deamons on DataNode

MapReduce implementation

Persto , Drill (SQL: 2011)

Hive On spark Similar tools

Data Streaming

Analysis

Data Injection

Resource Management

coor

dina

tion

Processingw

orkfl

ow


Hadoop Ecosystem

Workflow / Oozie

• Schedule Hadoop jobs• Job types:• Workflows – sequence of jobs via Directed Graphs (DAGs)• Coordinator - trigger jobs by time or availability

start Sqoop Fork

Pig

PigMR

Sub workflow

FS(HDFS)Join End

Control flow

Action

Email

Data Streaming

Analysis

Data Injection

Resource Management

coor

dina

tion

Processingw

orkfl

ow

Storage

Search

Data Formats

Hadoop Ecosystem

Search / Solr

• Full- text search over Hadoop• Near real time indexing• REST API • Based on Apache Lucene java search library

Data Streaming

Analysis

Data Injection

Resource Management

coor

dina

tion

Processingw

orkfl

ow

Visualization

Storage

Search

Data Formats

Hadoop Ecosystem

Visualization / Hue

• Open source Web interface for analyzing data with any Hadoop.• Application:• File Browser: HDFS, Hbase• Scheduling of jobs and workflows : Oozie• Job Browser: YARN• SQL : Hive, Impala• Data analysis: Pig, UDF• Dynamic Search: Solr• Notebooks: Spark• Data Transfer: Sqoop 2

Data Streaming

Analysis

Data Injection

Resource Management

coor

dina

tion

Processingw

orkfl

ow

Visualization Cl

uste

r m

anag

emen

t

Storage

Search

Data Formats

Hadoop Ecosystem

Cluster Management / Cloudera

• 100% open source• The most complete and tested distribution of Hadoop• Integrate all Hadoop project • Express – free, end to end administration• Enterprise – Extra features and support

Cluster Management / Comparison

://https . / - - - -talendexpert com cloudera vs honworks vs mapr

https://talendexpert.com/cloudera-vs-honworks-vs-mapr



MasterMasterMaster

Other Servers

Worker

Basic Cluster configuration

Resource manager Standby Resource Manager

NodeManager

DataNodeCloudera Manager

Hive GWZooKeeper

Impala Daemon

Impala State

Sqoop GW

Spark GW

NameNode

Master

ZooKeeper

Secondary NameNode

Worker

NodeManager

DataNode

Impala Daemon

Worker

NodeManager

DataNode

Impala Daemon

Worker

NodeManager

DataNode

Impala Daemon

Data Streaming

Analysis

Data Injection

Resource Management

coor

dina

tion

Processingw

orkfl

ow

Visualization Cl

uste

r m

anag

emen

t

Storage

Search

Data Formats

Hadoop Ecosystem

Thanks!Any questions?

Technology

Hadoop Ecosystem