51
Hadoop Ecosystem Lior Sidi Sep 2016

Hadoop Ecosystem

Embed Size (px)

Citation preview

Page 1: Hadoop Ecosystem

Hadoop EcosystemLior Sidi

Sep 2016

Page 2: Hadoop Ecosystem

Hello!I am Lior Sidi

Page 3: Hadoop Ecosystem

Big data V’sVolume

Velocity

Variety

Page 4: Hadoop Ecosystem

What is Hadoop?

• Hadoop – Open source implementation of MapReduce (MR)• Perform MR Jobs fast and efficient

Goalgenerating Value from large datasets

That cannot be analyzed using traditional technologies

Page 5: Hadoop Ecosystem

Hadoop Concepts

Requirements• Linear horizontal scalability• Jobs run in isolation• Simple programming model

Challenges and solution• Ch1: Data access bottleneck• Sol: Store and process data on same node

• Ch1: Distributed Programming is Difficult• Sol: Use high level languages API

Page 6: Hadoop Ecosystem

Hadoop Timeline2003 Oct

Google File System paper released

2004 DecMapReduce: Simplified Data

Processing on Large Clusters

2006 OctHadoop 1.0 released

2007 OctYahoo Labs creates Pig

2008 OctCloudera, Hadoop

distributor is founded

2010 SepHive and Pig Graduates

2011 JanZookeeper Graduates

2013 MarYarn deployed in Yahoo

2014 FebApache Spark top

Level Apache Project

Page 7: Hadoop Ecosystem

Data Streaming

Analysis

Data Injection

Resource Management

coor

dina

tion

Processingw

orkfl

ow

Visualization Cl

uste

r m

anag

emen

t

Storage

Search

Data Formats

Hadoop Ecosystem

Page 8: Hadoop Ecosystem

Storage

Hadoop Ecosystem

Page 9: Hadoop Ecosystem

Storage / HDFS

• “Hadoop Distributed File System”• Design:

• Write once – read many times pattern• Cheap hardware• Low latency data access

• Concepts:• Block – File is split to Size 128 MB blocks, redundancy - 3• NameNode (Master) – per cluster - file system namespace for blocks (single point of failure)• DataNode (Worker) – per Node - store and retrieve blocks

• Functions:• High availability – run a second NameNode• Block caching – block cached in only one DataNode• Locality - Rack sensitive, network topology• File permissions – like POSIX – r w x – owner/group/mode file/directory• Interfaces – HTTP (proxy/direct), Java API• Cluster balance – evenly spread the block on the cluster

Page 10: Hadoop Ecosystem

2Rack

1Rack

Data

Block 1

Block 2

Block 3

DataNode

DataNode

DataNode

DataNode

Block 1

Block 1

Block 2

Block 2

Block 3

Block 3

Block 1

DataNod

eBlock 2

Block 3

NameNode

HDFS proxy Client

file is distribution and accessed on Hadoop HDFS

Page 11: Hadoop Ecosystem

Resource Management

Storage

Hadoop Ecosystem

Page 12: Hadoop Ecosystem

Resource Management / YARN

• “Yet Another Resource Negotiator”• Manage and schedule the cluster resource• Daemons:• Resource Manager – Per Cluster – manage resource across the cluster• Node Manager – Per Node – launch and monitor a Container

• Container – execute an app process• Resource requests for containers:• Amount of computers (CPU & Memory)• Locality (node/rack)• Lifespan: application per user job or long-running apps shared by users

• Scheduling:• Allocate resource by policy (FIFO, capacity (ordanisation), Fair

Page 13: Hadoop Ecosystem

Hadoop Cluster

Node m

anager node

NodeManager

Container Master

Client node

application

Resource manager node

ResourceManager

Client

Node m

anager node

NodeManager

ContainerWorker

Node m

anager node

NodeManager

ContainerWorker

launch

launch

launch

launch

Launch YARN app

heartbeat Job scheduling on top Hadoop Cluster

Page 14: Hadoop Ecosystem

Resource Management

Processing

Storage

Hadoop Ecosystem

Page 15: Hadoop Ecosystem

Processing / MapReduce

• Simplify, large scale, automatic, Fault tolerant development data processing• origin - Google paper 2004• Batch processing • Hadoop MR:• JobTracker – 1per cluster - master process, schedule tasks on workers,

monitor progress• taskTracker – 1 per worker - execute map/reduce tasks locally and

report progress

Page 16: Hadoop Ecosystem

Processing / MapReduce

Lior Ron LiorRon Ron Andrey

Lior Andrey Lior

Count Name

1 Lior

1 Ron

1 Lior

Count Name

1 Lior

1 Andrey

1 Lior

Count Name

1 Andrey

1 Ron

1 Ron

Count Name4 Lior

Count Name3 Ron

Count Name2 Andrey

Data

Map

ReduceShuffle & Sort

Page 17: Hadoop Ecosystem

Hadoop Cluster

Node m

anager node

NodeManager

Container JobTracker

Client node

MR program

Resource manager node

ResourceManager

Client

Node m

anager node

NodeManager

ContainerTaskTracker

Node m

anager node

NodeManager

ContainerTaskTracker

launch

launch

launch

launch

Launch YARN app

heartbeat MR Job scheduling on top Hadoop Cluster

Page 18: Hadoop Ecosystem

Resource Management

Processing

Storage

Hadoop Ecosystem

Page 19: Hadoop Ecosystem

Storage / HBase

• Distributes Column Base database on top HDFS• Real time read/write random access for large data-sets• Region – tables splitting by row• Pheonex - SQL on HBase

RowKey Column Family 1 Column Family 2

Col 1.1

Version Data

Col 1.2 Col 1.3

Version DataVersion Data

Hbase Data Model

Page 20: Hadoop Ecosystem

Resource Management

coor

dina

tion

Processing

Storage

Hadoop Ecosystem

Page 21: Hadoop Ecosystem

Coordination / ZooKeeper

• Hadoop’s distributed coordination service• Coordinate read/write action on data• high availability filesystem• Implementation:• Data model:

• Tree build from Znodes (1MB data)• Znode – data changes, ACL (access control list )

• Leader - perform write and broadcast an update• Follower – pass atomic request to leader• Lock service• User groups• Replicate mode

Page 22: Hadoop Ecosystem

Coordination / ZooKeeperHadoop Cluster

ZooKeaper Service

Leader

HDFSHBase

DataNodeDataNodeDataNode

HMaster Other ClientRegionRegionRegion

NameNode

/

/HBase HDFS/

Follower

/

/HBase HDFS/

Follower

/

/HBase HDFS/LOCK LOCK

ZooKeeper Coordination

example

Page 23: Hadoop Ecosystem

Resource Management

coor

dina

tion

Processing

Storage Data Formats

Hadoop Ecosystem

Page 24: Hadoop Ecosystem

Row Based \ Avro

• Language natural data serialization system• Share many data formats with many code language • Split able and sortable - Allow easy map reduce• Rich schema resolution – flexible scheme• Other Row Based formats• sequenceFile - Logfile format • MapFile - Sorted sequenceFile

Page 25: Hadoop Ecosystem

Row Based \ Avro

Header Block 1 Block 2 Block N

Count objs Serialized objs SyncMarker

identifier Metadata: Schema & codec SyncMarker

Size objs

{ "Type":"record" "Name":"Person" "Fields": [{ "name":"firstName", "type":"string" "order":"descending" },{ "name":"age", "type":"int" },{... ]}

Schema

File Structure

Page 26: Hadoop Ecosystem

File Structure

Parquet

• Columnar storage format• Skip unneeded columns• Fast queries & small size

• Efficient nested data store Header Block 1 Block 2 Block N

Column chunk Column chunk Column chunk

Page Page Page Page

Magic Number File Metadata

Footer

Message Person { Required binary name (UTF8); Required int32 age (UTF8); Required group hobbies (LIST) { Required binary array (UTF8); }}

Schema

Page 27: Hadoop Ecosystem

Data Injection

Resource Management

coor

dina

tion

Processing

Storage Data Formats

Hadoop Ecosystem

Page 28: Hadoop Ecosystem

Data Integration / Sqoop

• Import/export structural data • Sqoop connector:• import/export from a database

• Sqoop1- command line• Sqoop2 – service• Connectors – connect RDBs

Hadoop Cluster

Export MapReduce Job

Database Table

Sqoop client

Import MapReduce Job

Hdfs Hdfs

Map Map

Hdfs Hdfs

Map Map

metadata

launch launch

ExportImport

Page 29: Hadoop Ecosystem

Data Integration / Flume

• Event base data injection into Hadoop• Flume agent components:• Sources – spoolingDir (create events), Avro(RPC), Http (requests)• Channel• Sink – Avro, HDFS, HBase, Solr(=near real time)

• Reliability - Use separate transaction • Fan out – one source many sinks• Scale - agent tiers for aggregation multiple sources • Sink grouping- avoid failure and load balancing

Page 30: Hadoop Ecosystem

Fan Out

Data Integration / Flume

Hadoop Data

File system

Flume Agent

Source Channel Sink

Tier 1 Flume Agent

Tier 1 Flume Agent

Tier 1 Flume Agent

Tier 2 Flume Agent

Tier 2 Flume Agent

Tier 3 Flume Agent

Tier 3 Flume Agent

File system Sink

GroupingScale

HDFS

HBase

Data

Page 31: Hadoop Ecosystem

Data Integration / Kafka

• distributed publish-subscribe messaging system• Fast, scalable, durable• Components:• Topics – categories of feeds messages• Procedures – process that publish messages to topic• Message consumer – processes that subscribe for topic• Broker – kafka servers on cluster

• Distribution• Leader – allow read/write• Follower – replicate

Page 32: Hadoop Ecosystem

Data StreamingData Injection

Resource Management

coor

dina

tion

Processing

Storage Data Formats

Hadoop Ecosystem

Page 33: Hadoop Ecosystem

Data Integration / Streaming

• Stream processing• Kafka Stream - Process and analyze data in Kafka• Storm – real-time computation• Spark streaming – process live data and can apply Spark MLib and

graphX

Flume Agent 1Data

Kafka

Spark Streaming

Flume Agent 2 Storm

Topic A

Topic B

HDFS

1

1

1

2

2

Page 34: Hadoop Ecosystem

Data Streaming

Analysis

Data Injection

Resource Management

coor

dina

tion

Processing

Storage Data Formats

Hadoop Ecosystem

Page 35: Hadoop Ecosystem

• Cluster Computing Framework• In Memory processing• Language: Scala, Java and Python• RDD – resilience Distributed dataset

• Read only collection spread in the cluster• Computation of transformation happened when Action

• DAG engine – schedule many transformations to one optimal Job• Spark context

• parallel jobs• Caching• Broadcast variables (Data/Functions)

• Cluster Manager of executors:• Local, Standalone, Mesos , Yarn

Computation / Spark

Page 36: Hadoop Ecosystem

Computation / Spark

Hadoop

Driver

SparkContext

Spark Program

DAG Scheduler

Task Scheduler

Scheduler backend

Executer Executer Executer

Job

Job

Stages

Tasks

Task Task Task

Page 37: Hadoop Ecosystem

Scripting / Pig

• Data flow programming language - Map reduce abstraction• support: User defined functions (UDF), Streaming, nested data• Don’t support: random read/write• Pig Latin - Scripting language• Load, store, filtering, Group, Join, Sort, Union and Split, UDF, Co-group

• Modes• Local – small datasets• MR mode – run on cluster

• Execution - script, grunt (shell), embedded (java)• Parameter substitution – run script with different parameters• Similar• Crunch – MR pipeline with Java (no UDF)

Page 38: Hadoop Ecosystem

Query / Hive

• Components• MetaStore – tables description• HiveQL – SQL dialect (SQL: 2003)

• tables Management• warehouse directory• external tables

• functionality• Bucketing and Partitions by column• Support UDF and UDAF (aggregate)• Insert Update Delete:

• Saved in delta files• Background MR Jobs• (Available Transaction context)

• Lock table (avoid drop)

Page 39: Hadoop Ecosystem

Query / Comparison

SparkSql (shark) Impala Hive

Procedural development

BI & SQL analytics Batch Usage

OK Best bad Speed

Memory Dedicated Deamons on DataNode

MapReduce implementation

Persto , Drill (SQL: 2011)

Hive On spark Similar tools

Page 40: Hadoop Ecosystem

Data Streaming

Analysis

Data Injection

Resource Management

coor

dina

tion

Processingw

orkfl

ow

Storage Data Formats

Hadoop Ecosystem

Page 41: Hadoop Ecosystem

Workflow / Oozie

• Schedule Hadoop jobs• Job types:• Workflows – sequence of jobs via Directed Graphs (DAGs)• Coordinator - trigger jobs by time or availability

start Sqoop Fork

Pig

PigMR

Sub workflow

FS(HDFS)Join End

Control flow

Action

Email

Page 42: Hadoop Ecosystem

Data Streaming

Analysis

Data Injection

Resource Management

coor

dina

tion

Processingw

orkfl

ow

Storage

Search

Data Formats

Hadoop Ecosystem

Page 43: Hadoop Ecosystem

Search / Solr

• Full- text search over Hadoop• Near real time indexing• REST API • Based on Apache Lucene java search library

Page 44: Hadoop Ecosystem

Data Streaming

Analysis

Data Injection

Resource Management

coor

dina

tion

Processingw

orkfl

ow

Visualization

Storage

Search

Data Formats

Hadoop Ecosystem

Page 45: Hadoop Ecosystem

Visualization / Hue

• Open source Web interface for analyzing data with any Hadoop.• Application:• File Browser: HDFS, Hbase• Scheduling of jobs and workflows : Oozie• Job Browser: YARN• SQL : Hive, Impala• Data analysis: Pig, UDF• Dynamic Search: Solr• Notebooks: Spark• Data Transfer: Sqoop 2

Page 46: Hadoop Ecosystem

Data Streaming

Analysis

Data Injection

Resource Management

coor

dina

tion

Processingw

orkfl

ow

Visualization Cl

uste

r m

anag

emen

t

Storage

Search

Data Formats

Hadoop Ecosystem

Page 47: Hadoop Ecosystem

Cluster Management / Cloudera

• 100% open source• The most complete and tested distribution of Hadoop• Integrate all Hadoop project • Express – free, end to end administration• Enterprise – Extra features and support

Page 48: Hadoop Ecosystem

Cluster Management / Comparison

://https . / - - - -talendexpert com cloudera vs honworks vs mapr

Page 49: Hadoop Ecosystem

MasterMasterMaster

Other Servers

Worker

Basic Cluster configuration

Resource manager Standby Resource Manager

NodeManager

DataNodeCloudera Manager

Hive GWZooKeeper

Impala Daemon

Impala State

Sqoop GW

Spark GW

NameNode

Master

ZooKeeper

Secondary NameNode

Worker

NodeManager

DataNode

Impala Daemon

Worker

NodeManager

DataNode

Impala Daemon

Worker

NodeManager

DataNode

Impala Daemon

Page 50: Hadoop Ecosystem

Data Streaming

Analysis

Data Injection

Resource Management

coor

dina

tion

Processingw

orkfl

ow

Visualization Cl

uste

r m

anag

emen

t

Storage

Search

Data Formats

Hadoop Ecosystem

Page 51: Hadoop Ecosystem

Thanks!Any questions?