View
119
Download
0
Category
Preview:
Citation preview
Hadoop EcosystemLior Sidi
Sep 2016
Hello!I am Lior Sidi
Big data V’sVolume
Velocity
Variety
What is Hadoop?
• Hadoop – Open source implementation of MapReduce (MR)• Perform MR Jobs fast and efficient
Goalgenerating Value from large datasets
That cannot be analyzed using traditional technologies
Hadoop Concepts
Requirements• Linear horizontal scalability• Jobs run in isolation• Simple programming model
Challenges and solution• Ch1: Data access bottleneck• Sol: Store and process data on same node
• Ch1: Distributed Programming is Difficult• Sol: Use high level languages API
Hadoop Timeline2003 Oct
Google File System paper released
2004 DecMapReduce: Simplified Data
Processing on Large Clusters
2006 OctHadoop 1.0 released
2007 OctYahoo Labs creates Pig
2008 OctCloudera, Hadoop
distributor is founded
2010 SepHive and Pig Graduates
2011 JanZookeeper Graduates
2013 MarYarn deployed in Yahoo
2014 FebApache Spark top
Level Apache Project
Data Streaming
Analysis
Data Injection
Resource Management
coor
dina
tion
Processingw
orkfl
ow
Visualization Cl
uste
r m
anag
emen
t
Storage
Search
Data Formats
Hadoop Ecosystem
Storage
Hadoop Ecosystem
Storage / HDFS
• “Hadoop Distributed File System”• Design:
• Write once – read many times pattern• Cheap hardware• Low latency data access
• Concepts:• Block – File is split to Size 128 MB blocks, redundancy - 3• NameNode (Master) – per cluster - file system namespace for blocks (single point of failure)• DataNode (Worker) – per Node - store and retrieve blocks
• Functions:• High availability – run a second NameNode• Block caching – block cached in only one DataNode• Locality - Rack sensitive, network topology• File permissions – like POSIX – r w x – owner/group/mode file/directory• Interfaces – HTTP (proxy/direct), Java API• Cluster balance – evenly spread the block on the cluster
2Rack
1Rack
Data
Block 1
Block 2
Block 3
DataNode
DataNode
DataNode
DataNode
Block 1
Block 1
Block 2
Block 2
Block 3
Block 3
Block 1
DataNod
eBlock 2
Block 3
NameNode
HDFS proxy Client
file is distribution and accessed on Hadoop HDFS
Resource Management
Storage
Hadoop Ecosystem
Resource Management / YARN
• “Yet Another Resource Negotiator”• Manage and schedule the cluster resource• Daemons:• Resource Manager – Per Cluster – manage resource across the cluster• Node Manager – Per Node – launch and monitor a Container
• Container – execute an app process• Resource requests for containers:• Amount of computers (CPU & Memory)• Locality (node/rack)• Lifespan: application per user job or long-running apps shared by users
• Scheduling:• Allocate resource by policy (FIFO, capacity (ordanisation), Fair
Hadoop Cluster
Node m
anager node
NodeManager
Container Master
Client node
application
Resource manager node
ResourceManager
Client
Node m
anager node
NodeManager
ContainerWorker
Node m
anager node
NodeManager
ContainerWorker
launch
launch
launch
launch
Launch YARN app
heartbeat Job scheduling on top Hadoop Cluster
Resource Management
Processing
Storage
Hadoop Ecosystem
Processing / MapReduce
• Simplify, large scale, automatic, Fault tolerant development data processing• origin - Google paper 2004• Batch processing • Hadoop MR:• JobTracker – 1per cluster - master process, schedule tasks on workers,
monitor progress• taskTracker – 1 per worker - execute map/reduce tasks locally and
report progress
Processing / MapReduce
Lior Ron LiorRon Ron Andrey
Lior Andrey Lior
Count Name
1 Lior
1 Ron
1 Lior
Count Name
1 Lior
1 Andrey
1 Lior
Count Name
1 Andrey
1 Ron
1 Ron
Count Name4 Lior
Count Name3 Ron
Count Name2 Andrey
Data
Map
ReduceShuffle & Sort
Hadoop Cluster
Node m
anager node
NodeManager
Container JobTracker
Client node
MR program
Resource manager node
ResourceManager
Client
Node m
anager node
NodeManager
ContainerTaskTracker
Node m
anager node
NodeManager
ContainerTaskTracker
launch
launch
launch
launch
Launch YARN app
heartbeat MR Job scheduling on top Hadoop Cluster
Resource Management
Processing
Storage
Hadoop Ecosystem
Storage / HBase
• Distributes Column Base database on top HDFS• Real time read/write random access for large data-sets• Region – tables splitting by row• Pheonex - SQL on HBase
RowKey Column Family 1 Column Family 2
Col 1.1
Version Data
Col 1.2 Col 1.3
Version DataVersion Data
Hbase Data Model
Resource Management
coor
dina
tion
Processing
Storage
Hadoop Ecosystem
Coordination / ZooKeeper
• Hadoop’s distributed coordination service• Coordinate read/write action on data• high availability filesystem• Implementation:• Data model:
• Tree build from Znodes (1MB data)• Znode – data changes, ACL (access control list )
• Leader - perform write and broadcast an update• Follower – pass atomic request to leader• Lock service• User groups• Replicate mode
Coordination / ZooKeeperHadoop Cluster
ZooKeaper Service
Leader
HDFSHBase
DataNodeDataNodeDataNode
HMaster Other ClientRegionRegionRegion
NameNode
/
/HBase HDFS/
Follower
/
/HBase HDFS/
Follower
/
/HBase HDFS/LOCK LOCK
ZooKeeper Coordination
example
Resource Management
coor
dina
tion
Processing
Storage Data Formats
Hadoop Ecosystem
Row Based \ Avro
• Language natural data serialization system• Share many data formats with many code language • Split able and sortable - Allow easy map reduce• Rich schema resolution – flexible scheme• Other Row Based formats• sequenceFile - Logfile format • MapFile - Sorted sequenceFile
Row Based \ Avro
Header Block 1 Block 2 Block N
Count objs Serialized objs SyncMarker
identifier Metadata: Schema & codec SyncMarker
Size objs
{ "Type":"record" "Name":"Person" "Fields": [{ "name":"firstName", "type":"string" "order":"descending" },{ "name":"age", "type":"int" },{... ]}
Schema
File Structure
File Structure
Parquet
• Columnar storage format• Skip unneeded columns• Fast queries & small size
• Efficient nested data store Header Block 1 Block 2 Block N
Column chunk Column chunk Column chunk
Page Page Page Page
Magic Number File Metadata
Footer
Message Person { Required binary name (UTF8); Required int32 age (UTF8); Required group hobbies (LIST) { Required binary array (UTF8); }}
Schema
Data Injection
Resource Management
coor
dina
tion
Processing
Storage Data Formats
Hadoop Ecosystem
Data Integration / Sqoop
• Import/export structural data • Sqoop connector:• import/export from a database
• Sqoop1- command line• Sqoop2 – service• Connectors – connect RDBs
Hadoop Cluster
Export MapReduce Job
Database Table
Sqoop client
Import MapReduce Job
Hdfs Hdfs
Map Map
Hdfs Hdfs
Map Map
metadata
launch launch
ExportImport
Data Integration / Flume
• Event base data injection into Hadoop• Flume agent components:• Sources – spoolingDir (create events), Avro(RPC), Http (requests)• Channel• Sink – Avro, HDFS, HBase, Solr(=near real time)
• Reliability - Use separate transaction • Fan out – one source many sinks• Scale - agent tiers for aggregation multiple sources • Sink grouping- avoid failure and load balancing
Fan Out
Data Integration / Flume
Hadoop Data
File system
Flume Agent
Source Channel Sink
Tier 1 Flume Agent
Tier 1 Flume Agent
Tier 1 Flume Agent
Tier 2 Flume Agent
Tier 2 Flume Agent
Tier 3 Flume Agent
Tier 3 Flume Agent
File system Sink
GroupingScale
HDFS
HBase
Data
Data Integration / Kafka
• distributed publish-subscribe messaging system• Fast, scalable, durable• Components:• Topics – categories of feeds messages• Procedures – process that publish messages to topic• Message consumer – processes that subscribe for topic• Broker – kafka servers on cluster
• Distribution• Leader – allow read/write• Follower – replicate
Data StreamingData Injection
Resource Management
coor
dina
tion
Processing
Storage Data Formats
Hadoop Ecosystem
Data Integration / Streaming
• Stream processing• Kafka Stream - Process and analyze data in Kafka• Storm – real-time computation• Spark streaming – process live data and can apply Spark MLib and
graphX
Flume Agent 1Data
Kafka
Spark Streaming
Flume Agent 2 Storm
Topic A
Topic B
HDFS
1
1
1
2
2
Data Streaming
Analysis
Data Injection
Resource Management
coor
dina
tion
Processing
Storage Data Formats
Hadoop Ecosystem
• Cluster Computing Framework• In Memory processing• Language: Scala, Java and Python• RDD – resilience Distributed dataset
• Read only collection spread in the cluster• Computation of transformation happened when Action
• DAG engine – schedule many transformations to one optimal Job• Spark context
• parallel jobs• Caching• Broadcast variables (Data/Functions)
• Cluster Manager of executors:• Local, Standalone, Mesos , Yarn
Computation / Spark
Computation / Spark
Hadoop
Driver
SparkContext
Spark Program
DAG Scheduler
Task Scheduler
Scheduler backend
Executer Executer Executer
Job
Job
Stages
Tasks
Task Task Task
Scripting / Pig
• Data flow programming language - Map reduce abstraction• support: User defined functions (UDF), Streaming, nested data• Don’t support: random read/write• Pig Latin - Scripting language• Load, store, filtering, Group, Join, Sort, Union and Split, UDF, Co-group
• Modes• Local – small datasets• MR mode – run on cluster
• Execution - script, grunt (shell), embedded (java)• Parameter substitution – run script with different parameters• Similar• Crunch – MR pipeline with Java (no UDF)
Query / Hive
• Components• MetaStore – tables description• HiveQL – SQL dialect (SQL: 2003)
• tables Management• warehouse directory• external tables
• functionality• Bucketing and Partitions by column• Support UDF and UDAF (aggregate)• Insert Update Delete:
• Saved in delta files• Background MR Jobs• (Available Transaction context)
• Lock table (avoid drop)
Query / Comparison
SparkSql (shark) Impala Hive
Procedural development
BI & SQL analytics Batch Usage
OK Best bad Speed
Memory Dedicated Deamons on DataNode
MapReduce implementation
Persto , Drill (SQL: 2011)
Hive On spark Similar tools
Data Streaming
Analysis
Data Injection
Resource Management
coor
dina
tion
Processingw
orkfl
ow
Storage Data Formats
Hadoop Ecosystem
Workflow / Oozie
• Schedule Hadoop jobs• Job types:• Workflows – sequence of jobs via Directed Graphs (DAGs)• Coordinator - trigger jobs by time or availability
start Sqoop Fork
Pig
PigMR
Sub workflow
FS(HDFS)Join End
Control flow
Action
Data Streaming
Analysis
Data Injection
Resource Management
coor
dina
tion
Processingw
orkfl
ow
Storage
Search
Data Formats
Hadoop Ecosystem
Search / Solr
• Full- text search over Hadoop• Near real time indexing• REST API • Based on Apache Lucene java search library
Data Streaming
Analysis
Data Injection
Resource Management
coor
dina
tion
Processingw
orkfl
ow
Visualization
Storage
Search
Data Formats
Hadoop Ecosystem
Visualization / Hue
• Open source Web interface for analyzing data with any Hadoop.• Application:• File Browser: HDFS, Hbase• Scheduling of jobs and workflows : Oozie• Job Browser: YARN• SQL : Hive, Impala• Data analysis: Pig, UDF• Dynamic Search: Solr• Notebooks: Spark• Data Transfer: Sqoop 2
Data Streaming
Analysis
Data Injection
Resource Management
coor
dina
tion
Processingw
orkfl
ow
Visualization Cl
uste
r m
anag
emen
t
Storage
Search
Data Formats
Hadoop Ecosystem
Cluster Management / Cloudera
• 100% open source• The most complete and tested distribution of Hadoop• Integrate all Hadoop project • Express – free, end to end administration• Enterprise – Extra features and support
Cluster Management / Comparison
://https . / - - - -talendexpert com cloudera vs honworks vs mapr
MasterMasterMaster
Other Servers
Worker
Basic Cluster configuration
Resource manager Standby Resource Manager
NodeManager
DataNodeCloudera Manager
Hive GWZooKeeper
Impala Daemon
Impala State
Sqoop GW
Spark GW
NameNode
Master
ZooKeeper
Secondary NameNode
Worker
NodeManager
DataNode
Impala Daemon
Worker
NodeManager
DataNode
Impala Daemon
Worker
NodeManager
DataNode
Impala Daemon
Data Streaming
Analysis
Data Injection
Resource Management
coor
dina
tion
Processingw
orkfl
ow
Visualization Cl
uste
r m
anag
emen
t
Storage
Search
Data Formats
Hadoop Ecosystem
Thanks!Any questions?
Recommended