Hadoop Inside

Preview:

Citation preview

Hadoop Inside

TC 데이터플랫폼실 GFIS팀

이은조

Distributed Processing System

How to process data in distributed environment

how to read/write data

how to control nodes

load balancing

Monitoring

node status

task status

Fault tolerance

error detection

process error, network error, hardware error, …

error handling

temporary error: retry -> duplication, data corruption, …

permanent error: fail over(which one?)

process hang: timeout & retry

• too long -> long response time

• too short -> infinite loop

Hadoop System Architecture

Job

Tracker

Name

Node

Task

Tracker

Data

Node

Task

Tracker

Data

Node

Task

Tracker

Data

Node

: Node : Process : Heart Beat : Data Read/Write

Secondary

Name

Node

HDFS + MapReduce

HDFS

vs. Filesystem

inode – namespace

cylinder / track – data node

blocks(bytes) – blocks(Mbytes)

Features

very large files

write once, read many times

support for usual file system operations

ls, cp, mv, rm, chmod, chown, put, cat, …

no support for multiple writers or arbitrary modifications

Block Replication & Rack Awareness

1 2

3 4

1 2

3 4

1

2

3

4

1 2 3 4

1 2

3 4

1 2

3

4

1 2

3

4 : Server

: Rack

: File

: Block

HDFS - Read

Name

Node

Data Node

: Node : Data Block : Data I/O

Data Read

: Operation Message

Client

Data Node

Data Node

1. Read Request

2. Response

3. Reqeust Data

4. Read Data

HDFS - Write

Name

Node

Data Node

Data Write

Client

Data Node

Data Node

1. Write Request

2. Response

3. Write Data

4. Write Replica

4. Write Replica

5. Write Done

: Node : Data Block : Data I/O : Operation Message

HDFS – Write (Failure)

Name

Node

Data Node

Data Write

Client

Data Node

Data Node

1. Write Request

2. Response

3. Write Data

4. Write Replica

5. Write Done

: Node : Data Block : Data I/O : Operation Message

HDFS – Write (Failure)

Name

Node

Data Node

Data Write

Client

Data Node

Data Node

: Node : Data Block : Data I/O : Operation Message

Data Node

Write Replica

Delete Partial block

Replica Arrangement

MapReduce

Definition

map: (+1) [ 1, 2, 3, 4, …, 10 ] -> [ 2, 3, 4, 5, …, 11 ]

reduce: (+) [ 2, 3, 4, 5, …, 11 ] -> 65

Programming Model for processing data sets in Hadoop

projection, filter -> map task

aggregation, join -> reduce task

sort -> partitioning

Job Tracker & Task Trackers

master / slave

job = many tasks

# of map tasks = # of file splits (default: # of blocks)

# of reduce tasks = user configuration

MapReduce

: Distributed File System

: Split

: Input Data Record

: Map Task

: Reduce Task

: Shuffling & Sorting

: Map Output Record (Key/Value pair)

: Reduce Output Record (Key/Value pair)

Map / Reduce Task

: Partition

MapReduce

: Distributed File System

: Split

: Input Data Record

: Map Task

: Reduce Task

: Shuffling & Sorting

: Map Output Record (Key/Value pair)

: Reduce Output Record (Key/Value pair)

Map / Reduce Task

: Partition

MapReduce

: Distributed File System

: Split

: Input Data Record

: Map Task

: Reduce Task

: Shuffling & Sorting

: Map Output Record (Key/Value pair)

: Reduce Output Record (Key/Value pair)

Map / Reduce Task

: Partition

MapReduce

: Distributed File System

: Split

: Input Data Record

: Map Task

: Reduce Task

: Shuffling & Sorting

: Map Output Record (Key/Value pair)

: Reduce Output Record (Key/Value pair)

Map / Reduce Task

: Partition

MapReduce

: Distributed File System

: Split

: Input Data Record

: Map Task

: Reduce Task

: Shuffling & Sorting

: Map Output Record (Key/Value pair)

: Reduce Output Record (Key/Value pair)

Map / Reduce Task

: Partition

MapReduce

: Distributed File System

: Split

: Input Data Record

: Map Task

: Reduce Task

: Shuffling & Sorting

: Map Output Record (Key/Value pair)

: Reduce Output Record (Key/Value pair)

Map / Reduce Task

: Partition

MapReduce

: Distributed File System

: Split

: Input Data Record

: Map Task

: Reduce Task

: Shuffling & Sorting

: Map Output Record (Key/Value pair)

: Reduce Output Record (Key/Value pair)

Map / Reduce Task

: Partition

Mapper - partitioning

double indexed structure

Spill Thread

data sorting: 2nd index (quick sort)

spill file generating

spill data file & index file

flush

merge sort (by key) per partition

key value key value … key value

partition key offset

value offset

partition key offset

value offset

key offset

key offset

key offset

….

Output Buffer (default: 100Mb)

1st Index

2nd Index

TaskTracker (reduce task)

Reducer –fetching

GetMapEventsThread

map event listener

MapOutputCopier

data fetching from completed mapper (HTTP)

concurrent running in some threads

Merger

key sorting (heap sort)

TaskTracker

(map task)

TaskTracker

(map task)

TaskTracker

(map task)

Job

Tracker

Copier

Copier

completion events

completion events

HTTP - GET

Reducer

Job Flow

Job

Client

MapReduce

Program

Job

Tracker

Task

Tracker

Child

Map/

Reduce

Task

1. runJob

3. submit job

5. add job

6. heartbeat

7. assign task

9. launch

10. run

Shared

File System 2. copy job

resources

4. retrieve

input spilts

8. retrieve

job resources

: Node

: JVM

: Class

: Job Queue

: Method Call

: I/O

11. read data/

write result

: Job

: Task

Client Node

JobTracker Node

TaskTracker Node

Monitoring

Heart beat

task tracker status checking

task request / alignment

other commands (restart, shudown, kill task, …)

Cluster Status

Job / Task Status

JobInProgress

TaskInProgress

Reporter & Metrics

Black list

Monitoring (Summary)

Heart beat

task tracker status checking

task request / alignment

other commands (restart, shudown, kill task, …)

Cluster Status

Job / Task Status

JobInProgress

TaskInProgress

Reporter & Metrics

Black list

Monitoring (Cluster Info)

Monitoring (Job Info)

Monitoring (Task Info)

Task Scheduler

job queue

red-black tree ( java.util.TreeMap)

sort by priority & job id (request time)

load factor

remain tasks / capacity

task alignment

high priority

new task > speculative execution task > dummy splits task

map task (local) > map task (non-local) > reduce task

padding

padding = MIN(total tasks * pad faction, task capacity)

for speculative execution

Error Handling

Retry

configurable (default 4 times)

Timeout

configurable

Speculative Execution

current – start >= 1 minute

average progress – progress > 20%

Distributed Processing System

How to process data in distributed environment

how to read/write data

how to control nodes

load balancing

Monitoring

node status

task status

Fault tolerance

error detection

process error, network error, hardware error, …

error handling

temporary error: retry -> duplication, data corruption, …

permanent error: fail over(which one?)

process hang: timeout & retry

• too long -> long response time

• too short -> infinite loop

Distributed Processing System

How to process data in distributed environment

how to read/write data

how to control nodes

load balancing

Monitoring

node status

task status

Fault tolerance

error detection

process error, network error, hardware error, …

error handling

temporary error: retry -> duplication, data corruption, …

permanent error: fail over(which one?)

process hang: timeout & retry

• too long -> long response time

• too short -> infinite loop

HDFS Client master / slave

replication / rack awareness job scheduler

Distributed Processing System

How to process data in distributed environment

how to read/write data

how to control nodes

load balancing

Monitoring

node status

task status

Fault tolerance

error detection

process error, network error, hardware error, …

error handling

temporary error: retry -> duplication, data corruption, …

permanent error: fail over(which one?)

process hang: timeout & retry

• too long -> long response time

• too short -> infinite loop

heart beat job/task status

reporter / metrics

Distributed Processing System

How to process data in distributed environment

how to read/write data

how to control nodes

load balancing

Monitoring

node status

task status

Fault tolerance

error detection

process error, network error, hardware error, …

error handling

temporary error: retry -> duplication, data corruption, …

permanent error: fail over(which one?)

process hang: timeout & retry

• too long -> long response time

• too short -> infinite loop

black list time out & retry

speculative execution

Limitations

map -> reduce network overhead

iterative processing

full(or theta) join

small size but many splits data

Low latency

polling & pulling

job initializing

optimized for throughput

job scheduling

data access

Q&A

Recommended