HDFS Hadoop Distributed File System

HDFSHadoop Distributed File System

100062123 柯懷貿100062139 王建鑫101062401 彭偉慶

OutlineIntroductionHDFS – How it worksPros and ConsConclusion

柯懷貿 2

Introduction to HDFS

Hadoop Distributed File System

柯懷貿 3

• Dung Cutting established• Nutch Project• File System for Hadoop framework• Remote Procedure Call• Master/Slave• Yahoo! has accomplished 10,000-core

Hadoop cluster in 2008

• Cloud Computing• JAVA• Processing PB-Level Data• Distributed Computing

Environment• Hadoop MapReduce • HDFS • HBase

• Allow files shared via internet

• Write-once-read-many• Restricting access• Replication & Fault

tolerance• Mapping between logical

objects & physical objects

MapReduce

柯懷貿 4

HBase

柯懷貿 5

• NoSQL• Using several servers to store PB-level

data

HDFS

柯懷貿 6

• Distributed, scalable, and portable• File replication(default : 3)• Reading efficacy

王建鑫 7

HDFS major rolesClient(user) – read/write data from/to

file system

Name node(masters) – oversee and coordinate the data storage function, receive instructions from Client

Data node(slaves) – store data and run computations, receive instructions from Namenode

王建鑫 8

王建鑫 9

王建鑫 10

Rack Awareness

王建鑫 11

王建鑫 12

王建鑫 13

王建鑫 14

王建鑫 15

王建鑫 16

王建鑫 17

HDFS fault toleranceNode failure – data node or nam enode

is deadCommunication failure – cannot send

and retrieve dataData corruption – data corrupted while

sending over network or corrupted in the hard disks

Write failure – the data node which is ready to be written is dead

Read failure - the data node which is ready to be read is dead

王建鑫 18

王建鑫 19

Detect the Network failureWhenever data is sent, an ACK is

replied by the receiverIf the ACK is not received(after

several retries), the sender assumes that the host is dead, or the network has failed

Also Checksum is sent along with transmitted data→can detect corrupt data when transferring

王建鑫 20

Handling the write/read failure

Client write the block in smaller data units(usually 64KB) called packet

Each data node replies back an ACK for each packet to confirm that they got the packet

If client don’t get the ACKs from some nodes, dead node detected

Client then adjust the pipeline to skip that node(then?)

Handling the read failure： just read another node

王建鑫 21

Handling the write failure cont’d

Name node contains two tables:List of blocks – blockA in dn1,

dn2,dn8； blockB in dn3, dn7, dn9…List of Data nodes – dn1 has blockA,

blockD； dn2 has blockE, blockG…

Name node check list of blocks to see if a block is not properly replicated

If so, ask other data nodes to copy block from data nodes that have the replication.

王建鑫 22

ProsVery large files

◦A file size overs xxxMB, GB, TB, PB .…..

Streaming data access◦Write-once, read-many.◦Efficient on reading whole dataset.

Commodity hardware◦High reliability and availability.◦Doesn’t require expensive, highly

reliable hardware.

彭偉慶 23

Cons

彭偉慶 24

ConclusionHDFS - an Apache Hadoop

subproject.

Highly fault-tolerant and is designed to be deployed on low-cost hardware.

High throughput but not low latency.

彭偉慶 25

Documents

HDFS Hadoop Distributed File System