Upload
leonard-owens
View
262
Download
0
Embed Size (px)
Citation preview
HDFSHadoop Distributed File System
100062123 柯懷貿100062139 王建鑫101062401 彭偉慶
OutlineIntroductionHDFS – How it worksPros and ConsConclusion
柯懷貿 2
Introduction to HDFS
Hadoop Distributed File System
柯懷貿 3
• Dung Cutting established• Nutch Project• File System for Hadoop framework• Remote Procedure Call• Master/Slave• Yahoo! has accomplished 10,000-core
Hadoop cluster in 2008
• Cloud Computing• JAVA• Processing PB-Level Data• Distributed Computing
Environment• Hadoop MapReduce • HDFS • HBase
• Allow files shared via internet
• Write-once-read-many• Restricting access• Replication & Fault
tolerance• Mapping between logical
objects & physical objects
MapReduce
柯懷貿 4
HBase
柯懷貿 5
• NoSQL• Using several servers to store PB-level
data
HDFS
柯懷貿 6
• Distributed, scalable, and portable• File replication(default : 3)• Reading efficacy
王建鑫 7
HDFS major rolesClient(user) – read/write data from/to
file system
Name node(masters) – oversee and coordinate the data storage function, receive instructions from Client
Data node(slaves) – store data and run computations, receive instructions from Namenode
王建鑫 8
王建鑫 9
王建鑫 10
Rack Awareness
王建鑫 11
王建鑫 12
王建鑫 13
王建鑫 14
王建鑫 15
王建鑫 16
王建鑫 17
HDFS fault toleranceNode failure – data node or nam enode
is deadCommunication failure – cannot send
and retrieve dataData corruption – data corrupted while
sending over network or corrupted in the hard disks
Write failure – the data node which is ready to be written is dead
Read failure - the data node which is ready to be read is dead
王建鑫 18
王建鑫 19
Detect the Network failureWhenever data is sent, an ACK is
replied by the receiverIf the ACK is not received(after
several retries), the sender assumes that the host is dead, or the network has failed
Also Checksum is sent along with transmitted data→can detect corrupt data when transferring
王建鑫 20
Handling the write/read failure
Client write the block in smaller data units(usually 64KB) called packet
Each data node replies back an ACK for each packet to confirm that they got the packet
If client don’t get the ACKs from some nodes, dead node detected
Client then adjust the pipeline to skip that node(then?)
Handling the read failure: just read another node
王建鑫 21
Handling the write failure cont’d
Name node contains two tables:List of blocks – blockA in dn1,
dn2,dn8; blockB in dn3, dn7, dn9…List of Data nodes – dn1 has blockA,
blockD; dn2 has blockE, blockG…
Name node check list of blocks to see if a block is not properly replicated
If so, ask other data nodes to copy block from data nodes that have the replication.
王建鑫 22
ProsVery large files
◦A file size overs xxxMB, GB, TB, PB .…..
Streaming data access◦Write-once, read-many.◦Efficient on reading whole dataset.
Commodity hardware◦High reliability and availability.◦Doesn’t require expensive, highly
reliable hardware.
彭偉慶 23
Cons
彭偉慶 24
ConclusionHDFS - an Apache Hadoop
subproject.
Highly fault-tolerant and is designed to be deployed on low-cost hardware.
High throughput but not low latency.
彭偉慶 25