25
HDFS Hadoop Distributed File System 100062123 柯柯柯 100062139 柯柯柯 101062401 柯柯柯

HDFS Hadoop Distributed File System

Embed Size (px)

DESCRIPTION

HDFS Hadoop Distributed File System. 100062123 柯 懷貿 100062139 王建鑫 101062401 彭偉慶. Outline. Introduction HDFS – How it works Pros and Cons Conclusion. Introduction to HDFS. H adoop D istributed F ile S ystem. Cloud Computing JAVA Processing PB-Level Data - PowerPoint PPT Presentation

Citation preview

Page 1: HDFS Hadoop  Distributed File System

HDFSHadoop Distributed File System

100062123 柯懷貿100062139 王建鑫101062401 彭偉慶

Page 2: HDFS Hadoop  Distributed File System

OutlineIntroductionHDFS – How it worksPros and ConsConclusion

柯懷貿 2

Page 3: HDFS Hadoop  Distributed File System

Introduction to HDFS

Hadoop Distributed File System

柯懷貿 3

• Dung Cutting established• Nutch Project• File System for Hadoop framework• Remote Procedure Call• Master/Slave• Yahoo! has accomplished 10,000-core

Hadoop cluster in 2008

• Cloud Computing• JAVA• Processing PB-Level Data• Distributed Computing

Environment• Hadoop MapReduce • HDFS • HBase

• Allow files shared via internet

• Write-once-read-many• Restricting access• Replication & Fault

tolerance• Mapping between logical

objects & physical objects

Page 4: HDFS Hadoop  Distributed File System

MapReduce

柯懷貿 4

Page 5: HDFS Hadoop  Distributed File System

HBase

柯懷貿 5

• NoSQL• Using several servers to store PB-level

data

Page 6: HDFS Hadoop  Distributed File System

HDFS

柯懷貿 6

• Distributed, scalable, and portable• File replication(default : 3)• Reading efficacy

Page 7: HDFS Hadoop  Distributed File System

王建鑫 7

Page 8: HDFS Hadoop  Distributed File System

HDFS major rolesClient(user) – read/write data from/to

file system

Name node(masters) – oversee and coordinate the data storage function, receive instructions from Client

Data node(slaves) – store data and run computations, receive instructions from Namenode

王建鑫 8

Page 9: HDFS Hadoop  Distributed File System

王建鑫 9

Page 10: HDFS Hadoop  Distributed File System

王建鑫 10

Page 11: HDFS Hadoop  Distributed File System

Rack Awareness

王建鑫 11

Page 12: HDFS Hadoop  Distributed File System

王建鑫 12

Page 13: HDFS Hadoop  Distributed File System

王建鑫 13

Page 14: HDFS Hadoop  Distributed File System

王建鑫 14

Page 15: HDFS Hadoop  Distributed File System

王建鑫 15

Page 16: HDFS Hadoop  Distributed File System

王建鑫 16

Page 17: HDFS Hadoop  Distributed File System

王建鑫 17

Page 18: HDFS Hadoop  Distributed File System

HDFS fault toleranceNode failure – data node or nam enode

is deadCommunication failure – cannot send

and retrieve dataData corruption – data corrupted while

sending over network or corrupted in the hard disks

Write failure – the data node which is ready to be written is dead

Read failure - the data node which is ready to be read is dead

王建鑫 18

Page 19: HDFS Hadoop  Distributed File System

王建鑫 19

Page 20: HDFS Hadoop  Distributed File System

Detect the Network failureWhenever data is sent, an ACK is

replied by the receiverIf the ACK is not received(after

several retries), the sender assumes that the host is dead, or the network has failed

Also Checksum is sent along with transmitted data→can detect corrupt data when transferring

王建鑫 20

Page 21: HDFS Hadoop  Distributed File System

Handling the write/read failure

Client write the block in smaller data units(usually 64KB) called packet

Each data node replies back an ACK for each packet to confirm that they got the packet

If client don’t get the ACKs from some nodes, dead node detected

Client then adjust the pipeline to skip that node(then?)

Handling the read failure: just read another node

王建鑫 21

Page 22: HDFS Hadoop  Distributed File System

Handling the write failure cont’d

Name node contains two tables:List of blocks – blockA in dn1,

dn2,dn8; blockB in dn3, dn7, dn9…List of Data nodes – dn1 has blockA,

blockD; dn2 has blockE, blockG…

Name node check list of blocks to see if a block is not properly replicated

If so, ask other data nodes to copy block from data nodes that have the replication.

王建鑫 22

Page 23: HDFS Hadoop  Distributed File System

ProsVery large files

◦A file size overs xxxMB, GB, TB, PB .…..

Streaming data access◦Write-once, read-many.◦Efficient on reading whole dataset.

Commodity hardware◦High reliability and availability.◦Doesn’t require expensive, highly

reliable hardware.

彭偉慶 23

Page 24: HDFS Hadoop  Distributed File System

Cons

彭偉慶 24

Page 25: HDFS Hadoop  Distributed File System

ConclusionHDFS - an Apache Hadoop

subproject.

Highly fault-tolerant and is designed to be deployed on low-cost hardware.

High throughput but not low latency.

彭偉慶 25