36
The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google* 정정정 , 정정정 1

The Google File System

  • Upload
    lobo

  • View
    29

  • Download
    0

Embed Size (px)

DESCRIPTION

The Google File System. Sanjay Ghemawat , Howard Gobioff , and Shun- Tak Leung Google* 정학수 , 최주영. Outline. Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions. Introduction. - PowerPoint PPT Presentation

Citation preview

Page 1: The Google File System

The Google File System

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak LeungGoogle*

정학수 , 최주영

1

Page 2: The Google File System

OutlineIntroductionDesign OverviewSystem InteractionsMaster OperationFault Tolerance and DiagnosisConclusions

2

Page 3: The Google File System

IntroductionGFS was designed to meet the demands

of Google’s data processing needs.Emphasis on Design

◦Component failures◦Files are huge◦Most files are mutated by appending

3

Page 4: The Google File System

DESIGN OVERVIEW4

Page 5: The Google File System

AssumptionsComposed of inexpensive components often failStores 100 MB or larger size fileLarge streaming reads, small random readsLarge, sequential writes that append data to files.Atomicity with minimal synchronization overhead is es-

sential.High sustained bandwidth is more important than low

latency

5

Page 6: The Google File System

Interface

6

Files are organized hierarchically in direc-tories and identified by pathnames

Operation FunctionCreate Create fileDelete Delete fileOpen Open fileClose Close fileRead Read fileWrite Write file

Snapshot Create a copy of file or a directory treeRecord append Allow multiple clients to append data to the same file

Page 7: The Google File System

Google File System. Designed for system-to-system inter-action, and not for user-to-system interaction. 7

Architecture

Page 8: The Google File System

8

Single Master

Page 9: The Google File System

Chunk SizeLarge chunk size – 64MB

◦Advantages Reduce client-master interaction Reduce network overhead Reduce the size of metadata

◦Disadvantages Hot spot - Many clients accessing the same file

9

Page 10: The Google File System

MetadataAll metadata is kept in master’s memoryLess than 64bytes metadata each chunkTypes

◦File and chunk namespace◦File to chunk mapping◦Location of each chunk’s replicas

10

Page 11: The Google File System

Metadata(Cont’d)In-Memory data structure

◦ Master operations are fast◦ Easy and efficient periodically scan

Operation log◦ Contain historical record of critical metadata changes◦ Replicate on multiple remote machines◦ Respond to client only after log record◦ Recovery by replaying the operation log

11

Page 12: The Google File System

Consistency ModelConsistent

◦ all clients will always see the same data regardless of which replicas they read from

Defined◦ consistent and clients will see what mutation writes in its

entiretyInconsistent

◦ different clients may see different data at different times

12

Page 13: The Google File System

SYSTEM INTERACTION13

Page 14: The Google File System

Leases and Mutation OrderLeases

◦ To maintain a consistent mutation order across replicas and minimize management overhead

◦ The master grants one of the replicas to become the primary

◦ Primary picks a serial order of mutation◦ When applying mutation all replicas follow the order

14

Page 15: The Google File System

Leases and Mutation Order(Cont’d)

15

Page 16: The Google File System

Data FlowFully utilize network bandwidth

◦ Decouple control flow and data flowAvoid network bottlenecks and high-latency

◦ Forwards the data to the closest machineMinimize latency

◦ Pipelining the data transfer

16

Page 17: The Google File System

Atomic Record AppendsRecord append : atomic append operation

◦ Client specifies only the data◦ GFS appends data at an offset of GFS’s choosing and re-

turn that offset to client◦ Many clients append to the same file concurrently

such files often serves as multiple-producer/ single-consumer queue

Contain merged results

17

Page 18: The Google File System

18

SnapshotMake a copy of a file or a directory treeStandard copy-on-write

SNAPSHOT

Page 19: The Google File System

MASTER OPERATION19

Page 20: The Google File System

Namespace Management and Lock-ingNamespace

◦ Lookup table mapping full pathname to metadataLocking

◦ To ensure proper serialization multiple operations active and use locks over regions of the namespace

◦ Allow concurrent mutations in the same directory◦ Prevent deadlock consistent total order

20

Page 21: The Google File System

Replica PlacementMaximize data reliability and availabilityMaximize network bandwidth utilization

◦Spread replicas across machines◦Spread chunk replicas across the racks

21

Page 22: The Google File System

Creation, Re-replication, RebalancingCreation

◦Demanded by writersRe-replication

◦Number of available replicas fall down below a user-specifying goal

Rebalancing ◦For better disk space and load balancing

22

Page 23: The Google File System

Garbage CollectionLazy reclaim

◦ Log deletion immediately◦ Rename to a hidden name with deletion timestamp

Remove 3 days later Undelete by renaming back to normal

Regular scan◦ Heartbeat message exchange with each chunkserver◦ Identify orphaned chunks and erase the metadata

23

Page 24: The Google File System

Stale Replica DetectionMaintain a chunk version number

◦Detect stale replicasRemove stale replicas in regular garbage

collection

24

Page 25: The Google File System

FAULT TOLERANCE AND DIAGNOSIS

25

Page 26: The Google File System

High AvailabilityFast recovery

◦Restore state and start in secondsChunk replication

◦Different replication levels for different parts of the file namespace

◦Master clones existing replicas as chunkservers go offline or detect corrupted replicas through checksum verification

26

Page 27: The Google File System

High AvailabilityMaster replication

◦Operation log and checkpoints are replicated on multiple machines

◦Master machine or disk fail Monitoring infrastructure outside GFS starts new

master process◦Shadow master

Read-only access when primary master is down

27

Page 28: The Google File System

Data IntegrityChecksum

◦ To detect corruption◦ Every 64KB block in each chunk◦ In memory and stored persistently with logging

Read◦ Chunkserver verifies checksum before returning

Write◦ Append

Incrementally update the checksum for the last block Compute new checksum

28

Page 29: The Google File System

Data Integrity(Cont’d)Write

◦ Overwrite Read and verify the first and last block then write Compute and record new checksums

During idle periods◦Chunkservers scan and verify inactive chunks

29

Page 30: The Google File System

MEASUREMENTS30

Page 31: The Google File System

Micro-benchmarksGFS cluster

◦ 1 master◦ 2 master replicas◦ 16 chunkservers◦ 16 clients

Server machines connected to one switchclient machines connected to the otherTwo switches are connected with 1 Gbps link.

31

Page 32: The Google File System

32

Micro-benchmarks

Figure 3: Aggregate Throughputs. Top curves show theoretical limits imposed by our network topology. Bottom curves show measured throughputs. They have error bars that show 95% confidence intervals, which are illegible in some cases because of low variance in measurements.

Page 33: The Google File System

33

Real World Clusters

Table2: characteristic Of two GFS clusters

Page 34: The Google File System

34

Table 3: Performance Metrics for Two GFS Clusters

Real World Clusters

Page 35: The Google File System

Real World ClustersIn cluster B

◦ Killed a single chunk server containing 15,000 chunks (600GB of data)

◦ All chunks restored in 23.2minutes◦ Effective replication rate of 440MB/s

◦ Killed two chunk servers each 16,000 chunks (660GB of data)

◦ 266 chunks only have a single replica◦ Higher priority◦ Restored with in 2 minutes

35

Page 36: The Google File System

ConclusionsDemonstrates qualities essential to support

large-scale processing workloads◦ Treat component failure as the norm◦ Optimize for huge files◦ Extend and relax standard file system

Fault tolerance provide◦ Consistent monitoring◦ Replicating crucial data◦ Fast and automatic recovery◦ Use checksum to detect data corruption

High aggregate throughput

36