Upload
lobo
View
29
Download
0
Embed Size (px)
DESCRIPTION
The Google File System. Sanjay Ghemawat , Howard Gobioff , and Shun- Tak Leung Google* 정학수 , 최주영. Outline. Introduction Design Overview System Interactions Master Operation Fault Tolerance and Diagnosis Conclusions. Introduction. - PowerPoint PPT Presentation
Citation preview
The Google File System
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak LeungGoogle*
정학수 , 최주영
1
OutlineIntroductionDesign OverviewSystem InteractionsMaster OperationFault Tolerance and DiagnosisConclusions
2
IntroductionGFS was designed to meet the demands
of Google’s data processing needs.Emphasis on Design
◦Component failures◦Files are huge◦Most files are mutated by appending
3
DESIGN OVERVIEW4
AssumptionsComposed of inexpensive components often failStores 100 MB or larger size fileLarge streaming reads, small random readsLarge, sequential writes that append data to files.Atomicity with minimal synchronization overhead is es-
sential.High sustained bandwidth is more important than low
latency
5
Interface
6
Files are organized hierarchically in direc-tories and identified by pathnames
Operation FunctionCreate Create fileDelete Delete fileOpen Open fileClose Close fileRead Read fileWrite Write file
Snapshot Create a copy of file or a directory treeRecord append Allow multiple clients to append data to the same file
Google File System. Designed for system-to-system inter-action, and not for user-to-system interaction. 7
Architecture
8
Single Master
Chunk SizeLarge chunk size – 64MB
◦Advantages Reduce client-master interaction Reduce network overhead Reduce the size of metadata
◦Disadvantages Hot spot - Many clients accessing the same file
9
MetadataAll metadata is kept in master’s memoryLess than 64bytes metadata each chunkTypes
◦File and chunk namespace◦File to chunk mapping◦Location of each chunk’s replicas
10
Metadata(Cont’d)In-Memory data structure
◦ Master operations are fast◦ Easy and efficient periodically scan
Operation log◦ Contain historical record of critical metadata changes◦ Replicate on multiple remote machines◦ Respond to client only after log record◦ Recovery by replaying the operation log
11
Consistency ModelConsistent
◦ all clients will always see the same data regardless of which replicas they read from
Defined◦ consistent and clients will see what mutation writes in its
entiretyInconsistent
◦ different clients may see different data at different times
12
SYSTEM INTERACTION13
Leases and Mutation OrderLeases
◦ To maintain a consistent mutation order across replicas and minimize management overhead
◦ The master grants one of the replicas to become the primary
◦ Primary picks a serial order of mutation◦ When applying mutation all replicas follow the order
14
Leases and Mutation Order(Cont’d)
15
Data FlowFully utilize network bandwidth
◦ Decouple control flow and data flowAvoid network bottlenecks and high-latency
◦ Forwards the data to the closest machineMinimize latency
◦ Pipelining the data transfer
16
Atomic Record AppendsRecord append : atomic append operation
◦ Client specifies only the data◦ GFS appends data at an offset of GFS’s choosing and re-
turn that offset to client◦ Many clients append to the same file concurrently
such files often serves as multiple-producer/ single-consumer queue
Contain merged results
17
18
SnapshotMake a copy of a file or a directory treeStandard copy-on-write
SNAPSHOT
MASTER OPERATION19
Namespace Management and Lock-ingNamespace
◦ Lookup table mapping full pathname to metadataLocking
◦ To ensure proper serialization multiple operations active and use locks over regions of the namespace
◦ Allow concurrent mutations in the same directory◦ Prevent deadlock consistent total order
20
Replica PlacementMaximize data reliability and availabilityMaximize network bandwidth utilization
◦Spread replicas across machines◦Spread chunk replicas across the racks
21
Creation, Re-replication, RebalancingCreation
◦Demanded by writersRe-replication
◦Number of available replicas fall down below a user-specifying goal
Rebalancing ◦For better disk space and load balancing
22
Garbage CollectionLazy reclaim
◦ Log deletion immediately◦ Rename to a hidden name with deletion timestamp
Remove 3 days later Undelete by renaming back to normal
Regular scan◦ Heartbeat message exchange with each chunkserver◦ Identify orphaned chunks and erase the metadata
23
Stale Replica DetectionMaintain a chunk version number
◦Detect stale replicasRemove stale replicas in regular garbage
collection
24
FAULT TOLERANCE AND DIAGNOSIS
25
High AvailabilityFast recovery
◦Restore state and start in secondsChunk replication
◦Different replication levels for different parts of the file namespace
◦Master clones existing replicas as chunkservers go offline or detect corrupted replicas through checksum verification
26
High AvailabilityMaster replication
◦Operation log and checkpoints are replicated on multiple machines
◦Master machine or disk fail Monitoring infrastructure outside GFS starts new
master process◦Shadow master
Read-only access when primary master is down
27
Data IntegrityChecksum
◦ To detect corruption◦ Every 64KB block in each chunk◦ In memory and stored persistently with logging
Read◦ Chunkserver verifies checksum before returning
Write◦ Append
Incrementally update the checksum for the last block Compute new checksum
28
Data Integrity(Cont’d)Write
◦ Overwrite Read and verify the first and last block then write Compute and record new checksums
During idle periods◦Chunkservers scan and verify inactive chunks
29
MEASUREMENTS30
Micro-benchmarksGFS cluster
◦ 1 master◦ 2 master replicas◦ 16 chunkservers◦ 16 clients
Server machines connected to one switchclient machines connected to the otherTwo switches are connected with 1 Gbps link.
31
32
Micro-benchmarks
Figure 3: Aggregate Throughputs. Top curves show theoretical limits imposed by our network topology. Bottom curves show measured throughputs. They have error bars that show 95% confidence intervals, which are illegible in some cases because of low variance in measurements.
33
Real World Clusters
Table2: characteristic Of two GFS clusters
34
Table 3: Performance Metrics for Two GFS Clusters
Real World Clusters
Real World ClustersIn cluster B
◦ Killed a single chunk server containing 15,000 chunks (600GB of data)
◦ All chunks restored in 23.2minutes◦ Effective replication rate of 440MB/s
◦ Killed two chunk servers each 16,000 chunks (660GB of data)
◦ 266 chunks only have a single replica◦ Higher priority◦ Restored with in 2 minutes
35
ConclusionsDemonstrates qualities essential to support
large-scale processing workloads◦ Treat component failure as the norm◦ Optimize for huge files◦ Extend and relax standard file system
Fault tolerance provide◦ Consistent monitoring◦ Replicating crucial data◦ Fast and automatic recovery◦ Use checksum to detect data corruption
High aggregate throughput
36