37
The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung 詹剑锋讲解

The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

The Google File System (GFS)

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

詹剑锋讲解

Page 2: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Acknowledgement

Parts of contents are from CSE 490h –Introduction to Distributed Computing, Winter 2008, Washington University

Page 3: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Distributed File Systems

Tradeoffs in Distributed File Systems Performance Scalability Reliability Availability

Two Core Approaches Super Computer? many cheap computers?

Page 4: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Motivation

Google went the cheap commodity route… Lots of data on cheap machines!

Why not use an existing file system? Unique problems GFS is designed for Google workloads Google apps are designed for GFS

Page 5: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Design constraints (1/2)

Component failures are the norm Large-scale cheap systems Bugs, human errors, failures of memory, disk,

connectors, networking, and power supplies Monitoring, error detection, fault tolerance,

automatic recovery Files are huge by traditional standards

Multi-GB files are common But there aren’t THAT many files

Page 6: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Design constraints (2/2)

Mutations are typically appending new data Random writes are rare Once written, files are only read, and typically

sequentially Optimize for this!

Large consecutive reads, small random reads Want high sustained bandwidth

low latency is not that important Google is co-designing apps AND file system

Page 7: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

GFS Interface

Supports usual commands Create, delete, open, close, read, write

Snapshot Copies a file or a directory tree

Record Append Allows multiple concurrent appends to same

file

Page 8: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

GFS Architecture

Single master Multiple chunkservers

Page 9: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Architectural Design (1/4)

A GFS cluster A single master Multiple chunkservers per master

Accessed by multiple clients Running on commodity Linux machines

A file divided into fixed-size chunks. Labeled with 64-bit unique global IDs Stored at chunkservers 3-way Mirrored across chunkservers

Page 10: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Architectural Design (2/4)

Master server Maintains all metadata

Name space, access control, file-to-chunk mappings, garbage collection, chunk migration

controls system-wide activities chunk lease management, garbage collection of

orphaned chunks, and chunk migration between chunkservers.

periodically communicates with each chunkserver in HeartBeat messages to give it instructions and collect its state.

Page 11: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Architectural Design (3/4)

GFS clients Consult master for metadata Access data from chunkservers Does not go through VFS since not providing

the POSIX API

Page 12: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Architectural Design (4/4)

No caching at clients and chunkservers due to the frequent case of streaming Client: most applications stream through huge files

with too large working sets. simplifies the client and the overall system by

eliminating cache coherence issues. (Clients do cache metadata, however.)

Chunkservers need not cache file Use Linux’s buffer cache.

Page 13: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Single-Master Design

From distributed systems Single point of failure Scalability bottleneck

GFS solutions: Shadow masters Minimize master involvement

never move data through it, use only for metadata large chunk size (64 MB) master delegates authority to primary replicas in data mutations

(chunk leases) Simple, and good enough!

Page 14: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Master’s responsibilities (1/2)

Metadata storage Namespace management/locking Periodic communication with

chunkservers give instructions, collect state, track

cluster health Garbage Collection

Page 15: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Master’s responsibilities (2/2)

Chunk creation Place new replicas on chunkservers with below

average disk-space utilization Limit number of recent creations on each chunk server Spread replicas across racks

Re-Replicate when replicas fall below user goal Periodic rebalancing

Better disk space usage Load balancing

Page 16: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Chunk Size

64 MB Fewer chunk location requests to the master Reduced overhead to access a chunk

on a large chunk, a client perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection.

Fewer metadata entries Kept in memory

Page 17: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Metadata (1/5)

Global metadata is stored on the master File and chunk namespaces Mapping from files to chunks Locations of each chunk’s replicas

All in memory (64 bytes / chunk) Fast Easily accessible Any problems?

Page 18: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Metadata (2/5)

Master has an operation log for persistent logging of critical metadata updates persistent on local disk replicated checkpoints for faster recovery

Page 19: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Metadata (3/5)

Three major types File and chunk namespaces File-to-chunk mappings

kept persistent by logging mutations to an operation log stored on the master’s local disk and replicated on remote machines.

Locations of a chunk’s replicas master does not store chunk location information

persistently. asks each chunkserver about its chunks at master

startup and whenever a chunkserver joins the cluster.

Page 20: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Metadata (4/5)

All kept in memory. Fast! Quick global scans

Garbage collections Reorganizations

re-replication in the presence of chunkserver failures Chunk migration to balance load and disk space usage

across chunkservers.

64 bytes per 64 MB of data stores file names compactly using Prefix

compression

Page 21: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Metadata (5/5)

Chunk locations: no persistent states Polls chunkservers at startup Use heartbeat messages to monitor servers Simplicity On-demand approach vs. coordination

On-demand wins when changes (failures) are often no point in maintaining a consistent view on the master

because errors on a chunkserver may cause chunks to vanish or an operator may rename a chunkserver.

Page 22: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Operation Logs (1/2)

Central to GFS. contains a historical record of critical metadata

changes. Not only is it the only persistent record of

metadata, but it also serves as a logical time line that defines the order of concurrent operations.

Files and chunks, as well as their versions , are all uniquely and eternally identified by the logical times at which they were created.

Page 23: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Operation Logs (2/2)

Metadata updates are logged e.g., <old value, new value> pairs Log replicated on remote machines

Take global snapshots (checkpoints) to truncate logs Memory mapped (compact B-tree like form) Checkpoints (take a while) can be created while

updates arrive master switches to a new log file and creates the

new checkpoint in a separate thread. Recovery (Latest checkpoint + subsequent log files)

Page 24: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Mutations Mutation = write or append

must be done for all replicas Goal: minimize master

involvement Lease mechanism:

master picks one replica asprimary; gives it a “lease” for mutations

primary defines a serial order of mutations all replicas follow this order

Data flow decoupled from control flow

Page 25: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Data Mutations

A write causes data to be written at an application-specified file offset.

A record append causes data (the “record”) to be appended atomically at least once even in the presence of concurrent mutations.

Page 26: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Atomic record append

Client specifies data

GFS appends it to the file atomically at least once GFS picks the offset In contrast, a “regular” append is merely a write at

an offset that the client believes to be the current end of file.

Used heavily by Google apps e.g., for files that serve as multiple-producer/single-

consumer queues

Page 27: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Consistency Model (1/3)

A file region is consistent if all clients will always see the same data, regardless of which replicas they read from.

A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety.

Page 28: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

consistency model (2/3)

“Consistent” = all replicas have the same value “Defined” = replica reflects the mutation, consistent

Some properties: concurrent writes leave region consistent, but possibly

undefined failed writes leave the region inconsistent

Some work has moved into the applications: e.g., self-validating, self-identifying records

Page 29: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Consistency Model (3/3)

Relaxed consistency Concurrent changes are consistent but

undefined An append is atomically committed at least

once - Occasional duplications

All changes to a chunk are applied in the same order to all replicas

Use version number to detect missed updates

Page 30: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

System Interactions

The master grants a chunk lease to a replica The replica holding the lease determines the

order of updates to all replicas Lease

60 second timeouts Can be extended indefinitely Extension request are piggybacked on

heartbeat messages After a timeout expires, the master can grant

new leases

Page 31: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Replica Placement

Goals: Maximize data reliability and availability Maximize network bandwidth

Need to spread chunk replicas across machines and racks

Higher priority to replica chunks with lower replication factors

Limited resources spent on replication

Page 32: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Fault Tolerance and Diagnosis(1/2)

High availability fast recovery

master and chunkservers restartable in a few seconds

chunk replication default: 3 replicas.

shadow masters

Page 33: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Fault Tolerance and Diagnosis (2/2)

Data integrity A chunk is divided into 64-KB blocks Each with its checksum Verified at read and write times Also background scans for rarely used data

Page 34: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Deployment in Google

50+ GFS clusters Each with thousands of storage nodes Managing petabytes of data GFS is under BigTable, etc.

Page 35: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Conclusion

GFS demonstrates how to support large-scale processing workloads on commodity hardware design to tolerate frequent component failures optimize for huge files that are mostly appended and

read feel free to relax and extend FS interface as required go for simple solutions (e.g., single master)

GFS has met Google’s storage needs… it must be good!

Page 36: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

谢谢

Page 37: The Google File System (GFS) - ict.ac.cnprof.ict.ac.cn/DComputing/uploads/2013/DC_4_0_GFS_ICT.pdf · The Google File System (GFS) Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

课程评估

目标:系统理解+表达能力

4 pages 描述一个系统

4张图 (每张图半页, Microsoft visio )描述一个系统

用最精炼的文字描述图形

2人一组

必须与我看到的不能完全相同

否则不会高于75分