The Google File System (GFS)

The Google File System (GFS)

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

Introduction

Design constraints 设计要求 Component failures are the norm

1000s of components Bugs, human errors, failures of memory, disk,

connectors, networking, and power supplies Monitoring, error detection, fault tolerance, automatic

recovery 监控，错误检测，容错，自动恢复 Files are huge by traditional standards

Multi-GB files are common Billions of objects

Introduction

Design constraints Most modifications are appends

Random writes are practically nonexistent Many files are written once, and read sequentially

Two types of reads Large streaming reads Small random reads (in the forward direction)

Sustained bandwidth more important than latency

高度可用的带宽比低延迟更加重要 File system APIs are open to changes

Interface Design

Not POSIX compliant 不遵守 POSIX规范 Additional operations

Snapshot Record append

Architectural Design

A GFS cluster A single master 在逻辑上只有一个 master Multiple chunkservers per master

Accessed by multiple clients Running on commodity Linux machines

A file 数据以文件形式存储在 Chunk Server上 Represented as fixed-sized chunks （数据块）

Labeled with 64-bit unique global IDs Stored at chunkservers 3-way Mirrored across chunkservers

GFS chunkserver

Linux file system

Architectural Design (2)

GFS Master

GFS chunkserver

Linux file systemGFS chunkserver

Linux file system

Application

GFS client

chunk location?

chunk data?

Architectural Design (3) Master server

Maintains all metadata Name space, access control, file-to-chunk mappings, garbage

collection, chunk migration

GFS clients Consult master for metadata Access data from chunkservers Does not go through VFS No caching at clients and chunkservers due to the

frequent case of streaming 客户端大部分是流式顺序读写，并不存在大量的重复读写，缓存文件数据

对提高系统整体性能的作用不大，服务器不缓存是因为 Linux操作系统会把经常访问的数据存放在内存中

Single-Master Design

Simple Master answers only chunk locations A client typically asks for multiple chunk

locations in a single request The master also predicatively provide chunk

locations immediately following those requested

Chunk Size

64 MB Fewer chunk location requests to the master Reduced overhead to access a chunk Fewer metadata entries

Kept in memory

- Some potential problems with fragmentation 碎片

Metadata

Three major types File and chunk namespaces 文件和数据块的命名空间 File-to-chunk mappings 文件到数据块的映射 Locations of a chunk’s replicas 数据块副本的位置

Metadata

All kept in memory Fast! Quick global scans

Garbage collections Reorganizations

64 bytes per 64 MB of data Prefix compression

Chunk Locations

No persistent states Polls chunkservers at startup Use heartbeat messages to monitor servers Simplicity On-demand approach vs. coordination

On-demand wins when changes (failures) are often

Operation Logs Metadata updates are logged

e.g., <old value, new value> pairs Log replicated on remote machines 多处远程机器上创建日志副本

Take global snapshots (checkpoints) to truncate （截短） logs Memory mapped 压入内存

(no serialization/deserialization) Checkpoints can be created while updates arrive

Recovery Latest checkpoint + subsequent log files

Consistency Model Relaxed consistency

Concurrent changes are consistent but undefined An append is atomically committed at least once

- Occasional duplications

All changes to a chunk are applied in the same order to all replicas

对 chunk所有副本的修改操作顺序一致 Use version number to detect missed updates 使用版本号来检测是否由于 chunk服务器宕机而丢失某些

更新操作

System Interactions

Master 节点为 Chunk 的一个副本建立一个租约，我们把这个副本叫做主 Chunk

主 Chunk 对 Chunk 的所有更改操作进行序列化。所有的副本都遵从这个序列进行修改操作。

Lease 租约租约的初始超时设置为 60 秒。不过，只要 Chunk 被修改了，主

Chunk 就可以申请更长的租期，通常会得到 Master 节点的确认并收到租约延长的时间。这些租约延长请求和批准的信息通常都是附加在Master 节点和 Chunk 服务器之间的 heartbeat 消息中来传递。有时Master 节点会试图提前取消租约（例如， Master 节点想取消在一个已经被改名的文件上的修改操作）。即使 Master 节点和主 Chunk 失去联系，它仍然可以安全地在旧的租约到期后和另外一个 Chunk 副本签订新的租约

Data Flow

Separation of control and data flows 控制流与数据流分离

Avoid network bottleneck Updates are pushed linearly among replicas Pipelined transfers 数据以管道的方式，顺序的沿着一个精心选择的 Chunk

服务器链推送。 13 MB/second with 100 Mbps network

Snapshot

Copy-on-write approach 写时拷贝方法撤销快照的文件所拥有的 chunk 的所有租约 New updates are logged while taking the

snapshot Commit the log to disk Apply to the log to a copy of metadata A chunk is not copied until the next update

Master Operation

GFS 没有针对每个目录实现能够列出目录下所有文件的数据结构。 GFS 也不支持文件或者目录的链接。

在逻辑上， GFS 的名称空间就是一个全路径和元数据映射关系的查找表。利用前缀压缩，这个表可以高效的存储在内存中。

Locking Operations

每个 Master 节点的操作在开始之前都要获得一系列的锁。通常情况下，如果一个操作涉及 /d1/d2/…/dn/leaf ，那么操作首先要获得目录 /d1 ， /d1/d2 ，…， /d1/d2/…/dn 的读锁，以及 /d1/d2/…/dn/leaf 的读写锁。注意，根据操作的不同， leaf 可以是一个文件，也可以是一个目录。

采用这种锁方案的优点是支持对同一目录的并行操作。比如，可以再同一个目录下同时创建多个文件：每一个操作都获取一个目录名的上的读取锁和文件名上的写入锁。目录名的读取锁足以的防止目录被删除、改名以及被快照。文件名的写入锁序列化文件创建操作，确保不会多次创建同名的文件

Replica Placement

两大目标：最大化数据可靠性和可用性，最大化网络带宽利用率。

必须在多个机架间分布储存 Chunk 的副本。这保证Chunk 的一些副本在整个机架被破坏或掉线的情况下依然存在且保持可用状态。这还意味着在网络流量方面，尤其是针对 Chunk 的读操作，能够有效利用多个机架的整合带宽。

每个需要被重新复制的 Chunk 都会根据几个因素进行排序。其中一个因素是 Chunk 现有副本数量和复制因数相差多少。例如，丢失两个副本的 Chunk 比丢失一个副本的 Chunk 有更高的优先级

Garbage Collection

当一个文件被应用程序删除时， Master 节点象对待其它修改操作一样，立刻把删除操作以日志的方式记录下来。但是， Master 节点并不马上回收资源，而是把文件名改为一个包含删除时间戳的、隐藏的名字。当Master 节点对文件系统命名空间做常规扫描的时候，它会删除所有三天前的隐藏文件（这个时间间隔是可以设置的）。直到文件被真正删除，它们仍旧可以用新的特殊的名字读取，也可以通过把隐藏文件改名为正常显示的文件名的方式“反删除”。当隐藏文件被从名称空间中删除， Master 服务器内存中保存的这个文件的相关元数据才会被删除。这也有效的切断了文件和它包含的所有 Chunk 的连接（

Fault Tolerance and Diagnosis 快速恢复

不管Master 服务器和 Chunk 服务器是如何关闭的，它们都被设计为可以在数秒钟内恢复它们的状态并重新启动。

Chunk 复制 Master 复制

GFS 中存在一些“影子”Master 服务器，这些“影子”服务器在“主”Master 服务器宕机的时候提供文件系统的只读访问。

处于 GFS 系统外部的监控进程会在其它的存有完整操作日志的机器上启动一个新的 Master 进程。

Fault Tolerance and Diagnosis 数据完整性

我们把每个 Chunk 都分成 64KB 大小的块。每个块都对应一个 32位的校验和（ checksum ）。和其它元数据一样， Checksum与其它的用户数据是分开的，并且保存在内存和硬盘上，同时也记录操作日志。

在 Chunk 服务器空闲的时候，它会扫描和校验每个不活动的 Chunk 的内容。这使得我们能够发现很少被读取的 Chunk 是否完整。

Documents

The Google File System (GFS)