72
Silberschatz, Galvin and Gagne 2002 13.1 Operating System Concepts Chapter 14: Mass-Storage Systems 海海海海海海 14.1 Disk Structure 海海海海 14.2 Disk Scheduling 海海海海 14.3 Disk Management 海海海海 14.4 Swap-Space Management 海海海海海海 14.5 RAID Structure RAID 海海 14.6 Disk Attachment 海海海海 14.7 Stable-Storage Implementation 海海海海海海 14.8 Tertiary Storage Devices 海海海海海海 Operating System Issues 海海海海海 海海海 Performance Issues 海海海海海海海

Chapter 14: Mass-Storage Systems 海量存储器系统

  • Upload
    johnda

  • View
    105

  • Download
    7

Embed Size (px)

DESCRIPTION

Chapter 14: Mass-Storage Systems 海量存储器系统. 14.1 Disk Structure 磁盘结构 14.2 Disk Scheduling 磁盘调度 14.3 Disk Management 磁盘管理 14.4 Swap-Space Management 交换空间管理 14.5 RAID Structure RAID 结构 - PowerPoint PPT Presentation

Citation preview

Page 1: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.1Operating System Concepts

Chapter 14: Mass-Storage Systems海量存储器系统

14.1 Disk Structure 磁盘结构 14.2 Disk Scheduling 磁盘调度 14.3 Disk Management 磁盘管理 14.4 Swap-Space Management 交换空间管

理 14.5 RAID Structure RAID 结构 14.6 Disk Attachment 磁盘连接 14.7 Stable-Storage Implementation 稳定存储实

现 14.8 Tertiary Storage Devices 三级存储设备 Operating System Issues 有关操作系统的问题 Performance Issues 有关性能的问题

Page 2: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.2Operating System Concepts

14.1 Disk Structure 磁盘结构 Disk drives are addressed as large 1-

dimensional arrays of logical blocks, where the logical block is the smallest unit of transfer.

磁盘设备是以一种逻辑块的一维大数组的形式编址的,这里的逻辑块是传输的最小单位。

The 1-dimensional array of logical blocks is mapped into the sectors of the disk sequentially.

逻辑块的一维数组映射到磁盘上一些相连的扇区。Sector 0 is the first sector of the first track

on the outermost cylinder. 0 扇区是最外边柱面的第一个磁道的第一个扇区。Mapping proceeds in order through that

track, then the rest of the tracks in that cylinder, and then through the rest of the cylinders from outermost to innermost. 数据首先都映射到一个磁道,其余的数据映射到同一柱面的其他磁道,然后按照从外向里的顺序映射到其余的柱面。

Page 3: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.3Operating System Concepts

Disk Structure 磁盘结构 Constant linear velocity(CLV) ( 恒定线速度)

Density of bits per track is uniformThe farther a track is from the center of the

disk, the greater its length, so the more sectors it can hold.

磁头越往中心移动,转速越快,以保持数据速率不变CD-ROM 和 DVD-ROM 采用这种方法

Constant angular velocity(CAV) ( 恒定角速度)磁头转速不变为保持数据速率不变,从中心往外,数据密度由大变小硬盘等采用这种方法

Low-level formattedBlock size:512 bytes or 1024

Page 4: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.4Operating System Concepts

14.2 Disk Scheduling 磁盘调度 The operating system is responsible for using

hardware efficiently — for the disk drives, this means having a fast access time and disk bandwidth.

操作系统任务就是高效地使用硬件——对于磁盘设备,这意味着很短的访问时间和磁盘带宽。

Access time has two major components 访问时间包括两个主要部分

Seek time is the time for the disk are to move the heads to the cylinder containing the desired sector.

寻道时间是指把磁头移到所需柱面的时间。Rotational latency is the additional time

waiting for the disk to rotate the desired sector to the disk head. 旋转延迟是指将磁头旋转到磁盘上指定扇区所需的等待时间。

Page 5: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.5Operating System Concepts

Disk Scheduling (Cont.)

Minimize seek time 最小寻道时间 Seek time seek distance 寻道时间 寻道距离 Disk bandwidth is the total number of

bytes transferred, divided by the total time between the first request for service and the completion of the last transfer.

磁盘带宽,是用传输的总位数,除以第一个服务请求与最后传输完成之间的总时间。(也就是单位时间内数据传输量)

Page 6: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.6Operating System Concepts

Disk Scheduling (Cont.)

Several algorithms exist to schedule the servicing of disk I/O requests.

有几种磁盘 I/O 请求的服务调度算法 We illustrate them with a request queue (0-

199).98, 183, 37, 122, 14, 124, 65, 67

Head pointer 53

Page 7: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.7Operating System Concepts

14.2.1 FCFS Scheduling 先来先服务调度Simplest form. Illustration shows total head movement of 640 cylinders. 如下图所示,磁头总共移动了 640 个柱面的距离。

Page 8: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.8Operating System Concepts

14.2.2 SSTF Scheduling最短寻道时间优先调度

Shortest-seek-time-first(SSTF) algorithm Selects the request with the minimum seek time from

the current head position.

选择从当前磁头位置所需寻道时间最短的请求。 SSTF scheduling is a form of shortest-job-first(SJF)

scheduling;

SSTF 是 SJF 调度的一种形式; may cause starvation of some requests.

有可能引起某些请求的饥饿。 Illustration shows total head movement of 236

cylinders. 如图所示,磁头移动的总距离是 236 柱面。 It’s not optimal, 最佳磁头移动的总距离是 208 柱面

Page 9: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.9Operating System Concepts

SSTF (Cont.)

Page 10: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.10Operating System Concepts

14.2.3 SCAN Scheduling扫描调度

The disk arm starts at one end of the disk, and moves toward the other end, servicing requests until it gets to the other end of the disk, where the head movement is reversed and servicing continues.

磁头从磁盘的一端开始向另一端移动,沿途响应访问请求,直到到达了磁盘的另一端,此时磁头反向移动并继续响应服务请求。

Sometimes called the elevator algorithm.

有时也称为电梯算法。 Illustration shows total head movement of 236

cylinders.

如图所示,磁头移动的总距离是 236 柱面。

Page 11: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.11Operating System Concepts

SCAN (Cont.)

Page 12: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.12Operating System Concepts

14.2.4 C-SCAN Scheduling Circular SCAN(C-SCAN) scheduling provides a more

uniform wait time than SCAN. 提供比扫描算法更均衡的等待时间。

The head moves from one end of the disk to the other. servicing requests as it goes. When it reaches the other end, however, it immediately returns to the beginning of the disk, without servicing any requests on the return trip.

磁头从磁盘的一端向另一端移动,沿途响应请求。当它到了另一端,就立即回到磁盘的开始处,在返回的途中不响应任何请求。

Treats the cylinders as a circular list that wraps around from the last cylinder to the first one.

把所有柱面看成一个循环的序列,最后一个柱面接续第一个柱面。 磁头移动的总距离是 183 ( * 在从一边到另一边的变化过程中不接受任

何请求)柱面。

Page 13: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.13Operating System Concepts

C-SCAN (Cont.)

Page 14: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.14Operating System Concepts

14.2.5 LOOK/C-LOOK Scheduling

SCAN 和 C-SCAN 总是将磁臂在整个盘面宽度上移动,其实这样做并不实用。

SCAN 和 C-SCAN 的一种改进形式: LOOK 和 C-LOOK 。

C-LOOK: Arm only goes as far as the last request in each direction, then reverses direction immediately, without first going all the way to the end of the disk.

磁臂在每个方向上仅仅移动到最远的请求位置,然后立即反向移动,而不需要移动到磁盘的一端。

磁头移动的总距离是 153 ( * 在从一边到另一边的变化过程中不接受任何请求)柱面。

Page 15: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.15Operating System Concepts

C-LOOK (Cont.)

Page 16: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.16Operating System Concepts

14.2.6 Selecting a Disk-Scheduling Algorithm选择一种磁盘调度算法

SSTF is common and has a natural appeal SSTF 比较通用,性能一般。 SCAN and C-SCAN perform better for systems

that place a heavy load on the disk. SCAN 和 C-SCAN 在磁盘重负载的系统中性能较好。 Performance depends on the number and types

of requests. 性能依赖于请求的数量和类型。 Requests for disk service can be influenced by

the file-allocation method. 磁盘服务请求受到文件定位方式的影响。 目录和索引块的位置(最里面、最外面或中间柱面)也很重要

。 上述算法仅考虑 seek time, 而没有考虑 rotational

latency

Page 17: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.17Operating System Concepts

Selecting a Disk-Scheduling Algorithm (Cont.)

The disk-scheduling algorithm should be written as a separate module of the operating system, allowing it to be replaced with a different algorithm if necessary.

磁盘调度算法应该写成操作系统中的一个独立模块,在必要的时候允许用不同的算法来替换。

Either SSTF or LOOK is a reasonable choice for the default algorithm.

SSTF 和 LOOK 都是缺省算法的合理选择。

Page 18: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.18Operating System Concepts

磁盘 I/O 调度策略

来自不同进程的磁盘 I/O 请求构成一个随机分布的请求队列。磁盘 I/O 调度的主要目标就是减少请求队列对应的平均柱面定位时间。先进先出算法优先级算法后进先出算法短查找时间优先算法扫描 (SCAN) 算法循环扫描 (C-SCAN) 算法N步扫描 (N-step-SCAN) 算法双队列扫描 (FSCAN) 算法

Page 19: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.19Operating System Concepts

先进先出 (FIFO, First In First Out) 算法:磁盘 I/O执行顺序为磁盘 I/O 请求的先后顺序。 该算法的特点是公平性;在磁盘 I/O负载较轻且每次读写多个连

续扇区时,性能较好。 优先级算法:依据进程优先级来调整磁盘 I/O 请求的执行顺序。

该算法反映进程在系统的优先级特征,目标是系统目标的实现,而不是改进磁盘 I/O 性能。

后进先出 (LIFO, Last In First Out) 算法:后产生的磁盘 I/O 请求,先执行。 该算法是基于事务系统中顺序文件中磁盘 I/O 的局部性特征,相邻访问的位置也相邻。

它的问题在于系统负载重时,可能有进程的磁盘 I/O永远不能执行,处于饥饿状态。

Page 20: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.20Operating System Concepts

短查找时间优先 (SSTF, Shortest Service Time First) 算法:考虑磁盘 I/O 请求队列中各请求的磁头定位位置,选择从当前磁头位置出发,移动最少的磁盘 I/O 请求。 该算法的目标是使每次磁头移动时间最少。它不一定是最短平

均柱面定位时间,但比 FIFO 算法有更好的性能。 对中间的磁道有利,可能会有进程处于饥饿状态。

扫描 (SCAN) 算法:选择在磁头前进方向上从当前位置移动最少的磁盘 I/O 请求执行,没有前进方向上的请求时才改变方向。 该算法是对 SSTF 算法的改进,磁盘 I/O较好,且没有进程会

饿死。

Page 21: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.21Operating System Concepts

循环扫描 (C-SCAN) 算法:在一个方向上使用扫描算法,当到达边沿时直接移动到另一沿的第一个位置。 该算法可改进扫描算法对中间磁道的偏好。实验表明,该算法

在中负载或重负载时,磁盘 I/O 性能比扫描算法好。 N步扫描 (N-step-SCAN) 算法:把磁盘 I/O 请求队列分成长度为N的段,每次使用扫描算法处理这 N个请求。当 N=1 时,该算法退化为 FIFO 算法。 该算法的目标是改进前几种算法可能在多磁头系统中出现磁头静止在一个磁道上,导致其它进程无法及时进行磁盘 I/O 。

双队列扫描 (FSCAN) 算法:把磁盘 I/O 请求分成两个队列,交替使用扫描算法处理一个队列,新生成的磁盘 I/O 请求放入另一队列中。 该算法的目标与 N步扫描算法一致。

Page 22: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.22Operating System Concepts

14.3 Disk Management 磁盘管理

Disk initialization Low-level formatting, or physical formatting 磁盘低级格式

化,或物理格式化 Booting from disk

Boot blocks

Bad-block recovery Bad blocks

Page 23: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.23Operating System Concepts

14.3.1 Disk Format

Low-level formatting, or physical formatting — Dividing a disk into sectors that the disk controller can read and write.

低级格式化,或物理格式化——把磁盘划分成扇区,以便磁盘控制器可以进行读写。

扇区的数据结构: header + data area(usually 512 bytes) + trailer An error-correcting code(ECC) included in header ECC 在写数据时计算产生, 在读数据时校验,并有可能纠正错误

Page 24: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.24Operating System Concepts

Disk Format(Cont.)

To use a disk to hold files, the operating system still needs to record its own data structures on the disk. 为了使用磁盘保存文件,操作系统还需要在磁盘上保存它自身的数据结构。步骤:1. Partition the disk into one or more groups of

cylinders. 把磁盘划分成一组或多组柱面。 2. Logical formatting or “making a file system”.

逻辑格式化或“创建文件系统”。 使用磁盘系统有两种方法:

Raw I/O ( raw disk access 直接访问磁盘 ) Via regular file system services and I/O

Page 25: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.25Operating System Concepts

14.3.2 Boot Block 启动块

Boot block initializes system.

启动块初始化系统 The bootstrap is stored in ROM.

引导程序存储在 ROM 中 The full bootstrap program is stored in a partition

called the boot blocks. 完整的引导程序在磁盘的被称为引导块的分区上(系统盘, boot disk / system disk )

Bootstrap loader program.

引导程序装载程序。

Page 26: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.26Operating System Concepts

Fig 14.6 MS-DOS Disk Layout

Page 27: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.27Operating System Concepts

14.3.3 Bad Block 坏块

简单磁盘系统(如 IDE 接口) 通过命令方式处理坏块 如 MS-DOS 的 format 和 chkdsk 命令

复杂磁盘系统(如 SCSI 接口) 通过 sector sparing or forwarding 的方式处理坏块:

将坏块重定向到系统保留的空闲块上 空闲块在磁盘的某个用户不可见的位置,或在每个柱面的某个

位置以减少数据移动距离( seek time ) 通过 sector slipping 的方式处理坏块(整体移动):

假如 17#扇区坏,随后第一个刻有扇区为 202# 则处理方法为: 201202,200201,…,1819,1718

坏块未必一定能够被恢复

Page 28: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.28Operating System Concepts

14.4 Swap-Space Management交换空间管理

Swap-space — Virtual memory uses disk space as an extension of main memory.

交换空间——虚拟内存使用磁盘空间作为对主存的扩展。使用交换空间的主要目的:提高系统吞吐量

一个系统中,交换空间的大小可以是几兆 ~几个 Gbytes Some OS, such as Unix, allow the use of multiple

swap spaces It’s safer to overestimate than to underestimate swap

space. Waste some disk space, but does not other harm.

Page 29: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.29Operating System Concepts

Swap-Space Management交换空间管理

Swap-space Location can be carved out of the normal file system,

交换空间可以与常规的文件系统一起使用 容易实现 效率低 =〉改进: cache block location information in

main memory or, more commonly, it can be in a separate disk partition.

或者 ,更通常的情况是放在一个单独的磁盘分区里。 Use algorithm optimized for speed, not for storage

efficiency Add more swap space via repartitioning of the disk

有些操作系统(如 Solaris 2 )比较灵活,既可利用 raw partitions, 也可利用 file-system space 作为交换空间

Page 30: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.30Operating System Concepts

Swap-Space Management (Cont.)

Swap-space management 交换空间管理 4.3BSD 4.3BSD allocates swap space when process starts; holds text

segment (the program) and data segment.

4.3BSD 在进程开始时分配交换空间;保存正文段(程序)和数据段。

交换是通过在连续的磁盘区域和内存之间 copy 整个进程来实现的 Kernel uses swap maps to track swap-space use.

OS核心使用交换映像跟踪交换空间的使用情况。 Solaris 2Solaris 2 allocates swap space only when a page is forced out

of physical memory, not when the virtual memory page is first created. Solaris 2 仅在一页被交换出物理内存的时候分配交换空间,而不是在虚拟内存页最初生成的时候。

Page 31: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.31Operating System Concepts

Fig 14.7 4.3 BSD Text-Segment Swap Map

Fixed size (512KB), 最后一块除外(最后一块 1KB 为分配增量)

Page 32: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.32Operating System Concepts

Fig14.8 4.3 BSD Data-Segment Swap Map 分配较复杂,因为数据段是动态增长的 Map is fixed size. Given index i (map entry number), the block

size pointed by i is 2i x 16KB, maximum of 2MB When a process tries to grow its data segment beyond the final

allocated block in its swap area, the operating system allocates another block, twice as large as the previous one.

Index i = 0 1 2 3 4 …

Page 33: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.33Operating System Concepts

14.5 RAID Structure RAID 结构

RAID – Redundant Arrays of Inexpensive Diskmultiple disk drives provides reliability via

redundancy. Inexpensive Independent

RAID is arranged into six different levels. Mean time to failure (平均故障时间 ) Mean time to repair (平均修复时间 ) Mean time to data loss (平均数据丢失时间 )

Page 34: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.34Operating System Concepts

RAID (cont) Several improvements in disk-use techniques involve the use

of multiple disks working cooperatively. Improvement of Reliability via Redundant

通过冗余提高可靠性 Reliability Redundancy Simplest (but most expensive) approach is to duplicate every

disk, called mirroring (or shadowing) RAID schemes improve performance and improve the

reliability of the storage system by storing redundant data. Mirroring or shadowing keeps duplicate of each disk. Block interleaved parity(块交叉存取校验) uses much less

redundancy.

Page 35: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.35Operating System Concepts

RAID (cont) Improvement of Performance via Parallelism

通过并行性提高性能 With multiple disks, we can improve the transfer rate as well by

striping data (bit-level or block-level) across multiple disks

Disk data striping uses a group of disks as one storage unit. Bit-level striping:

– e.g. 4 disks, bits i and 4+i each byte go to disk i Block-level striping

– e.g. n disks, block i of a file goes to disk (i mod n)+1 Goals:

Increase the throughput of multiple small accesses by load balancing

Reduce the response time of large accesses

Page 36: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.36Operating System Concepts

Fig 14.9 RAID Levels

•P– error-correcting bits

•C– a second copy of the data

Page 37: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.37Operating System Concepts

RAID Levels RAID Level 0: disk arrays with striping at the

level of blocks, but without any redundancy RAID Level 1: disk mirroring RAID Level 2: memory-style error-correcting

code(ECC) organizationAll single-bit errors are detected by the

memory systemError-correcting schemes store two or

more extra bitsMany extra disks, so it’s not used in

practice

Page 38: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.38Operating System Concepts

RAID Levels RAID Level 3: bit-interleaved parity ( 位交叉存取校验)

organization 如果有某一个扇区坏了,可以确切地知道是哪一个;

并且可以通过另一磁盘上的相应数据计算出该坏块上的每一位是 0或是 1

所需的冗余硬盘比 RAID Level 2少 性能问题:计算和写校验数据需要时间 改进:使用 NVRAM ( non-volatile RAM) 或 Cache

RAID Level 4: block-interleaved parity ( 块交叉存取校验) organization 与 RAID Level 3 相类似 , 但在另一硬盘上保存了校验块

( not bit ) Block-level striping 每个硬盘上按块访问,并行性比较好 问题:一次写操作,需要访问磁盘 4次( 2次读老的

块 <数据块和校验块 > , 2次写新的块) , 也即:read-modify-write

Page 39: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.39Operating System Concepts

RAID Levels RAID Level 5: block-interleaved distributed

parity (块交叉分布式存取校验) 与 RAID Level 4 的区别:将数据和校验分布

在所有 N+1 个磁盘上,而不是将数据写在 N个盘上,将校验写在 1 个盘上

例如:磁盘阵列有 5个硬盘构成,则第 n个数据块的校验信息存放在第 (n mod 5)+1 个盘上,而第 n块的实际数据分布在另外 4 个磁盘上

改进:与 RAID Level 4 相比,可以避免校验盘访问过频

RAID Level 6: P+Q redundancy scheme与 RAID Level 5 的相类似,但存放更多冗余信息,以防多个磁盘失效

使用 Error-correcting code,而不是 parity,因此需要更多的冗余信息

Page 40: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.40Operating System Concepts

Fig 14.10 RAID (0 + 1) and (1 + 0)

Page 41: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.41Operating System Concepts

RAID Levels RAID Level 0+1: a combination of RAID 0

and 1RAID 0 提供性能 (performance)RAID 1 提供可靠性 (reliability)Better performance than RAID 5A set of disks are striped, and then the

stripe is mirrored to another, equivalent stripe

RAID Level 1+0: disks are mirrored in pairs, and then the resulting mirror pairs are striped Have advantages over RAID 0+1

theoretically

Page 42: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.42Operating System Concepts

RAID LevelsRAID 1+0 have advantages over RAID 0+1

theoretically, for example: If a single disk fails in RAID 0+1, the

entire stripe is inaccessible, leaving only the other stripe available

With a failure in RAID 1+0, the single disk is unavailable, but its mirrored pair is still available as are all the rest of the disks

Page 43: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.43Operating System Concepts

14.5.4 Selecting a RAID level

RAID Level 0: used in high-performance applications

RAID Level 1: used in high-reliability with fast recovery

RAID Level 0+1 and 1+0: used in high-performance and reliability with fast recovery

RAID Level 5: preferred for storing large volumes of data

Hot spare disk 热备份磁盘 : is not used for data, but is configured to be used as a replacement should any other disk fail. Allocating more than one hot spare allows more than one

failure to be repaired without human intervention

Page 44: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.44Operating System Concepts

14.5.5 Extensions

The concepts of RAID have generalized to other storage devices, including arrays of tapes and even to the broadcast of data over wireless systems

Tape-drive robots

Page 45: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.45Operating System Concepts

14.6 Disk Attachment磁盘连接

Disks may be attached one of two ways:

1.Host attached via an I/O port

-- IDE, ATA and SCSI

2. Network attached via a network connection

Page 46: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.46Operating System Concepts

Fig 14.11 Network-Attached Storage

Page 47: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.47Operating System Concepts

Fig 14.12 Storage-Area Network

Page 48: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.48Operating System Concepts

SAN

SAN is a private network using storage protocols rather than networking protocols

当前 SAN 系统普遍存在的缺陷: 协议不标准 设备的互操作性差

发展趋势: 用 IP (Gigabit Ethernet) 网络协议作为交换设备

Page 49: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.49Operating System Concepts

14.7 Stable-Storage Implementation稳定存储实现

Write-ahead log scheme requires stable storage.向前写日志系统需要稳定存储。

To implement stable storage: 为了实现稳定存储Replicate information on more than one nonvolatile

storage media with independent failure modes.

在多个非易失性存储介质上备份信息,这些介质具有不同的故障方式。

Update information in a controlled manner to ensure that we can recover the stable data after any failure during data transfer or recovery.

以一种有控制的方式更新信息,以便确保在数据传输或修复的过程中发生错误以后我们能够恢复稳定的数据。

Page 50: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.50Operating System Concepts

14.8 Tertiary Storage Structure三级存储结构

Low cost is the defining characteristic of tertiary storage.三级存储的定义特征是低成本。

Generally, tertiary storage is built using removable media通常,三级存储由可移动介质构成。

Common examples of removable media are floppy disks 、 CD-ROMs, and tapes; other types are available.

通常的可移动介质的例子是软盘、光盘和磁带;其他还有一些类型。

Page 51: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.51Operating System Concepts

14.8.1 Tertiary Storage Devices 三级存储设备

14.8.1.1 Removable Disks 可移动磁盘 Floppy disk — thin flexible disk coated with

magnetic material, enclosed in a protective plastic case.软盘——在又薄又软的盘面上涂上磁介质,装在一个用于保护的塑料套中。Most floppies hold about 1 MB; similar

technology is used for removable disks that hold more than 1 GB. 大多数软盘的容量是1MB ;类似的技术也用于可移动磁盘,其容量大于1GB 。

Removable magnetic disks can be nearly as fast as hard disks, but they are at a greater risk of damage from exposure. 可移动磁盘的速度几乎与硬盘一样快,但由于是暴露在外的,损坏的风险更大。

Page 52: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.52Operating System Concepts

Removable Disks (Cont.) A magneto-optic disk records data on a rigid platter coated with

magnetic material.

光电磁盘在一个涂有磁介质的刚性盘面上记录数据。 Laser heat is used to amplify a large, weak magnetic field to

record a bit.

激光的热量用来放大一个大面积脆弱的磁区域,来记录一位。 Laser light is also used to read data (Kerr effect).

激光的光线也用来读取数据( Kerr 效应) The magneto-optic head flies much farther from the disk

surface than a magnetic disk head, and the magnetic material is covered with a protective layer of plastic or glass; resistant to head crashes.

光电磁头距离盘面的距离比磁头远,并且磁介质上覆盖有塑料或玻璃的保护层;可抗击磁头撞击。

Page 53: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.53Operating System Concepts

Removable Disks (Cont.-1)

Optical disks do not use magnetism; they employ special materials that are altered by laser light.

光盘不使用磁介质;它们使用激光改造过的特殊材料。 The data on read-write disks can be

modified over and over. 光电磁盘在一个涂有磁介质的刚性盘面上记录数据。

Page 54: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.54Operating System Concepts

WORM Disks WORM (“Write Once, Read Many Times”) disks can be

written only once. WORM (“一次写,多次读“)盘只能被写一次。 Thin aluminum film sandwiched between two glass or

plastic platters.薄铝膜被加在两层玻璃或塑料盘中间。 To write a bit, the drive uses a laser light to burn a

small hole through the aluminum; information can be destroyed by not altered. 要写一位,驱动器用激光在铝膜上烧一个小洞;信息不能修改,只能被破坏。

Very durable and reliable. 持久可靠。 Read Only disks, such ad CD-ROM and DVD, come

from the factory with the data pre-recorded. 只读盘,比如光盘和 DVD ,都是在工厂进行了数据预存储

的。

Page 55: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.55Operating System Concepts

14.8.1.2 Tapes 磁带 Compared to a disk, a tape is less expensive and

holds more data, but random access is much slower. 与磁盘比较,磁带更便宜,并且能保存更多数据,但是随机访问非常慢。

Tape is an economical medium for purposes that do not require fast random access, e.g., backup copies of disk data, holding huge volumes of data.

如果不需要快速随机存取,磁带是一种经济的媒介,比如备份磁盘数据、保存极大量的数据

Large tape installations typically use robotic tape changers that move tapes between tape drives and storage slots in a tape library.

大型磁带装置通常使用自动磁带机,把磁带从磁带库的磁带驱动器移动到存储槽。

Page 56: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.56Operating System Concepts

Tapes (Cont.) stacker – library that holds a few tapes

栈式存储器——保存有一些磁带的库。 silo – library that holds thousands of tapes

队式存储器——保存有数以千计磁带的库。 A disk-resident file can be archived to tape for low

cost storage; the computer can stage it back into disk storage for active use.

一个磁盘驻留文件可以存到磁带上,以便降低存储成本;计算机可以为了当前的使用,把它传输回到磁盘上。

Page 57: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.57Operating System Concepts

14.8.1.3 Future Technology

Holographic storage 全息存储

Micro-electronic mechanical systems (MEMS)

Page 58: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.58Operating System Concepts

14.8.2 Operating System Jobs操作系统的工作

Major OS jobs are to manage physical devices and to present a virtual machine abstraction to applications主要的系统工作是管理物理设备,并且为应用程序提供一个虚拟机的抽象。

For hard disks, the OS provides two abstraction:

对于硬盘,操作系统提供两个抽象 Raw device – an array of data blocks.

Raw 设备——一个数据块的数组。 File system – the OS queues and schedules the

interleaved requests from several applications.

文件系统——操作系统对几个应用程序交叉的请求进行排队和调度。

Page 59: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.59Operating System Concepts

14.8.2.1 Application Interface 应用程序接口 Most OSs handle removable disks almost exactly like fixed

disks — a new cartridge is formatted and an empty file system is generated on the disk.

多数操作系统处理可移动磁盘的方式与固定磁盘几乎是一样的——格式化一个新的盘碟,同时在盘上生成一个空的文件系统。

Tapes are presented as a raw storage medium, i.e., and application does not not open a file on the tape, it opens the whole tape drive as a raw device.

磁带作为一个 raw 存储介质,也就是说,应用程序不是打开磁带上的一个文件,而是作为 raw 设备打开整个磁带。

Usually the tape drive is reserved for the exclusive use of that application.

通常磁带设备是由一个应用程序独占使用的。

Page 60: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.60Operating System Concepts

Application Interface (Cont.)

Since the OS does not provide file system services, the application must decide how to use the array of blocks.

由于操作系统在磁带上不提供文件系统服务,应用程序必须决定如何使用数据块的数组。

Since every application makes up its own rules for how to organize a tape, a tape full of data can generally only be used by the program that created it.

由于每个应用程序对于如何组织一个磁带都建立了自己的规则,一个装满数据的磁带通常只能由创建它的应用程序来使用。

The basic operations for a tape drive differ from those of a disk drive.

磁带驱动器的基本操作与磁盘驱动器是不同的。

Page 61: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.61Operating System Concepts

Tape Drives 磁带驱动器

locate positions the tape to a specific logical block, not an entire track (corresponds to seek).

定位操作指向磁带的一个特定逻辑块,而不是一个完整的磁道(与寻道操作对比)

The read position operation returns the logical block number where the tape head is.

读位置操作返回磁带头所在的逻辑块号。 The space operation enables relative motion. 间隔操作允许相关的运动。 Tape drives are “append-only” devices; updating a block in the

middle of the tape also effectively erases everything beyond that block.

磁带驱动器是“只能附加”的设备;更新磁带中间的一个块,需要先清除那个块上的所有数据。

An EOT mark is placed after a block that is written. 写完的一块后面需要加上一个 EOT标记。

Page 62: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.62Operating System Concepts

14.8.2.2 File Naming 文件命名

The issue of naming files on removable media is especially difficult when we want to write data on a removable cartridge on one computer, and then use the cartridge in another computer. 当我们想向一台计算机的一个可移动盘碟写入数据、然后在另一台计算机中使用的时候,命名文件的问题在可移动媒介上更加困难。

Contemporary OSs generally leave the name space problem unsolved for removable media, and depend on applications and users to figure out how to access and interpret the data.

现代操作系统通常没有解决可移动媒介上的命名空间问题,而是依靠应用程序和用户来指出如何访问解释数据。

Some kinds of removable media (e.g., CDs) are so well standardized that all computers use them the same way. 一些可移动介质(比如 CD )相当的标准化,所有的计算机都以同样的方式使用它们。

Page 63: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.63Operating System Concepts

14.8.2.3 Hierarchical Storage Management (HSM) 层次存储管理

A hierarchical storage system extends the storage hierarchy beyond primary memory and secondary storage to incorporate tertiary storage — usually implemented as a jukebox of tapes or removable disks. 一个层次存储系统扩展了存储层次,从主存、二级存储到一体化的三级存储——通常是一个磁带的自动播放机或者可移动磁盘。

Usually incorporate tertiary storage by extending the file system.通常通过扩展文件系统来一体化三级存储。 Small and frequently used files remain on disk. 经常使用的小文件仍然存在磁盘上。 Large, old, inactive files are archived to the jukebox. 不使用的大的旧文件存在自动播放机上。

HSM is usually found in supercomputing centers and other large installations that have enormous volumes of data. HSM 在超级计算中心和其他有庞大数据量的大设备中比较常见。

Page 64: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.64Operating System Concepts

14.8.3 Performance Issues 有关性能的问题14.8.3.1 Speed 速度 Two aspects of speed in tertiary storage are

bandwidth and latency.三级存储速度的两个方面是带宽和延迟。

Bandwidth is measured in bytes per second. 带宽用每秒字节数来衡量。

Sustained bandwidth – average data rate during a large transfer; # of bytes/transfer time.Data rate when the data stream is actually flowing.

持续的带宽——大量传输过程中的平均数据率;单位传输时间的字节数。数据流实际流动时的数据率。

Page 65: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.65Operating System Concepts

Speed Effective bandwidth – average over the entire I/O time,

including seek or locate, and cartridge switching. Drive’s overall data rate.

有效带宽——整个 I/O 时间的平均,包括寻道或者定位,以及盘碟选择。驱动器的全面数据率

Access latency – amount of time needed to locate data.

访问延迟——定位数据需要的时间。 Access time for a disk – move the arm to the selected

cylinder and wait for the rotational latency; < 35 milliseconds. 磁盘的访问时间——移动磁臂来选择柱面,并且等待旋转延迟; <35毫秒。

Access on tape requires winding the tape reels until the selected block reaches the tape head; tens or hundreds of seconds. 访问磁带需要把所选的块倒到磁带头的位置;数十甚至数百秒。

Page 66: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.66Operating System Concepts

Speed (Cont.) Generally say that random access within a tape cartridge is

about a thousand times slower than random access on disk.

一般来说,对磁带的随机访问比对磁盘的随机访问要慢一千倍。 The low cost of tertiary storage is a result of having many

cheap cartridges share a few expensive drives.

三级存储成本低,这是许多便宜的盘碟使用少量昂贵驱动器的结果。 A removable library is best devoted to the storage of

infrequently used data, because the library can only satisfy a relatively small number of I/O requests per hour.

一个可移动的库对于很少使用的数据的存储是很好的,因为库每个小时只需要满足相对很少的 I/O 请求。

Page 67: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.67Operating System Concepts

14.8.3.2 Reliability 可靠性

A fixed disk drive is likely to be more reliable than a removable disk or tape drive.固定磁盘驱动器比可移动磁盘或磁带驱动器更可靠。

An optical cartridge is likely to be more reliable than a magnetic disk or tape.光介质比磁介质的磁盘或磁带更可靠。

A head crash in a fixed hard disk generally destroys the data, whereas the failure of a tape drive or optical disk drive often leaves the data cartridge unharmed.

对于固定的硬盘,磁头撞击通常会破坏数据,然而磁带或光盘驱动器的错误通常对数据盘碟是无害的。

Page 68: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.68Operating System Concepts

14.8.3.3 Cost 成本 Main memory is much more expensive than disk storage

主存比磁盘存储要贵很多。 The cost per megabyte of hard disk storage is competitive

with magnetic tape if only one tape is used per drive. 硬盘存储的每兆字节成本与磁带不相上下,如果每个驱动器只用一条磁带。

The cheapest tape drives and the cheapest disk drives have had about the same storage capacity over the years.近年来,最便宜的磁带驱动器和最便宜的磁盘驱动器的存储容量几乎一样。

Tertiary storage gives a cost savings only when the number of cartridges is considerably larger than the number of drives. 只有当盘碟的数量远大于驱动器数量的时候,三级存储才能节约成本。

Page 69: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.69Operating System Concepts

Fig 14.13 Price per Megabyte of DRAM, From 1981 to 2000

Page 70: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.70Operating System Concepts

Fig 14.14 Price per Megabyte of Magnetic Hard Disk, From 1981 to 2000

Page 71: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.71Operating System Concepts

Fig 14.15 Price per Megabyte of a Tape Drive, From 1984-2000

Page 72: Chapter 14:  Mass-Storage Systems 海量存储器系统

Silberschatz, Galvin and Gagne 200213.72Operating System Concepts

Exercises

2, 10