William Stallings Computer Organization and Architecture 5 th Edition

1

William Stallings Computer Organization and Architecture5th Edition

Chapter 11CPU Structure and Function

CPU 的结构和功能

2

Topics

Processor Organization

Register Organization

Instruction Cycle

Instruction Pipelining

The Pentium Processor

3

CPU Structure

CPU must: CPU 必须具备的功能：Fetch instructions 能够从存储器读取指令Interpret instructions 对指令进行解析译码Fetch data 取指令所需的数据Process data 对数据进行处理Write data 将处理后的数据写回目的地

CPU 需要一个小的内部存储器暂存数据和指令CPU needs a small internal memory

4

CPU With Systems Bus

5

CPU Internal Structure

6

Registers （寄存器） CPU must have some working space (temporary

storage) CPU 必须有部分工作空间进行暂时存储 Called registers 这部分空间叫寄存器 Number and function vary between processor

designs 它们的数量和功能因处理器的设计而不同

One of the major design decisions 寄存器是设计 CPU 时考虑的一个主要因素 Top level of memory hierarchy 位于存储器分级中的较

高层 Two categories: 分为两类：

User-visible registers 用户可见寄存器Control and status registers 控制和状态寄存器

7

User Visible Registers 用户可见寄存器

General Purpose 通用寄存器Data 数据寄存器Address 地址寄存器Condition Codes 条件代码寄存器

8

User Visible Registers

General Purpose RegistersMay be true general purpose 真正意义的通用May be restricted 可能有一定的限

制Data registers

Accumulator register 累加寄存器Addressing registers

Segment pointers 段寄存器Index registers 变址寄存器Stack Pointer 堆栈寄存器

9

General or Special? 比较

Make them general purposeIncrease flexibility and programmer options 增加了灵活性和程序员的可选择性Increase instruction size & complexity 增加了指令的长度和复杂度

Make them specializedSmaller (faster) instructions 指令更小更块Less flexibility 灵活性变低

The trend seems to be toward the use of specialized registers.

现在趋向于专用寄存器

10

How Many GP Registers? 个数

Between 8 – 32 大都 8 － 32 个Fewer = more memory references 寄存器个数太少，导致频繁访问存储器More does not reduce memory references 寄存器个数太多也不能显著减少访问存储器

11

How big? 寄存器的长度

Large enough to hold the largest address 要能够保存最长的地址Large enough to hold most data types 要能够保存大多数数据类型的值Often possible to combine two data

registers 两个数据寄存器经常合并为一个使用

C programmingdouble a;long int a;

12

Condition Code Registers

Condition codes are bits set by the CPU hardware as the result of operations.

Sets of individual bits 标志位的集合e.g. result of last operation was zero

At least partially visible to the user 至少部分对用户可见Can be read (implicitly) by programs 程序可以

读取e.g. Jump if zero

Can not (usually) be set by programs 一般不能有程序进行设置

13

Control & Status Registers

Program Counter 程序计数器（ PC）

Instruction Register 指令寄存器（ IR）

Memory Address Register 存储地址寄存器MAR

Memory Buffer Register 存储缓冲寄存器MBR

Revision: what do these all do?

14

Program Status Word 程序状态字 PSW

A set of bits ， Includes Condition Codes 状态位集合

Sign of last result 符号：最后算术运算结果符号位Zero 零标记：当结果是零时被置位Carry 进位标记：借位或进位时置位Equal 等于标记：逻辑比较结果相等置位Overflow 溢出标记：用于指示算术溢出 Interrupt enable/disable 中断允许／禁止Supervisor 监督：指出 CPU 是执行在监督模式中还是在用户模式中

15

Program Status Word - Example

Motorola 68000’s PSWSystem Byte User Byte

Interrupt Mask Supervisor Status

Trace Mode

15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0 T S I2 I1 I0 X N Z V C

16

Other Registers

May have registers pointing to:Process control blocks (see O/S) 进程控制块（ PCB ）Interrupt Vectors (see O/S) 中断向量

N.B. CPU design and operating system design are closely linked

CPU 设计和操作系统设计紧密相关

17

Example Register Organizations

18

Instruction Cycle

An Instruction cycle includes the following subcycles:

指令周期包括以下子周期

START

HALT

Fetch Next Instruction

Fetch Next Instruction

Execute Instruction

Execute Instruction

Check for Interrupt;

Process Interrupt

Check for Interrupt;

Process Interrupt

Fetch Cycle 取指令周期

Execute Cycle执行周期

Interrupt Cycle中断周期

Interrupt Disabled

Interrupt Enabled

19

Indirect Addressing Cycle

May require memory access to fetch operands

指令的执行需要访问存储器获得操作数Indirect addressing requires more memory

accesses 间接寻址需要额外的存储器访问Can be thought of as additional instruction

subcycle 可以把它看成是额外的指令子周期

20

Instruction Cycle with Indirect

21

Instruction Cycle State Diagram

22

Data Flow (Instruction Fetch) Depends on CPU design 指令周期期间，严格的事件序列取决于 CPU 的设计 In general: Fetch 取指令周期

PC contains address of next instruction 开始 PC 拥有待取的下一条指令地址Address moved to MAR 将此地址送到 MARAddress placed on address bus 并放到地址总线上Control unit requests memory read 控制器请求读存储

器Result placed on data bus, copied to MBR, then to IR 结果放到数据总线上并复制到 MBR ，然后传送到 IRMeanwhile PC incremented by 1 此时 PC 加 1

23

Data Flow (Fetch Diagram)

1 2 3

4

5

6

24

Data Flow (Indirect Cycle) IR is examined 取指周期后控制器检查 IR 的内容 If indirect addressing, indirect cycle is

performed 若有一个使用间接寻址的操作数，则执行一个间址周期

Right most N bits of MBR transferred to MAR MBR 最右的 N 位是一个地址引用，被传送到 MARControl unit requests memory read 控制器请求一个存储器读Result (address of operand) moved to MBR 得到所要求的操作数地址并送入 MBR

op-code addressinstruction format

25

Data Flow (Indirect Diagram)

1 2

3

26

Data Flow (Execute Cycle)

May take many forms 指令周期能取多种形式

Depends on instruction being executed 取决于当前执行的指令May include

Memory read/write 存储器读写Input/Output I/O 设备的读写Register transfers 寄存器间数据传送ALU operations ALU 操作

27

Data Flow (Interrupt Cycle)

Current PC saved to allow resumption after interrupt PC 的当前内容必须被保存，以便在中断之后 CPU 能恢复正常

的动作 Contents of PC copied to MBR

PC 的内容传送到 MBR Special memory location (e.g. stack pointer) loaded to

MAR 一个专门的存储器位置由控制器装入 MAR MBR written to memory 将 MBR 的内容写到存储器 PC loaded with address of interrupt handling routine 中断子程序的地址装入 PC Next instruction (first of interrupt handler) can be

fetched 下一指令周期将以取此相应的指令而开始

28

Data Flow (Interrupt Diagram)

1

2

3

4

5

29

Laundry Example Ann, Brian, Cathy, Dave

each have one load of clothes to wash, dry, and fold

如有 4 个人有衣服要洗、干、叠Washer takes 30 minutes 洗需 30 分钟Dryer takes 40 minutes 干 40

分“Folder” takes 20 minutes 叠

20 分

A B C D

Pipelining 流水处理

30

Sequential Laundry

Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

A

B

C

D

30 40 20 30 40 20 30 40 20 30 40 20

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

31

Pipelined Laundry

Pipelined laundry takes 3.5 hours for 4 loads

A

B

C

D

6 PM 7 8 9 10 11 Midnight

Task

Order

Time

30 40 40 40 40 20

32

Pipelining Lessons （ 1 ） Pipelining doesn’t help

latency of single task, it helps throughput of entire workload

Pipeline rate limited by slowest pipeline stage

Multiple tasks operating simultaneously

Potential speedup = Number pipe stages

Unbalanced lengths of pipe stages reduces speedup

Time to “fill” pipeline and time to “drain” it reduces speedup

A

B

C

D

6 PM 7 8 9

Task

Order

Time

30 40 40 40 40 20

33

Pipelining Lessons （ 2 ）

流水线对执行单个任务没有帮助，但是它能够提高整个系统的吞吐量流水的改进比例受最少流水节拍的限制

思想：多个任务能够同时进行理想加速比＝流水节拍数

流水节拍的不平衡降低了加速比开始的填充时间和最后的排空时间也会减少加速比

34

Instruction Pipelining

Similar to assembly line in manufacturing plants:Products at various stages can be worked on simultaneously Performance improved

在生产车间里，多个产品可以同时在不同的生产线上进行加工，这样就提高了效率

First attempt: 2 stages 将指令周期分为 2 步Fetch 取指令Execution 执行

35

Prefetch

Fetch accessing main memory 从存储器取指令Execution usually does not access main

memory 执行时通常不访问存储器Can fetch next instruction during execution of

current instruction 执行当前指令时可以预取下一条指令Called instruction prefetch 称为指令预取 Ideally instruction cycle time would be halved

(if durationF = durationE …)

理想情况下指令周期会减半

36

Improved Performance(1)

But not doubled: 性能加倍不可能的原因Fetch usually shorter than execution

取指时间小于执行时间。Any jump or branch means that

prefetched instructions are not the required instructions

任何跳转、分支指令意味着预取指令作废e.g., ADD A, B

BEQ NEXT ADD B, C

NEXT: SUB C, D

37

Two Stage Instruction Pipeline

38

Improved Performance (2)

Reduce time loss due to branching by guessing 可以通过预测来减少分支带来的时间损失

Prefetch instruction after branching instruction 取指阶段取存储器中转移指令之后的指令

If not branched 若转移未发生 use the prefetched instruction ．没有时

间损失else 若转移发

discard the prefetched instruction 己取指令作废

fetch new instruction 并取新的指令

39

Pipelining Add more stages to improve performance 流水线可以通过更多的阶段获得进一步的加速 More stages more speedup

FI: Fetch instruction 取指令DI: Decode instruction 指令译码CO: Calculate operands 计算操作数FO: Fetch operands 取操作数EI: Execute instructions 执行指令WO: Write result 写结果

Various stages are of nearly equal duration 各阶段时间几乎相等 Overlap these operations 这样就可以并行操作

40

Timing of Pipeline

41

Speedup of Pipelining (1)

9 instructions 6 stagesw/o pipelining: __ time unitsw/ pipelining: __ time unitsspeedup = _____

Q: 100 instructions 6 stages, speedup = ____

Q: instructions k stages, speedup = ____Can you prove it (formally)?

42

Pipelining - Discussion Not all stages are needed in one instruction

e.g., LOAD: WO not needed 并不是每条指令都必须包含所有阶段

Assume all stages can be performed in parallele.g., FI, FO, and WO memory conflicts假设所有阶段能并行执行，没有冲突

Timing is set up assuming all stages are needed by each instruction Simplify pipeline hardware

为简化流水线硬件设计，在假定每条指令都要求所有阶段的基础上来建立时序

Assuming no conditional branch instructions 假设没有条件分支指令

43

Limitation by Branching Conditional branch instructions can invalidate

several instruction prefetches 条件分支能使多条指令作废

In our example (see next slide)Instruction 3 is a conditional branch to

instruction 15Next instruction’s address won’t be known till

instruction 3 is executed (at time unit 7) 指令 3 执行完后才知道下一条指令地址pipeline must be cleared

No instruction is finished from time units 9 to 12performance penalty在时间 9 － 12 之间没有指令完成，导致性能惩罚

44

Branch in a Pipeline

45

46

Limitation by Data DependenciesData needed by current instruction may

depend on a previous instruction that is still in pipeline

当前指令所需要的数据可能是上一条仍在执行的指令的结果

E.g., A B + CD A + E

47

Limitation by stage overhead

Ideally, more stages, more speedup 理想情况下，指令分段越多，加速比越大However,

more overhead in moving data between buffers

数据在缓冲器间传送需要花费开销more overhead in preparation and delivery

functions 完成准备和递交功能也需要开销more complex circuit for pipeline hardware 需要更加复杂的硬件线路

48

Pipeline Performance

Cycle time =max[i]+d= m+d 周期时间 m :maximum stage delay 最大段延迟 k:number of stages 流水线段数 d:time delay of a latch 锁存延迟

Execute n instruction time with pipelining Tk=[k+(n-1)] 流水执行 n条指令的时间

Time to execute n instructions without pipelining

T1 = nk 非流水执行 n条指令的时间Speedup 加速比

Sk=T1/Tk=nk /[k+(n-1)]= nk /[k+(n-1)]

49

Pipeline Performance

Speedup of k-stage pipelining compared to without pipelining

Q: instructions k stages, speedup = ____

)()]([

_

111

nknk

nknk

TT

ePerformanc

ePerformancS

k

pipeliningno

pipeliningk

50

51

Example一条指令的执行过程分成取指令、指今译码、取

操作数和执行 4 个子过程．分别由 4 个功能部件来完成，每个子过程所需时间为 Δt， 4 个子过程的流水线如图所示。

1 2 3 41 2 3 4

1 2 3 41 2 3 4

取指译码取数执行

tt

时空图

52

Example

性能分析：T1 = nk = 4*4* Δt = 16Δt Tk=[k+(n-1)] =[4+(4-1)]* Δt =7 Δt Sk=T1/Tk=16/7

若执行一条指令所需的时间为 T ，在理想情况下，当流水线充满后，每隔 T/4 就完成了一条指令的执行，此时加速比为 4 。

53

流水线性能分析

衡量流水线性能的指标主要有两个：吞吐率和效率。

吞吐率是指单位时间内能处理的任务数或输出结果的数量。最大吞吐率实际吞吐率

54

流水线性能分析最大吞叶率：如果各子过程所需时间不相等，那

么吞吐率将取决于流水线中最慢子过程的处理时间 Δt ，公式为：

实际吞吐率：由于流水线在开始时有一段建立时间，结束时有一段排空时间，使流水线无法连续流动，所以，实际吞吐率总是小于最大吞吐率。

}max{

1max t

TP

nk

TP

tntk

nTP

/)1(1)1(max

55


效率即流水线中各部件的利用率。一般用流水线各段处于工作时间的时空区与流水线各段总的时空区之比来衡量流水线的效率，即：

1)1(

nk

n

tnkk

tnk

k

nE

段总的时空区个任务占用的时空区

56


计算该流水线的吞吐率和效率

1 2 3 41 2 3 4

1 2 3 41 2 3 4


TTt1 2 3 4 5 6 7

57


实际吞吐率：

最大吞吐率：

效率：

tt

TP /1}max{

1max

ttntk

nTP 7/4

)1(

28/16*7*4

*4*4t

t

k

nE

段总的时空区个任务占用的时空区

58


取指取指译码译码取数取数执行执行

tt tt 3t3t tt

1 2 31 2 3

1 21


TTt

44

3 42 3 4

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

59


假设 n 条指令，若令 t 取指＝ 1 ， t 分析＝ 2 ， t执行＝ 3 ， n ＝ 1000 ，请写出下列情况下完成n 条指令的时间表达式：1. 顺序方式2. 流水线方式并计算流水线的加速比。

60


有一条指令流水线如图：1. 当 t1=t2=t3=t4 时，画出其连续执行 10 个任务

的时空图，并计算吞吐率和效率。2. 当 t1=t2, t3=2t1, t4=4t1 时，重复上述工作。

取指取指译码译码取数取数执行执行

t1t1 t2t2 t3t3 t4t4

61

Dealing with Branches

Multiple Streams 多指令流Prefetch Branch Target 预取转移目标Loop buffer 循环缓冲Branch prediction 转移预测Delayed branching 延迟转移

62

Multiple Streams 多指令流Have two pipelines, Prefetch each branch into a

separate pipeline, Use appropriate pipeline 将每个分支都预取到多个指令流里Problems

Leads to bus & register contention delays 导致寄存器和存储器访问的竞争延迟Multiple branches (i.e., additional branch

entering pipelines before original branch decision made) lead to further pipelines being needed

分支过多可能导致需要多个指令流Can improve performance, anyway

e.g., IBM 370/168 尽管存在问题，但可以改进性能

63

Prefetch Branch Target预取转移目标

Target of branch is prefetched in addition to instructions following branch

除了取转移指令之后的指令外，转移目标处的指令也被取来。Keep target until branch is executed 这个目标被保存直到转移指令被执行 Used by IBM 360/91

64

Loop Buffer (1)循环缓冲Small, very fast memory 小而高速的存储器Maintained by fetch (IF) stage of pipeline 由流水线取阶段维护Contains the n most recently fetched instructions

in sequence 包含n条最近顺序取来的指令 If a branch is to be taken 若转移发生

Hardware checks whether the target is in bufferIf YES then 检测目标是否在缓冲器，若是

next instruction is fetched from the buffer 则下一条指令有次缓冲器获得

else fetch from memory 否则，从内存得到

65

Loop Buffer (2)

Reduce memory access time 减少内存访问时间

Very good for small loops or jumps 非常适合处理小的循环和跳转 If buffer is big enough to contain entire loop,

instructions in the loop need to be fetched from memory only once at the first iteration

若循环缓冲器充分大，足以容纳循环的全部指令，则这些指令只需要第一次循环时由存储器取来。后面的重复不需要再取指令

Used by CRAY-1

66

Loop Buffer Diagram

67

Branch Prediction (1)转移预测

Predict whether a branch will be taken 首先预测某个转移分支是否发生If the prediction is right 预测正确

No branch penalty 无分支惩罚If the prediction is wrong预测错误

Empty pipeline 流水线空Fetch correct instruction 取正确指令

Branch penalty 分支惩罚

68

Branch Prediction (2)

Predict techniquesStatic

Predict never taken 预测转移不发生Predict always taken 预测转移发生Predict by opcode 根据操作码预测

DynamicTaken/not taken switch 发生 /不发生切换Branch history table 转移历史表

69

Branch Prediction (3) Predict never taken

Assume that jump will not happen 假设跳转不发生Always fetch next instruction 总是取下一条指令68020 & VAX 11/780VAX will not prefetch after branch if a page fault

would result (O/S v CPU design) 若取转移指令后的指令将引起缺页，则处理器暂停预取

Predict always takenAssume that jump will happen 假设总是跳转Always fetch target instruction 总是预取目标指令More than 50% 跳转发生的概率大于 50%

70


Predict by Opcode 依据转移指令的操作码进行预测Some instructions are more likely to result

in a jump than others 有些指令转移的概率大于其它指令Decision based on the opcode of the

branch instruction. 可以根据操作码决定预取Can get up to 75% success 成功率超过

75%

71


Taken/Not taken switchOne or more bits can be associated with

each conditional branch instruction that reflect the recent history of the instruction.

每个条件转移指令可有与之相关联的一位或几位．它们反映此指令的最近历史。通过记录条件转移指令在程序中的历史来改善预测的准确度。

Good for loops 对循环有用

72

关于转移预测的一点解释对于 1位历史位：预测实际1 11 11 11 00 11 11 11 00 11 11 11 0

For ( …. )

…

xxxxx

For ( …. )

…

xxxxx

For ( …. )

…

xxxxx

对于 2 位历史位：预测实际 1 实际 21 1 11 1 11 1 11 0 11 1 01 1 11 1 11 0 11 1 01 1 11 1 11 0 1

73

Branch Prediction State Diagram

74

Branch Prediction Flowchart

Taken/Not taken switchUse two bits

recording history

75

作业 11.5

76

Dealing With Branches

使用历史位方法的缺点如果判断转移发生，并不能马上取转移目标指令必须等转移指令计算出转移地址

能否在判断转移发生后，马上取转移指令？从时序上看不可能！建立一个转移历史表 (branch history table) 可能

会有一定作用也称为转移目标缓冲器 (branch target

buffer)保存一些已执行过的转移指令信息

77


Branch history table 转移历史表A small cache memory associated with the

instruction fetch stage of the pipeline. 转移历史表是—个与流水线取指令阶段相关联的小型高速缓冲

存储器There are three elements 每个表项由三个元素组成

The address of a branch instruction 转移指令的地址

Target instruction / target address 目标指令 /地址The state of use the instruction 该指令的使用状

态

78

Predict Never Taken Strategy

图 11.17 把“转移历史表”方案和预测决不发生策略作了对比。

对于预测决不发生策略，指令的取指段总是取下一顺序地址的指令。如果处理器中的某种逻辑确定一个转移发生，那么就从目标地址取下一指令（另外要清空流水线）。

79


80

Branch History Table Strategy转移历史表被作为高速缓冲存储器来对待。每次

预取操作都触发一次对转移历史表的查找。如果没有找到匹配的指令，则下一顺序地址被用于取指。如果找到匹配的指令，则基于指令的状态做预测：或者把下一顺序地址送选择逻辑，或者把转移目标地址送选择逻辑。

当转移指令执行时，执行阶段将其结果通知转移历史表逻辑，以及时修改历史表中该指令的历史位。若之前的预测不正确，则还要指示选择逻辑重定向到正确地址取址。当遇到的转移指令没在表中，还要将其加入表中，此时要使用某种替换算法。

81


82

Delayed Branch 延迟转移

Rearrange instructions 自动重排程序中的指令Do not take jump until you have to 转移指令出现在实际所要求的位置之后

83

Intel 80486 Pipelining （ 1 ）Fetch

From cache or external memory 由 Cache 或外部存储器取指令Put in one of two 16-byte prefetch buffers 放入两个 16 字节预取缓冲器中的某一个Fill buffer with new data as soon as old data

consumed 旧数据被指令译码器用掉后立即以新数据填充预取缓冲器Average 5 instructions fetched per load 平均每次装 5 条指令Independent of other stages to keep buffers

full 取阶段的操作独立于其他阶段以保持预取缓冲器满

84

Intel 80486 Pipelining （ 2 ） Decode stage 1

Opcode & address-mode info 译码操作码和寻址方式At most first 3 bytes of instruction 所要求信息及指令长度信息最多只占指令的前 3 字节Can direct D2 stage to get rest of instruction 指挥 D2阶段计算其余的信息

Decode stage 2Expand opcode into control signals 将每个操作码扩展成对 ALU 的控制信号Computation of complex address modes 控制更复杂寻址方式的计算

85

Intel 80486 Pipelining （ 3 ）Execute 执行

ALU operations, cache access, register update

包括 ALU 运算、 cache 访问、寄存器修改Writeback 写回

Update registers & flags 更改寄存器和在前面执行阶段修改过的状态标志 Results sent to cache & bus interface

write buffers 计算值同时被送到 cache 和 “总线／接口写” 缓冲器

中。

86

80486 Instruction Pipeline

87

Pentium 4 Registers

88

EFLAGS Register

89

Control Registers

90

MMX Register Mapping

MMX uses several 64 bit data types MMX 功能利用了几个 64 位数据类型Use 3 bit register address fields

使用 3 位寄存器地址字段8 registers 支持 8 个 MMX 寄存器的使用

No MMX specific registers 没有专用的 MMX 寄存器Aliasing to lower 64 bits of existing floating

point registers各浮点寄存器的低 64 位 ( 尾数 )用来构成 8 个 MMX 寄存

器

91

MMX Register Mapping Diagram

92

Pentium Interrupt ProcessingInterrupts: generated by hardware at

random times 中断通常由硬件信号在程序执行期间内的任何时刻产生

Maskable 可屏蔽中断Nonmaskable 不可屏蔽中断

Exceptions: generated from software and provoked by the execution of an instruction

异常是软件在执行指令时产生的Processor detected 处理器确定异常Programmed 程序异常

93

Pentium Interrupt ProcessingInterrupt vector table 中断向量表

Each interrupt type assigned a number 每一类中断都被指派一个中断号Index to vector table 用于对向量表的索引256 * 32 bit interrupt vectors 中断号范围 32

～ 256

5 priority classes 异常和中断的优先权组成 5 类。

94

Pentium Interrupt Processing （ 1 ） Interrupt Handling: 中断管理

1. the current stack segment register and the current extended stack pointer(ESP) register are pushed onto the stack

若转换涉及到优先级改变，则当前堆栈段寄存器和当前扩展的栈指针寄存器的内容被压入堆栈

2. EFLAGS register is pushed on to stack EFLAGS 寄存器的当前值被压入堆栈

95

Pentium Interrupt Processing （ 2 ）3. Interrupt(IF) and trap(TF) flags are

cleared 中断和自陷两个标志被清除4. CS pointer and IP are pushed 当前代码段寄存器和当前指令指针的内容被压入

堆栈5. Error code is pushed 若中断伴随有错误代码．则错误代码也要压人堆

栈6. The interrupt vector contents are fetched

and loaded into the CS and IP or EIP registers

取中断向量内容并装入 CS 和 IP( 或 EIP) 寄存器

Documents

William Stallings Computer Organization and Architecture 5 th Edition