81
1 Fundamentals of Computer Design (2007.10) (2) Why have multiprocessor systems become popular nowadays? What are their advantages? What are the challenges in the design and use of a multiprocessor system? Ans: (1) 由於以提昇單顆 CPU 的速度來改進效能有其限制,必須花費大量成本來設計硬體電 路,而使用 multiprocessor 的架構則可利用現有的 CPU ,採平行處理的方式提昇效能, 所需成本較低,因此成為現今系統的主流。 (2) 其主要優點為每個 processor 可以平行處理程式中 independent 的指令,得以加速完成 所要執行的 processes,也提升整體的 performance(3) 設計時所面臨的挑戰可能包括: i. 如何確保資料的一致性 ii. 硬體設計複雜性較高 iii. 各個 processor 的溝通 iv. 如何解決記憶體頻寛的限制 (2007.10) (5) What is “availability”? Given the mean time to failure (MTTF) and the mean time to repair (MTTR), calculate the mean time between failure (MTBF) and the availability. How can MTTF and availability be improved? Use a RAID disk array as an example to explain why it can have better availability than a single disk. Ans: (1) MTTR MTTF MTBF + = MTBF MTTF MTTR MTTF MTTF y Avalibilit = + = Availability 是用來測量平均兩次 failure 發生之間,正常 service 時間所佔的比例。 (2) RAID disk array 是結合多個 hard disks 所組成的系統,它主要的概念是將重覆的 data 存放在多個 disks 當中,因此一旦某一個 disk 產生 failure 時並不會造成整個系統損壞, 而能夠將 loss 的資料救回來。若使用 RAID disk array,則當系統產生 failure 時,它能 夠快速地將系統回復正常,因此可有效地降低 MTTR 所需的時間,而從 availability 定義來看,降低 MTTR 即可提升 availability。而提升了 availability,在整個系統時間 的衡量上相對地就提高了 MTTF 的時間,因此使用 RAID disk array 可有效地改善 MTTF availability

exam - chinese.pdf

Embed Size (px)

DESCRIPTION

the way to covert from AMAT to MTTB.Chapter 1 answers

Citation preview

  • 1

    Fundamentals of Computer Design (2007.10) (2) Why have multiprocessor systems become popular nowadays? What are their advantages? What are the challenges in the design and use of a multiprocessor system? Ans:

    (1) CPU multiprocessor CPU

    (2) processor independent processes performance

    (3) i.

    ii. iii. processor iv.

    (2007.10) (5) What is availability? Given the mean time to failure (MTTF) and the mean time to repair (MTTR), calculate the mean time between failure (MTBF) and the availability. How can MTTF and availability be improved? Use a RAID disk array as an example to explain why it can have better availability than a single disk. Ans:

    (1) MTTRMTTFMTBF +=

    MTBFMTTF

    MTTRMTTFMTTFyAvalibilit =+=

    Availability failure service (2) RAID disk array hard disks data

    disks disk failure loss RAID disk array failure MTTR availability MTTR availability availability MTTF RAID disk array MTTF availability

  • 2

    (2007.03) (1a) (1b) The architecture team is considering enhancing a machine by adding vector hardware to it. When a computation is run in vector mode on the vector hardware, it is 10 times faster than the normal mode of execution. We call the percentage of time that could be spent using vector mode the percentage of vectorization. (a) What percentage of vectorization is needed to achieve a speedup of 2? (b) Support you have measured the percentage of vectorization for programs to be 70%. The

    hardware design group says they can double the speedup of the vector hardware with a significant additional engineering investment. You wonder whether the compiler crew could increase the use of vector mode as another approach to increase performance. How much of an increase in the percentage of vectorization (relative to current usage) would you need to obtain the same performance gain as doubling vector hardware speed?

    Ans:

    (a) x percentage of vectorization

    %56.555556.010

    )1(

    12

    =+

    =

    x

    xx

    (b) x percentage of vectorization double the speedup of the vector hardware 10*2=20

    %89.3%70%89.73%89.737389.0

    10)1(

    1

    207.03.0

    1

    ==

    +=

    +x

    xx

    3.89% percentage of vectorization (2007.03) (8) (2003.10) (5) To design an instruction set, what factors you need to consider in determining the number of registers? Ans:

    instruction set register (1) addressing mode

    addressing mode registers i. Immediate addressing mode

    ADD R1, #3 register

    ii. Indexed addressing mode ADD R3, (R1+R2)

  • 3

    3 registers (2) Code size decode complexity tradeoff

    operands registers decode decode

    Designer registerCISCRISC register

    (2006.10) (5) Describe one technique for exploiting instruction level parallelism. Ans:

    Exploiting ILP hardware software (1) Hardware: Pipelining

    Pipelining implement pipelining machine superscalarVLIW

    (2) Software: Loop unrolling Loop unrolling iteration independent instructions

    (2006. 03) (5a) (5b) (5c) (5d) The following figure shows the trend of microprocessors in terms of transistor count. The transistor count on a processor chip increases dramatically over time. This is also known as Moores Law. (Ignore the figure) (a) Based on the straight line plotted on the figure, estimate the number of transistors

    available on a chip in 2020. (b) What does bit-level parallelism mean? How was bit-level parallelism utilized to increase

    performance of microprocessors in 70s and 80s? Is bit-level parallelism still an effective way to increase performance of mainstream applications today? Why?

    (c) Explain how instruction-level parallelism (ILP) can be exploited to improve processor performance. Is ILP still an effective way to increase processor performance for mainstream applications today? Why?

    (d) What does thread-level parallelism (TLP) mean? Why is it important for processors to take advantage of TLP today? Discuss the advantages and disadvantages of TLP.

  • 4

    Ans:

    (a) () 1975 410 1995 710 2020 x10 x

    197520204

    1975199547

    =

    x 75.10=x 2020 75.1010

    (b) i Bit-level parallelism processor word size bus bandwidth

    word size word size performance processor word size 8 bits 16 bits lower 8 bits higher 8 bits processor word size 16 bits

    ii bandwidth (c)

    i Instruction-level parallelism processor performance instruction pipeliningsuperscalarVLIW Superscalar

    ii implement ILP structurecontrol data hazard fully parallelism

    (d) i Thread-level parallelism processor threads thread

    states(instructionsdataPCregisters) ii Advantage: multiple instruction streams throughput

  • 5

    multi-thread program ILP exploit TLP cost-effective Disadvantage: ILP programmer

    (2005.10) (1a) (1b) (a) Generally speaking, CISC CPUs have more complex instructions than RISC CPUs, and

    therefore need fewer instructions to perform the same tasks. However, typically one CISC instruction, since it is more complex, takes more time to complete than a RISC instruction. Assume that a certain task needs P CISC instructions and 2P RISC instructions, and that one CISC instruction takes 8T ns to complete, and one RISC instruction takes 2T ns. Under this assumption, which one has the better performance?

    (b) Compare the pros & cons of two types of encoding schemes: fixed length and variable length.

    Ans:

    (a) CPU time CISC: P8T=8PT (ns) RISC: 2P2T=4PT (ns) RISC

    (b) Fixed length: Pros: decode CPI pipeline Cons: addressing mode code size Variable length: Pros: addressing mode code size Cons: decode CPI pipeline

    (2004.10) (1) The benchmarks for your new RISC computer show the following instruction mix:

    Instruction Type Frequency Clock cycle count ALU 30% 1

    Loads 30% 2 Strores 25% 2

    Branches 15% 2

    Assume there is one hardware enhancement that reduces the execution time of load/store instruction to 1 cycle. But it increases the execution time of a branch instruction to 3 cycles and the system clock cycle time by 15%. If you are the system designer, will you implement this hardware enhancement? Why?

  • 6

    Ans:

    Instruction Type Frequency Clock cycle count ALU 30% 1 Loads 30% 1 Strores 25% 1

    Branches 15% 3 CCTCPIICCPUtime = speedup

    13712.1495.17.1

    15.1)315.0125.013.013.0(1)215.0225.023.013.0(

    ==++++++=

    ==

    enhancedenhanced

    originaloriginal

    enhanced

    original

    CCTCPICCTCPI

    CPUtimeCPUtime

    Speedup

    speedup 1 hardware enhancement

    (2008.03) (5) Describe the Amdahl's Law. (2004.10) (4) What is the important design principle addressed by Amdahls law? (2003.03) (1) What is the architectural design principle based on the Amdahls Law? Ans:

    Amdahls law

    enhanced

    enhancedenhanced

    overall

    SpeedUpFraction

    FractionSpeedUp

    +=

    )1(

    1

    (1)(2) Designer performance bottleneck performance

    (2004.10) (8) (2003.03) (4) Why can a SMT processor utilize CPU resources more efficiently than a superscalar processor?

  • 7

    Ans: Superscalar processor exploit ILP hardware pipeline thread dependent instructions CPU stall Simultaneous Multithreading ILP TLP fine-grained multithreading clock cycle switch thread cycle thread independent instructions CPU stall superscalar CPU

    (2004.03) (2) Assume the instruction mix and the latency of various instruction types listed in Table 1. Determine the speedup obtained by applying a compiler optimization that converts 50% of multiplies into a sequence of shifts and adds with an average length of three instructions.

    Instruction Type Frequency Clock cycle count ALU arithmetic (add, sub, shift)

    30% 1

    Loads 25% 2 Strores 15% 2

    Branches 15% 4 Multiply 15% 10

    Ans:

    add shift 3 instruction

    Instruction Type Frequency Clock cycle count ALU arithmetic (add, sub, shift)

    (30 + 7.5 3) % 1

    Loads 25 % 2 Strores 15 % 2

    Branches 15 % 4 Multiply 7.5 % 10

    (30 + 7.5 3) %+25 %+15 %+15 %+7.5 %=1.15 Speedup 1/1.15 Frequency normalize 100

    37570.115.11)10075.0415.0215.0225.01)3075.03.0((

    1015.0415.0215.0225.013.0

    =+++++

    ++++=

    ==improved

    original

    improved

    original

    CPICPI

    ExTimeExTime

    Speedup

  • 8

    (2007.03) (4) (2004.03) (6) What is the problem existing in the modern superscalar processor that motivates the design of the simultaneous multithreading architecture? Ans:

    Superscalar processor exploit ILP hardware pipeline thread dependent instructions CPU stall CPU resources ILP TLP simultaneous multithreading (SMT) cycle thread independent instructions CPU stall CPU

    (2003.10) (4) What is an important design principle addressed by Amdahls Law? Explain why it expresses the law of diminishing returns. Ans:

    (1) Amdahls law

    enhanced

    enhancedenhanced

    overall

    SpeedUpFractionFraction

    SpeedUp+

    =)1(

    1

    (i)(ii) Designer performance bottleneck performance

    (2) Amdahls law enhancedSpeedUp

    overallSpeedUp enhancedSpeedUp

    enhancedFraction1

    1 (the law of diminishing

    returns) (2002.10) (1) Why is MIPS a wrong measure for comparing performance among computers? Ans:

    MIPS

    666 101010 === CPIfrequencyCPU

    CCTCPIICIC

    ExTimeICMIPS

  • 9

    performance instruction countclock per instruction clock cycle time MIPS CPI CPU frequencycomputer instruction set architecture ( CISCRISC) CISC CPI IC RISC CPI IC MIPS ISA

    (2002.03) (1a) (1b) (a) You just make a fortune from your summer job. You want to use this money to buy the

    fastest PC available in the market. You are choosing between machine A and machine B. Machine A runs at 800 MHZ and the reported peak performance is 500 MIPS. The clock rate of machine B is 1 GHZ and the reported peak performance is 700 MIPS. Are you able to decide which machine to buy? Why?

    (b) The benchmark for machine A shows the following instruction mix:

    Instruction Type Frequency CPI A 40% 1 B 30% 4 C 15% 2 D 15% 6

    The architectural development team is studying a hardware enhancement. According to the simulation result, the enhancement will reduce the instruction counts of instruction A, B, C and D by 10%, 20%, 50% and 5%, respectively. But it will lengthen the clock cycle time by 15%. Should they implement this new architectural feature in their next generation product?

    Ans:

    (a) 66 1010 == CCTCPIICIC

    ExTimeICMIPS MIPS clock

    rate machine A machine B ISA IC machine

    (b) machine A

    Instruction Type Frequency CPI A 40 90 % 1 B 30 80 % 4 C 15 50 % 2 D 15 95 % 6

  • 10

    654511.068925.0

    115.1)601425.02075.0424.0136.0(

    615.0215.043.014.0

    =+++

    +++=

    ==improved

    original

    improved

    original

    CPICPI

    ExTimeExTime

    Speedup

    implement architectural feature (2001.03) (1) Can you give one example of high-level optimization performed by an optimizing compiler? (1999.03) (1) Can you give an example of high-level optimizations in modern compiler design? Ans:

    Compiler optimization (1) Loop Interchange: memory row

    major spatial locality Before After

    for (k = 0; k < 100; k = k+1) for (j = 0; j < 100; j = j+1)

    for (i = 0; i < 5000; i = i+1)x[i][j] = 2 * x[i][j];

    for (k = 0; k < 100; k = k+1) for (i = 0; i < 5000; i = i+1)

    for (j = 0; j < 100; j = j+1) x[i][j] = 2 * x[i][j];

    (2) Loop Fusion: temporal locality Before After

    for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1)

    a[i][j] = 1/b[i][j] * c[i][j]; for (i = 0; i < N; i = i+1)

    for (j = 0; j < N; j = j+1) d[i][j] = a[i][j] + c[i][j];

    for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {

    a[i][j] = 1/b[i][j] * c[i][j]; d[i][j] = a[i][j] + c[i][j];

    } (3) Blocking: size submatrix cache

    capacity basic block Before After

    for (i = 0; i < N; i = i+1) for (j = 0; j < N; j = j+1) {

    r = 0; for (k = 0; k < N; k = k+1) {

    r = r + y[i][k]*z[k][j]; }; x[i][j] = r;

    };

    for (jj = 0; jj < N; jj = jj+B) for (kk = 0; kk < N; kk = kk+B)

    for (i = 0; i < N; i = i+1) for (j = jj; j < min(jj+B-1,N); j = j+1) {

    r = 0; for (k = kk; k < min(kk+B-1,N); k = k+1) {

    r = r + y[i][k]*z[k][j]; }; x[i][j] = x[i][j] + r; };

  • 11

    (2000.10) (1) Can you explain the following terms? (1) The diminishing return in Amdahls law. (2) Convoy and chime with respect to vector processors. (3) Vector stride with respect to the execution of vector processors. Ans:

    (1) Amdahls law

    enhanced

    enhancedenhanced

    overall

    SpeedUpFraction

    FractionSpeedUp

    +=

    )1(

    1

    (i)(ii)

    Amdahls law enhancedSpeedUp

    overallSpeedUp enhancedSpeedUp

    enhancedFraction1

    1 (the law of diminishing

    returns) (2) Convoy: clock vector instructions structure

    hazards data hazards Chime: vector operation

    (3) Vector stride elements separating distance vector operation vector stride general purpose register LVWS (load vector with stride) vector fetch vector register SVWS (store vector with stride)nonunit stride vector

    (1999.10) (1) A CPU designer wants to change the pipeline design of a CPU. How will his/her action affect the 3 metrics on the right hand side of the following equation? CPU time = instruction count clocks per instruction clock cycle time. If the optimizing compiler developed for the CPU is rewritten, how will the metrics be affected? Ans:

    (1) CPU CPU clock cycle time metric Instruction set RISCcode size IC CPI CISC code size IC CPI

  • 12

    (2) Compiler optimizing compilercycle process code sequence cycle CPI Compiler Instruction Count

    (1998.10) (1) To evaluate a CPU performance, we use the following equation: Execution time = instruction count CPI clock cycle time. Can you explain how instruction set design, architecture design, hardware technology, and compiler technology affect the three components in the equation? Ans:

    (1) Instruction set Instruction Count CPI Instruction set RISCcode size IC CPI CISC code size IC CPI

    (2) Architecture CPI CPI single multi processor CPI Single>Multi

    (3) Hardware technology Clock Cycle Time Processor Rate clock cycle time

    (4) Compiler technology CPI, Instruction Count Compiler optimizing compilercycle process code sequence cycle CPI Compiler Instruction Count (!!)

    Instruction

    Count CPI Clock Cycle Time

    Program Compiler

    Instruction Set Organization Technology

  • 13

    Instuction-Level parallesism and Its exploitation (2008.10) (2) Tomasulos schemem? WAR (write after read) and WAW (write after write) hazard?

    Ans:

    The Tomasulo algorithm is a hardware algorithm developed in 1967 by Robert Tomasulo from IBM. It allows sequential instructions that would normally be stalled due to certain dependencies to execute non-sequentially (in order issue, out-of-order execution). It was first implemented for the IBM360/91s floating point unit. As we will see, RAW hazards are avoided by executing an instruction only when its operands are availabledelayoperands available. WAR and WAW hazards, which arise from name dependences, are eliminated by register renaming. Register renaming eliminates these hazards by renaming all destination registers, including those with a pending read or write for an earlier instruction, so that the out-of-order write does not affect any instructions that depend on an earlier value of an operand. This algorithm differs from scoreboarding in that it utilizes register renaming. Where scoreboarding resolves Write-after-Write (WAW) and Write-after-Read (WAR) hazards by stalling, register renaming allows the continual issuing of instructions. The Tomasulo algorithm also uses a common data bus (CDB) on which computed values are broadcast to all the reservation stations that may need it. This allows for improved parallel execution of instructions which may otherwise stall under the use of scoreboarding.

    (2007.10) (4) What is branch prediction and why is it important? Draw the state diagram to illustrate the 2-bit branch prediction mechanism. How to predict the branch target? Ans:

    (1) Branch prediction instruction fetch branch cycle fetch stall CPU pipeline CPU

    (2) 2-bit branch prediction predict state diagram

  • 14

    (3) 2-bit predictor transition diagram satae branch Taken or Non-Taken

    initial stateT/NT branch State Predict NTTaken State Predict NTTakenPredict T

    (2007.03) (5) Explain the differences between dynamic scheduling with and without speculation. Ans:

    Dynamic scheduling with and without speculation hardware speculation branch prediction branch branch predict speculation commit stage update register memoryTomasulos algorithm (1) Tomasulos algorithm with speculation

    Issue Execution Write result: write results to the CDB & store results in a HW buffer (Reorder Buffer) Commit: update register file or memory ( in-order commit)

    (2) Tomasulos algorithm without speculation Issue Execution Write result: write results to the CDB & update register file or memory

  • 15

    (2007.03) (6) (2002.10) (6) Support we have deeply pipelined processor, for which we implement a branch-target buffer for the conditional branches only. Assume that the mis-prediction penalty is always 4 cycles, and the buffer miss penalty is always 3 cycles. Assume 90% hit rate and 90% accuracy, and 15% branch frequency. How much faster is the processor with the branch-target buffer versus a processor that has a fixed 2-cycle branch penalty? Assume a base CPI without branch stalls of 1. Ans:

    18289.1099.1

    3.1

    4*1.0*9.0*15.03*1.0*15.012*15.01

    ==++

    +===BTB

    NOBTB

    BTB

    NOBTB

    CPICPI

    ExTimeExTimeSpeedup

    (2006.10) (3) Increasing the size of a branch-prediction buffer means that it is less likely that two branches in a program will share the same predictor. A single predictor predicting a single branch instruction is generally more accurate than is that same predictor serving more than one branch instruction. (a) List a sequence of branch taken and not taken actions to show a simple example if 1-bit

    predictor sharing that reduces mis-prediction rate. (b) List a sequence of branch taken and not taken actions that show a simple example of how

    sharing a 1-bit predictor increases mis-prediction rate. Ans:

    b1b2 branchP shared 1-bit predictorb1pb2p b1 b2 ( predictor NT)

  • 16

    (a) 1-bit predictor

    b1p NT T T T T T NT NT NT NT b1 T T T T T NT NT NT NT NT Corr. b2p NT T T T T T NT NT NT NT b2 T T T T T NT NT NT NT NT Corr.

    b1 mis-prediction rate: 20%b2 mis-prediction rate: 20% 1-bit predictor

    b1p NT T T T T T NT NT NT NT b1 T T T T T NT NT NT NT NT Corr. b2p T T T T T NT NT NT NT NT b2 T T T T T NT NT NT NT NT Corr.

    b1 mis-prediction rate: 20%b2 mis-prediction rate: 0% b1b2 1-bit predictor mis-prediction rate (b) 1-bit predictor

    b1p NT T T T T T T T T T b1 T T T T T T T T T T Corr. b2p NT NT NT NT NT NT NT NT NT NT b2 NT NT NT NT NT NT NT NT NT NT Corr.

    b1 mis-prediction rate: 10%b2 mis-prediction rate: 0% 1-bit predictor

    b1p NT NT NT NT NT NT NT NT NT NT b1 T T T T T T T T T T Corr. b2p T T T T T T T T T T b2 NT NT NT NT NT NT NT NT NT NT Corr.

    b1 mis-prediction rate: 100%b2 mis-prediction rate: 100% b1b2 1-bit predictor mis-prediction rate

  • 17

    (2006.10) (4) For the following code sequence, assuming the pipeline latency listed in the following table, show how it can be scheduled in a single-issue pipeline without delays. You could transform the code and unroll the loop as many times as needed.

    Foo: L.D F0, 0(R1) L.D F4, 0(R2) MUL.D F0, F0, F4 ADD.D F2, F0, F2 DADDUI R1, R1, #-8 DADDUI R2, R2, #-8 BNEZ R1, Foo

    Instruction producing result

    Instruction using result

    Latency in clock cycles

    FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0

    Ans:

  • 18

    (2006.03) (1) For the code segment shown below, use standard MIPS 5-stage pipeline (IF, ID, EX, MEM, WB) with register forwarding and no branch prediction to answer the following questions. Assume the branch is resolved in the ID stage, and ld instructions hit in the cache. The processor has only one floating point ALU, the execution latency of the floating point ALU is three cycles, and the floating point ALU is pipelined. Inst 1: LD F1, 45(R2) Inst 2: ADD.D F7, F1, F5 Inst 3: SUB.D F8, F1, F6 Inst 4: BNEZ F7, target Inst 5: ADD.D F8, F8, F5 Inst 6: target SUB.D F2, F3, F4 (a) Identify each dependency by type (data, and output dependency); list the two instructions

    involved. (b) Assume a non-taken branch. How many cycles does the code segment execute? Please show

    how the code segment scheduled in the pipeline.

  • 19

    Note: Please write down all the additional architectural assumptions that your answers are based on.

    Ans: (a) Data dependency: (1,2), (1,3), (3,5)

    Output dependency: (3,5) ,(2,4) (b)

    LD Mem Read SD WB Read

    Cycles Instrs.

    1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 181 F D E M W 2 F D S E E E M W 3 F S D S S E E E M W 4 F S S D E M W 5 F D S E E E M W 6 F S D S S E E E M W

    code 18 cycles (2006.03) (3) What is a tournament branch predictor? Ans:

    Tournament branch predictor predictor selection algorithm local global predictor predictor branch predictor selection algorithm predictor state diagram predictor #1 / predictor #2 (0: incorrect, 1: correct)

  • 20

    (2006.03) (4) Here is a code sequence for a two-issue superscalar that can issue a combination of one memory reference and one ALU operation, or a branch by itself. Show how the following code segment can be improved using a predicated form of LW (LWC).

    First instruction slot Second instruction slot LW R1, 40(R2) ADD R3, R4, R5

    ADD R6, R3, R7 BEQZ R10, L LW R8, 0(R10)

    L: LW R9, 0(R8) Ans:

    First instruction slot Second instruction slot LW R1, 40(R2) ADD R3, R4, R5

    Waste ADD R6, R3, R7 BEQZ R10, L LW R8, 0(R10)

    L: LW R9, 0(R8)

    code first instruction slot waste pipeline true data dependencyLWC LW condition LWC code

    First instruction slot Second instruction slot LW R1, 40(R2) ADD R3, R4, R5 LWC R8, 0(R10), R10 ADD R6, R3, R7 BEQZ R10, L

    L: LW R9, 0(R8)

    case R100 LWC R8, 0(R10), R10 first instruction slot R8 true data dependency pipeline

    (2005.10) (2) For the code segment shown below, use standard MIPS 5-stage pipeline (IF, ID, EX, MEM, WB) with register forwarding and delayed branch semantics to answer the following questions (note the pipeline latency is listed in table 1):

    Loop: L.D F0, 0(R1) MULT.D F0, F0, F2 L.D F4, 0(R2) ADD.D F0, F0, F4 S.D 0(R2), F0

  • 21

    SUBI R1, R1, 8 SUBI R1, R2, 8 BNEQZ R1, Loop

    Instruction producing result

    Instruction using result

    Latency in clock cycles

    FP ALU op Another FP ALU op 3 FP ALU op Store double 2 Load double FP ALU op 1 Load double Store double 0

    (a) Perform loop unrolling and schedule the codes without any delay. (b) Perform loop unrolling and schedule the codes on a simple two-issue, statically scheduled

    superscalar MIPS pipeline without any delay. This processor can issue two instructions per clock cycle: one of the instructions can be a load, store, branch or integer ALU operation, and the other can be any floating-point operation.

    Ans:

  • 22

  • 23

  • 24

    (2005.03) (2) Show a software-pipelined version of this loop, which increments all the elements of an array whose starting address is in R1 by the contents of F2:

    Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop

    Be sure to include the start-up and clean-up code in your answer. (2002.10) (5) Show a software-pipelined version of the following code segment:

    Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop

    You may omit the start-up and clean-up code. Ans:

    [] http://developer.apple.com/hardwaredrivers/ve/software_pipelining.html P.248-250 iteration code

    L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1)

    software-pipeline

    Cycles Iter. 1 Iter. 2 Iter. 3

    1 L.D 2 ADD.D L.D 3 S.D ADD.D L.D 4 S.D ADD.D 5 S.D

  • 25

    code Start-up code L.D F0, 0(R1)

    ADD.D F4, F0, F2 L.D F0, -8(R1)

    Loop S.D F4, 0(R1) ADD.D F4, F0, F2 L.D F0, -16(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop

    Clean-up code S.D F4, 0(R1) ADD.D F4, F0, F2 S.D F4, -8(R1)

    (2005.03) (3) Support we have a deeply pipelined processor, for which we implement a branch-target buffer for the conditional branches only. Assume that the mis-prediction penalty is always 5 cycles, and the buffer miss penalty is always 3 cycles. Assume 95% hit rate and 90% accuracy, and 15% branch frequency. How much faster is the processor with the branch-target buffer versus a processor that has a fixed 2-cycle branch penalty? Assume a base CPI without branch stalls of 1. Ans:

    18857.109375.1

    3.1

    5*1.0*95.0*15.03*05.0*15.012*15.01

    ==++

    +===BTB

    NOBTB

    BTB

    NOBTB

    CPICPI

    ExTimeExTimeSpeedup

    (2005.03) (7) (2004.03) (4) Use code examples to illustrate three types of pipeline hazards. Ans:

    MIPS 5-stage pipeline (1) Structure hazard

    memory port code instr. 1 cycle 4 memory access instr. 4 cycle 4 instruction fetch memory port structure hazard

  • 26

    LD F0, 0(R2) SUB.D F1, F2, F3 ADD.D F4, F5, F6 SUB.D F7, F8, F9

    (2) Control hazard Code branch branch taken not taken control hazard LOOP: LD F0, 0(R1) ADD.D F2, F0, F4 DADDUI R1, R1, #-8 BNEZ R1, LOOP R1 instruction BNEZ

    (3) Data hazard i. Read after write (RAW) hazard: data dependence

    ADD.D F0, F4, F2 SUB.D F8, F0, F6

    ii. Write after read (WAR) hazard: anti dependence ADD.D F4, F0, F2 SUB.D F0, F8, F6

    iii. Write after write (WAW) hazard: output dependence ADD.D F0, F4, F2 SUB.D F0, F8, F6

    (2004.10) (5) What are correlating branch predictors? (2000.10) (4) Can you explain how correlated predictors, also called two-level predictors, work in branch prediction? Ans:

    Correlating branch predictor branch branch taken not taken (1) global last branch history branch (2) (m, n) predictor

    global local predictor globalm branch 2^m n-bit local predictor predictor branch

    Lecture_3 P.42

  • 27

    (2004.10) (6) For the following code sequence:

    L.D F6, 34(R2) L.D F2, 45(R3) MULT.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2

    Identify instruction pairs with data, anti and output dependence. Ans:

    (1) Data dependence (RAW hazard): i. L.D F6, 34(R2)

    SUB.D F8, F6, F2 ii. L.D F6, 34(R2)

    DIV.D F10, F0, F6 iii. L.D F2, 45(R3)

    MULT.D F0, F2, F4 iv. L.D F2, 45(R3)

    SUB.D F8, F6, F2 v. L.D F2, 45(R3)

    ADD.D F6, F8, F2 vi. MULT.D F0, F2, F4

    DIV.D F10, F0, F6 vii. L.D F2, 45(R3)

    ADD.D F6, F8, F2 (2) Anti dependence (WAR hazard):

    i. SUB.D F8, F6, F2 ADD.D F6, F8, F2

    ii. DIV.D F10, F0, F6 ADD.D F6, F8, F2

    (3) Output dependence (WAW hazard): i. L.D F6, 34(R2)

    ADD.D F6, F8, F2

  • 28

    (2004.03) (3) Consider the following code segment within a loop body:

    If (x is even) then /* branch b1 */ a++; If(x is a multiple of 10) then /* branch b2 */ b++;

    For the following list of 10 values of x to be processed by 10 iterations of this loop: 8, 9, 10, 11, 12, 20, 29, 30, 31, 32, determine the prediction accuracy of b1 and b2 using a 2-bit predictor as illustrated in Figure 1. Assume the initial state is non-taken in the branch history table. (Ignore the figure) Ans:

  • 29

    Iter. 1 2 3 4 5 6 7 8 9 10 b1 NT T NT T NT NT T NT T NT b1p N2 N1 T2 N2 T2 T2 N2 T2 N2 T2 Corr. b2 T T NT T T NT T NT T T b2p N2 T2 T1 N1 T1 T1 N1 T1 N1 T1 Corr.

    (b1 b2 ) b1 10%b2 30%

    Correlating Branch Predictor

    (2003.10) (1) For the following code sequence:

    L.D F6, 34(R2) (1)* L.D F2, 45(R3) (1)* MULT.D F0, F2, F4 (2)* SUB.D F8, F6, F2 (2)* DIV.D F10, F0, F6 (40)* ADD.D F6, F8, F2 (2)*

    * the number represents the exectuion latency of the corresponding instruction. (For example, L.D. instruction starts execution at time t and completes execution at t+1).

    (a) Identify instruction pairs with data dependence, anti-dependence and output dependence. (b) On the scoreboarding implementations, are there instructions that are stalled due to the

    WAR or WAW hazards? If yes, identify instruction pairs that cause the pipeline to stall? (c) Describe how the Tamasulo algorithms resolve WAR and WAW hazards.

  • 30

    Ans: (a) Data dependence: (1,4), (1,5), (2,3), (2,4), (2,6), (3,5), (4,6)

    Anti-dependence: (4,6), (5,6) Output dependence: (1,6)

    (b) Scoreboarding

    Instrs. Issue Read OP.

    Exe. Comp.

    Write Result

    L.D F6, 34(R2) (1) 1 2 3 4 L.D F2, 45(R3) (1) 5 6 7 8 MULT.D F0, F2, F4 (2) 6 9 11 12 SUB.D F8, F6, F2 (2) 7 9 11 12 DIV.D F10, F0, F6 (40) 8 13 53 54 ADD.D F6, F8, F2 (2) 13 14 16 17 WAR WAW hazard CPU stall Stall RAW Resouce WAR WAW

    (c) Tomasulo algorithm register renaming WAR, WAW hazards stall in-order issue, out-of-order executing, out-of-order complete hazards performance

    (2003.10) (3) You are designing an embedded system where power, cost and complexity of implementations are the major design considerations. Which technique will you choose to exploit ILP, superscalar with dynamic instruction scheduling or VLIW? You need to justify your answer. Ans:

    [] http://www.embedded.com/story/OEG20010222S0039 Superscalar with dynamic instruction scheduling Tomasulos algorithmpipeline in-order issueout-of-order execution out-of-order completion cost power VLIW run time compiler independent instructions large instruction fixed instruction packet hardware complexity compiling technique branch scheduling complexity power, cost complexity superscalar with dynamic instruction scheduling power VLIW embedded system

    resource load

  • 31

    (2003.10) (6) Explain how loop unrolling can be used to improve pipeline scheduling. Ans:

    Loop unrolling iteration independent instructions code stall pipeline stall performance

    (2003.03) (3) What is the problem that a Branch Target Buffer tries to solve? Ans:

    Branch target buffer (BTB) branch instruction address (branch PC) predicted address (predicted PC) BTB instruction decode stage instruction fetch stage branch fetch pipeline stall

    (2003.03) (7) Consider the execution of the following loop on a dual-issue, dynamically scheduling processor with speculation:

    Loop: L.D F0, 0(R1) ADD.D F4, F0, F2 S.D F4, 0(R1) DADDUI R1, R1, #-8 BNE R1, R2, Loop

    Assume : (a) Both a floating-point and integer operations can be issued on every clock cycle even if they

    are dependent. (b) One integer functional unit is used for both the ALU operation and effective address

    calculations and a separate pipelined FP functional unit for each operation type. (c) A dynamic branch-prediction hardware (with perfect branch prediction), a separate

    functional unit to evaluate branch conditions and branch single issue (no delay branch). (d) The number of cycles of latency between a source instruction and an instruction

    consuming the result is one cycle for integer ALU, two cycles for loads, and three cycles for FP add.

    (e) There are four-entry ROB (reorder buffer) and unlimited reservation stations, two common data bus (CDB).

    (f) Up to two instructions of any type can commit per cycle.

    Create a table showing when each instruction issue, begin execution, write its result to the

  • 32

    CDB and commit for the first two iterations of the loop.

    Table Sample: Iteration # Instruction Issue Executes Mem access Write-CDB Commit

    Note: If your solution is based on the assumptions that are not listed above, please list them clearly. Ans:

    speculation

    Iter. Instruction Issue Executes Mem access Write-CDB Commit

    1 L.D F0, 0(R1) 1 2 3 4 5 1 ADD.D F4, F0, F2 1 5 8 9 1 S.D F4, 0(R1) 2 3 9 1 DADDUI R1, R1, #-8 2 4 5 10 1 BNE R1, R2, Loop 6 7 10 2 L.D F0, 0(R1) 10 11 12 13 14 2 ADD.D F4, F0, F2 10 14 17 18 2 S.D F4, 0(R1) 11 12 18 2 DADDUI R1, R1, #-8 11 13 14 19 2 BNE R1, R2, Loop 15 16 19

    (2002.10) (3) List two architectural features that enhance the CPU performance by exploiting instruction level parallelism. Ans:

    [] http://en.wikipedia.org/wiki/Superscalar http://en.wikipedia.org/wiki/Very_long_instruction_word

    (1) Superscalar processor Superscalar clock rate CPU throughput CPI < 1 performance redundant functional units (ALUmultiplier) cycle issue instructions

  • 33

    Tomasulos algorithm in-order issueout-of-order executionout-of-order completion compiler issue

    (2) Very long instruction word (VLIW) VLIW run time compiler independent instructions large instruction fixed instruction packet hardware complexity compiling technique branch scheduling complexity

    (2002.03) (2) Assume a five-stage integer pipeline with dedicated branch logic in the ID stage. Assume that writes to the register file occur in the 1st half of the clock cycle and reads to the register file occurs in the 2nd half. For the following code segment:

    Loop: LW R1, 0(R2) ADDI R3, R1, 100 SW 0(R2), R3 ADD R5, R5, R3 SUBI R4, R4, #4 SUBI R2, R2, #4 BNEZ R4, Loop

    (a) Show how the code sequence is scheduled on the described pipeline without register forwarding and branch prediction. Identify all the pipeline hazards and use s to indicate a stall cycle. Assume all memory accesses are cache hits.

    (b) Assume that the pipeline implements register forwarding and branch predict-taken scheme. If the initial value of R4 is 12 and the first instruction in the loop start execution at clock #1, how many cycle does it take to finish executing the entire loop?

    Note: Please write down all the additional architectural assumptions that your answers are based on. Ans:

  • 34

    (a)

    Cycles

    Instrs. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

    1 F D E M W 2 F D S S E M W 3 F S S D S S E M W 4 F S S D E M W 5 F D E M W 6 F D E M W 7 F S D E M W

    (b)

    Cycles

    Instrs. 1 2 3 4 5 6 7 8 9 10 11 12

    1 F D E M W 2 F D S E M W 3 F S D E M W 4 F D E M W 5 F D E M W 6 F D E M W 7 F D E M W

  • 35

    R4 12 loop 3 Stall 12+2*(7+1)=28 cycles

    (2001.03) (2) In what condition can we move an instruction from one possible path of a branch to the basic block ahead of the branch without affecting the other part of the code or causing any false exception? Ans:

    branch prection prediction accuracy 100%prefetch code branch false exception pipeline stall

    (2001.03) (3) Can you describe a hardware mechanism for solving the false exception problem in problem 2? Ans:

    100% predictor Tomasulos algorithm with speculation false exception branch prediction branch branch speculation reorder buffer commit stage update register memory predict code

    (2000.10) (3) It seems that the performance of a pipelined processor increases as the number of pipeline stages increases. Can you discuss the major factors that limit the number of pipeline stages in processor design? Ans:

    pipeline stages pipeline stages stage pipeline hazards (structure hazard, control hazard, and data hazard) stages pipeline stage performance pipeline stage performance

  • 36

    (1999.10) (2) Can you describe how the reorder buffer works? How does the reorder buffer materialize the precise interrupt? Ans:

    (1) Reorder Buffer circular array i. program array tail

    ii. entry location iii. Buffer head non-speculative

    instruction commit Buffer (2) reorder buffer in-order commit commit

    register I/O interrupt reorder buffer head flush reservation stations interrupt

    (1999.03) (3) (a) What data dependences are checked by the scoreboard mechanism during instruction

    scheduling? (b) What data dependences are checked by the Tomasulos mechanism? Ans:

    (a) [] http://en.wikipedia.org/wiki/Scoreboarding Scoreboarding IssueRead operandsExecutionWrite results 4 stages i. Issue

    current instruction registersoutput dependencies (WAW hazards) registers stall issuefuctional units busy current instruction stall (structural hazard)

    ii. Read operands issue allocate hardware module operands available true data dependencies (RAW hazards) registers unavailable available

    iii. Execution operands fetch functional units scoreboard

    iv. Write results destination register delay destination register earlier instructions read operands anti dependencies (WAR

  • 37

    hazards) (b) [] http://en.wikipedia.org/wiki/Tomasulo_algorithm

    Tomasulos algorithm without speculation IssueExecutionWrite results 3 stages i. Issue

    operands reservation stations ready issue stall register renaming WAR WAW hazards

    ii. Execution delay operands available RAW hazards

    iii. Write results ALU operations registers store operations memory

    (1999.03) (4) (a) If a superscalar microprocessor designer decides to use the scoreboard mechanism instead

    of the Tomasulos mechanism due to hardware costs consideration, what aer the designers main reasons?

    (b) Implementing the scoreboard mechanism instead of the Tomasulos mechanism means that performance of the superscalar microprocessor is sacrificed, can you find some ways to make up loss of performance, including possibly adding more hardware?

    Ans:

    (a) Tomasulos mechanism reservation stations register renaming WAR WAW hazardsscoreboard mechanism designer scoreboard mechanism

    (b) scoreboard mechanism function units 1unit for store/load, 2units for multiplication, 1 unit for adding, 1 unit for division process (ex. load) free function unit structure hazard performance Function units structure hazards performance scoreboard mechanism register rename WAW hazard WAR hazard stall register rename stall performance

  • 38

    (1998.10) (2) Can you explain why branch prediction is so important in modern microprocessor design? Ans:

    Branch prediction instruction fetch branch cycle fetch stall CPU pipeline CPU

  • 39

    Memory Hierarchy (2007.10) (3) Explain virtual memory and how a modern processor access memory space. What are the functions of MMU and TLB here? Ans:

    (1) [] http://en.wikipedia.org/wiki/Virtual_memory i Virtual memory computer system technique

    fragments physical memory inactive program data swap physical memory

    ii Modern processor memory space (i) addressing translation virtual address physical address(ii) physical address cache cache cache main memory

    (2) Memory management unit (MMU): virtual address map physical address hardware device Translation look-aside buffer (TLB): virtual address physical address TLB hit physical address page table physical address address translation

    (2007.10) (6) The following code multiplies two matrices B and C. (a) Compute the cache miss ratio for the code on a 16K, two-way set-associative cache with

    LRU replacement and write-allocate/write-back policy. The cache block size is 16 bytes. Assume that none of the array elements exists in the cache initially.

    (b) Which type of cache misses dominates in this case? (compulsory, conflict, or capacity). Can the cache miss ratio be reduced by rewriting the code? Briefly explain why the new code has fewer cache misses.

    int A[32][32]; /* 4KB, starting at address 0x20000 */ int B[32][32]; /* 4KB, starting at address 0x30000 */ int C[32][32]; /* 4KB, starting at address 0x40000 */ for (i = 0; i < 32; i = i+1)

    for (j = 0; j < 32; j = j+1) A [i] [j] = B [i] [j] * C [i] [j]; /* the memory access order is B, C, A */

  • 40

    Ans:

    (a) i. cache index 214/(242) = 29 index

    9 bits ii. array 4K3=12K < 16K (cache size) capacity

    miss iii. array

    Address Tag Index Block offset A: 0x20000 010000 000000000 0000 B: 0x30000 011000 000000000 0000 C: 0x40000 100000 000000000 0000

    A, B, C array iteration cache index array cache compulsory miss conflict miss cache index (4 iterations cache index 1 element 4B1 Block16B 4 iterationindex 1 ) 4 iterations 1 compulsory miss ( array B )11 conflict miss ( arrays A,C B,A,C )miss ratio 328(11+1) / 32323( i j ABC ) = 100%

    (b) block size=16B array 4 elements BC arrays: temp1 temp2 iteration iteration compulsory miss conflict miss miss ratio temp1 temp2 A, B, C conflict misses(temp1 temp2 cache index ABC ) code for (i = 0; i < 32; i = i+1) {

    for (j = 0; j < 32/4; j = j+4) { for (k=0; k < 4; k = k+1) temp1[i][j4+k] = B[i][j4+k]; for (l=0; l < 4; l = l+1) temp2[i][j4+l] = C[i][j4+l]; for (m=0; m < 4; m = m+1) A[i][j4+m] = temp1[i][j4+m] temp2[i][j4+m]; }

    } miss ratio 3285 / 32323 = 41.67%

    Two way

    temp1+temp2+A+B+C 5

  • 41

    (2007.03) (2) For the following memory system parameters: Cache: physically addressed, 32KB, direct-mapped, 32B block size Virtual Memory: 4K page, 1 GB virtual address space TLB: fully associative, 32 entries 64 MB of addressable physical memory Sketch a block diagram of how to determine whether a memory request is a cache hit. Be sure to label exactly how many bits are in each field of the TLB/cache architecture and which/how many address bits go where. You only need to show the tag and data fields of the TLB/cache architectures. Ans: Page offset = 4K page = 212 =12 bit

    1GB virtueal Space = 230 = 30 bit Physical page number= addressable physical memory / page size = 226 / 212 = 214 =14 bit

    14

    Virtual address number Page offset

    Physical page number Page offset

    Physical address tag cache index block offset

    Virtual Address

    TLB Page Table

    Physical Address

    Cache

    Virtual Page # V Physical page #

    12

    TLB Hit

    18 Miss

    10

    = Cache Hit

    Physical Tag

    11

    5

  • 42

    (2007.03) (7) (2004.10) (7) What are the advantages of replicating data across caches for shared-memory multiprocessor processors? What kind of problems does it introduce? (2003.10) (7) In the shared-memory multiprocessor processor, what are the advantages of replicating data across caches? What problem does it introduce? Ans:

    (1) shared-memory multiprocessor processor cache read miss

    (2) cache coherency Designer cache X processor A B cache copy processor code cache X modified value memory snooping protocol cache coherency

    (2006.10) (1) (a) There are three organizations for memory (i) one word memory organization, (ii) wide

    memory organization and (iii) interleaved memory organization. Briefly describe the three organizations and compare the hardware they need.

    (b) Assume that the cache block size is 32 words, the width of organization (ii) is eight words, and the number of banks in organization (iii) is four. If sending an address is 1 clock, the main memory latency for a new address is 10 cycles, and the transfer time for one word is one cycle, what are the miss penalties for each organization?

    Ans:

    (a) [] http://en.wikipedia.org/wiki/Memory_organisation

  • 43

    (i) One word memory organization 1 word wide memory 1 word wide bus cache

    (ii) Wide memory organization 1 word ( 4 words) memory bus low level cache low level cache multiplexer 1 word wide bus high level cache

    (iii) Interleaved memory organization 1 word wide memory banks 1 word wide buscache access memory address lower k bits bank higher order m bits bank location

    (b) (i) One word memory organization

    address memory 1(send an adderss time) block size 32 words total latency 3210(latency time) memory cache 32x1(transferr time) one-word-wide memory miss penalty 1+3210+321=353 (cycles)

    (ii) Wide memory organization wide memory 8 words 32 words 4 penalty 1+(32/8)10+(32/8)1=45 (cycles)

    (iii) Interleaved memory organization miss penalty 1+(32/4)10+321=113 (cycles)

    (2006.10) (6) Suppose we wished to solve a problem of the following form: for (i = 0; i

  • 44

    Ans:

    (a) P processors PMN2

    iteration delay Pa * PaM ** ( for loop m

    iterations) PMNPaM2

    ** +

    (b) P

    NPa

    N

    PMNPaM

    MNExTimeExTimeSpeedup

    new

    old2

    2

    2

    2

    *** +=

    +==

    (c) P

    NPa

    NS 22

    * += 2)(

    )()()()())()((

    xpxpxqxqxp

    xpxq

    dxd =

    S P 422242222

    22)(

    PaNaPNaPPNNaPNS ++

    +=

    42224

    224

    2 PaNaPNPaNN++

    = S P

    0)2(

    26)2(

    ))(24()2)(2(

    242224

    426

    242224

    22422242224

  • 45

    (2006.03) (2) To overlap the cache access with TLB access as shown in the figure, how do we design the cache?

    Ans:

    cache access TLB access address translation cache hit virtually indexed and physically tagged cache virtual address index cache aliasing virtual addresses physical address map cache block aliasing hardware anti-aliasing OS page coloring

    (2005.10) (3) (a) Do we need a non-blocking cache for an in-order issue processor? Why? (b) What is a virtually-indexed, physically-tagged cache? What problem does it try to solve? (c) Describe one method to eliminate compulsory misses. (d) Describe one method to reduce cache miss penalty. Ans:

    (a) In-order issue processor in-order execution out-of-order execution, out-of-order completion non-blocking cache cache miss CPU request stall CPU independent instructions performance processor in-order execution cache miss CPU stall data memory cache non-blocking cache

    (b) Virtually indexed, physically tagged cache virtual address index cache cache physical address tag cache read tag address translation physical tag cache tag cache hit time

    (c) spatial locality ( access array)

    CPU

    Cache TLB

    MEM

    PA

  • 46

    cache miss block size compulsory miss

    (d) L1 L2 cache victim cache L1 cache conflict miss capacity miss cache line L1 miss check cache line victim cache L2 cache L2 miss penalty

    (2005.10) (5) In small bus-based multiprocessors, what is the advantage of using a write-through cache over a write-back cache? Ans:

    small bus-based multiprocessors bus bandwidthwrite invalidate processor data cache copy processor invalidate copy request processor cache write-back cache processor processor dirty blockmemorywrite-through cache update update memory read miss memory read miss penalty bus traffic

    (2005.03) (1) Both machine A and B contain one-level on chip caches. The CPU clock rates and cache configurations for these two machines are shown in Table 1. The respective instruction/data cache miss rates in executing program P are also shown in Table 1. The frequency of load/store instructions in program P is 20%. On a cache miss, the CPU stalls until the whole cache block is fetched from the main memory. The memory and bus system have the following characteristics:

    1. the bus and memory support 32-byte block transfer; 2. a 32-bit synchronous bus clocked at 200 MHz, with each 32-bit transfer taking 1 bus

    clock cycle, and 1 bus clock cycle required to send an address to memory (assuming shared address and data lines);

    3. assuming there is no cycle needed between each bus operation; 4. a memory access time for the first 4 words (16 bytes) is 250 ns, each additional set of

    four words can be read in 25 ns. Assume that a bus transfer of the most recently read data and a read of the next four words can be overlapped.

    Table 1 Machine A Machine B CPU clock rate 800 MHz 400 MHz

  • 47

    I-cache configuration

    Direct-mapped 32-byte block, 8K

    2-way, 32-byte block, 128K

    D-cache configuration

    2-way, 32-byte block, 16K 4-way, 32-byte block, 256K

    I-cache miss rate 5% 1% D-cache miss rate 15% 3% I-cache:Instruction cache, D-cache: Data cache

    (a) What is the data cache miss penalty (in CPU cycles) for machine A? (b) Which machine is faster in executing program P and by how much? The CPI (Cycle per

    Instruction) is 1 without cache misses for both machine A and B. Ans:

    (a) clock cycle = 1/200MHz =0.00510-6= 510-9s = 5ns cache miss cache line 32 bytesaccess memory 16bytes 16bytes send address take one clock cycle = 5ns 16bytes 250ns 32bits (4bytes) 5ns 5ns (send adderss)+250ns (memory acess)+45ns (transferr time)=275ns16bytes 25ns address 25ns+45ns=45nscache miss cache 275ns+45ns=320ns mechine A data cache miss penalty = 320ns 800MHz = 256 cycles mechine B data cache miss penalty = 320ns 400MHz = 128 cycles

    (b) mechine A B CPU Time CPU TimeA = ICA (1+5% 256 cycles + 20% 15% 256 cycles) 1/800MHz CPU TimeB = ICB (1+1% 128 cycles + 20% 3% 128 cycles) 1/400MHz (20% load/store D-cache) A B Instruction set RISC CISC ICA ICB program P

    (2005.03) (5) A virtual addressed cache removes the TLB from the critical path. However, it could cause aliases. Describe one technique to solve the aliasing problem. Ans:

    (1) Hardware: anti-aliasing cache block physical address cache miss virtual address cache location physical address physical address cache physical address

    (2) Software:

  • 48

    OS page coloring aliases direct mapped cache size n2 virtual address physical address n bits virtual addresses map physical address cache set

    (2005.03) (6) Why does the on-chip cache size continue to increase in modern processors? Ans:

    on-chip cache size capacity miss cache hit rate CPU on-chip cache access memory performance CPU memory hit time

    (2004.10) (2) Describe how the victim cache scheme improves the cache performance. Ans:

    Victim cache L1 cache L2 cache buffer conflict miss capacity miss L1 cache cache blocks L1 cache miss block victim cache data L2 cache L1 cache temporal locality block L1 cache miss victim cache L2 cache L1 cachevictim cache miss penalty performance

    (2004.10) (3) What is a non-blocking cache? Why is it important for a superscalar processor with dynamic instruction scheduling? Ans:

    (a) Non-blocking cache hit under miss hit under multiple misses cache miss CPU request cache hit miss penalty

    (b) Superscalar processor with dynamic instruction scheduling in-order issue, out-of-order execution, out-of-order completion cache miss CPU independent instructions performance non-blocking cache cache miss CPU stall

  • 49

    (2004.03) (1) A virtual address cache is used to remove the TLB from the critical path on cache accesses. (a) To avoid flushing the cache during a context switch, one solution is to increase the width of

    the cache address tag with a process-identifier tag. Assume 64-bit virtual addresses, 8-bit process identifiers, and 32-bit physical addresses. Compare the tag size of a physical cache vs. a virtual cache for a 2-way, 64K cache with 32B blocks.

    (b) A virtual cache incurs the address aliasing problem. Explain what the address aliasing problem is? Explain how page coloring can be used to solve this problem for a direct-mapped cache.

    Ans:

    (a) 532logoffsetblock 2 == bit (32B blocks)

    10322264logindex

    10

    2

    =

    = bit 105

    16

    222

    2 =

    (1) The tag size of physical cache = 32 (10+5) = 17 bits (2) The tag size of virtual cache = 64 (10+5) +8= 57 bits

    (b) (1) virtual address physical address virtual

    address x y physical address x y virtual cache cache line aliasing

    (2) direct mapped cache size n2 page coloring virtual address physical address n bits virtual addresses map physical address cache set

    (2003.10) (2) For the following memory system parameters: Cache: physically addressed, 32KB, 2-way, 32B block size Virtual Memory: 4K page, 1 GB virtual address space TLB: fully associative, 40 entries 64 MB of addressable physical memory Sketch a block diagram of how to determine whether a memory request is a cache hit. Be sure to label exactly how many bits are in each field of the TLB/cache architecture and which/how many address bits go where. You only need to show the tag and data fields of the TLB/cache architectures. Ans:

  • 50

    (2003.03) (2) What is the problem that a TLB tries to solve? Ans:

    TLB page number cache record virtual page number physical page number TLB virtual address page table physical page number TLB miss page table physical page number address translation instruction data cache data cache cache hit virtually indexed, physically tagged cache TLB index cache address translation cache hit time

    14

    Virtual address number Page offset

    Physical page number Page offset

    Physical address tag cache index block offset

    Virtual Address

    TLB Page Table

    Physical Address

    Cache

    Virtual Page # V Physical page #

    12

    TLB Hit

    18 Miss

    9

    = Cache Hit

    Physical Tag

    12

  • 51

    (2003.03) (5) What is the aliasing problem using a virtual cache? Describe a mechanism to solve the aliasing problem. Ans:

    (1) virtual address physical address virtual address x y physical address x y virtual cache cache line aliasing

    (2) OS page coloring aliases direct mapped cache size n2 virtual address physical address n bits virtual addresses map physical address cache set

    (2003.03) (6) What is the memory wall problem? How does pre-fetching solve this problem? Pre-fetching mechanism should be used with a blocking cache or non-block cache? Why? Ans:

    (1) memory access CPU CPU memory access memory cache instruction bottleneck memory wall problem

    (2) pre-fetching stream buffer cache miss memory stream buffer cache miss data stream buffer buffer memory datasequential code array cache miss stream buffer data memory access CPU stall performance

    (3) Pre-fetching data cache miss CPU stall performance non-blocking cache cache miss CPU independent instructions blocking cache cache miss CPU stall data pre-fetching

    (2002.10) (4) What is a non-blocking cache? Ans:

    Non-blocking cache hit under miss hit under multiple misses cache miss CPU request cache hit miss penalty

  • 52

    non-blocking cache cache miss CPU stall independent instructions in-order issue, out-of-order execution, out-of-order completion

    (2002.10) (7) Find the cache miss rate for the following code segments. Assume an 8-KB direct-mapped data cache with 16-byte blocks. It is a write-back cache that does write allocate. Both a and b are double-precision floating-point arrays with 3 rows and 100 columns for a and 101 rows and 3 columns for b (each array element is 8B long). Lets also assume they are not in the cache at the start of the program. For (i = 0; i < 3; i++)

    For (j = 0; i < 100; j++) a [i] [j] = b [j] [0] * b [j+1] [0];

    Ans: i cache 3 * 100 * 3 = 900 ii cache miss

    (1) a compulsory miss: write-allocate a element cache block 16B element 8B (double) miss 2 elements cache miss 3 * 50 = 150

    (2) b compulsory miss: miss i = 0 100 b[j][0] 1 b[j+1][0] miss 101

    (3) Capacity miss: cache 8K16B = 512 cache lines a b cache 150+101251 lines

    (4) Conflict miss: memory

    miss rate (150 + 101)900 = 27.89%

  • 53

    (2002.03) (3) (a) A virtually indexed and physically tagged cache is often used to avoid address translation

    before accessing the cache. One design is to use the page offset to index the cache while sending the virtual part to be translated. The limitation is that a direct-mapped cache can be no bigger than the page size. Describe how we can use page coloring to technique to remove this limitation.

    (b) Your task is to choose the block size of a 8K direct-mapped cache to optimize the memory performance for the following code sequence: For (j = 0; j < 5000; j++)

    For (i = 0; i < 5000; i++) X [i] [j] = 2 X [i] [j];

    Assume that each array element is 4 bytes. The available cache block size is 16, 32, 64 and 256 bytes. Which one will you choose? Why?

    (c) Compute the cache miss ratio for the following code sequence on a 8K, two-way set-associative cache with LRU replacement and write-allocate/write-back policy. The cache block size is 16 bytes. Assume that none of the array elements exists in the cache initially.

    int a [32] [32]; /* 4KB, at address 0x2000 */ int b [32] [32]; /* 4KB, at address 0x3000 */ int c [32] [32]; /* 4KB, at address 0x4000 */

    for (i = 0; i < 32; i = i+1)

    for (j = 0; j < 32; j = j+1) a [i] [j] = b [i] [j] * c [i] [j]; /* the memory access order is b, c, a */

  • 54

    Ans: (a) direct mapped cache size n2 page coloring virtual

    address physical address n bits physical address index cache page offset index cache bits index cache physical cache virtual cache physical cache

    (b) 16 bytes code column major array row major spatial locality cache block size performance

    (c) i. cache index 213/(242) = 28 index

    8 bits ii. array

    Address Tag Index Block offset A: 0x20000 010000 000000000 0000 B: 0x30000 011000 000000000 0000 C: 0x40000 100000 000000000 0000

    A, B, C array iteration cache index array cache compulsory miss conflict miss cache index (4 iterations cache index ) 4 iterations 1 compulsory miss ( array B )11 conflict miss ( arrays A, B, C )miss ratio 328(11+1) / 32323 = 100%

    (1999.10) (3) Can you parallelize the following two loops? If yes, show how you do it. If no, give your reason. (a) for (i = 1; i < 100; i++)

    { a[i] = b[i] + c[i]; c[i+1] = a[i] * d[i+1]; }

    (b) for (i = 1; i

  • 55

    loop (a)(b) A B c[i+1]

    (1998.10) (4) Can you discuss some software and compiler techniques that can improve the performance of a cache system? Ans:

    (1) Loop interchange memory row major compulsory misses

    (2) Loop fusion loop cache compulsory misses

    (3) Blocking elements cache capacity misses submatrix elements size cache submatirx

    (4) Software prefetching compiler pre-fetch compulsory misses

  • 56

    Multiprocessor and Multicore

    (2008.10) (1) barrier? implement Ans: Using barriers is one method of providing synchronization in a multiprocessor system. When a processor executes a barrier instruction it must stop execution when it has reached the point specified by the barrier instruction. After arriving at the specified synchronization point, the processor notifies the barrier controller and waits. When all the participating processors have checked-in with the barrier controller, they are released and begin executing their next instruction. Barrier synchronization ensures that a specified sequence of tasks is not executed out of order by different processing elements. Figure 2 shows an example of barrier synchronization in which Processor #2 has been stopped at a barrier and is waiting for the calculation results of Processors #1 and #3.

    A typical implementation of a barrier can be done with two spin locks: one to protect a counter that tallies the processes arriving at the barrier and one to hold the processes until the 1ast process arrives at the barrier. To implement a barrier we usually use the ability to spin on a variable until it satisfies a test; we use the notation spin (condition) to indicate this. Below code is a typical implementation, assuming that lock and unlock provide basic spin locks and total is the number of processes that must reach the barrier. lock (counter lock); /* ensure update atomic */ if (count= =0) release=0; /* first=>reset release */ count = count+ 1; /* count arrivals */ unlock (counter lock) ; /* release lock */ if (count= =total){ /* a11 arrived */

    count=0 ; /* reset counter */ release=1; /*release processes */ }

  • 57

    Else{ /* more to come */ spin (release= =l) ; /* wait for arrivals */ }

    (2008.10)(3) multi-core CPU? Ans: A multi-core processor (or chip-level multiprocessor, CMP) combines two or more independent cores (normally a CPU) into a single package composed of a single integrated circuit (IC), called a die, or more dies packaged together. A dual-core processor contains two cores, and a quad-core processor contains four cores. A multi-core microprocessor implements multiprocessing in a single physical package parallel and independent instructions ! land balance multi-core CPU Parallel Overhead

    Due to thread creation, scheduling Synchronization

    Excessive use of global data, contention for the same synchronization object Load balance

    Improper distribution of parallel work

    Granularity

    No sufficient parallel work multi-core

  • 58

    System Issues Memory bandwidth, false sharing

    (2008.10)(6) shared-memory multiprocessor interconnection network

    Ans: Centralized shared-memory multiprocessor Multiple-step networks Centralized shared-memory multiprocessor:

    share a single centralized memory, interconnect processors and memory by a bus also known as uniform memory access (UMA) orsymmetric (shared-memory) multiprocessor (SMP) that is A symmetric relationship to all processors and A uniform memory access time from any process but scalability problem: less attractive for large-scale processors

    cache line 6

    2 cache hit

    6 cache hit

  • 59

    Configurations:

    1. Common bus 2. Multiple buses 3. Crossbar

    A crossbar switch may be used to allow any processor to be connected to any memory unit simultaneously however a memory unit can only be connected to one processor at a time. Each processor has a communications path associated with it. Similarly each memory unit has a communications path associated with it. A set of switches connects processor communications paths to memory unit communications paths. The switches are configured such that a processor communications path is only ever connected to a single memory unit communications path and vice versa. The structure of a crossbar switch is shown in Figure

    Multiple-step networks:

    Distributed-memory multiprocessor memory modules associated with CPUs. Advantages:

    cost-effective way to scale memory bandwidth lower memory latency for local memory access

    Drawbacks Longer communication latency for communicating data between processors software model more complex

  • 60

    Configurations:

    (Bus) (Star) (Ring) (Mesh)

  • 61

    (2008.03)(1) What is Multithreading? Describe at least 2 approaches to Multithreading and discuss their advantages and disadvantages. Ans: Multithreading: multiple threads to share the functional units of 1 processor via overlapping Fine-Grained Multithreading Switches between threads on each instruction, causing the execution of multiple threads to be interleaved. That usually done in a round-robin fashion, skipping any stalled threads. CPU must be able to switch threads every clock. Advantage: is it can hide both short and long stalls, since instructions from other threads executed

    when one thread stalls. Disadvantage: is it slows down execution of individual threads, since a thread ready to execute

    without stalls will be delayed by instructions from other threads Course-Grained Multithreading Switches threads only on costly stalls, such as L2 cache misses Advantages:

    Relieves need to have very fast thread-switching Does not slow down thread, since instructions from other threads issued only when the thread encounters a costly stall

    Disadvantage: is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs. Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline

  • 62

    must be emptied or frozen. New thread must fill pipeline before instructions can complete.

    Simultaneous multithreading (SMT): Simultaneous multithreading, often abbreviated as SMT, is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures An inspection of the basic characteristics of earlier proposed architectures compared to SMT, reveals that the throughput advantage of the later comes from sharing (i.e. avoiding partitioning) the underutilized resources among threads. Disadvantages of resource partitioning and how these can limit available parallelism in superscalar processors.

    (2008.03)(2) In Shared Memory Multiprocessors, a lock operation is used to provide atomicity for a critical section of code. The following are assembly-level instructions for this attempt at a lock and unlock. 1. Lock: Ld register, location /* copy location to register */ 2. Cmp register, #0 /* compare with 0 */ 3. Bnz Lock /* if not zero, try again */ 4. St location, #1 /* store 1 into location to mark it locked */ 5. Ret I* return control to caller of lock */

    6. Unlock: St location, #0 /* write 0 to location */ 7. Ret I* return control to caller */

  • 63

    What is the problem with this lock? How to modify it? Ans:

    processoer lock test and set ! cycle P1 P2 P1 lacation value P2 lacation value

    1 Line 1 0 1 2 Line 2 0 2 3 Line 3 0 3

    location 0 NT4 Line 1 0 1 5 Line 2 0 2 6 Line 3 0 3

    location 0 NT7 Line 4 1 P1 4

    1 8 Line 4 1 P2

    BNZP2P2 4 1

    9 Line 5 1 5 Line 5 1 5

    Test-and-set: tests a value and sets it if the value passes the test

  • 64

    (2008.03)(3) Describe a four-state (MESI) write-back invalidation protocol for cache coherence. The MESI consists of 4 states: Modified (M) or dirty, exclusive-clean (E), Shared (S), and invalid (I). Ans: SharedOne or more processors have the block cached, and the value in memory is up to date (as

    well as in all the caches). Invalid (Uncached)No processor has a copy of the cache block. ModifiedExactly one processor has a copy of the cache block, and it has written the block, so the

    memory copy is out of date. The processor is called the owner of the block. exclusive-clean- cache ( cache memory) (2007.03) (3) Please explain why it is usually easier to pre-fetch array data structures than pointers. (2006.10) (2) Why is it easier to design a pre-fetching scheme for array-based than pointer-based data structures? Ans:

    array data structure spatial locality pre-fetching CPU stall memory performance array pointer-based data structures spatial locality temporal locality pre-fetching pointer cache miss CPU stall pre-fetch pre-fetch array data structures pointer-based data structures

    (2006.10) (7) System Organization Consider a computer system configuration with a CPU connected to a main memory bank, input devices and output devices through a bus, as shown in the figure below. The CPU is rated at 10 MIPS, i.e., has a peak instruction execution rate of 10 MIPS. On the average, to fetch and execute one instruction, CPU needs 40 bits from memory, 2 bits from an input device and sends 2 bits to an output device. The peak bandwidth of the input and output devices are 3 Megabytes per second each. The bandwidth of the memory bank is 10 Megabytes per second, and the bandwidth of the bus is 40 Megabytes per second. (a) What is the peak instruction execution rate of the system as configured below? (b) What unit is the bottleneck in the system? (c) Suggest several ways in which you might modify the bottleneck unit so as to improve the

    instruction execution rate.

  • 65

    (d) Using only units with specifications as given in the problem statement, show a system configuration (redraw Figure 1) where the CPU is likely to be the bottleneck. Briefly justify your answer.

    Figure 1: System Configuration

    Ans:

    (a) CPU 40bits from memory 10 Megabytes per second 1010241024/(40/8)=2106 2MIPSInput Output device 310241024/(2/8)=12106 12MIPSCPU peak execution rate = 10MIPSpeak instruction execution rate 2MIPS

    (b) CPU=10MIPSMemory=2MIPSInput, Output Device=12MIPS bottleneck memory

    (c) i. cache miss penalty

    performance ii. TLB virtual address physical address

    (d) memory bus device cpu CPU bound memory input output device CPU bottleneck CPU

    CPU Main Memory

    Input Device

    Output Device

    Bit Byte

  • 66

    (2006.03) (6) Consider a bus-based symmetric shared-memory multiprocessor system which looks like the figure below.

    Here we plan to use Suns UltraSPARC III microprocessors build a commercial server system which mainly runs an OLTP (On-Line Transaction Processing) application. The UltraSPARC III processor has a two-level cache hierarchy. Let us analyze the following simulated results to characterize the processor performance for this OLTP workload.

    CPU

    Main Memory

    Input Device

    Output Device

  • 67

    Note that the chart shows the sources for memory cycles per instruction (MCPI): true sharing, false sharing, cold (compulsory) misses, capacity misses, conflict misses, and instruction misses. The cycles per instruction (CPI) on a processor = Ideal_CPI + MCPI, where Ideal_CPI is the cycles per instruction under a perfect memory system (i.e. infinite fully-associate L1 cache, perfect pre-fetching, perfect caching, etc.) (a) Explain what false sharing and true sharing mean. Why do they increase as the processor

    count increase? (b) Assuming the Ideal_CPI for the UltraSPARC III processor is 1. Suppose the OLTP

    workload is fully parallelized and has lots of threads (e.g. 256) to run in parallel on the system. Use the figure above to estimate the CPI and the speedup for the system running the OLTP workload with 1, 2, 4, and 8 processors.

    (c) Based on your analysis, do you think the OLTP workload can scale beyond 8 processors? List all the reasons you can think of.

    Ans:

    (a) cache line words block processor block true sharing false sharing cache coherency miss True sharing: processor A block word Xprocessor B word X

    cache coherency protocol False sharing: processor A block word X processor B

    word Y processor B block invalidate

  • 68

    block false sharing true sharing

    (b) processor count MCPI processor 1248 MCPI 1.11.51.752.8 processors OLTP workload CPI speedup

    Processor 1 2 4 8 CPI 1/1+1.1=2.1 1/2+1.5=2 1/4+1.75=2 1/8+2.8=2.925 Speedup 2.1/2.1=1 2.1/2=1.05 2.1/2=1.05 2.1/2.925=0.72

    BTB

    NOBTB

    BTB

    NOBTB

    CPICPI

    ExTimeExTimeSpeedup ==

    (c) processors true sharing miss false sharing miss MCPI (c) 8 processors CPI speedup 0.72 1 processor 8 processors CPI 8 processors run OLTP workload performance

    (2005.10) (4) Compare the pros & cons of a synchronous vs. asynchronous bus. Ans:

    (1) Synchronous bus: Pros: bus Cons:(i) bus device run clock rate

    (ii) bus clock skew() (2) Asynchronous bus:

    Pros: (i) device (ii) bus clock skew

    Cons: handshaking protocol buffer (2005.03) (4) Describe the Flynns classification of computers. Ans:

    Flynns classification (1) Single Instruction Single Data (SISD)

    processor

  • 69

    (2) Single Instruction Multiple Data (SIMD)

    vector processor data level parallelism

    (3) Multiple Instruction Single Data (MISD)

    functional unit

    (4) Multiple Instruction Multiple Data (MIMD)

    processor

    (2004.03) (5) Describe one pre-fetching mechanism that works well for programs with sequential access pattern.

  • 70

    Ans: stream buffer cache miss memory block cache block stream buffer cache miss stream buffer block buffer cache memory sequential access program miss

    (2002.10) (2) Please describe how snooping protocol maintains cache coherency in a multiprocessor environment? Ans:

    Snooping protocol bus processor write-invalidate write-update protocol maintain cache coherency (1) Write-invalidate protocol

    a. Write-through cache i. Read miss request processor memory cache

    ii. Write miss processor cache bus broadcast invalidate copy processor bus invalidate invalidate request processor cache update memory memory up-to-date

    b. Write-back cache () i. Read miss request processor bus processor cache data cache

    ii. Write miss processor cache bus broadcast invalidate copy processor bus invalidate invalidate request processor cache cache line processor read miss dirty update memory write-back cache read miss memory

    (2) Write-update protocol a. Read miss

    Memory request processor memory b. Write miss

    request processor bus broadcast cache

  • 71

    memory copy processor broadcast update cache

    (2002.03) (4) (a) Why is snooping protocol more suitable for centralized shared-memory multiprocessor

    architectures than distributed architectures? (b) P1, P2, and P3 are three nodes on a distributed shared-memory multiprocessor

    architecture, which implements a directory-based cache coherency protocol. Assume that caches implement write-back policy. P1 and P2 have cache block A initially. For the following event sequence, show the sharer and state transition (shared, uncached and exclusive) of block A in the directory and the message types sent among nodes:

    State Sharer Message-Type Initial P3 writes 10 to A

    P1 read A Ans:

    (a) Centralized shared-memory multiprocessor uniform memory access (UMA) processor bus memorydistributed architecture non-uniform memory access (NUMA) processors interconnection network processormemory snooping protocol write-invalidate write-update protocol processor bus request request processor centralized shared-memory multiprocessor small scale machine

    (b) State Sharer Message-Type Initial Shared {P1, P2}

    P3 writes 10 to A Exclusive {P3} Write Miss: P3Home Invalidate: HomeP1, P2 Data Value Reply: HomeP3

    P1 read A Shared {P1, P3} Read Miss: P1Home Fetch: HomeP3 Data Write Back: P3Home Data Value Reply: HomeP1

    (2001.03) (4) What is the minimum number of states that a cache coherence protocol must have? Please describe the meanings of these states.

  • 72

    Ans: cache coherence protocol state (1) Exclusive state: processor cache cache block

    state exclusive state copies blocks ( processor cache) invalidate

    (2) Shared state: processor data block copies cache block shared state

    (3) Invalid state: processor cache processor cache copies block invalid state block contains no data

    (2001.03) (5) Why do architecture designers employ cache coherence protocols that have more states than the answer you give in problem 4? Ans:

    states Modified Exclusive block main memory Owned Shared share block main memory [] Wiki Modified: A cache line in the modified state holds the most recent, correct copy of the data. The copy in main memory is stale (incorrect), and no other processor holds a copy. Owned: A cache line in the owned state holds the most recent, correct copy of the data. The owned state is similar to the shared state in that other processors can hold a copy of the most recent, correct data. Unlike the shared state, however, the copy in main memory can be stale (incorrect). Only one processor can hold the data in the owned stateall other processors must hold the data in the shared state. Exclusive: A cache line in the exclusive state holds the most recent, correct copy of the data. The copy in main memory is also the most recent, correct copy of the data. No other processor holds a copy of the data. Shared: A cache line in the shared state holds the most recent, correct copy of the data. Other processors in the system may hold copies of the data in the shared state, as well. The copy in main memory is also the most recent, correct copy of the data, if no other processor holds it in owned state. Invalid: A cache line in the invalid state does not hold a valid copy of the data. Valid copies of the data can be either in main memory or another processor cache.

  • 73

    (2000.10) (2) Advanced vector processors may incorporate a mechanism called chaining to improve performance. Can you use an example to elaborate how chaining works and how it improves vector processor performance? Ans:

    ChainingforwardingVector Instruction(VI)performancevector instructions: (1) MULV.D V1,V2,V3, (2) ADDV.D V4,V1,V5VIV1VIV5VIV1VIforwardVIVIVIperformanceunchainedchainedcycle7 cycles6 cyclesthe latency of adder and multiplier

    (1999.10) (4) Can you describe when the following bus transactions are issued in the Mbus cache coherence protocol? (a) coherent read (b) coherent read and invalidate (c) invalidate Ans:

    (a) cache read miss issue coherent read cache line (b) write-allocate cache write miss issue coherent read and

    invalidate copy cache blocks cache line (c) cache write hit issue invalidate copy cache

    blocks [] The five possible states of a data block are: Invalid (I): Block is not present in the cache. Clean exclusive (CE): The cached data is consistent with memory, and no other cache has it. Owned exclusive (OE): The cached data is different from memory, and no other cache has it.

  • 74

    This cache is responsible for supplying this data instead of memory when other caches request copies of this data.

    Clean shared (CS): The data has not been modified by the corresponding CPU since cached. Multiple CS copies and at most one OS copy of the same data could exist.

    Owned shared (OS): The data is different from memory. Other CS copies of the same data could exist. This cache is responsible for supplying this data instead of memory when other caches request copies of this data. (Note, this state can only be entered from the OE state.)

    The MBus transactions with which we are concerned are: Coherent Read (CR): issued by a cache on a read miss to load a cache line. Coherent Read and Invalidate (CRI): issued by a cache on a write-allocate after a write miss. Coherent Invalidate (CI): issued by a cache on a write hit to a block that is in one of the

    shared states. Block Write (WR): issued by a cache on the write-back of a cache block. Coherent Write and Invalidate (CWI): issued by an I/O processor (DMA) on a block write (a

    full block at a time). (1999.03) (6) A multiprocessor designer incorporates a write-invalidate cache coherence protocol, what does the designer need to do in order to guarantee sequential consistence? Ans:

    [] http://www.ece.cmu.edu/~ece548/handouts/19coher.pdf write-invalidate cache coherence protocol sequential consistencedesigner (1) memory operation (2) CPU memory operation (3) processor memory operation (interleaved)

    memory access invalidate (1998.10) (3) If an architecture designer implements a multiprocessor system that shares cache memory, what degree of parallel processing granularity is the machine designed for? Ans:

    [] http://en.wikipedia.org/wiki/Granularity multiprocessor system cache memory chip multiprocessor (CMP)

  • 75

    parallel computing granularity computation communication computation communication granularity parallelism synchronization communication overheads CMP communication cost small granularity fine-grained multithreading communication

  • 76

    External Issues

    (2008.10)(4) Power management power power management? (2008.03)(4) Describe 4 approaches to reduce the power consumption of a microprocessor. ANS: 2005 261 PDA power - Thermal issue ()

    - Environmental issue

    Clock gatingDisabling the clock to a circuit whenever the circuit is not used Pipeline BalancingGeneral purpose processors contain more resources than a program needs.

    Exploiting IPCInstructions Per Clock Cyclevariations to disable unused resources

  • 77

    Power gate Nehalem Power Gate() Nehalem (Sleep transistors)

    resizable caches: to exploit cache requirement variability in applications to reduce cache size and eliminate energy dissipation in the caches unused sections with minimal impact on performance.

    Phase Detection & Configuration Selection Phase Detection & Configuration Selection phase phase cache or core Temporal approach- Partitioning a programs execution into fixed interval run

    time mointer phase

    Positional approach-Associating phases with codes, e.g. loops or subroutines loop loop

    Block buffer: cache line line buffer cache mis memory line buffer memory search

    Drowsy CacheA cache line can stay in a lower power mode such that the content of a cache line is preserved. Pros& Cons: do not reduce leakage as much as Gated-Vdd but with much smaller state-transition overheads

    (2008.10)(5) Digital signal processor(DSP) DSP General-purpose processor ANS A digital signal processor (DSP or DSP micro) is a specialized microprocessor designed specifically for digital signal processing, generally in real-time computing. Digital signal processing algorithms typically require a large number of mathematical operations to be performed quickly on a set of data. Signals are converted from analog to digital, manipulated digitally, and then converted again to analog form, as diagrammed below. Many DSP applications have constraints on latency; that is, for the system to work, the DSP operation must be completed within some time constraint.

  • 78

    DSP GPP 1 GPP GPP DSP DSP bits bits - DSP MAC 2 GPP . 4 DSP DSP MAC GPP GPP DSP DSP DSP DSP DSP 3 DSP DSP 1GPP GPP

  • 79

    4 DSP DSP DSP DSP 5 DSP FFT GPP 6 DSP GPP GPP DSP GPP DSP GPP DSP DSP 7 DSP DSP DSP DSP DSP MAC FIR GPP GPP C C++ DSP DSP C DSP DSP C DSP DSP DSP DSP

  • 80

    8 DSP DSP GPP GPP GPP DSP GPP

    (2007.10) (1) System-on-Chip (SoC) is a recent trend in the design of computer systems. Please explain the concept of SoC and discuss on the importance of SoC from all aspects (e.g. cost, time-to-development, etc). Ans:

    SoC (MPU, DSP)(RAM, ROM, FLASH) SoC (Time to Market) IC SoC DVDMP3 SoC 90 IC SoC --

    (2004.03) (7) How do disk arrays affect the performance, availability and reliability of the storage system compared to a single-disk system? (2003.10) (8) How do disk arrays affect performance, reliability and available? Ans:

    [] http://en.wikipedia.org/wiki/RAID RAID reliability I/O performance mirroringstriping error correction performanceavailability reliability

  • 81

    (1) Performance vs. reliability Mirroring disk reliability disc disk performance Striping performance reliability

    (2) Availability failure single-disk system shut down disk array error correction (fault tolerance) machine high availability

    (1999.03) (5) (a) Can you show how a level-5 RAID places the parity data? (b) When a level-5 RAID needs to write to a file block, how does it compute the new parity? Ans:

    [] http://web.mit.edu/rhel-doc/4/RH-DOCS/rhel-isa-zh_tw-4/s1-storage-adv.html (a) RAID 5

    RAID 0 n n parity

    (b) []

    http://docs.sun.com/app/docs/doc/806-3204/6jccb3gad?l=zh_tw&a=view 1. RAID disk blocks

    multiple I/Os (a read-modify-write sequence) parity buffers parity parity parity log data stripe units

    2. RAID disk blocks XOR parity parity log data parity data stripe units