21
Computer Architecture lecture 1 2 学学学学 学学学学 () 学学学 2015.4.11

Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

Embed Size (px)

Citation preview

Page 1: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

Computer Architecturelecture 1、 2学习报告(第二次)

亢吉男2015.4.11

Page 2: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

The science and art of designing, selecting, and interconnecting hardware components and designing the hardware/software interface to create a computing system that meets functional, performance, energy consumption, cost, and other specific goals.

1、What is Computer Architecture?

2、 our purposeEnable better systems: make computers faster, cheaper, smaller, more reliable, …Enable new applicationsEnable better solutions to problems: Software innovation is built into trends and changes in computer architecture , > 50% performance improvement per year has enabled this innovation

Page 3: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

3、 problems are solved by electrons

Microarchitecture

ISA (Architecture)

Program/Language

Algorithm

Problem

Logic

Circuits

Runtime System(VM, OS, MM)

Electrons

ISA(instruction set architecture) is the interface between hardware and software, It is a contract that the hardware promises to satisfy.Microarchitecture : Specific implementation of an ISA; Not visible to the software.Microprocessor : ISA, Microarchitecture, circuits “Architecture” = ISA + microarchitecture

Implementation (uarch) can be various as long as it satisfies the specification (ISA), Microarchitecture usually changes faster than ISA. Few ISAs (x86, ARM, SPARC, MIPS, Alpha) but many uarchs.

Page 4: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

Also called stored program computer (instructions in memory). Two key properties:( 1) Stored program

Instructions stored in a linear memory array;Memory is unified between instructions and data;The interpretation of a stored value depends on the control signals;( 2) Sequential instruction processing

One instruction processed (fetched, executed, and completed) at a time;Program counter (instruction pointer) identifies the current instr;Program counter is advanced sequentially except for control transfer instructions;

4、 The Von Neumann Model/Architecture

Page 5: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

CONTROL UNIT

IP Inst Register

PROCESSING UNIT

ALU TEMP

MEMORY

Mem Addr Reg

Mem Data Reg

INPUT OUTPUT

The Von Neumann Model (of a Computer)

Page 6: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

Dataflow model: An instruction is fetched and executed in data flow order. properties:

when its operands are ready;there is no instruction pointer;Instruction ordering specified by data flow dependence;Each instruction specifies “who” should receive the result;An instruction can “fire” whenever all operands are received;Potentially many instructions can execute at the same time;

In a data flow machine, a program consists of data flow nodes. A data flow node fires (fetched and executed) when all it inputs are ready.Von Neumann model: An instruction is fetched and executed in control flow order.

As specified by the instruction pointer;Sequential unless explicit control flow instruction;

5、 The Dataflow Model (of a Computer)

Page 7: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

ISA: Specifies how the programmer sees instructions to be executed

Programmer sees a sequential, control-flow execution orderProgrammer sees a data-flow execution order

Microarchitecture: How the underlying implementation actually executes instructions Microarchitecture can execute instructions in any order as long as it obeys the semantics specified by the ISA when making the instruction results visible to software

Programmer should see the order specified by the ISA

6、 The distinctions between ISA and uarch

Page 8: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

All major instruction set architectures today use this Von Neumann Model :

x86, ARM, MIPS, SPARC, Alpha, POWER

Underneath (at the microarchitecture level), the execution model of almost all implementations (or, microarchitectures) is very different:

Pipelined instruction execution: Intel 80486 uarchMultiple instructions at a time: Intel Pentium uarchOut-of-order execution: Intel Pentium Pro uarchSeparate instruction and data caches

But, what happens underneath that is not consistent with the von Neumann model is not exposed to software.

Page 9: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

(1)Instructions Opcodes, Addressing Modes, Data Types Instruction Types and Formats Registers, Condition Codes(2)MEMORY

Address space, Addressability, AlignmentVirtual memory managementCall, Interrupt/Exception Handling

(3)Access Control, Priority/Privilege (4)I/O: memory-mappedvs. instr.(5)Task/thread Management(6)Power and Thermal Management(7)Multi-threading support, Multiprocessor support

ISA

Page 10: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

Implementation of the ISA under specific design constraints and goalsAnything done in hardware without exposure to software

(1)Pipelining(2)In-order versus out-of-order instruction execution(3)Memory access scheduling policy(4)Speculative execution(5)Superscalar processing (multiple instruction issue)(6)Clock gating(7)Caching, Levels, size, associativity, replacement policy(8)Prefetching(9)Voltage/frequency scaling(10)Error correction

Microarchitecture

Page 11: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

(1)Denial of Memory Service in Multi-Core System

(2) DRAM Refresh

(3) DRAM Row Hammer (or DRAM Disturbance Errors)

7、 Three examples in lecture 1

There are three examples (questions) in lecture 1. Because I mainly study about the second example, DRAM Refresh, I will talk more about that and describe others briefly.

Page 12: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

In a multi-core chip, different cores share some hardware resources. In particular, they share the DRAM memory system.

Multiple applications share the DRAM controller, DRAM controllers designed to maximize DRAM data throughputDRAM scheduling policies are unfair to some applications

Row-hit first: unfairly prioritizes apps with high row buffer locality. Threads that keep on accessing the same row.Oldest-first: unfairly prioritizes memory-intensive applications.

(1)Denial of Memory Service in Multi-Core System

(3) DRAM Row Hammer (or DRAM Disturbance Errors)

Repeatedly opening and closing a row enough times within a refresh interval induces disturbance errors in adjacent rows in most real DRAM chips.

Page 13: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

A DRAM cell consists of a capacitor and an access transistor. It stores data in terms of charge in the capacitor. DRAM capacitor charge leaks over time. The memory controller needs to refresh each row periodically to restore charge, Typical Activate each row every 64 ms.

Downsides of refresh -- Energy consumption: Each refresh consumes energy -- Performance degradation: DRAM rank/bank

unavailable while refreshed -- predictability impact: (Long) pause times during

refresh -- Refresh rate limits DRAM capacity scaling

(2) DRAM Refresh

Page 14: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

Existing DRAM devices refresh all cells at a rate determined by the leakiest cell in the device. However, most DRAM cells can retain data for significantly longer. Therefore, many of these refreshes are unnecessary.

Solution:RAIDR (Retention-Aware Intelligent DRAM Refresh), a low-cost mechanism that can identify and skip unnecessary refreshes using knowledge of cell retention times. The key idea is to group DRAM rows into retention time bins and apply a different refresh rate to each bin. As a result, rows containing leaky cells are refreshed as frequently as normal, while most rows are refreshed less frequently.RAIDR requires no modification to DRAM and minimal modification to the memory controller, a modest storage overhead of 1.25 KB in the memory controller.

Page 15: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

More details about RAIDR

A retention time profiling step determines each row’s retention time ((1) in Figure 4). For each row, if the row’s retention time is less than the new default refresh interval, the memory controller inserts it into the appropriate bin (2) . During system operation (3) , the memory controller ensures that each row is chosen as a refresh candidate every 64 ms.

Page 16: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

Three key components: (1) retention time profiling; (2) storing rows into retention time bins; (3) issuing refreshes to rows when necessary;

(1)Retention time profilingThe straightforward method of conducting these measurements is to write a small number of static patterns (such as “all 1s” or “all 0s”), turning off refreshes, and observing when the first bit changes.Before the row retention times for a system are collected, thememory controller performs refreshes using the baseline auto-refresh mechanism. After the row retention times for a systemhave been measured, the results can be saved in a file by theoperating system.

Page 17: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

(2)Storing Retention Time Bins: Bloom FiltersA Bloom filter is a structure that provides a compact way of representing set membership and can be implemented efficiently in hardware. A Bloom filter consists of a bit array of length m and k distinct hash functions that map each element to positions in the array. A Bloom filter can contain any number of elements; the probability of a false positive gradually increases with the number of elements inserted into the Bloom filter, but false negatives will never occur. This means that rows may be refreshed more frequently than necessary, but a row is never refreshed less frequently than necessary.

Page 18: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

(3) Performing Refresh OperationsSelecting A Refresh Candidate Row : We choose all refreshintervals to be multiples of 64 ms, This is implemented with a row counter that counts through every row address sequentially. Determining Time Since Last Refresh : Determining if a row needs to be refreshed requires determining how many 64 ms intervals have elapsed since its last refresh. Issuing Refreshes : In order to refresh a specific row, the memory controller simply activates that row, essentially performing a RAS-only refresh.

Page 19: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

An auto-refresh operation occupies all banks on the rank simultaneously (preventing the rank from servicing any requests) for a length of time tRFC(the average time between auto-refreshcommands), where tRFC depends on the number of rows being refreshed. Previous DRAM generations also allowed the memory controller to perform refreshes by opening rows one-by-one (called RAS-only refresh), but this method has been deprecated due to the additional power required to send row addresses on the bus.

Auto-refresh

Page 20: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

A microprocessor processes instructions, it has to do three things:1) supply instructions to the core of the processor where each instruction can do its job; 2) supply data needed by each instruction; 3) perform the operations required by each instruction;

(1) Instruction SupplyThe number that can be fetched at one time has grown from one to four, and shows signs of soon growing to six or eight. Three things can get in the way of fully supplying the core with instructions to process: (a)Instruction cache misses,(b)fetch breaks,(c) conditional branch mispredictions.

8、 Something new in “Requirements, Bottlenecks, and Good Fortune: Agents for Microprocessor Evolution”

Page 21: Computer Architecture lecture 1 、 2 学习报告 (第二次) 亢吉男 2015.4.11

(2) Data SupplyTo supply data needed by an instruction, one needs the ability to have available an infinite supply of needed data, to supply it in zero time, and at reasonable cost.Thebest we can do is a storage hierarchy: where a small amountof data can be accessed (on-chip) in one to three cycles, a lot more data can be accessed (also, on-chip) in ten to 16 cycles, and still more data can be accessed (off chip) in hundreds of cycles.

(3) Instruction ProcessingTo perform the operations required by these instructions, one needs a sufficient number of functional units to process the data as soon as the data is available, and sufficient interconnections to instantly supply a result produced by one functional unit to the functional unit that needs it as a source.