EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 1
EE3004 (EE3.cma) - Computer Architecture
Roger Webb
University of Surreyhttp://www.ee.surrey.ac.uk/Personal/R.Webb/l3a15
also link from Teaching/Course page
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 2
IntroductionBook List
Computer Architecture - Design & PerformanceBarry Wilkinson, Prentice-Hall 1996(nearest to course)Advanced Computer ArchitectureRichard Y. Kain, Prentice-Hall 1996(good for multiprocessing + chips + memory)Computer ArchitectureBehrooz Parhami, Oxford Univ Press, 2005(good for advanced architecture and Basics)Computer ArchitectureDowsing & Woodhouse(good for putting the bits together..)Microprocessors & Microcomputers - Hardware & SoftwareAmbosio & Lastowski(good for DRAM, SRAM timing diagrams etc.)Computer Architecture & DesignVan de Goor(for basic Computer Architecture)
Wikipedia is as good as anything...!
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 3
IntroductionOutline Syllabus
Memory Topics• Memory Devices• Interfacing/Graphics• Virtual Memory• Caches & HierarchiesInstruction Sets• Properties & Characteristics• Examples• RISC v CISC• Pipelining & ConcurrencyParallel Architectures• Performance Characteristics• SIMD (vector) processors• MIMD (message-passing) • Principles & Algorithms
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 4
Computer Architectures - an overview
What are computers used for?
3 ranges of product cover the majority of processor sales:
• Appliances (consumer electronics)
• Communications Equipment
• Utilities (conventional computer systems)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 5
Consumer ElectronicsThis category covers a huge range of processor performance• Micro-controlled appliances
– washing machines, time switches, lamp dimers– lower end, characterised by:
• low processing requirements• microprocessor replaces logic in small package• low power requirements
• Higher Performance Applications– Mobile phones, printers, fax machines, cameras, games
consoles, GPS, TV set-top boxes, video/DVD/HD recorders…...
• High bandwidth - 64-bit data bus• Low power - to avoid cooling• Low cost - < $20 for the processor• Small amounts of software - small cache (tight program loops)
Computer Architectures - an overview
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 6
Communications Equipmenthas become the major market – WWW, mobile comms• Main products containing powerful processors are:
– LAN products - bridges, routers, controllers in computers– ATM exchanges– Satellite & Cable TV routing and switching– Telephone networks (all-digital)
• The main characteristics of these devices are:– Standardised application (IEEE, CCITT etc.) - means
competitive markets– High bandwidth interconnections– Wide processor buses - 32 or 64 bits– Multi-processing (either per-box, or in the distributed
computing sense
Computer Architectures - an overview
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 7
Utilities (Conventional Computer Systems)Large scale computing devices will, to some extent, be replaced by
greater processing power on the desk-top. • But some centralised facilities are still required, especially
where data storage is concerned– General-purpose computer servers; supercomputers– Database servers - often safer to maintain a central corporate
database– File and printer servers - again simpler to maintain– Video on demand servers
• These applications are characterised by huge memory requirements and:– Large operating systems– High sustained performance over wide workload variations– Scalability - as workload increases– 64 bit (or greater) data paths, multiprocessing, large caches
Computer Architectures - an overview
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 8
Computer Architectures - an overviewComputer System Performance• Most manufacturers quote performance of their processors in terms of
the peak rate - MIPS (MOPS) of MFLOPS.• Most of the applications above depend on the continuous supply of
data or results - especially for video images• Thus critical criterion is the sustained throughput of instructions
– (MPEG image decompression algorithm requires 1 billion operations per second for full-quality widescreen TV)
– Less demanding VHS quality requires 2.7Mb per second of compressed data
– Interactive simulations (games etc) must respond to a user input within 100ms - re-computing and displaying the new image
• Important measures are:– MIPS per dollar– MIPS per Watt
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 9
Computer Architectures - an overviewUser InteractionsConsider how we interact with our computers:
0
10
20
30
40
50
60
70
80
90
100
1955 1965 1975 1985 1995 2005
Lights & Switches
Punched Card & Tape
Timesharing
Menus, Forms
WYSIWIG, Mice, Windows
Virtual Reality, Cyberspace
% o
f C
PU
tim
e sp
ent m
anag
ing
inte
ract
ion
What does a typical CPU do?70% User interface; I/O
processing20% Network interface;
protocols9% Operating system;
system calls1% User application
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 10
Computer Architectures - an overviewSequential Processor Efficiency
The current state-of-the-art of large microprocessors include:
• 64-bit memory words, using interleaved memory
• Pipelined instructions
• Multiple functional units (integer, floating point, memory fetch/store)
• 5 GHz practical maximum clock speed
• Multiple processors
• Instruction set organised for simple decoding (RISC?)
However as word length increases, efficiency may drop:
• many operands are small (16 bit is enough for many VR tasks)• many literals are small - loading 00….00101 as 64 bits is a waste
• may be worth operating on several literals per word in parallel
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 11
Computer Architectures - an overviewExample - reducing the number of instructions
Perform a 3D transformation of a point (x,y,z) by multiplying the 4-element matrix (x,y,z,1) by a 4x4 transformation matrix A. All operands are 16-bits long.
=
Conventionally this requires 20 loads, 16 multiplies, 12 adds and 4 stores, using 16-bit operands on a 16-bit CPU.
On a 64-bit CPU with instructions dealing with groups of four parallel 16-bit operands, as well as a modest amount of pipelining, all this can take just 7 processor cycles.
x y z 1 a b c de f g hi j k l m n o p
x’ y’ z’ r
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 12
Computer Architectures - an overviewThe Effect of Processor Intercommunication Latency
In a multiprocessor, and even in a uniprocessor, the delays associated with communicating and fetching data (latency) can dominate the processing times.
Consider:
memory memory memory
CPU CPU CPU
Interconnection Network
memory
CPU
cacheSymmetrical Multiprocessor
Uniprocessor
Delays can be minimised by placing components closer together and:
• Add caches to provide local data storage
• Hide latency by multi-tasking - needs fast context switching
• Interleave streams of independent instructions - scheduling
• Run groups of independent instructions together (each ending with long latency instruction)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 13
Computer Architectures - an overviewMemory EfficiencyQuote from 1980s “Memory is free”By the 2000s the cost per bit is no longer falling so fast and
consumer electronics market is becoming cost sensitiveRenewed interest in compact instruction sets and data
compactness - both from the 1960s and 1970s
Instruction CompactnessRISC CPUs have a simple register-based instruction encoding• Can lead to codebloat - as can poor coding and compiler design• Compactness gets worse as the word size increasese.g. INMOS (1980s) transputer had a stack based register scheme• needed 60% of the code of an equivalent register based cpu• lead to smaller cache needs for instruction fetches & data
1977 - £3000/Mb1994 - £4/MbNow – <1p/Mb
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 14
Computer Architectures - an overview
Cache Efficiency
• Designer should aim to optimise the instruction performance whilst using the smallest cache possible
• Hiding latency (using parallelism & instruction scheduling) is an effective alternative to minimising it (by using large caches)
• Instruction scheduling can initiate cache pre-fetches
• Switch to another thread if the cache is not ready to supply data for the current one
• In video and audio processing, especially, unroll the inner code loops – loop unrolling (more on that later)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 15
Computer Architectures - an overview
Predictable Codes
In many applications (e.g. video and audio processing) much is known about the code which will be executed. Techniques which are suitable for these circumstances include:
• Partition the cache separately for code and different data structures
• The cache requirements of the inner code loops can be pre-determined, so cache usage can be optimised
• Control the amounts of a data structure which are cached
• Prevent interference between threads by careful scheduling
• Notice that a conventional cache’s contents are destroyed by a single block copy instruction
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 16
Computer Architectures - an overview
Processor Engineering Issues• Power consumption must be minimised (to simplify on-chip and in-
box cooling issues)– Use low-voltage processors (2V instead of 3.3V)– Don’t over-clock the processor– Design logic carefully to avoid propagation of redundant signals– Tolerance of latency allows lower performance (cheaper)
subsystems to be used– Explicit subsystem control allows subsystems to be powered down
when not in use– Eliminate redundant actions - eg speculative pre-fetching– Provide non-busy synchronisation to avoid the need for spin-locks
• Battery design is advancing slowly - power stored per unit weight or volume will quadruple (over NiCd) with 5-10 years
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 17
Computer Architectures - an overview
Processor Engineering Issues• Speed to market is increasing, so processor design is becoming
critical. Consider the time for several common devices to become established:– 70 years Telephone (0% to 60% of households)– 40 years Cable Television– 20 years Personal Computer– 10 years Video Recorders– <10years Web based video
• Modularity and common processor cores provide design flexibility– reusable cache and CPU cores– product-specific interfaces and co-processors– common connection schemes
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 18
Computer Architectures - an overview
Interconnect Schemes
Wide data buses are a problem:
• They are difficult to route on printed circuit boards
• They require huge numbers of processor and memory pins (expensive to manufacture on chips and PCBs)
• Clocking must accommodate the slowest bus wire.
• Parallel back-planes add to loading and capacitance, slowing signals further and increasing power consumption
Serial chip interconnects offer 1Gbit/s performance using just a few pins and wires. Can we use a packet routing chip as a back-plane?
• Processors, memories, graphic devices, networks, slow external interfaces all joined to a central switch
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 19
33
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 20
Memory Devices
Regardless of scale of computer the memory is similar.
Two major types:• Static• Dynamic
Larger memories get cheaper as production increases and smaller memories get more expensive - you pay more for less!
See:
http://www.educypedia.be/computer/memoryram.htm
http://www.kingston.com/tools/umg/default.asp
http://www.ahinc.com/hhmemory.htm
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 21
Memory DevicesStatic Memories
• made from static logic elements - an array of flip-flops
• don’t lose their stored contents until clocked again
• may be driven as slowly as needed - useful for single stepping a processor
• Any location may be read or written independently
• Reading does not require a re-write afterwards
• Writing data does not require the row containing it to be pre-read
• No housekeeping actions are needed
• The address lines are usually all supplied at the same time
• Fast - 15ns was possible in Bipolar and 4-15ns in CMOS
Not used anymore – too much power for little gain in speed
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 22
Memory Devices
Memory Matrix256x256
Column I/O
Column Decoder
Row
Decoder
Input Data
Control
Timing Pulse Generator
Read Write Control
Vcc
Gnd
A0A1A2A3A4A5A6A7
A8A15
I/O0
I/O7
CS2
CS1
WE
OE
HM
6264 - 8K*8 static R
AM
organisation
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 23
Memory Devices
HM
6264 - 8K*8 static R
AM
organisation
HM
6264 Read C
ycle
Item Symbol min max UnitRead Cycle Time tRC 100 nsAddress Access Time tAA - 100 ns
CS1 tCO1 - 100 nsChip Selection toOutput CS2 tCO2 - 100 nsOutput Enable to Output Valid tOE - 50 ns
CS1 tLZ1 10 - nsChip Selection toOutput in Low Z CS2 tLZ2 10 - nsOutput Enable to Output in Low Z tOLZ 5 - ns
CS1 tHZ1 0 35 nsChip Deselection toOutput in High Z CS2 tHZ2 0 35 nsOutput Disable to Output in High Z tOHZ 0 35 nsOutput Hold from Address Change tOH 10 - ns
tAA tCO1
tLZ1
tCO2
tHZ1
tLZ2 tOE
tOLZ
tHZ2
tOHZ
tOH
Data Valid
tRC
Address
CS1
CS2
OE
Dout
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 24
Memory Devices
HM
6264 - 8K*8 static R
AM
organisation
HM
6264 Write C
ycle
Item Symbol min max UnitWrite Cycle Time tWC 100 nsChip Selection to End of Write tCW 80 nsAddress set up time tAS 0 nsAddress valid to End of Write tAW 80 nsWrite Pulse Width tWP 60 ns
CS1,WE tWR1 5 - nsWrite Recovery TimeCS2 tWR2 15 - ns
Write to Output in High Z tWHZ 0 35 nsData to Write Time Overlap tDW 40 nsData Hold from Write Time tDH 0 nsOutput Enable to Output in High Z tOHZ 0 35 nsOutput Active from End of Write tOW 5 - ns
Address
Din
tWC
CS1
CS2
OE
Dout
WE
tCWtWR1
tCW tWR2
tAW
tAS tOHZ
tDW tDH
tWP
Data sampled by memory
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 25
Memory DevicesDynamic Memories
• information stored on a capacitor - discharges with time
• Only one transistor required to control - 6 for SRAM
• must be refreshed (0.1-0.01 pF needs refresh every 2-8ms)
• memory cells are organised so that cells can be refreshed a row at a time to minimise the time taken
• row and column organisation lends itself to multiplexed row and column addresses - fewer pins on chip
• Use RAS and CAS to latch row and column addresses sequentially
• DRAM consumes high currents when switching transistors (1024 columns at a time). Can cause nasty voltage transients
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 26
Memory Devices
HM
50464 - 64K*4 dynam
ic RA
M organisation
row select
Bit LineDynamic memory cell
OE
Output Buffer OE ClockInput Buffer
WEClock
CASClock
RASClock
X Addrss
YAddrss
R/W Switch
X Decoder X Decoder
Y D
ecoderY
Decoder
MemoryArray 1
MemoryArray 2
MemoryArray 3
MemoryArray 4
I/O1-4
Refresh AddressCounter
WE
CAS
RAS
Ai
RASCAS
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 27
Memory Devices
HM
50464 Read C
ycle
HM
50464 - 64K*4 dynam
ic RA
M organisation
row column
validoutput
RAS
CAS
Address
WRITE
IO
OERead CycleDynamic memory read operation is as follows• The memory read cycle starts by setting all bit lines (columns) to a
suitable sense voltage. - pre charging• Required row address is applied and a RAS (row address) is asserted
• selected row is decoded and opens transistors (one per column). This dumps their capacitors charge into high feedback amplifiers which recharge the capacitors - RAS must remain low
• simultaneously apply column address and set CAS. Decoded and requested bits are gated to output - goes to outside when OE is active
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 28
Memory Devices
HM
50464 Write C
ycle
HM
50464 - 64K*4 dynam
ic RA
M organisation
row column
Valid Input
RAS
CAS
Address
WRITE
IO
Early Write Cycle
Similar to the read cycle except the fall in WRITE signals time to latch input data.
During the “Early Write” cycle - the WRITE falls before CAS - ensures that memory device keeps data outputs disabled (otherwise when CAS goes low they could output data!)
Alternatively a “Late Write” cycle the sequence is reversed and the OE line is kept high - this can be useful in common address/data bus architectures
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 29
Memory Devices
HM
50464 - 64K*4 dynam
ic RA
M organisation
Refresh Cycle
For a refresh no output is needed. A read, with a valid RAS and row address pulls the data out all we need to do is put it back again by de-asserting RAS.
This needs to be repeated for all 256 rows (on the HM50464) every 4ms. There is an on chip counter which can be used to generate refresh addresses.
Page Mode Access [“Fast Page Mode DRAM”] – standard DRAM
The RAS cycle time is relatively long so optimisations have been made for common access patterns
Row address is supplied just once and latched with RAS. Then column address are supplied and latched using CAS, data is read using WRITE or OE. CAS and column address can then be cycled to access bits in same row. The cycle ends when RAS goes high again.
Care must be taken to continue to refresh the other rows of memory at the specified rate if needed
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 30
Memory Devices
HM
50464 - 64K*4 dynam
ic RA
M organisation
RAS
CAS
row colAddress col col
DataIO Data Data
Page Mode DRAM access - nibble and static column mode are similar
Nibble Mode
Rather than supplying the second and subsequent column addresses they can be calculated by incrementing the initial address - first column address stored in register when CAS goes low then incremented and used in next low CAS transition - less common then Page Mode.
Static Column Mode
Column addresses are treated statically and when CAS is low the outputs are read if OE is low as well. If the column address changes the outputs change (after a propagation delay). The frequency of address changes can be higher as there is no need to have an inactive CAS time
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 31
Memory Devices
HM
50464 - 64K*4 dynam
ic RA
M organisation
OE
CAS
row colAddress col col
DataIO Data Data
Extended Data Out DRAM access
Extended Data Out Mode (“EDO DRAM”)
EDO DRAM is very similar to page mode access. Except that data bus outputs are controlled exclusively by the OE line. So that CAS can be taken high and low again without data from previous word being removed from data bus - so data can be latched by processor whilst new column address is being latched by memory. Overall cycle times can be shortened.
RAS
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 32
Memory Devices
HM
50464 - 64K*4 dynam
ic RA
M organisation
row colAddress bank row
IO D0
Simplified SDRAM burst read access
Synchronous DRAM (“SDRAM”)
Instead of asynchronous control signals SDRAMs accept one command in each cycle. Different stages of access initiated by separate commands - initial row address, reading etc. all pipelined so that a read might not return a word for 2 or 3 cycles
Bursts of accesses to sequential words within a row may be requested by issuing a burst-length command. Then, subsequent read or write request operate in units of the burst length
Clock
Act readCommand PChg ActNOP NOP NOP NOP NOP NOP NOP NOP
D1 D2 D3ActivateDRAM
row
Read fromColumn no.
(3 cycle latency)Read burst (4 words)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 33
Summary DRAMs• A whole row of the memory array must be read• After reading the data must be re-written• Writing requires the data to be read first (whole row has to be
stored if only a few bits are changed)
• Cycle time a lot slower than static RAM• Address lines are multiplexed - saves package pin count• Fastest DRAM commonly available has access time of
~60ns but a cycle time of 121ns• DRAMs consume more current• SDRAMS replace the asynchronous control mechanisms
Memory Devices
Cycles RequiredMemory Type
Word 1 Word 2 Word 3 Word 4
DRAM 5 5 5 5
Page-Mode DRAM 5 3 3 3
EDO DRAM 5 2 2 2
SDRAM 5 1 1 1
SRAM 2 1 1 1
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 34
44
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 35
Memory Interfacing
Interfacing
Most processors rely on external memory
The unit of access is a word carried along the Data Bus
Ignoring caching and virtual memory, all memory belongs to a single address space.
Addresses are passed on the Address Bus
Hardware devices may respond to particular addresses - Memory Mapped devices
External memory is a collection of memory chips.
All memory devices are joined to the same data bus
Main purpose of the addressing logic is to ensure only one memory device is activated during each cycle
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 36
Memory Interfacing
Interfacing
The Data Bus has n lines - n = 8,16,32 or 64
The Address Bus has m lines - m = 16,20,24, 32 or 64 providing 2m words of memory
The Address Bus is used at the beginning of a cycle and the Data Bus at the end
It is therefore possible to multiplex (in time) the two buses
Can create all sorts of timing complications - benefits are a reduced processor pin count, makes it relatively common
Processor must tell memory subsystem what to do and when to do it
Can do this either synchronously or asynchronously
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 37
Memory Interfacing
Interfacing
synchronously
• processor defines the duration of a memory cycle
• provides control lines for begin and end of cycle
• most conventional
• the durations and relationships might be determined at boot time (available in 1980’s in the INMOS transputer)
asynchronously -
• processor starts cycle, memory signals end of cycle
• Error recovery is needed - if non-existent memory is accessed (Bus Error)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 38
Memory Interfacing
Interfacing
synchronous memory scheme control signals
• Memory system active – goes active when the processor is accessing external
memory. – Used to enable the address decoding logic
• provides one active chip select to a group of chips
• Read Memory– says the processor is not driving the data bus
– selected memory can return data to the data bus
– usually connected to the output enable (OE) of memory
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 39
Memory InterfacingInterfacingsynchronous memory scheme control signals (cont’d)• Memory Write
– indicates data bus contains data which selected memory device should store
– different processors use leading or trailing edges of signal to latch data into memory
– Processors with data bus wider than 8 bits have separate memory write byte signal for each byte of data
– Memory write lines connected to write lines of memories• Address Latch Enable (in multiplexed address machines)
– tells the addressing logic when to take a copy of the address from multiplexed bus so processor can use it for data later
• Memory Wait– causes processor to extend memory cycle– allows fast and slow memories to be used together without
loss of speed
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 40
Memory Interfacing
Address BlocksHow do we place blocks of memory within the address space
of our processor?Two methods of addressing memory:• Byte addressing
– each byte has its own address– good for 8-bit processors and graphics systems– if memory is 16 or 32 bits wide?
• Word addressing– only address lines which number individual words– select multi-byte word– extra byte address bits retained in processor to
manipulate individual byte – or use write byte control signals
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 41
Memory Interfacing
Address BlocksHow do we place blocks of memory within the address space
of our processor?Often want different blocks of memory:• Particular addresses might be special:
– memory mapped I/O ports– location executed first after a reset– fast on-chip memory– diagnostic or test locations
• Also want – SRAM and/or DRAM in one contiguous block– memory mapped graphics screen memory– ROM for booting and low level system operation– extra locations for peripheral controller registers
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 42
Memory Interfacing
Address BlocksHow do we place blocks of memory within the address space
of our processor?• Each memory block might be built from individual
memory chips– address and control lines wired in parallel– data lines brought out separately to provide n bit word
• Fit all the blocks together in overall address map– easier to place similar sized blocks next to each other
so that they can be combined to produce 2k+1 word area– jumbling blocks of various sizes complicates address
decoding– if contiguous blocks are not needed, place them at
major power of 2 boundaries - eg put base of SRAM at 0, ROM half way up, lowest memory mapped peripheral at 7/8ths
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 43
Memory Interfacing
Address Decodingaddress decoding logic determines which memory device to
enable depending upon address• if each memory area stores contiguous words of 2k block
– all memory devices in that area will have k address lines
– connected (normally) to the k least-significant lines– remaining m-k examined to see if they provide most-
significant (remaining) part of address of each area3 schemes possible
– Full decoding - unique decoding• All m-k bits are compared with exact values to make up
full address of that block• only one block can become active
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 44
Memory Interfacing
Address Decoding3 schemes possible (cont’d)
– Partial decoding• only decode some of m-k lines so that a number of
blocks of addresses will cause a particular chip select to become active
• eg ignoring one line will mean the same memory device will be accessible at places in memory map
• makes decoding simpler– Non-unique decoding
• connect different one of m-k lines directly to active low chip select of each memory block
• can activate memory block by referencing that line• No extra logic needed• BUT can access 2 blocks at once this way…...
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 45
55
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 46
Memory Interfacing
Address Decoding - Example
A processor has a 32-bit data bus. It also provides a separate 30-bit word addressed address bus, which is labelled A2 to A31 since it refers to memory initially using byte addressing, where it uses A0 and A1 as byte addressing bits. It is desired to connect 2 banks of SRAM (each built up from 128K*8 devices) and one bank of DRAM, built from 1M*4 devices, to this processor. The SRAM banks should start at the bottom of the address map, and the DRAM bank should be contiguous with the SRAM. Specify the address map and design the decoding logic.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 47
Memory InterfacingAddress Decoding - ExampleEach bank of SRAMs will require 4 devices to make up the 32 bit data bus.
Each Bank of DRAMs will require 8 devices.
0013FFFF
00040000
DRAMbank 0
DRAMbank 0
DRAMbank 0
DRAMbank 0
DRAMbank 0
DRAMbank 0
DRAMbank 0
DRAMbank 0
1Mwords(20 bits)
0003FFFF
00020000
SRAMBank 1
SRAMBank 1
SRAMBank 1
SRAMBank 1
128kwords(17 bits)
0001FFFF
00000000
SRAMBank 2
SRAMBank 2
SRAMBank 2
SRAMBank 2
128kwords(17bits)
------------------------- 32 bits -------------------------------------------------------------
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 48
Address Decoding - Example
Memory Interfacing
CPU
SRAM128k*8
SRAM128k*8
DRAM1M*4
17 address linesto all devices
in parallel
17 address linesto all devices
in parallel
20 address lines
to all devicesin parallel
8 data linesto each device
4 data linesto each device
CS1 CS2
CS3
CS1 connects to chip select on SRAM bank 0CS2 connects to chip select on SRAM bank 1CS3 connects to chip select on DRAM bank
CS1 = A19*A20*A21*A22CS2 =A19*A20*A21*A22CS3 = A20+A21+A22
}}omitting all address lines A23 and above to simplify}
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 49
66
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 50
Connecting Multiplexed Address and Data Buses
There are many multiplexing schemes but let’s choose 3 processor types and 2 memory types and look at the possible interconnections:
• Processor types all 8-bit data and 16 bit address:
– No multiplexing - (eg Zilog Z80)
– multiplexes least significant address bits with data bus (intel 8085)
– multiplexes the most significant and least significant halves of address bus
• Memory types:
– SRAM (8k *8) - no address multiplexing
– DRAM (16k*4) - with multiplexed address inputs
Memory Interfacing
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 51
CPU vs Static Memory Configuration
Memory Interfacing
Addressdecode
CPUNon multiplexed
address bus
8k*8 SRAM
A0…15
D0…7
A0…12
D0…7
CS
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 52
CPU vs Static Memory Configuration
Memory Interfacing
Addressdecode
CPU with LSaddresses multiplexed
with data bus
8k*8 SRAM
A8…15
AD0…7
CS
A0…12
D0…7
latc
h
MS
LS
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 53
CPU vs Static Memory Configuration
Memory Interfacing
Addressdecode
CPUtime-multiplexed
address bus
8k*8 SRAM
MA0…7
D0…7
CS
A0…12
D0…7
latc
h
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 54
CPU vs Dynamic Memory Configuration
Memory Interfacing
Addressdecode
CPUnon - multiplexed
address bus
2 x 16k*4 DRAM
A0…15
D0…7
CAS
MA0…6
D4…7
MP
XD0…3
MA0…6
RAS
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 55
CPU vs Dynamic Memory Configuration
Memory Interfacing
Addressdecode
CPU with LSaddresses multiplexed
with data bus
2 x 16k*4 DRAM
A8…15
AD0…7
CAS
MA0…6
D4…7
MP
XD0…3
MA0…6
RAS
latc
h
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 56
CPU vs Dynamic Memory Configuration
Memory Interfacing
Addressdecode
MA0…7
D0…7
2 x 16k*4 DRAM
CAS
MA0…6
D4…7
D0…3
MA0…6
RAS
CPUtime-multiplexed
address bus
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 57
DisplaysVideo Display Characteristics• Consider a video display capable of producing 640*240 pixel
monochrome, non-interlaced images at a frame rate of 50Hz:
h
DisplayedImage
vadd 20%for lineflyback
add 20%for frameflyback
dot rate = (640*1.2)*(240*1.2)*50 Hz
= 11MHz
= 90 ns/pixel
For 1024*800 non-interlaced display:
dot rate = (1024*1.2)*(800*1.2)* 50 Hz
= 65MHz
= 15 ns/pixel
add colour with 64 levels for rgb -
- 18 bits per pixel
Bandwidth now 1180MHz...
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 58
Video Display Characteristics• Problems with high bit rates:
– Memory mapping of the screen display within the processor map couples CPU and display tightly - design together
– In order that screen display may be refreshed at the rates required by video, display must have higher priority then processor for DMA of memory bus - uses much of bandwidth
– In order to update the image the CPU may require very fast access to screen memory too
– Megabytes of memory needed for large screen displays are still relatively expensive - compared with CPU etc.
Displays
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 59
Bit-Mapped Displays• Even 640*240 pixel display cannot be easily maintained
using DMA access to CPU’s RAM - except with multiple word access
• Increase memory bandwidth for video display with special video DRAM
– allows whole row of DRAM (256 or 1024 bits) in one DMA access
• Many video DRAMs may be mapped to provide a single bit of a multi-bit pixel in parallel - colour displays.
Displays
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 60
Displays
Character Based Displays• limited to displaying one of a small number of images in fixed positions
– typically 24lines of 80 characters
– normally 8-bit ASCII
• Character value is used to determine the image from a look-up table– table often in ROM (RAM version allows font changes)
• For a character of 9 dots wide by 14 high– 14 rows are generated for each row of characters
– In order to display a complete frame, pixels are drawn a suitable do rate:
dot rate = (80*9*1.2)*(24*14*1.2)*50 Hz
= 17.28 MHz
= 58ns/pixel
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 61
Displays
Character Based Displays• A row of 80 characters must be read for every displayed line
– giving a line rate of 20.16kHz (similar to EGA standard)
– overall memory access rate needed ~1.6Mbytes/second (625ns/byte)
– barely supportable using DMA on small computers
– even at 4bytes at a time (32 bit machines) still major use of data bus
• To avoid reading each line of 80 characters on other 13 rows characters can be stored in a circular shift register on first access and used instead of memory access.– only need 80*24*50 accesses/sec - in bursts
– 167s per byte - easily supported
– the whole 80 bytes can be read during flyback before start of new character row at full memory speed in one DMA burst - 80 * about 200ns at a rate of 24*50 times a second - less than 2% of bus bandwidth.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 62
Displays
Character Based Displays• Assuming that rows of 80 characters in the CPU’s memory map are
stored at 128-byte boundaries (simplifies addressing) the CPU memory addresses are:
n-12 bits 5 bits 7 bits
address of screen memory row column
address decode 0…23 0…79
• Address of character positions on the screen:
5 bits 4 bits 7 bits 4 bits
columnrow
0…23 0…790…13 0…8
dot numberacross char
line numberin row
carrycarrycarry
Memory Look-upTable
Memory address of currentbit in shift register
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 63
Displays
Character Based Displays• An appropriate block diagram of the display would be:
ScreenMemory
CharacterGenerator
ROM
ScreenAddress
ASCII bytes
8 9
12 4Line no.in char
Shiftregister
9 to 1 bit
dot clock
video data out
(16*256)*9 bits
FIFO80*8bits
(r,c)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 64
DisplaysCharacter Based Displays• The problem with DMA fetching individual characters from display
memory is its interference with processor.
• Alternative is to use Dual Port Memories
Dual Port SRAMs• provide 2 (or more) separate data and address pathways to each memory cell
• 100% of memory bandwidth can be used by display without effecting CPU
• Can be expensive - ~£25 for 4kbytes - makes Mb displays impractical. For character based display would be OK
Memory Arrayaddressdecode 1
I/O1
WriteCE
OEDo..Dn
Ao…Anrow, col
addressdecode 2
I/O2
WriteCE
OEDo..Dn
Ao…Anrow, col
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 65
77
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 66
Bit-Mapped Graphics & Memory Interleaving
Bit-Mapped Displays• Instead of using an intermediate character generator can store all pixel
information in screen memory at pixel rates above.
• Even 640*240 pixel display cannot be maintained using DMA access to CPU’s RAM - except with multiple word access
• Increase memory bandwidth for video display with special video DRAM
– allows whole row of DRAM (256 or 1024 bits) in one DMA access
• Many video DRAMs may be mapped to provide a single bit of a multi-bit pixel in parallel - colour displays.
• Use of video shift register limits clocking frequency to 25MHz - 40ns/pixel
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 67
Graphics Card consists of:GPU – Graphics Processing Unit
microprocessor optimized for 3D graphics renderingclock rate 250-850MHz with pipelining – converts 3D images of vertices and lines into 2D pixel image
Video BIOS – program to operate card and interface timings etc.Video Memory – can use computer RAM, but more often has its
own VideoRAM (128Mb- 2Gb) – often multiport VRAM, now DDR (double data rate – uses rising and falling edge of clock)
RAMDAC – Random Access Digital to Analog Converter to CRT
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 68
Using Video DRAMs• To generate analogue signals for a colour display
– 3 fast DAC devices are needed– each fed from 6 or 8 bits of data– one each for red, green and blue video inputs
• To save storing so much data per pixel (24 bits) a Colour Look Up Table (CLUT) device can be used.– uses a small RAM as a look-up table– E.g. a 256 entry table accessed by 8-bit values stored for each pixel - the
table contains 18 or 24 bits used to drive DACs– Hence “256 colours may be displayed from a palette of 262144”
Red output
Blue output
Green output
DACsData for updating RAM
Pixel Data218*28
AddrRAM
Din
Dout
6
6
6
CLUT
Bit Mapped Graphics & Memory Interleaving
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 69
screen block select row address column address
Using Video DRAMs• Addressing Considerations
– if the number of bits in the shift registers is not the same as the number of displayed pixels, it is easier to ignore the extra ones - wasting memory may make addressing simpler
– processor’s screen memory bigger than displayable memory, gives a scrollable virtual window.
remaining bits log2v bits log2h bits
(not all combinations used)
0..(v-1) 0..(h-1)
– Even though most 32 bit processors can access individual bytes (used as pixels) this is not as efficient as accessing memory in word (32bits) units
Bit Mapped Graphics & Memory Interleaving
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 70
Addressing Considerations (cont’d)– Sometimes it might be better NOT to arrange the displayed pixels in
ascending memory address order:
0 1 2 3
0 1
2 3
32 Each word defines one bit of 32 horizontally neighbouring pixels. 8 words (in 8 separate colour planes) need to be changed to completely change any pixel. Useful for adding or moving blocks of solid colour - CAD
Each word defines 2 pixels horizontally and vertically with all colour data. Useful for text or graphics applications where small rectangular blocks are modified - might access fewer words for changes
Each word defines 4 horizontally neighbouring pixels. Each set fully specifies its colour - most simple and common representation
Bit Mapped Graphics & Memory Interleaving
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 71
Addressing Considerations (cont’d)• The video memories must now be arranged so that the bits
within the CPU’s 32-bit words can all be read or written to their relevant locations in video memory in parallel.
– this is done by making sure that the pixels stored in each neighbouring 32-bit word are stored in different memory chips - interleaving
Bit Mapped Graphics & Memory Interleaving
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 72
ExampleDesign a 1024*512 pixel colour display capable of passing 8
bits per pixel to a CLUT. Use a video frame rate of 60Hz and use video DRAMs with a shift register maximum clocking frequency of 25MHz. Produce a solution that supports a processor with an 8-bit data bus.
Bit Mapped Graphics & Memory Interleaving
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 73
Example• 1024 pixels across the screen can be satisfied using 1 1024-bit shift
register (or 4 multiplexed 256-bit ones)
• The frame rate is 60Hz
• The number of lines displayed is 512
• The line rate becomes 60*512 = 30.72kHz - or 32.55s/line
• 1024 pixels gives a dot rate of 30.72*1024 = 31.46MHz
• Dot time is thus 32ns - too fast for one shift register! So we will have to interleave 2 or more.
• Multiplexing the minimum 3 shift registers will make addressing complicated, easier to use 4 VRAMs - each with 256 rows of 256 columns, addressed row/column intersection containing 4 bits interfaced by 4 pins to the processor and to 4 separate shift registers
Bit Mapped Graphics & Memory Interleaving
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 74
Example• Hence for 8 bit CPU:
screen block select 512 rows 1024 columns
n-20 bits 9 bits 10 bitsCPU memoryaddress(BYTE address)
0..512 0..1023
to top/bottommultiplexers
to RAS address i/p implicit addressof bits in cascaded
shift registers
1 bit 8 bits 8 bits0..512 0..1023
1 bitVideo address(pixel counters)
to pixelmultiplexer
(odd/even pixels)
1 bit
Which VRAM?A+B, C+DE+F, G+H
Bit Mapped Graphics & Memory Interleaving
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 75
Example
top 256 lines on screen (8 bits of eachodd pixel)
256*256*4
A
256*256*4
B
256*256*4
C
256*256*4
D
256*256*4
E
256*256*4
F
256*256*4
G
256*256*4
H
top/bottmmpx
top/bottmmpx
odd/evenpixelmpx
RGB
CLUT
4
4
4
4
4
4
4
4
8
8
8
8
8
8
select
8 bits ofall pixels
(interleaved)
bottom 256 lines on screen (8 bits of eacheven pixel)
top 256 lines on screen (8 bits of eacheven pixel)
bottom 256 lines on screen (8 bits of eachodd pixel)
select
select
RHS of screen LHS of screen
odd pixels
even pixels
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 76
88
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 77
Mass Memory ConceptsDisk technology• unchanged for 50 years• similar for CD, DVD• 1-12 platters• 3600-10000rpm• double sided• circular tracks• subdivided into sectors• recording density >3Gb/cm2
• innermost tracks not used – can not be used efficiently• inner tracks factor of 2 shorter than outer tracks• hence more sectors in outer tracks• cylinder – tracks with same diameter on all recording surfaces
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 78
Mass Memory ConceptsAccess Time• Seek time
– align head withcylinder containingtrack with sector inside
• Rotational Latency– time for disk to rotate tobeginning of sector
• Data Transfer time– time for sector to pass under head
Disk Capacity = surfaces x tracks/surface x sectors/track x bytes/sector
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 79
Key Attributes of Example Discs
Manufacturer Seagate Hitachi IBM
Identity of disc Series Barracuda DK23DA Microdrive
Model Number ST1181677LW ATA-5 40 DSCM-11000
Typical Application Desktop Laptop Pocket device
Storage attributes
Formatted Capacity GB 180 40 1
Recording surfaces 24 4 2
Cylinders 24,247 33,067 7167
Sector size B 512 512 512
Avg tracks/sector 604 591 140
Max recording Density Gb/cm2 2.4 5.1 2.4
Access attributes
Min seek time ms 1 3 1
Max seek time ms 17 25 19
External data rate MB/s 160 100 13
Physical attributes
Diameter, inches 3.5 2.5 1
Platters 12 2 1
Rotation speed rpm 7200 4200 3600
Weight kg 1.04 0.10 0.04
Operating power W 14.1 2.3 0.8
Idle power W 10.3 0.7 0.5
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 80
Key Attributes of Example Discs
Samsung launch 1Tb Hard drive:3 x 3.5” platters334Gb per platter7200RPM32Mb Cache3Gb/s SATA interface(SATA – serial Advanced Technology Attachment)
Highest density so far....
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 81
Mass Memory Concepts
Disk OrganizationData bits are small regions of magnetic coating magnetized in different
directions to give 0 or 1
Special encoding techniques maximize the storage density
eg rather than let data bit values dictate direction of magnetization can magnetize based on change of bit value – nonreturn-to-zero (NRZ) – allows doubling of recording capacity
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 82
Mass Memory ConceptsDisk Organization• Sector proceeded by sector number and followed by cyclic redundancy check
allows some errors and anomalies to be corrected
• Various gaps within and separating sectors allow processing to finish
• Unit of transfer is a sector – typically 512 to 2K bytes
• Sector address consists of 3 components:
– Disk address = Cylinder#, Track#, Sector#
17-31 bits 10-16bits 1-5bits 6-10bits
– Cylinder# - actuator arm
– Track# - selects read/write head or surface
– Sector# - compared with sector number recorded as it passes
• Sectors are independent and can be arranged in any logical order
• Each sector needs some time to be processed – some sectors may pass before disk is ready to read again, so logical sectors not stored sequentially as physical sectors
track i 0 16 32 48 1 17 33 49 2 18 34 50 3 19 35 51 4 20 36 52 …..
track i+1 30 46 62 15 31 47 0 16 32 48 1 17 33 49 2 18 34 50 3 19….
track i+2 60 13 29 45 61 14 30 46 62 15 31 47 0 16 32 48 1 17 33 49…..
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 83
Mass Memory ConceptsDisk Performance
Disk Access Latency = Seek Time + Rotational Latency
• Seek Time – how far head travels from current cylinder
– mechanical motion – accelerates and brakes
• Rotational Latency – depends upon position
– Average rotational latency = time for half a rotation
– at 10,000 rpm = 3ms
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 84
Mass Memory ConceptsRAID - Redundant Array of Inexpensive (Independent) Disks.• High capacity faster response without specialty hardware
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 85
Mass Memory ConceptsRAID0 – multiple disks appear as a single disk each accessing a
part of a single item across many disks
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 86
Mass Memory ConceptsRAID1 – robustness added by mirror contents on duplicate
disks – 100% redundancy
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 87
Mass Memory ConceptsRAID2 – robustness using error correcting codes – reducing
redundancy – Hamming codes – ~50% redundancy
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 88
Mass Memory ConceptsRAID3 – robustness using separate parity and spare disks –
reducing redundancy to 25%
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 89
Mass Memory ConceptsRAID4 – Parity/Checksum applied to sectors instead of bytes –
requires large use of parity disk
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 90
Mass Memory ConceptsRAID5 – Parity/Checksum distributed across disks – but 2 disk
failures can cause data loss
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 91
Mass Memory ConceptsRAID6 – Parity/Checksum distributed across disks and a second
checksum scheme (P+Q) distributed across different disks
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 92
99
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 93
Virtual MemoryIn order to take advantage of the various performance and prices of
different types of memory devices it is normal for a memory hierarchy to be used:
CPU register fastest data storage medium
cache for increased speed of access to DRAM
main RAM normally DRAM for cost reasons; SRAM possible
disc magnetic, random access
magnetic tape serial access for archiving; cheap
• How and where do we find memory that is not RAM?
• How does a job maintain a consistent user image when there are many others swapping resources between memory devices?
• How can all users pretend they have access to similar memory addresses?
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 94
Virtual Memory
Paging
In a paged virtual memory system the virtual address is treated as groups of bits which correspond to the Page number and offset or displacement within the page
– often denoted as (P,D) pair.
• Page number can be looked up in a page table and concatenated with the offset to give the real address.
• There is normally a separate page table for each virtual machine which point to pages in the same memory.
• There are two methods used for page table lookup
– direct mapping
– associative mapping
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 95
Virtual MemoryDirect Mapping• uses a page table with the same
number of entries as there are pages of virtual memory.
• thus possible to look up the entry corresponding to the virtual page number to find
– the real address of the page (if the page is currently resident in real memory)
– or the address of that page on the backing store if not
• This may not be economic for large mainframes with many users
• A large page table is expensive to keep in RAM and may be paged...
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 96
Virtual MemoryContent Addressable Memories
• when an ordinary memory is given an address it returns the data word stored at that location.
• A content addressable memory is supplied data rather than an address.
• It looks through all its storage cells to find a location which matches the pattern and returns which cell contained the data - may be more than one
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 97
Virtual MemoryContent Addressable Memories
• It is possible to perform a translation operation using a content addressable memory
• An output value is stored together with each cell used for matching
• When a match is made the signal from the match is used to enable the register containing the output value
• Care needs to be taken so that only one output becomes active at any time
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 98
Virtual MemoryAssociative Mapping
• Associative mapping uses a content addressable memory to find if the page number exist in the page table
• If it does the rest of the entry contains the real memory address of the start of the page
• If not then page is currently in backing store and needs to be found from a directly mapped page table on disc
• The associative memory only needs to contain the same number of entries as the number of pages of real memory - much smaller than the directly mapped table
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 99
Virtual MemoryAssociative Mapping
• A combination of direct and associative mapping is often used.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 100
Virtual Memory
Paging
• Paging is viable because programs tend to consist of loops and functions which are called repeatedly from the same area of memory. Data tends to be stored in sequential areas of memory and are likely to be used frequently once brought into main memory.
• Some memory access will be unexpected, unrepeated and so wasteful of page resources.
• It is easy to produce a program which mis-use virtual memory, provoking frantic paging as they access memory over a wide area.
• When RAM is full, paging can not just read virtual pages from backing store to RAM, it must first discard old ones to the backing store.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 101
1010
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 102
Virtual Memory
Paging
• There are a number of algorithms that can be used to decide which ones to move:
– Random replacement - easy to implement, but takes no account of usage
– FIFO replacement - simple cyclic queue, similar to above
– First-In-Not-Used-First-Out - FIFO queue enhanced with extra bits which are set when page is accessed and reset when entry is tested cyclically.
– Least Recently Used - uses set of counters so that access can be logged
– Working Set - all pages used in last x accesses are flagged as working set. All other pages are discarded to leave memory partially empty, ready for further paging
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 103
Virtual Memory
Paging - general points
• Every process requires its own page table - so that it can make independent translation of location of actual page
• Memory fragmentation under paging can be serious.
– as pages are set size, usage will not be for complete page and last one of a set will not normally be full
– especially if page size is large to optimise disc usage (reduce the number of head movements)
• Extra bits can be stored in page table with the real address - dirty bit - to determine if page has been written to since it was copied and hence if it needs to be copied back
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 104
Virtual Memory
Segmentation
• A virtual address in a segmented system is made from 2 parts
– segment number
– displacement within (S,D) pairs
• unlike paging, segments are not fixed length, maybe variable
• Segments store complete entities - pages allow objects to be split
• Each task has its own segment table
• segment table contains base address and length of segment so that other segments aren’t corrupted
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 105
Virtual Memory
Segmentation
• Segmentation doesn’t give rise to fragmentation in the same way, pages are of variable size so no waste of a segment.
• BUT as they are variable size not very easy to plan to fit them into memory
• Keep a sorted table of vacant blocks of memory and combine neighbouring blocks when possible
• Can keep information on the “type” a segment is - read-only executable etc. as they correspond to complete entities.
?
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 106
Virtual Memory
Segmentation & Paging
• A combination of segmentation and Paging uses a triplet of virtual address fields - the segment number, the page number within the segment and the displacement within the page (S,P,D)
• More efficient than pure paging - use of space more flexible
• More efficient than pure segmentation - allows part of segment to be swapped
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 107
Virtual Memory
Segmentation & Paging
• It is easy to mis-use virtual memory by simple difference in the way that some routines are coded: The 2 examples below perform exactly the same task, but the left-hand one generates 1,000 page faults on a machine with 1K word pages, while the one on the right generates 1,000,000. Most languages (except Fortran) store arrays in memory with the rows laid out sequentially, the right hand subscript varying most rapidly…..
void order
{
int array[1000][1000], ii, jj;
for (ii=0; ii<1000; ii++) {
for (jj=0;jj<1000; jj++) {
array[ii][jj];
}
}
}
void order
{
int array[1000][1000], ii, jj;
for (ii=0; ii<1000; ii++) {
for (jj=0;jj<1000; jj++) {
array[jj][ii];
}
}
}
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 108
Memory Caches
• Most general purpose processor systems use DRAM for their bulk RAM requirements because it is cheap and more dense than SRAM
• The penalty for this is that it is slower - SRAM has a 3-4 times shorter cycle time
• To help some SRAM can be added:– On-chip directly to the CPU for use as desired - use depends on
the compiler, not always easy to use efficiently but fast access– Cacheing - between DRAM and CPU. Built using small fast
SRAM, copies of certain parts of the main memory are held here. The method used to decide where to allocate cache determines the performance.
– Combination of the two - on chip cache.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 109
Memory CachesDirectly mapped cache - simplest form of memory cache.• In which the real memory address is treated in three parts:
block select tag (t bits) cache index (c bits)• For a cache of 2c words, the cache index section of the real memory
address indicates which cache entry is able to store data from that address• When cached the tag (msb of address) is stored in cache with data to
indicate which page it came from• Cache will store 2c words from 2t pages.• In operation tag is compared in every memory cycle
– if tag matches a cache hit is achieved and cache data is passed– otherwise a cache miss occurs and the DRAM supplies word and data
with tag are stored in the cache
t bits c bitsTag Index
Tags Data MainMemory
Cache Memory compareUse Cache orMain Memory
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 110
Memory CachesSet Associative Caches.• A 2-way cache contains 2 cache blocks, each capable of storing one word
and the appropriate tag.• For any memory access the two stored tags are checked• Require Associative memory with 2 entries for each of the 2c cache lines• Similarly a 4-way cache stores 4 cache entries for each cache index
t bits c bitsTag Index
Tags Data MainMemory
compare
Use Appropriate Cache orMain Memory
DataTags
Cache Memory
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 111
Memory CachesFully Associative Caches• A 2-way cache has two places which it must read and compare to
look for a tag• This is extended to the size of the cache memory
– so that any main memory word can be cached at any location in cache
• cache has no index (c=0) and contains longer tags and data– notice as c (address length) decreases, t (tag length) must
increase to match • all tags are compared on each memory access• to be fast all tags must be compared in parallel
block select tag (t bits)
no cache index (c=0)
INMOS T9000 had such a cache on chip
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 112
Memory Caches
Degree of Set Associativity
• for any chosen size of cache, there is a choice between more associativity or a larger index field width
• optimum can depend on workload and instruction decoding - accessible by simulation
In practice:
An 8kbyte (2k entries) cache, addressed directly, will produce a hit rate of about 73%, a 32kbyte cache achieves 86% and a 128kbyte 2-way cache 89%
(all these figures depend on characteristics of the instruction set and code executed, data used, etc. - these are for the Intel 80386)
• considering the greater complexity of the 2-way cache there doesn’t seem to be a great advantage in applying it
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 113
Memory Caches
Cache Line Size
• Possible to have cache data entries wider than a single word -
– i.e. a line size > 1
• Then a real memory access causes 2, 4 etc. words to be read
– reading performed over n-word data bus
– or from page mode DRAM, capable of transferring multiple words from same row in DRAM, by supplying extra column addresses
– extra words are stored in the cache in an extended data area
– as most code (and data access) occurs sequentially, it is likely that next word will come in useful…
– real memory address specifies which word in the line it wants
block select tag (t bits) cache index (c bits) line address (l bits)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 114
1111
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 115
Memory CachesWriting Cached Memory
So far only really concerned with reading cache. But problem also exists to keep cache and main memory consistent:
Unbuffered Write Through
• write data to relevant cache entry, update tag, also write data to location in main memory - speed determined by main memory
Buffered Write Through
• Data (and address) is written to A FIFO buffer between CPU and main memory, CPU continues with next access, FIFO buffer writes to DRAM
• CPU can continue to write at cache speeds, until FIFO is full, then slows down to DRAM speed as FIFO empties
• If CPU wants to read from DRAM (instead of cache) need to empty FIFO to ensure we have the correct data - can put long delay in.
• This delay can be shortened if FIFO has only one entry - simple latch buffer
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 116
22bits
Memory Caches
Micro-processor
MainDRAMmemory
Data Bus (32 bits)FIFO32bits
DataCache
Memory
Address BusFIFO22bits
Tagstorage
andcomparison
ControlLogic
D0-31 D0-31
A0-21A0-31
DA
WR
WR A
D Qcontrol
control control
DRAM select Tag Index 2bits(byte address)13 bits9 bits8 bits
CPUtimingsignals
9 bittag
13 bitindex
13 bitindex
Match
4Mword Memory using 8kwordDirect-Mapped cache with
Write-Through writes
FIFO
s op
tion
alfo
r bu
ffer
edw
rite
-thr
ough
32 32
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 117
Memory Caches
Writing Cached Memory (cont’d)
Deferred Write (Copy Back)
• data is written out to cache only, allowing the cached entry to be different from main memory. If the cache system wants to over-write a cache index with a different tag it looks to see if the current entry has been changed since it was copied in. If so it writes the new value to main memory before reading the new data to the location in cache.
• More logic is required for this operation, but the performance gain can be considerable as it allows the CPU to work at cache speed if it stays within the same block of memory. Other techniques will slow down to DRAM speed eventually.
• Adding a buffer to this allows CPU to write to cache before data is actually copied back to DRAM
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 118
Memory Caches
22bits
Micro-processor
MainDRAMmemory
Data Bus (32 bits)
DataCache
Memory
Address Bus
Tagstorage
andcomparison
ControlLogic
D0-31 D0-31
A0-21A0-31
DA
WR
WR A
D Q
control
DRAM select Tag Index 2bits(byte address)
13 bits9 bits8 bits
CPUtimingsignals
9 bittag
13 bitindex
13 bitindex
Match 4Mword Memory using 8kwordDirect-Mapped cache with
Copy-Back writes
32 32
Dirty bit
Latch32bits
control
LatchD Q
LatchD QQ
control
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 119
Memory Caches
Cache Replacement Policies for non direct-mapped caches
• when CPU accesses a location which is not already in cache need to decide which existing entry to send back to main memory
• needs to be a quick decision
• Possible schemes are:– Random replacement - a very simple scheme where a frequently
changing binary counter is used to supply a cache set number for rejection.
– First-In-First-Out - a counter is incremented every time a new entry is brought into the cache, which is used to point to the next slot to be filled
– Least Recently Used - good strategy as keeps often used values in cache, but difficult to implement with a few gates in short times
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 120
Memory Caches
Cache Consistency
A problem occurs when DMA is used by other devices or processors.
• Simple solution is to attach cache to memory and make all devices operate through it.
• Not best idea as DMA transfer will cause all cache entries to be overwritten, even though it is unlikely to be needed again soon
• If the cache is placed on the CPU side of the DMA traffic then cache might not mirror DRAM contents
Bus Watching - monitor access to the DRAM and invalidate the relevant cache tag entry if that DRAM has been updated can then keep cache towards the CPU
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 121
1212
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 122
Instruction Sets
IntroductionInstruction streams control all activity in the processor. All
characteristics of the machine depend on design of instruction set– ease of programming– code space efficiency– performance
Look at a few different instruction sets:– Zilog Z80– DEC Vax-11– Intel family– INMOS Transputer– Fairchild Clipper– Berkeley RISC-I
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 123
Instruction Sets
General Requirements of an Instruction Set
Number of conflicting requirements of an instruction set:
• Space Efficiency - control information should be compact
– the major part of all data moved between memory and CPU
– obtained by careful design of instruction set
• variable length coding can be used so that frequently used instructions are encoded into fewer bits
• Code Efficiency - can only translate a task efficiently if it is easy to pick needed instructions from set.
– various attempts at optimising instruction sets resulted in :• CISC - rich set of long instructions - results in small number
of translated instructions
• RISC - very short instructions, combined at compile time to produce same result
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 124
Instruction Sets
General Requirements of an Instruction Set (cont’d)
• Ease of Compilation - in some environments compilation is a more frequent activity than on machines where demanding executables predominate. Both want execution efficiency however.
– more time consuming to produce efficient code for CISC - more difficult to map program to wide range of complex instructions
– RISC simplifies compilation
– Ease of compilation doesn’t guarantee better code…..
– Orthogonality of instruction set also effects code generation.
• regular structure
• no special cases
• thus all actions (add, multiply etc.) able to work with each addressing mode (immediate, absolute, indirect, register).
• If not compiler may have to treat different items differently - constants, arrays and variables
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 125
Instruction Sets
General Requirements of an Instruction Set (cont’d)
• Ease of Programming
– still times when humans work directly at machine code level;
• compiler code generators
• performance optimisation
– in these cases there are advantages to regular, fixed length instructions with few side effects and maximum orthoganality
• Backward Compatibility
– many manufacturers produce upgrade versions which allow code written for earlier CPU to run without change.
– Good for public relations - if not compatible the could rewrite for competitors CPU instead!
– But can make Instruction set a mess - deficiencies added to rather than replaced - 8086 - 80286 - 80386 - 80486 - pentium
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 126
Instruction Sets
General Requirements of an Instruction Set (cont’d)
• Addressing Modes & Number of Addresses per Instruction
– Huge range of addressing modes can be provided - specifying operands from 1 bit to several 32bit words.
– These modes may themselves need to include absolute addresses, index registers, etc. of various lengths.
– Instruction sets can be designed which primarily use 0, 1, 2 or 3 operand addresses just to compound the problem.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 127
Instruction Sets
Important Instruction Set Features:
• Operand Storage in the CPU
– where are operands kept other than in memory?
• Number of operands named per instruction
– How many operands are named explicitly per instruction?
• Operand Location
– can any ALU operand be located in memory or must some or all of the operands be held in the CPU?
• Operations
– What types of operations are provided in the instruction set?
• Type and size of operands
– What is the size and type of each operand and how is it specified?
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 128
Instruction SetsThree Classes of Machine:• Stack based Machines
Advantages Simple model of expression evaluationShort instructions can give dense code
Disadvantages Stack can not be randomly accessed make efficient code generation difficult
Stack can be hardware bottleneck
• Accumulator based MachinesAdvantages Minimises internal state of machine
Short instructionsDisadvantages Since accumulator provides temporary storage
memory traffic is high
• Register based MachinesAdvantages Most general modelDisadvantages All operands must be named, leading to long
instructions
zero
add
ress
mac
hine
one
addr
ess
mac
hine
mul
ti a
ddre
ss m
achi
ne
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 129
Instruction SetsRegister Machines• Register to Register
Advantages Simple, fixed length instruction encodingSimple model for code generation Most compactInstructions access operands in similar time
Disadvantages Higher instruction count than in architectures with memory references in instructions
Some short instruction codings may waste instruction space.
• Register to MemoryAdvantages Data can be accessed without loading first
Instruction format is easy to encode and dense
Disadvantages Operands are not symmetric, since one operand (in the register) is destroyed
The no. of registers is fixed by instruction coding
Operand fetch speed depends on location (register or memory)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 130
Instruction SetsRegister Machines (cont’d)
• Memory to MemoryAdvantages Simple, (fixed length?) instruction encoding
Does not waste registers for temporary storageDisadvantages Large variation in instruction size - especially as
number of operands is increasedLarge variation in operand fetch speedMemory accesses create memory bottleneck
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 131
1313
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 132
Instruction SetsAddressing ModesRegister Add R4, R3 R4=R4+R3 When a value is in a register
Immediate Add R4, #3 R4=R4+3 For constants
Indirect Add R4, (R1) R4=R4+M[R1] Access via a pointer
Displacement Add R4, 100(R1) R4=R4+M[100+R1] Access local variables
Indexed Add R3, (R1+R2) R3=R3+M[R1+R2] Array access (base + index)
Direct Add R1, (1001) R1 = R1+M[1001] Access static data
Memory Add R1, @(R3) R1=R1+M[M[R3]] Double indirect - pointers IndirectAuto Add R1, (R2)+ R1=R1+M[R2] step through arrays - d is Postincrement then R2=R2+d word lengthAuto Add R1,-(R2) R2=R2-1 Postdecrement then R1=R1+M[R2] can also be used for stacksScaled Add R1,100(R2)[R3] R1=R1+M[100+R2+(R3*d)]
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 133
Instruction SetsInstruction Formats
Number of Address (operands)
4 operation 1st operand 2nd operand Result next address
3 operation 1st operand 2nd operand Result
2 operation 2nd operand
1 operation register 2nd operand
0 operation
1st operand& result
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 134
Instruction SetsExample Programs and simulations(used in simulations by Hennessey & Patterson)gcc the gcc compiler (written in C) compiling a large number of C
source files
TeX the TeX text formatter (written in C), formatting a set of computer manuals
SPICE The spice electronic circuit simulator (written in FORTRAN) simulating a digital shift register
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 135
Instruction SetsSimulations on Instruction Sets from Hennessey & PattersonThe following tables are extracted from 4 graphs in Hennessey &
Patterson’s “Computer Architecture: A Quantitative Approach”
Use of Memory Addressing Modes
Addressing Mode TeX Spice gccMemory Indirect 1 6 1 listsScaled 0 16 6 ArraysIndirect 24 3 11 pointersImmediate 43 17 39 consts.Displacement 32 55 40 local var
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 136
Instruction SetsSimulations on Instruction Sets (cont’d)
Number of bits needed for a Displacement Operand Value
Percentage of displacement operands using this # of bits0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
TeX 17 1 2 8 5 17 16 9 0 0 0 0 0 5 2 22
Spice 4 1 13 9 1 3 3 6 6 5 14 16 5 11 0 12
gcc 27 0 0 5 5 15 14 6 5 1 2 1 0 4 1 12
How local are the local variables?
< 8bits: 71% for TeX; 37% for Spice; 79% for gcc
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 137
Instruction SetsSimulations on Instruction Sets (cont’d)
Percentage of Operations using Immediate Operands
The Distributions of Immediate Operand Sizes
Operation TeX Spice gcc
Loads 38 26 23
Compares 83 92 84
ALU Operations 52 49 69
Number of bits needed for an Immediate ValueProgram 0 4 8 12 16 20 24 28 32
TeX 3 44 3 2 16 23 2 1 0Spice 0 12 36 16 14 10 12 0 0gcc 1 50 21 3 2 19 0 0 1
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 138
1414
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 139
Instruction SetsThe Zilog Z80
• 8 bit microprocessor derived from the Intel 8080
• has a small register set ( 8 bit accumulator + 6 other registers)
• Instructions are either register based or register and one memory address - single address machine
• Enhanced 8080 with relative jumps and bit manipulation
• 8080 instruction set (8bit opcodes) -
– unused gaps filled in with extra instructions.
– even more needed so some codes cause next byte to be interpreted as another set of opcodes….
• Typical of early register-based microprocessor
• Let down by lack of orthogonality - inconsistencies in instructions eg:– can load a register from address in single register
– but accumulator can only be loaded by address in register pair
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 140
Instruction SetsThe Zilog Z80 (cont’d)• Separate PC, SP and 2 index registers• Addressing modes:
– Immediate (1 or 2 byte operands)– Relative (one-byte displacement)– Absolute (2-byte address)– Indexed (M[index reg + 8 bit disp])– Register (specified in opcode itself)– Implied (e.g. references accumulator)– Indirect (via HL, DE or BC register pairs)
• Instruction Types– Load & Exchange - 64 opcodes used just for register- register copying– Block Copy– Arithmetic, rotate & shift - mainly 8 bit; some simple 16-bit operations– Jump, call & return - uses condition code from previous instruction– Input & Output - single byte; block I/O
8 improvements over 8080
1) Enhanced Instruction set – index registers & instructions
2) Two sets of registers for fast context switching
3) Block Move
4) Bit manipulation
5) Built in DRAM refresh address counter
6) Single 5V power supply
7) Fewer extra support chips needed
8) Very good price…
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 141
Instruction SetsIntel 8086 Family• 8086 announced in 1978 - not used in PC until 1987 (slower 8088 from 1981)
– 16 bit processor, data paths– 20 bit base addressing mode
• 80186 upgrade: small extensions• 80286 - used in PC/AT in 1984 (6 times faster than 8088 - 20MHz)
– Memory mapping & protection added• Support for VM through segmentation• 4 levels of protection – to keep applications away from OS
– 24-bit addressing (16Mb) - segment table has 24 bit base field & 16 bit size field• 80386 - 1986 - 40MHz
– 32 bit registers and addressing (4Gb)– Incorporates “virtual” 8086 mode rather than direct hardware support– Paging (4kbytes) and segmentation (up to 4Gb) – allows UNIX implementation– general purpose register usage– Incorporates 6 parallel stages:
• Bus Interface Unit – I/O and memory• Code Prefetch Unit• Instruction Decode Unit• Execution Unit• Segment Unit – logical address to linear address translation• Paging Unit – linear address to physical address translation
– Includes cache for up to 32 most recently used pages
Concurrent Fetch (Prefetch) and Execute
protected mode only switchable by processor reset until 386!
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 142
Instruction SetsIntel 8086 Family• i486 - 1988 - 100MHz
– more performance• added caching (8kb) to memory system • integrated floating point processor on board• Expanded decode and execute to 5 pipelined stages
• Pentium- 1994 - 150-750MHz (10,000 times speed of 8088)– added second pipeline stage to give superscalar performance– Now code (8k) and data (8k) cache– Added branch prediction, with on-chip branch table for lookps– Pages now 4Mb as well as 4kb– Internal paths 128bits and 256bits, external still 32bits– Dual processor support added
• Pentium Pro– Instruction decode now 3 parallel units– Breaks up code into “micro-ops”– Micro-ops can be executed in any order using 5 parallel execution units, 2
integer, 2 floating point and 1 memory
not 80486 – court ruling “can’t trademark a number
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 143
Instruction SetsIntel 8086 Registers (initially 16 bit)DataAX used for general arithmetic
AH and AL used in byte arithmeticBX general-purpose register
used as address base registerCX general-purpose register
used specifically in string, shift & loop instructionsDX general-purpose register
used in multiply, divide and I/O instructionsAddressSP Stack PointerBP base register - for base-addressing modeSI index, string source base registerDI index, string destination base registerRegisters can be used in 32 bit mode when in 80386 mode
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 144
Instruction SetsIntel 8086 Registers (initially 16 bit)Segment Base Registers - shift left 4 bits and add to address specified in
instruction...CS start address of code accessesSS start address of Stack SegmentES extra segment (for string destinations)DS data segment - used for all other accessesControl RegistersIP Instruction Pointer (LS 16 bits of PC)Flags 6 condition code bits plus 3 processor status control bitsAddressing ModesA wide range of addressing modes are supported. Many modes can only be
accessed via specific registers eg:Register Indirect BX, SI, DIBase + displacement BP, BX, SI, DIIndexed address is sum of 2 registers - BX+SI, BX+DI, BP+SI, BP+DI
causes overlap!!!
changed in 80286
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 145
1515
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 146
Instruction SetsThe DEC Vax-11
Vax-11 family was compatible with the PDP-11 range - had 2 separate processor modes - “Native” (VAX) and “Compatibility” Modes
• VAX had 16 32 bit general purpose registers including PC and SP and a frame pointer.
• All data and address paths were 32bits wide - 4Gb address space.
• Full range of data types directly supported by hardware - 8, 16, 32 and 64 bit integers, 32 and 64 bit floating point and 32 digit BCD numbers, character strings etc.
• A very full selection of addressing modes was available
• Used instructions made up from 8-bit bytes which specified:
– the operation
– the data type
– the number of operands
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 1478 bits
Instruction SetsThe DEC Vax-11• Special opcodes FD and FF introduce even more opcodes in a second byte.• Only the number of addresses is encoded into the opcode itself - the
addresses of operands are encoded in one or more succeeding bytesSo the operation:
ADDL3 #1, R0, @#12345678(R2)or“Add 1 to the longword in R0 and store the result in a memory location
addressed at an offset of the number of longwords stored in R2 from the absolute address 12345678 (hex)”
is stored as:
12
78
02
15
3456
10549
ADDL3 opcode193Literal (immediate) constantRegister mode - register 0
Index prefix register 2Abs address follows for indexing
Absolute address #12345678- the VAX was little endian
9 by
tes
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 148
Instruction SetsThe INMOS Transputer• The transputer is a
microprocessor designed to operate with other transputers in parallel embedded systems.
• The T800 was exceptionally powerful when introduced in 1986
• The T9000 - more powerful pipelined version in 1994
• Government sell INMOS to EMI• EMI decide to concentrate on music• SGS Thompson buy what’s left• Japanese use transputer technology in
printers/scanners• Then sold to ST microelectronics• Now abandoned
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 149
Instruction Sets
The INMOS Transputer
• Designed for synchronised communications applications
• Suitable for coupling into a multiprocessing configuration allowing a single program to be spread over all machines to perform task co-operatively.
• Has 4kbytes internal RAM - not cache, but a section of main memory map for programmer/compiler to utilise.
• Compact instruction set
– most popular instructions in shortest opcodes - to minimise bandwidth
– operate in conjunction with a 3 word execution stack - a zero addressing strategy
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 150
Instruction SetsThe INMOS TransputerThe processor evaluates the following high-level expression:
x = a+b+c;where x, a, b, and c represent integer variables.
No need to specify which processor registers receive the variables. Processor just told to load - pushed on stack - and add them. When an operation is performed, two values at the top are popped, then
combined, and the result of the operation is pushed back: ;stack contents (=Undefined) ;[ ] load a ;[a ] load b ;[b a ] load c ;[c b a] add ;[c+b a ] add ;[c+b+a ] store x ;[c+b+a ]• removes need to add extra bits to the instruction to specify which register
is accessed, instructions can be packed in smaller words - 80% of instructions are only1 byte long - results in tighter fit in memory, and less time spent fetching the instructions
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 151
Instruction SetsThe INMOS Transputer• Has 6 registers
– 3 make the register stack– a program counter (called the instruction pointer by Inmos)– a stack pointer (called a workspace pointer by Inmos)– and an operand register
• The stack is interfaced by the first of the 3 registers (A, B, C)– “push”ing a value into A will cause A’s value to be pushed to B and
B’s value to C– “pop”ping a value from A will cause B’s value to be popped to A and
C’s value to B• The operand pointer is the focal point for instruction processing.
– the 4 upper bits of a transputer instruction contain the operation– 16 possible operations– 4 lower bits contain the operand - this can be enlarged to 32 bits by
using a “prefix” instructions
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 152
Instruction SetsThe INMOS Transputer• The 16 instructions include jump, call, memory load/store and add. Three
of the 16 elementary instructions are used to enlarge the two 4-bit fields (opcode or operand) in conjunction with the OR as follows:– the “prefix” instruction adds its operand data into the OR (4bits)– shifts the OR 4 bits to the left– allowing numbers (upto 32 bits) to be built up in the OR– a negative prefix instruction adds its operand into the OR and then
inverts all the bits in the OR before shifting 4 bits to the left - allows 2’s complement negative values to be built up - eg
Mnemonic Code Memory ldc #3 #4 #43 ldc #35 #2345 is coded as pfix #3 #2 #23 ldc #5 #4 #45 ldc #987 #292847 is coded as pfix #9 #2 #29 pfix #8 #2 #28 ldc #7 #4 #47
OperandRegister
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 153
Instruction SetsThe INMOS Transputer Mnemonic Code Memory
ldc -31 (ldc #FFFFFFE1) #6141 is coded as nfix #1 #6 #61 ldc #1 #4 #41
This last example shows the advantage of loading the 2’s complement negative prefix. Otherwise we would have to load all of the Fs making 5 additional operations….
• An additional “operate” instruction allows the OR to be treated as an extended opcode - up to 32 bits. Such instructions can not have an operand as OR is used for instruction so are all zero address instructions.
• We have 16 1-address instructions and potentially lots of zero length instructions.
Mnemonic Code Memory add #5 #F5 is coded as opr #5 #F #F5 ladd #16 #21F6 is coded as pfix #1 #2 #21 opr #6 #F #F6
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 154
Instruction SetsThe INMOS Transputer
• No dedicated data registers. The transputer does not have dedicated registers, but a stack of registers, which allows for an implicit selection of the registers. The net result is a smaller instruction format.
• Reduced Instruction Set design. The transputer adopts the RISC philosophy and supports a small set of instructions executed in a few cycles each.
• Multitasking supported in microcode. The actions necessary for the transputer to swap from one task to another are executed at the hardware level, freeing the system programmer of this task, and resulting in fast swap operations.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 155
Instruction SetsThe Fairchild (now Intergraph) Clipper• Had sixteen 32-bit general purpose registers for the user and another 16
for operating system functions.
– Separated interrupt activity and eliminated time taken to save register information during an ISR
• Tightly coupled to a Floating Point Unit
• Had 101 RISC like instructions
– 16 bits long
– made up from an 8-bit opcode and two 4-bit register fields
– some instructions can carry 4 bits of immediate data
– the 16 bit instructions could be executed in extremely fast cycles
– also had 67 macro instructions - made up from multiples of simpler instructions using a microprogramming technique - these incorporated many more complex addressing modes as well as operations which took several clock cycles
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 156
Intergraph were a leading workstation producer for CAD in transport, building and local government products built using Intel chips.
1987 – Intergraph buys Advanced Processor Division of Fairchild from National Semiconductor
1989-92 – Patents for Clipper transferred to Intergraph1996 – Intergraph find that Intel are infringing their patents on Cache addressing,
memory and consistency between cache and memory, write through & copy back modes for virtual addressing and bus snooping etc..- Intergraph ask Intel to pay for patent rights- Intel refuse- Intel then cut off Intergraph from advanced information about Intel chips - without that info Integraph could not design new products well- Intergraph go from #1 to #5
1997 – Intergraph sue Intel – lots of legal stuff for next 3 years – court rules Intel not licensed to use clipper technology in pentium
2002 – Intel pays Intergraph $300M for license plus $150M damages for infringement of PIC technology – core of Itanium chip for high end servers
Parallel Instruction Computing
A tale of Intel
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 157
Federal Trade Commission site Intel in 2 other similar cases:
1997 – Digital sue Intel saying it copied DEC technology to make Pentium Pro.
In retaliation Intel cut off DEC from Intel pre-release material.
Shortly after this DEC get bought out by Compaq.
1994 – Compaq sue Packard Bell for violating patents for Comaq chip set.
Packard Bell say chip set made by Intel
Intel cut off Compaq from advanced information…..
A tale of Intel
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 158
Instruction SetsThe Fairchild (now Intergraph) Clipper
An example of a Harvard Architecture - having a separate internal instruction bus and data bus (and associated caches)
IntegerCPU
FPU
Cache/Memory
ManagementUnit
Cache/Memory
ManagementUnit
InternalInstruction
Bus
InternalDataBus
Off-CarrierMemory
Bus
The clipper is made upfrom 3 chips mounted on a ceramic carrier.The Harvard Architectureenables the caches to beoptimised to the differentcharacteristics of theinstruction and datastreams.
Microchips PIC chip also uses a Harvard Architecture
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 159
1616
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 160
Instruction SetsThe Berkeley RISC-I Research ProcessorA research project at UC Berkeley 1980-83 set out to build
• a “pure” RISC structure
• highly suited to executing compiled high level language programs
– procedural block, local & global variables
The team examined the frequency of execution of different types of instructions in various C and Pascal programs
The RISC-I has had a strong influence on the design of SUN Sparc architecture - (the Stanford MIPS (microprocessor without Interlocked Pipelined
Stages) architecture influenced the IBM R2000)
The RISC-I was a register based machine. The registers, data and addresses were all 32 bits wide.
Had a total of 138 registers.
All instructions, except memory LOADs and STOREs, operated on 1,2 or 3 registers.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 161
Instruction SetsThe Berkeley RISC-I Research ProcessorWhen running program had available a total of 32 general-purpose registers• 10 (R0-R9) are global• the remaining 22 were split into 3 groups:
– low, local and high - 6,10 and 6 registers respectively• When a program calls a procedure
– the first 6 parameters are stored to the programs low registers – a new register window is formed– these 6 low registers relabelled as the high 6 in a new block of 22– this is the register space for the new procedure while it runs.– the running procedure can keep 10 of its local variables in registers– it can call further procedures using its own low registers– it can nest calls to a depth of 8 calls – (thus using all 138 registers)– on return from procedures the return results are in the high registers
and appear in the calling procedures low registers.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 162
Instruction SetsThe Berkeley RISC-I Research ProcessorProcess A calls process B which calls process C:
high
lowlocal
high
lowlocal
high
lowlocal
A
B
C
137
90
RegisterBank
Global Global Global
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 163
Instruction SetsThe Berkeley RISC-I Research Processor
RISC-I Short Immediate Instruction Format
RISC-I Long Immediate Instruction FormatDEST is the register number for all operations except conditional branches,
when it specifies the conditionS1 is the number of the first source register and S2 the second if bit 13 is high
- a 2’s complement immediate value otherwiseSCC is a set condition code bit which causes the status word register to be
activated
Op-Code SCC DEST IMM7 1 5 19
Op-Code SCC DEST S1 S27 15 5 131
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 164
Instruction SetsThe Berkeley RISC-I Research ProcessorThe Op-Code (7 bits) can be one of 4 types of instruction:• Arithmetic
– where RDEST = RS1 OP S2 and OP is a math, logical or shift operation• Memory Access
– where LOADs take the form RDEST = MEM[RS1+S2]– and STOREs take the form MEM[RS1+S2] = RDEST
• Note that RDEST is really the source register in this case• Control Transfer
– where various branches may be made relative to the current PC (PC+IMM) or relative to RS1 using the short form (RS1+S2)
• Miscellaneous– all the rest. Includes “Load immediate high” - uses the long immediate
format to load 19 bits into the MS part of a register - can be followed by a short format load immediate to the other 13 bits - 32 in all
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 165
Instruction SetsRISC PrinciplesNot just a machine with a small set of instructions. Must also have been
optimised and minimised to improve processor performance.Many processors n the 60s and 70s were developed with a microcode engine
at the heart of the processor - easier to design (CAD and formal proof did not exist) and easy to add extra, or change instructions
Most CISC programs spend most of their time in small number of instructionsIf the time taken to decode all instructions can be reduced by having fewer of
them then more time can be spent on making the less frequent instructionsVarious other features become necessary to make this work:• One clock cycle per instruction
CISC machines typically take a variable number of cycles– reading in variable numbers of instruction bytes– executing microcodeTime wasted by waiting for these to complete is gained if all operate in
the same periodFor this to happen a number of other features are required.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 166
Instruction SetsRISC Principles• Hard-wired Controller, Fixed Format Instructions
– Single cycle operation only possible if instructions can be decoded fast and executed straight away.
– Fast (old-fashioned?) hard-wired instruction sequences are needed - microcode can be too slow
– As designing these controllers is hard even more important to have few
– can be simplified by making all instructions share a common format
• number of bytes, positions of op-code etc.• smaller the better - provided that each instruction
contains needed information– Typical for only 10% of the logic of a RISC chip to be used for
controller function, compared with 50-60% of a CISC chip like the 68020
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 167
Instruction SetsRISC Principles• Larger Register Set
– It is necessary to minimise data movement to and from the processor– The larger the number of registers the easier this is to do.– Enables rapid supply of data to the ALU etc. as needed– Many RISC machines have upward of 32 registers and over 100 is not
uncommon.– There are problems with saving state of this many registers– Some machines have “windows” of sets of registers so that a complete
set can be switched by a single reference change• Memory Access Instructions
– One type of instruction can not be speeded up as much as others– Use indexed addressing (via a processor register) to avoid having to
supply (long) absolute addresses in the instruction– Harvard architecture attempts to keep most program instructions and
data apart by having 2 data and address buses
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 168
Instruction SetsRISC Principles• Minimal pipelining, wide data bus
– CISC machines use pipelining to improve the delivery of instructions to the execution unit
– it is possible to read ahead in the instruction stream and so decode one instruction whilst executing the previous one whilst retrieving another
– Complications in jump or branch instructions can make pipelining unattractive as they invalidate the backed up instructions and new instructions have to ripple their way through.
– RISC designers often prefer large memory cache so that data can be read, decoded and executed in a single cycle independent of main memory
– Regardless of pipelining, fetching program instructions fast is vital to RISC and a wide data bus is essential to ensure this - same for CISC
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 169
Instruction SetsRISC Principles• Compiler Effort
– A CISC machine has to spend a lot of effort matching high-level language fragments to the many different machine instructions - even more so when the addressing modes are not orthogonal.
– RISC compilers have a much easier job in that respect - fewer choices– They do, however, build up longer sequences of their small
instructions to achieve the same effect.– The main complication of compiling for RISC is that of optimising
register usage.– Data must be maintained on-chip when possible - difficult to evaluate
an importance to a variable.• a variable accessed in a loop can be used many times and one
outside may be used only once - but both only appear in the code once...
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 170
Instruction Sets
Convergence of RISC and CISCMany of the principles developed for RISC machine optimisation have been
fed back into CISC machines (Intergraph and Intel…). This is tending to bring the two styles of machine back together.
• Large caches on the memory interface - reduce the effects of memory usage
• CISC machines are getting an increasing number of registers• More orthogonal instruction sets are making compiler implementation
easier• Many of the techniques described above may be applied to the
microprogram controller inside a conventional CISC machine.• This suggests that the microprogram will take on a more RISC like form
with fixed formats and fields, applying orthogonally over the registers etc.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 171
1717
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 172
Pipelined Parallelism in Instruction Processing
General PrinciplesPipelined processing involves splitting a task into several sequential parts and
processing each in parallel with separate execution units.• for one off tasks little advantage, but• for repetitive tasks, can make substantial gainsPipelining can be applied to many fields of computing, such as:• large scale multi-processor distributed processing• arithmetic processing using vector hardware to pipe individual vector
elements through a single high-speed arithmetic unit• multi-stage arithmetic pipelines• layered protocol processing• as well as instruction execution within a processorOver all task must be able to be broken into smaller sub-tasks which can be
chained together - all subtasks taking the same time to executeChoosing the best sub-division of tasks is called load balancing
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 173
Pipelined Parallelism in Instruction Processing
General Principles
single instruction still takes as long, each instruction still has to be performed in the same order. Speed up occurs when all stages are kept in operation at same time. Start up and ending become less efficient.
stage 2 stage 3stage 1
stage 1 stage 2 stage 3
stage 1 stage 2 stage 3
stage 1 stage 2 stage 3
stage 1 stage 2 stage 3
stage 1 stage 2 stage 3
non-pipelined processing
pipelined processing
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 174
Pipelined Parallelism in Instruction Processing
General PrinciplesTwo clocking schemes which can be incorporated in pipelining -Synchronous
Operates using a global clock - indicates when each stage of the pipeline should pass its result to the next stage.
Clock must run at rate of slowest possible element in pipeline when given with most time consuming data.
To de-couple each stage they are separated by staging latches
stage 1 stage 2 stage 3
latc
h
latc
hTask Results
Clock
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 175
Pipelined Parallelism in Instruction Processing
General PrinciplesAsynchronous
in this case the stages of the pipeline run independently of each other.Two stages synchronise when a result has to pass from one to the other.A little more complicated to design than synchronous, but benefits that
stages can run in time needed rather than use maximum time.Use of a FIFO buffer instead of latch between stages can allow queuing
of results for each stage
stage 1 stage 2 stage 3
latc
h
latc
hTask Results
acknowledgeready
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 176
Pipelined Parallelism in Instruction Processing
Pipelining for Instruction ProcessingProcessing a stream of instructions can be performed in a pipelineIndividual instructions can be executed in a number of distinct phases:Fetch Read instruction from memory
Decode instruction Inspect instruction - how many operands, how and where will it be executed
Address generate Calculate addresses of registers and memory locations to be accessed
Load operand Read operands stored in memory - might read register operands or set up pathways between registers and functional units
Execute Drive the ALU, shifter, FPU and other components
Store operand Store result of previous stage
Update PC PC must be updated for next fetch operationNo processor would implement all of these. Most common would be Fetch
and Execute
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 177
Pipelined Parallelism in Instruction Processing
Overlapping Fetch & Execute PhasesFetch - involves memory activity (slow) can be overlapped with Decode and
Execute.In RISC only 2 instructions access memory - LOAD and STORE - the
remainder operate on registers so for most instructions only Fetch needs memory bus.
On starting the processor the Fetch unit gets an instruction from memoryAt the end of the cycle the instruction just read is passed to the Execute unitWhile the Execute unit is performing the operation Fetch is getting next
instruction (provided Execute doesn’t need to use the memory as well)This and other contention can be resolved by:• Extending the clashing cycle to give time for both memory accesses to take
place - hesitation - requires synchronous clock to be delayed• Providing multi-port access to main memory (or cache) so that access can
happen in parallel. Memory interleaving may help.• Widening data bus so that 2 instructions are fetched with each Fetch• Use a Harvard memory architecture - separate instruction and data bus
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 178
Pipelined Parallelism in Instruction Processing
Overlapping Fetch & Execute Phases
Fetch #1
time
Fetch #2 Fetch #3
Execute #1 Execute #2 Execute #3
time
Fetch #1 Fetch #2 Fetch #3
Execute #1 Execute #2 Execute #3
Decode #1 Decode #2 Decode #3
Overlapping Fetch, Decode & Execute Phases
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 179
Pipelined Parallelism in Instruction Processing
Overlapping Fetch, Decode & Execute Phases
There are benefits to extending the pipeline to more than 2 stages - even though more hardware is needed
A 3-stage pipeline splits the instruction processing into Fetch, Decode and Execute.
The Fetch stage operates as before.
The Decode stage decodes the instruction and calculates any memory addresses used in the Execute
The Execute stage controls the ALU and writes result back to a register - and can perform LOAD and STORE accesses.
The Decode stage is guaranteed not to need a memory access. Thus memory contention is no worse than in the 2 stage version.
Longer Pipelines of 5, 7 or more stages are possible and depend on the complexity of hardware and instruction set.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 180
1818
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 181
Pipelined Parallelism in Instruction Processing
The Effect of Branch Instructions
One of the biggest problems with pipelining is the effect of a branch instruction.
A branch is Fetched as usual and the target address Decoded. The Execute stage then has the task of deciding whether or not to branch and so changing the PC.
By this time the PC has already been used at least once by the Fetch (and with a separate Decode maybe twice).
The effect of changing the PC is that all data in the pipeline following the branch must be flushed.
Branches are common in some types of program (up to 10% of instructions). So benefits of pipelining can be lost for 10% of instructions and incur reloading overhead.
A number of schemes exist to avoid this flushing:
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 182
Pipelined Parallelism in Instruction Processing• Delayed Branching – Sun SPARC
instead of branching as soon as a branch instruction has been decided, the branch is modified to “Execute n more instructions before jumping to the instruction specified” - used with n chosen to be 1 smaller than the number of stages in pipeline. So that in a 2 stage pipeline, instead of the loop:
a; b; c; a; b; c; ….. (where c is the branch instruction back to a)
in that order, the code could be stored as:a; c; b; a; c; b; ……
in this case a is executed, then the decision to jump back to a, but before the jump happens b is executed.
the delayed jump at c enables b - which has already been fetched when evaluating c to be used rather than thrown away.
must be careful when operating instructions out of sequence and the machine code becomes difficult to understand.
a good compiler can hide all of this and in about 70% of cases can be implemented easily.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 183
Pipelined Parallelism in Instruction Processing
• Delayed BranchingConsider the following code fragments - running on a 3 stage pipelineloop: RA = RB ADD RC
RD = RB SUB RCRE = RB MUL RCRF = RB DIV RCBNZ RA, loop
Cycle Fetch Decode Execute1 ADD - -2 SUB ADD -3 MULT SUB ADD4 DIV MULT SUB5 BNZ DIV MULT6 next BNZ DIV7 next 2 next BNZ (Updates PC)8(=1) ADD - -
Pipeline has to be flushed to remove the two incorrectly fetched instructions and code repeats every 7 cycles.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 184
Pipelined Parallelism in Instruction Processing• Delayed Branching
We can invoke the delayed branching behaviour of DBNZ and re-order 2 instructions (if possible) from earlier in the loop:
loop: RA = RB ADD RCRD = RB SUB RCDBNZ RA loopRE = RB MULT RCRF = RB DIV RC
Cycle Fetch Decode Execute1 ADD - -2 SUB ADD -3 DBNZ SUB ADD4 MULT DBNZ SUB5 DIV MULT DBNZ (Updates PC)6(=1) ADD DIV MULT7 SUB ADD DIV8(=3) DBNZ SUB ADD
Loop now executes every 5 processor cycles - no instructions are fetched and unused.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 185
Pipelined Parallelism in Instruction Processing• Instruction Buffers – IBM PowerPC
When a branch is found in early stage of pipeline, the Fetch unit can be made to start fetching both future instructions into separate buffers and start decoding both, before branch is executed. A number of difficulties with this:
– it imposes an extra load on instruction memory
– requires extra hardware - duplication of decode and fetch
– becomes difficult to exploit fully if several branches follow closely - each fork will require a separate pair of instruction buffers
– early duplicated stages cannot fetch different values to the same register, so register fetches may have to be delayed - pipeline stalling(?)
– duplicate pipeline stages must not write (memory or registers) unless mechanism for reversing changes is included (if branch not taken)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 186
Pipelined Parallelism in Instruction Processing• Branch Prediction – Intel Pentium
When a branch is executed destination address chosen can be kept in cache. When Fetch stage detects a branch, it can prime itself with a next-program-counter value looked up in the cached table of previous destinations for a branch at this instruction.
If the branch is made (at execution stage) in the same direction as before, then pipeline already contains the correct prefetched instructions and does not need to be flushed.
More complex schemes could even use a most-frequently-taken strategy to guess where the next branch from any particular instruction is likely to go and reduce the pipeline flush still further.
ExecuteDecodeFetchMemoryaddress
Instruction
PC
load target address
target address found
search
Instructionaddress
Targetaddress
Look-up Table
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 187
Pipelined Parallelism in Instruction Processing
• Dependence of Instructions on others which have not completedInstructions can not be reliably fetched if all previous branch instructions
are incomplete - PC updated too late for next fetch– Similar problem occurs with memory and registers.– memory case can be solved by ensuring that all memory accesses are
atomically performed in a single Execute stage - get data only when needed.
– but what if the memory just written contains a new instruction which has already been prefetched? (self modifying code)
In a long pipeline, several stages may read from a particular register and several may write to the same register.
– Hazards occur when the order of access to operands is changed by the pipeline
– various methods may be used to prevent data from different stages getting confused in the pipeline.
Consider 2 sequential instructions i, j, and a 3 stage pipeline. Possible hazards are:
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 188
Pipelined Parallelism in Instruction Processing
• Read-after-write HazardsWhen j tries to read a source before i writes it, j incorrectly gets the old value
– a direct consequence of pipelining conventional instructions– occurs when a register is read very shortly after it has been updated– value in register is correct
ExampleR1 = R2 ADD R3R4 = R1 MULT R5
Cycle Fetch Decode Execute Comments 1 ADD - - 2 MULT ADD fetches R2,R3 -
3 next1 MULT fetches R1,R5 ADD stores R1 register fetch probably wrong value
4 next2 next1 MULT stores R4 wrong value calculated
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 189
Pipelined Parallelism in Instruction Processing
• Write-after-write HazardsWhen j tries to write an operand before i writes it, the value left by i rather than
the value written by j is left at the destination– Occurs if the pipeline permits write from more than one stage.– value in register is incorrect
ExampleR3 = R1 ADD R2R5 = R4 MULT -(R3)
Cycle Fetch Decode Execute Comments 1 ADD - - 2 MULT ADD fetches R1,R2 -
3 next1 MULT fetches (R3-1),R4,saves R3-1 in R3 ADD stores R3 which version of R3?
4 next2 next1 MULT stores R5
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 190
Pipelined Parallelism in Instruction Processing
• Write-after-read HazardsWhen j tries to write to a register before it is read by i, i incorrectly gets the
new value– can only happen if the pipeline provides for early (decode-stage)
writing of registers and late reading - auto-increment addressing– the value in the register is correct
ExampleA realistic example is difficult in this case for several reasons.
• Firstly memory accessing introduces dependencies for the data in the read case, or stalls due to bus activity in the write case
• A long pipeline with early writing and late reading of registers is rather untypical……..
• Read-after-read HazardsThese are not a hazard - multiple reads always return the same value…….
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 191
1919
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 192
Pipelined Parallelism in Instruction Processing
• Detecting Hazardsseveral techniques - normally resulting in some stage of the pipeline being
stopped for a cycle - can be used to overcome these hazards.They all depend on detecting register usage dependencies between
instructions in the pipeline.An automated method of managing register accesses is neededMost common detection scheme is scoreboardingScoreboarding – keeping a 1-bit tag with each register.– clear tags when machine is booted– set by Fetch or Decode stage when instruction is going to change a
register– when the change is complete the tag bit is cleared– if instruction is decoded which wants a tagged register, then
instructions is not allowed to access it until tag is cleared.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 193
Pipelined Parallelism in Instruction Processing
• Avoiding Hazards - ForwardingHazards will always be a possibility, particularly in long pipelines.Some can be avoided by providing an alternative pathway for data from a
previous cycle but not written back in time:
Registers
Mpx Mpx
regreg
valuevalue
ALU
bypasspaths
Normal register write path
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 194
Pipelined Parallelism in Instruction Processing• Avoiding Hazards - Forwarding - Example
R1 = R2 ADD R3R4 = R1 SUB R5R6 = R1 AND R7R8 = R1 OR R9R10 = R1 XOR R11
Cycle Fetch Decode/regs ALU Memory Writeback 1 ADD - - - - 2 SUB ADD read R2,R3 - - - 3 AND SUB read R5, R1(not ready) ADD compute R1 - - 4 - SUB read R1(not ready) - ADD pass R1 - 5 - SUB read R1(not ready) - - ADD store R1
6 - SUB read R1 - - - 7 OR AND read R1, R7 SUB compute R4 - - 8 XOR OR read R1, R9 AND compute R6 SUB pass R4 - 9 next1 XOR read R1, R11 OR compute R8 AND pass R6 SUB store R410 next2 next1 XOR compute R10 OR pass R8 AND store R611 next3 next2 next1 XOR pass R10 OR store R812 next4 next3 next2 next1 XOR store
R1013 next5 next4 next3 next2 next1
on a 5 stage pipeline - no forwarding pathways
FetchDecode
Reg readALU
ExecuteMemoryAccess
RegisterWrite
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 195
Pipelined Parallelism in Instruction Processing• Avoiding Hazards - Forwarding - Example
R1 = R2 ADD R3R4 = R1 SUB R5R6 = R1 AND R7R8 = R1 OR R9R10 = R1 XOR R11
Cycle Fetch Decode/regs ALU Memory Writeback 1 ADD - - - - 2 SUB ADD read R2,R3 - - - 3 AND SUB read R5, R1(not ready) ADD compute R1 - - 4 - SUB read R1(not ready) - ADD pass R1 - 5 - SUB read R1 - - ADD store R1
6 OR AND read R1, R7 SUB compute R4 - - 7 XOR OR read R1, R9 AND compute R6 SUB pass R4 - 8 next1 XOR read R1, R11 OR compute R8 AND pass R6 SUB store R4 9 next2 next1 XOR compute R10 OR pass R8 AND store R610 next3 next2 next1 XOR pass R10 OR store R811 next4 next3 next2 next1 XOR store R10
12 next5 next4 next3 next2 next1
on a 5 stage pipeline - no forwarding pathwaysBUT: registers read in second half of cycle and
written in first half of cycle
FetchDecode
Reg readALU
ExecuteMemoryAccess
RegisterWrite
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 196
Pipelined Parallelism in Instruction Processing• Avoiding Hazards - Forwarding - Example
R1 = R2 ADD R3R4 = R1 SUB R5R6 = R1 AND R7R8 = R1 OR R9R10 = R1 XOR R11
Cycle Fetch Decode/regs ALU MemoryWriteback 1 ADD - - - - 2 SUB ADD read R2,R3 - - - 3 AND SUB read R5, R1(from ALU) ADD compute R1 - - 4 OR AND read R1(from ALU),R7 SUB compute R4 ADD pass R1 - 5 XOR OR read R1(from ALU),R9 AND compute R6 SUB pass R4 ADD store R1 6 next1 XOR read R1, R11 OR compute R8 AND pass R6 SUB store R4 7 next2 next1 XOR compute R10 OR pass R8 AND store R6 8 next3 next2 next1 XOR pass R10 OR store R8 9 next4 next3 next2 next1 XOR store R10
10 next5 next4 next3 next2 next1
In this case the forwarding prevents any pipeline stalls.
on a 5 stage pipeline - with full forwarding
FetchDecode
Reg readALU
ExecuteMemoryAccess
RegisterWrite
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 197
Pipelined Parallelism in Instruction Processing• Characteristics of Memory Store Operations
Example - use the 5 stage pipeline as before in store cycle: R1 = R2 ADD R325 (R1) = R1 (store in main memory)
Cycle Fetch Decode/regs ALU Memory Writeback 1 ADD - - - - 2 STORE ADD read R2,R3 - - - 3 next1 STORE read R1(not ready) ADD compute R1 - - 4 next2 next1 STORE compute R1+25 ADD pass R1 - 5 stall next2 next1 STORE R1(R1) ADD store
R1 6 next3 - next2 next1 STORE null
R1 from ALU
Since STORE is an output operation, it does not create register based hazards.It might create memory-based hazards, which may be avoided by instruction re-ordering or store-fetch avoidance techniques - see next section
Wait for memory indirection
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 198
Pipelined Parallelism in Instruction Processing• Forwarding during Memory Load Operations
Example - use the 5 stage pipeline as before in Load cycle:R1 = 32 (R6)R4 = R1 ADD R7
R5 = R1 SUB R8 R6 = R1 AND R7
Cycle Fetch Decode/regs ALU Memory Writeback 1 LOAD - - - - 2 ADD LOAD read R6 - - - 3 SUB ADD read R7,R1(not ready) LOAD R6+32 - - 4 stall ADD R1(not ready) - LOAD (R6+32) - 5 AND SUB read R8,R1(from Mem) ADD R4,R1(Mem) - LOAD store
R1 6 next1 AND read R7,R1 SUB R5 ADD pass R4 - 7 next2 next1 AND R6 SUB pass R5 ADD store
R4 8 next3 next2 next1 AND pass R6 SUB store R5 9 next4 next3 next2 next1 AND store
R6
In this case the result of the LOAD must be forwarded to the earlier ALU stage, and the even earlier DECODE stage.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 199
Pipelined Parallelism in Instruction Processing• Forwarding (Optimisation) Applied to Memory Operations
– Store Fetch Forwarding - where words stored and then loaded by another instruction further back in the pipeline can be piped directly without the need to be passed into and out of that register or memory location: e.g
MOV [200],AX ;copy AX to memoryADD BX,[200] ;add memory to BX
transforms to:MOV [200],AXADD BX, AX
– Fetch Fetch Forwarding - where words loaded twice in successive stages may be loaded together - or once from memory to register
MOV AX, [200] ;copy memory to AXMOV BX,[200] ;copy memory to BX
transforms to:MOV AX, [200]MOV BX, AX
– Store Store OverwritingMOV [200],AX ;copy AX to memoryMOV [200],BX ;copy BX to memory
transforms to:MOV [200], BX
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 200
Pipelined Parallelism in Instruction Processing• Code Changes affecting the pipeline - Instruction Re-orderingBecause hazards and data dependencies cause pipeline stalls, removing them can
improve performance. Re-ordering instructions is often simplest technique.Consider a program to calculate on a 3-stage pipeline:
loop: RT = RA EXP RNRT = RT MULT RNRS = RS ADD RTRN = RN SUB 1BNZ RN, loop
Cycle Fetch Decode Execute 1 EXP - - 2 MULT EXP read RA,RN - 3 ADD MULT read RN,RT(not ready) EXP store RT
4 - MULT read RN,RT - 5 SUB ADD read RS,RT(not ready) MULT store RT
6 - ADD read RS,RT - 7 BNZ SUB read RN,1 ADD store RS 8 - BNZ read RN (not ready) SUB store RN
9 next1 BNZ read RN -10 next2 next1 BNZ store PC
11 EXP flushed flushed
100
1n
nna
Needs 10 cycles..
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 201
Pipelined Parallelism in Instruction Processing• Code Changes affecting the pipeline - Instruction Re-ordering
Re-order the sum and decrement instructions:
loop: RT = RA EXP RNRT = RT MULT RNRN = RN SUB 1 these 2 swappedRS = RS ADD RT these 2 swappedBNZ RN, loop
Cycle Fetch Decode Execute 1 EXP - - 2 MULT EXP read RA,RN - 3 SUB MULT read RN,RT(not ready) EXP store RT
4 - MULT read RN,RT - 5 ADD SUB read RN,1 MULT store RT 6 BNZ ADD read RS,RT SUB store RN 8 next1 BNZ read RN ADD store RS 9 next2 next1 BNZ store PC
10 EXP flushed flushed
Can only make it better with forwarding - to remove final RT dependency
Needs 8 cycles..
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 202
2020
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 203
Pipelined Parallelism in Instruction Processing
• Code Changes affecting the pipeline - Loop UnrollingThe unrolling of loops is a conventional technique for increasing
performance. It works especially well in pipelined systems:– start with a tight program loop– re-organise the loop construct so that the loop is traversed half (or a
third, quarter etc.) as many times– re-write the code body so that it performs two (3, 4) times as much
work in each loop– Optimise the new code bodyIn the case of pipeline execution, the code body gains from:– more likely benefit from delayed branching– less need to increment the loop variable– instruction re-ordering avoids pipeline stalls– parallelism is exposed - useful for vector and VLIW architectures
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 204
Pipelined Parallelism in Instruction Processing• Code Changes affecting the pipeline - Loop Unrolling
Example - Calculate using a Harvard architecture and forwarding R2 = 0
R1 = 100loop: R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3 BNZ R1, loop
Cycle Fetch Decode ALU MemoryWriteback 1 LOAD - - - - 2 SUB LOAD read R1 - - - 3 ADD SUB read R1,1 LOAD R1+0 - - 4 BNZ ADD R2,R3(not ready) SUB R1-1 LOAD (R1+0) - 5 next1 BNZ read R1 ADD R2+R3,R3(Mem) SUB pass R1 LOAD store
R3 6 next2 next1 BNZ R1(from ALU) ADD pass R2 SUB store R1 7 next3 next2 next1 BNZ pass R1 ADD store R2 8 next4 next3 next2 next1 BNZ store PC
9 LOAD - - - -
Code is difficult to write in optimal form - too short to implement delayed branching - forwarding prevents stalling and performing decrement early hides some of the memory latency
100
1
)(n
narray
800 cycles to complete all loops
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 205
Pipelined Parallelism in Instruction Processing• Code Changes affecting the pipeline - Loop Unrolling
Example - Calculate using a Harvard architecture and forwarding R2 = 0
R1 = 100loop: R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3 BNZ R1, loop
100
1
)(n
narray
Unrolling the loop body:
loop: R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3
R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3
R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3
R3 = LOAD array(R1) R1 = R1 SUB 1 R2 = R2 ADD R3
BNZ R1, loop
Re-label registers and re-order
loop: R3 = LOAD array(R1) R4 = LOAD array-1(R1) R5 = LOAD array-2(R1) R6 = LOAD array-3(R1) R1 = R1 SUB 4 DBNZ R1, loop R2 = R2 ADD R3 R2 = R2 ADD R4 R2 = R2 ADD R5 R2 = R2 ADD R6
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 206
Pipelined Parallelism in Instruction Processing• Code Changes affecting the pipeline - Loop Unrolling
Example - Calculate using a Harvard architecture and forwarding
Branch has been replaced with a delayedbranch - takes effect after 4 moreinstructions (5 stage pipeline)
Cycle Fetch Decode ALU Memory Writeback 1 LOAD1 - - - - 2 LOAD2 LOAD1 read R1 - - - 3 LOAD3 LOAD2 read R1,1 LOAD1 array+R1 - - 4 LOAD4 LOAD3 read,R1,2 LOAD2 array+1+R1 LOAD1 R3 - 5 SUB LOAD4 read R1,3 LOAD3 array+2+R1 LOAD2 R4 LOAD1 store R3 6 DBNZ SUB read R1,4 LOAD4 array+3+R1 LOAD3 R5 LOAD2 store R4 7 ADD1 DBNZ read R1 SUB R1 LOAD4 R6 LOAD3 store R5 8 ADD2 ADD1 read R2,R3 DBNZ R1(from ALU) SUB pass R1 LOAD4 store R6
9 ADD3 ADD2 read R2,R4 ADD1 R2 DBNZ pass R1 SUB store R1 10 ADD4 ADD3 read R2,R5 ADD2 R2(from ALU) ADD1 pass R2 DBNZ store PC11 LOAD1 ADD4 read R2,R6 ADD3 R2(from ALU) ADD2 pass R2 ADD1 store R212 LOAD2 LOAD1 read R1 ADD4 R2(from ALU) ADD3 pass R2 ADD2 store R213 LOAD3 LOAD2 read R1,1 LOAD1 array+R1 ADD4 pass R2 ADD3 store R214 LOAD4 LOAD3 read,R1,2 LOAD2 array+1+R1 LOAD1 R3 ADD4 store R215 SUB LOAD4 read R1,3 LOAD3 array+2+R1 LOAD2 R4 LOAD1 store R3
100
1
)(n
narray
250 cycles to complete all loops
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 207
Pipelined Parallelism in Instruction Processing
• Code Changes affecting the pipeline - Loop Unrolling
Example - Calculate using a Harvard architecture and forwarding
The original loop took 8 cycles per iteration. The unrolled version allows a delayed branch to be implemented and performs 4 iterations in 10 cycles.
Gives an improvement of a factor of 3.2
Benefits of Loop Unrolling
– Fewer instructions (multiple decrements can be performed in one operation)
– longer loop allows delayed branch to fit
– better use of pipeline - more independent operations
– disadvantage - more registers required to obtain these results
100
1
)(n
narray
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 208
Pipelined Parallelism in Instruction Processing
• Parallelism at the Instruction LevelConventional instruction sets rely on encoding of register numbers, instruction type
and addressing modes to reduce volume of instruction streamCISC processors optimise a lower level encoding in a longer instruction word -
requires them to consume more instruction bits per cycle, forcing advancements like Harvard memory architectures.
CISC architectures are still sequential processing machines - pipelining and superscalar instruction grouping introduce a limited amount of parallelism
Parallelism can also be introduced explicitly with parallel operations in each instruction word.
VLIW (Very Long Instruction Word) machines have instruction formats which contain different fields, each referring to a separate functional unit in the processor, this requires multi-ported access to registers etc.
Choice of parallel activities in a VLIW machine is made by the compiler, which must determine when hazards exist and how to resolve them...
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 209
Instruction Level ParallelismSuperscalar and Very Long Instruction Word Processors• Uniprocessor limits on performanceThe speed of a pipelined processor (instructions per second) is limited by:• clock frequency (AMD 2.66 GHz) is unlikely to increase - much more• depth of pipeline. As depth increases, work in each stage per cycle initially
decreases. But effects of register hazards, branching etc. limit further sub-division and load balancing between stages gets increasingly difficult
So, why only initiate one instruction in each cycle?Superpipelined processors double the clock frequency by pushing alternate
instructions from a conventional instruction stream to 2 parallel pipelines. Compiler must separate instructions to run independently in the 2 streams and when not possible must add NULL operations. Could use more than 2 pipelines. Scheme is not very flexible and is superseded by:
Superscalar processors use conventional instruction stream, read at several instructions per cycle. Decoded instructions issued to a number of pipelines - 2 or 3 pipelines can be kept busy this way
Very Long Instruction Word (VLIW) processors use modified instruction set - each containing sub-instructions, each sent to separate functional units
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 210
Instruction Level ParallelismSuperscalar and Very Long Instruction Word Processors
• Superscalar Architectures– fetch and decode more instructions than needed to feed a single pipeline
– launch instructions down a number of parallel pipelines in each cycle
– compilers often re-order instructions to place suitable instructions in parallel - the details of the strategy used will have a huge effect on the degree of parallelism achieved
– some superscalars can perform re-ordering at run time - to take advantage of free resources
– relatively easy to expand - add another pipelined functional unit. Will run previously compiled code, but will benefit from new compiler
– provide exceptional peak performance, but extra data requirements put heavy demands on memory system and sustained performance might not be much more than 2 instructions per cycle.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 211
Instruction Level ParallelismSuperscalar and Very Long Instruction Word Processors
• Very Long Instruction Word architectures– VLIW machines provide a number of parallel functional units
• typically 2 integer ALUs, 2 floating point units, 2 memory access units and a branch control engine
– the units are controlled from bits in a very long instruction word - this can be 150 bits or more in width
– needs fetching across a wide instruction bus - and hence wide memories and cache.
– Many functional units require 2 register read ports and a register write port
– Application code must have plenty of instruction level parallelism and few control hazards - obtainable by loop unrolling
– Compiler responsible for identifying activities to be combined into a single VLIW.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 212
2121
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 213
Instruction Level ParallelismSuperscalar and Very Long Instruction Word Processors
• Hazards and Instruction Issue Matters with Multiple PipelinesThere are 3 main types of hazard:
– read after write - j tries to read an operand before i writes it, j gets the old value
– write after write - j writes a result before i, the value left by i rather than j is left at the destination
– write after read - j writes a result before it is read by i, i incorrectly gets new value
In single pipeline machine with in-order execution read after write is the only one that can not be avoided and is easily solved using forwarding.
Using extra superscalar pipelines (or altering the order of instruction completion or issue) brings all three types of hazard further into play:
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 214
Instruction Level ParallelismSuperscalar and Very Long Instruction Word Processors
• Read After Write HazardsIt is difficult to organise forwarding from one pipeline to another.
Better is to allow each pipeline to write its result values directly to any execution unit that needs them
• Write After Read HazardsConsider
F0 = F1 DIV F2F3 = F0 ADD F4F4 = F2 SUB F6
Assume that DIV takes several cycles to execute in one floating point pipeline.Its dependency with ADD (F0) stops ADD from being executed until DIV
finishes.BUT SUB is independent of F0 and F3 and could be executed in parallel with
DIV and could finish 1st. If it wrote to F4 before the ADD read it then ADD would have the wrong value
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 215
Instruction Level ParallelismSuperscalar and Very Long Instruction Word Processors
• Write After Write HazardsConsider
F0 = F1 DIV F2F3 = F0 MUL F4F0 = F2 SUB F6F3 = F0 ADD F4
On a superscalar the DIV and SUB have independent operands (F2 is read twice but not changed)
If there are 2 floating point pipelines, each could be performed at the same time.DIV would be expected to take longerSo SUB might try and write to F0 before DIV - hence ADD might get wrong
value from F0 (MUL would be made to wait for DIV to finish, however)
We can use Scoreboarding to resolve these issues.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 216
Instruction Level ParallelismSuperscalar and Very Long Instruction Word Processors
• Limits to Superscalar and VLIW ExpansionOnly 5 operations per cycle are typical, why not 50?– Limited Parallelism available.
• VLIW machines depend on a stream of ready-parallelised instructions.– Many parallel VLIW instructions can only be found by unrolling loops– if a VLIW field can not be filled in an instruction, then the functional unit will
remain idle during that cycle• superscalar machine depends on stream of sequential instructions
– loop unrolling is also beneficial for superscalars– Limited Hardware resources
• cost of registers read/write ports scale linearly with number, but complexity of access increases as the square of the number
• extra register access complexity may lead to longer cycle times• more memory ports needed to keep processor supplied with data
– Code Size too high• wasted fields in VLIW instructions lead to poor code density, need for
increased memory access and overall less benefit from wide instruction bus
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 217
Instruction Level ParallelismSuperscalar and Very Long Instruction Word Processors
• Amdahl’s LawGene Amdahl suggested the following law for vector processors - equally appropriate
for VLIW and superscalar machines and all multiprocessor machines.Any parallel code has sequential elements - at startup and shutdown, at the beginning
and end of each loop etc.To find the benefit from parallelism need to consider how much is done sequentially
and in parallel.Speedup factor can be taken as: Execution time using one processor
S(n) = Execution time using a multiprocessor with n processors
If the fraction of code which can not be parallelised is f and the time taken for the computation on one processor is t then the time taken to perform the computation with n processors will be:
ft + (1 - f) t / nThe speed up is therefore:
S(n) = t / ( ft + (1 - f) t / n) = n/(1 + (n - 1) f)(ignoring any overhead due to parallelism or communication between processors)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 218
Instruction Level ParallelismSuperscalar and Very Long Instruction Word Processors
• Amdahl’s LawS( ) 1/f
• even for an infinite number of processors maximum speed up is given by above
• Small reduction in sequential overhead can make huge difference in throughput
Amdahl's Law - Speedup v No. Processors
0
2
4
6
8
10
12
14
16
18
20
0 5 10 15 20
Number of Processors
Sp
ee
du
p S
(n)
f = 0%
f = 5%
f = 10%
f = 20%
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 219
Instruction Level ParallelismSuperscalar and Very Long Instruction Word Processors
• Gustafson’s Law
– A result of observation and experience.
– If you increase the size of the problem then the size (not the fraction) of the sequential part remains the same.
– eg if we have a problem that uses a number of grid points to solve a partial differential equation
• for 1000 grid points 10% of the code is sequential.
• might expect that for 10,000 grid points only 1% of the code will be sequential.
• If we expand the problem to 100,000 grid points the only 0.1% of the problem remains sequential.
– So after Amdahl’s law things start to look better again!
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 220
2222
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 221
Running Programs in ParallelRunning Programs in ParallelOptions for running programs in parallel include:• Timesharing on a Uniprocessor - this is mainly muti-tasking to share a
processor rather than combining resources for a single application. Timesharing is characterised by:– Shared memory and semaphores– high context-switch overheads– limited parallelism
• Multiprocessors with shared memory - clustered computing combines several processors communicating via shared memory and semaphores. – Shared memory limits performance (even with caches) due to the delays
when the operating system or user processes wait for other processes to finish with shared memory and let them have their turn.
– Four - eight processors, actively communicating on a shared bus is about the limit before access delays become unacceptable
• Multiprocessors with separate communication switching devices - INMOS transputer and Beowulf clusters.– each element contains a packet routing controller as well as a processor
(transputer contained both on single chip) – messages can be sent between any process on any processor in hardware
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 222
Running Programs in ParallelRunning Programs in ParallelOptions for running programs in parallel include: (cont’d)• Vector Processors (native and attached)
– may just be specialised pipeline engines pumping operands through heavily-pipelined, chained, floating point units.
– Or they might have enough parallel floating point units to allow vector operands to be manipulated element-wise in parallel.
– can be integrated into otherwise fast scalar processors– or might be co-processors which attach to general purpose processors
• Active Memory (Distributed Array Processor)– rather than take data to the processors it is possible to take the processors to
the data, by implementing a large number of very simple processors in association with columns of bits in memory
– thus groups of processors can be programmed to work together, manipulating all the bits of stored words.
– All processors are fed the same instruction in a cycle by a master controller.• Dataflow Architectures - an overall task is defined in terms of all operations
which need to be performed and all operands and intermediate results needed to perform them. Some operations can be started immediately with initial data whilst others must wait for the results of the first ones and so on to the result.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 223
Running Programs in Parallel• Semaphores:
– Lock shared resources– Problems of deadlock and starvation
• Shared memory– Fastest way to move information between two
processors is not to!– Rather than:
• sender → receiver we have sender receiver
– Use semaphore to prevent receiver reading until sender has finished
– Segment created outside normal process space – system call maps it into space of requesting process
Segment 2
Segment 3Segment 1Proc 1
Proc 2 Proc 3
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 224
Running Programs in ParallelFlynn’s Classification of Computer Architectures
SISD Single Instruction, Single Data machines are conventional uni-processors. They read their instructions sequentially and operate on their data operands individually. Each instruction only accesses a few operand words
SIMD Single Instruction, Multiple Data machines are typified by vector processors. Instructions are still read sequentially but this time they each perform work on operands which describe multi-word objects such as arrays and vectors. These instructions might perform vector element summation, complete matrix multiplication or the solution of a set of simultaneous equations.
MIMD Multiple Instruction, Multiple Data machines are capable of fetching many instructions at once, each of which performs operations on its own operands. The architecture here is of a mutiprocessor - each processor (probably a SISD or SIMD processor) performs its own computations but shares the results with the others. The multiprocessor sub-divides a large task into smaller sections which are suitable for parallel solution and permits these tasks to share earlier results
(MISD) Multiple Instruction Single Data machines are not really implementable. One might imagine an image processing engine capable of taking an image and performing several concurrent operations upon it...
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 225
Major Classifications of ParallelismIntroductionAlmost all parallel applications can be classified into one or more of the following:• Algorithmic Parallelism- the algorithm is split into sections (eg pipelining)• Geometric Parallelism - static data space is split into sections (eg process an
image on an array of processors)• Processor Farming - the input data is passed to many processors (eg ray
tracing co-ordinates to several processors one ray at a time)
Load BalancingThere are 3 forms of load balancing• Static Load Balancing - the choice of which processor to use for each part of
the task is made at compile time• Semi-dynamic - the choice is made at run-time, but once started, each task
must run to completion on the chosen processor - more efficient• Fully-dynamic load balancing - tasks can be interrupted and moved between
processors at will. This enables processors with different capabilities to be used to best advantage. Context switching and communication costs may outweigh the gains
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 226
2323
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 227
Major Classifications of ParallelismAlgorithm Parallelism• Tasks can be split so that a stream of data can be processed in successive stages
on a series of processors• As the first stage finishes its processing the result is passed to the second stage
and the first stage accepts more input data and processes it and so on.• When the pipeline is full one result is produced at every cycle• At the end of continuous operation the early stages go idle as the last results are
flushed through.• Load balancing is static - the speed of the pipeline is determined by the speed
of the slowest stage.
data resultslinear pipeline
or chain
data results pipeline withparallel section
data resultsIrregular network
general casedata
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 228
Major Classifications of ParallelismGeometric Parallelism• Some regular-patterned tasks can be processed by spreading their data across
several processors and performing the same task on each section in parallel• Many examples involve image processing - pixels mapped to an array of
transputers for example• Many such tasks involve communication of boundary data from one portion to
another - finite element calculations• Load balancing is static - initial partitioning of data determines the time to
process each area.• Rectangular blocks may not be the best choice - stripes, concentric squares…• Initial loading of the data may prove to be a serious overhead
data array
connectedtransputers
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 229
Major Classifications of Parallelism
Geometric v Algorithmic F(xi) = cos(sin(exp(xi*xi))) for x1, x2, … x6 using 4 processorsAlgorithmic: x3, x2, x1 y*y ey sin(y) cos(y)
y y y result
1 time unit 1 time unit 1 time unit 1 time unit
F1 is produced in 4 time unitsF2 is produced at time 5i.e. time = 4+(6-1) = 9 units speedup = 24/9 = 2.6
Geometric:
i.e. time = 8 units speedup = 24/8 = 3
cos(sin(ex*x)) cos(sin(ex*x)) cos(sin(ex*x)) cos(sin(ex*x))
x1 x2 x3 x4
x5 x6
F1 F2 F5
F6
F6
F4
4 time units
4 time units
4 time units
4 time units
4 time units
4 time units
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 230
Major Classifications of ParallelismProcessor Farming• Involves sharing work out from a central controller process to several worker
processes.• The “workers” just accept packets of command data and return results.• The “controller” splits up the tasks, sending work packets to free processors
(ones that have returned a result) and collating the results• Global data is sent to all workers at the outset.• Processor farming is only appropriate if:
– the task can be split into many independent sections– the amount of communication (commands + results) is small
• To minimise latency, it might be better to keep 2 (or 3) packets in circulation for each worker - buffers are needed
• Load balancing is semi-dynamic - the command packets are sent to processors which have just (or are about to) run out of work. Thus all processors are kept busy except for the closedown phase, when some finish before others.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 231
Major Classifications of ParallelismProcessor Farming (cont’d)
controller
Return routers
Buffers
Workers
Buffers
Outgoing Routers
Each section on separate transputer/processor
displayresults
receivedresults
sendcommands
generatework
packets
initial proc nos
Command packets(CPU; work)
Result packets(result; CPU)
free CPU # A Processor Farm Controller
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 232
Vector ProcessorsIntroduction• Vector processors extend the scalar model by incorporating vector registers
within the CPU• These registers can be operated on by special vector instructions - each performs
calculations element wise on the vector• Vector parallelism could enable a machine to be constructed with a row of FPUs
all driven in parallel. In practice a heavily pipelined single FPU is usually used. Both are classified as SIMD
• A vector instruction is similar to an unrolled loop, but:– each computation is guaranteed independent of all others - allows a deep
pipeline (allowing the cycle time to be kept short) and removes the need to check for data hazards (within the vector)
– Instruction bandwidth is considerably reduced– There are no control hazards (eg pipeline flushes on branches) since the
looping has been removed– Memory access pattern is well-known - thus latency of memory access can
be countered by interleaved memory blocks and serial memory techniques– Overlap of ALU & FPU operations, memory accesses and address
calculations are possible.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 233
Vector ProcessorsTypes of Vector Processors• Vector-Register machines - vector registers, held in the CPU, are loaded and
stored using pipelined vector versions of the typical memory access instructions• Memory-to-Memory Vector machines - operate on memory only. Pipelines of
memory accesses and FPU instructions operate together without pre-loading the data into vector registers. (This style has been overtaken by Vector-Register machines.)
Vector-Register MachinesMain Sections of a Vector-Register Machine are:• The Vector Functional units - machine can have several pipelined such units,
usually dedicated to just one purpose so that they can be optimised.• Vector Load/Store activities are usually carried out by a dedicated pipelined
memory access unit. This unit must deliver one word per cycle (at least) in order that the FPUs are not held up. If this is the case, vector fetches may be carried out whilst part of the vector is being fed to the FPU
• Scalar Registers and Processing Engine - conventional machine• Instruction Scheduler
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 234
Vector Processors
The Effects of Start-up Time and Initiation Rate• like all pipelined systems, the time taken for a vector operation is determined
from the start-up time, the initiation (result delivery) rate and the number of calculations performed
• The initiation rate is usually 1 - new vector elements are supplied in every cycle
• The start-up cost is the time for one element to pass along the vector pipeline - the depth in stages. This time is increased by the time taken to fetch data operands from memory if they are not already in the vector registers - can dominate
• The number of clock cycles per vector element is then:cycles per result = (start-up time + n*initiation rate)/n
• The start-up time is divided amongst all of the elements and dominates for short vectors.
• The start-up time is more significant (as a fraction of the time per result) when the initiation rate drops to 1
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 235
Vector Processors
Load/Store Behaviour• the pipelined load/store unit must be able to sustain a memory access rate at
least as good as the initiation rate of the FPUs to avoid data starvation.• This is especially important when chaining the two units• Memory has a start-up overhead - access time latency - similar to the pipeline
start-up cost• Once data starts to flow, how can a rate of one word/cycle be maintained?
– interleaving is usually used
Memory InterleavingWe need to attach multiple memory banks to the processor and operate them all in
parallel so that the overall access rate is sufficient. Two schemes are common:• Synchronised Banks• Independent Banks
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 236
Vector Processors
Memory Interleaving (cont’d)Synchronised Banks• A single memory address is passed to all memory banks, and they all access a
related word in parallel.• Once stable, all these words are latched and are then read out sequentially
across the data bus - achieving the desired rate.• Once the latching is complete the memories can be supplied with another
address and may start to access it.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 237
Vector Processors
Memory Interleaving (cont’d)Independent Banks• If each bank of memory can be supplied with a separate address, we
obtain more flexibility - BUT must generate and supply much more information.
• The data latches (as in synchronised case) may not be necessary, since all data should be available at the memory interface when required.
In both cases, we require more memory banks than the number of clock cycles taken to get information from a bank of memory
The number of banks chosen is usually a power of 2 - to simplify addressing (but this can also be a problem - see vector strides)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 238
2424
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 239
Vector Processors
Variable Vector LengthIn practice the length of vectors will not be 64, 256 - whatever the memory size isA hardware vector length register in the processor is set before each vector
operation - used also in load/store unit.Programming Variable Length Vector OperationsSince the processor’s vector length is fixed, operations on long user vectors must
be covered by several vector instructions. This is called strip miningFrequently, the user vector will not be a precise multiple of the machine vector
length and so one vector operation will have to compute results for a short vector - this incurs greater set-up overheads
Consider the following:for (j=0; j<n; j++) x[j] = x[j] + (a * b[j]);
For a vector processor with vectors of length MAX and a vector-length register called LEN, we need to process a number of MAX-sized chunks of x[j]and then one section which covers the remainder:
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 240
Vector Processors
Variable Vector Length (cont’d)start = 0;LEN = MAX;for (k=0; k<n/MAX; k++ ) {
for (j=start; j<start+MAX; j++) {x[j] = x[j] + (a*b[j]);
}start = start + MAX;
}LEN = n-start;for (j=start; j<n; j++) x[j]= x[j] + (a*b[j]);
The j-loop in each case is implemented as three vector instructions - a Load, a multiply and an add.
The time to execute the whole program is simply:Int(n/MAX)*(sum of start-up overheads) + (n*3*1) cycles
This equation exhibits a saw-tooth shape as n increases - the efficiency drops each time a large vector fills up and an extra 1 element vector must be used, carrying and extra start-up overhead…
Unrolling the outer loop will be effective too...
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 241
Vector Processors
Vector StrideMulti-dimensional arrays are stored in memory as single-dimensional vectors. In all
common languages (except Fortran) row 1 is stored next to row 0, plane 0 is stored next to plane 1 etc….
Thus,accessing an individual row of a matrix involves reading contiguous memory locations, these reads are easily spread across several interleaved memory banks:
Accessing a column of a matrix - the nth element in every row, say - involves picking individual words from memory. These words are separated from each other by x words, where x is the number of elements in each row of the matrix. x is the stride of the matrix in thisdimension. Each dimension has its own stride.
Once loaded, vector operations on columns can be carried out with no further reference to their original memory layout.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 242
Vector ProcessorsVector Stride (cont’d)Consider multiplying 2 rectangular matrices together:What is the memory reference pattern of a column-wise vector load?• We step through the memory in units of our strideWhat about in a memory system with j interleaved banks?• If j is co-prime with the stride x then we visit each bank just once before re-visiting
any one again (assuming that we use the LS words address bits as bank selectors)• If j has any common factors with x (especially if j is a factor of x) then the banks are
visited in a pattern which favours some banks and totally omits others. Since the number of active banks is reduced, the latency of memory accesses is not hidden and the one-cycle-per-access goal is lost. This is an example of aliasing.
Does it matter whether the interleaving uses synchronised or independent banks?• Yes. In the synchronised case, the actual memory accesses must be timed correctly
since all the MS addresses are the same, and if the stride is wider than the interleaving factor, only some of the word accesses will be used anyway.
• In the independent case, the separate accesses automatically happen at the right time and to the right addresses. The load/store unit must generate the stream of addresses in advance of the data being required, and must send each to the correct bank
A critically-banked system - interleaved banks are all used fully in a vector accessOverbanking - supplying more banks than needed, reduces danger of aliasing
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 243
Vector ProcessorsForwarding and ChainingIf a vector processor is required to perform multiple operations on the same vector,
then it is pointless to save the first result before reading it back to another (or the same) functional unit
Chaining - the vector equivalent of forwarding - allows the pipelined result output of one functional unit to be joined to the input of another
The performance of two chained operations is far greater than that of just one, since the first operation does not have to finish before the next starts. Consider
V1 = V2 MULT V3V4 = V1 ADD V5
The non-chained solution requires a briefstall (4 cycles) since V1 must be fully writtenback to the registers before it can be re-used.
In the chained case, the dependence between writes to elements of V1 and their re-readingin the ADD are compensated by the forwardingeffect of the chaining - no storage is required prior to use.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 244
Multi-Core Processors
• A multi-core microprocessor is one which combines two or more independent processors into a single package, often a single IC. A dual-core device contains only two independent microprocessors.
• In general, multi-core microprocessors allow a computing device to exhibit some form of parallelism without including multiple microprocessors in separate physical packages often known as chip level multiprocessing or CMP.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 245
Multi-Core ProcessorsCommercial examplesCommercial examples• IBM’s POWER4, first Dual-Core module processor released in 2000. • IBM's POWER5 dual-core chip now in production - in use in the Apple
PowerMac G5. • Sun Microsystems UltraSPARC IV, UltraSPARC IV+, UltraSPARC T1• AMD - dual-core Opteron processors on 22 April 2005,
– dual-core Athlon 64 X2 family, on 31 May 2005. – And the FX-60, FX-62 and FX-64 for high performance desktops, – and one for laptops.
• Intel's dual-core Xeon processors, – also developing dual-core versions of its Itanium high-end CPU– produced Pentium D, the dual core version of Pentium 4. – A newer chip, the Core Duo, is available in the Apple Computer's iMac
• Motorola/Freescale has dual-core ICs based on the PowerPC e500 core, and e600 and e700 cores in development.
• Microsoft's Xbox 360 game console uses a triple core PowerPC microprocessor.
• The Cell processor, in PlayStation 3 is a 9 core design.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 246
Multi-Core ProcessorsWhy?Why?• CMOS manufacturing technology continues
to improve: – BUT reducing the size of single gates,
can’t continue to increase clock speed– 5km of internal interconnects in modern
processor…. Speed of light is too slow!• Also significant heat dissipation and data
synchronization problems at high rates. • Some gain from
– Instruction Level Parallelism (ILP) - superscalar pipelining – can be used for many applications
– Many applications better suited to Thread level Parallelism (TLP)- multiple independent CPUs
• A combination of increased available space due to refined manufacturing processes and the demand for increased TLP has led to multi-core CPUs.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 247
Multi-Core Processors• AdvantagesAdvantages
• Proximity of multiple CPU cores on the same die have the advantage that the cache coherency circuitry can operate at a much higher clock rate than is possible if the signals have to travel off-chip - combining equivalent CPUs on a single die significantly improves the cache performance of multiple CPUs.
• Assuming that the die can fit into the package, physically, the multi-core CPU designs require much less Printed Circuit Board (PCB) space than multi-chip designs.
• A dual-core processor uses slightly less power than two coupled single-core processors - fewer off chip signals, shared circuitry, like the L2 cache and the interface to the main Bus.
• In terms of competing technologies for the available silicon die area, multi-core design can make use of proven CPU core library designs and produce a product with lower risk of design error than devising a new wider core design.
• Also, adding more cache suffers from diminishing returns, so better to use space in other ways
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 248
Multi-Core Processors• DisadvantagesDisadvantages• In addition to operating system (OS) support, adjustments to existing software
can be required to maximize utilization of the computing resources provided by multi-core processors.
• The ability of multi-core processors to increase application performance depends on the use of multiple threads within applications. – eg, most current (2006) video games will run faster on a 3 GHz single-core
processor than on a 2GHz dual-core, despite the dual-core theoretically having more processing power, because they are incapable of efficiently using more than one core at a time.
• Integration of a multi-core chip drives production yields down and they are more difficult to manage thermally than lower-density single-chip designs.
• Raw processing power is not the only constraint on system performance. Two processing cores sharing the same system bus and memory bandwidth limits the real-world performance advantage. Even in theory, a dual-core system cannot achieve more than a 70% performance improvement over a single core, and in practice, will most likely achieve less
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 249
2525
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 250
The INMOS TransputerThe TransputerNecessary features for a message-passing microprocessor are:• A low context-switch time• A hardware process scheduler• Support for communicating process model• Normal microprocessor facilities.Special Features of Transputers:• high performance microprocessor• conceived as building blocks (like transistors or logic gates)• designed for intercommunication• CMOS devices - low power, high noise immunity• integrated with small supporting chip count• provided with a hardware task scheduler - supports multi-tasking with low
overhead• capable of sub-microsecond interrupt responses - good for control
applications
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 251
The INMOS Transputer
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 252
The INMOS Transputer
Transputer PerformanceThe fastest first-generation transputer (IMS T805-30) is capable of:• up to 15 MIPS sustained• up to 3 MFLOPs sustained• up to 40 Mbytes/sec at the main memory interface• up to 120 Mbytes/sec to the 4K byte on-chip memory• up to 2.3 Mbytes/sec on each of 4 bi-directional Links30MHz clock speed
The fatstest projected second generation transputer (IMS T9000-50):• is 5 times faster in calculation • and 6 times faster in communication50MHz clock speed - equivalent performance to the 100MHz intel 486
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 253
The INMOS Transputer
Low Chip CountTo run using internal RAM a T805 transputer only requires:• a 5MHz clock • a 5V power supply at about 150mA• a power-on-reset or external reset input• an incoming Link to supply boot code and sink results
Expansion possibilities• 32K*32 SRAM (4 devices) require 3 support chips• 8 support devices will support 8Mbytes of DRAM with optimal timing• Extra PALs will directly implement 8-bit memory mapped I/O ports or
timing logic for conventional peripheral devices (Ethernet, SCSI, etc)• Link adapters can be used for limited expansion to avoid memory mapping• TRAMs (transputers plus peripherals) can be used as very high-level
building blocks
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 254
The INMOS Transputer
Transputer ProcessesSoftware running on a transputer is made up from one or more sequential
processes, which are run in parallel and communicate with each other periodically
Software running on many interconnected transputers is simply a group of parallel processes - just the same as if all code were running on a single processor
Processes can be reasoned about individually; rules exist which allow the overall effect of parallel processes to be reasoned about too.
The benefits of breaking a task into separate processes include:• Taking advantage of parallel hardware• Taking advantage of parallelism on a single processor• Breaking the task into separately-programmed sections• Easy implementation of buffers and data management code which runs
asynchronously with the main processing sections
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 255
The INMOS Transputer
Transputer RegistersThe transputer implements a stack of 3 hardware registers and is able to execute
0-address instructions.It also has a few one-address instructions which are used for memory access.All instructions and data operands are built up in 4-bit sections using an
Operand register and two special instruction Prefix and Negative Prefix.Extra registers are used to store the head and tail pointers to two linked lists of
process workspace headers - these make up the high andlow priority run-time processqueues. The hardware scheduler takes a new process from one of these queues whenever it suspends the current process (due to time-slicing or communication)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 256
The INMOS Transputer
Action on Context SwitchingEach process runs until it communicates,is time-sliced or is pre-empted by a higherpriority process. Time-slices occur at thenext descheduling point - approx 2ms.Pre-emption can occur at any time.
At a context switch the following happens:• The PC of the stopping process is saved in its workspace at word WSP-1• The process pointed to by the processor’s BPtr1 is changed to point to the
stopping processes’ WSP• On a pre-emptive context switch (only) the registers in the ALU and FPU
may need saving• The process pointed to by FPtr1 is unlinked from the process queue, has its
stored PC value loaded into the processor and starts executingA context switch takes about 1s. This translates to an interrupt rate of about
1,000,000 per second.
local variables for PROC
pointer to workspacechain
program counter
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 257
The INMOS TransputerJoining Transputers TogetherThree routing configurations are possible:• static - nearest neighbour communications• any-to-any routing across static configurations• dynamic configuration with specialised routing devicesStatic ConfigurationsCan be connected together in fixed configurations and are characterised by:• Number of nodes• Valency - number of interconnecting arcs (per processor)• Diameter - maximum number of arcs traversed from point to point• Latency - time for a message to pass across the network• point-to-point bandwidth - message flow rate along a routeStructures
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 258
The T3D NetworkAfter simulation the T3D network was chosen to be a 3D torus (as is the T3E)Note:
config. max latency average latency8-node ring 4 hops 2 hops2D, 4*2 torus 3 hops 1.5 hops3D, 2*2*2 torus 2 hops 1 hop
The Cray T3D
2D torus 4*4
cube = 4*2 2D torus
hyper-cube
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 259
Beowulf Clusters
IntroductionMass market competition has driven down prices of subsystems:
processors, motherboards, disks, network cards etc.Development of publicly available software :
Linux, GNU compilers, PVM and MPI librariesPVM - Parallel Virtual Machine (allows many inter-linked machines to be combined as one parallel machine)MPI - Message Passing Interface (similar to PVM)
High Performance Computing groups have many years of experience working with parallel algorithms.
History of MIMD computing shows many academic groups and commercial vendors building machines based on “state-of-the-art” processors, BUT always needed special “glue” chips or one-of-a-kind interconnection schemes.
Leads to interesting research and new ideas, but often results in one off machines with a short life cycle.
Leads to vendor specific code (to use vendor specific connections)Beowulf uses standard bits and Linux operating system (with MPI - or PVM)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 260
Beowulf Clusters
IntroductionFirst Beowulf was built in 1994 with 16 DX4 processors anda 10Mbit/s Ethernet.Processors were too fast for a single EthernetEthernet switches were still much too expensive to usemore than one.So they re-wrote the linux ethernet drivers and built a channel bonded Ethernet
– network traffic was striped across 2 or more ethernetsAs 100Mb/s ethernet and switches have become cheap less need for channel
bonding. This can support 16, 200MHz P6 processors…..The best configuration continues to change. But this does not affect the user.With the robustness of MPI, PVM, Linux (Extreme) and GNU compilers
programmers have the confidence that what they are writing today will still work on future Beowulf clusters.
In 1997 CalTech’s 140 node cluster ran a problem sustaining a 10.9 Gflop/s rate
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 261
Beowulf Clusters
The FutureBeowulf clusters are not quite Massively Parallel Processorslike the Cray T3D as MPPs are typically bigger and have alower network latency and a lot of work must be done by theprogrammer to balance the system.But the cost effectiveness is such that many people aredeveloping do-it-yourself approaches to HPC and building their ownclusters. A large number of computer companies are taking these machines very
seriously and offering full clusters.2002 – 2096 processor linux cluster comes in as 5th fastest computer in the
world…2005 – 4800 2.2GHz powerPC cluster is #5 – 42.14TFlops 40960 1.4GHz itanium is #2 – 114.67 TFlops 65536 0.7GHz powerPC is #1 – 183.5TFlops 5000 Opteron (AMD - Cray) is #10 – 20 TFlops
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 262
Fastest Super computers – June 2006Rank Site Computer Processors Year Rmax Rpeak
1 LLNL US Blue Gene – IBM 131072 2005 280600 367000
2 IBM US Blue Gene –IBM 40960 2005 91290 114688
3 LLNL US ASCI Purple IBM 12208 2006 75760 92781
4 NASA US Columbia – SGI 10160 2004 51870 60960
5 CEA, France Tera 10, Bull SA 8704 2006 42900 55705.6
6 Sandia US Thunderbird – Dell 9024 2006 38270 64972.8
7 GSIC, Japan TSUBAME - NEC/Sun 10368 2006 38180 49868.8
8 Julich, Germany Blue Gene – IBM 16384 2006 37330 45875
9 Sandia, US Red Storm - Cray Inc. 10880 2005 36190 43520
10 Earth Simulator, Japan Earth-Simulator, NEC 5120 2002 35860 40960
11 Barcelona Super Computer Centre, Spain MareNostrum – IBM 4800 2005 27910 42144
12 ASTRON/University Groningen, Netherlands Stella (Blue Gene) – IBM 12288 2005 27450 34406.4
13 Oak Ridge, US Jaguar - Cray Inc. 5200 2005 20527 24960
14 LLNL, US Thunder - Digital Corporation 4096 2004 19940 22938
15 Computational Biology Research Center, Japan Blue Protein (Blue Gene) –IBM 8192 2005 18200 22937.6
16 Ecole Polytechnique, Switzerland Blue Gene - IBM 8192 2005 18200 22937.6
17 High Energy Accelerator Research Organization, Japan KEK/BG Sakura (Blue Gene) – IBM 8192 2006 18200 22937.6
18 High Energy Accelerator Research Organization, Japan KEK/BG Momo (Blue Gene) – IBM 8192 2006 18200 22937.6
19 IBM Rochester, On Demand Deep Computing Center, US Blue Gene - IBM 8192 2006 18200 22937.6
20 ERDC MSRC, United States Cray XT3 - Cray Inc. 4096 2005 16975 21299
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 263
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 264
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 265
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 266
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 267
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 268
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 269
2626
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 270
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 271
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 272
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 273
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 274
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 275
Shared Memory Systems
IntroductionThe earliest form of co-operating processors used shared memory as the
communication mediumShared memory involves:• connecting the buses of several processors together so that either:
– all memory accesses for all processors share the bus; or– just inter-processor communication accesses share the common memoryClearly the latter involves less contention
Shared memory systems typically operate under control of a single operating system either:
• with one master processor and several slaves; or• with all processors running separate copies of the OS, maintaining a common
set of VM and process tables.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 276
Shared Memory SystemsThe Shared-Memory Programming ModelIdeally a programmer wants each process to have access to a contiguous area of
memory - how is unimportantSomewhere in the memory map will be sections of memory which are also
accessible by other processes.
How do we implement this? We certainly need caches (for speed) and VM, secondary storage etc. (for flexibility)
ProcessorMemory
shared
Processor
Processor
Processor
Localcache
Localcache
Localcache
MainMemory
SecondaryMemory
SharedVirtualAddressSpace
Notice that cacheconsistency issuesare introduced assoon as multiplecaches areprovided.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 277
Shared Memory SystemsCommon Bus StructuresA timeshared common bus arrangement can provide the interconnection required:
P P P
A common bus provides:• contention resolution between the processors• limited bandwidth, shared by all processors• single-point failure modes• cheap(ish) hardware - although speed requirements and complex wiring add to
expense• easy, but non-scalable, expansion
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 278
Shared Memory SystemsCommon Bus Structures (cont’d)Adding caches, extra buses (making a crossbar arrangement) and mutiport memorycan help
P
P
P
cache
cache
cache
cache
cachecache
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 279
Shared Memory Systems
Kendall Square Research KSR1One of the most recent shared memory architectures is the Kendall Square
Research KSR1, which implements the virtual memory model across multiple memories, using a layered cacheing scheme.
The KSR1 processors are proprietary:• 64-bit superscalar, issues 1 integer and 2 chained FP instructions per 50ns
cycle, giving a peak integer and FP performance of 20MIPS / 40 MFLOPs• Each Processor has 256Kbytes of local instruction cache and 256Kbytes of
local data cache• There is a 40bit global addressing scheme1088 (32*34) processors can be attached in the current KSR1 architectureMain memory comprises 32Mbytes DRAM per Processor Environment, connected
in a hierarchical cached scheme.If a page is not held in one of the 32Mbyte caches it is stored on secondary
memory (disc - as with any other system)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 280
Shared Memory SystemsKSR1 Processor InterconnectionThe KSR1 architecture connects the caches on each processor with a special memory
controller called the Allcache Engine. Several such memory controllers can be connected
level 0router
directory
Cellinter-
connect
32 MBmaincache
P
256kBcache
Cellinter-
connect
32 MBmaincache
P
256kBcache
Cellinter-
connect
32 MBmaincache
P
256kBcache
Cellinter-
connect
32 MBmaincache
P
256kBcache
…...
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 281
Shared Memory Systems
KSR1 Processor Interconnection (cont’d)The Allcache Engine at the lowest level (level-0) provides:• connections to all the 32Mbyte caches on the processor cells• Up to 32 processors may be present in each ringThe level-0 Allcache Engine Features:• a 16-bit wide slotted ring, which synchronously passes packets between the
interconnect cells (ie every path can carry a packet simultaneously)• Each ring carries 8 million packets per second• Each packet contains a 16-byte header and 128 bytes of data• This gives the total throughput of 1Gbyte per second• Each router directory contains an entry for each sub-page held in the main
cache memory (below)• Requests for a sub-page are made by the cell interconnect unit, passed around
the ring and satisfied by data if it is found in the other level-0 caches.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 282
Shared Memory Systems
KSR1 Processor Interconnection (cont’d)KSR1 Higher Level RoutersIn order to connect more than 32 processors, a second layer of routing is needed.This contains up to 34 Allcache router directory cells, plus the main level-1
directory which permits connection to level 2.
level-2
level-1
level-0
32 processors
1088 processors
unaffordable; minimal bandwidth per processor
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 283
Shared Memory Systems
KSR1 Processor Interconnection (cont’d)The Level-1 Allcache RingThe routing directories in level 1 Allcache engine contain copies of the entries in
the lower level tables, so that requests may be sent downwards for sub-page information as well as upwards - the Level-1 table is therefore very large
The higher level packet pipelines carry 4 Gbytes per second of inter-cell traffic
level 1router
directory
ARD 0copy
ARD 0copy
ARD 0copy
…...
ARD 0copy
ARD 0directory
ARD 0directory
ARD 0directory
ARD 0directory
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 284
Shared Memory Systems
KSR1 PerformanceAs with all multi-processor machines, maximum performance is obtained when
there is no communicationThe layered KSR1 architecture does not scale linearly in bandwidth or latency as
processors are added:Relative Bandwidths
unit bandwidth shared fraction(MByte/s) amongst (MByte/s)
256 k subcache160 1 PE 16032MB cache 90 1 PE 90level-0 ring 1000 32 PEs 31level-1 ring 4000 1088 PEs 3.7
Relative LatenciesLocation Latency (cycles)subcache 2cache 18ring 0 150ring 1 500page fault (disc) 400,000
Copied (read-only) sub-pages reside in more thanone cache and thus providea low-latency access toconstant information
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 285
Shared Memory Systems
KSR1 Performance - how did it succeed?Like most other parallel architectures, it relies on locality
Locality justifies the workings of:• Virtual memory systems (working sets)• Caches (hit rates)• Interprocess connection networks
Kendall Square Research claim that the locality present in massively-parallel programs can be exploited by their architecture.
1991 - 2nd commercial machine is installed in Manchester Regional Computer Centre
1994 - upgraded to 64bit version1998 - Kendall Square Research went out of business, patents transferred to SUN
microsystems
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 286
The Cray T3D
IntroductionThe Cray T3D is the successor to several generations of vector conventional
processors. T3D has been replaced by newer T3E but much the same as T3DT3E (with 512 processors) capable of 0.4 TFlops
SV1ex (unveiled 7/11/00 capable of 1.8 TFLOPs with 1000 processors - normally delivered as 8-32 processor machines
T3Dwatercooled T3ESV1
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 287
The Cray T3D
IntroductionLike every other manufacturer, Cray would like to deliver:• 1000+ processors with GFLOPs performance• 10s of Gbytes/s per processor of communication bandwidth• 100ns interprocessor latence……they can’t afford to - just yet…….
They have tried to achieve these goals by:• MIMD - multiple co-operating processors will beat small numbers of
intercommunicating ones (even vector supercomputers)• Distributed memory• Communication at the memory-access level, keeping latency short and packet
size small• A scalable communications network• Commercial processors (DEC Alpha)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 288
The T3D NetworkAfter simulation the T3D network was chosen to be a 3D torus (as is the T3E)Note:
config. max latency average latency8-node ring 4 hops 2 hops2D, 4*2 torus 3 hops 1.5 hops3D, 2*2*2 torus 2 hops 1 hop
The Cray T3D
2D torus 4*4
cube = 4*2 2D torus
hyper-cube
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 289
T3D Macro-architectureThe T3D designers have decided that the programmer’s view of the architecture
should include:• globally-addressed physically-distributed memory characteristics• visible topological relationships between PEs• synchronisation features visible from a high levelTheir goal is led by the need to provide a slowly-changing view (to the
programmer) from one hardware generation to the next.
T3D Micro-architectureRather than choosing to develop their own processor, Cray selected the DEC
Alpha processor:• 0.75 m CMOS RISC processor core• 64 bit bus• 150MHz, 150 MFLOPS, 300MIPS (3 instructions/cycle)• 32 integer and 32 FP registers• 8Kbytes instruction and 8Kbytes data caches• 43 bit virtual address space
The Cray T3D
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 290
Latency HidingThe DEC Alpha has a FETCH instruction which allows memory to be loaded into
the cache before it is required in an algorithm.This runs asynchronously with the processor16 FETCHes may be in progress at once - they are FIFO queuedWhen data is received, it is slotted into the FIFO, ready for access by the processorThe processor stalls if data is not available at the head of the FIFO when needed
Stores do not have a a latency - they can proceed independently of the processor (data dependencies permitting)
SynchronisationBarrier Synchronisation• no process may advance beyond the barrier until all processes have arrived• used as a break between 2 blocks of code with data dependencies• supported in hardware - 16 special registers - bits set to 1 on barrier creation;
set to 0 by arriving process; hardware interrupt on completionMessaging (a form of synchronisation)• T3D exchanges 32-byte messages + 32-byte control header• Messages are queued at target PE, returned to sender PE’s queue if full
The Cray T3D
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 291
IntroductionThe Connection MAchine family of suprecomputers has developed since first
descriptions were published in 1981. Today the CM5 is one of the fastest available supercomputers
In 1981 the philosophy of the CM founders was for a machine capable of sequential program execution, but where each instruction was spread to use lots of processors.
The CM-1 had 65,536 processors organised in a layer between two communicating planes:
The Connection Machine Family
Host
broadcast controlnetwork
hyper-cubedata network
...PM
PM
PM
PM
Plane of 65536 cells
P = single-bit processorM = 4kbit memory
Total Memory = 32Mbytes
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 292
Introduction (cont’d)Each single-bit processor can:• perform single-bit calculations• transfer data to its neighbours or via the data network• be enabled or disabled (for each operation) by the control network and its own
stored data
The major lessons learnt from this machine were:• A new programming model was needed - that of virtual processors. One
“processor” could be used per data element and a number of data elements combined onto actual processors. The massive concurrency makes programming and compiler design clearer
• 32Mbytes was not enough memory (even in 1985!)• It was too expensive for AI - but physicists wanted the raw processing power
The Connection Machine Family
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 293
The Connection Machine 2This was an enlarged CM-1 with several enhancements. It had:• 256kbit DRAM per CPU• Clusters of 32 bit serial processors augmented by floating point chip (2048 in total)• Parallel I/O added to the processors - 40 discs (RAID - Redundant Array of
Inexpensive Disks) Graphics frame bufferIn addition, multiple hosts could be added to support multiple users; the plane of small
processors could be partitioned.Architectural Lessons:• Programmers used a high-level language (Fortran 90) rather than a lower-level
parallel language. F90 contains array operators, which provide the parallelism directly. The term data parallel was coined for this style of computation
• Array operators compiled into instructions sent to separate vector or bit processors• This SIMD programming model gives synchronisation between data elements in
each instruction but MIMD processor engine doesn’t need such constraints• Differences between shared (single address space) and distributed memory blur.• Data network now carries messages which correspond to memory accesses• The compiler places memory and computations optimally, but statically• multiple hosts are awkward compared with a single timesharing host
The Connection Machine Family
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 294
The Connection Machine 5This architecture is more orthogonal than the earlier ones. It just uses larger multi-bit
processors, but similar communication architecture to the CM-1 and CM-2Design Goals were:• > 1 TFLOPs• Several Tbytes of memory• > 1 Tbit/s of I/O bandwidth
The Connection Machine Family
Host
broadcast controlnetwork
hyper-cubedata network
...WW W H HI/O Hosts (H) and worker (W) processors
identical (hosts have more memory)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 295
The CM-5 ProcessorTo save on development effort, CM used a common SPARC RISC processor for all the
hosts and workers. RISC CPUs are optimised for workstations, so they added extra hardware and fast memory paths
Each Node has:• 32Mbytes memory• A Network interface• Vector processors capable of up to 128 MFLOPS• Vector-to-Memory bandwidth of 0.5Gbytes/sCaching doesn’t really work here.
The Connection Machine Family
vector processor
vector processor
vector processor
vector processor
32 Mbytesmemory
SPARC
cache
I/Omain bus
64 bit0.5Gbyte/svector ports
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 296
Fastest Super computers – June 2006Rank Site Computer Processors Year Rmax Rpeak
1 LLNL US Blue Gene – IBM 131072 2005 280600 367000
2 IBM US Blue Gene –IBM 40960 2005 91290 114688
3 LLNL US ASCI Purple IBM 12208 2006 75760 92781
4 NASA US Columbia – SGI 10160 2004 51870 60960
5 CEA, France Tera 10, Bull SA 8704 2006 42900 55705.6
6 Sandia US Thunderbird – Dell 9024 2006 38270 64972.8
7 GSIC, Japan TSUBAME - NEC/Sun 10368 2006 38180 49868.8
8 Julich, Germany Blue Gene – IBM 16384 2006 37330 45875
9 Sandia, US Red Storm - Cray Inc. 10880 2005 36190 43520
10 Earth Simulator, Japan Earth-Simulator, NEC 5120 2002 35860 40960
11 Barcelona Super Computer Centre, Spain MareNostrum – IBM 4800 2005 27910 42144
12 ASTRON/University Groningen, Netherlands Stella (Blue Gene) – IBM 12288 2005 27450 34406.4
13 Oak Ridge, US Jaguar - Cray Inc. 5200 2005 20527 24960
14 LLNL, US Thunder - Digital Corporation 4096 2004 19940 22938
15 Computational Biology Research Center, Japan Blue Protein (Blue Gene) –IBM 8192 2005 18200 22937.6
16 Ecole Polytechnique, Switzerland Blue Gene - IBM 8192 2005 18200 22937.6
17 High Energy Accelerator Research Organization, Japan KEK/BG Sakura (Blue Gene) – IBM 8192 2006 18200 22937.6
18 High Energy Accelerator Research Organization, Japan KEK/BG Momo (Blue Gene) – IBM 8192 2006 18200 22937.6
19 IBM Rochester, On Demand Deep Computing Center, US Blue Gene - IBM 8192 2006 18200 22937.6
20 ERDC MSRC, United States Cray XT3 - Cray Inc. 4096 2005 16975 21299
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 297
History of Supercomputers1966/7: Michael Flynn’s Taxonomy & Amdahl’s Law1976: Cray Research delivers 1st Cray-1 to LANL1982: Fujitsu ships 1st VP200 vector machine ~500MFlops1985: CM-1 demonstrated to DARPA1988: Intel delivers iPSC/2 hypercubes1990: Intel produces iPSC/860 hypercubes1991: CM5 announced1992: KSR1 delivered1992: Maspar delivers its SIMD machine – MP21993: Cray delivers Cray T3D1993: IBM delivers SP11994: SGI Power Challenge1997: SGI/Cray Origin 2000 delivered to LANL - 0.7TFlops1998: Cray T3E delivered to US military – 0.9Tflops1996: Hitachi Parallel System1997: Intel Paragon (ASCI Red) 2.3 Tflops to Sandia Nat Lab2000: IBM (ASCI White) 7.2 Tflops to Lawrence Livermore NL2002: HP (ASCI Q) 7.8 Tflops to Los Alamos Nat Lab 2002: NEC Earth Simulator Japan 36TFlops2002: 5th fastest machine in world is a linux cluster (2304 processor)
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 298
History of Supercomputers
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 299
2727
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 300
The fundamentals of Computing have remained unchanged for 70 years
• During all of the rapid development of computers during that time little has changed since Turing and Von Neumann
Quantum Computers are Potentially different.
• They employ Quantum Mechanical principles that expand the range of operations possible on a classical computer.
• Three main differences between classical and Quantum computers are:
• Fundamental unit of information is a qubit
• Range of logical operations
• Process of determining the state of the computer
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 301
Qubits
Classical computers are built from bits
two states: 0 or 1
Quantum computers are built from qubits
Physical system which possess states analogous to 0 or 1, but which can also be in states between 0 and 1
The intermediate states are known as superposition states
A qubit – in a sense – can store much more information than a bit
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 302
Range of logical operations
Classical computers operate according to binary logic
Quantum logic gates take one or more qubits as input and produce one or more qubits as output.
Qubits have states corresponding to 0 and 1, so quantum logic gates can emulate classical logic gates.
With superposition states between 0 and 1 there is a great expansion in the range of quantum logic gates.
• e.g. quantum logic gates that take 0 and 1 as input and produce as output different superposition states between 0 and 1 – no classical analogue
This expanded range of quantum gates can be exploited to achieve greater information processing power in quantum computers
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 303
Determining the State of the Computer
In Classical computers we read out the state of all the bits in the computer at any time
In a Quantum computer it is in principle impossible to determine the exact state of the computer.
i.e. we can’t determine exactly which superposition state is being stored in the qubits making up the computer
We can only obtain partial information about the state of the computer
Designing algorithms is a delicate balance between exploiting the expanded range of states and logical operations and the restricted readout of information.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 304
Beam-splitter
Detector A
Detector B
Single particle
Equal probability of photon reaching A or B
What actually happens?
Does the photon travel each path at Random?
Detector A
Detector B
Beam-splitter
Beam-splittermirror
mirror
What actually happens here?
If path lengths are the same photons always hit A.
A single photon travels both routes simultaneously
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 305
Photons travel both paths simultaneously.
If we block either of the paths then A or B become equally probable
This is quantum interference and applies not just to photons but all particles and physical systems
Quantum computation is all about making this effect work for us.
In this case the photon is a in a coherent superposition of being on both paths at the same time.
Any qubit can be prepared in a superposition of two logical states – a qubit can store both 0 and 1 simultaneously, and in arbitrary proportions.
Any quantum system with at least two discrete states can be used as a qubit – e.g. energy levels in an atom, photons, trapped ions, spins of atomic nuclei…..
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 306
Once the qubit is measured, however, only one of the two values it stores can be detected at random – just like the photon is detected on only one of the two paths.
Not very useful – but….
Consider a traditional 3-bit register it can represent 8 different numbers 000 - 111
A quantum register of 3 qubits can represent 8 numbers at the same time in quantum superposition. The bigger the register the more numbers we can represent at the same time.
A 250 qubit register could hold more numbers than there are atoms in the known universe – all on 250 atoms…..
But we only see one of these if we measure the registers contents.
We can now do some real quantum computation…..
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 307
Mathematical Operations can be performed at the same time on all the numbers held in the register.
If the qubits are atoms then tuned laser pulses can affect their electronic states so that initial superpositions of numbers evolve into different superpositions.
Basically a massively parallel computation
Can perform a calculation on 2L numbers in a single step, which would take 2L steps or processors in a conventional architecture
Only good for certain types of computation….
NOT information storage – it can hold many states at once but can only see one of them
Quantum interference allows us to obtain a single result that depends logically on all 2L of the intermediate results
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 308
Grover’s Algorithm
Searches an unsorted list of N items in only N steps.
Conventionally this scales as N/2 – by brute force searching. The quantum computer can search them all at the same time.
BUT if the QC is merely programmed to print out the result at that point it will not be any faster than a conventional system.
Only one of the N paths would check the entry we are looking for, so the probability that measuring the computer’s state would give us the correct answer would require the same number of hits.
BUT if we leave the information in the computer, unmeasured, a further quantum operation can cause the information to affect other paths. If we repeat the operation N times a measurement will return information about which entry contains the desired number with a probability of 0.5. Repeating just a few more times will find the entry with a probability extremely close to 1.
Can be turned into a very useful searching, minimization or evaluation of the mean tool.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 309
Cryptoanalysis
Biggest use of quantum computing is in cracking encrypted data.
Cracking DES (Data encryption standard) requires a search among 256 keys.
Conventionally even at 1M/s this takes more than 1000 years.
A QC using Grover’s algorithm could do it in less than 4 minutes.
Factorisation is the key to RSA encryption system.
Conventionally the time taken to factorise a number increases exponentially with the number of digits.
Largest number ever factorised contained 129 digits.
No way to factorise 1000 digits – conventionally…..
QC can do this in a fraction of a second
Already a big worry for data security, it is only a matter of a few years before this will be available.
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 310
Decoherence: the obstacle to quantum computation
For a qubit to work successfully it must remain in an entangled quantum superposition of states.
As soon as we measure the state it collapses to a single value.
This happens even if we make the measurement by accident
source
source
In a conventional double split experiment, the wave amplitudes corresponding to an electron (or photon) travelling along the two possible paths interfere. If another particle with spin is placed close to the left slit an electron passing will flip the spin. This “accidentally” records the which path the electron took and causes the loss of the interference pattern
EE3.cma - Computer Architecture04/21/23 EE3.cma - Computer Architecture 311
Decoherence: the obstacle to quantum computation
In reality it is very difficult to prevent qubits from interacting with the rest of the world.
The best solution (so far) to this is to build quantum computers with fault tolerant designs using error correction procedures.
The result of this is that we need more qubits, between 2 and 5 times the number in an “ideal world”