CS6461 – Computer ArchitectureSpring 2015
Morris Lancaster - InstructorAdapted from Professor Stephen Kaisler’s Notes
Lecture 12
Multicore Architectures
04/19/23 CSCI6461 Computer Architecture 12-2CS61 Computer Architecture 2
Moore’s Law
04/19/23 CSCI6461 Computer Architecture 12-3CS61 Computer Architecture 3
Moore’s Law: 2005
04/19/23 CSCI6461 Computer Architecture 12-4CS61 Computer Architecture 4
Single Thread Performance is Falling Off
Sou
rce:
SP
EC
Int
Pub
lishe
d D
ata
04/19/23 CSCI6461 Computer Architecture 12-5CS61 Computer Architecture 5
Multiprocessors
• We moved from one processor in a system to multiple processors in a system
• Speedup: near-linear until interprocessor or remote memory communication overwhelms performance increase
• Reaching single core limit of performance• So, as multiple processors improved
performance, look for another performance boost from multiple cores
04/19/23 CSCI6461 Computer Architecture 12-6CS61 Computer Architecture 6
So, What’s the Story …?
• Functional units– Superscalar is known territory– Diminishing returns for adding more functional blocks– Alternatives like VLIW have been considered and rejected
by the market– Single-threaded architectural performance is pegged
• Data paths– Increasing bandwidth between functional units in a
core makes a difference• Such as comprehensive 64-bit design, but then where to?
• Is 128 bits really needed in a processor?– Do we know how to use it?
04/19/23 CSCI6461 Computer Architecture 12-7CS61 Computer Architecture 7
And, the Story ….?
• Pipeline– Deeper pipeline buys number of instructions in processing
at the expense of increased cache miss penalty and lower instructions per clock
– Shallow pipeline gives better instructions per clock at the expense of number of instructions in processing scaling
– Industry converging on middle ground…9 to 11 stages• Successful RISC CPUs are in the same range
• Cache– Cache size buys performance at expense of die size, it’s a
direct hit to manufacturing cost– Deep pipeline cache miss penalties are reduced by larger
caches– Not always the best match for shallow pipeline cores, as
cache misses penalties are not as steep
04/19/23 CSCI6461 Computer Architecture 12-8CS61 Computer Architecture 8
Manufacturing
• Moore’s Law isn’t dead, more transistors for everyone!– But…it doesn’t really mention scaling transistor power
– Transistors are not free!
– More functional units, deeper pipelines, larger caches means more transistors ===> real estate problems!
• Chemistry and physics at nano-scale– Stretching materials science
– Voltage doesn’t scale yet
– Transistor leakage current is increasing
• As manufacturing economies and frequency increase, power consumption is increasing disproportionately
There are no process or architectural quick-fixes
04/19/23 CSCI6461 Computer Architecture 12-9CS61 Computer Architecture 9
Multicore Processor
• Definition: A multicore processor is a chip with multiple processors (cores). What a “core” is, is not well-defined, so the “core” varies with implementations.
• For example, the Cell has a Power PC and 8 special processing elements (SPEs), but all are assumed to be cores, although the SPEs have some limits on functionality.
04/19/23 CSCI6461 Computer Architecture 12-10CS61 Computer Architecture 10
Why Multicore?
• Can’t make a single core faster (Physics and noise are problems)
• Moore’s Law same core is 2X smaller per generation– Need to keep adding value to maintain average selling price – More and more cache doesn’t cut it– More transistors per generation
• Use all those transistors to put multiple processors (cores) on a chip– 2X cores per generation– Cores can potentially be optimized for power
• But harder to program, except for independent tasks– How many independent tasks are there to run at once?
04/19/23 CSCI6461 Computer Architecture 12-11
Core Design Parameters - ISA
Pro Con
LegacyCompiler and Software Support wellunderstood
May be inefficient for certain apprequiring higher performance to achieveend-to-end performance objectives.
Custom Can be optimized for targeted applicationsCompiler and software support maybe nonexistent
RISCEasy microarchitecture designEasy compiler design
Code size may be large and inefficientfor certain types of apps.
CISCMore instructions may allow for betteroptimization, smaller code size
Complex microarchitecture design tosupport all instructionsComplex compiler design
Special InstructionsHighly optimized code for targeted appsInstructions specified to appsrequirements
Complex designOften requires hand coding as nocompiler support
Ref: Blake, G, R. G. Dreslinski, and T. Mudge. 2009. “A Survey of Multicore Processors”, IEEE SignalProcessing magazine, 26(6)
04/19/23 CSCI6461 Computer Architecture 12-12
Core Design Parameters - Microarchitecture
Pro Con
In-orderLow to medium complexityLow powerLow are so many can be placed on die
Low to medium single threadperformance
Out-of-orderVery fast single-thread performance dueto dynamic scheduling of instructions
High design complexityLarge areaHigh power
SIMDVery efficient for highly data-parallel orVector code
Underutilized if code cannot beparallelizedNot applicable for control-dominated apps
VLIWMay issue many more instructions thanout-of-order due to reduced complexity
Requires advanced compiler supportMay perform poorly if compiler cannotStatically find ILP.
Ref: Blake, G, R. G. Dreslinski, and T. Mudge. 2009. “A Survey of Multicore Processors”, IEEE SignalProcessing magazine, 26(6)
04/19/23 CSCI6461 Computer Architecture 12-13
Memory System Design Parameters – On-Die
Pro Con
CachesTransparently provide appearance of low-latency access to main memoryCan be configured into multiple levels
No real-time performance guaranteeMust use die area to store tags
Local Store
Stores more data per die area thanCachesCan provide real-time performanceguarantee
Must be software controlled (withPerformance implications)
Ref: Blake, G, R. G. Dreslinski, and T. Mudge. 2009. “A Survey of Multicore Processors”, IEEE SignalProcessing magazine, 26(6)
04/19/23 CSCI6461 Computer Architecture 12-14
Memory System Design Parameters – Coherence
Pro Con
YesProvides a shared memorymultiprocessorSupports all programming models
Hard to implement
No Easy to ImplementSupports limited number ofprogramming models
Ref: Blake, G, R. G. Dreslinski, and T. Mudge. 2009. “A Survey of Multicore Processors”, IEEE SignalProcessing magazine, 26(6)
04/19/23 CSCI6461 Computer Architecture 12-15
Memory System Design Parameters – Interconnect
Pro Con
BusEasy to implementAll processors see uniform latencies toother processors and memories
Low bisection bandwidthSupports small number of cores
RingHigher bisection bandwidth than busSupports larger number of processors
Non-uniform access latencies withhigh varianceRequires routing logic
Network-on-Chip
High section bandwidthSupports large number of coresNonuniform latencies w/ lower variancethan ring
Requires sophisticated routing andarbitration logic
CrossbarHighest bisection bandwidthSupports large number of coresUniform Access latencies
Requires sophisticated arbitration logicRequires large die area
Ref: Blake, G, R. G. Dreslinski, and T. Mudge. 2009. “A Survey of Multicore Processors”, IEEE SignalProcessing magazine, 26(6)
04/19/23 CSCI6461 Computer Architecture 12-1616
Multicore: Where Processor and System Collide
• Scales performance– Dedicated resources for multiple simultaneous threads
– Multiple cores will contend for memory and I/O bandwidth• Northbridge is the bottleneck – connects the cores and caches• Integrating Northbridge into chip eliminates much of bottleneck• Northbridge architecture has significant impact on performance• Cores, cache and Northbridge must be balanced for optimal
performance
– Most application software doesn’t need to do anything to benefit from multicore
– Be aware that, for a processor within a given power envelope• Fewer cores will clock faster than more cores
– Single-threaded performance-sensitive applications
• More cores will out-perform fewer cores for– Multi-threaded applications
– Multi-tasking response times
– Transaction processing
04/19/23 CS61 Computer Architecture 17CS61 Computer Architecture 17
Basic Idea: Multicore Architectures
• Replicate multiple processor cores on a single die. The cores fit on a single processor chip utilizing one socket
04/19/23 CSCI6461 Computer Architecture 12-18
Basic Idea: Cores Run in Parallel
core
1
core
2
core
3
core
4
several threads
several threads
several threads
several threads
04/19/23 CSCI6461 Computer Architecture 12-19
Simultaneous Multithreading (SMT)
• Permits multiple independent threads to execute SIMULTANEOUSLY on the SAME core
• Weaving together multiple “threads” on the same core
• Example: if one thread is waiting for a floating point operation to complete, another thread can use the integer units
04/19/23 CSCI6461 Computer Architecture 12-20
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 1: floating point
Without SMT, only a single thread can run at any given time
04/19/23 CSCI6461 Computer Architecture 12-21
Without SMT, only a single thread can run at any given time
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 2:integer operation
04/19/23 CSCI6461 Computer Architecture 12-22
SMT processor: both threads can run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 1: floating pointThread 2:integer operation
04/19/23 CSCI6461 Computer Architecture 12-23
But: Can’t simultaneously use the same functional unit
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROMBTBL2 C
ache
and
Con
trol
Bus
Thread 1 Thread 2
This scenario isimpossible with SMTon a single core(assuming a single integer unit)
IMPOSSIBLE
04/19/23 CSCI6461 Computer Architecture 12-24
SMT not a “true” parallel processor
• Enables better threading (e.g. up to 30%)• OS and applications perceive each simultaneous thread as
a separate “virtual processor”• The chip has only a single copy
of each resource• Compare to multi-core:
each core has its own copy of resources
04/19/23 CSCI6461 Computer Architecture 12-25
Multi-core: threads can run on separate cores
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ache
and
Con
trol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ache
and
Con
trol
Bus
Thread 1 Thread 2
04/19/23 CSCI6461 Computer Architecture 12-26
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ache
and
Con
trol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ache
and
Con
trol
Bus
Thread 3 Thread 4
Multi-core: threads can run on separate cores
04/19/23 CSCI6461 Computer Architecture 12-27
Combining Multi-core and SMT
• Cores can be SMT-enabled (or not)• The different combinations:
– Single-core, non-SMT: standard uniprocessor– Single-core, with SMT – Multi-core, non-SMT– Multi-core, with SMT: our fish machines
• The number of SMT threads:2, 4, or sometimes 8 simultaneous threads
• Intel calls them “hyper-threads”
04/19/23 CSCI6461 Computer Architecture 12-28
SMT Dual-core: all four threads can run concurrently
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ache
and
Con
trol
Bus
BTB and I-TLB
Decoder
Trace Cache
Rename/Alloc
Uop queues
Schedulers
Integer Floating Point
L1 D-Cache D-TLB
uCode ROM
BTBL2 C
ache
and
Con
trol
Bus
Thread 1 Thread 3 Thread 2 Thread 4
04/19/23 CSCI6461 Computer Architecture 12-29
Multicores: Relative SpeedupW
illia
m S
talli
ng
s, C
om
pu
ter
Org
an
iza
tion
an
d A
rch
itect
ure
, 8th
Ed
itio
n
04/19/23 CSCI6461 Computer Architecture 12-30
Multicores: Speedup w/ OverheadW
illia
m S
talli
ng
s, C
om
pu
ter
Org
an
iza
tion
an
d A
rch
itect
ure
, 8th
Ed
itio
n
04/19/23 CSCI6461 Computer Architecture 12-31
Multicore Connectivity
s s s s
p
c
p
c
p
c
p
c
Ring Multicore
BUS
p
c
p
c
p
c
Bus Multicore
p
cs
p
cs
p
cs
p
cs
p
cs
p
cs
p
cs
p
cs
p
cs
Mesh Multicore
We have seen similar topologies before for multiprocessor systems
04/19/23 CSCI6461 Computer Architecture 12-32
Multicore Architectures
Core Core Core Core
L1 L1 L1 L1
L2 L2
L3L3
Memory Module 1
Memory Module 2
I/O
Homogeneous with Shared Caches anda Crossbar
04/19/23 CSCI6461 Computer Architecture 12-33
Multicore Architectures
Heterogeneous with
caches, local store and
ring bus
Core (2x SMT)
CoreL1
L2
Core
LocalStore
LocalStore
Core Core
LocalStore
LocalStore
I/OMemory Module
Heterogeneous withcaches, local store and ring bus
04/19/23 CSCI6461 Computer Architecture 12-34
Multicore Architecture: Alternatives
04/19/23 CSCI6461 Computer Architecture 12-35
IBM Cell Processor
• Joint collaboration of IBM/Sony/Toshiba• Develop a new/next-gen processor
– Initially for Play Station 3– Others, multimedia application (Blu-ray, HDTV)– Server systems
• Cell designed for vector computations– Vector arithmetic faster than scalar arithmetic
• Designed for fast SIMD processing• PowerPC Processing Element (PPE)
– PPE register file: 32 x 128-byte vectors– PPE: dual-issue in-order processor– In-order & out-of-order computation (load instructions)– 1 x PPE 64-bit PowerPC (L1: 32KB I$ + 32 KB D$; L2: 512 KB)
• - PPE design goals– Maximize performance/power– Maximize performance/area ratio
• - PPE main tasks– Run OS (Linux)– Coordinate with SPE's-
04/19/23 CSCI6461 Computer Architecture 12-36
IBM Cell Processor
• Synergistic Processing Element (SPE):– An SPE is a self contained vector processor (SIMD) which acts as
a co-processor– SPE’s ISA is a cross between VMX and the PS2’s Emotion
Engine.– SPE register file: 128 x 128-byte vectors– In-order (again to minimize circuitry to save power)– Statically scheduled (compiler plays big role)– Also no dynamic prediction hardware (relies on compiler generated
hints)– 8 x SPE cores (LS: 256KB, SIMD machines)– Both PPE and SPE have Vector instruction capability
• PPE & SPE's @ 3.2Ghz• External RAMBUS XDR Memory
– Two channels @ 3.2Ghz (400Mhz, Octal data rate)
• IO Controller @ 5Ghz
04/19/23 CSCI6461 Computer Architecture 12-37
IBM Cell Processor
Element Interconnection Bus:• Connects various on chip elements• Data-ring structure with control of a bus• 4 unidirectional rings but 2 rings run counter direction to other 2• Worst-case maximum latency is only half distance of the ring• Each ring is 16 bytes wide and runs at half the core clock frequency (core
clock freq ~3.2 GHz)
04/19/23 CSCI6461 Computer Architecture 12-38
IBM Cell Processor: Chip Photo
04/19/23 CSCI6461 Computer Architecture 12-39
IBM Cell Processor: PPE Architecture
04/19/23 CSCI6461 Computer Architecture 12-40
IBM Cell Processor: SPE Architecture
04/19/23 CSCI6461 Computer Architecture 12-41
IBM Cell Programming: Programming
• Creating instructions in a different language for the 8 SPEs than for the PowerPC core.
• Separate compiler for SPE– Embed SPE executable into library– 'extern spe_program_handle_t <program_name>'– Compile main PPU program with library
• Thread-based model, push/pull data– Thread scheduling by user – Five layers of parallelism:
• Task parallelism (MPMD) • Data parallelism (SPMD)• Data streaming parallelism (DMA double buffering) • Vector parallelism (SIMD – up to 16-ways)• Pipeline parallelism (dual-pipelined SPEs)
• Need to think in terms of SIMD nature of dataflow to get maximum performance from SPUs
• SPU local store needs to perform coherent DMA access for accessing system memory
04/19/23 CSCI6461 Computer Architecture 12-42CSCI6461 Computer Architecture
IBM Cell Processor: Programming
04/19/23 CSCI6461 Computer Architecture 12-43CSCI6461 Computer Architecture
IBM Cell Processor: Programming
• Manually partition the application into separate code segments and use the compiler that targets the appropriate ISA
• For SPUs, SIMD code generation can be done by parallelizing compiler with auto-SIMDization
• Allocating SPE program data in system memory (shared memory view) & have SPE compiler automatically manage the movement of data– A naive compiler inserts an explicit DMA transfer for each access
to shared memory– optimized: employ a software cache mechanism that permits reuse
of the temporary buffers in the LS• Using the SPE linker and an embedding tool
– generate a PPE executable that contains the SPE binary embedded within the data section
• PPE object is then linked, using a PPE linker– with the runtime libraries which are required for thread creation and
management, to create a bound executable for the Cell BE program
04/19/23 CSCI6461 Computer Architecture 12-44
AMD Athlon Barcelona
04/19/23 CSCI6461 Computer Architecture 12-45
Basic Idea: Programming Multiple Cores
• Programmer:– Programmers must use threads or processes.
– Write parallel algorithms.
• Parallel programming is harder than normal programming because it involves:– Additional techniques– Problem partitioning– Synchronization– Access control– …
• 90% of programmers don’t do parallel programming.
04/19/23 CSCI6461 Computer Architecture 12-46
Basic Idea: Programming Multiple Cores
• Operating System Interaction:– Most major OS support multi-core today.– OS perceives each core as a separate processor.– OS scheduler maps threads/processes to different cores.
– OS will map threads/processes to cores
04/19/23 CSCI6461 Computer Architecture 12-47
Multicore Programming: Shared Memory
• The Shared Memory Model: cores share a single memory• Typically written using OpenMP (http://openmp.org/wp/)• Software constructs that allow individual processes to physically
share certain portions of the same address space– Directives to compilers: FORTRAN, C/C++
• Seems intuitive (physical memory chips are shared by the cores)– Core virtualization?
• Pros– Easy to write– Communication coordination between processes is built-in– Allows support of both sequential and parallel processes– Easily scalable to a certain point
• Cons– Not very general, geared toward loop-level parallelism– Does not support asynchronous events very well– Not scalable to distributed systems easily
04/19/23 CSCI6461 Computer Architecture 12-48
Multicore Programming: Message Passing
• Often written in Message Passing Interface (MPI)– an API specification that allows computers to communicate with one
another.
• Allows communication between processes (threads) using specific message-passing system calls.
• All shared data is communicated through messages• Physical memory not necessarily shared• Pros
– Allows for asynchronous events– Does not require programmer to write in terms of loop-level parallelism– Operates on multicores AND is scalable to distributed systems– A more general model of programming…extremely flexible
• Cons– Considered extremely difficult to write– Difficult to incrementally increase parallelism– Implicitly shared data (in MPI-2.0)
04/19/23 CSCI6461 Computer Architecture 12-49
Multicore Programming: Transaction Model
• Instructions are grouped into sets of transactions• All-or-nothing model of execution and completion…Atomicity• Suitable for certain types of applications
– (ATMs, bank processing, database applications)
• Scalability!• Pros
– Scalable to large distributed systems– Applicable to a wide range of consumer-oriented applications– Does not necessarily imply a message-passing or shared-memory
interface– Applicable to many hardware models (assuming support for
atomicity)
• Cons– Not obviously amenable to all problems– Difficulty to reason about for many applications
04/19/23 CSCI6461 Computer Architecture 12-50
Old Approaches Fall Short
• Pthreads– Intel webinar likens it to the assembly of parallel programming– Data races are hard to analyze– No encapsulation or modularity – But evolutionary, and OK in the interim
• DMA with external shared memory– DSP programmers favor DMA– Explicit copying from global shared memory to local store– Wastes pin bandwidth and energy– But, evolutionary, simple, modular and small core memory footprint
• MPI– Province of HPC users– Based on sending explicit messages between private memories– High overheads and large core memory footprint
04/19/23 CSCI6461 Computer Architecture 12-51
Multicore Programming: What’s Best Model?
• The Billion $$$ Question…• No great general model…(yet)• Hardware and software issues• “The vast majority of programmers today don’t grok concurrency,
just as the vast majority of programmers 15 years ago didn’t yet grok objects.” (1)
• Economic & Cultural Forces(1) “The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software.”
Herb Sutter. Dr. Dobb’s Journal, March 2005. (http://www.gotw.ca/publications/concurrency-ddj.htm)
“grok” = “to understand deeply (Stranger in a Strange Land, Robert Heinlein)
04/19/23 CSCI6461 Computer Architecture 12-52
Shared L2 Cache: Advantages
• Constructive interference reduces overall miss rate• Data shared by multiple cores not replicated at cache level• With proper frame replacement algorithms, the mean amount of
shared cache dedicated to each core is dynamic– Threads with less locality can have more cache
• Easy inter-process communication through shared memory• Cache coherency confined to L1• Dedicated L2 cache gives each core more rapid access
– Good for threads with strong locality
• Shared L3 cache may also improve performance
04/19/23 CSCI6461 Computer Architecture 12-53
Multicore Challenges
• Relies on effective exploitation of multiple-thread parallelism
• Need for parallel computing model and parallel programming model
• Aggravates memory wall– Memory bandwidth
• Way to get data out of memory banks• Way to get data into multi-core processor array
– Memory latency– Fragments L3 cache
• Pins become strangle point• Rate of pin growth projected to slow and flatten• Rate of bandwidth per pin (pair) projected to grow slowly
• Requires mechanisms for efficient inter-processor coordination
• Synchronization• Mutual exclusion• Context switching
04/19/23 CSCI6461 Computer Architecture 12-54
Multicore: Advantages
• Cache coherency circuitry can operate at a much higher clock rate than is possible if the signals have to travel off-chip.
• Signals between different CPUs travel shorter distances, those signals degrade less.
• These higher quality signals allow more data to be sent in a given time period since individual signals can be shorter and do not need to be repeated as often.
• A dual-core processor uses slightly less power than two coupled single-core processors.
04/19/23 CSCI6461 Computer Architecture 12-55
Multicore: Disadvantages
• Ability of multi-core processors to increase application performance depends on the use of multiple threads within applications.
• Most Current video games will run faster on a 3 GHz single-core processor than on a 2GHz dual-core processor (of the same core architecture).
• Two processing cores sharing the same system bus and memory bandwidth limits the real-world performance advantage.
• If a single core is close to being memory bandwidth-limited, going to dual-core might only give 30% to 70% improvement.
• If memory bandwidth is not a problem, a 90% improvement can be expected.
04/19/23 CSCI6461 Computer Architecture 12-56
Multicore Issues
• How many general purpose cores is enough?– Conjecture: probably no more than 16 based on experience with
multiprocessor systems
• Should future systems have homogeneous or heterogeneous cores (e.g., the Cell)
• What is the best way to connect the cores on a chip?• Are threads or processes better for programming
multicore processors?• Will software vendors charge a separate license per
each core or only a single license per chip?
04/19/23 CS61 Computer Architecture 57CS61 Computer Architecture 57
04/19/23 CSCI6461 Computer Architecture 12-58
Additional Material
04/19/23 CSCI6461 Computer Architecture 12-59
AMD Opteron Processor
AGUAGU
Int Decode & Rename
FADD FMISCFMUL
44-entryLoad/Store
Queue
36-entry FP scheduler
FP Decode & Rename
ALU
AGU
ALU
MULT
ALU
Res Res Res
L1Icache64KB
L1Dcache64KB
FetchBranch
Prediction
Instruction Control Unit (72 entries)
Fastpath Microcode EngineScan/Align/Decode
µops
AMD Opteron processor core architecture
AGU = Address Generation UnitALU = Arithmetic/Logical UnitRES = Reservation Station
04/19/23 CSCI6461 Computer Architecture 12-60
AMD Athlon 64-bit Lines
http://www.amd.com/gb-uk/Processors/ProductInformation/0,,30_118_9485_13041^13043,00.html
04/19/23 CSCI6461 Computer Architecture 12-61
Evolution of AMD Athlon 64-bit Processors
L2: 2x 1 MB) (L2: 2 x 1 MB) (L2: 2 MB)L2: 1 MB
BarcelonaWindsorToledoSan Diego
04/19/23 CSCI6461 Computer Architecture 12-62
Single Chip cloud Computer (SCC)
04/19/23 CSCI6461 Computer Architecture 12-63
Inside the SCC
04/19/23 CSCI6461 Computer Architecture 12-64
Tilera 64
• Cores connected by mesh network
• Five physical mesh networks– UDN, IDN, SDN,
TDN, MDN– Each has 32
channels– Packet-switched– Wormhole routed– Point-to-point
• TDN and MDN are used for handling memory traffic:– Separate networks
improve concurrency by reducing bottlenecks
04/19/23 CSCI6461 Computer Architecture 12-65
Tilera 64
• Number of Tiles = 64• On Chip Distributed Cache
= 5 MB• Operations at 32, 16, 8 bits
= 144, 192, 384 BOPS• On Chip interconnect
Bandwidth = 32 Tbits• I/O Bandwidth = 40 Gbps• Memory Bandwidth = 200
Gbps• 3-Way, 64-bit VLIW CPU
04/19/23 CSCI6461 Computer Architecture 12-66
Tilera 64
• Memory requests transit via TDN
– Large store requests, small load requests
• Memory responses transit via MDN
– Large load responses, small store responses
– Includes cache-to-cache transfers and off-chip transfers
• Directory-based cache coherence
• Directory cache at every node
• Off-chip directory controller• Tile-to-tile requests and
responses transit the TDN• Off-chip memory requests
and responses transit the MDN
04/19/23 CSCI6461 Computer Architecture 12-67
Itanium Dual-Core
04/19/23 CSCI6461 Computer Architecture 12-68
Itanium core Duo
• 2 mobile-optimized execution cores– No multi-threading
• Cache hierarchy– Private 32-KB L1I and L1D– Shared 2 MB L2 cache– Provides efficient data sharing between both cores
• Power reduction– Some states individually by each processor– Deeper and enhanced deeper sleep states only for die– Dynamic Cache Sizing feature
• Flushes entire cache
• This enables Enhanced Deeper Sleep with lower voltage which does not guarantee cache integrity
• 151 Million transistors
04/19/23 CSCI6461 Computer Architecture 12-69
ARM11 MPCore
• Up to 4 processors each with own L1 instruction and data cache• Distributed interrupt controller• Timer per CPU• Watchdog
– Warning alerts for software failures– Counts down from predetermined values– Issues warning at zero
• CPU interface– Interrupt acknowledgement, masking and completion acknowledgement
• CPU– Single ARM11 called MP11
• Vector floating-point unit– FP co-processor
• L1 cache• Snoop control unit
– L1 cache coherency
04/19/23 CSCI6461 Computer Architecture 12-70CS61 Computer Architecture 70
ARM11 MPCore
04/19/23 CSCI6461 Computer Architecture 12-71
ARM11 MPCore Interrupt Handling
• Distributed Interrupt Controller (DIC) collates from many sources– Masking– Prioritization– Distribution to target MP11 CPUs– Status tracking– Software interrupt generation
• Number of interrupts independent of MP11 CPU design• Memory mapped• Accessed by CPUs via private interface through SCU• Can route interrupts to single or multiple CPUs• Provides inter-process communication
– Thread on one CPU can cause activity by thread on another CPU
04/19/23 CSCI6461 Computer Architecture 12-72
ARM11 MPCore: Cache Coherency
• Snoop Control Unit (SCU) resolves most shared data bottleneck issues
• L1 cache coherency based on MESI• Direct data Intervention
– Copying clean entries between L1 caches without accessing external memory
– Reduces read after write from L1 to L2– Can resolve local L1 miss from remote L1 rather than L2
• Duplicated tag RAMs– Cache tags implemented as separate block of RAM– Same length as number of lines in cache– Duplicates used by SCU to check data availability before sending
coherency commands– Only send to CPUs that must update coherent data cache
• Migratory lines– Allows moving dirty data between CPUs without writing to L2 and
reading back from external memory