View
213
Download
0
Embed Size (px)
Citation preview
RAMP Gold
RAMPants{rimas,waterman,yunsup}@cs
Parallel Computing LaboratoryUniversity of California, Berkeley
A Survey of μArch Simulation Trends
• Typical ISCA 2008 papers simulated about about twice as many instructions as those in 1998. So what?
Something is Rotten in theState of California
• A median ISCA ‘08 paper’s simulations run for fewer than four OS scheduling quanta!– We run yesterday’s apps at yesteryear’s
timescales– And attempt to model N communicating cores
with O(1/N) instructions per core?!
• The problem is that simulators are too slow– Irony: since performance scales as
sqrt(complexity), simulated instructions per wall-clock second falls as processors get faster
RAMP Gold: Our Solution
• RAMP Gold is an FPGA-based, 100 MIPS manycore simulator– Only 100x slower than real-time– Economical: RTL is BSD-licensed;
commodity HW
Cost Performance(MIPS) Simulations per day
SoftwareSimulator $2,000 0.1 - 1 1
RAMP Gold $2,000 + $750 50 - 100 100
Our Target Machine
SPARC V8CORE
SPARC V8CORE
I$I$ D$D$
DRAMDRAM
Shared L2$ / InterconnectShared L2$ / Interconnect
SPARC V8CORE
SPARC V8CORE
I$I$ D$D$
SPARC V8CORE
SPARC V8CORE
I$I$ D$D$
SPARC V8CORE
SPARC V8CORE
I$I$ D$D$
…
64 cores
RAMP Gold Architecture
• Mapping the target machine directly to an FPGA is inefficient
• Solution: split timing and functionality– The timing logic decides how many target
cycles an instruction sequence should take– Simulating the functionality of an
instruction might take multiple host cycles– Target time and host time are orthogonal
Function/Timing Split Advantages
• Flexibility– Can configure target at runtime– Synthesize design once, change target
model parameters at will
• Efficient FPGA resource usage– Example 1: model a 2-cycle FPU in 10
host cycles– Example 2: model a 16MB L2$ using
only 256KB host BRAM to store tags/metadata
Host Multithreading
• How are we going to model 64 cores?
FF DD XX MM WW
FF DD XX MM WW
FF DD XX MM WW
FF DD XX MM WW
…64 pipelines
time
FF DD XX MM WW
FF DD XX MM WW
FF DD XX MM WW
time
Build 64 pipelines Time-multiplex one pipeline
• Single hardware pipeline with multiple copies of CPU state• No bypass path required• Not multithreaded target!
…
FF DD XX MM WW
FF DD XX MM WW
Cache Modeling
• The cache model maintains tag, state, protocol bits internally
• Whenever the functional model issues a memory operation, the cache model determines how many target cycles to stall
…
tagtag indexindex offsetoffset
tag, statetag, state tag, statetag, state tag, statetag, state
==== ==
hit : don’t stallmiss : stall arbitrary cycles
maxassociativity
Putting it all together
instruction cache
ifetch stage
decode stage
register access stage
memory stage
data cache
exception stage
memory controller
cache model & performance counters
Resource Utilization (XC5VLX110T)• LUTs – 14%, BRAM – 23%We can fit 3 pipelines on one FPGA!
Infrastructure
ELF to BRAM Loader
Standard ELF Loader
Frontend Test Server
ELF to DRAM Loader
SPARC V8 Verification Suite
(.S or .C)
GNU SPARC v8 Compiler/ Linker(sparc-linux-gcc, sparc-linux-as, sparc-linux-ld)
Customized Linker Script
(.lds)
RAMP Gold Systemverilog Source
Files / netlist(.sv, .v)
ELF big-endian Binaries
Modelsim SE/ Questasim 6.4a
Host dynamic simulation libraries
(.so)
GNU libbfd library(from GNU binutil)
SPARC v8 disassembler C implementationFrontend Links
Xilinx Unisim Library Systemverilog DPI interface
Simulation log/Console output
Checker
C Functional Simulator
Frontend Links
Reference Result
FPGA TargetXilinx ML505, BEE3
Frontend Links
Memory/Register dumps
Solaris/Linux Machine(Handle Syscall)
Our accomplishments this semester
Jan 2009 Last Night
Single Threaded 64 Way Host Multithreaded
0.000032GB BRAM 2GB DDR2 SDRAM
Hello World works (sometimes) ParLab Damascene CBIR App,SPLASH2 + SPEC CPU2000
No Timing Model or Introspection
Runtime Configurable Cache Model, Performance Counters
No Floating Point Hardware FPU Multiply/Add + Software Emulation
HARDware ain’t no joke
500+ Man Hours DDR2 Memory Controller Debugging
100+ Man Hours DMA Engine/Pipeline Issues
150+ Man Hours CAD Tool Issues
0 Pipeline Corner Cases
Sample Use Case: L1 D$ Tradeoffs
• Assume we have a 64-core CMP with private 16KB direct-mapped L1 D$
• In the next tech gen, we can fit either of these improved configurations in a clock cycle:– 32KB direct-mapped L1– 16KB 4-way set-associative L1
• Which should we choose?
Sample Use Case: L1 D$ Tradeoffs
• Evidently, the associative cache is superior• It took longer to make these slides than to
run these 10+ billion instruction simulations
Future Directions
• RAMP Gold closes two critical feedback loops– Expedient HW/SW co-tuning is within our
grasp– Simulations can now be run on a
thermal timescale, enabling the exploration of temperature-aware scheduling policies
• We intend to explore both avenues!
DEMO: Damascene
ImageImage
Convert Colorspace
Convert Colorspace
Textons:K-meansTextons:K-means
Texture GradientTexture Gradient
CombineCombine
Non-maxsuppression
Non-maxsuppression
Intervening ContourIntervening Contour
Generalized EigensolverGeneralized Eigensolver
Oriented Energy Combination
Oriented Energy Combination
Combine, NormalizeCombine, Normalize
ContoursContours
BgBg CgaCga CgbCgb