View
446
Download
0
Category
Preview:
DESCRIPTION
SystemExplorer is a system simulation framework based upon the open-source gem5 simulation infrastructure. It includes a rich collection of hardware components such as ARM cores, interconnect, memories and memory controllers, IO devices - ethernet, PCIe, and other peripherals. In addition it provides support for run fully featured operating systems such as Linux and Android combined with pre-packaged filesystem images that contain real workloads and benchmarks for Smartphone, Server and High Performance Computing. In this talk I'll give an overview of ARM R&D's use of the SystemExplorer tool for workload directed architectural co-design. I will focus on how we are using it in combination with the Department of Energy's co-design center proxy applications to help evaluate and enable the ARM architecture to address the power-efficiency, performance, and resilience requirements of Exascale computing. (Presented during FastPass 2013 Workshop in Austin, TX)
Citation preview
1
Simulation Directed Co-Design from Smartphones to Supercomputers
Eric Van Hensbergen ARM Research & Development
Austin, TX
FastPath 2013 April 21, 2013
2
STATE § SHIFT
§ No longer solely rely on Process Reduc0on to improve performance
§ Performance/Power/Cost will increasingly become reliant on Integra0on
§ ARM § Focuses on Design & Licensing of IP Building Blocks for SoC’s (=LEGO’s) § Building Blocks effecJvely act as COTS-‐on-‐Silicon § COTS-‐on-‐Silicon encourages mulJ-‐suppliers through the eco-‐system
§ It enables circuit-‐boards to be integrated onto a single chip § Technology DNA is Power-‐Efficiency
3
FLEXIBILITY § Build what you want?
§ Target your SoC to solve your problem § One size does not fit all § OpJmize power/performance for the domain
§ UJlize common infrastructure and components § Leverage SW ecosystem and portability § Leverage validated IP § Proven design flows
§ Focus on adding value to solve your problems § Adding you applicaJon specific IP § Everything else off the shelf
§ Rich IP libraries § Diverse and compeJJve IP vendors
§ Leverage the ARM ecosystem
4
MARKETS
Home
Mobile
4.6bn 3%
in 2011
Embedded
2.3bn 25%
in 2011
Home
0.4bn 40%
in 2011
Enterprise
1.4bn 10%
in 2011
5
PROCESSORS
Architecture “ARMv8”
Processor Hard-Macro Implementation
Processor Micro-Architecture “Cortex-A57”
6
“On-Chip” INTERCONNECT
Architecture “AMBA”
RTL Implementation CCN-504
7
GPUs
Architecture “Midgard”
GPU Micro-Architecture Mali T-678
8
1000+
9
gem5 § Architectural simulator § ARM has invested significantly in ARM support for gem5
under the internal name “SystemExplorer” § Plan to continue to invest over time § ARMv7 support is extremely good today § Plans to contribute ARMv8 support when complete
§ BSD licensed § Good platform for collaboration
§ Base infrastructure is available and we can share bits beyond that
10
SystemExplorer
11
OS Support in SystemExplorer
Ubuntu 12.04 (Linux kernel v3.3) Android Jellybean (Kernel v2.6.38)
§ Latest Ubuntu and Android distributions
12
Server Applications
Single system simulation
Multi-system simulation with simulated Ethernet
Mobile Applications
Understanding how real application workloads and operating systems stress our IP
Webserver
Netperf
SSJ
DaCapo
HPC
IOzone
AR
Angry Birds
Ande- Bench
ToF
BBench
Replica
JS V8 Engine
Vid Playback
Wireless Disp Video Conf
SystemExplorer Platforms
DB
SystemExplorer Application Support
Done Planned In Process Legacy
EEMBC SPEC2000
AppLaunch
Includes kernel support
Graphics
Taji Egypt
HPC Applications
Mantevo CESAR
ExaCT
…
Caffeine Mark WPS
Vellamo HTML5
Velllamo Metal
RLBench UI Twiddle
13
ARM gem5 Usage Continues to Grow
§ ARM gem5 exceeding both X86 and Alpha
0
50
100
150
200
250
300 D
ownl
oads
per
Mon
th
alpha arm x86
ARM #1
Overtake x86
Overtake alpha
14
gem5 Visualization with Streamline
15
SystemExplorer Dhrystone Correlation
16
SystemExplorer SPECint2000 Correlation
17
SystemExplorer EEMBC CORRELATION
18
High Performance Computing § High performance computing (HPC) is becoming much more
pervasive. § Power efficiency and integration are becoming key factors in both
large-scale and commercial HPC § 2018-2022 DARPA/DOD/DOE Visions for HPC:
20KW Rack Petascale
20MW Data Center Exaflop
20W Chip Teraflop
5KW Chassis Terascale
Medical/Pharma
50 GFLOPS/W (20 pJ/FLOP)
19
Why does ARM care about HPC? § We expect the challenges HPC experiences today to be
similar to the enterprise challenges of tomorrow § Data center networking is getting more advanced § Energy will forever be a concern
§ ARM’s long-term vision is for ARM technology to be in all levels of compute § Five years ago we announced the Cortex-M (Microcontroller) series § ARM powers many hard-real-time system (Radio, Automotive, etc) § Mobile devices § Servers § HPC is the only place you don’t find ARM technology today and we
aim to change that
20
First steps in ARM HPC: § Supercomputer investigation based
on embedded (ARM) technology
§ Funded under FP7 § 3-year IP Project (Start October 2011) § Budget: 14.5 M€ (8.1 M€ from EC)
§ Project goals: physical prototype based on available embedded (ARM) technology and a design of a full next-gen system
§ Consortium includes experienced HPC developers and users:
21
Mont-Blanc Roadmap A big challenge, and a huge opportunity for Europe
• Prototypes are critical to accelerate software development• System software stack + applications
2011 2012 2013 2014 2015 2016 2017
256 nodes250 GFLOPS
1.7 Kwatt
Built with the bestof the market
Built with the bestthat is coming
What is the bestthat we could do?
GFL
OPS
/ W
September 13, 2012HPC Advisory Council, Malaga18
22
US DoE Exascale Timeline
23
Goals § Port co-design center proxy applications to ARM platform and
take baseline measurements § Also execute HPC Challenge and FFTW benchmarks to
compliment proxy applications § Execute same set of workloads on gem5 with a configuration
similar to an ARM hardware platform to get an idea of how well the simulator correlates
§ Use results as a baseline for understanding the current state of ARM for HPC, future optimizations and sensitivity studies
§ Since national labs aren’t as interested in 32-bit, use the process to refine methodology till 64-bit hardware and/or simulator becomes available
24
Baseline Workload Characterization
National Labs
Workloads
HPC Disk Image
Performance Projection
Design Sensitivity
Studies
Characterization Co-Design
Centers RTL Simulation
25
High Performance Computing Challenge § DARPA benchmark established to help evaluate systems in
the HPCS program (which ultimately produced Cray Cascade and IBM PERCs machine) § LINPACK – stress peak floating point § PTRANS – rate of transfer of large arrays § GUPS – random updates of memory § FFT – Fast Fourier Transform § STREAM – measures sustainable memory bandwidth § DGEMM – Double precision general matrix multiply
§ Generally run across a cluster with MPI, but can run single node and single core
§ Configure can scale to different working set sizes § http://icl.cs.utk.edu/hpcc
26
Mantevo Proxy Applications Suite § Developed at Sandia National Labs as an outgrowth of
Trillinos project which is a collection of open-source scientific libraries, applications and benchmarks
§ Goals: § Predict performance of real applications in new situations. § Aid computer systems design decisions. § Foster communication between applications, libraries and computer
systems developers. § Guide application and library developers in algorithm and software
design choices for new systems. § Provide open source software to promote informed algorithm,
application and architecture decisions in the HPC community.
§ Released as open source: § http://mantevo.org
27
Co-Design Center Apps
CESAR Center for Exascale Simulation of Advanced Reactors • Thermal Hydraulics: for the
fluid codes (NEK 5000)* • Neutronics : for the Neutronics
codes (MOCFE and OpenMC) • Coupling and Data Analytics for
data intensive tasks: cian
ExaCT Center for Exascale Simulation of Combustion in Turbulence • Exp_CNS_NoSpec: A simple
stencil-based test code • MultiGrid_C: A multigrid-based
solver for a model linear elliptic system based on a centered second-order discretization.
• vodeDriver: chemical combustion kinetics
ExMatEx Materials in Extreme Environments • CoMD – Molecular Dynamics • LULESH - Lagrangian Explicit
Shock Hydrodynamics • VPFFT - Crystal viscoplasticity
28
Workloads, benchmarks, & miniapps Linpack DGEMM
FFT
miniMD OpenMD Nekbone
miniFE Hpccg
PHDmesh
PTRANS STREAM GUPS
29
Workloads Instruction Mix Linpack DGEMM FFT
miniMD OpenMD Nekbone
miniFE HPCCG PHDmesh
PTRANS STREAM GUPS
Memory – Integer – SIMD Integer – Float – SIMD Float
30
gem5 Methodology § Boot scripts are in m5-obj/config/boot/hpc § Base.rcS creates a checkpoint after boot and 60-second
“rest” period. Setup to re-read workload script after checkpoint so that workload can be configured during restore skipping boot period.
§ Configs are self-contained in workloads, output is sent to simulation host via m5 writefile.
§ I’ve got some bundled run scripts which handle establishing the base checkpoint and for restoring checkpoint and executing workloads in atomic, A15, and A15 with period stats enabled
§ Runs parameterized so that complete run can complete in a reasonable amount of time w/timing-approximate simulation
§ Disk image available, optimized for A15
31
gem5 Correlation – Simple Memory
-100.00% -80.00% -60.00% -40.00% -20.00% 0.00% 20.00% 40.00% 60.00% 80.00% 100.00%
G-HPL MFlop/s
G-FFTR MFlop/s
EP-DGEMM MFlop/s
G-PTRANS GB/s
EP-STREAM GB/s
GUPS
FFTW (SP)
CG_MFLOP/s
HPCCG P1 FLOP/s
MD TIME
phdMesh
CoMD 8k
32
miniFE – finite element simulation § It assembles a sparse linear-system from the steady-state
conduction equation on a brick-shaped problem domain of linear 8-node hex elements. It then solves the linear-system using a simple un-preconditioned conjugate-gradient algorithm.
Thus the kernels that it contains are: § computation of element-operators
§ diffusion matrix, source vector
§ assembly § scattering element-operators into sparse matrix and vector
§ sparse matrix-vector product § during CG solve
§ vector operations (level-1 blas: axpy, dot, norm)
33
Profile: miniFE § Language & Runtime: reference code in C++, alternate
versions for openMP, cilk, chapel, qthreads, etc. § Library Dependencies: None § SLOCCOUNT: 2872 lines of code § A15 Perf Characteristics:
§ Run Time: 217,413,167 cycles § Max Heap Size: 14.54MB § CPI: 1.6958 § L1D Miss Rate: 2.5% § L2 Miss Rate: 6.68% § Branch Mispredicts: 6.36%
Int
Float
SIMD Int
SIMD Float
Memory
Other
34
miniFE gem5 Cache Occupancy
35
miniFE Streamline Visualization
36
Workloads: Next Steps § More benchmarks
§ Get big data analytic mini-apps and benchmarks working (graph 500, mantevo analytics mini-app, others?)
§ Get an ExaCT benchmark working, incorporate forthcoming ASC benchmarks § More variations
§ Multinode MPI, PGAS, and other runtimes § OpenCL variants § Handcode NEON optimized versions of key benchmarks
§ More Accuracy § Continue calibration gem5 memory system against hardware to increase accuracy
of memory-bound benchmarks § Systems Software Sensitivity Study
§ OpenMPI versus MPICH versus LAMPI on ARM § Operating System Version (3.7 has THP) § armcc vs gcc vs gcc-dragon-egg versus clang (etc.)
§ Transition to 64-bit gem5 (and hardware) when available. § Integrate Montblanc benchmarks and runtimes § Roll bare-metal version of co-design center workloads to make them more
accessible to design teams.
37
Simulation Driven Challenges § Performance
§ When running functional mode (atomic), performance is in MIPS, when running in cycle approximate mode (with memory models, cache models, etc.) simulation runs in KIPS – but longer runtimes with timing models give more representative results.
§ Current methodology works of atomic checkpoints followed by short timing measurements, but can be refined to get better representation of multi-phase workloads
§ Scale § gem5 is currently inherently serial, adding cores or nodes to simulation has a
multiplicative effect § Multi-threading the simulation model at core, node, and cluster levels could help
address this problem, but may impact granularity of timing accuracy. § Correlation
§ Correlating a single core simulation is hard, correlating multi-core is extremely difficult, as is multi-node.
§ Sensitivity Study State Space Explosion § Many knobs to turn, determining which ones to turn in combination for the best
effect is an on-going research problem. § Visualization
§ Need better ways of visualizing performance characteristics, particularly at scale.
38
Future Work: Integration with SST § SST: The Structural Simulation Toolkit
§ Maintained by Sandia National Labs § Component-based Discrete Event Model § Already uses gem5 as a component (but not well integrated with ARM
variant) § Potential to help us scale out simulation as well as integrate with
other simulations (fabric, etc.) to allow for end-to-end simulation of large scale supercomputer.
39
Links § More info on ARM including Research Papers
§ http://infocenter.arm.com
§ gem5 (http://www.m5sim.org) § SST (http://sst.sandia.gov) § Montblanc (http://montblanc-project.eu) § Exacale Initiative
§ http://sites.google.com/a/lbl.gov/exascale-initiative/
§ Co-Design Center Proxy Apps § Mantevo (http://mantevo.org) § ExMatEx (http://exmatex.lanl.gov) § ExaCT (http://exactcodesign.org) § CESAR (http://cesar.mcs.anl.gov)
40
QUESTIONS? Thanks!
Recommended