40
1 Simulation Directed Co-Design from Smartphones to Supercomputers Eric Van Hensbergen ARM Research & Development Austin, TX FastPath 2013 April 21, 2013

Simulation Directed Co-Design from Smartphones to Supercomputers

Embed Size (px)

DESCRIPTION

SystemExplorer is a system simulation framework based upon the open-source gem5 simulation infrastructure. It includes a rich collection of hardware components such as ARM cores, interconnect, memories and memory controllers, IO devices - ethernet, PCIe, and other peripherals. In addition it provides support for run fully featured operating systems such as Linux and Android combined with pre-packaged filesystem images that contain real workloads and benchmarks for Smartphone, Server and High Performance Computing. In this talk I'll give an overview of ARM R&D's use of the SystemExplorer tool for workload directed architectural co-design. I will focus on how we are using it in combination with the Department of Energy's co-design center proxy applications to help evaluate and enable the ARM architecture to address the power-efficiency, performance, and resilience requirements of Exascale computing. (Presented during FastPass 2013 Workshop in Austin, TX)

Citation preview

Page 1: Simulation Directed Co-Design from Smartphones to Supercomputers

1

Simulation Directed Co-Design from Smartphones to Supercomputers

Eric Van Hensbergen ARM Research & Development

Austin, TX

FastPath 2013 April 21, 2013

Page 2: Simulation Directed Co-Design from Smartphones to Supercomputers

2

STATE § SHIFT  

§  No  longer  solely  rely  on  Process  Reduc0on  to  improve  performance  

§  Performance/Power/Cost  will  increasingly  become  reliant  on  Integra0on  

§ ARM  §  Focuses  on  Design  &  Licensing  of  IP    Building  Blocks  for  SoC’s  (=LEGO’s)  §  Building  Blocks  effecJvely  act  as  COTS-­‐on-­‐Silicon  §  COTS-­‐on-­‐Silicon  encourages  mulJ-­‐suppliers  through  the  eco-­‐system  

§  It  enables  circuit-­‐boards  to  be  integrated  onto  a  single  chip    §  Technology  DNA  is  Power-­‐Efficiency  

Page 3: Simulation Directed Co-Design from Smartphones to Supercomputers

3

FLEXIBILITY § Build  what  you  want?  

§  Target  your  SoC  to  solve  your  problem  §  One  size  does  not  fit  all  §  OpJmize  power/performance  for  the  domain  

§  UJlize  common  infrastructure  and  components  §  Leverage  SW  ecosystem  and  portability  §  Leverage  validated  IP  §  Proven  design  flows  

§  Focus  on  adding  value  to  solve  your  problems  §  Adding  you  applicaJon  specific  IP  §  Everything  else  off  the  shelf  

§ Rich  IP  libraries  §  Diverse  and  compeJJve  IP  vendors  

§  Leverage  the  ARM  ecosystem  

Page 4: Simulation Directed Co-Design from Smartphones to Supercomputers

4

MARKETS

Home

Mobile

4.6bn 3%

in 2011

Embedded

2.3bn 25%

in 2011

Home

0.4bn 40%

in 2011

Enterprise

1.4bn 10%

in 2011

Page 5: Simulation Directed Co-Design from Smartphones to Supercomputers

5

PROCESSORS

Architecture “ARMv8”

Processor Hard-Macro Implementation

Processor Micro-Architecture “Cortex-A57”

Page 6: Simulation Directed Co-Design from Smartphones to Supercomputers

6

“On-Chip” INTERCONNECT

Architecture “AMBA”

RTL Implementation CCN-504

Page 7: Simulation Directed Co-Design from Smartphones to Supercomputers

7

GPUs

Architecture “Midgard”

GPU Micro-Architecture Mali T-678

Page 8: Simulation Directed Co-Design from Smartphones to Supercomputers

8

1000+

Page 9: Simulation Directed Co-Design from Smartphones to Supercomputers

9

gem5 § Architectural simulator § ARM has invested significantly in ARM support for gem5

under the internal name “SystemExplorer” §  Plan to continue to invest over time §  ARMv7 support is extremely good today §  Plans to contribute ARMv8 support when complete

§ BSD licensed § Good platform for collaboration

§  Base infrastructure is available and we can share bits beyond that

Page 10: Simulation Directed Co-Design from Smartphones to Supercomputers

10

SystemExplorer

Page 11: Simulation Directed Co-Design from Smartphones to Supercomputers

11

OS Support in SystemExplorer

Ubuntu 12.04 (Linux kernel v3.3) Android Jellybean (Kernel v2.6.38)

§ Latest Ubuntu and Android distributions

Page 12: Simulation Directed Co-Design from Smartphones to Supercomputers

12

Server Applications

Single system simulation

Multi-system simulation with simulated Ethernet

Mobile Applications

Understanding how real application workloads and operating systems stress our IP

Webserver

Netperf

SSJ

DaCapo

HPC

IOzone

AR

Angry Birds

Ande- Bench

ToF

BBench

Replica

JS V8 Engine

Vid Playback

Wireless Disp Video Conf

SystemExplorer Platforms

DB

SystemExplorer Application Support

Done Planned In Process Legacy

EEMBC SPEC2000

AppLaunch

Includes kernel support

Graphics

Taji Egypt

HPC Applications

Mantevo CESAR

ExaCT

Caffeine Mark WPS

Vellamo HTML5

Velllamo Metal

RLBench UI Twiddle

Page 13: Simulation Directed Co-Design from Smartphones to Supercomputers

13

ARM gem5 Usage Continues to Grow

§  ARM gem5 exceeding both X86 and Alpha

0

50

100

150

200

250

300 D

ownl

oads

per

Mon

th

alpha arm x86

ARM #1

Overtake x86

Overtake alpha

Page 14: Simulation Directed Co-Design from Smartphones to Supercomputers

14

gem5 Visualization with Streamline

Page 15: Simulation Directed Co-Design from Smartphones to Supercomputers

15

SystemExplorer Dhrystone Correlation

Page 16: Simulation Directed Co-Design from Smartphones to Supercomputers

16

SystemExplorer SPECint2000 Correlation

Page 17: Simulation Directed Co-Design from Smartphones to Supercomputers

17

SystemExplorer EEMBC CORRELATION

Page 18: Simulation Directed Co-Design from Smartphones to Supercomputers

18

High Performance Computing § High performance computing (HPC) is becoming much more

pervasive. §  Power efficiency and integration are becoming key factors in both

large-scale and commercial HPC §  2018-2022 DARPA/DOD/DOE Visions for HPC:

20KW Rack Petascale

20MW Data Center Exaflop

20W Chip Teraflop

5KW Chassis Terascale

Medical/Pharma

50 GFLOPS/W (20 pJ/FLOP)

Page 19: Simulation Directed Co-Design from Smartphones to Supercomputers

19

Why does ARM care about HPC? § We expect the challenges HPC experiences today to be

similar to the enterprise challenges of tomorrow §  Data center networking is getting more advanced §  Energy will forever be a concern

§ ARM’s long-term vision is for ARM technology to be in all levels of compute §  Five years ago we announced the Cortex-M (Microcontroller) series §  ARM powers many hard-real-time system (Radio, Automotive, etc) §  Mobile devices §  Servers §  HPC is the only place you don’t find ARM technology today and we

aim to change that

Page 20: Simulation Directed Co-Design from Smartphones to Supercomputers

20

First steps in ARM HPC: §  Supercomputer investigation based

on embedded (ARM) technology

§  Funded under FP7 §  3-year IP Project (Start October 2011) §  Budget: 14.5 M€ (8.1 M€ from EC)

§  Project goals: physical prototype based on available embedded (ARM) technology and a design of a full next-gen system

§  Consortium includes experienced HPC developers and users:

Page 21: Simulation Directed Co-Design from Smartphones to Supercomputers

21

Mont-Blanc Roadmap A big challenge, and a huge opportunity for Europe

• Prototypes are critical to accelerate software development• System software stack + applications

2011 2012 2013 2014 2015 2016 2017

256 nodes250 GFLOPS

1.7 Kwatt

Built with the bestof the market

Built with the bestthat is coming

What is the bestthat we could do?

GFL

OPS

/ W

September 13, 2012HPC Advisory Council, Malaga18

Page 22: Simulation Directed Co-Design from Smartphones to Supercomputers

22

US DoE Exascale Timeline

Page 23: Simulation Directed Co-Design from Smartphones to Supercomputers

23

Goals § Port co-design center proxy applications to ARM platform and

take baseline measurements § Also execute HPC Challenge and FFTW benchmarks to

compliment proxy applications § Execute same set of workloads on gem5 with a configuration

similar to an ARM hardware platform to get an idea of how well the simulator correlates

§ Use results as a baseline for understanding the current state of ARM for HPC, future optimizations and sensitivity studies

§ Since national labs aren’t as interested in 32-bit, use the process to refine methodology till 64-bit hardware and/or simulator becomes available

Page 24: Simulation Directed Co-Design from Smartphones to Supercomputers

24

Baseline Workload Characterization

National Labs

Workloads

HPC Disk Image

Performance Projection

Design Sensitivity

Studies

Characterization Co-Design

Centers RTL Simulation

Page 25: Simulation Directed Co-Design from Smartphones to Supercomputers

25

High Performance Computing Challenge § DARPA benchmark established to help evaluate systems in

the HPCS program (which ultimately produced Cray Cascade and IBM PERCs machine) §  LINPACK – stress peak floating point §  PTRANS – rate of transfer of large arrays §  GUPS – random updates of memory §  FFT – Fast Fourier Transform §  STREAM – measures sustainable memory bandwidth §  DGEMM – Double precision general matrix multiply

§ Generally run across a cluster with MPI, but can run single node and single core

§ Configure can scale to different working set sizes § http://icl.cs.utk.edu/hpcc

Page 26: Simulation Directed Co-Design from Smartphones to Supercomputers

26

Mantevo Proxy Applications Suite § Developed at Sandia National Labs as an outgrowth of

Trillinos project which is a collection of open-source scientific libraries, applications and benchmarks

§ Goals: §  Predict performance of real applications in new situations. §  Aid computer systems design decisions. §  Foster communication between applications, libraries and computer

systems developers. §  Guide application and library developers in algorithm and software

design choices for new systems. §  Provide open source software to promote informed algorithm,

application and architecture decisions in the HPC community.

§ Released as open source: §  http://mantevo.org

Page 27: Simulation Directed Co-Design from Smartphones to Supercomputers

27

Co-Design Center Apps

CESAR Center for Exascale Simulation of Advanced Reactors •  Thermal Hydraulics: for the

fluid codes (NEK 5000)* •  Neutronics : for the Neutronics

codes (MOCFE and OpenMC) •  Coupling and Data Analytics for

data intensive tasks: cian

ExaCT Center for Exascale Simulation of Combustion in Turbulence •  Exp_CNS_NoSpec: A simple

stencil-based test code •  MultiGrid_C: A multigrid-based

solver for a model linear elliptic system based on a centered second-order discretization.

•  vodeDriver: chemical combustion kinetics

ExMatEx Materials in Extreme Environments •  CoMD – Molecular Dynamics •  LULESH - Lagrangian Explicit

Shock Hydrodynamics •  VPFFT - Crystal viscoplasticity

Page 28: Simulation Directed Co-Design from Smartphones to Supercomputers

28

Workloads, benchmarks, & miniapps Linpack DGEMM

FFT

miniMD OpenMD Nekbone

miniFE Hpccg

PHDmesh

PTRANS STREAM GUPS

Page 29: Simulation Directed Co-Design from Smartphones to Supercomputers

29

Workloads Instruction Mix Linpack DGEMM FFT

miniMD OpenMD Nekbone

miniFE HPCCG PHDmesh

PTRANS STREAM GUPS

Memory – Integer – SIMD Integer – Float – SIMD Float

Page 30: Simulation Directed Co-Design from Smartphones to Supercomputers

30

gem5 Methodology § Boot scripts are in m5-obj/config/boot/hpc § Base.rcS creates a checkpoint after boot and 60-second

“rest” period. Setup to re-read workload script after checkpoint so that workload can be configured during restore skipping boot period.

§ Configs are self-contained in workloads, output is sent to simulation host via m5 writefile.

§  I’ve got some bundled run scripts which handle establishing the base checkpoint and for restoring checkpoint and executing workloads in atomic, A15, and A15 with period stats enabled

§ Runs parameterized so that complete run can complete in a reasonable amount of time w/timing-approximate simulation

§ Disk image available, optimized for A15

Page 31: Simulation Directed Co-Design from Smartphones to Supercomputers

31

gem5 Correlation – Simple Memory

-100.00% -80.00% -60.00% -40.00% -20.00% 0.00% 20.00% 40.00% 60.00% 80.00% 100.00%

G-HPL MFlop/s

G-FFTR MFlop/s

EP-DGEMM MFlop/s

G-PTRANS GB/s

EP-STREAM GB/s

GUPS

FFTW (SP)

CG_MFLOP/s

HPCCG P1 FLOP/s

MD TIME

phdMesh

CoMD 8k

Page 32: Simulation Directed Co-Design from Smartphones to Supercomputers

32

miniFE – finite element simulation §  It assembles a sparse linear-system from the steady-state

conduction equation on a brick-shaped problem domain of linear 8-node hex elements. It then solves the linear-system using a simple un-preconditioned conjugate-gradient algorithm.

Thus the kernels that it contains are: § computation of element-operators

§  diffusion matrix, source vector

§ assembly §  scattering element-operators into sparse matrix and vector

§ sparse matrix-vector product §  during CG solve

§ vector operations (level-1 blas: axpy, dot, norm)

Page 33: Simulation Directed Co-Design from Smartphones to Supercomputers

33

Profile: miniFE § Language & Runtime: reference code in C++, alternate

versions for openMP, cilk, chapel, qthreads, etc. § Library Dependencies: None § SLOCCOUNT: 2872 lines of code § A15 Perf Characteristics:

§  Run Time: 217,413,167 cycles §  Max Heap Size: 14.54MB §  CPI: 1.6958 §  L1D Miss Rate: 2.5% §  L2 Miss Rate: 6.68% §  Branch Mispredicts: 6.36%

Int

Float

SIMD Int

SIMD Float

Memory

Other

Page 34: Simulation Directed Co-Design from Smartphones to Supercomputers

34

miniFE gem5 Cache Occupancy

Page 35: Simulation Directed Co-Design from Smartphones to Supercomputers

35

miniFE Streamline Visualization

Page 36: Simulation Directed Co-Design from Smartphones to Supercomputers

36

Workloads: Next Steps §  More benchmarks

§  Get big data analytic mini-apps and benchmarks working (graph 500, mantevo analytics mini-app, others?)

§  Get an ExaCT benchmark working, incorporate forthcoming ASC benchmarks §  More variations

§  Multinode MPI, PGAS, and other runtimes §  OpenCL variants §  Handcode NEON optimized versions of key benchmarks

§  More Accuracy §  Continue calibration gem5 memory system against hardware to increase accuracy

of memory-bound benchmarks §  Systems Software Sensitivity Study

§  OpenMPI versus MPICH versus LAMPI on ARM §  Operating System Version (3.7 has THP) §  armcc vs gcc vs gcc-dragon-egg versus clang (etc.)

§  Transition to 64-bit gem5 (and hardware) when available. §  Integrate Montblanc benchmarks and runtimes §  Roll bare-metal version of co-design center workloads to make them more

accessible to design teams.

Page 37: Simulation Directed Co-Design from Smartphones to Supercomputers

37

Simulation Driven Challenges §  Performance

§  When running functional mode (atomic), performance is in MIPS, when running in cycle approximate mode (with memory models, cache models, etc.) simulation runs in KIPS – but longer runtimes with timing models give more representative results.

§  Current methodology works of atomic checkpoints followed by short timing measurements, but can be refined to get better representation of multi-phase workloads

§  Scale §  gem5 is currently inherently serial, adding cores or nodes to simulation has a

multiplicative effect §  Multi-threading the simulation model at core, node, and cluster levels could help

address this problem, but may impact granularity of timing accuracy. §  Correlation

§  Correlating a single core simulation is hard, correlating multi-core is extremely difficult, as is multi-node.

§  Sensitivity Study State Space Explosion §  Many knobs to turn, determining which ones to turn in combination for the best

effect is an on-going research problem. §  Visualization

§  Need better ways of visualizing performance characteristics, particularly at scale.

Page 38: Simulation Directed Co-Design from Smartphones to Supercomputers

38

Future Work: Integration with SST § SST: The Structural Simulation Toolkit

§  Maintained by Sandia National Labs §  Component-based Discrete Event Model §  Already uses gem5 as a component (but not well integrated with ARM

variant) §  Potential to help us scale out simulation as well as integrate with

other simulations (fabric, etc.) to allow for end-to-end simulation of large scale supercomputer.

Page 39: Simulation Directed Co-Design from Smartphones to Supercomputers

39

Links § More info on ARM including Research Papers

§  http://infocenter.arm.com

§ gem5 (http://www.m5sim.org) § SST (http://sst.sandia.gov) § Montblanc (http://montblanc-project.eu) § Exacale Initiative

§  http://sites.google.com/a/lbl.gov/exascale-initiative/

§ Co-Design Center Proxy Apps §  Mantevo (http://mantevo.org) §  ExMatEx (http://exmatex.lanl.gov) §  ExaCT (http://exactcodesign.org) §  CESAR (http://cesar.mcs.anl.gov)

Page 40: Simulation Directed Co-Design from Smartphones to Supercomputers

40

QUESTIONS? Thanks!