Simulation Directed Co-Design from Smartphones to Supercomputers

Eric Van Hensbergen ARM Research & Development

Austin, TX

FastPath 2013 April 21, 2013

STATE § SHIFT

§  No longer solely rely on Process Reduc0on to improve performance

§  Performance/Power/Cost will increasingly become reliant on Integra0on

§ ARM §  Focuses on Design & Licensing of IP Building Blocks for SoC’s (=LEGO’s) §  Building Blocks effecJvely act as COTS-‐on-‐Silicon §  COTS-‐on-‐Silicon encourages mulJ-‐suppliers through the eco-‐system

§  It enables circuit-‐boards to be integrated onto a single chip §  Technology DNA is Power-‐Efficiency

FLEXIBILITY § Build what you want?

§  Target your SoC to solve your problem §  One size does not fit all §  OpJmize power/performance for the domain

§  UJlize common infrastructure and components §  Leverage SW ecosystem and portability §  Leverage validated IP §  Proven design flows

§  Focus on adding value to solve your problems §  Adding you applicaJon specific IP §  Everything else off the shelf

§ Rich IP libraries §  Diverse and compeJJve IP vendors

§  Leverage the ARM ecosystem

MARKETS

Mobile

4.6bn 3%

in 2011

Embedded

2.3bn 25%

in 2011

0.4bn 40%

in 2011

Enterprise

1.4bn 10%

in 2011

PROCESSORS

Architecture “ARMv8”

Processor Hard-Macro Implementation

Processor Micro-Architecture “Cortex-A57”

“On-Chip” INTERCONNECT

Architecture “AMBA”

RTL Implementation CCN-504

Architecture “Midgard”

GPU Micro-Architecture Mali T-678

gem5 § Architectural simulator § ARM has invested significantly in ARM support for gem5

under the internal name “SystemExplorer” §  Plan to continue to invest over time §  ARMv7 support is extremely good today §  Plans to contribute ARMv8 support when complete

§ BSD licensed § Good platform for collaboration

§  Base infrastructure is available and we can share bits beyond that

SystemExplorer

OS Support in SystemExplorer

Ubuntu 12.04 (Linux kernel v3.3) Android Jellybean (Kernel v2.6.38)

§ Latest Ubuntu and Android distributions

Server Applications

Single system simulation

Multi-system simulation with simulated Ethernet

Mobile Applications

Understanding how real application workloads and operating systems stress our IP

Webserver

Netperf

DaCapo

IOzone

Angry Birds

Ande- Bench

BBench

Replica

JS V8 Engine

Vid Playback

Wireless Disp Video Conf

SystemExplorer Platforms

SystemExplorer Application Support

Done Planned In Process Legacy

EEMBC SPEC2000

AppLaunch

Includes kernel support

Graphics

Taji Egypt

HPC Applications

Mantevo CESAR

Caffeine Mark WPS

Vellamo HTML5

Velllamo Metal

RLBench UI Twiddle

ARM gem5 Usage Continues to Grow

§  ARM gem5 exceeding both X86 and Alpha

alpha arm x86

ARM #1

Overtake x86

Overtake alpha

gem5 Visualization with Streamline

SystemExplorer Dhrystone Correlation

SystemExplorer SPECint2000 Correlation

SystemExplorer EEMBC CORRELATION

High Performance Computing § High performance computing (HPC) is becoming much more

pervasive. §  Power efficiency and integration are becoming key factors in both

large-scale and commercial HPC §  2018-2022 DARPA/DOD/DOE Visions for HPC:

20KW Rack Petascale

20MW Data Center Exaflop

20W Chip Teraflop

5KW Chassis Terascale

Medical/Pharma

50 GFLOPS/W (20 pJ/FLOP)

Why does ARM care about HPC? § We expect the challenges HPC experiences today to be

similar to the enterprise challenges of tomorrow §  Data center networking is getting more advanced §  Energy will forever be a concern

§ ARM’s long-term vision is for ARM technology to be in all levels of compute §  Five years ago we announced the Cortex-M (Microcontroller) series §  ARM powers many hard-real-time system (Radio, Automotive, etc) §  Mobile devices §  Servers §  HPC is the only place you don’t find ARM technology today and we

aim to change that

First steps in ARM HPC: §  Supercomputer investigation based

on embedded (ARM) technology

§  Funded under FP7 §  3-year IP Project (Start October 2011) §  Budget: 14.5 M€ (8.1 M€ from EC)

§  Project goals: physical prototype based on available embedded (ARM) technology and a design of a full next-gen system

§  Consortium includes experienced HPC developers and users:

Mont-Blanc Roadmap A big challenge, and a huge opportunity for Europe

• Prototypes are critical to accelerate software development• System software stack + applications

2011 2012 2013 2014 2015 2016 2017

256 nodes250 GFLOPS

1.7 Kwatt

Built with the bestof the market

Built with the bestthat is coming

What is the bestthat we could do?

September 13, 2012HPC Advisory Council, Malaga18

US DoE Exascale Timeline

Goals § Port co-design center proxy applications to ARM platform and

take baseline measurements § Also execute HPC Challenge and FFTW benchmarks to

compliment proxy applications § Execute same set of workloads on gem5 with a configuration

similar to an ARM hardware platform to get an idea of how well the simulator correlates

§ Use results as a baseline for understanding the current state of ARM for HPC, future optimizations and sensitivity studies

§ Since national labs aren’t as interested in 32-bit, use the process to refine methodology till 64-bit hardware and/or simulator becomes available

Baseline Workload Characterization

National Labs

Workloads

HPC Disk Image

Performance Projection

Design Sensitivity

Studies

Characterization Co-Design

Centers RTL Simulation

High Performance Computing Challenge § DARPA benchmark established to help evaluate systems in

the HPCS program (which ultimately produced Cray Cascade and IBM PERCs machine) §  LINPACK – stress peak floating point §  PTRANS – rate of transfer of large arrays §  GUPS – random updates of memory §  FFT – Fast Fourier Transform §  STREAM – measures sustainable memory bandwidth §  DGEMM – Double precision general matrix multiply

§ Generally run across a cluster with MPI, but can run single node and single core

§ Configure can scale to different working set sizes § http://icl.cs.utk.edu/hpcc

Mantevo Proxy Applications Suite § Developed at Sandia National Labs as an outgrowth of

Trillinos project which is a collection of open-source scientific libraries, applications and benchmarks

§ Goals: §  Predict performance of real applications in new situations. §  Aid computer systems design decisions. §  Foster communication between applications, libraries and computer

systems developers. §  Guide application and library developers in algorithm and software

design choices for new systems. §  Provide open source software to promote informed algorithm,

application and architecture decisions in the HPC community.

§ Released as open source: §  http://mantevo.org

Co-Design Center Apps

CESAR Center for Exascale Simulation of Advanced Reactors •  Thermal Hydraulics: for the

fluid codes (NEK 5000)* •  Neutronics : for the Neutronics

codes (MOCFE and OpenMC) •  Coupling and Data Analytics for

data intensive tasks: cian

ExaCT Center for Exascale Simulation of Combustion in Turbulence •  Exp_CNS_NoSpec: A simple

stencil-based test code •  MultiGrid_C: A multigrid-based

solver for a model linear elliptic system based on a centered second-order discretization.

•  vodeDriver: chemical combustion kinetics

ExMatEx Materials in Extreme Environments •  CoMD – Molecular Dynamics •  LULESH - Lagrangian Explicit

Shock Hydrodynamics •  VPFFT - Crystal viscoplasticity

Workloads, benchmarks, & miniapps Linpack DGEMM

miniMD OpenMD Nekbone

miniFE Hpccg

PHDmesh

PTRANS STREAM GUPS

Workloads Instruction Mix Linpack DGEMM FFT

miniMD OpenMD Nekbone

miniFE HPCCG PHDmesh

PTRANS STREAM GUPS

Memory – Integer – SIMD Integer – Float – SIMD Float

gem5 Methodology § Boot scripts are in m5-obj/config/boot/hpc § Base.rcS creates a checkpoint after boot and 60-second

“rest” period. Setup to re-read workload script after checkpoint so that workload can be configured during restore skipping boot period.

§ Configs are self-contained in workloads, output is sent to simulation host via m5 writefile.

§  I’ve got some bundled run scripts which handle establishing the base checkpoint and for restoring checkpoint and executing workloads in atomic, A15, and A15 with period stats enabled

§ Runs parameterized so that complete run can complete in a reasonable amount of time w/timing-approximate simulation

§ Disk image available, optimized for A15

gem5 Correlation – Simple Memory

-100.00% -80.00% -60.00% -40.00% -20.00% 0.00% 20.00% 40.00% 60.00% 80.00% 100.00%

G-HPL MFlop/s

G-FFTR MFlop/s

EP-DGEMM MFlop/s

G-PTRANS GB/s

EP-STREAM GB/s

FFTW (SP)

CG_MFLOP/s

HPCCG P1 FLOP/s

MD TIME

phdMesh

CoMD 8k

miniFE – finite element simulation §  It assembles a sparse linear-system from the steady-state

conduction equation on a brick-shaped problem domain of linear 8-node hex elements. It then solves the linear-system using a simple un-preconditioned conjugate-gradient algorithm.

Thus the kernels that it contains are: § computation of element-operators

§  diffusion matrix, source vector

§ assembly §  scattering element-operators into sparse matrix and vector

§ sparse matrix-vector product §  during CG solve

§ vector operations (level-1 blas: axpy, dot, norm)

Profile: miniFE § Language & Runtime: reference code in C++, alternate

versions for openMP, cilk, chapel, qthreads, etc. § Library Dependencies: None § SLOCCOUNT: 2872 lines of code § A15 Perf Characteristics:

§  Run Time: 217,413,167 cycles §  Max Heap Size: 14.54MB §  CPI: 1.6958 §  L1D Miss Rate: 2.5% §  L2 Miss Rate: 6.68% §  Branch Mispredicts: 6.36%

SIMD Int

SIMD Float

Memory

miniFE gem5 Cache Occupancy

miniFE Streamline Visualization

Workloads: Next Steps §  More benchmarks

§  Get big data analytic mini-apps and benchmarks working (graph 500, mantevo analytics mini-app, others?)

§  Get an ExaCT benchmark working, incorporate forthcoming ASC benchmarks §  More variations

§  Multinode MPI, PGAS, and other runtimes §  OpenCL variants §  Handcode NEON optimized versions of key benchmarks

§  More Accuracy §  Continue calibration gem5 memory system against hardware to increase accuracy

of memory-bound benchmarks §  Systems Software Sensitivity Study

§  OpenMPI versus MPICH versus LAMPI on ARM §  Operating System Version (3.7 has THP) §  armcc vs gcc vs gcc-dragon-egg versus clang (etc.)

§  Transition to 64-bit gem5 (and hardware) when available. §  Integrate Montblanc benchmarks and runtimes §  Roll bare-metal version of co-design center workloads to make them more

accessible to design teams.

Simulation Driven Challenges §  Performance

§  When running functional mode (atomic), performance is in MIPS, when running in cycle approximate mode (with memory models, cache models, etc.) simulation runs in KIPS – but longer runtimes with timing models give more representative results.

§  Current methodology works of atomic checkpoints followed by short timing measurements, but can be refined to get better representation of multi-phase workloads

§  Scale §  gem5 is currently inherently serial, adding cores or nodes to simulation has a

multiplicative effect §  Multi-threading the simulation model at core, node, and cluster levels could help

address this problem, but may impact granularity of timing accuracy. §  Correlation

§  Correlating a single core simulation is hard, correlating multi-core is extremely difficult, as is multi-node.

§  Sensitivity Study State Space Explosion §  Many knobs to turn, determining which ones to turn in combination for the best

effect is an on-going research problem. §  Visualization

§  Need better ways of visualizing performance characteristics, particularly at scale.

Future Work: Integration with SST § SST: The Structural Simulation Toolkit

§  Maintained by Sandia National Labs §  Component-based Discrete Event Model §  Already uses gem5 as a component (but not well integrated with ARM

variant) §  Potential to help us scale out simulation as well as integrate with

other simulations (fabric, etc.) to allow for end-to-end simulation of large scale supercomputer.

Links § More info on ARM including Research Papers

§  http://infocenter.arm.com

§ gem5 (http://www.m5sim.org) § SST (http://sst.sandia.gov) § Montblanc (http://montblanc-project.eu) § Exacale Initiative

§  http://sites.google.com/a/lbl.gov/exascale-initiative/

§ Co-Design Center Proxy Apps §  Mantevo (http://mantevo.org) §  ExMatEx (http://exmatex.lanl.gov) §  ExaCT (http://exactcodesign.org) §  CESAR (http://cesar.mcs.anl.gov)

QUESTIONS? Thanks!

Simulation Directed Co-Design from Smartphones to Supercomputers

Technology

Los Smartphones

Smartphones android

Atacando Smartphones

Extreme Scale Breadth-First Search on Supercomputers

Next generation supercomputers - Index of /wiki

Smartphones ppt

LABs @TicWisdom, directed by @JeanCharles

Hemodynamic goal directed therapy 20110926

Smartphones nokia2

2015 Spanish Higher – Reading and Directed Writing ... – Reading and Directed Writing : ... General Marking Principles for Spanish Higher Reading and Directed ... fellow traveller

Competencia Smartphones

Self-directed learning, SDL

IMPLEMENTASI MODEL PEMBELAJARAN DRA (DIRECTED …

Directed by Dave Rasmussen - nejazz.com

Goal Directed Budget 2012

Directed By : Taufik Ibrahim, S.Pd

PENGARUH MODEL PEMBELAJARAN SELF-DIRECTED …

Computer hardware directed

Coachee Directed Learning

PENERAPAN STRATEGI DIRECTED READING THINKING …