Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU...

Connect. Challenge. Inspire.

ISC 2017

June 20th 2017

Fujitsu HPC and AI Processors

Takumi MaruyamaSenior Director

AI Platform Business Unit

Advanced System Research & Development Unit

Agenda

K computer

Fujitsu’s latest processors

Future Fujitsu processors under development

Post K

AI processor: DLU

Summary

K Computer

10.51 PFlops (Top500, 2011/11)

38,621 GTEPS (Graph500, 2016/11)

602.7 TFLOPS (HPCG, 2016/11)

High Performance Processor

Liquid Cooling

4Processors

Torus Network

Fujitsu Technologies in the K computer

864 racks

82,944 Compute nodes

5,184 IO nodes

High Density Rack

24boards

The Latest Fujitsu Processors

Fujitsu Processor DevelopmentPerpetual Evolution > 60 years:

Always Targeting No.1

2000～2003

SPARC64

GS8900

GS8600

GS8800B

SPARC64

GS8800

Mainframe

bility

Store Ahead

Branch History

Prefetch

Single-chip CPU

Non-Blocking $

O-O-O Execution

Super-Scalar

L2$ on Die

HPC-ACE

System on Chip

Hardware Barrier

Multi-core Multi-thread

2004～2007 2008～2011

SPARC64

2012～2015 2016～

SPARC64

VIIIfx

Virtual Machine Architecture

Software on Chip

High-speed Interconnect

SPARC64

250nm /

:Technology generation

CMOS Cu

HPCUNIX

Register/ALU Parity

Instruction Retry

$ Dynamic Degradation

Error Checkers/History

Mainframe/UNIX/HPC + AI

incremental development

SPARC64

Post-K

SPARC64™ XIfx Chip (HPC)

Architecture Features• 32 computing cores

+ 2 assistant cores

• HPC-ACE2 (256bit SIMD)Fujitsu’s ISA enhancements

• Sector Cache: Cache with SW controllability

• 24 MB L2 cache

20nm CMOS• 3,750M transistors

• 2.2GHz

Performance (peak)• 1.1TFlops

• HMC 240GB/s x 2 (in/out)

• Tofu2 125GB/s x 2 (in/out)

core core

Assistant

coreAssistant

core core

Tofu2 interface

Tofu2 controller

C inte

rface H

L2 cache

PCI interface

PCI controller

Many (32+2) cores, Medium CPU GHz

SPARC64™ XII Chip (UNIX)

Architecture Features• 12 cores x 8 threads

• SWoC (“Software on Chip”)Fujitsu’s ISA enhancements

• 32MB L3 cache

• Embedded MAC and IOC

20nm CMOS• 25.8mm x 30.8mm

• 5,450M transistors

• 4.25GHz (up to 4.35GHz with “High Speed Mode” enabled)

Performance (peak)• 417GIPS / 835GFlops

• 153GB/s memory throughput

DDR4 interface

CoreCoreL

SERDES

connect

rence C

Multiple big cores, High CPU GHz

SPARC64TM XIfx (HPC) Pipeline

L1 I$64KB

BranchTarget

Address

Decode

& Issue

GPR188Registers

EXCEAGB

FPR128x4 Reg.

L1 D $64KB

Fetch Issue Dispatch Reg-Read Execute Cache and Memory

Commit

Control

Registers

Buffer

PatternHistoryTable

IOC CPU-CPU I/F

34 cores …

FLBFLALocal

PatternTable

FLBFLBFLB

Instruction

RSEReservation Station

for Execution

RSAReservation Station

for Address generation

RSFReservation Station

for Floating-point

RSBRReservation Station

for Branch

FUBFPR Update Buffer

Fetch Decode Issue Reg-Read Execute Cache and Memory

Commit

Buffer

12 cores

Pipeline-0

Pipeline-1

L3 Cache

IOCCPU-CPU i/f

L2 Cache

SPARC64TM XII (UNIX) Pipeline

BranchPrediction

Program

Counter x4

Control

Registers x4

DecodeInstruction

Buffer

Shared Micro-architecture

Future Fujitsu Processors

Under Development

- Post K

Project Overview

• RIKEN and Fujitsu are currently developing the post-K computer, which

is aims to be the most advanced general-purpose supercomputer in the

Goals of Japan’s Post-K Development Project

• Application performance

• Low power consumption

• User convenience

• Ability to produce ground-breaking results

Japan’s Post-K Computer Development Project

Functions & Architecture Post-K K computer

Processor

Base ISA + SIMD Extensions ARMv8-A+SVESPARCv9+HPC-

SIMD width [bit] 512 128

FP16 (half precision) support ✔ -

FMA: Floating-point multiply and add ✔ ✔

Math. acceleration primitives ✔ Enhanced ✔

Inter-core barrier ✔ ✔

Sector cache ✔ Enhanced ✔

Hardware “prefetch” assist ✔ Enhanced ✔

Interconnect Tofu ✔ Enhanced ✔

Post-K Processor and Interconnect Features

Fujitsu Processor, adopting ARM ISA and enhanced Tofu interconnect

Inherits and enhances the K computer’s innovative features

Post-K Processor Supports FP16

Provides optimized precision for a wide range of applications

• Superior performance

• Reduces required bandwidth and power consumption

Target applications:

• Existing numerical applications

• Brand-new applications, including Deep Learning

High Performance

More Applications

Double Precision

Single Precision

Precision

Future Fujitsu Processor

Development

- AI Processor (DLUTM)

Processor Designed for Deep Learning

Features of DLU Architecture designed for Deep Learning

Low power consumption design

Optimized precision

➔Goal: 10x Performance / Watt compared to

competitors

Scalable design with Tofu interconnect technology

➔Ability to handle large-scale neural networks

The photograph is an image, and it is different from the thing.

DLU(Deep Learning Unit)

FY2018 ～

Utilizing technologies derived from the K computer

DLU Design Target

Performance

Conflicting Demands

• Less Transistors

• less control logic

• fewer execution units/$

• Lower Frequency

• More transistors

• state of the art O-O-O

• many execution units/$

• Higher Frequency

High Deep Learning performance / watt:10x performance / watt

However, high performance and low power is not easy to achieve at the same time

Need for a New ArchitectureA new architecture is required for the DLU to achieve

the target.

The architecture is domain specific – Deep Learning

General

Purpose

Computing

Supercomputer

Accelerator

Quantum

Computer

Learning

Inference

Specialization

Required

Processing

What’s the New Architecture for the DLU?

High Precision

General Use

Conventional

Architecture

The New

Architecture

2. Optimal Precision

1. Domain Specific

Sequential

+ Parallel3. Massively Parallel

Many cores w/ on-chip networkMultiple strong cores

Double/Single precision FP Deep Learning Integer

Complicated O-O-O cores Domain specific cores

Domain specific, Optimal precision, and Massively parallel.

DPU: Deep learning Processing Unit, DPE: Deep learning Processing Element

Host I/FDPU-0

DPE DPEDPE

DPE DPEDPE DPE DPEDPE

DPE DPEDPE

Large scale DLU interconnect

through off-chip network

DPE DPEDPE

(Deep Learning Unit)

DLU Architecture

Inter-chip

1. Domain specific

Domain specific Cores

- Newly designed ISA

- Simplified μ-architecture

- Fully software visible and

controllable

- Heterogeneous cores★- DPE and Large RF ★

3. Massively Parallel

Many DPUs with an On-chip Network

2. Optimal Precision

Deep Learning Integer★

DPU: Execution

・Execute DL operations based

on master core’s control

How to utilize many DPUs

(convolution example)

・ one CH-out / DPU

・ multiple batch / DPU

Heterogeneous Cores

Master

MemoryMemory

Controller

Instructions/Data

・・・

CH-in CH-out

Master Core:

Memory Access and

DPU control

• Push & Pull

instructions and data for DLUs.

• Start/stop execution of DLUs

The combination of few large core (Master) and many small execution cores (DPU) results in more performance with less power consumption, compared to a conventional homogeneous structure

DPE & Large RF (Register File)

DPU: 128 SIMD* / 16DPE

DPU consists of 16 DPEs connected with on-chip network

DPE incudes large RF and wide SIMD execution units to realize an efficient Deep Learning engine.

RF is fully SW controllable unlike cache to extract full HW potential

DPE: 8SIMD* with large RF

(~100x of typical CPU core)

* For FP32

Register File

Name RF/$ structure

UNIX SPARC64 XII RF + $

HPC SPARC64 XIfx RF + sector $

AI DLU Large RF

More SW controllability

Deep Learning Integer

Fujitsu’s “Deep Learning Integer” realizes necessary accuracy for Deep Learning with only a 16 or 8 bit data size (i.e. less power consumption compared with FP32)

Data Size

Effective Precision

16-bit

Required Precision for Deep Learning

>INT8,16

Accumulator

INT8 INT8

FP16 FP16

int16 int16

Int>16

int8 int8 int8 int8

× × × ×

Deep Learning

Integer

Data Size and Precision DLU Data Type

Small Large

16/8bit area with

minimum accuracy loss

INT8 INT8

HW gathered

statistics

Deep Learning Integer Accuracy

(*) ImageNet(subset): image size=96x96, #categories=25

Deep Learning Integer (16bit/8bit)

Deep Learning Integer has shown similar accuracy with FP32for Deep Learning

Deep Learning Integer (8bit)

Deep Learning Integer (16bit)

DLU Roadmap

Multiple generations of DLUs over time, as we currently do for HPC/UNIX/Mainframe processors

• Host CPU

required

• Inter-DLU direct

connection

Generation

• Embedded host

CPU2nd

Generation

Other special processors

• Neuromorphic

• Combinatorial optimization

FutureFY2018

* Subject to change without notice

Summary

Fujitsu Processor Design Style

Mainframe UNIX HPC AI(DLU)

Instruction Set Architecture(HW-SW I/F)

Micro-architecture(CPU internal structure)

★Performance/RAS

Semiconductor Technology

Design Infrastructure

Shared[FJ development]

General Purpose

Standard ISA with FJ enhancements / newly developed ISA

Shared / Simple + SW visible micro-architecture

The latest semiconductor technology

Shared design infrastructure: Circuit, Methodology, People

Simple

SW visible

Circuit, Methodology, People

The latest

Domain Specific

Deep Learning

Fujitsu Processor Direction

General purpose and Domain specific

Wider variety of processors in the future to meet different requirements.

Supercomputer

Specialization

Required

SPARC64TM

VII / VII+

SPARC64TM

VIIIfxPost-K

General

Purpose

Domain

Specific

HPC & AI

Diverge

Summary

Fujitsu has designed processors for a long time (> 60 years)

Perpetual evolution over generations

SPARC64 IXfx (HPC), SPARC64 XII (UNIX), and Post-K

General purpose computing

Domain specific

New Architecture

Heterogeneous, DPE and large RF, Deep Learning Integer

Shared Design infrastructure: Circuit, Methodology, People

Fujitsu will continue to develop cutting-edge processors to

meet the needs of a new era.

Fujitsu HPC and AI Processors · Title: Fujitsu HPC and AI Processors Author: FUJITSU...

Documents

Benchmarking Huawei ARM Multi-Core Processors for HPC ... · Benchmarking Huawei ARM Multi-Core Processors for HPC workloads Key Liao Center for HPC Shanghai Jiao Tong University

0 FUJITSU PUBLIC Copyright 2015 FUJITSU FUJITSU Security Solution SURIENT MRS

Programming Protocol-Independent Packet Processors

REVISTA HPC

Word Processors Ki Dunya

HPC-Systeme HPC und Storage

Embedded System Platform & Andes Embedded Processors

FUJITSU LIFEBOOK U727 FUJITSU LIFEBOOK U747 ...solutions.us.fujitsu.com/www/content/pdf/SupportGuides/...FUJITSU LIFEBOOK U727 FUJITSU LIFEBOOK U747 FUJITSU LIFEBOOK U757 Manuel d’utilisation

ホワイトペーパー - Fujitsu...ホワイトペーパー - Fujitsu ... ディスク []

Design Space Exploration of Next-Generation HPC Machines · 2020-05-08 · Trends in High Performance Computing (HPC) systems are changing. The use of commodity server-grade processors

Hpc 2010 @ Hpc Day 2009

Palestra hpc python

RISC Processors – Page 1 of 46CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: RISC Processors Reading: Stallings, Chapter

Advanced processors 5

Advanced processors

Picojava Processors Mateus Beck Rutzig mbrutzig@inf.ufrgs.br

Liebert HPC-M - alphagrissin.com.uaalphagrissin.com.ua/userdata/materials/Liebert-HPC-M-RU.pdf · 4 Liebert HPC-M: с гарантией надежности и высокого КПД

MISD Architecture of Specialized Processors

FlowVision HPC HPC ––инновационный HPC –инновационныйisicad.ru/ru/2008/presentations/d2/pdf/TESIS_Shchelyaev.pdf · November 2005 FlowVision HPC –инновационный

Scaling algebraic multigrid to over 287K processors