View
219
Download
0
Category
Preview:
Citation preview
Connect. Challenge. Inspire.
All Rights Reserved, Copyright© FUJITSU LIMITED 2015
ISC 2017
June 20th 2017
Fujitsu HPC and AI Processors
Takumi MaruyamaSenior Director
AI Platform Business Unit
Advanced System Research & Development Unit
Agenda
K computer
Fujitsu’s latest processors
HPC
UNIX
Future Fujitsu processors under development
Post K
AI processor: DLU
Summary
1 Copyright 2017 FUJITSU LIMITED1
K Computer
2 Copyright 2017 FUJITSU LIMITED
K Computer
WR#1
10.51 PFlops (Top500, 2011/11)
38,621 GTEPS (Graph500, 2016/11)
602.7 TFLOPS (HPCG, 2016/11)
3 Copyright 2017 FUJITSU LIMITED3
High Performance Processor
8core
Liquid Cooling
4Processors
Torus Network
6D.
Fujitsu Technologies in the K computer
864 racks
82,944 Compute nodes
5,184 IO nodes
High Density Rack
24boards
4 Copyright 2017 FUJITSU LIMITED
The Latest Fujitsu Processors
5 Copyright 2017 FUJITSU LIMITED
Fujitsu Processor DevelopmentPerpetual Evolution > 60 years:
Always Targeting No.1
2000~2003
SPARC64
SPARC64
II
SPARC64
V
SPARC64
GP
GS8900
GS21
600
GS8600
GS8800B
SPARC64
VII
GS21
1600
SPARC64
V+
SPARC64
VI
GS8800
GS21
900
Mainframe
Perfo
rman
ce
Relia
bility
Store Ahead
Branch History
Prefetch
Single-chip CPU
Non-Blocking $
O-O-O Execution
Super-Scalar
L2$ on Die
HPC-ACE
System on Chip
Hardware Barrier
Multi-core Multi-thread
2004~2007 2008~2011
SPARC64
GP
2012~2015 2016~
SPARC64
IXfx
SPARC64
VIIIfx
Virtual Machine Architecture
Software on Chip
High-speed Interconnect
SPARC64
X+
130nm
250nm /
220nm
180nm
:Technology generation
90nm
350nm
28nm
Tr=1B
CMOS Cu
40nm
65nm
HPCUNIX
$ ECC
Register/ALU Parity
Instruction Retry
$ Dynamic Degradation
Error Checkers/History
Mainframe/UNIX/HPC + AI
incremental development
GS21
2600
45nm
40nm
Next
GS
SPARC64
XIfx
SPARC64
X
20nm
DLU
SPARC64
XII
Post-K
ARM
AI
6
Next
SPARC
Copyright 2017 FUJITSU LIMITED
SPARC64™ XIfx Chip (HPC)
Architecture Features• 32 computing cores
+ 2 assistant cores
• HPC-ACE2 (256bit SIMD)Fujitsu’s ISA enhancements
• Sector Cache: Cache with SW controllability
• 24 MB L2 cache
20nm CMOS• 3,750M transistors
• 2.2GHz
Performance (peak)• 1.1TFlops
• HMC 240GB/s x 2 (in/out)
• Tofu2 125GB/s x 2 (in/out)
core core
core core
core core
core core
core core
core core
core core
core core
Assistant
coreAssistant
core
core core
core core
core core
core core
core core
core core
core core
core core
Tofu2 interface
Tofu2 controller
HM
C inte
rface H
MC
inte
rfac
e
L2 cache
L2 cache
PCI interface
MA
C
MA
C M
AC
M
AC
PCI controller
7 Copyright 2017 FUJITSU LIMITED7
Many (32+2) cores, Medium CPU GHz
SPARC64™ XII Chip (UNIX)
Architecture Features• 12 cores x 8 threads
• SWoC (“Software on Chip”)Fujitsu’s ISA enhancements
• 32MB L3 cache
• Embedded MAC and IOC
20nm CMOS• 25.8mm x 30.8mm
• 5,450M transistors
• 4.25GHz (up to 4.35GHz with “High Speed Mode” enabled)
Performance (peak)• 417GIPS / 835GFlops
• 153GB/s memory throughput
DDR4 interface
DDR4 interface
CoreCoreL
2 C
ach
e
Core
Core
Core
Core
Core
Core
Core
Core
Core
Core
L3
Cach
e
MAC
MAC
SERDES
PCIe
Gen3
SERDES
Inter
connect
L3
Cach
e
L3
Cach
e
L3
Cach
e
L2
Ca
ch
eL
2 C
ach
eL
2 C
ach
e
L2
Ca
ch
eL
2 C
ach
eL
2 C
ach
e
L2
Ca
ch
e
L2
Ca
ch
e
L2
Ca
ch
e
L2
Ca
ch
e
L2
Ca
ch
e
Inte
rco
nn
ect &
Co
he
rence C
on
tro
l
Copyright 2017 FUJITSU LIMITED8
Multiple big cores, High CPU GHz
SPARC64TM XIfx (HPC) Pipeline
FLB
L1 I$64KB
4ways
BranchTarget
Address
Decode
& Issue
RSE
RSA
RSF
RSBR
GUB
GPR188Registers
EXA
EXB
EAGA
EXCEAGB
EXD
FPR128x4 Reg.
FUB
Fetch
Port
Store
Port
L1 D $64KB
4Way
MAC
Fetch Issue Dispatch Reg-Read Execute Cache and Memory
CSE
Commit
PC
Control
Registers
L2$
Write
Buffer
PatternHistoryTable
IOC CPU-CPU I/F
34 cores …
FLBFLALocal
PatternTable
FLBFLBFLB
Copyright 2017 FUJITSU LIMITED9
L1
Instruction
Cache
64KB
RSEReservation Station
for Execution
RSAReservation Station
for Address generation
RSFReservation Station
for Floating-point
RSBRReservation Station
for Branch
GUB
EXA
EXB
EAGA
EAGB
FUBFPR Update Buffer
FLA
FLB
Fetch
Port
Store
Port
L1
Data
Cache
32KB
Fetch Decode Issue Reg-Read Execute Cache and Memory
Commit
Stack
Entry
Commit
FLC
FLD
Store
Buffer
12 cores
Pipeline-0
Pipeline-1
MAC
L3 Cache
IOCCPU-CPU i/f
L2 Cache
dTLB
SPARC64TM XII (UNIX) Pipeline
BranchPrediction
GPR
x4
FPR
x4
Program
Counter x4
Control
Registers x4
DecodeInstruction
Buffer
Shared Micro-architecture
Copyright 2017 FUJITSU LIMITED10
Future Fujitsu Processors
Under Development
- Post K
11 Copyright 2017 FUJITSU LIMITED
Project Overview
• RIKEN and Fujitsu are currently developing the post-K computer, which
is aims to be the most advanced general-purpose supercomputer in the
world
Goals of Japan’s Post-K Development Project
• Application performance
• Low power consumption
• User convenience
• Ability to produce ground-breaking results
Copyright 2017 FUJITSU LIMITED
Japan’s Post-K Computer Development Project
12
Functions & Architecture Post-K K computer
Processor
Base ISA + SIMD Extensions ARMv8-A+SVESPARCv9+HPC-
ACE
SIMD width [bit] 512 128
FP16 (half precision) support ✔ -
FMA: Floating-point multiply and add ✔ ✔
Math. acceleration primitives ✔ Enhanced ✔
Inter-core barrier ✔ ✔
Sector cache ✔ Enhanced ✔
Hardware “prefetch” assist ✔ Enhanced ✔
Interconnect Tofu ✔ Enhanced ✔
Post-K Processor and Interconnect Features
Fujitsu Processor, adopting ARM ISA and enhanced Tofu interconnect
Inherits and enhances the K computer’s innovative features
Copyright 2017 FUJITSU LIMITED13
Post-K Processor Supports FP16
Provides optimized precision for a wide range of applications
• Superior performance
• Reduces required bandwidth and power consumption
Target applications:
• Existing numerical applications
• Brand-new applications, including Deep Learning
High Performance
for
More Applications
Double Precision
Single Precision
Half
Precision
Copyright 2017 FUJITSU LIMITED14
Future Fujitsu Processor
Development
- AI Processor (DLUTM)
15 All Rights Reserved, Copyright 2017 FUJITSU LIMITED
Processor Designed for Deep Learning
Features of DLU Architecture designed for Deep Learning
Low power consumption design
Optimized precision
➔Goal: 10x Performance / Watt compared to
competitors
Scalable design with Tofu interconnect technology
➔Ability to handle large-scale neural networks
The photograph is an image, and it is different from the thing.
DLU(Deep Learning Unit)
FY2018 ~
TM
Utilizing technologies derived from the K computer
Copyright 2017 FUJITSU LIMITED16
DLU Design Target
Copyright 2017 FUJITSU LIMITED
High
Performance
Low
Power
Conflicting Demands
• Less Transistors
• less control logic
• fewer execution units/$
• Lower Frequency
• More transistors
• state of the art O-O-O
• many execution units/$
• Higher Frequency
High Deep Learning performance / watt:10x performance / watt
However, high performance and low power is not easy to achieve at the same time
17
Need for a New ArchitectureA new architecture is required for the DLU to achieve
the target.
The architecture is domain specific – Deep Learning
General
Purpose
Computing
Brain
Computing
Supercomputer
Accelerator
Quantum
Computer
Deep
Learning
Inference
Specialization
Required
Processing
Copyright 2017 FUJITSU LIMITED18
What’s the New Architecture for the DLU?
High Precision
General Use
Conventional
Architecture
The New
Architecture
2. Optimal Precision
1. Domain Specific
Sequential
+ Parallel3. Massively Parallel
Many cores w/ on-chip networkMultiple strong cores
Double/Single precision FP Deep Learning Integer
Complicated O-O-O cores Domain specific cores
Domain specific, Optimal precision, and Massively parallel.
Copyright 2017 FUJITSU LIMITED19
HBM2
DPU: Deep learning Processing Unit, DPE: Deep learning Processing Element
Host I/FDPU-0
DPU-1
DPU
DPU
DPU
DPU-n
DPE DPEDPE
DPE DPEDPE DPE DPEDPE
DPE DPEDPE
DPE DPEDPE
Large scale DLU interconnect
through off-chip network
DPE DPEDPE
DLUTM
(Deep Learning Unit)
DLU Architecture
Inter-chip
I/F
1. Domain specific
Domain specific Cores
- Newly designed ISA
- Simplified μ-architecture
- Fully software visible and
controllable
- Heterogeneous cores★- DPE and Large RF ★
3. Massively Parallel
Many DPUs with an On-chip Network
2. Optimal Precision
Deep Learning Integer★
Copyright 2017 FUJITSU LIMITED20
DPU: Execution
・Execute DL operations based
on master core’s control
How to utilize many DPUs
(convolution example)
・ one CH-out / DPU
・ multiple batch / DPU
Heterogeneous Cores
DPU
DPU
DPU
DPU
・
DPU
DPU
DPU
DPU
Master
MemoryMemory
Controller
Instructions/Data
・・・
CH-in CH-out
…
Master Core:
Memory Access and
DPU control
• Push & Pull
instructions and data for DLUs.
• Start/stop execution of DLUs
The combination of few large core (Master) and many small execution cores (DPU) results in more performance with less power consumption, compared to a conventional homogeneous structure
Copyright 2017 FUJITSU LIMITED21
DPE & Large RF (Register File)
DPU
CNTL
DPU: 128 SIMD* / 16DPE
DPU consists of 16 DPEs connected with on-chip network
DPE incudes large RF and wide SIMD execution units to realize an efficient Deep Learning engine.
RF is fully SW controllable unlike cache to extract full HW potential
DPE: 8SIMD* with large RF
(~100x of typical CPU core)
Exec
UNIT
Exec
UNIT
RF
Exec
UNIT
RF
Exec
UNIT
RF
Exec
UNIT
RF
Exec
UNIT
RF
Exec
UNIT
RF
Exec
UNIT
RF
* For FP32
Copyright 2017 FUJITSU LIMITED22
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Register File
Name RF/$ structure
UNIX SPARC64 XII RF + $
HPC SPARC64 XIfx RF + sector $
AI DLU Large RF
More SW controllability
Deep Learning Integer
Fujitsu’s “Deep Learning Integer” realizes necessary accuracy for Deep Learning with only a 16 or 8 bit data size (i.e. less power consumption compared with FP32)
Copyright 2017 FUJITSU LIMITED
Data Size
Effective Precision
FP32
16-bit
INT
8-bit
INT
Required Precision for Deep Learning
>INT8,16
Accumulator
INT16
INT8 INT8
INT16
INT8 INT8
FP16
FP16 FP16
FP32
int16 int16
int16 int16
Int>16
+
× ×
Int>16
+
int8 int8 int8 int8
int8 int8 int8 int8
× × × ×
+
Int>8
+
Int>8
+
Int>8
+
Int>8
Deep Learning
Integer
Data Size and Precision DLU Data Type
Small Large
16/8bit area with
minimum accuracy loss
INT16
INT8 INT8
INT16
INT8 INT8
HW gathered
statistics
23
Deep Learning Integer Accuracy
Copyright 2017 FUJITSU LIMITED
(*) ImageNet(subset): image size=96x96, #categories=25
FP32
Deep Learning Integer (16bit/8bit)
Deep Learning Integer has shown similar accuracy with FP32for Deep Learning
INT8
INT8
Deep Learning Integer (8bit)
FP32
Deep Learning Integer (16bit)
24
DLU Roadmap
Multiple generations of DLUs over time, as we currently do for HPC/UNIX/Mainframe processors
• Host CPU
required
• Inter-DLU direct
connection
1st
Generation
• Embedded host
CPU2nd
Generation
Other special processors
• Neuromorphic
• Combinatorial optimization
FutureFY2018
* Subject to change without notice
Copyright 2017 FUJITSU LIMITED25
Summary
26 Copyright 2017 FUJITSU LIMITED
Fujitsu Processor Design Style
Mainframe UNIX HPC AI(DLU)
Instruction Set Architecture(HW-SW I/F)
Micro-architecture(CPU internal structure)
★Performance/RAS
Semiconductor Technology
Design Infrastructure
Shared[FJ development]
GS
ISA
SPARC
ISA
General Purpose
Standard ISA with FJ enhancements / newly developed ISA
Shared / Simple + SW visible micro-architecture
The latest semiconductor technology
Shared design infrastructure: Circuit, Methodology, People
27
ARM
ISA
New
ISA
Simple
SW visible
Circuit, Methodology, People
Copyright 2017 FUJITSU LIMITED
The latest
Domain Specific
Deep Learning
Fujitsu Processor Direction
General purpose and Domain specific
Wider variety of processors in the future to meet different requirements.
Supercomputer
Specialization
Required
ProcessingCopyright 2017 FUJITSU LIMITED
SPARC64TM
VII / VII+
SPARC64TM
X
SPARC64TM
XII
SPARC64TM
XIfx
SPARC64TM
VIIIfxPost-K
DLU
28
General
Purpose
Domain
Specific
HPC & AI
Diverge
Summary
Fujitsu has designed processors for a long time (> 60 years)
Perpetual evolution over generations
SPARC64 IXfx (HPC), SPARC64 XII (UNIX), and Post-K
General purpose computing
DLU
Domain specific
New Architecture
Heterogeneous, DPE and large RF, Deep Learning Integer
Shared Design infrastructure: Circuit, Methodology, People
Fujitsu will continue to develop cutting-edge processors to
meet the needs of a new era.
29 Copyright 2017 FUJITSU LIMITED
Recommended