AI Chip Trends and Forecase - ictconference.kr · M. Jerry., et al., "Ferroelectric FET analog...

AI Chip Trends and Forecast

Joo-Young Kim

2019. 11. 6

ICT 산업전망컨퍼런스

Outline

• Introduction- Brief history & deep neural network models

- AI stack and new computing paradigm

• Trends in AI chips- ??

• Looking forward- ???

Motivation

Artificial Intelligence is pervasive in our everyday life.

Brief History of Neural Networks

F. Rosenblatt B. Widrow – M. Hoff M. Minsky – S. Papert D. Rumelhart – G. Hinton – R. Wiliams G. Hinton – R. Salakhutdinov

• Learnable weights and

Threshold• XOR problem • Nonlinear problem solved

• High computation

• Local optima and overfitting

• Hierarchical feature

learning

• Adjustable

but not

learnable

weights

W. S. McCulloch - W. Pitts

1958 1960 1969 1986 2006

Deep Learning!

First WinterSecond Winter

- ImageNet

- AlphaGo

- Speech

translation

- Video synthesis

- Smart factory

Deep Learning ≠ AI

AISearching

Planning

Knowledge

Representation

Fuzzy Logic

Natural Language

Processing

Genetic

Algorithm

Any technique that enables computers to mimic human behavior

AI techniques that have computers learn without being explicitly programmed

A subset of ML that makes the computation of multi-layer neural networks feasible

Deep Learning Revolution

Human: ~5%

ImageNet (ILSVRC) Top-5 Error

* F. Veen, The Asimov Institute, 2016

Deep learning starts to surpass human-level recognition on specific tasks

What Has Changed?

• Traditional pattern recognition

• Deep learning (model + data)Trainable Features & Classifiers

"Ship"

"Car"CNN

Hand-Crafted

Features

Haar Like

Simple Trainable

ClassifiersSVM

K-Means

"Ship"

Amount of Data

Traditional algorithms

Deep learning

Andrew Ng, Stanford CS 229 class

Popular Types of DNNs

MLP(Multi-Layer Perceptron)

CNN(Convolutional)

RNN(Recurrent)

Characteristic Fully Connected Convolutional LayerSequential DataFeedback Path

Major Application

Speech Recognition

Image RecognitionSpeech / Action

Recognition

Number of Layers

3~10 Layers Max ~100 Layers 3~5 Layers

Convolution

PoolingInput

Fully Connected

Hidden Outp

Matrix-vector multiplication

3d convolutionMatrix-vector multiplication

Main Computation

And Many More Models…

1970s 1980s 1990s

MLPCognitron/

Attention only

Network

Tacotron

YOLO v3

DeepLab v3+

VoxelNet

PointNet++

CycleGAN

StarGAN

DiscoGAN

DenseNet

DeepLab

YOLO v2

PointNet

WaveNet

CNN+RNN

ResNet

Fast R-CNN

Faster R-CNN

R-CNNLSTM

AlexNet

VGGNet

GoogleNet

SegNet

2012~2014 2015 2016 2017~

DNN Characteristics

• Requires big data & big computation

• Modern hardware enabled deep learning revolution (e.g. GPU)

# Operations: ~2Billion/Face# Mem. Access: ~1GB/Face

Local-feature-based Deep Learning-based

# Operations: ~0.1Billion/Face# Mem. Access: ~10MB/Face

AI Stack

Algorithm

Device

• Neuromorphic chip: brain-inspired computing, biological brain simulation, …• Programmable chip: GPU, ASIC, FPGA, DSP, …• System-on-Chip: multi-core, many-core, SIMD, systolic array, …• Development tool-chain: frameworks, compiler, simulator, optimizer, …

• High bandwidth off-chip memory: HBM, DRAM, GDDR, STT-MRAM, …• High speed interface: SerDes, Optical Communication• CMOS 3d stacking• Emerging computing device: analog computing, memristors, …• Emerging memory device: ReRAM, PCRAM, …

• Neural network topology: MLP, CNN, RNN, LSTM, SNN, …• Deep neural networks: AlexNet, ResNet, GoogLeNet, …• Neural network algorithms: reinforcement Learning, adversarial Learning, …• Machine learning algorithms: SVM, K-NN, decision tree, Markov chain, …

Application

• Video/Image: face recognition, image generation, video analysis, …• Sound and Speech: speech recognition, language synthesis, music generation, …• NLP: text analysis, language translation, human-machine communication, …• Robotics: autopilot, UAV, industrial automation, …

New Computational Paradigm

• Being able to handle big data- Huge storage capacity, high bandwidth, low latency memory access- “memory wall” problem

• Large amount of computation- Mainly linear algebraic operations while control is relatively simple- Parameters are large

• Training vs Inference- Training: accuracy, data capacity (~1018 bytes), weight synchronization- Inference: speed, energy, hardware cost, efficient reading of weights

• Data precision / Model compression / Pruning- Not always require a high precision

• High configurability- Tradeoff between energy efficiency and adaptability to new algorithms

AI Chip Landscape

https://basicmi.github.io/AI-Chip/

DNN Hardware

• Mobile Based- Specific AI - Real-time- Limited resources- Low-power

• Cloud Based- General AI- High computing- Huge memory- Fast & accurate

learning

Low Real-Time Operation

Cloud Server

Mobile

Edge Terminal

Control &

Control Model

Control &

Control Model

Data &

Learned Model

Data &

Learned Model

Cloud based AI Computing

Pre-trained Network

LearningT

t)Inferenceon

Cloud / Server

Question

Answer

Voice Assistant

Cloud / ServerDevice / Edge

DNN Chips for Cloud Server

• Nvidia (GPU)

• Goodle (TPU)

• Microsoft (BrainWave)

• Amazon (Inferentia)

• Facebook

• Alibaba, Baidu

Real-Time Operation

HighLow

Cloud Server

Control based on overall conditions

Learning with data collected from edge devices

Stand-Alone AI

NVIDIA Volta Google Cloud TPU

Mobile/Edge based AI Inference

Self-driving vehicle, intelligent camera/speaker, IoT devices

Pretrained Network

Learning

Inferenceon

Cloud / Server

InferenceUsing Pretrained Model

UserInterface

APPsplatform

Camera

Local Data

Load Pretrained

Cloud / ServerDevice / Edge

Mobile/Edge DNN Applications

• Apple

• Huawei

• Qualcomm

• ARM

• CEVA

• Cambrion

• Horizon Robotics

• MobileEye

• Tesla

Inference Speed

Slow Fast

Wearable

Automoitive

Mobile

Cloud vs Edge Summary

High Performance

High Precision

High Flexibility

Distributed

Scalable

Diverse Requirements

(Car, Wearable, IoT)

Low-Moderate Throughput

Low Latency

Power Efficiency

Low Cost

High Throughput

Low Latency

Power Efficiency

Distributed

Scalable

Cloud / Datacenter Edge / MobileIn

Functional Integration

Intel CPU

nVidia GPU

Xilinx FPGA

MIT Eyeriss

KAIST LNPU

Google TPU

Microsoft BrainWave

Wave DPU

Tsinghua Thinker

Hardware Classic Domain specific Reconfigurable

Domain Cloud Could/Edge Could/Edge

Target Workload Training oriented Inference Inference & Training

Early 1st Stage 2nd Stage

Courtesy of GTIC 2019

Two Different Directions

• Be more flexible

• Be more compact

DedicatedDiannao

RS DataflowMIT Eyeriss

Systolic ArrayGoogle TPU

Sparse-awareNvidia SCNN

Flexible BitwidthKAIST UNPU …

2016 2017.6 20182017.1

CompressionPruning

2016.2

BWN TWN Low-bit TrainingDoReFa-Net

Low-bit QuantizationLQ-Nets …

2016.8 2018.2 2018.92016.11

Courtesy of GTIC 2019

Von Neumann Bottleneck for AI

• Von-Neumann architecture serially fetches data from the storage

• AI application needs to access tremendous amount of data

AI Processor

Memory

Bottleneck

Memory Wall

NVM DRAMSRAM

(Cache)Processor

Von Neumann Bottleneck

NVM DRAMSRAM

(Cache)Processor

Increasing Memory Bandwidth

How can we increase bandwidth between processor and memory?

Near Memory Processing

Processor

3D-Stacked Memory

High Bandwidth Memory

Advantage of HBM

ITEM GDDR5 HBM (High B/W Memory)

System

Configuration

DRAM 8Gb GDDR5 12ea 4GB HBM 4ea

Size 3120 ㎟ 792 ㎟

Density 12GB 16GB

Bandwidth 384GB/s 1024GB/s

Power 18.3W (1.5W X GDDR5 12ea) 9.1W (2.3W X HBM 4ea)

(Ball)

Speed 8 Gbps 2 Gbps

# I/O 32 per chip (Total 384) 1024 per cube (Total 4096)

2016GFX 예측 사양• HBM 4~6cube• 4~8GB, 512~1TB/s• 10TFLOPs

Processor

Emerging Non-Volatile Memories

White Paper on AI Chip Technologies (2018)

DRAM-like speed, Flash-like capacity and Non-Volatile

Towards into Memory

NVM DRAMSRAM

(Cache)Processor

Von Neumann Bottleneck

NVM DRAMSRAM

(Cache)Processor

NVM DRAM

P P P P P P P P P P P

Traditional

Near-Memory/Emerging Mem

In-Memory/Memory-centric

Processing-In-Memory (PIM)

AI Processor

Memory

Bottleneck

Von Neuman

Non Von Neuman

Converged logic + memory (high BW)

Suitable for data-intensive workloads

Little data movement (energy efficient)

PIM Chip

Renesas’s ternary SRAM PIM for AI inference

S. Okumura, et al., “A Ternary Based Bit Scalable, 8.80 TOPS/W CNN accelerator with Many-core Processing-in-memory Architecture with 896K synapses/mm2”, Symposium on VLSI Technology 2019

AI Framework

Provides higher-level abstraction to developers/users

Convolution on volumes (1 line)

Max pooling (1 line)

Non-linear ReLu (1 line)

Hyper-Scale AI Accelerators

TPU v3 (2018)

Cerebras Wafer Scale Engine (2019)

Usually hundreds of processing units

in array structure..

How do we program this?

1.2T transistors

46,225 mm2

400,000 cores

18GB SRAM

100 Pb/s interconnect

Who Fills this Gap?

… …

Cerebras WSE

AI Software Tool-Chain

• Xilinx AI Edge PlatformSW developers, users

A few hardware vendors

Problem: No De Facto SW Tool & Hardware!

C / Java Compiler toolchain CPU

Software Hardware

OpenGL / CUDA

Compiler toolchain GPU

Verilog / VHDL Synthesis toolchain FPGA

Neuromorphic Chip

• “Spiking neuron”• Closely model biological

neuron’s activity• Incorporates concept of

time: integrate and fire• Computationally expensive• Difficult to train

Not practical at moment

Generation

• Perceptron based• No non-linear

functions• Binary output

Generation3rd

Generation

• Non-linear activation functions• Continuous output• Functional modeling of our

brain• Working real-life applications• We are here (FF, CNN, RNN, …)

IBM TrueNorth

• 5.4 billion transistors in 28nm CMOS process

• 64 x 64 neurosynaptic core, 256 neurons each

Paul A. Merolla, et al. "A million spiking-neuron integrated circuit with a scalable communication network and interface." Science2014

IBM TrueNorth

• Mimicking synapse with SRAM

• However, SRAM is not made for this (large area, cost).

Pre-Neuron (Tx)

Post-Neuron (Rx)

Synapse is a structure that permits a neuron to pass an electrical signal to another.

Input Spike

8T SRAM cell

as synapse

Output Spike (Voltage)

BLBLWLT

Voltage Σ ΣΣ

SRAM Synapse Array

Neuromorphic Chip with Emerging Device

• New model requires device with new physics • FeFET: better storing/transferring analog signal

M. Jerry., et al., "Ferroelectric FET analog synapse for acceleration of deep neural network training.", IEEE IEDM 2017

Neuromorphic Chip with Emerging NV RAM

Z. Wang., et al, "Fully memristive neural networks for pattern classification with unsupervised learning", Nature Electronics 2018

• ReRAM (memristor)

1. Cloud and Edge Will be Closer

• Edge inference & learning will be more important due to privacy concern, real-time operation, and power constraint

• Federated learning: leverage cloud’s big data advantage on edge devices

Mobile Devices

Encryption & Compressed Data

LocalLearning

Custom Weight

Cloud ServersShared Model

Broadcasting shared model

Aggregating encrypted data

LocalLearning

Custom Weight

LocalLearning

Custom Weight

LocalLearning

Custom Weight

Updated Model

2. AI Chips will Support More Algorithms

• State-of-the-art algorithms are moving from traditional MLP, CNN, RNN to GAN, reinforcement learning, and unsupervised learning

Inference only(MLP/RNN or CNN)

Inference + Training(MLP/CNN/RNN)Inference only

(MLP/CNN/RNN)

Inference + Training(GAN/RL/

Unsupervised/MLP/CNN/RNN)

3. AI Security Will be Essential

• It is easy to break DNN based recognitionNew cyberattack: imperceivable noise injection

Breaking state-of-the-art face recognition Physical attack for autonomous vehicles

4. For Success of AI Chip, SW is the Key

• How did ARM dominate mobile processor market?- Low power consumption with reasonable performance

- ARM’s competent complier toolchain & licensing strategy

• Why did GPU have a big success in early DNN revolution?- That was because of CUDA which is a generic programming language for data-

intensive workloads like matrix-vector multiplication

- CUDA was baked for several years to have developers actually use it

AI Chip Researches at KAIST

Multi-core OR

Processor

Layered

3-stage

Pipeline

Simultaneous

Multi-threading

Multi-classifier

System

Multi-core

2008 2009 2010 2012 2013

Visual

Attent

ionTomatoSauce

Heterogeneous

Many-SIMD

20142011 2015 2016 2017

Multi-Modal UI/UX

Deep Learning Core

Recogni

Result

Convolution

Cluster 0

FC LSTM

Processor

Ext. Gateway

Convolution

Cluster 3

Convolution

Cluster 1

Convolution

Cluster 2

Ctrlr.

Aggregation

Ctrlr.

wayStereo Matching

Processor

Recognition

& CNN–RNN

2018 2019

Aggregation Core

Matching

Pipelined CNN PE

FWD/BWD Unit

Custom

Ext. I/F Ext. I/FTop Controller

ICP-PSO Engine

PIM 10

PIM 11

PIM 12

PIM 13

PIM 14

PIM 15

Variable Bit

& 3D HGR

Core Cluster 3Core Cluster 2

Core Cluster 1

Core3Core2

PELPELPELPEL

Central CoreI/F1

Ctrlr. R

Process 65nm 1P8M Logic CMOS

Area 4mm × 4mm

SRAM 448 KB

Supply 0.67V – 1.1V

Power196 mW @ 200MHz, 1.1V

2.4 mW @ 10MHz, 0.67V

PrecisionFeature – bfloat16

Weight – 16/8/4'b FXP

Performance204 GFLOPS @ 16b Weight

Core 1

Core 2 Core 3

Top Ctrlr.Ext.

UMEMBMEM

PE Arrays

Exp. Compressor

1-D SIMD

Supervised &

Reinforcement

Learning

Input Image

Hand Depth

Tracking

Results

-1.5cm

20cm25cm

30cm35cm

0cm5cm

20cm25cm

30cm35cm

0cm5cm

20cm25cm

30cm35cm

Tracking

Accuracy

2.6mm@20cm

4.6mm@30cm

3.4mm@40cm

Seperated

Cameras

22.5cm

40.5cm

AI Chip Trends and Forecase - ictconference.kr · M. Jerry., et al., "Ferroelectric FET analog...

Documents

Ferroelectric control of magnetism in arti cial ...edoc.unibas.ch/38844/1/Jakoba_Kolumbine_Heidler... · Ferroelectric control of magnetism in arti cial multiferroic composites

Routine ferroelectric characterizations of a 4-μm a, · Standard ferroelectric polarization-electric field (P-E) hysteresis loops of a 4-μm-thick CIPS flake at 100 Hz. Different

UDUDEC A. PAUL - FLORIN STANCIUC C. - R. FLORIN GRÄDINARIU C. ALEXANDRA - IONE-LA MUNTEAN D. - M. AURELIAN SLUSARIUC l. -S. ALEXANDRU - VASILE ... DIN so MCT IEDM IEDM TCM MCT

Microanalyses of Ferroelectric Properties of BaTiO3elpub.bib.uni-wuppertal.de/servlets/DerivateServlet/Derivate-708/d130103.pdf · Microanalyses of Ferroelectric Properties of BaTiO3

Advanced Circuit Design of Gigabit-Density Ferroelectric ...sylvester.bth.rwth-aachen.de/dissertationen/2002/201/02_201.pdf · Gigabit-Density Ferroelectric Random-Access Memories

Neuromorphic computing with nanoscale spintronic oscillatorsjulie.grollier.free.fr/publications/Torrejon-Grollier-Neuromorphic computing with... · Our nano-oscillators consist of

Croquer la vie 08 dans son assiette - IEDM

lJHiMeprints.usm.my/10166/1/Development_Of_Nanocrystalline.pdf · Laporan Akhir Projek Penyelidikan Jangka Pendek Development of Nanocrystalline Ferroelectric PLZT Powder bya Modified

Aperiodic Molecular Ferroelectric Crystals - EPub Bayreuth · Aperiodic Molecular Ferroelectric Crystals Von der Universit¨at Bayreuth zur Erlangung des akademischen Grades eines

Fileshare.ro_caiet de Practica Anul II IEDM

Title Preparation and Properties of Glass-Ceramics Containing Ferroelectric Crystals ...repository.kulib.kyoto-u.ac.jp/dspace/bitstream/2433/... · 2017. 9. 8. · Preparation and

Self healable neuromorphic memtransistor elements for ... Papers... · ARTICLE Self healable neuromorphic memtransistor elements for decentralized sensory signal processing in robotics

Ferroelectric Optics:Optical Bistability in Nonlinear …cdn.intechweb.org/pdfs/12504.pdf · 18 Ferroelectric Optics:Optical Bistability in Nonlinear Kerr Ferroelectric Materials

PHYSICAL REVIEW B99 Memory-functionality superconductor ... · and antiferromagnetic, ferroelectric, multiferroic, charge/spin density wave, and orbital ordered states. Their versatile

Physical aspects of ferroelectric semiconductors for photovoltaic solar energy conversion · 2017. 4. 23. · Physical aspects of ferroelectric semiconductors for photovoltaic solar

Ferroelectric and Dielectric Properties of Poly(vinylidene

Up Iedm Anu3

Electric Field Effect on 14N NQR in Ferroelectric Thiourea ...zfn.mpdl.mpg.de/data/Reihe_A/47/ZNA-1992-47a-0227.pdf · 229 H. J. Kim et al. 14N NQR in Ferroelectric Thiourea Single

Electrical Conductivity and Surface Roughness …file.upi.edu/Direktori/FPTK/JUR._PEND._TEKNIK_MESIN/...Electrical Conductivity and Surface Roughness Properties of Ferroelectric Gallium

Pedro Tiago Albergaria Félix ferroeléctricos seleccionados ... · keywords ceramics, lithium niobate, Endogenous electrical currents, bone, bioactivity, ferroelectric lithium tantalate