10
2014/12/10 Hot Chips & SC14 トピックス、 CAE試作ボードの現状と今後 広島市立大学 情報科学研究科  北村 俊明 HotChips 26 での発表から

Hot Chips & SC14トピックス、 試作ボードの現状と今後 tutorial and presentation sessions focus on up-to-the-minute developments in leading-edge industrial designs and

  • Upload
    vanngoc

  • View
    222

  • Download
    8

Embed Size (px)

Citation preview

Page 1: Hot Chips & SC14トピックス、 試作ボードの現状と今後 tutorial and presentation sessions focus on up-to-the-minute developments in leading-edge industrial designs and

2014/12/10

Hot Chips & SC14トピックス、CAE試作ボードの現状と今後広島市立大学 情報科学研究科  北村 俊明

HotChips 26での発表から

Page 2: Hot Chips & SC14トピックス、 試作ボードの現状と今後 tutorial and presentation sessions focus on up-to-the-minute developments in leading-edge industrial designs and

Hot Chipsとは

✤ 1989年以来夏に行われているマイクロプロセッサなどの半導体を中心とした学会

✤ ほとんど企業の発表で、最近は新製品の発表がよくおこなわれる

✤ モバイルからPC、サーバ、スパコン用プロセッサまでセッションがある

✤ FPGAのセッションもある

Mon

day

Aug

ust

11Tu

esda

yA

ugus

t 12

August 10-12, 2014 A Symposium on High-Performance ChipsFlint Center for the Performing Arts-Cupertino,CA http://www.hotchips.org

ADVANCE PROGRAM26

A Symposium of the Technical Committee on Microprocessors and Microcomputers of the IEEE Computer Society and the Solid-State Circuits Society

WarthmanAssociatesTechnical Writerswww.warthman.com

High-Performance Computing • SX-ACE Processor: NEC's Brand-New Vector Processor NEC • SPARC64 XIfx: Fujitsu’s Next Generation Processor for HPC Fujitsu • Anton 2: A 2nd-Generation ASIC for Molecular Dynamics Simulation D.E. Shaw Research

Keynote 1 Power Constraints: From Sensors to Servers Michael Muller ARM Mobile Processors• NVIDIA’s Tegra K1 System-on-Chip NVIDIA• Applying AMD’s “Kaveri” APU for Heterogeneous Computing AMD• NVIDIA’s Denver Processor NVIDIA

Technology• HBM: Memory Solution for Bandwidth-Hungry Processors SK Hynix Inc• Improved 3D Chip Stacking withThruChip Wireless Connections ThruChip Communications• CMOS Biochips for Point-of-Care Molecular Diagnostics InSilixa

ARM Servers• The AMD Opteron “Seattle”: A 64b ARM Dense Server Processor AMD• ARM Next-Generation IP Supporting LSI’s High-End Networking ARM, LSI Logic• X-Gene2: 28nm Scale-Out Processor Applied Micro

FPGAs• Design of a High-Density SOC-FPGA at 20nm Altera• Large-Scale Reconfigurable Computing in a Microsoft Datacenter Microsoft• Xilinx FPGAs Case Study: High Capacity and Performance 20nm FPGAs Xilinx• SDA: Software-Defined Accelerator for Large-Scale DNN Systems Baidu

High-Performance ASICs• Hardware-Accelerated Text Analytics IBM• Myriad2 “Eye” of the Computational-Vision Storm Movidius• Goldstrike 1: A 1st Generation Cryptocurrency Processor for Bitcoin Mining Cointerra• RayChip: Real-Time Ray Tracing Chip for Embedded Applications Siliconarts

Keynote 2 The Internet of Everything: What is it? What’s driving it? What comes next? Rob Chandhok Qualcomm Dense Servers and Server Technology• SCORPIO: 36-Core Shared-Memory Processor with a Coherent Mesh MIT

• Oracle’s Next-Generation SPARC Processor Cache Hierarchy Oracle• Unchaining the Datacenter with OpenPOWER: Reengineering a Server Ecosystem IBM• Intel C2000 Atom Microserver: Power Efficient Processing for the Data Center Intel

Big-Iron Servers• Performance Characteristics of the POWER8 Processor IBM• Next-Generation Oracle SPARC Processor Oracle• IvyBridge Server: Delivering Performance from Workstations to Mission Critical Intel

Tutorial 1: Emerging Trends in Hardware Support for Security• Security Basics Princeton• Mobile HW Security ARM• Secure Systems Design AMD• Mitigating Exploits, Rootkits and Advanced Persistent Threats Intel

• University Research in Hardware Security Princeton Tutorial 2: Internet of Things• Powering the Internet of Things TI• Ultra Low Power Design Approaches for IoT National University of Singapore• Connecting the IoT Qualcomm• Standards for Constrained IoT Devices ARM

Organizing CommitteeChairKrste Asanovic UC BerkeleyVice ChairFred WeberFinanceLily Jow HPAdvertisingDon Draper OracleSponsorshipAmr Zaky InvensensePublicationsRandall NeffPressRalph Wittig XilinxRegistrationCharlie Neuhauser Neuhauser

AssociatesLocation ServicesJohn Sell MicrosoftAllen BaumVolunteer CoordinatorGary Brown TensilicaWebmaster, ITKevin BrochProductionLance HammondMike AlbaughKeith DiefendorffSteering CommitteeChairAlan Jay SmithCommittee MembersAllen BaumDon Draper OraclePradeep Dubey IntelLily Jow HPJohn Mashey TechviserJohn Sell MicrosoftKeith DiefendorffProgram CommitteeProgram Co-ChairsSam Naffziger AMDGuri Sohi U. WisconsinCommittee MembersForest Baskett NEAPradeep Dubey IntelJohn Davis MicrosoftAlan Jay Smith UC BerkeleySteve Miller NetAppSubhasish Mitra StanfordStefan Rusu IntelTom McWilliams BayStorageBehnam Robatmili QualcommRalph Wittig XilinxMike Taylor UCSDBill Dally NVIDIAFounder Bob Stewart SRE

HOTCHIPS brings together designers and architects of high-performance chips, software, and systems. The tutorial andpresentation sessions focus on up-to-the-minute developments in leading-edge industrial designs and research projects. Register now at: https://www.123signup.com/register?id=drvzv

Sun

day

Aug

ust

10

AMDのARMサーバ

✤ ARMではなくAMDが設計

✤ x86ではなくARMアーキテクチャでサーバ利用を目指す

THE AMD OPTERONTM

A1100 PROCESSOR CODENAMED "SEATTLE"

SEAN WHITE 11 AUGUST 2014

Page 3: Hot Chips & SC14トピックス、 試作ボードの現状と今後 tutorial and presentation sessions focus on up-to-the-minute developments in leading-edge industrial designs and

| AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014 2

“SEATTLE” – WHAT IS IT AND WHY?

\ What is it? ‒ “Seattle” is AMD’s first 64-bit ARM-based processor

‒ 8 ARM CortexTM-A57 cores ‒ 2 DDR3/4 DRAM channels ‒ 10G Ethernet, PCI-Express, SATA ‒ GlobalFoundries 28nm process

\ Why did AMD build it? ‒ “Seattle” is a dense server processor for datacenter applications

‒ Performance/dollar/watt drives today’s datacenter designs ‒ A significant number of datacenter workloads have inherently low Instructions Per Clock

(IPC) and high cache miss rates ‒ For such workloads, processors like “Seattle,” with smaller cores and caches, can deliver

the equivalent performance as traditional server processors with large cores and caches, but using much less power and area

‒ The 32-bit to 64-bit transition for the ARM architecture is a major shift in the industry, like the 32-bit to 64-bit transition in x86 was

‒ AMD is taking a leadership role in the 64-bit ARM space, as it did in the 64-bit x86 space

| AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014 3

“SEATTLE” SOC OVERVIEW 28nm Process Technology

Cortex A5 System Control Processor

Cryptographic Coprocessor

L3 Cache 8MB

DDR3/4 Memory Controller

DDR3/4 Memory Controller

L2 Cache 1MB

64-bit Cortex

A57 Core

64-bit Cortex

A57 Core

64-bit Cortex

A57 Core

64-bit Cortex

A57 Core

L2 Cache 1MB

L2 Cache 1MB

64-bit Cortex

A57 Core

64-bit Cortex

A57 Core

64-bit Cortex

A57 Core

64-bit Cortex

A57 Core

L2 Cache 1MB

I2C

UART

SPI

1Gbit Ethernet (RGMII)

10Gbit Ethernet (KR)

SATA 3

PCIe Gen 3

Package • 27mm x 27mm, SP1 BGA

Power Efficient Cores • Up to Eight ARM Cortex-A57 cores • Up to 4MB shared L2 cache total

Cache Coherent Network • Full cache coherency • 8MB L3 cache • SMMU: I/O address mapping and protection

High Performance, Flexible Memory • Two 64-bit DDR3/4 channels with ECC • Two DIMMs/channel up to 1866Mhz • SODIMM, UDIMM, RDIMM support • Up to 128GB per CPU

Highly Integrated I/O • 8x SATA 3 (6Gb/s) ports • Two 10GBASE-KR Ethernet ports • 8 lanes PCI-Express® Gen 3, supports x8, x4, x2

System Control Processor • TrustZone® technology for enhanced security • Dedicated 1GbE system management port (RGMII) • SPI, UART, I2C interfaces

Cryptographic Coprocessor • Separate Cryptographic algorithm engine for

offloading encryption, decryption, compression, decompression computations

Page 4: Hot Chips & SC14トピックス、 試作ボードの現状と今後 tutorial and presentation sessions focus on up-to-the-minute developments in leading-edge industrial designs and

| AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014 16

Standalone uATX board

• 1P standalone platform intended to meet the needs of partners (ISV, OSV, IHV) • Off-the-shelf 2U rack mount chassis

• DDR3 DIMMS only

• x8 PCIe Gen3 lanes supporting (1) x8 slot or

alternatively (2) x4 slots • NIC supported through add-in card option

• Supports up to 8 hard drives

• Provisions for remote access to start, stop, and

remote console will be provided

“SEATTLE” REFERENCE SYSTEM

| AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014 17

“SEATTLE” REFERENCE SYSTEM BOARD

• uATX form factor

• 1 “Seattle” SP1 BGA processor

• DDR3 2-DIMM per memory channel config (up to 4 DIMMs per CPU)

• 1 x8 PCIe slot • 2 x4 PCIe slots an alternative via mux

• 8 SATA3 ports

• 2 10GBase-T connectors

• 4 I2C ports

• 2 UARTs

• Supports required debug features

Page 5: Hot Chips & SC14トピックス、 試作ボードの現状と今後 tutorial and presentation sessions focus on up-to-the-minute developments in leading-edge industrial designs and

ARMコア入りFPGA

✤ 20nmプロセスを使った製品✤ ARMコアを含むSoC全体を1チップに

Design of a High-Density SoC FPGA at 20nm

Brad Vest, Sean Atsatt, Mike Hutton Altera, San Jose

High Capacity and High Performance 20nm FPGAs Steve Young, Dinesh Gaitonde August 2014 © Copyright 2014 Xilinx

Device Goals

� Mid-Range FPGA: balance of performance/power/cost targeting Key Market Applications

� Key Targets and Metrics: − 491 MHz fixed-point DSP datapath for Wireless RRU − 1M+LEs at 350 MHz for 4xOTU4 (400G) OTN networks, with Partial Reconfig − Cloud Server Acceleration – Hardened Floating-Point − 28G transceivers to support 200G to 400G networking/routing − Dramatic die-size reduction

3

Page 6: Hot Chips & SC14トピックス、 試作ボードの現状と今後 tutorial and presentation sessions focus on up-to-the-minute developments in leading-edge industrial designs and

Overview and Floorplan

4

� TSMC 20SOC Process − 5.3B Tx, 11LM

� Resources − 1.15M LEs, 1.7M FFs − 64Mb embedded SRAM − 32 fPLL, 16 PLLs, 32 GCLK − 1.5 TFlops IEEE754 DSP − Dual-Core ARM A9 − Row-based redundancy

� I/O − 28G SERDES, >1.7Tb b/w − x72 2.667Gbps DDR4 w/

Hard memory Controller − Hardened PCIe/ILKN/10GE

Hardened Floating Point DSP

� Hardened IEEE 754 Floating Point adder & Multiplier − 12% DSP Area increase (<<1% die area)

� 100% Fixed Point backwards compatible − No performance or power penalty

� ‘Have your cake and eat it too’ � How is this possible?

− Overlaid FP algorithms on Fixed point circuits

13

Major Innovation – Hard Floating Point on a Commercial FPGA

X

+

32 32

32

32

Page 7: Hot Chips & SC14トピックス、 試作ボードの現状と今後 tutorial and presentation sessions focus on up-to-the-minute developments in leading-edge industrial designs and

DSP Block – 1000s of blocks at very low latency

14

� 1.5 TFLOPS of aggregate computation; 50 GFLOPS/W − 1678 blocks @ 2 FLOPS/clock @ 450 MHz = 1.520 GFLOPs − Can run individually or as large integrated DSP system

� Hardware recursive structure support (Vector Mode) − 10s/100s of DSP blocks can be seamlessly integrated − Internal/External pipeling of individual DSP elements

� Very small latency − Floating Point used for iterative algorithms – require small latency − Arria 10 Floating Point - 256 length dot products ~ 25 clocks − Standard FPGA Technology - 256 length systolic FIR filter ~750 clocks

X

+

A B

AB+CD

X

+

C D AB+CD

X

+

E F

EF+GH

EF+GH

X

+

G H

X

+

I J IJ+KL+ MN+OP

AB+CD+EF+GH IJ+KL

AB+CD+ EF+GH

AB+CD+EF+GH+ IJ+KL+MN+OP

© Copyright 2014 Xilinx .

Vivado ® routes more complex designs on UltraScale UltraScale shows lower congestion on complex designs As a result, timing closure is accelerated Delivers 1 speedgrade higher Fmax

Page 15

UltraScale Results

Rout

ing

com

plex

ity

Rout

ing

com

plex

ity

No routing congestion

High routing congestionCannot route

Page 8: Hot Chips & SC14トピックス、 試作ボードの現状と今後 tutorial and presentation sessions focus on up-to-the-minute developments in leading-edge industrial designs and

© Copyright 2014 Xilinx .

Spartan-6/Virtex-6(45nm/40nm)

7 Series(28nm)

UltraScale(20nm/16nm)

up to 50%

Power Optimizations

Static

Dynamic

I/O

Transceiver

Static

Dynamic

I/O

Transceiver

up to 50%

Static Dynamic

I/O Transceiver

25-45%

• Architectural optimizations • Low power mode

• I/O multi-mode control (cont’d from 28nm) • DDR4 voltage reduction

• CLB packing & reduced wire length

• HW based clock gating on leaf cells

• BRAM hardened data cascading

• BRAM dynamic power gating

• DSP hardened features

• MMCM & PLL lower supply voltage

• Process node

• Power binning & lower voltage scaling • 3D IC static power binned slices

up to 40%

up to 30%

up to 50%

up to 60%

up to 65%

up to 30%

up to 40%

Page 17

装置の1部品から装置全体へ✤ SoCの流れに沿って、システムの1構成要素としてFPGAによる機能を利用すると言う構成から、FPGAの上でSoCを構成してしまうと言う方向に変化

✤ これを可能にしているのは、半導体の集積度向上✤ より高速な回路を要求、しかも消費電力の削減も

Page 9: Hot Chips & SC14トピックス、 試作ボードの現状と今後 tutorial and presentation sessions focus on up-to-the-minute developments in leading-edge industrial designs and

HOT CHIPS 26の資料

✤ http://www.hotchips.orgに歴代の資料があります。✤ 数年前の分から、プレゼンテーションのビデオも見られます。✤ 26については、Keynoteのみ一般公開。✤ 12月には全て公開の予定です。

SuperComputing 2014からの話題

Page 10: Hot Chips & SC14トピックス、 試作ボードの現状と今後 tutorial and presentation sessions focus on up-to-the-minute developments in leading-edge industrial designs and

SuperComputing

✤ 毎年11月に開催✤ 今年度は11月16~22日New Orleansのコンベンションセンターで✤ 論文発表のペーパーセッション以外に、展示会とBoFセッションもある。