View
222
Download
8
Category
Preview:
Citation preview
2014/12/10
Hot Chips & SC14トピックス、CAE試作ボードの現状と今後広島市立大学 情報科学研究科 北村 俊明
HotChips 26での発表から
Hot Chipsとは
✤ 1989年以来夏に行われているマイクロプロセッサなどの半導体を中心とした学会
✤ ほとんど企業の発表で、最近は新製品の発表がよくおこなわれる
✤ モバイルからPC、サーバ、スパコン用プロセッサまでセッションがある
✤ FPGAのセッションもある
Mon
day
Aug
ust
11Tu
esda
yA
ugus
t 12
August 10-12, 2014 A Symposium on High-Performance ChipsFlint Center for the Performing Arts-Cupertino,CA http://www.hotchips.org
ADVANCE PROGRAM26
A Symposium of the Technical Committee on Microprocessors and Microcomputers of the IEEE Computer Society and the Solid-State Circuits Society
WarthmanAssociatesTechnical Writerswww.warthman.com
High-Performance Computing • SX-ACE Processor: NEC's Brand-New Vector Processor NEC • SPARC64 XIfx: Fujitsu’s Next Generation Processor for HPC Fujitsu • Anton 2: A 2nd-Generation ASIC for Molecular Dynamics Simulation D.E. Shaw Research
Keynote 1 Power Constraints: From Sensors to Servers Michael Muller ARM Mobile Processors• NVIDIA’s Tegra K1 System-on-Chip NVIDIA• Applying AMD’s “Kaveri” APU for Heterogeneous Computing AMD• NVIDIA’s Denver Processor NVIDIA
Technology• HBM: Memory Solution for Bandwidth-Hungry Processors SK Hynix Inc• Improved 3D Chip Stacking withThruChip Wireless Connections ThruChip Communications• CMOS Biochips for Point-of-Care Molecular Diagnostics InSilixa
ARM Servers• The AMD Opteron “Seattle”: A 64b ARM Dense Server Processor AMD• ARM Next-Generation IP Supporting LSI’s High-End Networking ARM, LSI Logic• X-Gene2: 28nm Scale-Out Processor Applied Micro
FPGAs• Design of a High-Density SOC-FPGA at 20nm Altera• Large-Scale Reconfigurable Computing in a Microsoft Datacenter Microsoft• Xilinx FPGAs Case Study: High Capacity and Performance 20nm FPGAs Xilinx• SDA: Software-Defined Accelerator for Large-Scale DNN Systems Baidu
High-Performance ASICs• Hardware-Accelerated Text Analytics IBM• Myriad2 “Eye” of the Computational-Vision Storm Movidius• Goldstrike 1: A 1st Generation Cryptocurrency Processor for Bitcoin Mining Cointerra• RayChip: Real-Time Ray Tracing Chip for Embedded Applications Siliconarts
Keynote 2 The Internet of Everything: What is it? What’s driving it? What comes next? Rob Chandhok Qualcomm Dense Servers and Server Technology• SCORPIO: 36-Core Shared-Memory Processor with a Coherent Mesh MIT
• Oracle’s Next-Generation SPARC Processor Cache Hierarchy Oracle• Unchaining the Datacenter with OpenPOWER: Reengineering a Server Ecosystem IBM• Intel C2000 Atom Microserver: Power Efficient Processing for the Data Center Intel
Big-Iron Servers• Performance Characteristics of the POWER8 Processor IBM• Next-Generation Oracle SPARC Processor Oracle• IvyBridge Server: Delivering Performance from Workstations to Mission Critical Intel
Tutorial 1: Emerging Trends in Hardware Support for Security• Security Basics Princeton• Mobile HW Security ARM• Secure Systems Design AMD• Mitigating Exploits, Rootkits and Advanced Persistent Threats Intel
• University Research in Hardware Security Princeton Tutorial 2: Internet of Things• Powering the Internet of Things TI• Ultra Low Power Design Approaches for IoT National University of Singapore• Connecting the IoT Qualcomm• Standards for Constrained IoT Devices ARM
Organizing CommitteeChairKrste Asanovic UC BerkeleyVice ChairFred WeberFinanceLily Jow HPAdvertisingDon Draper OracleSponsorshipAmr Zaky InvensensePublicationsRandall NeffPressRalph Wittig XilinxRegistrationCharlie Neuhauser Neuhauser
AssociatesLocation ServicesJohn Sell MicrosoftAllen BaumVolunteer CoordinatorGary Brown TensilicaWebmaster, ITKevin BrochProductionLance HammondMike AlbaughKeith DiefendorffSteering CommitteeChairAlan Jay SmithCommittee MembersAllen BaumDon Draper OraclePradeep Dubey IntelLily Jow HPJohn Mashey TechviserJohn Sell MicrosoftKeith DiefendorffProgram CommitteeProgram Co-ChairsSam Naffziger AMDGuri Sohi U. WisconsinCommittee MembersForest Baskett NEAPradeep Dubey IntelJohn Davis MicrosoftAlan Jay Smith UC BerkeleySteve Miller NetAppSubhasish Mitra StanfordStefan Rusu IntelTom McWilliams BayStorageBehnam Robatmili QualcommRalph Wittig XilinxMike Taylor UCSDBill Dally NVIDIAFounder Bob Stewart SRE
HOTCHIPS brings together designers and architects of high-performance chips, software, and systems. The tutorial andpresentation sessions focus on up-to-the-minute developments in leading-edge industrial designs and research projects. Register now at: https://www.123signup.com/register?id=drvzv
Sun
day
Aug
ust
10
AMDのARMサーバ
✤ ARMではなくAMDが設計
✤ x86ではなくARMアーキテクチャでサーバ利用を目指す
THE AMD OPTERONTM
A1100 PROCESSOR CODENAMED "SEATTLE"
SEAN WHITE 11 AUGUST 2014
| AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014 2
“SEATTLE” – WHAT IS IT AND WHY?
\ What is it? ‒ “Seattle” is AMD’s first 64-bit ARM-based processor
‒ 8 ARM CortexTM-A57 cores ‒ 2 DDR3/4 DRAM channels ‒ 10G Ethernet, PCI-Express, SATA ‒ GlobalFoundries 28nm process
\ Why did AMD build it? ‒ “Seattle” is a dense server processor for datacenter applications
‒ Performance/dollar/watt drives today’s datacenter designs ‒ A significant number of datacenter workloads have inherently low Instructions Per Clock
(IPC) and high cache miss rates ‒ For such workloads, processors like “Seattle,” with smaller cores and caches, can deliver
the equivalent performance as traditional server processors with large cores and caches, but using much less power and area
‒ The 32-bit to 64-bit transition for the ARM architecture is a major shift in the industry, like the 32-bit to 64-bit transition in x86 was
‒ AMD is taking a leadership role in the 64-bit ARM space, as it did in the 64-bit x86 space
| AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014 3
“SEATTLE” SOC OVERVIEW 28nm Process Technology
Cortex A5 System Control Processor
Cryptographic Coprocessor
L3 Cache 8MB
DDR3/4 Memory Controller
DDR3/4 Memory Controller
L2 Cache 1MB
64-bit Cortex
A57 Core
64-bit Cortex
A57 Core
64-bit Cortex
A57 Core
64-bit Cortex
A57 Core
L2 Cache 1MB
L2 Cache 1MB
64-bit Cortex
A57 Core
64-bit Cortex
A57 Core
64-bit Cortex
A57 Core
64-bit Cortex
A57 Core
L2 Cache 1MB
I2C
UART
SPI
1Gbit Ethernet (RGMII)
10Gbit Ethernet (KR)
SATA 3
PCIe Gen 3
Package • 27mm x 27mm, SP1 BGA
Power Efficient Cores • Up to Eight ARM Cortex-A57 cores • Up to 4MB shared L2 cache total
Cache Coherent Network • Full cache coherency • 8MB L3 cache • SMMU: I/O address mapping and protection
High Performance, Flexible Memory • Two 64-bit DDR3/4 channels with ECC • Two DIMMs/channel up to 1866Mhz • SODIMM, UDIMM, RDIMM support • Up to 128GB per CPU
Highly Integrated I/O • 8x SATA 3 (6Gb/s) ports • Two 10GBASE-KR Ethernet ports • 8 lanes PCI-Express® Gen 3, supports x8, x4, x2
System Control Processor • TrustZone® technology for enhanced security • Dedicated 1GbE system management port (RGMII) • SPI, UART, I2C interfaces
Cryptographic Coprocessor • Separate Cryptographic algorithm engine for
offloading encryption, decryption, compression, decompression computations
| AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014 16
Standalone uATX board
• 1P standalone platform intended to meet the needs of partners (ISV, OSV, IHV) • Off-the-shelf 2U rack mount chassis
• DDR3 DIMMS only
• x8 PCIe Gen3 lanes supporting (1) x8 slot or
alternatively (2) x4 slots • NIC supported through add-in card option
• Supports up to 8 hard drives
• Provisions for remote access to start, stop, and
remote console will be provided
“SEATTLE” REFERENCE SYSTEM
| AMD “SEATTLE” | HOT CHIPS 26 | 11 AUGUST 2014 17
“SEATTLE” REFERENCE SYSTEM BOARD
• uATX form factor
• 1 “Seattle” SP1 BGA processor
• DDR3 2-DIMM per memory channel config (up to 4 DIMMs per CPU)
• 1 x8 PCIe slot • 2 x4 PCIe slots an alternative via mux
• 8 SATA3 ports
• 2 10GBase-T connectors
• 4 I2C ports
• 2 UARTs
• Supports required debug features
ARMコア入りFPGA
✤ 20nmプロセスを使った製品✤ ARMコアを含むSoC全体を1チップに
Design of a High-Density SoC FPGA at 20nm
Brad Vest, Sean Atsatt, Mike Hutton Altera, San Jose
High Capacity and High Performance 20nm FPGAs Steve Young, Dinesh Gaitonde August 2014 © Copyright 2014 Xilinx
Device Goals
� Mid-Range FPGA: balance of performance/power/cost targeting Key Market Applications
� Key Targets and Metrics: − 491 MHz fixed-point DSP datapath for Wireless RRU − 1M+LEs at 350 MHz for 4xOTU4 (400G) OTN networks, with Partial Reconfig − Cloud Server Acceleration – Hardened Floating-Point − 28G transceivers to support 200G to 400G networking/routing − Dramatic die-size reduction
3
Overview and Floorplan
4
� TSMC 20SOC Process − 5.3B Tx, 11LM
� Resources − 1.15M LEs, 1.7M FFs − 64Mb embedded SRAM − 32 fPLL, 16 PLLs, 32 GCLK − 1.5 TFlops IEEE754 DSP − Dual-Core ARM A9 − Row-based redundancy
� I/O − 28G SERDES, >1.7Tb b/w − x72 2.667Gbps DDR4 w/
Hard memory Controller − Hardened PCIe/ILKN/10GE
Hardened Floating Point DSP
� Hardened IEEE 754 Floating Point adder & Multiplier − 12% DSP Area increase (<<1% die area)
� 100% Fixed Point backwards compatible − No performance or power penalty
� ‘Have your cake and eat it too’ � How is this possible?
− Overlaid FP algorithms on Fixed point circuits
13
Major Innovation – Hard Floating Point on a Commercial FPGA
X
+
32 32
32
32
DSP Block – 1000s of blocks at very low latency
14
� 1.5 TFLOPS of aggregate computation; 50 GFLOPS/W − 1678 blocks @ 2 FLOPS/clock @ 450 MHz = 1.520 GFLOPs − Can run individually or as large integrated DSP system
� Hardware recursive structure support (Vector Mode) − 10s/100s of DSP blocks can be seamlessly integrated − Internal/External pipeling of individual DSP elements
� Very small latency − Floating Point used for iterative algorithms – require small latency − Arria 10 Floating Point - 256 length dot products ~ 25 clocks − Standard FPGA Technology - 256 length systolic FIR filter ~750 clocks
X
+
A B
AB+CD
X
+
C D AB+CD
X
+
E F
EF+GH
EF+GH
X
+
G H
X
+
I J IJ+KL+ MN+OP
AB+CD+EF+GH IJ+KL
AB+CD+ EF+GH
AB+CD+EF+GH+ IJ+KL+MN+OP
© Copyright 2014 Xilinx .
Vivado ® routes more complex designs on UltraScale UltraScale shows lower congestion on complex designs As a result, timing closure is accelerated Delivers 1 speedgrade higher Fmax
Page 15
UltraScale Results
Rout
ing
com
plex
ity
Rout
ing
com
plex
ity
No routing congestion
High routing congestionCannot route
© Copyright 2014 Xilinx .
Spartan-6/Virtex-6(45nm/40nm)
7 Series(28nm)
UltraScale(20nm/16nm)
up to 50%
Power Optimizations
Static
Dynamic
I/O
Transceiver
Static
Dynamic
I/O
Transceiver
up to 50%
Static Dynamic
I/O Transceiver
25-45%
• Architectural optimizations • Low power mode
• I/O multi-mode control (cont’d from 28nm) • DDR4 voltage reduction
• CLB packing & reduced wire length
• HW based clock gating on leaf cells
• BRAM hardened data cascading
• BRAM dynamic power gating
• DSP hardened features
• MMCM & PLL lower supply voltage
• Process node
• Power binning & lower voltage scaling • 3D IC static power binned slices
up to 40%
up to 30%
up to 50%
up to 60%
up to 65%
up to 30%
up to 40%
Page 17
装置の1部品から装置全体へ✤ SoCの流れに沿って、システムの1構成要素としてFPGAによる機能を利用すると言う構成から、FPGAの上でSoCを構成してしまうと言う方向に変化
✤ これを可能にしているのは、半導体の集積度向上✤ より高速な回路を要求、しかも消費電力の削減も
HOT CHIPS 26の資料
✤ http://www.hotchips.orgに歴代の資料があります。✤ 数年前の分から、プレゼンテーションのビデオも見られます。✤ 26については、Keynoteのみ一般公開。✤ 12月には全て公開の予定です。
SuperComputing 2014からの話題
SuperComputing
✤ 毎年11月に開催✤ 今年度は11月16~22日New Orleansのコンベンションセンターで✤ 論文発表のペーパーセッション以外に、展示会とBoFセッションもある。
Recommended