27
インシリコ創薬時代の 最新チップとアプリの開発状況 ソリューションアーキテクト 郡司 茂樹 [email protected] バイオグリッド研究会2015

インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

  • Upload
    others

  • View
    4

  • Download
    0

Embed Size (px)

Citation preview

Page 1: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

インシリコ創薬時代の 最新チップとアプリの開発状況 ソリューションアーキテクト 郡司 茂樹 shigegunjiintelcom バイオグリッド研究会2015

生命科学の発展を支える製品ポートフォリオ

COMPUTE

FABRIC

VISUALIZATION NETWORKING

Intelreg Xeonreg Intelreg Xeon Phitrade

Intelreg iGFX Intelreg Xeonreg

Intelreg Xeon Phitrade Iris Protrade Graphics

Embree Ray-Tracing

Intelreg Lustre Intelreg SSDNVMe RAID Controller Intelreg Xeonreg

Intelreg 1040GbE Intelreg Switch Si

Intelreg True Scale Intelreg Omni-Path

Intelreg 1040GbE

Intelreg Software Developer Tools

Intelreg Intel Cluster Ready Intelreg Data Center Manager

BoardsSystems

IA Programming Model amp Code Base The Broadest Technical Computing Ecosystem

STORAGE

Intelreg Xeonreg Processor

E5 Family

インテルの HPC パフォーマンスの基礎 ほぼ全域のワークロードにとって理想的

業界をリードする性能とワットあたりの性能

標準的な範囲のコア数を備え

高速なシリアル性能にもフォーカスした シリアルおよび並列ワークロードのための

汎用プロセッサー

3

wwwintelcomxeon

Intelreg Xeonreg Processor E5 Family

4

ディープラーニング も朝飯前

Intelreg Xeon Phitrade

Coprocessor 7120P

61 Cores 244 Threads 1238 GHz

121 TFLOPS(倍精度浮動小数点ピーク性能) 512 bit SIMD instructions

スレッドあたり32 個のベクトルレジスター 16GB GDDR5 メモリー 352 GBs

300W(冷却方式パッシブ) PCIe x16( IA のホストプロセッサーが必要)

22nm with the worldrsquos first 3-D Tri-Gate transistors

Linux operating system IP addressable

Common x86IA Programming Models and SW-Tools

wwwintelcomxeonphi

5

6

Is Xeon Phidagger performance compelling Vs Xeondagger E5v2 ldquo2-socket Xeon E5v2 systemrdquo Vs ldquo2-socket Xeon E5v2 system + Xeon Phi 7120rdquo

httpwwwintelcomperformance

daggerXeon = Intelreg Xeonreg processor daggerXeon Phi = Intelreg Xeon Phitrade coprocessor

Xeon Phi delivers up to 165 higher performance (with 1 card) versus 2-socket Xeon E5v2

3+ TFLOPS2

Intelreg Xeon Phitrade Product Family

第3世代 Intelreg Xeon Phitrade Product Family

第2世代 Intel Omni-Path Architecture

10nm プロセス技術

systems providers expected3

many more card-based systems

Knights Hill

Knights Landing

+

gt50

Knights Corner

2Hrsquo15 First

Commercial Systems

Knights Landing

Intelreg Xeon Phitrade Coprocessor ndash Applications

and Solutions Catalog

1 Claim based on calculated theoretical peak double precision performance capability for a single coprocessor 16 DP FLOPSclockcore 61 cores 123GHz = 1208 TeraFLOPS

2Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores clock frequency and floating point operations per cycle FLOPS = cores x clock frequency x floating-point operations per second per cycle 3 Intel internal estimate

-プロセッサ(ブート可能) -広帯域メモリをオンパッケージ -インターコネクト内蔵

1 TFLOPS1

gt100 PFLOPS customer system compute commits to-date3

Intelreg Omni-Path Architecture Coming 2Hrsquo15

Infi

niB

and

56

56 低い 遅延4

Lower is Better vs 36 in InfiniBand

100 Gbps

Line speed

48ポート Switch Chip Architecture

高い システム 拡張性

高い アプリ性能 拡張性

13x

Maximize SINGLE SWITCH investment

48 ports supports up to 12 addrsquol nodes by only adding CABLES1

up to frac12 スケーラブル

Over 27k NODES in a 2-tier 5-hop FABRIC3

高いポート密度

小規模クラスタ 主流のクラスタ スパコン

スイッチ数の削減2

23x

wwwintelcomomnipath

1 As compared to a shipping 36-port edge InfiniBand switch 2 Reduction in up to frac12 fewer switches claim based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters 3 A23X based on 27648 nodes based on a cluster configured with the Intelreg Omni-Path Architecture using 48-port switch ASICs as compared with a 36-port switch chip that can support up to 11664 nodes 4 Latency reductions based on Mellanox CS7500 Director Switch and Mellanox SB7700SB7790 Edge switches compared to preliminary Intel simulations for Intelreg Omni-Path switches based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration (2-tier 5 total switch hops) using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling and provided to you for informational purposes Any differences in your system hardware software or configuration may affect your actual performancerdquo

Intelreg EE for Lustre Hadoopとの接続性

オープンソース インテル独自の拡張 Hadoopディストリビューションへのコネクター内訳

9 Hadoopに接続可能なLustre

ANL Selects Intel for Worldrsquos Biggest Supercomputer

Dagger Cray XC Series at National Energy Research Scientific Computing Center (NERSC) dagger Cray XC Series at National Nuclear Security Administration (NNSA)

2-system CORAL award extends IA leadership in extreme scale HPC

Aurora Argonne National Laboratory

gt180PF

April lsquo15

Theta Argonne National Laboratory

gt85PF

Trinity NNSAdagger

gt40PF

July rsquo14

Cori NERSCDagger

gt30PF

April rsquo14

2

gt$200M

+

The Most Advanced Supercomputer Ever Built An Intel-led collaboration with ANL and Cray to accelerate discovery amp innovation

11

gt180 PFLOPS (option to increase up to 450 PF)

gt50000 nodes

13MW

2018 delivery

18X higher performancedagger

gt6X more energy efficientdagger

Prime Contractor

Subcontractor

Source Argonne National Laboratory and Intel daggerComparison of theoretical peak double precision FLOPS and power consumption to ANLrsquos largest current system MIRA (10PFs and 48MW)

Wang Bingqiang

Head of High Performance Computing BGI

アプリケーション対応状況Life Sciences

ldquoIntelrsquos leading technology amp product provide

great high performance computing power

which enable us achieve more genome

scientific research success for genome

application development for China and for the

whole human beingrdquo

12 13

0

1

2

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Intelreg Xeonreg processor E5-2697 v2 (optimized)

Xeon E5-2697 v2 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 (optimized) + NVIDIA K40 DPFP

Intelreg Xeonreg processor E5-2697 v3

Xeon E5-2697 v3 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

13 13

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is the Intelreg Xeonreg processor E5-2697 v2 compared to

the Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Offload processing on both and using the released code double precision code across the platforms 50 workload on the host and 50 on the coprocessor

Highlights The code was optimized delivered to the AMBER community (whoever has license) and available as an update patch during code configuration The benchmark information is at httpwwwksuiuceduResearchSTMV

Results Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A offload demonstrated up to 241X improved performance over the Intel Xeon processor E5-2697 v2 Optimized offload process demonstrated 107X increased performance compared to NVIDIA K40 performance

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

AMBER 14 PME Tobacco Virus 1 Million Atoms

1 NODE

AMBER 14 Particle Mesh Ewald (PME) Tobacco Virus

For configuration details go here

1

152X

2X

226X

193X

241X

Other names and brands may be claimed as the property of others

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION

0

1

2 nodes 3 nodes

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Xeon E5-2697 v2 + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 + NVIDIA K40 DPFP

ldquoXeon E5-2697 v2rdquo = Intelreg Xeonreg processor E5-2697 v2

14 14

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is on the Intelreg Xeonreg processor E5-2697 v2 host only

(also measured in httpambermdorggpusbenchmarkshtmBenchmarks) and speed up is shown with offload processing on both the Intel Xeon processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Performance shown is for the released code double precision across the platforms 50 workload on the host 50 on the coprocessor

Highlights The code had been optimized will be delivered to the AMBER community (whoever has license) and available as update patch during code configuration

Results Optimized offload process demonstrated compelling cluster performance improvement up to 26X over the baseline Intelreg Xeonreg processor E5-2697 v2

AMBER 14 Particle Mesh Ewald (PME) Cellulose NPT

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

AMBER PME Cellulose NPT (408K Atoms)

1

114X 111X

137X

3 NODES CLUSTER BENCHMARK

Other names and brands may be claimed as the property of others

157X

132X

APPROVED FOR PUBLIC PRESENTATION

0

1

Intelreg Xeonreg processor E5-2697 v2

1 Intelreg Xeon Phitrade coprocessor 7120PX

2 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 1 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 2 Intelreg Xeon Phitrade coprocessor 7120PX

156X

15

GROMACS 512K H2O with RF

Application GROMACS 50-RC1 Workload 512K H2O with RF method

Description GROMACS is a versatile package to perform molecular dynamics ie simulate the Newtonian equations of motion for systems with hundreds to millions of particles It is one of the fastest and the most popular Molecular Dynamics packages

Availability Code Version 50-rc1 available here and here Recipe Available here

Highlights Highly optimized for Intelreg Xeonreg Processors (AVX-

intrinsics) Able to run full simulation on Intelreg Xeon Phitrade

coprocessor natively + host processor using a symmetric model

Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors

Results Symmetric process demonstrated up to 179X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

APPROVED FOR PUBLIC PRESENTATION

179X

1 NODE

SOURCE INTEL MEASURED RESULTS AS OF APRIL 2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

GROMACS 512K H2O with RF Speed Up

1 103X

172X

Other names and brands may be claimed as the property of others

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 2: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

生命科学の発展を支える製品ポートフォリオ

COMPUTE

FABRIC

VISUALIZATION NETWORKING

Intelreg Xeonreg Intelreg Xeon Phitrade

Intelreg iGFX Intelreg Xeonreg

Intelreg Xeon Phitrade Iris Protrade Graphics

Embree Ray-Tracing

Intelreg Lustre Intelreg SSDNVMe RAID Controller Intelreg Xeonreg

Intelreg 1040GbE Intelreg Switch Si

Intelreg True Scale Intelreg Omni-Path

Intelreg 1040GbE

Intelreg Software Developer Tools

Intelreg Intel Cluster Ready Intelreg Data Center Manager

BoardsSystems

IA Programming Model amp Code Base The Broadest Technical Computing Ecosystem

STORAGE

Intelreg Xeonreg Processor

E5 Family

インテルの HPC パフォーマンスの基礎 ほぼ全域のワークロードにとって理想的

業界をリードする性能とワットあたりの性能

標準的な範囲のコア数を備え

高速なシリアル性能にもフォーカスした シリアルおよび並列ワークロードのための

汎用プロセッサー

3

wwwintelcomxeon

Intelreg Xeonreg Processor E5 Family

4

ディープラーニング も朝飯前

Intelreg Xeon Phitrade

Coprocessor 7120P

61 Cores 244 Threads 1238 GHz

121 TFLOPS(倍精度浮動小数点ピーク性能) 512 bit SIMD instructions

スレッドあたり32 個のベクトルレジスター 16GB GDDR5 メモリー 352 GBs

300W(冷却方式パッシブ) PCIe x16( IA のホストプロセッサーが必要)

22nm with the worldrsquos first 3-D Tri-Gate transistors

Linux operating system IP addressable

Common x86IA Programming Models and SW-Tools

wwwintelcomxeonphi

5

6

Is Xeon Phidagger performance compelling Vs Xeondagger E5v2 ldquo2-socket Xeon E5v2 systemrdquo Vs ldquo2-socket Xeon E5v2 system + Xeon Phi 7120rdquo

httpwwwintelcomperformance

daggerXeon = Intelreg Xeonreg processor daggerXeon Phi = Intelreg Xeon Phitrade coprocessor

Xeon Phi delivers up to 165 higher performance (with 1 card) versus 2-socket Xeon E5v2

3+ TFLOPS2

Intelreg Xeon Phitrade Product Family

第3世代 Intelreg Xeon Phitrade Product Family

第2世代 Intel Omni-Path Architecture

10nm プロセス技術

systems providers expected3

many more card-based systems

Knights Hill

Knights Landing

+

gt50

Knights Corner

2Hrsquo15 First

Commercial Systems

Knights Landing

Intelreg Xeon Phitrade Coprocessor ndash Applications

and Solutions Catalog

1 Claim based on calculated theoretical peak double precision performance capability for a single coprocessor 16 DP FLOPSclockcore 61 cores 123GHz = 1208 TeraFLOPS

2Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores clock frequency and floating point operations per cycle FLOPS = cores x clock frequency x floating-point operations per second per cycle 3 Intel internal estimate

-プロセッサ(ブート可能) -広帯域メモリをオンパッケージ -インターコネクト内蔵

1 TFLOPS1

gt100 PFLOPS customer system compute commits to-date3

Intelreg Omni-Path Architecture Coming 2Hrsquo15

Infi

niB

and

56

56 低い 遅延4

Lower is Better vs 36 in InfiniBand

100 Gbps

Line speed

48ポート Switch Chip Architecture

高い システム 拡張性

高い アプリ性能 拡張性

13x

Maximize SINGLE SWITCH investment

48 ports supports up to 12 addrsquol nodes by only adding CABLES1

up to frac12 スケーラブル

Over 27k NODES in a 2-tier 5-hop FABRIC3

高いポート密度

小規模クラスタ 主流のクラスタ スパコン

スイッチ数の削減2

23x

wwwintelcomomnipath

1 As compared to a shipping 36-port edge InfiniBand switch 2 Reduction in up to frac12 fewer switches claim based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters 3 A23X based on 27648 nodes based on a cluster configured with the Intelreg Omni-Path Architecture using 48-port switch ASICs as compared with a 36-port switch chip that can support up to 11664 nodes 4 Latency reductions based on Mellanox CS7500 Director Switch and Mellanox SB7700SB7790 Edge switches compared to preliminary Intel simulations for Intelreg Omni-Path switches based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration (2-tier 5 total switch hops) using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling and provided to you for informational purposes Any differences in your system hardware software or configuration may affect your actual performancerdquo

Intelreg EE for Lustre Hadoopとの接続性

オープンソース インテル独自の拡張 Hadoopディストリビューションへのコネクター内訳

9 Hadoopに接続可能なLustre

ANL Selects Intel for Worldrsquos Biggest Supercomputer

Dagger Cray XC Series at National Energy Research Scientific Computing Center (NERSC) dagger Cray XC Series at National Nuclear Security Administration (NNSA)

2-system CORAL award extends IA leadership in extreme scale HPC

Aurora Argonne National Laboratory

gt180PF

April lsquo15

Theta Argonne National Laboratory

gt85PF

Trinity NNSAdagger

gt40PF

July rsquo14

Cori NERSCDagger

gt30PF

April rsquo14

2

gt$200M

+

The Most Advanced Supercomputer Ever Built An Intel-led collaboration with ANL and Cray to accelerate discovery amp innovation

11

gt180 PFLOPS (option to increase up to 450 PF)

gt50000 nodes

13MW

2018 delivery

18X higher performancedagger

gt6X more energy efficientdagger

Prime Contractor

Subcontractor

Source Argonne National Laboratory and Intel daggerComparison of theoretical peak double precision FLOPS and power consumption to ANLrsquos largest current system MIRA (10PFs and 48MW)

Wang Bingqiang

Head of High Performance Computing BGI

アプリケーション対応状況Life Sciences

ldquoIntelrsquos leading technology amp product provide

great high performance computing power

which enable us achieve more genome

scientific research success for genome

application development for China and for the

whole human beingrdquo

12 13

0

1

2

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Intelreg Xeonreg processor E5-2697 v2 (optimized)

Xeon E5-2697 v2 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 (optimized) + NVIDIA K40 DPFP

Intelreg Xeonreg processor E5-2697 v3

Xeon E5-2697 v3 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

13 13

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is the Intelreg Xeonreg processor E5-2697 v2 compared to

the Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Offload processing on both and using the released code double precision code across the platforms 50 workload on the host and 50 on the coprocessor

Highlights The code was optimized delivered to the AMBER community (whoever has license) and available as an update patch during code configuration The benchmark information is at httpwwwksuiuceduResearchSTMV

Results Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A offload demonstrated up to 241X improved performance over the Intel Xeon processor E5-2697 v2 Optimized offload process demonstrated 107X increased performance compared to NVIDIA K40 performance

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

AMBER 14 PME Tobacco Virus 1 Million Atoms

1 NODE

AMBER 14 Particle Mesh Ewald (PME) Tobacco Virus

For configuration details go here

1

152X

2X

226X

193X

241X

Other names and brands may be claimed as the property of others

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION

0

1

2 nodes 3 nodes

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Xeon E5-2697 v2 + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 + NVIDIA K40 DPFP

ldquoXeon E5-2697 v2rdquo = Intelreg Xeonreg processor E5-2697 v2

14 14

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is on the Intelreg Xeonreg processor E5-2697 v2 host only

(also measured in httpambermdorggpusbenchmarkshtmBenchmarks) and speed up is shown with offload processing on both the Intel Xeon processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Performance shown is for the released code double precision across the platforms 50 workload on the host 50 on the coprocessor

Highlights The code had been optimized will be delivered to the AMBER community (whoever has license) and available as update patch during code configuration

Results Optimized offload process demonstrated compelling cluster performance improvement up to 26X over the baseline Intelreg Xeonreg processor E5-2697 v2

AMBER 14 Particle Mesh Ewald (PME) Cellulose NPT

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

AMBER PME Cellulose NPT (408K Atoms)

1

114X 111X

137X

3 NODES CLUSTER BENCHMARK

Other names and brands may be claimed as the property of others

157X

132X

APPROVED FOR PUBLIC PRESENTATION

0

1

Intelreg Xeonreg processor E5-2697 v2

1 Intelreg Xeon Phitrade coprocessor 7120PX

2 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 1 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 2 Intelreg Xeon Phitrade coprocessor 7120PX

156X

15

GROMACS 512K H2O with RF

Application GROMACS 50-RC1 Workload 512K H2O with RF method

Description GROMACS is a versatile package to perform molecular dynamics ie simulate the Newtonian equations of motion for systems with hundreds to millions of particles It is one of the fastest and the most popular Molecular Dynamics packages

Availability Code Version 50-rc1 available here and here Recipe Available here

Highlights Highly optimized for Intelreg Xeonreg Processors (AVX-

intrinsics) Able to run full simulation on Intelreg Xeon Phitrade

coprocessor natively + host processor using a symmetric model

Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors

Results Symmetric process demonstrated up to 179X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

APPROVED FOR PUBLIC PRESENTATION

179X

1 NODE

SOURCE INTEL MEASURED RESULTS AS OF APRIL 2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

GROMACS 512K H2O with RF Speed Up

1 103X

172X

Other names and brands may be claimed as the property of others

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 3: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

Intelreg Xeonreg Processor

E5 Family

インテルの HPC パフォーマンスの基礎 ほぼ全域のワークロードにとって理想的

業界をリードする性能とワットあたりの性能

標準的な範囲のコア数を備え

高速なシリアル性能にもフォーカスした シリアルおよび並列ワークロードのための

汎用プロセッサー

3

wwwintelcomxeon

Intelreg Xeonreg Processor E5 Family

4

ディープラーニング も朝飯前

Intelreg Xeon Phitrade

Coprocessor 7120P

61 Cores 244 Threads 1238 GHz

121 TFLOPS(倍精度浮動小数点ピーク性能) 512 bit SIMD instructions

スレッドあたり32 個のベクトルレジスター 16GB GDDR5 メモリー 352 GBs

300W(冷却方式パッシブ) PCIe x16( IA のホストプロセッサーが必要)

22nm with the worldrsquos first 3-D Tri-Gate transistors

Linux operating system IP addressable

Common x86IA Programming Models and SW-Tools

wwwintelcomxeonphi

5

6

Is Xeon Phidagger performance compelling Vs Xeondagger E5v2 ldquo2-socket Xeon E5v2 systemrdquo Vs ldquo2-socket Xeon E5v2 system + Xeon Phi 7120rdquo

httpwwwintelcomperformance

daggerXeon = Intelreg Xeonreg processor daggerXeon Phi = Intelreg Xeon Phitrade coprocessor

Xeon Phi delivers up to 165 higher performance (with 1 card) versus 2-socket Xeon E5v2

3+ TFLOPS2

Intelreg Xeon Phitrade Product Family

第3世代 Intelreg Xeon Phitrade Product Family

第2世代 Intel Omni-Path Architecture

10nm プロセス技術

systems providers expected3

many more card-based systems

Knights Hill

Knights Landing

+

gt50

Knights Corner

2Hrsquo15 First

Commercial Systems

Knights Landing

Intelreg Xeon Phitrade Coprocessor ndash Applications

and Solutions Catalog

1 Claim based on calculated theoretical peak double precision performance capability for a single coprocessor 16 DP FLOPSclockcore 61 cores 123GHz = 1208 TeraFLOPS

2Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores clock frequency and floating point operations per cycle FLOPS = cores x clock frequency x floating-point operations per second per cycle 3 Intel internal estimate

-プロセッサ(ブート可能) -広帯域メモリをオンパッケージ -インターコネクト内蔵

1 TFLOPS1

gt100 PFLOPS customer system compute commits to-date3

Intelreg Omni-Path Architecture Coming 2Hrsquo15

Infi

niB

and

56

56 低い 遅延4

Lower is Better vs 36 in InfiniBand

100 Gbps

Line speed

48ポート Switch Chip Architecture

高い システム 拡張性

高い アプリ性能 拡張性

13x

Maximize SINGLE SWITCH investment

48 ports supports up to 12 addrsquol nodes by only adding CABLES1

up to frac12 スケーラブル

Over 27k NODES in a 2-tier 5-hop FABRIC3

高いポート密度

小規模クラスタ 主流のクラスタ スパコン

スイッチ数の削減2

23x

wwwintelcomomnipath

1 As compared to a shipping 36-port edge InfiniBand switch 2 Reduction in up to frac12 fewer switches claim based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters 3 A23X based on 27648 nodes based on a cluster configured with the Intelreg Omni-Path Architecture using 48-port switch ASICs as compared with a 36-port switch chip that can support up to 11664 nodes 4 Latency reductions based on Mellanox CS7500 Director Switch and Mellanox SB7700SB7790 Edge switches compared to preliminary Intel simulations for Intelreg Omni-Path switches based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration (2-tier 5 total switch hops) using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling and provided to you for informational purposes Any differences in your system hardware software or configuration may affect your actual performancerdquo

Intelreg EE for Lustre Hadoopとの接続性

オープンソース インテル独自の拡張 Hadoopディストリビューションへのコネクター内訳

9 Hadoopに接続可能なLustre

ANL Selects Intel for Worldrsquos Biggest Supercomputer

Dagger Cray XC Series at National Energy Research Scientific Computing Center (NERSC) dagger Cray XC Series at National Nuclear Security Administration (NNSA)

2-system CORAL award extends IA leadership in extreme scale HPC

Aurora Argonne National Laboratory

gt180PF

April lsquo15

Theta Argonne National Laboratory

gt85PF

Trinity NNSAdagger

gt40PF

July rsquo14

Cori NERSCDagger

gt30PF

April rsquo14

2

gt$200M

+

The Most Advanced Supercomputer Ever Built An Intel-led collaboration with ANL and Cray to accelerate discovery amp innovation

11

gt180 PFLOPS (option to increase up to 450 PF)

gt50000 nodes

13MW

2018 delivery

18X higher performancedagger

gt6X more energy efficientdagger

Prime Contractor

Subcontractor

Source Argonne National Laboratory and Intel daggerComparison of theoretical peak double precision FLOPS and power consumption to ANLrsquos largest current system MIRA (10PFs and 48MW)

Wang Bingqiang

Head of High Performance Computing BGI

アプリケーション対応状況Life Sciences

ldquoIntelrsquos leading technology amp product provide

great high performance computing power

which enable us achieve more genome

scientific research success for genome

application development for China and for the

whole human beingrdquo

12 13

0

1

2

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Intelreg Xeonreg processor E5-2697 v2 (optimized)

Xeon E5-2697 v2 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 (optimized) + NVIDIA K40 DPFP

Intelreg Xeonreg processor E5-2697 v3

Xeon E5-2697 v3 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

13 13

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is the Intelreg Xeonreg processor E5-2697 v2 compared to

the Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Offload processing on both and using the released code double precision code across the platforms 50 workload on the host and 50 on the coprocessor

Highlights The code was optimized delivered to the AMBER community (whoever has license) and available as an update patch during code configuration The benchmark information is at httpwwwksuiuceduResearchSTMV

Results Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A offload demonstrated up to 241X improved performance over the Intel Xeon processor E5-2697 v2 Optimized offload process demonstrated 107X increased performance compared to NVIDIA K40 performance

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

AMBER 14 PME Tobacco Virus 1 Million Atoms

1 NODE

AMBER 14 Particle Mesh Ewald (PME) Tobacco Virus

For configuration details go here

1

152X

2X

226X

193X

241X

Other names and brands may be claimed as the property of others

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION

0

1

2 nodes 3 nodes

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Xeon E5-2697 v2 + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 + NVIDIA K40 DPFP

ldquoXeon E5-2697 v2rdquo = Intelreg Xeonreg processor E5-2697 v2

14 14

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is on the Intelreg Xeonreg processor E5-2697 v2 host only

(also measured in httpambermdorggpusbenchmarkshtmBenchmarks) and speed up is shown with offload processing on both the Intel Xeon processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Performance shown is for the released code double precision across the platforms 50 workload on the host 50 on the coprocessor

Highlights The code had been optimized will be delivered to the AMBER community (whoever has license) and available as update patch during code configuration

Results Optimized offload process demonstrated compelling cluster performance improvement up to 26X over the baseline Intelreg Xeonreg processor E5-2697 v2

AMBER 14 Particle Mesh Ewald (PME) Cellulose NPT

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

AMBER PME Cellulose NPT (408K Atoms)

1

114X 111X

137X

3 NODES CLUSTER BENCHMARK

Other names and brands may be claimed as the property of others

157X

132X

APPROVED FOR PUBLIC PRESENTATION

0

1

Intelreg Xeonreg processor E5-2697 v2

1 Intelreg Xeon Phitrade coprocessor 7120PX

2 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 1 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 2 Intelreg Xeon Phitrade coprocessor 7120PX

156X

15

GROMACS 512K H2O with RF

Application GROMACS 50-RC1 Workload 512K H2O with RF method

Description GROMACS is a versatile package to perform molecular dynamics ie simulate the Newtonian equations of motion for systems with hundreds to millions of particles It is one of the fastest and the most popular Molecular Dynamics packages

Availability Code Version 50-rc1 available here and here Recipe Available here

Highlights Highly optimized for Intelreg Xeonreg Processors (AVX-

intrinsics) Able to run full simulation on Intelreg Xeon Phitrade

coprocessor natively + host processor using a symmetric model

Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors

Results Symmetric process demonstrated up to 179X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

APPROVED FOR PUBLIC PRESENTATION

179X

1 NODE

SOURCE INTEL MEASURED RESULTS AS OF APRIL 2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

GROMACS 512K H2O with RF Speed Up

1 103X

172X

Other names and brands may be claimed as the property of others

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 4: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

Intelreg Xeonreg Processor E5 Family

4

ディープラーニング も朝飯前

Intelreg Xeon Phitrade

Coprocessor 7120P

61 Cores 244 Threads 1238 GHz

121 TFLOPS(倍精度浮動小数点ピーク性能) 512 bit SIMD instructions

スレッドあたり32 個のベクトルレジスター 16GB GDDR5 メモリー 352 GBs

300W(冷却方式パッシブ) PCIe x16( IA のホストプロセッサーが必要)

22nm with the worldrsquos first 3-D Tri-Gate transistors

Linux operating system IP addressable

Common x86IA Programming Models and SW-Tools

wwwintelcomxeonphi

5

6

Is Xeon Phidagger performance compelling Vs Xeondagger E5v2 ldquo2-socket Xeon E5v2 systemrdquo Vs ldquo2-socket Xeon E5v2 system + Xeon Phi 7120rdquo

httpwwwintelcomperformance

daggerXeon = Intelreg Xeonreg processor daggerXeon Phi = Intelreg Xeon Phitrade coprocessor

Xeon Phi delivers up to 165 higher performance (with 1 card) versus 2-socket Xeon E5v2

3+ TFLOPS2

Intelreg Xeon Phitrade Product Family

第3世代 Intelreg Xeon Phitrade Product Family

第2世代 Intel Omni-Path Architecture

10nm プロセス技術

systems providers expected3

many more card-based systems

Knights Hill

Knights Landing

+

gt50

Knights Corner

2Hrsquo15 First

Commercial Systems

Knights Landing

Intelreg Xeon Phitrade Coprocessor ndash Applications

and Solutions Catalog

1 Claim based on calculated theoretical peak double precision performance capability for a single coprocessor 16 DP FLOPSclockcore 61 cores 123GHz = 1208 TeraFLOPS

2Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores clock frequency and floating point operations per cycle FLOPS = cores x clock frequency x floating-point operations per second per cycle 3 Intel internal estimate

-プロセッサ(ブート可能) -広帯域メモリをオンパッケージ -インターコネクト内蔵

1 TFLOPS1

gt100 PFLOPS customer system compute commits to-date3

Intelreg Omni-Path Architecture Coming 2Hrsquo15

Infi

niB

and

56

56 低い 遅延4

Lower is Better vs 36 in InfiniBand

100 Gbps

Line speed

48ポート Switch Chip Architecture

高い システム 拡張性

高い アプリ性能 拡張性

13x

Maximize SINGLE SWITCH investment

48 ports supports up to 12 addrsquol nodes by only adding CABLES1

up to frac12 スケーラブル

Over 27k NODES in a 2-tier 5-hop FABRIC3

高いポート密度

小規模クラスタ 主流のクラスタ スパコン

スイッチ数の削減2

23x

wwwintelcomomnipath

1 As compared to a shipping 36-port edge InfiniBand switch 2 Reduction in up to frac12 fewer switches claim based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters 3 A23X based on 27648 nodes based on a cluster configured with the Intelreg Omni-Path Architecture using 48-port switch ASICs as compared with a 36-port switch chip that can support up to 11664 nodes 4 Latency reductions based on Mellanox CS7500 Director Switch and Mellanox SB7700SB7790 Edge switches compared to preliminary Intel simulations for Intelreg Omni-Path switches based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration (2-tier 5 total switch hops) using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling and provided to you for informational purposes Any differences in your system hardware software or configuration may affect your actual performancerdquo

Intelreg EE for Lustre Hadoopとの接続性

オープンソース インテル独自の拡張 Hadoopディストリビューションへのコネクター内訳

9 Hadoopに接続可能なLustre

ANL Selects Intel for Worldrsquos Biggest Supercomputer

Dagger Cray XC Series at National Energy Research Scientific Computing Center (NERSC) dagger Cray XC Series at National Nuclear Security Administration (NNSA)

2-system CORAL award extends IA leadership in extreme scale HPC

Aurora Argonne National Laboratory

gt180PF

April lsquo15

Theta Argonne National Laboratory

gt85PF

Trinity NNSAdagger

gt40PF

July rsquo14

Cori NERSCDagger

gt30PF

April rsquo14

2

gt$200M

+

The Most Advanced Supercomputer Ever Built An Intel-led collaboration with ANL and Cray to accelerate discovery amp innovation

11

gt180 PFLOPS (option to increase up to 450 PF)

gt50000 nodes

13MW

2018 delivery

18X higher performancedagger

gt6X more energy efficientdagger

Prime Contractor

Subcontractor

Source Argonne National Laboratory and Intel daggerComparison of theoretical peak double precision FLOPS and power consumption to ANLrsquos largest current system MIRA (10PFs and 48MW)

Wang Bingqiang

Head of High Performance Computing BGI

アプリケーション対応状況Life Sciences

ldquoIntelrsquos leading technology amp product provide

great high performance computing power

which enable us achieve more genome

scientific research success for genome

application development for China and for the

whole human beingrdquo

12 13

0

1

2

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Intelreg Xeonreg processor E5-2697 v2 (optimized)

Xeon E5-2697 v2 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 (optimized) + NVIDIA K40 DPFP

Intelreg Xeonreg processor E5-2697 v3

Xeon E5-2697 v3 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

13 13

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is the Intelreg Xeonreg processor E5-2697 v2 compared to

the Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Offload processing on both and using the released code double precision code across the platforms 50 workload on the host and 50 on the coprocessor

Highlights The code was optimized delivered to the AMBER community (whoever has license) and available as an update patch during code configuration The benchmark information is at httpwwwksuiuceduResearchSTMV

Results Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A offload demonstrated up to 241X improved performance over the Intel Xeon processor E5-2697 v2 Optimized offload process demonstrated 107X increased performance compared to NVIDIA K40 performance

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

AMBER 14 PME Tobacco Virus 1 Million Atoms

1 NODE

AMBER 14 Particle Mesh Ewald (PME) Tobacco Virus

For configuration details go here

1

152X

2X

226X

193X

241X

Other names and brands may be claimed as the property of others

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION

0

1

2 nodes 3 nodes

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Xeon E5-2697 v2 + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 + NVIDIA K40 DPFP

ldquoXeon E5-2697 v2rdquo = Intelreg Xeonreg processor E5-2697 v2

14 14

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is on the Intelreg Xeonreg processor E5-2697 v2 host only

(also measured in httpambermdorggpusbenchmarkshtmBenchmarks) and speed up is shown with offload processing on both the Intel Xeon processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Performance shown is for the released code double precision across the platforms 50 workload on the host 50 on the coprocessor

Highlights The code had been optimized will be delivered to the AMBER community (whoever has license) and available as update patch during code configuration

Results Optimized offload process demonstrated compelling cluster performance improvement up to 26X over the baseline Intelreg Xeonreg processor E5-2697 v2

AMBER 14 Particle Mesh Ewald (PME) Cellulose NPT

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

AMBER PME Cellulose NPT (408K Atoms)

1

114X 111X

137X

3 NODES CLUSTER BENCHMARK

Other names and brands may be claimed as the property of others

157X

132X

APPROVED FOR PUBLIC PRESENTATION

0

1

Intelreg Xeonreg processor E5-2697 v2

1 Intelreg Xeon Phitrade coprocessor 7120PX

2 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 1 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 2 Intelreg Xeon Phitrade coprocessor 7120PX

156X

15

GROMACS 512K H2O with RF

Application GROMACS 50-RC1 Workload 512K H2O with RF method

Description GROMACS is a versatile package to perform molecular dynamics ie simulate the Newtonian equations of motion for systems with hundreds to millions of particles It is one of the fastest and the most popular Molecular Dynamics packages

Availability Code Version 50-rc1 available here and here Recipe Available here

Highlights Highly optimized for Intelreg Xeonreg Processors (AVX-

intrinsics) Able to run full simulation on Intelreg Xeon Phitrade

coprocessor natively + host processor using a symmetric model

Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors

Results Symmetric process demonstrated up to 179X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

APPROVED FOR PUBLIC PRESENTATION

179X

1 NODE

SOURCE INTEL MEASURED RESULTS AS OF APRIL 2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

GROMACS 512K H2O with RF Speed Up

1 103X

172X

Other names and brands may be claimed as the property of others

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 5: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

Intelreg Xeon Phitrade

Coprocessor 7120P

61 Cores 244 Threads 1238 GHz

121 TFLOPS(倍精度浮動小数点ピーク性能) 512 bit SIMD instructions

スレッドあたり32 個のベクトルレジスター 16GB GDDR5 メモリー 352 GBs

300W(冷却方式パッシブ) PCIe x16( IA のホストプロセッサーが必要)

22nm with the worldrsquos first 3-D Tri-Gate transistors

Linux operating system IP addressable

Common x86IA Programming Models and SW-Tools

wwwintelcomxeonphi

5

6

Is Xeon Phidagger performance compelling Vs Xeondagger E5v2 ldquo2-socket Xeon E5v2 systemrdquo Vs ldquo2-socket Xeon E5v2 system + Xeon Phi 7120rdquo

httpwwwintelcomperformance

daggerXeon = Intelreg Xeonreg processor daggerXeon Phi = Intelreg Xeon Phitrade coprocessor

Xeon Phi delivers up to 165 higher performance (with 1 card) versus 2-socket Xeon E5v2

3+ TFLOPS2

Intelreg Xeon Phitrade Product Family

第3世代 Intelreg Xeon Phitrade Product Family

第2世代 Intel Omni-Path Architecture

10nm プロセス技術

systems providers expected3

many more card-based systems

Knights Hill

Knights Landing

+

gt50

Knights Corner

2Hrsquo15 First

Commercial Systems

Knights Landing

Intelreg Xeon Phitrade Coprocessor ndash Applications

and Solutions Catalog

1 Claim based on calculated theoretical peak double precision performance capability for a single coprocessor 16 DP FLOPSclockcore 61 cores 123GHz = 1208 TeraFLOPS

2Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores clock frequency and floating point operations per cycle FLOPS = cores x clock frequency x floating-point operations per second per cycle 3 Intel internal estimate

-プロセッサ(ブート可能) -広帯域メモリをオンパッケージ -インターコネクト内蔵

1 TFLOPS1

gt100 PFLOPS customer system compute commits to-date3

Intelreg Omni-Path Architecture Coming 2Hrsquo15

Infi

niB

and

56

56 低い 遅延4

Lower is Better vs 36 in InfiniBand

100 Gbps

Line speed

48ポート Switch Chip Architecture

高い システム 拡張性

高い アプリ性能 拡張性

13x

Maximize SINGLE SWITCH investment

48 ports supports up to 12 addrsquol nodes by only adding CABLES1

up to frac12 スケーラブル

Over 27k NODES in a 2-tier 5-hop FABRIC3

高いポート密度

小規模クラスタ 主流のクラスタ スパコン

スイッチ数の削減2

23x

wwwintelcomomnipath

1 As compared to a shipping 36-port edge InfiniBand switch 2 Reduction in up to frac12 fewer switches claim based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters 3 A23X based on 27648 nodes based on a cluster configured with the Intelreg Omni-Path Architecture using 48-port switch ASICs as compared with a 36-port switch chip that can support up to 11664 nodes 4 Latency reductions based on Mellanox CS7500 Director Switch and Mellanox SB7700SB7790 Edge switches compared to preliminary Intel simulations for Intelreg Omni-Path switches based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration (2-tier 5 total switch hops) using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling and provided to you for informational purposes Any differences in your system hardware software or configuration may affect your actual performancerdquo

Intelreg EE for Lustre Hadoopとの接続性

オープンソース インテル独自の拡張 Hadoopディストリビューションへのコネクター内訳

9 Hadoopに接続可能なLustre

ANL Selects Intel for Worldrsquos Biggest Supercomputer

Dagger Cray XC Series at National Energy Research Scientific Computing Center (NERSC) dagger Cray XC Series at National Nuclear Security Administration (NNSA)

2-system CORAL award extends IA leadership in extreme scale HPC

Aurora Argonne National Laboratory

gt180PF

April lsquo15

Theta Argonne National Laboratory

gt85PF

Trinity NNSAdagger

gt40PF

July rsquo14

Cori NERSCDagger

gt30PF

April rsquo14

2

gt$200M

+

The Most Advanced Supercomputer Ever Built An Intel-led collaboration with ANL and Cray to accelerate discovery amp innovation

11

gt180 PFLOPS (option to increase up to 450 PF)

gt50000 nodes

13MW

2018 delivery

18X higher performancedagger

gt6X more energy efficientdagger

Prime Contractor

Subcontractor

Source Argonne National Laboratory and Intel daggerComparison of theoretical peak double precision FLOPS and power consumption to ANLrsquos largest current system MIRA (10PFs and 48MW)

Wang Bingqiang

Head of High Performance Computing BGI

アプリケーション対応状況Life Sciences

ldquoIntelrsquos leading technology amp product provide

great high performance computing power

which enable us achieve more genome

scientific research success for genome

application development for China and for the

whole human beingrdquo

12 13

0

1

2

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Intelreg Xeonreg processor E5-2697 v2 (optimized)

Xeon E5-2697 v2 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 (optimized) + NVIDIA K40 DPFP

Intelreg Xeonreg processor E5-2697 v3

Xeon E5-2697 v3 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

13 13

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is the Intelreg Xeonreg processor E5-2697 v2 compared to

the Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Offload processing on both and using the released code double precision code across the platforms 50 workload on the host and 50 on the coprocessor

Highlights The code was optimized delivered to the AMBER community (whoever has license) and available as an update patch during code configuration The benchmark information is at httpwwwksuiuceduResearchSTMV

Results Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A offload demonstrated up to 241X improved performance over the Intel Xeon processor E5-2697 v2 Optimized offload process demonstrated 107X increased performance compared to NVIDIA K40 performance

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

AMBER 14 PME Tobacco Virus 1 Million Atoms

1 NODE

AMBER 14 Particle Mesh Ewald (PME) Tobacco Virus

For configuration details go here

1

152X

2X

226X

193X

241X

Other names and brands may be claimed as the property of others

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION

0

1

2 nodes 3 nodes

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Xeon E5-2697 v2 + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 + NVIDIA K40 DPFP

ldquoXeon E5-2697 v2rdquo = Intelreg Xeonreg processor E5-2697 v2

14 14

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is on the Intelreg Xeonreg processor E5-2697 v2 host only

(also measured in httpambermdorggpusbenchmarkshtmBenchmarks) and speed up is shown with offload processing on both the Intel Xeon processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Performance shown is for the released code double precision across the platforms 50 workload on the host 50 on the coprocessor

Highlights The code had been optimized will be delivered to the AMBER community (whoever has license) and available as update patch during code configuration

Results Optimized offload process demonstrated compelling cluster performance improvement up to 26X over the baseline Intelreg Xeonreg processor E5-2697 v2

AMBER 14 Particle Mesh Ewald (PME) Cellulose NPT

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

AMBER PME Cellulose NPT (408K Atoms)

1

114X 111X

137X

3 NODES CLUSTER BENCHMARK

Other names and brands may be claimed as the property of others

157X

132X

APPROVED FOR PUBLIC PRESENTATION

0

1

Intelreg Xeonreg processor E5-2697 v2

1 Intelreg Xeon Phitrade coprocessor 7120PX

2 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 1 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 2 Intelreg Xeon Phitrade coprocessor 7120PX

156X

15

GROMACS 512K H2O with RF

Application GROMACS 50-RC1 Workload 512K H2O with RF method

Description GROMACS is a versatile package to perform molecular dynamics ie simulate the Newtonian equations of motion for systems with hundreds to millions of particles It is one of the fastest and the most popular Molecular Dynamics packages

Availability Code Version 50-rc1 available here and here Recipe Available here

Highlights Highly optimized for Intelreg Xeonreg Processors (AVX-

intrinsics) Able to run full simulation on Intelreg Xeon Phitrade

coprocessor natively + host processor using a symmetric model

Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors

Results Symmetric process demonstrated up to 179X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

APPROVED FOR PUBLIC PRESENTATION

179X

1 NODE

SOURCE INTEL MEASURED RESULTS AS OF APRIL 2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

GROMACS 512K H2O with RF Speed Up

1 103X

172X

Other names and brands may be claimed as the property of others

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 6: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

6

Is Xeon Phidagger performance compelling Vs Xeondagger E5v2 ldquo2-socket Xeon E5v2 systemrdquo Vs ldquo2-socket Xeon E5v2 system + Xeon Phi 7120rdquo

httpwwwintelcomperformance

daggerXeon = Intelreg Xeonreg processor daggerXeon Phi = Intelreg Xeon Phitrade coprocessor

Xeon Phi delivers up to 165 higher performance (with 1 card) versus 2-socket Xeon E5v2

3+ TFLOPS2

Intelreg Xeon Phitrade Product Family

第3世代 Intelreg Xeon Phitrade Product Family

第2世代 Intel Omni-Path Architecture

10nm プロセス技術

systems providers expected3

many more card-based systems

Knights Hill

Knights Landing

+

gt50

Knights Corner

2Hrsquo15 First

Commercial Systems

Knights Landing

Intelreg Xeon Phitrade Coprocessor ndash Applications

and Solutions Catalog

1 Claim based on calculated theoretical peak double precision performance capability for a single coprocessor 16 DP FLOPSclockcore 61 cores 123GHz = 1208 TeraFLOPS

2Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores clock frequency and floating point operations per cycle FLOPS = cores x clock frequency x floating-point operations per second per cycle 3 Intel internal estimate

-プロセッサ(ブート可能) -広帯域メモリをオンパッケージ -インターコネクト内蔵

1 TFLOPS1

gt100 PFLOPS customer system compute commits to-date3

Intelreg Omni-Path Architecture Coming 2Hrsquo15

Infi

niB

and

56

56 低い 遅延4

Lower is Better vs 36 in InfiniBand

100 Gbps

Line speed

48ポート Switch Chip Architecture

高い システム 拡張性

高い アプリ性能 拡張性

13x

Maximize SINGLE SWITCH investment

48 ports supports up to 12 addrsquol nodes by only adding CABLES1

up to frac12 スケーラブル

Over 27k NODES in a 2-tier 5-hop FABRIC3

高いポート密度

小規模クラスタ 主流のクラスタ スパコン

スイッチ数の削減2

23x

wwwintelcomomnipath

1 As compared to a shipping 36-port edge InfiniBand switch 2 Reduction in up to frac12 fewer switches claim based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters 3 A23X based on 27648 nodes based on a cluster configured with the Intelreg Omni-Path Architecture using 48-port switch ASICs as compared with a 36-port switch chip that can support up to 11664 nodes 4 Latency reductions based on Mellanox CS7500 Director Switch and Mellanox SB7700SB7790 Edge switches compared to preliminary Intel simulations for Intelreg Omni-Path switches based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration (2-tier 5 total switch hops) using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling and provided to you for informational purposes Any differences in your system hardware software or configuration may affect your actual performancerdquo

Intelreg EE for Lustre Hadoopとの接続性

オープンソース インテル独自の拡張 Hadoopディストリビューションへのコネクター内訳

9 Hadoopに接続可能なLustre

ANL Selects Intel for Worldrsquos Biggest Supercomputer

Dagger Cray XC Series at National Energy Research Scientific Computing Center (NERSC) dagger Cray XC Series at National Nuclear Security Administration (NNSA)

2-system CORAL award extends IA leadership in extreme scale HPC

Aurora Argonne National Laboratory

gt180PF

April lsquo15

Theta Argonne National Laboratory

gt85PF

Trinity NNSAdagger

gt40PF

July rsquo14

Cori NERSCDagger

gt30PF

April rsquo14

2

gt$200M

+

The Most Advanced Supercomputer Ever Built An Intel-led collaboration with ANL and Cray to accelerate discovery amp innovation

11

gt180 PFLOPS (option to increase up to 450 PF)

gt50000 nodes

13MW

2018 delivery

18X higher performancedagger

gt6X more energy efficientdagger

Prime Contractor

Subcontractor

Source Argonne National Laboratory and Intel daggerComparison of theoretical peak double precision FLOPS and power consumption to ANLrsquos largest current system MIRA (10PFs and 48MW)

Wang Bingqiang

Head of High Performance Computing BGI

アプリケーション対応状況Life Sciences

ldquoIntelrsquos leading technology amp product provide

great high performance computing power

which enable us achieve more genome

scientific research success for genome

application development for China and for the

whole human beingrdquo

12 13

0

1

2

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Intelreg Xeonreg processor E5-2697 v2 (optimized)

Xeon E5-2697 v2 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 (optimized) + NVIDIA K40 DPFP

Intelreg Xeonreg processor E5-2697 v3

Xeon E5-2697 v3 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

13 13

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is the Intelreg Xeonreg processor E5-2697 v2 compared to

the Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Offload processing on both and using the released code double precision code across the platforms 50 workload on the host and 50 on the coprocessor

Highlights The code was optimized delivered to the AMBER community (whoever has license) and available as an update patch during code configuration The benchmark information is at httpwwwksuiuceduResearchSTMV

Results Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A offload demonstrated up to 241X improved performance over the Intel Xeon processor E5-2697 v2 Optimized offload process demonstrated 107X increased performance compared to NVIDIA K40 performance

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

AMBER 14 PME Tobacco Virus 1 Million Atoms

1 NODE

AMBER 14 Particle Mesh Ewald (PME) Tobacco Virus

For configuration details go here

1

152X

2X

226X

193X

241X

Other names and brands may be claimed as the property of others

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION

0

1

2 nodes 3 nodes

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Xeon E5-2697 v2 + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 + NVIDIA K40 DPFP

ldquoXeon E5-2697 v2rdquo = Intelreg Xeonreg processor E5-2697 v2

14 14

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is on the Intelreg Xeonreg processor E5-2697 v2 host only

(also measured in httpambermdorggpusbenchmarkshtmBenchmarks) and speed up is shown with offload processing on both the Intel Xeon processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Performance shown is for the released code double precision across the platforms 50 workload on the host 50 on the coprocessor

Highlights The code had been optimized will be delivered to the AMBER community (whoever has license) and available as update patch during code configuration

Results Optimized offload process demonstrated compelling cluster performance improvement up to 26X over the baseline Intelreg Xeonreg processor E5-2697 v2

AMBER 14 Particle Mesh Ewald (PME) Cellulose NPT

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

AMBER PME Cellulose NPT (408K Atoms)

1

114X 111X

137X

3 NODES CLUSTER BENCHMARK

Other names and brands may be claimed as the property of others

157X

132X

APPROVED FOR PUBLIC PRESENTATION

0

1

Intelreg Xeonreg processor E5-2697 v2

1 Intelreg Xeon Phitrade coprocessor 7120PX

2 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 1 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 2 Intelreg Xeon Phitrade coprocessor 7120PX

156X

15

GROMACS 512K H2O with RF

Application GROMACS 50-RC1 Workload 512K H2O with RF method

Description GROMACS is a versatile package to perform molecular dynamics ie simulate the Newtonian equations of motion for systems with hundreds to millions of particles It is one of the fastest and the most popular Molecular Dynamics packages

Availability Code Version 50-rc1 available here and here Recipe Available here

Highlights Highly optimized for Intelreg Xeonreg Processors (AVX-

intrinsics) Able to run full simulation on Intelreg Xeon Phitrade

coprocessor natively + host processor using a symmetric model

Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors

Results Symmetric process demonstrated up to 179X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

APPROVED FOR PUBLIC PRESENTATION

179X

1 NODE

SOURCE INTEL MEASURED RESULTS AS OF APRIL 2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

GROMACS 512K H2O with RF Speed Up

1 103X

172X

Other names and brands may be claimed as the property of others

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 7: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

3+ TFLOPS2

Intelreg Xeon Phitrade Product Family

第3世代 Intelreg Xeon Phitrade Product Family

第2世代 Intel Omni-Path Architecture

10nm プロセス技術

systems providers expected3

many more card-based systems

Knights Hill

Knights Landing

+

gt50

Knights Corner

2Hrsquo15 First

Commercial Systems

Knights Landing

Intelreg Xeon Phitrade Coprocessor ndash Applications

and Solutions Catalog

1 Claim based on calculated theoretical peak double precision performance capability for a single coprocessor 16 DP FLOPSclockcore 61 cores 123GHz = 1208 TeraFLOPS

2Over 3 Teraflops of peak theoretical double-precision performance is preliminary and based on current expectations of cores clock frequency and floating point operations per cycle FLOPS = cores x clock frequency x floating-point operations per second per cycle 3 Intel internal estimate

-プロセッサ(ブート可能) -広帯域メモリをオンパッケージ -インターコネクト内蔵

1 TFLOPS1

gt100 PFLOPS customer system compute commits to-date3

Intelreg Omni-Path Architecture Coming 2Hrsquo15

Infi

niB

and

56

56 低い 遅延4

Lower is Better vs 36 in InfiniBand

100 Gbps

Line speed

48ポート Switch Chip Architecture

高い システム 拡張性

高い アプリ性能 拡張性

13x

Maximize SINGLE SWITCH investment

48 ports supports up to 12 addrsquol nodes by only adding CABLES1

up to frac12 スケーラブル

Over 27k NODES in a 2-tier 5-hop FABRIC3

高いポート密度

小規模クラスタ 主流のクラスタ スパコン

スイッチ数の削減2

23x

wwwintelcomomnipath

1 As compared to a shipping 36-port edge InfiniBand switch 2 Reduction in up to frac12 fewer switches claim based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters 3 A23X based on 27648 nodes based on a cluster configured with the Intelreg Omni-Path Architecture using 48-port switch ASICs as compared with a 36-port switch chip that can support up to 11664 nodes 4 Latency reductions based on Mellanox CS7500 Director Switch and Mellanox SB7700SB7790 Edge switches compared to preliminary Intel simulations for Intelreg Omni-Path switches based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration (2-tier 5 total switch hops) using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling and provided to you for informational purposes Any differences in your system hardware software or configuration may affect your actual performancerdquo

Intelreg EE for Lustre Hadoopとの接続性

オープンソース インテル独自の拡張 Hadoopディストリビューションへのコネクター内訳

9 Hadoopに接続可能なLustre

ANL Selects Intel for Worldrsquos Biggest Supercomputer

Dagger Cray XC Series at National Energy Research Scientific Computing Center (NERSC) dagger Cray XC Series at National Nuclear Security Administration (NNSA)

2-system CORAL award extends IA leadership in extreme scale HPC

Aurora Argonne National Laboratory

gt180PF

April lsquo15

Theta Argonne National Laboratory

gt85PF

Trinity NNSAdagger

gt40PF

July rsquo14

Cori NERSCDagger

gt30PF

April rsquo14

2

gt$200M

+

The Most Advanced Supercomputer Ever Built An Intel-led collaboration with ANL and Cray to accelerate discovery amp innovation

11

gt180 PFLOPS (option to increase up to 450 PF)

gt50000 nodes

13MW

2018 delivery

18X higher performancedagger

gt6X more energy efficientdagger

Prime Contractor

Subcontractor

Source Argonne National Laboratory and Intel daggerComparison of theoretical peak double precision FLOPS and power consumption to ANLrsquos largest current system MIRA (10PFs and 48MW)

Wang Bingqiang

Head of High Performance Computing BGI

アプリケーション対応状況Life Sciences

ldquoIntelrsquos leading technology amp product provide

great high performance computing power

which enable us achieve more genome

scientific research success for genome

application development for China and for the

whole human beingrdquo

12 13

0

1

2

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Intelreg Xeonreg processor E5-2697 v2 (optimized)

Xeon E5-2697 v2 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 (optimized) + NVIDIA K40 DPFP

Intelreg Xeonreg processor E5-2697 v3

Xeon E5-2697 v3 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

13 13

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is the Intelreg Xeonreg processor E5-2697 v2 compared to

the Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Offload processing on both and using the released code double precision code across the platforms 50 workload on the host and 50 on the coprocessor

Highlights The code was optimized delivered to the AMBER community (whoever has license) and available as an update patch during code configuration The benchmark information is at httpwwwksuiuceduResearchSTMV

Results Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A offload demonstrated up to 241X improved performance over the Intel Xeon processor E5-2697 v2 Optimized offload process demonstrated 107X increased performance compared to NVIDIA K40 performance

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

AMBER 14 PME Tobacco Virus 1 Million Atoms

1 NODE

AMBER 14 Particle Mesh Ewald (PME) Tobacco Virus

For configuration details go here

1

152X

2X

226X

193X

241X

Other names and brands may be claimed as the property of others

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION

0

1

2 nodes 3 nodes

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Xeon E5-2697 v2 + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 + NVIDIA K40 DPFP

ldquoXeon E5-2697 v2rdquo = Intelreg Xeonreg processor E5-2697 v2

14 14

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is on the Intelreg Xeonreg processor E5-2697 v2 host only

(also measured in httpambermdorggpusbenchmarkshtmBenchmarks) and speed up is shown with offload processing on both the Intel Xeon processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Performance shown is for the released code double precision across the platforms 50 workload on the host 50 on the coprocessor

Highlights The code had been optimized will be delivered to the AMBER community (whoever has license) and available as update patch during code configuration

Results Optimized offload process demonstrated compelling cluster performance improvement up to 26X over the baseline Intelreg Xeonreg processor E5-2697 v2

AMBER 14 Particle Mesh Ewald (PME) Cellulose NPT

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

AMBER PME Cellulose NPT (408K Atoms)

1

114X 111X

137X

3 NODES CLUSTER BENCHMARK

Other names and brands may be claimed as the property of others

157X

132X

APPROVED FOR PUBLIC PRESENTATION

0

1

Intelreg Xeonreg processor E5-2697 v2

1 Intelreg Xeon Phitrade coprocessor 7120PX

2 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 1 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 2 Intelreg Xeon Phitrade coprocessor 7120PX

156X

15

GROMACS 512K H2O with RF

Application GROMACS 50-RC1 Workload 512K H2O with RF method

Description GROMACS is a versatile package to perform molecular dynamics ie simulate the Newtonian equations of motion for systems with hundreds to millions of particles It is one of the fastest and the most popular Molecular Dynamics packages

Availability Code Version 50-rc1 available here and here Recipe Available here

Highlights Highly optimized for Intelreg Xeonreg Processors (AVX-

intrinsics) Able to run full simulation on Intelreg Xeon Phitrade

coprocessor natively + host processor using a symmetric model

Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors

Results Symmetric process demonstrated up to 179X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

APPROVED FOR PUBLIC PRESENTATION

179X

1 NODE

SOURCE INTEL MEASURED RESULTS AS OF APRIL 2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

GROMACS 512K H2O with RF Speed Up

1 103X

172X

Other names and brands may be claimed as the property of others

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 8: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

Intelreg Omni-Path Architecture Coming 2Hrsquo15

Infi

niB

and

56

56 低い 遅延4

Lower is Better vs 36 in InfiniBand

100 Gbps

Line speed

48ポート Switch Chip Architecture

高い システム 拡張性

高い アプリ性能 拡張性

13x

Maximize SINGLE SWITCH investment

48 ports supports up to 12 addrsquol nodes by only adding CABLES1

up to frac12 スケーラブル

Over 27k NODES in a 2-tier 5-hop FABRIC3

高いポート密度

小規模クラスタ 主流のクラスタ スパコン

スイッチ数の削減2

23x

wwwintelcomomnipath

1 As compared to a shipping 36-port edge InfiniBand switch 2 Reduction in up to frac12 fewer switches claim based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters 3 A23X based on 27648 nodes based on a cluster configured with the Intelreg Omni-Path Architecture using 48-port switch ASICs as compared with a 36-port switch chip that can support up to 11664 nodes 4 Latency reductions based on Mellanox CS7500 Director Switch and Mellanox SB7700SB7790 Edge switches compared to preliminary Intel simulations for Intelreg Omni-Path switches based on a 1024-node full bisectional bandwidth (FBB) Fat-Tree configuration (2-tier 5 total switch hops) using a 48-port switch for Intelreg Omni-Path cluster and 36-port switch ASIC for either Mellanox or Intelreg True Scale clusters Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling and provided to you for informational purposes Any differences in your system hardware software or configuration may affect your actual performancerdquo

Intelreg EE for Lustre Hadoopとの接続性

オープンソース インテル独自の拡張 Hadoopディストリビューションへのコネクター内訳

9 Hadoopに接続可能なLustre

ANL Selects Intel for Worldrsquos Biggest Supercomputer

Dagger Cray XC Series at National Energy Research Scientific Computing Center (NERSC) dagger Cray XC Series at National Nuclear Security Administration (NNSA)

2-system CORAL award extends IA leadership in extreme scale HPC

Aurora Argonne National Laboratory

gt180PF

April lsquo15

Theta Argonne National Laboratory

gt85PF

Trinity NNSAdagger

gt40PF

July rsquo14

Cori NERSCDagger

gt30PF

April rsquo14

2

gt$200M

+

The Most Advanced Supercomputer Ever Built An Intel-led collaboration with ANL and Cray to accelerate discovery amp innovation

11

gt180 PFLOPS (option to increase up to 450 PF)

gt50000 nodes

13MW

2018 delivery

18X higher performancedagger

gt6X more energy efficientdagger

Prime Contractor

Subcontractor

Source Argonne National Laboratory and Intel daggerComparison of theoretical peak double precision FLOPS and power consumption to ANLrsquos largest current system MIRA (10PFs and 48MW)

Wang Bingqiang

Head of High Performance Computing BGI

アプリケーション対応状況Life Sciences

ldquoIntelrsquos leading technology amp product provide

great high performance computing power

which enable us achieve more genome

scientific research success for genome

application development for China and for the

whole human beingrdquo

12 13

0

1

2

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Intelreg Xeonreg processor E5-2697 v2 (optimized)

Xeon E5-2697 v2 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 (optimized) + NVIDIA K40 DPFP

Intelreg Xeonreg processor E5-2697 v3

Xeon E5-2697 v3 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

13 13

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is the Intelreg Xeonreg processor E5-2697 v2 compared to

the Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Offload processing on both and using the released code double precision code across the platforms 50 workload on the host and 50 on the coprocessor

Highlights The code was optimized delivered to the AMBER community (whoever has license) and available as an update patch during code configuration The benchmark information is at httpwwwksuiuceduResearchSTMV

Results Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A offload demonstrated up to 241X improved performance over the Intel Xeon processor E5-2697 v2 Optimized offload process demonstrated 107X increased performance compared to NVIDIA K40 performance

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

AMBER 14 PME Tobacco Virus 1 Million Atoms

1 NODE

AMBER 14 Particle Mesh Ewald (PME) Tobacco Virus

For configuration details go here

1

152X

2X

226X

193X

241X

Other names and brands may be claimed as the property of others

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION

0

1

2 nodes 3 nodes

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Xeon E5-2697 v2 + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 + NVIDIA K40 DPFP

ldquoXeon E5-2697 v2rdquo = Intelreg Xeonreg processor E5-2697 v2

14 14

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is on the Intelreg Xeonreg processor E5-2697 v2 host only

(also measured in httpambermdorggpusbenchmarkshtmBenchmarks) and speed up is shown with offload processing on both the Intel Xeon processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Performance shown is for the released code double precision across the platforms 50 workload on the host 50 on the coprocessor

Highlights The code had been optimized will be delivered to the AMBER community (whoever has license) and available as update patch during code configuration

Results Optimized offload process demonstrated compelling cluster performance improvement up to 26X over the baseline Intelreg Xeonreg processor E5-2697 v2

AMBER 14 Particle Mesh Ewald (PME) Cellulose NPT

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

AMBER PME Cellulose NPT (408K Atoms)

1

114X 111X

137X

3 NODES CLUSTER BENCHMARK

Other names and brands may be claimed as the property of others

157X

132X

APPROVED FOR PUBLIC PRESENTATION

0

1

Intelreg Xeonreg processor E5-2697 v2

1 Intelreg Xeon Phitrade coprocessor 7120PX

2 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 1 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 2 Intelreg Xeon Phitrade coprocessor 7120PX

156X

15

GROMACS 512K H2O with RF

Application GROMACS 50-RC1 Workload 512K H2O with RF method

Description GROMACS is a versatile package to perform molecular dynamics ie simulate the Newtonian equations of motion for systems with hundreds to millions of particles It is one of the fastest and the most popular Molecular Dynamics packages

Availability Code Version 50-rc1 available here and here Recipe Available here

Highlights Highly optimized for Intelreg Xeonreg Processors (AVX-

intrinsics) Able to run full simulation on Intelreg Xeon Phitrade

coprocessor natively + host processor using a symmetric model

Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors

Results Symmetric process demonstrated up to 179X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

APPROVED FOR PUBLIC PRESENTATION

179X

1 NODE

SOURCE INTEL MEASURED RESULTS AS OF APRIL 2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

GROMACS 512K H2O with RF Speed Up

1 103X

172X

Other names and brands may be claimed as the property of others

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 9: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

Intelreg EE for Lustre Hadoopとの接続性

オープンソース インテル独自の拡張 Hadoopディストリビューションへのコネクター内訳

9 Hadoopに接続可能なLustre

ANL Selects Intel for Worldrsquos Biggest Supercomputer

Dagger Cray XC Series at National Energy Research Scientific Computing Center (NERSC) dagger Cray XC Series at National Nuclear Security Administration (NNSA)

2-system CORAL award extends IA leadership in extreme scale HPC

Aurora Argonne National Laboratory

gt180PF

April lsquo15

Theta Argonne National Laboratory

gt85PF

Trinity NNSAdagger

gt40PF

July rsquo14

Cori NERSCDagger

gt30PF

April rsquo14

2

gt$200M

+

The Most Advanced Supercomputer Ever Built An Intel-led collaboration with ANL and Cray to accelerate discovery amp innovation

11

gt180 PFLOPS (option to increase up to 450 PF)

gt50000 nodes

13MW

2018 delivery

18X higher performancedagger

gt6X more energy efficientdagger

Prime Contractor

Subcontractor

Source Argonne National Laboratory and Intel daggerComparison of theoretical peak double precision FLOPS and power consumption to ANLrsquos largest current system MIRA (10PFs and 48MW)

Wang Bingqiang

Head of High Performance Computing BGI

アプリケーション対応状況Life Sciences

ldquoIntelrsquos leading technology amp product provide

great high performance computing power

which enable us achieve more genome

scientific research success for genome

application development for China and for the

whole human beingrdquo

12 13

0

1

2

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Intelreg Xeonreg processor E5-2697 v2 (optimized)

Xeon E5-2697 v2 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 (optimized) + NVIDIA K40 DPFP

Intelreg Xeonreg processor E5-2697 v3

Xeon E5-2697 v3 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

13 13

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is the Intelreg Xeonreg processor E5-2697 v2 compared to

the Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Offload processing on both and using the released code double precision code across the platforms 50 workload on the host and 50 on the coprocessor

Highlights The code was optimized delivered to the AMBER community (whoever has license) and available as an update patch during code configuration The benchmark information is at httpwwwksuiuceduResearchSTMV

Results Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A offload demonstrated up to 241X improved performance over the Intel Xeon processor E5-2697 v2 Optimized offload process demonstrated 107X increased performance compared to NVIDIA K40 performance

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

AMBER 14 PME Tobacco Virus 1 Million Atoms

1 NODE

AMBER 14 Particle Mesh Ewald (PME) Tobacco Virus

For configuration details go here

1

152X

2X

226X

193X

241X

Other names and brands may be claimed as the property of others

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION

0

1

2 nodes 3 nodes

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Xeon E5-2697 v2 + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 + NVIDIA K40 DPFP

ldquoXeon E5-2697 v2rdquo = Intelreg Xeonreg processor E5-2697 v2

14 14

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is on the Intelreg Xeonreg processor E5-2697 v2 host only

(also measured in httpambermdorggpusbenchmarkshtmBenchmarks) and speed up is shown with offload processing on both the Intel Xeon processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Performance shown is for the released code double precision across the platforms 50 workload on the host 50 on the coprocessor

Highlights The code had been optimized will be delivered to the AMBER community (whoever has license) and available as update patch during code configuration

Results Optimized offload process demonstrated compelling cluster performance improvement up to 26X over the baseline Intelreg Xeonreg processor E5-2697 v2

AMBER 14 Particle Mesh Ewald (PME) Cellulose NPT

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

AMBER PME Cellulose NPT (408K Atoms)

1

114X 111X

137X

3 NODES CLUSTER BENCHMARK

Other names and brands may be claimed as the property of others

157X

132X

APPROVED FOR PUBLIC PRESENTATION

0

1

Intelreg Xeonreg processor E5-2697 v2

1 Intelreg Xeon Phitrade coprocessor 7120PX

2 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 1 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 2 Intelreg Xeon Phitrade coprocessor 7120PX

156X

15

GROMACS 512K H2O with RF

Application GROMACS 50-RC1 Workload 512K H2O with RF method

Description GROMACS is a versatile package to perform molecular dynamics ie simulate the Newtonian equations of motion for systems with hundreds to millions of particles It is one of the fastest and the most popular Molecular Dynamics packages

Availability Code Version 50-rc1 available here and here Recipe Available here

Highlights Highly optimized for Intelreg Xeonreg Processors (AVX-

intrinsics) Able to run full simulation on Intelreg Xeon Phitrade

coprocessor natively + host processor using a symmetric model

Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors

Results Symmetric process demonstrated up to 179X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

APPROVED FOR PUBLIC PRESENTATION

179X

1 NODE

SOURCE INTEL MEASURED RESULTS AS OF APRIL 2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

GROMACS 512K H2O with RF Speed Up

1 103X

172X

Other names and brands may be claimed as the property of others

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 10: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

ANL Selects Intel for Worldrsquos Biggest Supercomputer

Dagger Cray XC Series at National Energy Research Scientific Computing Center (NERSC) dagger Cray XC Series at National Nuclear Security Administration (NNSA)

2-system CORAL award extends IA leadership in extreme scale HPC

Aurora Argonne National Laboratory

gt180PF

April lsquo15

Theta Argonne National Laboratory

gt85PF

Trinity NNSAdagger

gt40PF

July rsquo14

Cori NERSCDagger

gt30PF

April rsquo14

2

gt$200M

+

The Most Advanced Supercomputer Ever Built An Intel-led collaboration with ANL and Cray to accelerate discovery amp innovation

11

gt180 PFLOPS (option to increase up to 450 PF)

gt50000 nodes

13MW

2018 delivery

18X higher performancedagger

gt6X more energy efficientdagger

Prime Contractor

Subcontractor

Source Argonne National Laboratory and Intel daggerComparison of theoretical peak double precision FLOPS and power consumption to ANLrsquos largest current system MIRA (10PFs and 48MW)

Wang Bingqiang

Head of High Performance Computing BGI

アプリケーション対応状況Life Sciences

ldquoIntelrsquos leading technology amp product provide

great high performance computing power

which enable us achieve more genome

scientific research success for genome

application development for China and for the

whole human beingrdquo

12 13

0

1

2

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Intelreg Xeonreg processor E5-2697 v2 (optimized)

Xeon E5-2697 v2 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 (optimized) + NVIDIA K40 DPFP

Intelreg Xeonreg processor E5-2697 v3

Xeon E5-2697 v3 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

13 13

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is the Intelreg Xeonreg processor E5-2697 v2 compared to

the Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Offload processing on both and using the released code double precision code across the platforms 50 workload on the host and 50 on the coprocessor

Highlights The code was optimized delivered to the AMBER community (whoever has license) and available as an update patch during code configuration The benchmark information is at httpwwwksuiuceduResearchSTMV

Results Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A offload demonstrated up to 241X improved performance over the Intel Xeon processor E5-2697 v2 Optimized offload process demonstrated 107X increased performance compared to NVIDIA K40 performance

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

AMBER 14 PME Tobacco Virus 1 Million Atoms

1 NODE

AMBER 14 Particle Mesh Ewald (PME) Tobacco Virus

For configuration details go here

1

152X

2X

226X

193X

241X

Other names and brands may be claimed as the property of others

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION

0

1

2 nodes 3 nodes

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Xeon E5-2697 v2 + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 + NVIDIA K40 DPFP

ldquoXeon E5-2697 v2rdquo = Intelreg Xeonreg processor E5-2697 v2

14 14

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is on the Intelreg Xeonreg processor E5-2697 v2 host only

(also measured in httpambermdorggpusbenchmarkshtmBenchmarks) and speed up is shown with offload processing on both the Intel Xeon processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Performance shown is for the released code double precision across the platforms 50 workload on the host 50 on the coprocessor

Highlights The code had been optimized will be delivered to the AMBER community (whoever has license) and available as update patch during code configuration

Results Optimized offload process demonstrated compelling cluster performance improvement up to 26X over the baseline Intelreg Xeonreg processor E5-2697 v2

AMBER 14 Particle Mesh Ewald (PME) Cellulose NPT

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

AMBER PME Cellulose NPT (408K Atoms)

1

114X 111X

137X

3 NODES CLUSTER BENCHMARK

Other names and brands may be claimed as the property of others

157X

132X

APPROVED FOR PUBLIC PRESENTATION

0

1

Intelreg Xeonreg processor E5-2697 v2

1 Intelreg Xeon Phitrade coprocessor 7120PX

2 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 1 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 2 Intelreg Xeon Phitrade coprocessor 7120PX

156X

15

GROMACS 512K H2O with RF

Application GROMACS 50-RC1 Workload 512K H2O with RF method

Description GROMACS is a versatile package to perform molecular dynamics ie simulate the Newtonian equations of motion for systems with hundreds to millions of particles It is one of the fastest and the most popular Molecular Dynamics packages

Availability Code Version 50-rc1 available here and here Recipe Available here

Highlights Highly optimized for Intelreg Xeonreg Processors (AVX-

intrinsics) Able to run full simulation on Intelreg Xeon Phitrade

coprocessor natively + host processor using a symmetric model

Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors

Results Symmetric process demonstrated up to 179X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

APPROVED FOR PUBLIC PRESENTATION

179X

1 NODE

SOURCE INTEL MEASURED RESULTS AS OF APRIL 2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

GROMACS 512K H2O with RF Speed Up

1 103X

172X

Other names and brands may be claimed as the property of others

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 11: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

The Most Advanced Supercomputer Ever Built An Intel-led collaboration with ANL and Cray to accelerate discovery amp innovation

11

gt180 PFLOPS (option to increase up to 450 PF)

gt50000 nodes

13MW

2018 delivery

18X higher performancedagger

gt6X more energy efficientdagger

Prime Contractor

Subcontractor

Source Argonne National Laboratory and Intel daggerComparison of theoretical peak double precision FLOPS and power consumption to ANLrsquos largest current system MIRA (10PFs and 48MW)

Wang Bingqiang

Head of High Performance Computing BGI

アプリケーション対応状況Life Sciences

ldquoIntelrsquos leading technology amp product provide

great high performance computing power

which enable us achieve more genome

scientific research success for genome

application development for China and for the

whole human beingrdquo

12 13

0

1

2

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Intelreg Xeonreg processor E5-2697 v2 (optimized)

Xeon E5-2697 v2 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 (optimized) + NVIDIA K40 DPFP

Intelreg Xeonreg processor E5-2697 v3

Xeon E5-2697 v3 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

13 13

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is the Intelreg Xeonreg processor E5-2697 v2 compared to

the Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Offload processing on both and using the released code double precision code across the platforms 50 workload on the host and 50 on the coprocessor

Highlights The code was optimized delivered to the AMBER community (whoever has license) and available as an update patch during code configuration The benchmark information is at httpwwwksuiuceduResearchSTMV

Results Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A offload demonstrated up to 241X improved performance over the Intel Xeon processor E5-2697 v2 Optimized offload process demonstrated 107X increased performance compared to NVIDIA K40 performance

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

AMBER 14 PME Tobacco Virus 1 Million Atoms

1 NODE

AMBER 14 Particle Mesh Ewald (PME) Tobacco Virus

For configuration details go here

1

152X

2X

226X

193X

241X

Other names and brands may be claimed as the property of others

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION

0

1

2 nodes 3 nodes

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Xeon E5-2697 v2 + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 + NVIDIA K40 DPFP

ldquoXeon E5-2697 v2rdquo = Intelreg Xeonreg processor E5-2697 v2

14 14

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is on the Intelreg Xeonreg processor E5-2697 v2 host only

(also measured in httpambermdorggpusbenchmarkshtmBenchmarks) and speed up is shown with offload processing on both the Intel Xeon processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Performance shown is for the released code double precision across the platforms 50 workload on the host 50 on the coprocessor

Highlights The code had been optimized will be delivered to the AMBER community (whoever has license) and available as update patch during code configuration

Results Optimized offload process demonstrated compelling cluster performance improvement up to 26X over the baseline Intelreg Xeonreg processor E5-2697 v2

AMBER 14 Particle Mesh Ewald (PME) Cellulose NPT

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

AMBER PME Cellulose NPT (408K Atoms)

1

114X 111X

137X

3 NODES CLUSTER BENCHMARK

Other names and brands may be claimed as the property of others

157X

132X

APPROVED FOR PUBLIC PRESENTATION

0

1

Intelreg Xeonreg processor E5-2697 v2

1 Intelreg Xeon Phitrade coprocessor 7120PX

2 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 1 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 2 Intelreg Xeon Phitrade coprocessor 7120PX

156X

15

GROMACS 512K H2O with RF

Application GROMACS 50-RC1 Workload 512K H2O with RF method

Description GROMACS is a versatile package to perform molecular dynamics ie simulate the Newtonian equations of motion for systems with hundreds to millions of particles It is one of the fastest and the most popular Molecular Dynamics packages

Availability Code Version 50-rc1 available here and here Recipe Available here

Highlights Highly optimized for Intelreg Xeonreg Processors (AVX-

intrinsics) Able to run full simulation on Intelreg Xeon Phitrade

coprocessor natively + host processor using a symmetric model

Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors

Results Symmetric process demonstrated up to 179X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

APPROVED FOR PUBLIC PRESENTATION

179X

1 NODE

SOURCE INTEL MEASURED RESULTS AS OF APRIL 2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

GROMACS 512K H2O with RF Speed Up

1 103X

172X

Other names and brands may be claimed as the property of others

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 12: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

Wang Bingqiang

Head of High Performance Computing BGI

アプリケーション対応状況Life Sciences

ldquoIntelrsquos leading technology amp product provide

great high performance computing power

which enable us achieve more genome

scientific research success for genome

application development for China and for the

whole human beingrdquo

12 13

0

1

2

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Intelreg Xeonreg processor E5-2697 v2 (optimized)

Xeon E5-2697 v2 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 (optimized) + NVIDIA K40 DPFP

Intelreg Xeonreg processor E5-2697 v3

Xeon E5-2697 v3 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

13 13

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is the Intelreg Xeonreg processor E5-2697 v2 compared to

the Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Offload processing on both and using the released code double precision code across the platforms 50 workload on the host and 50 on the coprocessor

Highlights The code was optimized delivered to the AMBER community (whoever has license) and available as an update patch during code configuration The benchmark information is at httpwwwksuiuceduResearchSTMV

Results Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A offload demonstrated up to 241X improved performance over the Intel Xeon processor E5-2697 v2 Optimized offload process demonstrated 107X increased performance compared to NVIDIA K40 performance

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

AMBER 14 PME Tobacco Virus 1 Million Atoms

1 NODE

AMBER 14 Particle Mesh Ewald (PME) Tobacco Virus

For configuration details go here

1

152X

2X

226X

193X

241X

Other names and brands may be claimed as the property of others

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION

0

1

2 nodes 3 nodes

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Xeon E5-2697 v2 + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 + NVIDIA K40 DPFP

ldquoXeon E5-2697 v2rdquo = Intelreg Xeonreg processor E5-2697 v2

14 14

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is on the Intelreg Xeonreg processor E5-2697 v2 host only

(also measured in httpambermdorggpusbenchmarkshtmBenchmarks) and speed up is shown with offload processing on both the Intel Xeon processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Performance shown is for the released code double precision across the platforms 50 workload on the host 50 on the coprocessor

Highlights The code had been optimized will be delivered to the AMBER community (whoever has license) and available as update patch during code configuration

Results Optimized offload process demonstrated compelling cluster performance improvement up to 26X over the baseline Intelreg Xeonreg processor E5-2697 v2

AMBER 14 Particle Mesh Ewald (PME) Cellulose NPT

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

AMBER PME Cellulose NPT (408K Atoms)

1

114X 111X

137X

3 NODES CLUSTER BENCHMARK

Other names and brands may be claimed as the property of others

157X

132X

APPROVED FOR PUBLIC PRESENTATION

0

1

Intelreg Xeonreg processor E5-2697 v2

1 Intelreg Xeon Phitrade coprocessor 7120PX

2 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 1 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 2 Intelreg Xeon Phitrade coprocessor 7120PX

156X

15

GROMACS 512K H2O with RF

Application GROMACS 50-RC1 Workload 512K H2O with RF method

Description GROMACS is a versatile package to perform molecular dynamics ie simulate the Newtonian equations of motion for systems with hundreds to millions of particles It is one of the fastest and the most popular Molecular Dynamics packages

Availability Code Version 50-rc1 available here and here Recipe Available here

Highlights Highly optimized for Intelreg Xeonreg Processors (AVX-

intrinsics) Able to run full simulation on Intelreg Xeon Phitrade

coprocessor natively + host processor using a symmetric model

Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors

Results Symmetric process demonstrated up to 179X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

APPROVED FOR PUBLIC PRESENTATION

179X

1 NODE

SOURCE INTEL MEASURED RESULTS AS OF APRIL 2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

GROMACS 512K H2O with RF Speed Up

1 103X

172X

Other names and brands may be claimed as the property of others

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 13: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

0

1

2

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Intelreg Xeonreg processor E5-2697 v2 (optimized)

Xeon E5-2697 v2 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 (optimized) + NVIDIA K40 DPFP

Intelreg Xeonreg processor E5-2697 v3

Xeon E5-2697 v3 (optimized) + Intelreg Xeon Phitrade coprocessor 7120A

13 13

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is the Intelreg Xeonreg processor E5-2697 v2 compared to

the Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Offload processing on both and using the released code double precision code across the platforms 50 workload on the host and 50 on the coprocessor

Highlights The code was optimized delivered to the AMBER community (whoever has license) and available as an update patch during code configuration The benchmark information is at httpwwwksuiuceduResearchSTMV

Results Optimized Intel Xeon processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A offload demonstrated up to 241X improved performance over the Intel Xeon processor E5-2697 v2 Optimized offload process demonstrated 107X increased performance compared to NVIDIA K40 performance

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

AMBER 14 PME Tobacco Virus 1 Million Atoms

1 NODE

AMBER 14 Particle Mesh Ewald (PME) Tobacco Virus

For configuration details go here

1

152X

2X

226X

193X

241X

Other names and brands may be claimed as the property of others

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION

0

1

2 nodes 3 nodes

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Xeon E5-2697 v2 + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 + NVIDIA K40 DPFP

ldquoXeon E5-2697 v2rdquo = Intelreg Xeonreg processor E5-2697 v2

14 14

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is on the Intelreg Xeonreg processor E5-2697 v2 host only

(also measured in httpambermdorggpusbenchmarkshtmBenchmarks) and speed up is shown with offload processing on both the Intel Xeon processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Performance shown is for the released code double precision across the platforms 50 workload on the host 50 on the coprocessor

Highlights The code had been optimized will be delivered to the AMBER community (whoever has license) and available as update patch during code configuration

Results Optimized offload process demonstrated compelling cluster performance improvement up to 26X over the baseline Intelreg Xeonreg processor E5-2697 v2

AMBER 14 Particle Mesh Ewald (PME) Cellulose NPT

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

AMBER PME Cellulose NPT (408K Atoms)

1

114X 111X

137X

3 NODES CLUSTER BENCHMARK

Other names and brands may be claimed as the property of others

157X

132X

APPROVED FOR PUBLIC PRESENTATION

0

1

Intelreg Xeonreg processor E5-2697 v2

1 Intelreg Xeon Phitrade coprocessor 7120PX

2 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 1 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 2 Intelreg Xeon Phitrade coprocessor 7120PX

156X

15

GROMACS 512K H2O with RF

Application GROMACS 50-RC1 Workload 512K H2O with RF method

Description GROMACS is a versatile package to perform molecular dynamics ie simulate the Newtonian equations of motion for systems with hundreds to millions of particles It is one of the fastest and the most popular Molecular Dynamics packages

Availability Code Version 50-rc1 available here and here Recipe Available here

Highlights Highly optimized for Intelreg Xeonreg Processors (AVX-

intrinsics) Able to run full simulation on Intelreg Xeon Phitrade

coprocessor natively + host processor using a symmetric model

Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors

Results Symmetric process demonstrated up to 179X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

APPROVED FOR PUBLIC PRESENTATION

179X

1 NODE

SOURCE INTEL MEASURED RESULTS AS OF APRIL 2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

GROMACS 512K H2O with RF Speed Up

1 103X

172X

Other names and brands may be claimed as the property of others

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 14: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

0

1

2 nodes 3 nodes

Intelreg Xeonreg processor E5-2697 v2 (baseline)

Xeon E5-2697 v2 + Intelreg Xeon Phitrade coprocessor 7120A

Xeon E5-2697 v2 + NVIDIA K40 DPFP

ldquoXeon E5-2697 v2rdquo = Intelreg Xeonreg processor E5-2697 v2

14 14

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

Application AMBER 14

Description Bimolecular Simulations (Protein DNA RNA virus etc) Full double precision (DPDP) More at httpambermdorg

Availability Code Available as a patch Recipe Available here (Section 187 of the manual)

Usage Model Baseline is on the Intelreg Xeonreg processor E5-2697 v2 host only

(also measured in httpambermdorggpusbenchmarkshtmBenchmarks) and speed up is shown with offload processing on both the Intel Xeon processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor 7120A

Performance shown is for the released code double precision across the platforms 50 workload on the host 50 on the coprocessor

Highlights The code had been optimized will be delivered to the AMBER community (whoever has license) and available as update patch during code configuration

Results Optimized offload process demonstrated compelling cluster performance improvement up to 26X over the baseline Intelreg Xeonreg processor E5-2697 v2

AMBER 14 Particle Mesh Ewald (PME) Cellulose NPT

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

AMBER PME Cellulose NPT (408K Atoms)

1

114X 111X

137X

3 NODES CLUSTER BENCHMARK

Other names and brands may be claimed as the property of others

157X

132X

APPROVED FOR PUBLIC PRESENTATION

0

1

Intelreg Xeonreg processor E5-2697 v2

1 Intelreg Xeon Phitrade coprocessor 7120PX

2 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 1 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 2 Intelreg Xeon Phitrade coprocessor 7120PX

156X

15

GROMACS 512K H2O with RF

Application GROMACS 50-RC1 Workload 512K H2O with RF method

Description GROMACS is a versatile package to perform molecular dynamics ie simulate the Newtonian equations of motion for systems with hundreds to millions of particles It is one of the fastest and the most popular Molecular Dynamics packages

Availability Code Version 50-rc1 available here and here Recipe Available here

Highlights Highly optimized for Intelreg Xeonreg Processors (AVX-

intrinsics) Able to run full simulation on Intelreg Xeon Phitrade

coprocessor natively + host processor using a symmetric model

Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors

Results Symmetric process demonstrated up to 179X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

APPROVED FOR PUBLIC PRESENTATION

179X

1 NODE

SOURCE INTEL MEASURED RESULTS AS OF APRIL 2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

GROMACS 512K H2O with RF Speed Up

1 103X

172X

Other names and brands may be claimed as the property of others

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 15: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

0

1

Intelreg Xeonreg processor E5-2697 v2

1 Intelreg Xeon Phitrade coprocessor 7120PX

2 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 1 Intelreg Xeon Phitrade coprocessor 7120PX

Intelreg Xeonreg processor E5-2697 v2 + 2 Intelreg Xeon Phitrade coprocessor 7120PX

156X

15

GROMACS 512K H2O with RF

Application GROMACS 50-RC1 Workload 512K H2O with RF method

Description GROMACS is a versatile package to perform molecular dynamics ie simulate the Newtonian equations of motion for systems with hundreds to millions of particles It is one of the fastest and the most popular Molecular Dynamics packages

Availability Code Version 50-rc1 available here and here Recipe Available here

Highlights Highly optimized for Intelreg Xeonreg Processors (AVX-

intrinsics) Able to run full simulation on Intelreg Xeon Phitrade

coprocessor natively + host processor using a symmetric model

Optimized with intrinsics for 512-bit vectorization on Intel Xeon Phi coprocessors

Results Symmetric process demonstrated up to 179X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

APPROVED FOR PUBLIC PRESENTATION

179X

1 NODE

SOURCE INTEL MEASURED RESULTS AS OF APRIL 2014

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

GROMACS 512K H2O with RF Speed Up

1 103X

172X

Other names and brands may be claimed as the property of others

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 16: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

16

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

NWChem 63 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2

NWChem 65 64S Intelreg Xeonreg processor E5-2697 v2 + 64 Intelreg Xeon

Phitrade Coprocessor 7120A 2

Co

mp

ara

tiv

e P

erf

orm

an

ce

NWChem 63rev2 and 65 CCSD(T) Method 32 Node Speed Up Application NWChem is a computational chemistry software package that includes quantum chemical and molecular dynamics functionality NWChem is developed the Environmental Molecular Sciences Laboratory (EMSL) at the Pacific Northwest National Laboratory (PNNL) More at httpwwwnwchem-sworg

Availability Code Available here and from the SVN repository Recipe Available here

Usage Model Offload using LEO and OpenMP

Highlights NWChem with Intelreg Xeon Phitrade coprocessor 7120A offloading is a compelling and cluster compelling application for the NWChem community

Results Compared to the NWChem 63rev2 and Intelreg Xeonreg processor E5-2697 v2 baseline 1) NWChem 65 CCSD(T) performed up to 124X faster with the

Intelreg Xeonreg processor E5-2697 v2 2) NWChem 65 CCSD(T) performed up to 152X faster with the

Intelreg Xeonreg processor E5-2697 v2 and the Intel Xeon Phi coprocessor 7120A

SOURCE INTEL MEASURED RESULTS AS OF JULY 2014

APPROVED FOR PUBLIC PRESENTATION 32 NODES

NWChem CCSD(T) Method

CLUSTER BENCHMARK

For configuration details go here

1

124X

152X

Other names and brands may be claimed as the property of others

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 17: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

0

5

10

15

20

25

30

1 Node 8 Nodes 32 Nodes

Intelreg Xeonreg processor E5-2697 v2 (Baseline 1 node 23 or 47 PPN)

Intelreg Xeonreg processor E5-2697 v3 (27 or 55 PPN)

Xeon E5-2697 v2 (23 or 47 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7120A (240T)

Xeon E5-2697 v3 (27 or 55 PPN) + 1 Intelreg Xeon Phitrade coprocessor 7110A (240T)

17

NAMD 210 Pre-Release STMV

Application NAMD 210 pre-release STMV

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large biomolecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is

available as a pre-release Use the nightly build Recipe Available here

Usage Model Single rank on host with 47 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the STMV workload the Intelreg Xeonreg processor E5-2697 v3 and the Intelreg Xeon Phitrade coprocessor (32 nodes 55 PPN) improved performance by up to 32X compared to the baseline processor (1 node 47 PPN)

APPROVED FOR PUBLIC PRESENTATION

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

32 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase STMV (~1M atoms)

Co

mp

ara

tiv

e P

erf

orm

an

ce

1 2X

68X

122X

272X

12X 21X

79X

131X

20X

32X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

242X

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 18: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

0

1

2

1 Node 2 Nodes

Intelreg Xeonreg processor E5-2697 v3 (Baseline 1 node)

Intelreg Xeonreg processor E5-2697 v3 + Intelreg Xeon Phitrade coprocessor B1-

7110A (240T)

18

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

NAMD 210 Pre-Release ApoA1

Application NAMD 210 pre-release ApoA1

Description A parallel object-oriented molecular dynamics code designed for high-performance simulation of large bio molecular systems More at httpwwwksuiuceduResearchnamd

Availability Code Intelreg Xeon Phitrade coprocessor support is available as a pre-

release Use the nightly build Recipe Available here

Usage Model Single rank on host with 55 threads Various computations are offloaded to Intelreg Xeon Phitrade coprocessor from each thread

Highlights Intelreg Xeon Phitrade coprocessor support is now in the development branch of NAMD 210 pre-release

Results For the ApoA1 workload 2-node performance can be accelerated by up to 261X using a single Intelreg Xeon Phitrade coprocessor

SOURCE INTEL MEASURED RESULTS AS OF SEPTEMBER 2014

APPROVED FOR PUBLIC PRESENTATION 2 NODES CLUSTER BENCHMARK

For configuration details go here

NAMD 210 (pre-release) Cluster Performance Increase ApoA1 (~92K atoms) 55 PPN

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

194X

261X

(Baseline 1 node 55PPN)

Other names and brands may be claimed as the property of others

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 19: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

0

1

2

3

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S Xeon E5-2697 v3 + Tesla K40c boost off ECC on

2S Xeon E5-2697 v3 + Xeon Phi 7120A turbo off (LAMMPS IA Package)

19

LAMMPS Stillinger-Weber Water Benchmark

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-

range terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Simulation rate increase with Intelreg Package is up to 36X Concurrent Intel Xeon Phi coprocessor computations and MPI communications yield improved speedup and higher node counts

SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

LAMMPS Liquid Crystal Benchmark Performance (Mixed Precision)

Co

mp

ara

tiv

e P

erf

orm

an

ce

For configuration details go here

1

3X

341X

1

305X

36X

Other names and brands may be claimed as the property of others

ldquoXeon E5-2697 v3rdquo = Intelreg Xeonreg processor E5-2697 v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

09X

APPROVED FOR PUBLIC PRESENTATION

No testing

on Tesla

NEW

32 NODES CLUSTER BENCHMARK

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 20: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

20

Application LAMMPS

Description Simulation of molecular systems with classical models More at httplammpssandiagov

Availability Code In main LAMMPS repository Recipe Available here

Usage Model Load balancer offloads part of neighbor-list and non-bond force calculations to Intelreg Xeon Phitrade coprocessor for concurrent calculations with CPU

Highlights Improved results with Intelreg Xeonreg processor E5-2697 v3 and Intel Xeon Phi coprocessor 7120A Dynamic load balancing allows for concurrent Data transfer between host and coprocessor Calculations of neighbor-list non-bond bond and long-range

terms Same routines in LAMMPS Intel Package also run faster on CPU

Results Up to 168X performance improvement utilizing Intelreg Xeonreg processors and Intelreg Xeon Phitrade coprocessors with application optimization on a single node compared to the baseline configuration Performance gains continue to hold at 147X when scaling up to 32 nodes

LAMMPS Rhodopsin Benchmark 512K Atoms

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF AUGUST 2014

0

1

1 Node 32 Nodes

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS Baseline)

2S Intelreg Xeonreg processor E5-2697 v3 (LAMMPS IA Package)

2S E5-2697 v3 + Intelreg Xeon Phitrade coprocessor 7110P7120A Turbo Off

(LAMMPS IA Package)

Co

mp

ara

tiv

e P

erf

orm

an

ce

LAMMPS Rhodopsin Benchmark Performance (Mixed Precision)

APPROVED FOR PUBLIC PRESENTATION 32 NODES CLUSTER BENCHMARK

For configuration details go here

1

127X

168X

1 107X

147X

Other names and brands may be claimed as the property of others

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 21: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

0

1

ERR161544 SRR034966_1 ERR000589 SRR002273_1

Intelreg Xeonreg processor E5-2697 v2 + 1 NVIDIA Tesla K40

Intelreg Xeonreg processor E5-2697 v3

21

Johns Hopkins Bowtie 2 Multiple workloads

Application Bowtie2 version 223 Intelreg AVX2 port

Description NVBowtie version 0993 Bowtie is a GPU-accelerated re-engineering of Bowtie2 a very widely used short-read aligner While being completely rewritten from scratch nvBowtie reproduces many (though not all) of the features of Bowtie2 httpnvlabsgithubionvbionvbowtie_pagehtml

Availability Code Available here Recipe Not available Check for future availability here

Usage Model ERR161544 SRR002273_1 HEK001(TGen) ERR000589_1 SRR033552_1 SRR034966_1 amp ERR024139_1

Highlights See more here

Results Bowtie2 running on the Intelreg Xeonreg processor E5-2697 v3 with Intelreg AVX2 port faster than NVBowtie running on the Intelreg Xeonreg processor E5-2697 v2 and the NVIDIA Tesla K40 for 6 of 7 workloads NVIDIA published data of K40 compared to Intelreg Xeonreg processor E5-2600 (6 cores) on one workload

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2015

Johns Hopkins Bowtie 2 TGen Workload Speed Up

Co

mp

ara

tiv

e I

ncr

ea

se

1 NODE

For configuration details go here

1

187X

Other names and brands may be claimed as the property of others

APPROVED FOR PUBLIC PRESENTATION

159X

108X

88X

NEW

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 22: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

22

Application Burrows-Wheeler Aligner version 0510 BWA-ALN is represented in this benchmark Workload is korean_female (read file 35 GB 30 GB reference data base)

Description BWA is a popular software package for mapping low-divergent sequences against a large reference genome such as the human genome More at httpbio-bwasourceforgenet

Availability Code Available here Recipe Available here

Usage Model Hybrid MPI + OpenMP using symmetric mode

Highlights Results are identical to the unmodified run of BWA-ALN

Results The Intelreg Xeonreg processor E5-2697 v2 and the Intelreg Xeon Phitrade coprocessor symmetric process demonstrated up to 186X improved performance over the baseline Intelreg Xeonreg processor E5-2697 v2

SOURCE INTEL MEASURED RESULTS AS OF JANUARY 2014

0

1

BWA-ALN Speed Up

2S Intelreg Xeonreg processor E5-2697 v2 (baseline BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 (optimized BWA-ALN)

2S Intelreg Xeonreg processor E5-2697 v2 + Intelreg Xeon Phitrade coprocessor

7120A

Co

mp

ara

tiv

e P

erf

orm

an

ce

APPROVED FOR PUBLIC PRESENTATION 1 NODE

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Burrows-Wheeler Aligner (BWA-ALN) Human Genome

For configuration details go here

1

124X

186X

Other names and brands may be claimed as the property of others

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 23: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3

ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

23

Application Basic Local Alignment Search Tool (BLASTn) v30

Description Searching for alignment in nucleotide query sequences against a known nucleotide db volume set National Center for Biotechnology Information (NCBI) More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 100 NCBI queries (concatenated) against db refseq_rna00-02 are distributed to the Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 8020 and 592318

Highlights Throughput for this load sharing model has a small sweet spot for a sufficiently large query set

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 152X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

BLAST BLASTn v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

BLASTn v30 Speed Up

1 NODE

For configuration details go here

Co

mp

ara

tiv

e P

erf

orm

an

ce

1

152X

APPROVED FOR PUBLIC PRESENTATION

149X

122X

141X

126X

NEW

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 24: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

0

1

2S Xeon E5-2697 v2 (BLASTn v30 baseline)

2S Xeon E5-2697 v2 + Xeon Phi 7120A

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

2S Xeon E5-2697 v3 (BLASTn v30 baseline)

2S Xeon E5-2697 v3 + Xeon Phi 7120A2

2S Xeon E5-2697 v2 + Xeon Phi 7120A OFS parallelized

24

BLAST BLASTp v30

Other names and brands may be claimed as the property of others SOURCE INTEL MEASURED RESULTS AS OF MARCH 2015

Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors Performance tests such as SYSmark and MobileMark are measured using specific computer systems components software operations and functions Any change to any of those factors may cause the results to vary You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases including the performance of that product when combined with other products See benchmark tests and configurations in the speaker notes For more information go to httpwwwintelcomperformance

Application Basic Local Alignment Search Tool (BLASTp) v30

Description Searching for alignment in protein query sequence against a known protein db volume set More at httpblastncbinlmnihgov

Availability Code Available here Recipe Available here

Usage Model 4 (multiple queries multiple db) 40 NCBI queries (concatenated) against db nr_sorted00-02 are distributed to Intelreg Xeonreg processor and Intelreg Xeon Phitrade coprocessor for maximum speedup sweet spot Experiment was repeated 20 times with the pick of queries randomized for a sweet spot split 337 and 2857

Highlights Throughput for this offload model has a small sweet spot for a sufficiently large query set Throughput is limited due to GAT stage not parallelized

Results Compared to the baseline simulation rate speed up with Intelreg Xeonreg processor E5-2697 v3 and Intelreg Xeon Phitrade coprocessor 7120A heterogeneous model is 141X Performance is also improved on the CPU due to Output Formatting Section (OFS) parallelization

1 NODE

For configuration details go here

BLASTp v30 Speed Up

Co

mp

ara

tiv

e P

erf

orm

an

ce

ldquoXeon E5-2697 v2v3rdquo = Intelreg Xeonreg processor E5-2697 v2v3 ldquoXeon Phi 7120Ardquo = Intelreg Xeon Phitrade coprocessor 7120A

141X

1

APPROVED FOR PUBLIC PRESENTATION

139X

121X 13X

115X

NEW

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 25: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

法務情報

本資料に記載されているすべての製品コンピューターシステム日付および数値は現在の予想に基づくものであり予告なく変更されることがあります インテルプロセッサーナンバーはパフォーマンスの指標ではありませんプロセッサーナンバーは 同一プロセッサーファミリー内の製品の機能を区別します異なるプロセッサーファミリー 間の機能の区別には用いません 詳細についてはhttpwwwintelcojpjpproductsprocessor_number を参照してください インテルreg プロセッサーチップセットおよびデスクトップボードにはエラッタと呼ばれる設計上の不具合が含まれている可能性があり公表されている仕様とは異なる動作をする場合があります現在確認済みのエラッタについてはインテルまでお問い合わせください インテルreg バーチャライゼーションテクノロジーを利用するには同テクノロジーに対応したインテルreg プロセッサーBIOSおよび仮想マシンモニター (VMM) を搭載したコンピューターシステムが必要です機能性性能もしくはその他のバーチャライゼーションテクノロジーの特長はご使用のハードウェアやソフトウェアの構成によって異なりますご利用になる OS によってはソフトウェアアプリケーションとの互換性がない場合があります各 PC メーカーにお問い合わせください 詳細についてはhttpwwwintelcojpcontentwwwjpjavirtualizationvirtualization-technologyhardware-assist-virtualization-technologyhtml を参照してください すべての条件下で絶対的なセキュリティーを提供できるコンピューターシステムはありませんインテルreg トラステッドエグゼキューションテクノロジー (インテルreg TXT) を利用するにはインテルreg バーチャライゼーションテクノロジーインテルreg TXT に対応したプロセッサーチップセットBIOSAuthenticated Code モジュールインテルreg TXT に対応した Measured Launched Environment (MLE) を搭載するコンピューターシステムが必要ですさらにインテルreg TXTを利用するにはシステムが TPM v1s を搭載している必要があります 詳細についてはhttpwwwintelcojpcontentwwwjpjadata-securitysecurity-overview-general-technologyhtml を参照してください インテルreg ターボブーストテクノロジーに対応したシステムが必要ですインテルreg ターボブーストテクノロジーおよびインテルreg ターボブーストテクノロジー 20 は一部のインテルreg プロセッサーでのみ利用可能です各 PC メーカーにお問い合わせください実際の性能はハードウェアソフトウェアシステム構成によって異なります詳細についてはhttpwwwintelcojpjptechnologyturboboost を参照してください インテルreg AES New Instructions (インテルreg AES-NI) を利用するにはインテルreg AES-NI に対応したプロセッサーを搭載したコンピューターシステムおよび命令を正しい手順で実行する他社製ソフトウェアが必要ですインテルreg AES-NI は一部のインテルreg プロセッサーで利用できます提供状況については各 PC メーカーなどにお問い合わせください詳細についてはhttpsoftwareintelcomen-usarticlesintel-advanced-encryption-standard-instructions-aes-ni (英語) を参照してください IntelインテルIntel ロゴIntel Inside ロゴXeonXeon InsideIntel Xeon Phi はアメリカ合衆国および またはその他の国における Intel Corporation の商標です copy 2012 Intel Corporation 無断での引用転載を禁じます

26

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27

Page 26: インシリコ創薬時代の 最新チップとアプリの開発状況biogrid.jp/pdf/20150530bg-1.pdf512 bit SIMD instructions スレッドあたり32 個のベクトルレジスター

法律的な免責条項 パフォーマンス

性能に関するテストや評価は特定のコンピューターシステムコンポーネントまたはそれらを組み合わせて行ったものでありこのテストによるインテル製品の性能の概算の値を表しているものですシステムハードウェアソフトウェアの設計構成などの違いにより実際の性能は掲載された性能テストや評価とは異なる場合がありますシステムやコンポーネントの購入を検討される場合はほかの情報も参考にしてパフォーマンスを総合的に評価することをお勧めしますインテル製品の性能評価についてさらに詳しい情報をお知りになりたい場合はhttpwwwintelcomperformance を参照してください インテルは本資料で参照している第三者のベンチマークまたは Web サイトの設計や実装について管理や監査を行っていません本資料で参照している Web サイトまたは類似の性能ベンチマークが報告されているほかの Web サイトも参照して本資料で参照しているベンチマークが購入可能なシステムの性能を正確に表しているかを確認されるようお勧めします 各ベンチマークの相対パフォーマンスはベンチマーク結果に 10 のベースライン値を割り当て各プラットフォームのベンチマークの結果をベースラインとなるプラットフォームの実際のベンチマーク結果で割り報告されたパフォーマンスの向上に比例する相対パフォーマンスの数値を割り当てることによって計算しています SPECSPECintSPECfpSPECrateSPECpowerSPECjAppServerSPECjEnterpriseSPECjbbSPECompMSPECompLSPEC MPI はStandard Performance Evaluation Corporation の商標です詳細については httpwwwspecorgspectrademarkshtml (英語) を参照してください TPC ベンチマークは Transaction Processing Council の商標です詳細についてはhttpwwwtpcorg (英語) を参照してください SAP および SAP NetWeaver はドイツおよびその他の国々における SAP AG の登録商標です詳細についてはhttpwwwsapcombenchmark(英語) を参照してください 本資料に掲載されている情報は現状のまま提供され明示されているか否かにかかわらずまた禁反言によるとよらずにかかわらずいかなる知的財産権のライセンスを許諾するものではありませんこの情報に関する明示または黙示の保証 (特定目的への適合性商品適格性あらゆる特許権著作権その他知的財産権の非侵害性への保証を含む)に関してもいかなる責任も負いません 性能に関するテストに使用されるソフトウェアとワークロードは性能がインテルreg マイクロプロセッサー用に最適化されていることがありますSYSmark や MobileMark などの性能テストは特定のコンピューターシステムコンポーネントソフトウェア操作機能に基づいて行ったものです結果はこれらの要因によって異なります製品の購入を検討される場合は他の製品と組み合わせた場合の本製品の性能などほかの情報や性能テストも参考にしてパフォーマンスを総合的に評価することをお勧めします

27