TDC2016SP - Trilha BigData

Próximo passo para a Computação Exascale Programando para Multi-core e Many-core

Trilha Big Data

Igor Freitas – Intel do [email protected]

http://software.intel.com/en-us/articles/optimization-notice

Legal Disclaimers

Intel technologies features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at [intel.com].Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products.All information provided here is subject to change without notice. Contact your Intel representative to obtain the latest Intel product specifications and roadmaps.Results have been estimated or simulated using internal Intel analysis or architecture simulation or modeling, and provided to you for informational purposes. Any differences in your system hardware, software or configuration may affect your actual performance.Intel technologies’ features and benefits depend on system configuration and may require enabled hardware, software or service activation. Performance varies depending on system configuration. No computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more at https://www-ssl.intel.com/content/www/us/en/high-performance-computing/path-to-aurora.html.Tests document performance of components on a particular test, in specific systems. Differences in hardware, software, or configuration will affect actual performance. Consult other sources of information to evaluate performance as you consider your purchase. For more complete information about performance and benchmark results, visit http://www.intel.com/performance. 3D XPoint, Intel, the Intel logo, Intel. Experience What’s Inside, the Intel. Experience What’s Inside logo, Intel Xeon Phi, Optane, and Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the United States or other countries. *Other names and brands may be claimed as the property of others.© 2016 Intel Corporation. All rights reserved.

2

https://www-ssl.intel.com/content/www/us/en/high-performance-computing/path-to-aurora.html

http://www.intel.com/performance


3

Próximo passo para a Computação Exascale


Over 15 GF/Watt1

~500 GB/s sustained memory bandwidth with integrated on-package memory

Next Step: KNLSystems scalable to >100 PFlop/s

~3X Flops and ~3X single-thread theoretical peak performance over Knights Corner1

Up to 100 Gb/s with Storm Lake integrated fabric

1 Projections based on internal Intel analysis during early product definition, as compared to prior generation Intel® Xeon Phi™ Coprocessors, and are provided for informational purposes only. Any difference in system hardware or software design or configuration may affect actual performance.

I/O

Memory

Processor Performance

Resiliency Standard Programming Models

PowerEfficiency

Exascale Vision

Next Step on Intel’s Path to Exascale Computing


Intel® Xeon Phi™ Product Family x200

5

Host Processor in Groveport PlatformSelf-boot Intel® Xeon Phi™ processor

Ingredient of Grantley PlatformsRequires Intel® Xeon® processor host

with integrated Intel® Omni-Path Fabric

Intel® Xeon Phi™ ProcessorIntel® Xeon Phi™ Coprocessor x200


6

Knights Landing (KNL) Architectural Diagram

Diagram is for conceptual purposes only and only illustrates a CPU and memory • It is not to scale and does not include all functional areas of the CPU, nor does it represent actual component layout.

DDR4

DDR4

DDR4

DMIWellsburgPCH

Up to 72 cores

PCIe Gen3

x36 (KNL)x4 (KNL-

F)

MCDRAM MCDRAM MCDRAM MCDRAM

MCDRAM MCDRAM MCDRAM MCDRAM

Micr

o-Co

ax C

able

(IF

P)

Micr

o-Co

ax C

able

(IF

P)

Connector

HFI

2 VPU

1MB

L2

Core

HUB 2 VPU

Core

TILEDDR4

DDR4

DDR4

Up to 72 cores2D mesh architecture

Up to 16GB high-bandwidth on-package memory (MCDRAM). Exposed as NUMA node ~500 GB/s sustained BW

Over 3 TF DP peak • Full Intel® Xeon Phi™ ISA compatibility through Intel® AVX-512~3x single-thread compared to Knights Corner

6 channels DDR4Up to 384GB

2 ports Intel® Omni-Path Integrated Fabric (KNL-F Only)On-package50 GB/s total bi-directional BW

Common with Grantley PCH1S (no QPI/KTI)

2x 512b VPU per core (Vector Processing

Units)

Based on Intel® Atom™ processor (Silvermont) with many HPC enhancements• Deep out-of-order buffers• Gather/scatter in hardware• Improved branch prediction• 4 threads/core• High cache bandwidth and more


7

Modernização de código em processadores Xeon® e Xeon Phi™

Identificando oportunidades de otimização - Vetorização


8

Identificando oportunidades de otimizaçãoFoco deste seminário: Modernização de código

Composer Edition

Threading design &

prototyping

Parallel performance

tuning

Memory & thread

correctness

Professional Edition

Intel® C++ and Fortran

compilersParallel models (e.g., OpenMP*)

Optimized libraries

Multi-fabric MPI library

MPI error checking and tuning

Cluster EditionHPC Cluster

MPI Messages

Vectorized &

Threaded Node

Otimização em um único “nó” de processamento Vetorização & Paralelismo


http://www.iconfinder.com/icondetails/63466/128/




9

Código C/C++ ou Fortran

Thread 0 / Core 0

Thread 1/ Core1Thread 2 / Core

2

Thread 12 /

Core12

...

Thread 0/Core0

Thread 1/Core1

Thread 2/Core2

Thread 244

/Core61

...

128 Bits 256 Bits

Vector Processor Unit por Core Vector Processor Unit por Core

Paralelismo (Multithreading)

Vetorização 512 Bits

Identificando oportunidades de otimizaçãoFoco deste seminário: Modernização do código


Identificando oportunidades de otimizaçãoRecapitulando o que é vetorização / SIMD

10

for (i=0;i<=MAX;i++) c[i]=a[i]+b[i];

+

c[i+7]c[i+6]c[i+5]c[i+4]c[i+3]c[i+2]c[i+1]c[i]

b[i+7]b[i+6]b[i+5]b[i+4]b[i+3]b[i+2]b[i+1]b[i]

a[i+7]a[i+6]a[i+5]a[i+4]a[i+3]a[i+2]a[i+1]a[i]Vector

- Uma instrução- Oito operações

+

C

B

AScalar

- Uma instrução- Uma operação

• O que é e ? • Capacidade de realizar uma

operação matemática em dois ou mais elementos ao mesmo tempo.

• Por que Vetorizar ?• Ganho substancial em

performance !


11

Identificando oportunidades de otimizaçãoVetorização dentro do “core”

Código de exemplo - Black-Scholes Pricing Code “a mathematical model of a financial market containing certain derivative investment instruments. “Exemplo retirado do livro “High Performance Parallelism Pearls” Código fonte: http://lotsofcores.com/pearls.code

Artigo sobre otimização deste método https://software.intel.com/en-us/articles/case-study-computing-black-scholes-with-intel-advanced-vector-extensions

http://lotsofcores.com/pearls.code

http://lotsofcores.com/pearls.code

https://software.intel.com/en-us/articles/case-study-computing-black-scholes-with-intel-advanced-vector-extensions





12

Facilidade de Uso

Ajuste Fino

Intel® Math Kernel Library

Array Notation: Intel® Cilk™ Plus

Auto vectorization

Semi-auto vectorization:#pragma (vector, ivdep, simd)

C/C++ Vector Classes(F32vec16, F64vec8)

Devemos avaliar três fatores:

Necessidade de

performance Disponibilidade de recursos

para otimizar o código

Portabilidade do código

Identificando oportunidades de otimizaçãoManeiras de otimizar o código

Intel® Data Analytics Acceleration Library


13


• Compilar o código com parâmetro “-qopt-report[=n]” no Linux ou “/Qopt-report[:n]” no Windows . • /Qopt-report-file:vecReport.txt Analisar relatório, encontrar dicas sobre loops não vetorizados e principal

causaLOOP BEGIN at ...Black-scholes-ch19\02_ReferenceVersion.cpp(93,3) remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed OUTPUT dependence between pT line 95 and pK line 97 remark #25439: unrolled with remainder by 2 LOOP END

LOOP BEGIN at ...Black-scholes-ch19\02_ReferenceVersion.cpp(56,3) remark #15344: loop was not vectorized: vector dependence prevents vectorization. First dependence is shown below. Use level 5 report for details remark #15346: vector dependence: assumed ANTI dependence between pS0 line 58 and pC line 62LOOP END

Loop de inicialização de variáveis

Loop dentro da função “GetOptionPrices”

O código está otimizado para rodar em uma única thread ?


14


• Rodar Intel® VTune – “General Exploration” marcar opção “Analyze memory bandwidth”• Identificar “hotspot” = função que gasta mais tempo na execução• Identificar se as funções estão vetorizadas


15


Parâmetros de compilação utilizados:

/GS /W3 /Gy /Zc:wchar_t /Zi /O2 /Fd"x64\Release\vc110.pdb" /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Qipo /Zc:forScope /Oi /MD /Fa"x64\Release\" /EHsc /nologo /Fo"x64\Release\" /Qprof-dir "x64\Release\" /Fp"x64\Release\Reference.pch“

Parâmetros de execução<número de elementos> <número de threads>“60000000 1”


16


Instruções escalares como “movsd” ou “cvtsd2ss” (“s” de scalar) estão sendo utilizadas ao invés de “vmovapd” ; “v” de AVX , “p” de packed , “d” de double

SSE utiliza 128 bits ; e não 256 bits igual instruções AVX


17


O que identificamos ? • Código não está vetorizado, não utiliza instruções AVX/AVX2 de 256 bits• Compilador apontou dependência de dados

• Oportunidades apontadas pelo VTune • “Back-end bound”: baixo desempenho na execução das instruções

• Memory bound: aplicação dependente da troca de mensagens entre Cache – RAM • Instruções load (ram -> cache) e store ( cache -> ram )

• L1 bound: dados não são encontrados neste nível de cache• Port utilization: baixa utilização do “core” , “non-memory issues”

• Funções “cdfnormf” e “GetOptionPrices” são hotspots


18

Modernização de código com “semi-autovetorização”


19

Modernização de códigoSemi-autovetorização

Uso do #pragma ivdep (por enquanto utilizando apenas 1 thread/core)• Código não está vetorizado, não utiliza instruções AVX/AVX2 de 256 bits• Compilador apontou dependência de dados na linha 56

#pragma ivdep for (i = 0; i < N; i++) { d1 = (log(pS0[i] / pK[i]) + (r + sig * sig * 0.5) * pT[i]) / (sig * sqrt(pT[i])); d2 = (log(pS0[i] / pK[i]) + (r - sig * sig * 0.5) * pT[i]) / (sig * sqrt(pT[i]));p1 = cdfnormf(d1);p2 = cdfnormf(d2); pC[i] = pS0[i] * p1 - pK[i] * exp((-1.0) * r * pT[i]) * p2; }

Performance – 60.000.000 elementos#pragma ivdep time = 1.886624

Código originaltime = 22.904422

12.1x de speedup

ConfiguraçãoIntel Core i5-4300 CPU 2.5 GHZ4GB RAMWindows 8.1 x64Intel Compiler C++ 15.0


20


Uso do #pragma ivdep• Entenda o que mudou rodando novamente o VTune, compare as duas

versões !

Menos instruções executadas !Menos ciclos de clock por execução !

Menos “misses” na cache L1

Melhor uso do “core”


21



versões !

Menos instruções executadas !Menos ciclos de clock por execução !

Menos “misses” na cache L1

Melhor uso do “core”




versões ! • Relatório do Compilador: análise do loop vetorizado e dicas de como

vetorizar

week\pdf-codigos-intel-lncc\paralelismo-dia-02\Black-scholes-ch19\02_ReferenceVersion.cpp(63,5) ] remark #15300: LOOP WAS VECTORIZED

remark #15442: entire loop may be executed in remainder remark #15448: unmasked aligned unit stride loads: 1

remark #15450: unmasked unaligned unit stride loads: 2 remark #15451: unmasked unaligned unit stride stores: 1

remark #15475: --- begin vector loop cost summary --- remark #15476: scalar loop cost: 760

remark #15477: vector loop cost: 290.250 remark #15478: estimated potential speedup: 2.560

remark #15479: lightweight vector operations: 66 remark #15480: medium-overhead vector operations: 2

remark #15482: vectorized math library calls: 5 remark #15487: type converts: 13

remark #15488: --- end vector loop cost summary ---LOOP END



Basel ine IVDEP QxAVX + IVDEP

QxAVX + IVDEP +

UNROLL(4)

05

101520

Otimizações “dentro do core” – 1 thread

60Mi 120Mi

Spee

dup

• IVDEP • Ignora dependência entre os

vetores

• (QxAVX) Instruções AVX – 256 bits

• Unroll ( n )

• Desmembra o loop para instruções SIMD

• Link sobre unroll• Requisitos para loop ser vetoriza

do• Loop unrolling


https://software.intel.com/en-us/articles/avoid-manual-loop-unrolling

https://software.intel.com/en-us/articles/requirements-for-vectorizable-loops

https://software.intel.com/en-us/articles/requirements-for-vectorizable-loops

http://en.wikipedia.org/wiki/Loop_unrolling


24

Identificando oportunidades de otimização [2]

Precisão numérica e alinhamento de dados


Identificando oportunidades de otimizaçãoVetorização dentro do “core” – Precisão e dados alinhados

LOOP BEGIN at ... Black-scholes-ch19\02_ReferenceVersion.cpp(58,3) remark #15389: vectorization support: reference pS0 has unaligned access [ remark #15381: vectorization support: unaligned access used inside loop body

remark #15399: vectorization support: unroll factor set to 2 remark #15417: vectorization support: number of FP up converts: single precision to double

precision 1 remark #15389: vectorization support: reference pK has unaligned access [ ... Black-scholes-

ch19\02_ReferenceVersion.cpp(64,5) ] remark #15381: vectorization support: unaligned access used inside loop body

remark #15399: vectorization support: unroll factor set to 8 remark #15417: vectorization support: number of FP up converts: single precision to

double precision 1 [ ... Black-scholes-ch19\02_ReferenceVersion.cpp(60,5) ]

2 oportunidades apontadas pelo Compilador ! • Redução da precisão numérica – Double (64 bits) para Single (32 bits)• Alinhamento de dados


Otimizando o código para precisão simples

d1 = (logf(pS0[i] / pK[i]) + (r + sig * sig * 0.5f) * pT[i]) / (sig * sqrtf(pT[i]));d2 = (logf(pS0[i] / pK[i]) + (r - sig * sig * 0.5f) * pT[i]) / (sig * sqrtf(pT[i]));p1 = cdfnormf (d1);p2 = cdfnormf (d2); pC[i] = pS0[i] * p1 - pK[i] * expf((-1.0f) * r * pT[i]) * p2;

d1 = (log(pS0[i] / pK[i]) + (r + sig * sig * 0.5) * pT[i]) / (sig * sqrt(pT[i])); d2 = (log(pS0[i] / pK[i]) + (r - sig * sig * 0.5) * pT[i]) / (sig * sqrt(pT[i]));p1 = cdfnormf(d1);p2 = cdfnormf(d2); pC[i] = pS0[i] * p1 - pK[i] * exp((-1.0) * r * pT[i]) * p2;

23.6x speedup vs Código original

1.4x speedup vs “AVX + Unrool + IVDEP”



Otimizando o código para precisão simples

0102030

Otimizações “dentro do core” – 1 thread

60Mi 120Mi

Spee

dup




28

Identificando oportunidades de Paralelismo (multi-threading)

“Do Not Guess – Measure”


29

Identificando oportunidades de Paralelismo Multithreads – Intel® Advisor XE

Apesar de vetorizado (paralelismo em nível de instruções), o código está rodando em apenas uma única thread/core !

• Antes de começar a otimização no código, podemos analisar se vale a pena paraleliza-lo em mais threads !

Intel® Advisor XE• Modela e compara a performance entre vários frameworks para criação

de threads tanto em processadores quanto em co-processadores• OpenMP, Intel® Cilk ™ Plus, Intel® Threading Bulding Blocks• C, C++, Fortran (apenas OpenMP) e C# (Microsoft TPL)

• Prevê escalabilidade do código: relação n.º de threads/ganho de performance

• Identifica oportunidades de paralelismo no código• Checa corretude do código (deadlocks, race condition)


30

Identificando oportunidades de Paralelismo Multithreads – Intel® Advisor XEPassos para utilizar o Intel Advisor

1º - Inclua os headers#include "advisor-annotate.h“

2º - Adicionar referência ao diretório “include” ; linkar lib ao projeto (Windows e Linux)

Windows com Visual Studio 2012 – Geralmente localizado em “C:\Program Files (x86)\Intel\Advisor XE\include”

Linux - Compilando / Link com Advisoricpc -O2 -openmp 02_ReferenceVersion.cpp -o 02_ReferenceVersion -I/opt/intel/advisor_xe/include/ -L/opt/intel/advisor_xe/lib64/


31


Passos para utilizar o Intel Advisor

3º - Executando o AdvisorLinux$ advixe-gui &

Crie um novo projeto- Interface é a mesma para Linux e Windows- No caso do Visual Studio há a opção de roda-lo de forma integrada.


32


Passos para utilizar o Intel AdvisorAdvisor Workflow

• Survey Target: analisa o código em busca de oportunidades de paralelismo

• Annotate Sources: Anotações são inseridas em possíveis regiões de código paralelas

• Check Suitability: Analisa as regiões paralelas anotadas, entrega previsão de ganho de performance e escalabilidade do código

• Check correctness: Analisa possíveis problemas como “race conditions”e “dealocks”

• Add Parallel Framework: Passo para substituir “anotações do Advisor”pelo código do framework escolhido (OpenMP, Cilk Plus, TBB, etc.)



Identificando “hotspots” e quais loops podem ser paralelizados



Inserindo as “anotações” do Advisor para executar a próxima fase: Check Suitability



Identificando “hotspots” e quais loops podem ser paralelizados


Aplicando paralelismo via OpenMPAnálise de concorrência com o Intel® VtuneCódigo otimizado “IVDEP + AVX + UNROLL +FLOAT PRECISION”Nthreads: 1 time = 0.967425Nthreads: 2time = 0.569371Nthreads: 4time = 0.387649Nthreads: 8time = 0.396282

1 2 4 811.21.41.61.8

22.22.42.6

Otimização MULTI-THREAD

OpenMP - 60mi

threadsSp

eedu

p



37

Links úteis

• Intel Developer Zone – Modern Code

• Catálogo de Aplicações e Frameworks otimizados para Xeon Phi – link

• Machine Learning – link

• Intel Modern Code Workshops em Big Data e HPC - UNESP / Núcleo de Computação Científica – link

https://software.intel.com/pt-br/modern-code

https://software.intel.com/en-us/mic-developer/app-catalogs

https://software.intel.com/en-us/machine-learning

http://modern-code.ncc.unesp.br/events


Intel® Xeon® ProcessorsIntel® Xeon Phi™ Processors

Intel® Xeon Phi™ CoprocessorsIntel® Server Boards and Platforms

Intel® Solutions for Lustre*Intel® Optane™ Technology

3D XPoint™ TechnologyIntel® SSDs

Intel® Omni-Path ArchitectureIntel® True Scale Fabric

Intel® EthernetIntel® Silicon Photonics

Intel® HPC OrchestratorIntel® Software Tools

Intel® Cluster Ready ProgramIntel Supported SDVis

Small Clusters Through SupercomputersCompute and Data-Centric ComputingStandards-Based ProgrammabilityOn-Premise and Cloud-Based

Compute Memory/Storage

Fabric Software

Intel Silicon Photonics

Fuel Your Insight

38

Intel® Scalable System Framework


39

Intel® Xeon Phi™ Coprocessor Product FamilyBased on Intel® Many Integrated Core (MIC) Architecture

Per Intel’s announced products or planning process for future products

2013Knights CornerIntel® Xeon Phi™ x100 product family• 22 nm process• Coprocessor• Over 1 TF DP Peak• Up to 61 Cores• Up to 16GB GDDR5

2016KnightsLandingThe processor versionof the next generationIntel Xeon Phi productfamily• 14 nm process• Processor & Coprocessor• Over 3 TF DP Peak• Up to 72 Cores• On Package High-Bandwidth

Memory• 3x single-thread performance• Out-of-order core • Integrated Intel® Omni-Path

Knights Landing

Knights Landingwith Fabric

FUTUREKnights HillNext generation of Intel® MIC Architecture Product Line• 10 nm process• 2nd Generation Integrated

Intel® Omni-Path• In planning –


©2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Intel Xeon, and Intel Xeon Phi are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. 40


Education

TDC2016SP - Trilha BigData