Upload
internet
View
108
Download
1
Embed Size (px)
Citation preview
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
O que há de novo na plataforma x86 para High Performance
Jefferson de A SilvaSystems Management & Product Specialist
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Por que sistemas de alta performancePor que sistemas de alta performance
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
A tendência tecnológica – Todo ano nós ficamos mais rápidos mais processadores
Intel CPU Trends (sources: Intel, Wikipedia, K.
Olukotun)
Pentium
386
Xeon
Paxville
Montecito
Breakdown inFrequency scaling
• Por volta do início de 2003 começou a limitação da freqüência do processador
• De acordo com essa trajetória passada nós deveríamos estar hoje acima de 10GHz !
• Historicamente freqüência mais alta aumenta uma única threaded
• Multi-core sómente melhora aumenta de performance de software quando for possível aumentar o número de execução de threads
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Aspectos de Planejamento de Capacidade em ambiente x86
1. Network Subsystem Performance Update• TOE, IOAT technology overview• TOE, and IOAT Ethernet throughput
2. Storage Subsystem Performance Update• 2.5” vs. 3.5” disk effects on performance
3. Memory Subsystem Performance Update• Memory operation fundamentals
• Latency vs. Bandwidth• DDR2 & FBDIMM memory performance
4. CPU Technology & Performance Update• Snoop filter performance overview• Multi-core processor performance update• New processor architecture changes and performance
• AMD® Opteron® Next Generation• Intel Core® (Xeon® 5100 Woodcrest)• Intel Tulsa
• Clovertown Performance update
5. X Architecture Overview6. Performance per Watt
– What is performance/watt?– How does Xeon 5100 Series (Woodcrest)
perform?
7. Product Positioning– How to position Xeon vs. Opteron
products?
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Aspectos de Planejamento de Capacidade em ambiente x86
• TOE e IOAT (TCP/IP Offload–I/O Acceleration Technology)
• Discos 2.5” vs 3.5”
• SDRAM, DDR, DDR2 e FBD• CPU (Multi-core, novas arquiteturas)
• VT (On chip e software)
• Consumo de energia
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Aspectos de Planejamento de Capacidade em ambiente x86
Network Architecture – Standard NIC
CPUCPUCPUCPUCPUCPUCPUCPU
LANLAN
Mem
ory
Mem
ory
ChipsetChipsetChipsetChipset
LANLAN
1
2
3
4Potential bottlenecks
1) Interrupt Process and Multiple Memory Accesses by the CPU
2) TCP Protocol Processing
3) CPU Memory Copies
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Aspectos de Planejamento de Capacidade em ambiente x86
TCP/IP Offload - TOE
Benefit
1. Less code processing by CPU
2. Fewer CPU data copies
CPUCPUCPUCPUCPUCPUCPUCPU
LANLAN
Mem
ory
Mem
ory
ChipsetChipsetChipsetChipset
TOETOE1
2
3
1
2
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Aspectos de Planejamento de Capacidade em ambiente x86
I/O Acceleration Technology – Intel (IOAT)
CPUCPUCPUCPUCPUCPUCPUCPU
LANLAN
Mem
ory
Mem
ory
ChipsetChipsetChipsetChipset
LANLAN
1
2
3
Benefit
1) Few if any data copies by CPU
2) First version will only help receive performance since copies will be done only on frames that are moving from TCP/IP space to application space
4
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
MemoryController Memory Bus
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
400 MHzMemory
Controller Memory Bus
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
533 MHz
667 MHz
MemoryController Memory Bus
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
Not representative of any particular systemDiagram is intended to illustrate speed and DIMM count limitations
Aspectos de Planejamento de Capacidade em ambiente x86
And We Still Have The Capacity vs. Speed Trade-off
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
FBDIMM Solves This Problem With Serial Memory Bus And On-DIMM Advanced Memory Buffer (AMB)
Serial Address Bus
Serial Data BusMemoryController
Same DDR2 DRAM Technology
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
FBDIMM Serial Bus Add Latency Due to Hops
Serial Address Bus
Serial Data BusMemoryController
Address
Data
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Additional Memory Channels = Greater Capacity And Greater Throughput Which Offsets Additional Latency Under Load
DDR2 MemoryController
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
FBD MemoryController
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
DRAM
Greater MemoryBandwidth
Less MemoryBandwidth
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Measured DDR2 vs. FBD Memory Throughput
39%
Incr
ease
2.8x
Incr
ease
Memory Throughput for DDR2 vs. FBD
0
1000
2000
3000
4000
5000
6000
Sequential Reads Random Reads
Me
mo
ry T
hro
ug
hp
ut
By
tes
/Se
c
3.2GHz Xeon DDR2
3.0GHz Woodcrest FBD
39%
Incr
ease
2.8x
Incr
ease
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
CPU Bottleneck Performance Fundamentals
• Core Intensive - Processor is executing instructions as fast as CPU core can process
• Latency Intensive - Processor is executing instructions as fast as memory latency allows
• Bandwidth Intensive - Processor is executing instructions as fast as memory bandwidth allows
Potential Processor Bottlenecks
Core Intensive
Ban
dwid
th
Inte
nsiv
e
Latency Intensive
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Dual Core System DesignXeon vs. Opteron Performance Fundamentals
• Core Intensive - Processor is executing instructions as fast as CPU core can process
• Latency Intensive - Processor is executing instructions as fast as memory latency allows
• Bandwidth Intensive - Processor is executing instructions as fast as memory bandwidth allows
Potential Processor Bottlenecks
Core Intensive
Ban
dwid
th
Inte
nsiv
e Latency Intensive
Woodcrest,Clovertown and Tulsa Win
By as much as 20+%
X3 Xeon Wins
Woodcrest and O
pteron
About the same
Opt
eron
Win
s by
as
muc
h
As 2X
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
PCI
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Xeon Coherency Protocol – CPU Snoop Request
PCI
PCI
PCI
IO ControllerUSB, IDE, SATA,etc
IO ControllerUSB, IDE, SATA,etc
Memory Bridge
Memory Bridge
Cache Miss Read Data
Snoop All Other Processor Caches
Memory Controller
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
PCI
Memory Bridge
Memory Bridge
Xeon Coherency Protocol – CPU Snoop Response
PCI
PCI
PCI
IO ControllerUSB, IDE, SATA,etc
IO ControllerUSB, IDE, SATA,etc
Only Now Can Processor Operate on Data!
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Memory Controller
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
PCIMemory Controller
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Xeon Coherency Protocol – DMA Snoop Request
PCI
PCI
PCI
IO ControllerUSB, IDE, SATA,etc
IO ControllerUSB, IDE, SATA,etc
DMA Read Data
Snoop All Processor Caches
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
PCIMemory Controller
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Memory Bridge
Xeon Coherency Protocol – DMA Snoop Response
PCI
PCI
PCI
IO ControllerUSB, IDE, SATA,etc
IO ControllerUSB, IDE, SATA,etc
Only Now Can Memory Be Accessed!
Snoop Responses Returned
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
AMD Architecture – Local Memory Access
HyperTransport™
PCI-X 100Mhz
PCI-X100Mhz
HyperTransport
6.4GB/s coherent HyperTransport
PCI-X100Mhz
HyperTransport
PCI-X133Mhz
HyperTransport
PCI-X 133Mhz
AMD Opteron CPU 0 AMD Opteron CPU 1
AMD Opteron CPU 3AMD Opteron CPU 2
PCI-X Bridge
PCI-X Bridge
PCI-X BridgePCI-X Bridge
HyperTransport™ HyperTransport™
Processor Cache Miss – Local Read
1. Local memory read happens fast – This low latency is well publicized
HyperTransport
6.4GB/s coherent HyperTransport
HyperTransport™
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
AMD Architecture – Local Memory Access
HyperTransport™
PCI-X 100Mhz
PCI-X100Mhz
HyperTransport
6.4GB/s coherent HyperTransport
PCI-X100Mhz
HyperTransport
HyperTransport
PCI-X 133Mhz
AMD Opteron CPU 0 AMD Opteron CPU 1
AMD Opteron CPU 3AMD Opteron CPU 2
PCI-X Bridge
PCI-X Bridge
PCI-X BridgePCI-X Bridge
HyperTransport™ HyperTransport™
HyperTransport
6.4GB/s coherent HyperTransport
HyperTransport™
Snoop CPU 1,3
SnoopCPU 2
SnoopCPU 3
Snoop Response
From CPU1Snoop Respon
seFrom CPU2
Snoop Respon
seFrom CPU3
Snoop Response
From CPU3
1. Local memory read happens fast – This low latency is well publicized2. But processor cannot use data until ALL snoops complete3. In 4-way there are always two hops for snoops
PCI-X133Mh
z
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
AMD Architecture – Local Memory Access
HyperTransport™
PCI-X 100Mhz
PCI-X100Mhz
HyperTransport
6.4GB/s coherent HyperTransport
PCI-X100Mhz
HyperTransport
HyperTransport
PCI-X 133Mhz
AMD Opteron CPU 0 AMD Opteron CPU 1
AMD Opteron CPU 3AMD Opteron CPU 2
PCI-X Bridge
PCI-X Bridge
PCI-X BridgePCI-X Bridge
HyperTransport™ HyperTransport™
Only now can execution proceed!
HyperTransport
6.4GB/s coherent HyperTransport
HyperTransport™
Read CompleteAnd Usable
PCI-X133Mh
z
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Processor Futures
• Both AMD and Intel have significant processor architecture changes happening soon• AMD – Next Generation Processors
– Rev F (Dual Core)– Barcelona (Quad Core)
• Intel – Core Micro-Architecture Processors– Woodcrest (Dual Core)– Clovertown (Quad Core)
• Intel Xeon MP – – Tulsa (Dual Core)
• Intel MP based on Core Micro-Architecture– Tigerton (Quad-Core)
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Opteron Next Gen Processors Add Faster DDR2 Memory
Opteron withDDR2 Memory
Controller
DRAM
DRAM
DRAM
DRAM
DRAMDRAM
DRAMDRAMDRAM
DRAM
DRAM
DRAM
DRAM
DRAMDRAM
DRAMDRAMDRAM
DRAM
DRAM
DRAM
DRAM
DRAMDRAM
DRAMDRAMDRAM
DRAM
DRAM
DRAM
DRAM
DRAMDRAM
DRAMDRAMDRAM
DRAMDRAMDRAMDRAM
DRAMDRAM
DRAMDRAMDRAM
DRAMDRAMDRAMDRAM
DRAMDRAM
DRAMDRAMDRAM
DRAMDRAMDRAMDRAM
DRAMDRAM
DRAMDRAMDRAM
DRAMDRAMDRAMDRAM
DRAMDRAM
DRAMDRAMDRAM
Opteron withDDR2 Memory
Controller
DRAMDRAMDRAMDRAM
DRAMDRAM
DRAMDRAMDRAM
DRAMDRAMDRAMDRAM
DRAMDRAM
DRAMDRAMDRAM
DRAMDRAMDRAMDRAM
DRAMDRAM
DRAMDRAMDRAM
DRAMDRAMDRAMDRAM
DRAMDRAM
DRAMDRAMDRAMRev E 266/333MHz DDR1 -> 400/533 MHz
DDR2
Opteron withDDR2 Memory
Controller
DRAMDRAMDRAMDRAM
DRAMDRAM
DRAMDRAMDRAM
DRAMDRAMDRAMDRAM
DRAMDRAM
DRAMDRAMDRAMRev E 400MHz DDR1 -> 667 MHz DDR2
Rev E 400MHz DDR1 -> 800 MHz DDR2
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Opteron Dual-Core Design
Cache to Cache data sharing is done through crossbar switch.
CPU0
1MB L2 Cache
CPU1
System Request Interface
Crossbar Switch
Memory
ControllerHT0 HT1 HT2
AMD Opteron™ Architecture
1MB L2 Cache
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
AMD Opteron Quad-Core Design: Barcelona
CPU0
L2 Cache
CPU1
Crossbar Switch
Memory
ControllerHT0 HT1 HT2
CPU1
L2 Cache
AMD Opteron™ Architecture
CPU3
L2 Cache
CPU2
L2 Cache
System Request Interface
L3 Cache
Quad Core Design: Adds L3 Cache
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Xeon 5100 Series (Woodcrest) DP Architecture
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Wide Dynamic Execution From:http://www.intel.com/technology/architecture/coremicro/#anchor2
• Executes 4 instructions per clock cycle compared to 3 instructions per cycle for NetBurst
Net Burst
Core Microarchitecture
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Xeon vs. Core™ Dual-Core Design
Cache to cache data sharing is now done through shared cache
Cache to cache data sharing was done through bus interface (slow)
Intel Core™ Architecture
CPU0 CPU1
4 MB Shared Cache
Bus Interface
CPU0
2MB L2 Cache
Intel Xeon Dual-Core Architecture
CPU1
2MB L2 Cache
Bus Interface
• In Xeon 5100 Series (Woodcrest) L2 Cache can be dynamically shared so if one processor needs all cache it can be used, or it can be shared equally
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Intel Core™ Quad-Core Design
CPU2 CPU3
4 MB Shared Cache
Bus Interface
CPU0 CPU1
Bus Interface
4 MB Shared Cache
FSBFSB
Clovertown is basically two Woodcrest multi-chip modules (MCM’s) on a single die
MCM die allows easy transition and better yields than monolithic die
MCM’s must leverage FSB interface for cache to cache communication
MCM to MCM data sharing isdone through bus interface (slow)
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Intel Caneland MP Platform
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Memory Controller
Scalability ControllerX4X4
Memory card 1
Nova 0 Nova 1
Memory card 2
Nova 2 Nova 3
Memory card 3
Nova 4 Nova 5
Memory card 4
Nova 6 Nova 7
IO B
ridg
eIO
Bri
dge
Scalability Port
Memory Controller
Scalability ControllerX4X4
Memory card 1
Nova 0 Nova 1
Memory card 2
Nova 2 Nova 3
Memory card 3
Nova 4 Nova 5
Memory card 4
Nova 6 Nova 7
IO B
ridg
eIO
Bri
dge
Scalability Port
Scalability Ports
EM64T EM64T
I/O BridgeI/O Bridge
EM64T EM64T
I/O BridgeI/O Bridge
Scalability Controller
Memory Controller
MemoryInterface
MemoryInterface Memory
Interface
MemoryInterface Memory
Interface
MemoryInterface Memory
Interface
MemoryInterface
Scalability Ports
EM64T EM64TEM64T EM64T
I/O BridgeI/O Bridge
EM64T EM64TEM64T EM64TEM64T EM64T
I/O BridgeI/O Bridge
Scalability Controller
Memory Controller
MemoryInterface
MemoryInterface Memory
Interface
MemoryInterface Memory
Interface
MemoryInterface Memory
Interface
MemoryInterface
3x @ 34.1GB/s
10.6 GB/s
2x @ 42.6GB/s21.3 GB/s
1.6x @10.4GB/s
6.4 GB/s
• Quad FSB architecture delivers increased memory bandwidth
• IBM bus technology provides optimal memory read and write bandwidth
• Increased scalability port frequency for higher scalable bandwidth
• Lower loaded latency across the board
to X4X4 – Architectural Improvements
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
2 Socket Product Positioning Today – AMD Dual core and Intel Quad core
INTEL
AMD
Integer Processing
HPC – core intensive
BPC – core intensive
Web-serving
Java
Database
Collaboration
Virtualization
File and Print
HPC – bandwidth intensiveEDA
Video Streaming
Media encode/decode
BPC – bandwidth intensive
Large memory set workloads
Memory Bandwidth/Capacity
Core Processing
ClovertownClovertown
Next Gen - RevFNext Gen - RevF
Data mining
SAP
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
2 Socket Product Positioning 3Q07 – AMD Quad core and Intel Quad core
INTEL
AMD
Integer Processing
HPC – core intensive
BPC – core intensive
Web-serving
Java
Database
Collaboration
Virtualization
File and Print
HPC – bandwidth intensiveEDA
Video Streaming
Media encode/decode
BPC – bandwidth intensive
Large memory set workloads
Memory Bandwidth/Capacity
Core Processing
ClovertownClovertown
RevF-Quad CoreRevF-Quad Core
Data mining
SAP
Proibida cópia ou divulgação sem permissão escrita do CMG Brasil.
Aspectos de Planejamento de Capacidade em ambiente x86
Obrigado!