26
sor Uri Weiser srael Handling Memory Accesses in Big Data Environment Chipex 2016 1 The talk covers research done by: T. Horowitz , Prof. A. Kolodny, T. Morad, , Prof. A. Mendelson, Daniel Raskin, Gil Shomron, Loren Jamal, Prof. U. Weiser

Prof. Uri Weiser,Technion

Embed Size (px)

Citation preview

Page 1: Prof. Uri Weiser,Technion

Professor Uri WeiserTechnionHaifa, Israel

Handling Memory Accesses in Big Data Environment

Chipex 2016

1The talk covers research done by: T. Horowitz , Prof. A. Kolodny, T. Morad, , Prof. A. Mendelson, Daniel Raskin, Gil Shomron, Loren Jamal, Prof. U. Weiser

Page 2: Prof. Uri Weiser,Technion

2

A New Architecture Avenues in Big Data Environment

The Era of Heterogeneous HW/SW fits application Dynamic tuning Accelerators performance, energy efficiency

Big Data = big In general non repeated access to all the

“Big Data” What are the implications?

Page 3: Prof. Uri Weiser,Technion

Heterogeneous computing :Application Specific Accelerators

Performance/power

Apps range

Continue performance trend by tuned architecture to bypass current technological hurdles

Perf

orm

ance

/pow

erAccelerators

3

Tuned architectures

Apps behavior

Page 4: Prof. Uri Weiser,Technion

4

A New Architecture Avenues in Big Data Environment

Heterogeneous computing – ”tuning” HW to respond to specific needs

example: Big Data memory access pattern Potential savings

Reduction of Data Movements and bypass DRAM

Bandwidth issue Potential solution

Page 5: Prof. Uri Weiser,Technion

Input: Unstructured data

Big Data usage of DATA

5

Read OnceNon-Temporal

Memory Access

Funnel

beta=BWout

BWin

Page 6: Prof. Uri Weiser,Technion

Structuring

Input: Unstructured data

Structured data (aggregation)

A

ML Model creation

Data structuring = ETL

C

B

C Model usage @ client

6

Machine Learning

Page 7: Prof. Uri Weiser,Technion

7

Does Big Data exhibit special memory access pattern?

It probably should since Revisiting ALL Big Data items will cause huge/slow

data transfers from Data sources There are 2 access modes of memory operations:

Temporal Memory Access Non-Temporal Memory access

Many Big Data computations exhibit a Non-Temporal Memory-Accesses and/or Funnel operation

Page 8: Prof. Uri Weiser,Technion

Non-Temporal Memory access Initial analysis: Hadoop-grep Single Memory Access Pattern

~50% of Hadoop-grep unique memory references are single access

8

Page 9: Prof. Uri Weiser,Technion

Non-Temporal Memory AccessesPreliminary Results

WordCount:Access to Storage:Non-temporal locality

Sort: Access to Storage:NO Non-temporal locality

0 5 10 15 20 25 30 35 40 45 500

1000020000300004000050000600007000080000

WordCount I/O Utilization

Time [s]

0 200 400 600 800 1000 12000

20000400006000080000

100000120000

SORT I/O

Time [s]

Access rate[KB/s]

Time

Time

9

Access rate[KB/s]

Page 10: Prof. Uri Weiser,Technion

10

Where energy is wasted?

• DRAM

• Limited BW

Page 11: Prof. Uri Weiser,Technion

From: Mark Horowitz, Stanford “Computing’s Energy Problems”

From: Bill Dally (nVidia and Stanford), Efficiency and Parallelism, the challenges of future computing11

Page 12: Prof. Uri Weiser,Technion

Energy:

DRAM

12

Page 13: Prof. Uri Weiser,Technion

Memory Subsystem - copies

L1$

L2$

LL Cache

DRAM

NV Storage

RegistersKBs

10’s KBs

MBs

TBs

GBs

10’s MBs

3GB/sec

25GB/sec

500GB/sec

TB/sec

Size

Core

BW- Source

Copy 1 (main memory)

Copy 2 (LL Cache)

Copy 3 (L2 Cache)

Copy 4 (L1 Cache)

Copy 5 (Registers) - Destination

13

Page 14: Prof. Uri Weiser,Technion

Memory Subsystem – DRAM bypass == DDIO

L1$

L2$

LL Cache

DRAM

NV Storage

Registers

3-20GB/sec

25GB/sec

500GB/sec

TB/sec

Core

BW- Source

Copy 1 (main memory)

Copy 2 (LL Cache)

Copy 3 (L2 Cache)

Copy 4 (L1 Cache)

Copy 5 (Registers) - Destination

Potential savings:

@ 0.5n J/B (DRAM)10 – 20 GB/s NV BW

5W – 10W

Reference: “Optimizing Read-Once Data Flow in Big-Data Applications”Morad, Ghomron, Erez, Weiser, Kolodny, in Computer Architecture Letters Journal 2016 14

Page 15: Prof. Uri Weiser,Technion

BandwidthWhen should we use Funnel at the Data source

15

Page 16: Prof. Uri Weiser,Technion

Memory Hierarchy is Optimized for A: Bandwidth issue System are built for Temporal Locality

16Highest Bandwidth

L1$

L2$

LLC Cache

DRAM

NV Storage

RegistersKBs

10’s KBs

MBs

TBs

GBs

10’s MBs

3-20GB/sec

25GB/sec

500GB/sec

TB/sec

Size

Core

BW Existing BW

NTMA Desired BW

Page 17: Prof. Uri Weiser,Technion

# of cores

Bandwidth[MB/s] INITIAL RESULT

S# of cores

CPUutilization

[%]

Bandwidth[MB/s]Read Once – Non-Temporal Memory Accesses

CPU Utilizations

SSD=CPU Bandwidth

# of cores

Bandwidth[MB/s]

CPUutilization

[%]Temporal Memory Accesses

# of cores

Bandwidth[MB/s]

Hint: Memory access per operation

B: Memory access per operation impact BW

CPU Utilizations

CPU Bandwidth

SSD Bandwidth

17

Page 18: Prof. Uri Weiser,Technion

Solution:

Flow of “Non-Temporal Data Accesses”

Core

L1$

L2$

LLC Cache

DRAM

NV Storage

Registers

Non-Temporal

Memory accesses

NTMA

Temporal locality

accesses

The Funnel

18

Use Funnel when Bandwidth bottleneck occurs- “high” memory accesses per Instruction- Limited BW- Non temporal locality memory access

*private communication with: Moinuddin Qureshi

Page 19: Prof. Uri Weiser,Technion

“Funnel”ing “Read-Once” data in storage

*Kang, Yangwook, Yang-suk Kee, Ethan L. Miller, and Chanik Park. "Enabling cost-effective data processing with smart ssd." In Mass Storage Systems and Technologies (MSST), 2013 IEEE 29th Symposium on, pp. 1-12. IEEE, 2013.**K. Eshghi and R. Micheloni. “SSD Architecture and PCI Express Interface”

Typical SDD architecture*

19

Page 20: Prof. Uri Weiser,Technion

Analytical model of the Funnel

Post process

Bandwidth (BW) IN

Bandwidth BW OUT

Funnel

B

B

= BWOUT/BWIN

20

Page 21: Prof. Uri Weiser,Technion

Purposed Architecture

21

PCIeTL

B

CPU performs NTMA and TMA work

SSD Storage

B

Funnel

B=Bandwidth

Baseline Configuration

PCIe

TLB

2,LcE

CPU performs TMA workSSD performs NTMA work

B

Funnel

Funnel Configurations

B

B B

21

Page 22: Prof. Uri Weiser,Technion

Funnel Performance

Perfo

rman

ce

impr

ovem

ent

CPU becomes bottleneck

drops as CPU becomes bottleneck

𝟏𝐏𝐂𝐈𝐞𝐁𝐖

𝟏𝐒𝐒𝐃𝐁𝐖

PCIeTL

B

CPU performs NTMA and TMA work

SSD Storage

B

Funnel

B=Bandwidth

PCIeTL

B

2,LcE

CPU performs: TMA work

SSD performs NTMA work

B

Funnel

beta beta

Pe

rfor

man

ce

22

Page 23: Prof. Uri Weiser,Technion

Funnel energy

Funnel improvement

SSD BW is the

bottleneck

CPU becomes the bottleneck

PCIe is the bottleneck

Energy grows as

performance

advantage shrinks

Funnel processor overhead

PCIeTL

B

CPU performs NTMA and TMA work

SSD Storage

B

Funnel

B=Bandwidth

PCIeTL

B

2,LcE

CPU performs TMA work

SSD performs NTMA work

B

Funnel

beta

En

ergy

CPU becomes the bottleneck

Funnel configuration

baseline configuration

23

Page 24: Prof. Uri Weiser,Technion

Solution: ?

Non-Temporal Memory Accesses should be processed as close as possible to the data sourceData that exhibit Temporal Locality should use current Memory HierarchyUse Machine Learning (context aware*) to distinguish between the two phasesOpen questions:

SW modelShared DataHW implementationComputational requirement at the “Funnel”

*Reference: “Semantic locality and Context based prefetching” Peled, Mannor, Weiser, Etsion in ISCA 2015 24

Page 25: Prof. Uri Weiser,Technion

Summary

Memory access is a critical path in computingFunnel should be used for:

Resolve BW systems’ bottleneck for specific applicationsSolve the System’s BW issues for “Read Once” cases

Reduction of Data movementFree up system’s memory resources (re-Spark)Simple-energy-efficient engines at the front endIssues

25

Page 26: Prof. Uri Weiser,Technion

26