21
© 2017 VMware Inc. All rights reserved. Resource disaggregation for the 99% Irina Calciu, Aasheesh Kolli, Jayneel Gandhi, Stanko Novakovic, Marcos K. Aguilera, Rajesh Venkatasubramanian, Pratap Subrahmanyam WAMS @ ASPLOS 2018

Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

© 2017 VMware Inc. All rights reserved.

Resource disaggregation for the 99%

Irina Calciu, Aasheesh Kolli, Jayneel Gandhi,

Stanko Novakovic, Marcos K. Aguilera,

Rajesh Venkatasubramanian, Pratap Subrahmanyam

WAMS @ ASPLOS 2018

Page 2: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

1

10

100

1000

1997 2007 2017

log

scal

eRatio of network to memory latency

Page 3: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Resource disaggregation

3

Logical disaggregationRequires new/expensive hw

Benefits(1) More resources w/ less overprovisioning(2) Better utilization and sharing(3) Decoupling of memory and compute

Page 4: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Logical resource disaggregation

1. Programming model

2. Resource management

3. Security

4. Performance

4

Page 5: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Programming model

5

Complete transparency

PerformanceSimplicity

App makes all decisions

Page 6: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Programming model

• Shared pool of memory• Coherent within host• Non-coherent across hosts

• Threads local and remote• Hierarchical rack-level scheduler

• Some failures exposed to the application

Page 7: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Failure management (component, host)

7

• What failures can we solve transparently?• What failures can we ignore?• What failures should we report to the application? • How to contain failures?

Page 8: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Logical resource disaggregation

1. Programming model

2. Resource management

3. Security

4. Performance

8

Page 9: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Logical resource disaggregation

2. Resource management

• Memory management

• Compute management

• Devices

• Isolation

9

Page 10: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Remote memory [SoCC 2017]

RAM RAM

Fast network

CPU+NICCPU+NIC

BenefitsBetter memory utilization

Big memoryEfficient sharing

10

Page 11: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

How does remote memory work?

RAM

CPU+NIC

cacheremotepages

RDMA

virtualmemory

functionality

Application host Application hostRack

resourcemanager

RAM

CPU+NIC

remotepages

11

Page 12: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

How does remote memory work?

RAM

CPU+NIC

flushdirtypages

RDMA

virtualmemory

functionality

Application host Application hostRack

resourcemanager

RAM

CPU+NIC

remotepages

12

Page 13: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Remote memory

13

• Cache remote pages:• page fault + RDMA

• Flush dirty pages: • Write protect page• Page fault on write • Dirty data amplification

Page 14: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Granularity

14

RDMA latency;Dirty data amplification

Memory management(Local and remote)Data movement

Metadata size; Translation overheads

Page 15: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Do we need new hardware?

15

• Large software overhead• Fix system software first

• Possible: Cache-coherent FPGAs • Network attached • Could enable more efficient data movement

• Good platform for experimenting with new hw

Page 16: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Technical challenges

16

1. Virtual memory overheads2. Dirty data amplification3. Local cache management4. Remote memory allocation5. Sharing model6. API & transparency level7. Remote host crashes8. Network slow or congested9. Non-uniform latency10. Virtual machine indirection11. Remote host compromised

Page 17: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Logical resource disaggregation

2. Resource management

• Memory management

• Compute management

• Devices

• Isolation

17

Page 18: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Scaling compute

18

• Create local thread• Create remote thread• Shared memory across remote threads

Page 19: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Heterogeneous compute

19

• Many heterogeneous compute elements• Heterogeneous cores, FPGAs, GPUs, etc.• Accessing accelerators across the rack• Scheduling across heterogeneous compute elements

Page 20: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

Conclusion

Still, many technical challenges for the community

20

Resource disaggregation is becoming feasible, but still risky

and expensive for most companies

Logical resource disaggregation presents most of the benefits,

at a lower cost

Page 21: Resource disaggregation for the 99%workshops.inf.ed.ac.uk/wams/slides/wams2018-resdisag.pdffor the 99% Irina Calciu, AasheeshKolli, JayneelGandhi, StankoNovakovic, Marcos K. Aguilera,

© 2017 VMware Inc. All rights reserved.

Thank you!

https://research.vmware.com/[email protected]

We are looking for students with FPGA expertise for full-time/internship roles

We’re hiring!