1
[email protected] [email protected] CERN, June 2016 libFabric ofi:// Transport nanomsg ALFA usNIC FairMQ For the ALICE O 2 upgrade, the simulation and reconstruction software for the ALICE experiment are using the ALFA 1 framework. Among other abstractions, ALFA/FairROOT framework provides a Message-Queue abstraction library, the FairMQ 2 , that is lightweight wrapper around ØMQ and NanoMsg libraries. Since the project’s goal is to implement a new transport for FairMQ we decided to extend the functionality of either of these libraries. We chose NanoMsg 3 because of it’s clean and modular internals. Introduc)on ØMQ fi_send( endpoint, buffer, len, mr_desc, context ); buffer .. fi_recv( endpoint, buffer, len, mr_desc, context ); buffer .. RDMA* Memory Region Memory Region fi_send( endpoint, buffer, len, mr_desc, context ); fi_recv( endpoint, buffer, len, mr_desc, context ); Tx CQ Rx CQ Tx CQ Rx CQ SEND ACK fi_cq_read( &event ); fi_cq_read( &event ); * libfabric has custom event polling func?ons One of the powerful features of the usNIC fabric is the fact that it can bypass the linux kernel from user-space when using the libFabric 4 library. This relieves the kernel from the IP stack overhead, reclaiming it’s CPU time for more useful operations. usNIC + Kernel Bypass The ofi:// Transport The project is implemented as a patch to the NanoMsg sources 5 that introduces the Open Fabrics Interface (OFI) transport. The transport translates the POSIX-like API of NanoMsg into an RDMA-like API for libFabric, transparently to the user. To achieve this it uses a dynamic memory registration (MR) mechanism that tries to reduce the amount of MR performed, while being agnostic of the user’s intentions. 1. Technical Design Report for the Upgrade of the Online– Offline Computing System, The ALICE Collaboration 2. https://github.com/FairRootGroup/FairRoot/tree/master/ fairmq 3. http://nanomsg.org 4. http://ofiwg.github.io/libfabric 5. https://github.com/wavesoft/nanomsg-transport-ofi 6. https://github.com/wavesoft/robob 7. https://github.com/ofiwg/libfabric/issues?q=author %3Awavesoft 8. https://github.com/nanomsg/nanomsg/pull/612 In a similar manner, it uses the high-level libFabric polling API, instead of the FD - based NanoMsg polling API, making it possible to support any fabric without any modification. True Zero-Copy One of the implementation requirements of the OFI transport was to ensure that no memcpy operations will take place between the user’s request and the transfer on the wire. That’s reasonable if you consider that the message sizes vary from 50Mb to 1Gb A useful by-product of this project was the development of robob 6 , a fully automated benchmarking utility, for ensuring the quality of the measured values By-Products Outcomes We frequently encountered roadblocks while working this project, since we were using new products and open source components. We had frequent interactions with NanoMsg and CISCO developers 7 and we contributed our own modifications 8 . Nonetheless we managed to create a prototype were we demonstrated the feasibility of the transport and it’s performance. On the right we present some preliminary measurements using the OFI transport between two Intel Xeon E5-2690 machines with UCSC-PCIE-C40Q NICs, connected through switch with a 40Gbit copper cable. 0 5 10 15 20 25 30 35 8192 16384 32768 65536 1048576 2097152 4194304 8388608 16777216 33554432 67108864 134217728 Throughput (GB/s) for Different Message Sizes OFI [GBit/s] TCP [Gbit/s] ØMQ [Gbit/s] www.cern.ch/openlab Poster by Ioannis Charalampidis. Special thanks to our supervisor, Predrag Buncic, to Artur Barczyk for his guidance, to Mohammad Al- Turany and Peter Hristov

[9-6-2016] Openlab Poster-v3

Embed Size (px)

Citation preview

Page 1: [9-6-2016] Openlab Poster-v3

[email protected] [email protected] CERN, June 2016

libFabric

ofi:// Transport

nanomsg

ALFA

usNIC

FairMQ

For the ALICE O2 upgrade, the simulation and reconstruction software for the ALICE experiment are using the ALFA1 framework.

Among other abstractions, ALFA/FairROOT framework provides a Message-Queue abstraction library, the FairMQ2, that is lightweight wrapper around ØMQ and NanoMsg libraries.

Since the project’s goal is to implement a new transport for FairMQ we decided to extend the functionality of either of these libraries.

We chose NanoMsg3 because of it’s clean and modular internals.

Introduc)on

ØMQ

fi_send( endpoint, buffer, len, mr_desc, context );

buffer… ..

fi_recv( endpoint, buffer, len, mr_desc, context );

buffer… ..

RDMA*

MemoryRegion MemoryRegion

fi_send( endpoint, buffer, len, mr_desc, context );

fi_recv( endpoint, buffer, len, mr_desc, context );

TxCQ RxCQ TxCQ RxCQ

SEND

ACK

fi_cq_read( &event ); fi_cq_read( &event );

*libfabrichascustomeventpollingfunc?ons

One of the powerful features of the usNIC fabric is the fact that it can bypass the linux kernel from user-space when using the libFabric4 library.

This relieves the kernel from the IP stack overhead, reclaiming it’s CPU time for more useful operations.

usNIC+KernelBypass

Theofi://TransportThe project is implemented as a patch to the NanoMsg sources5 that introduces the Open Fabrics Interface (OFI) transport.

The transport translates the POSIX-like API of NanoMsg into an RDMA-like API for libFabric, transparently to the user. To achieve this it uses a dynamic memory registration (MR) mechanism that tries to reduce the amount of MR performed, while being agnostic of the user’s intentions.

1.  Technical Design Report for the Upgrade of the Online– Offline Computing System, The ALICE Collaboration

2.  https://github.com/FairRootGroup/FairRoot/tree/master/fairmq

3.  http://nanomsg.org 4.  http://ofiwg.github.io/libfabric

5.  https://github.com/wavesoft/nanomsg-transport-ofi 6.  https://github.com/wavesoft/robob 7.  https://github.com/ofiwg/libfabric/issues?q=author

%3Awavesoft 8.  https://github.com/nanomsg/nanomsg/pull/612

In a similar manner, it uses the high-level libFabric polling API, instead of the FD - based NanoMsg polling API, making it possible to support any fabric without any modification.

TrueZero-CopyOne of the implementation requirements of the OFI transport was to ensure that no memcpy operations will take place between the user’s request and the transfer on the wire. That’s reasonable if you consider that the message sizes vary from 50Mb to 1Gb

A useful by-product of this project was the development of robob6, a fully automated benchmarking

utility, for ensuring the quality of the measured values

By-Products

OutcomesWe frequently encountered roadblocks while working this project, since we were using new products and open source components.

We had frequent interactions with NanoMsg and CISCO developers7 and we contributed our own modifications8.

Nonetheless we managed to create a prototype were we demonstrated the

feasibility of the transport and it’s performance.

On the right we present some preliminary measurements using the OFI transport between two Intel Xeon E5-2690 machines with UCSC-PCIE-C40Q NICs, connected through switch with a 40Gbit copper cable. 0

5

10

15

20

25

30

35

8192 16384 32768 65536 1048576 2097152 4194304 8388608 16777216 33554432 67108864 134217728

Throughput(GB/s)forDifferentMessageSizes

OFI[GBit/s] TCP[Gbit/s] ØMQ[Gbit/s]

www.cern.ch/openlab

Poster by Ioannis Charalampidis. Special thanks to our supervisor, Predrag Buncic, to Artur

Barczyk for his guidance, to Mohammad Al-Turany and Peter Hristov