40
NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions copyright Tom Talpey and Gary Grider)

NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

  • View
    213

  • Download
    0

Embed Size (px)

Citation preview

Page 1: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

NFS/RDMA over IB under Linux

Charles J. AntonelliCenter for Information Technology IntegrationUniversity of Michigan, Ann ArborFebruary 7, 2005(portions copyright Tom Talpey and Gary Grider)

Page 2: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Agenda

NFSv2,3,4NFS/RDMALinux NFS/RDMA serverNFS SessionspNFS and RDMA

Page 3: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

NFSv2,3

One of the major software innovations of the 80’sOpen systems

Open specificationRemote procedure call (RPC)

Invocation across machine boundariesSupport for heterogeneity

Virtual file system interface (VFS)Abstract interface to file system functionsRead, write, open, close, etc.

Stateless serverEase of implementationObviates lack of server reliability

Page 4: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Problems with NFSv2,3

NamingUnder client control (automounter helps)

ScalabilityCaching is hard to get right

ConsistencyThree-second rule

PerformanceChatty protocol

Page 5: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Problems with NFSv2,3

Access controlTrusted clientIdentity agreement

LockingOutside the NFS protocol specification

System administrationNo tools for backend managementProliferation of exported workstation disks

Page 6: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

NFSv4

Major componentsExport managementCompound RPCDelegationState and locksAccess control listsSecurity: RPCSEC_GSS

Page 7: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

NFSv4

Page 8: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Export Management

NFSv4 pseudo fs allows the client to mount the server root, and browse to discover offered exports

No more mountd

Access into an export is based on the user’s credentials

Obviates /etc/exports client list

Page 9: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Compound RPC

Designed to reduce wire trafficMultiple operations per request:

Compound RPC

PUTROOTFH

LOOKUP

GETATTR

GETFH

“Start with the pseudo fs root, lookup mount point path name, and return attributes and file handle.”

Page 10: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Delegation

Server issues delegations to clientsA read delegation on a file is a guarantee that no other clients are writing to the fileA write delegation on a file is a guarantee that no other clients are accessing the file

Reduces revalidation requirementsNot necessary for correctnessIntended to reduce RPC requests to the server

Page 11: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

NFSv3 is an ostensibly stateless protocolHowever, NFSv3 is typically used with a stateful auxiliary locking protocol (NLM)

NFSv4 locking is part of the protocolNo more lockd

LOCK operation sets up lock stateClient polls server when LOCK request is denied

NFSv4 servers also keep track ofOpen files, mainly to support Windows share reservation semanticsDelegations

State and Locks

Page 12: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Open file and lock state are lease-basedA lease is the amount of time a server will wait, while not receiving a state referencing operation from a client, before reaping the client’s state.

Delegation state is callback-basedA callback is a communication channel from the server back to the client

State Management

Page 13: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

NFSv4 defines ACLs for file system objectsRicher and more granular than POSIX ACLsSimilar to NT ACLsACLs are showing up on local UNIX file systems

Access Control Lists

Page 14: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Security Model

Security added to RPC layerRFC 2203 defines RPCSEC_GSS

Adds the GSSAPI to the ONC RPC

An application that uses the GSSAPI can "plug in" any security service implementing the APINFSv4 mandates the implementation of Kerberos v5 and LIPKEY GSSAPI security mechanisms.

The combination of LIPKEY (and SPKM3) provides a security service similar to TLS

Page 15: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Existing NFSv4 Implementations

SUN Solaris client and serverNetwork Appliance multi-protocol server

NFSv4, NFSv3, CIFSHummingbird WinXXX client and serverCITI

Linux client and serverOpenBSD/FreeBSD client

EMC multi-protocol serverHPUX serverGuelph OpenBSD serverIBM AIX client and server

Page 16: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Future Implementations

Cluster-coherent NFS serverpNFS

Page 17: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

NFS/RDMA

A way to run NFS v2/v3/v4 over RDMAGreatly enhanced NFS performance

Low overheadFull bandwidthDirect I/O – true zero copy

Implemented on LinuxkDAPL APIClient today, server soon

Page 18: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

RPC layer approach

Implemented within RPC layerNew RPC transport typeAdds RDMA-transport specific header“Chunks” direct data transfer between client memory and server buffersBindings for NFSv2/v3, also NFSv4

Page 19: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Implementation Layering

Client implemented as kernel RPC transportServer approach similarRDMA API: kDAPLNFS client code remains unchangedCompletely transparent to application

Page 20: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Use of kDAPL

All RDMA interfacing is via kDAPLVery simple subset of kDAPL 1.1 API

Connection, connection DTOsKernel-virtual or physical LMRs, RMRsSmall (1KB-4KB typical) send/receiveLarge RDMA (4KB-64KB typical)

All RDMA read/write initiated by server

Page 21: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Potential NFS/RDMA Users

Anywhere high bandwidth, low overhead is important:

HPC/Supercomputing clustersDatabaseFinancial applicationsScientific computingGeneral cluster computing

Page 22: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Linux NFS/RDMA server

Project goalsRPC/RDMA implementation

kDAPL APIMellanox IB

Interoperate with NetApp RPC RDMA clientPerformance gain over TCP transport

Page 23: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Linux NFS/RDMA server

ApproachDivide RPC layer into unified state management and abstract transport layerSocket-specific code replaced by general interface implemented by socket or RDMA transportsSimilar to client RPC transport switch concept

Page 24: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Linux NFS/RDMA server

Implementation stagesListen for and accept connectionsProcess inline NFSv3 requestsNFSv3 RDMANFSv4 RDMA

Page 25: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Listen for and accept connections

svc_makexprtSimilar to svc_makesock for socket transportsRDMA transport tasks:

Open HCARegister memoryCreate endpoint for RDMA connections

Page 26: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Listen for and accept connections

svc_xprtRetains transport-independent components of svc_sockAdd pointer to transport-specific structureSupport for registering dynamic transport implementations (eventually)

Page 27: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Listen for and accept connections

Reorganize code into transport-agnostic and transport-specific blocksUpdate calling code to specify transport

Page 28: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Process inline NFSv3 requests

RDMA-specific send and receive routinesAll data sent inline via RDMA SendTasks

Register memory buffers for RDMA sendManage buffer transmission by the hardwareProcess RDMA headers

Page 29: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

NFSv3 RDMA

Use RDMA Read and Write for large transfersRPC page management

xdr_buf contains initial kvec and list of pagesInitial kvec holds RPC header and short payloadsPage list used for large data transfer

Server memory registrationAll server memory pre-registeredAllows simpler memory managementMay need revisiting wrt security

Page 30: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

NFSv3 RDMA

Client writeServer issues RDMA Read from client-provided read chunksServer reads into xdr_buf page listSimilar to socket-based receive for ULP

Client readServer issues RDMA Write into client-provided write chunks

Page 31: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

NFSv3 RDMA

Reply chunksApplies when client requests generate replies that are too large for RDMA SendServer issues RDMA write into client-supplied buffers

Page 32: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

NFSv4 RDMA

NFSv4 layered on RPC/RDMATask:

export modifications for RDMA transport

Page 33: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

NFSv4.1 Sessions

Adds a session layer to NFSv4Enhances protocol reliability

Accurate duplicate request cachingBounded resources

Provides transport diversityTrunking, multipathing

http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-sess-00.txt

Page 34: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

pNFS basics

Separation of data and control, so NFS metadata requests go through NFS and data requests flow directly to devices (OBSD, Block/ iSCSI, file)This allows an NFSv4.X-pNFS client to be a native client to Object/SAN/data-filer file system and scale efficiently.Limits the need for custom VFS clients for every version of every OS/kernel known to mankind

Page 35: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

pNFS and RDMA

NFSv4.x client with RDMA gives us low latency low overhead path for metadata (via RPC/RDMA layer)pNFS gives us parallel paths for data direct to the storage devices or filers (for OBSD, block, and file methods)

For file method RPC/RDMA provides standards based data path to data filerFor block method iSCSI/ISER or SRP could be used, this provides a standards based data path (lacks transactional security though)For OBSD method, since ANSI OBSD is iSCSI extended, if OBSD/iSCSI/ISER all get along, this provides a standards based data path that is transactionally secure

Page 36: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

pNFS and RDMA

With the previous two items, combined with other NFSv4 features like leasing, compound RPC’s, etc., we have a first class standards based file system client that gets native device performance all provided by NFSv4.XXX, capable of effectively using any global parallel file system

AND ALL WITH STANDARDS!

Page 37: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

pNFS and RDMA

We really need all this work to be enabled on both Ethernet and Infiniband and to be completely routable between the two medias.

Will higher level apps that become RDMA aware be able to use both Ethernet and Infiniband and mixtures of both transparently?Will NFSv4 RPC/RDMA, iSCSI, and SRP be routable between medias?

Page 38: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

CITI

Developing NFSv4 reference implementation since 1999

NFS/RDMA and NFSv4.1 Sessions since 2003

Funded by Sun, Network Appliance, ASCI, PolyServe, NSFhttp://www.citi.umich.edu/projects/nfsv4/

Page 39: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Key message

Give us kDAPL

Page 40: NFS/RDMA over IB under Linux Charles J. Antonelli Center for Information Technology Integration University of Michigan, Ann Arbor February 7, 2005 (portions

Any questions?http://www.citi.umich.edu/