View
213
Download
0
Embed Size (px)
Citation preview
NFS/RDMA over IB under Linux
Charles J. AntonelliCenter for Information Technology IntegrationUniversity of Michigan, Ann ArborFebruary 7, 2005(portions copyright Tom Talpey and Gary Grider)
Agenda
NFSv2,3,4NFS/RDMALinux NFS/RDMA serverNFS SessionspNFS and RDMA
NFSv2,3
One of the major software innovations of the 80’sOpen systems
Open specificationRemote procedure call (RPC)
Invocation across machine boundariesSupport for heterogeneity
Virtual file system interface (VFS)Abstract interface to file system functionsRead, write, open, close, etc.
Stateless serverEase of implementationObviates lack of server reliability
Problems with NFSv2,3
NamingUnder client control (automounter helps)
ScalabilityCaching is hard to get right
ConsistencyThree-second rule
PerformanceChatty protocol
Problems with NFSv2,3
Access controlTrusted clientIdentity agreement
LockingOutside the NFS protocol specification
System administrationNo tools for backend managementProliferation of exported workstation disks
NFSv4
Major componentsExport managementCompound RPCDelegationState and locksAccess control listsSecurity: RPCSEC_GSS
NFSv4
Export Management
NFSv4 pseudo fs allows the client to mount the server root, and browse to discover offered exports
No more mountd
Access into an export is based on the user’s credentials
Obviates /etc/exports client list
Compound RPC
Designed to reduce wire trafficMultiple operations per request:
Compound RPC
PUTROOTFH
LOOKUP
GETATTR
GETFH
“Start with the pseudo fs root, lookup mount point path name, and return attributes and file handle.”
Delegation
Server issues delegations to clientsA read delegation on a file is a guarantee that no other clients are writing to the fileA write delegation on a file is a guarantee that no other clients are accessing the file
Reduces revalidation requirementsNot necessary for correctnessIntended to reduce RPC requests to the server
NFSv3 is an ostensibly stateless protocolHowever, NFSv3 is typically used with a stateful auxiliary locking protocol (NLM)
NFSv4 locking is part of the protocolNo more lockd
LOCK operation sets up lock stateClient polls server when LOCK request is denied
NFSv4 servers also keep track ofOpen files, mainly to support Windows share reservation semanticsDelegations
State and Locks
Open file and lock state are lease-basedA lease is the amount of time a server will wait, while not receiving a state referencing operation from a client, before reaping the client’s state.
Delegation state is callback-basedA callback is a communication channel from the server back to the client
State Management
NFSv4 defines ACLs for file system objectsRicher and more granular than POSIX ACLsSimilar to NT ACLsACLs are showing up on local UNIX file systems
Access Control Lists
Security Model
Security added to RPC layerRFC 2203 defines RPCSEC_GSS
Adds the GSSAPI to the ONC RPC
An application that uses the GSSAPI can "plug in" any security service implementing the APINFSv4 mandates the implementation of Kerberos v5 and LIPKEY GSSAPI security mechanisms.
The combination of LIPKEY (and SPKM3) provides a security service similar to TLS
Existing NFSv4 Implementations
SUN Solaris client and serverNetwork Appliance multi-protocol server
NFSv4, NFSv3, CIFSHummingbird WinXXX client and serverCITI
Linux client and serverOpenBSD/FreeBSD client
EMC multi-protocol serverHPUX serverGuelph OpenBSD serverIBM AIX client and server
Future Implementations
Cluster-coherent NFS serverpNFS
NFS/RDMA
A way to run NFS v2/v3/v4 over RDMAGreatly enhanced NFS performance
Low overheadFull bandwidthDirect I/O – true zero copy
Implemented on LinuxkDAPL APIClient today, server soon
RPC layer approach
Implemented within RPC layerNew RPC transport typeAdds RDMA-transport specific header“Chunks” direct data transfer between client memory and server buffersBindings for NFSv2/v3, also NFSv4
Implementation Layering
Client implemented as kernel RPC transportServer approach similarRDMA API: kDAPLNFS client code remains unchangedCompletely transparent to application
Use of kDAPL
All RDMA interfacing is via kDAPLVery simple subset of kDAPL 1.1 API
Connection, connection DTOsKernel-virtual or physical LMRs, RMRsSmall (1KB-4KB typical) send/receiveLarge RDMA (4KB-64KB typical)
All RDMA read/write initiated by server
Potential NFS/RDMA Users
Anywhere high bandwidth, low overhead is important:
HPC/Supercomputing clustersDatabaseFinancial applicationsScientific computingGeneral cluster computing
Linux NFS/RDMA server
Project goalsRPC/RDMA implementation
kDAPL APIMellanox IB
Interoperate with NetApp RPC RDMA clientPerformance gain over TCP transport
Linux NFS/RDMA server
ApproachDivide RPC layer into unified state management and abstract transport layerSocket-specific code replaced by general interface implemented by socket or RDMA transportsSimilar to client RPC transport switch concept
Linux NFS/RDMA server
Implementation stagesListen for and accept connectionsProcess inline NFSv3 requestsNFSv3 RDMANFSv4 RDMA
Listen for and accept connections
svc_makexprtSimilar to svc_makesock for socket transportsRDMA transport tasks:
Open HCARegister memoryCreate endpoint for RDMA connections
Listen for and accept connections
svc_xprtRetains transport-independent components of svc_sockAdd pointer to transport-specific structureSupport for registering dynamic transport implementations (eventually)
Listen for and accept connections
Reorganize code into transport-agnostic and transport-specific blocksUpdate calling code to specify transport
Process inline NFSv3 requests
RDMA-specific send and receive routinesAll data sent inline via RDMA SendTasks
Register memory buffers for RDMA sendManage buffer transmission by the hardwareProcess RDMA headers
NFSv3 RDMA
Use RDMA Read and Write for large transfersRPC page management
xdr_buf contains initial kvec and list of pagesInitial kvec holds RPC header and short payloadsPage list used for large data transfer
Server memory registrationAll server memory pre-registeredAllows simpler memory managementMay need revisiting wrt security
NFSv3 RDMA
Client writeServer issues RDMA Read from client-provided read chunksServer reads into xdr_buf page listSimilar to socket-based receive for ULP
Client readServer issues RDMA Write into client-provided write chunks
NFSv3 RDMA
Reply chunksApplies when client requests generate replies that are too large for RDMA SendServer issues RDMA write into client-supplied buffers
NFSv4 RDMA
NFSv4 layered on RPC/RDMATask:
export modifications for RDMA transport
NFSv4.1 Sessions
Adds a session layer to NFSv4Enhances protocol reliability
Accurate duplicate request cachingBounded resources
Provides transport diversityTrunking, multipathing
http://www.ietf.org/internet-drafts/draft-ietf-nfsv4-sess-00.txt
pNFS basics
Separation of data and control, so NFS metadata requests go through NFS and data requests flow directly to devices (OBSD, Block/ iSCSI, file)This allows an NFSv4.X-pNFS client to be a native client to Object/SAN/data-filer file system and scale efficiently.Limits the need for custom VFS clients for every version of every OS/kernel known to mankind
pNFS and RDMA
NFSv4.x client with RDMA gives us low latency low overhead path for metadata (via RPC/RDMA layer)pNFS gives us parallel paths for data direct to the storage devices or filers (for OBSD, block, and file methods)
For file method RPC/RDMA provides standards based data path to data filerFor block method iSCSI/ISER or SRP could be used, this provides a standards based data path (lacks transactional security though)For OBSD method, since ANSI OBSD is iSCSI extended, if OBSD/iSCSI/ISER all get along, this provides a standards based data path that is transactionally secure
pNFS and RDMA
With the previous two items, combined with other NFSv4 features like leasing, compound RPC’s, etc., we have a first class standards based file system client that gets native device performance all provided by NFSv4.XXX, capable of effectively using any global parallel file system
AND ALL WITH STANDARDS!
pNFS and RDMA
We really need all this work to be enabled on both Ethernet and Infiniband and to be completely routable between the two medias.
Will higher level apps that become RDMA aware be able to use both Ethernet and Infiniband and mixtures of both transparently?Will NFSv4 RPC/RDMA, iSCSI, and SRP be routable between medias?
CITI
Developing NFSv4 reference implementation since 1999
NFS/RDMA and NFSv4.1 Sessions since 2003
Funded by Sun, Network Appliance, ASCI, PolyServe, NSFhttp://www.citi.umich.edu/projects/nfsv4/
Key message
Give us kDAPL
Any questions?http://www.citi.umich.edu/