27
Network Stack in Userspace (NUSE) Hajime Tazaki 高速PCルーター研究会 2014/9/29

Network Stack in Userspace (NUSE)

Embed Size (px)

DESCRIPTION

A brief introduction of Network Stack in Userspace (NUSE).

Citation preview

Network Stack in Userspace (NUSE)

!!

Hajime Tazaki高速PCルーター研究会 2014/9/29

Today’s talk

• Userspace version of (Linux) network stack

• not intended for high-speed something

• but useful for high-speed network I/O

2

I have a new Layer-3/4 protocol! Yey!

• I have new, great Layer-3/4 protocol ! It will change the WORLD !

• network stack って、入れかえたいですか?

• No: your code will destroy my life ?! (experimental ? not tested ?)

• Yes: I wanna be your slave.

• VM cloud = OK, no much users/services interfere

• multi-user server, PC, phone = Nightmare, my life will have trouble…

3

I have a new Layer-3/4 protocol! Yey! (cont’d)• Kernel programming sucks

• LKM ? can cause panic anyway..

• Click ? only router/middlebox, not for end-hosts

• Slow evolution

• VM ? Hmm, I’m a lazy guy..4

Rekindling Network Protocol Innovation with User-Level

Stacks

Michio Honda⇤, Felipe Huici⇤, Costin Raiciu†, Joao Araujo‡, Luigi Rizzo§k

NEC Europe Ltd.⇤, Universitatea Politehnica Bucuresti†, University College London‡, Università di Pisa§,International Computer Science Institute, Berkeley, CAk

{first.last}@neclab.eu, [email protected], [email protected], [email protected]

ABSTRACTRecent studies show that more than 86% of Internet pathsallow well-designed TCP extensions, meaning that it is stillpossible to deploy transport layer improvements despite theexistence of middleboxes in the network. Hence, the blamefor the slow evolution of protocols (with extensions takingmany years to become widely used) should be placed on endsystems.

In this paper, we revisit the case for moving protocolsstacks up into user space in order to ease the deploymentof new protocols, extensions, or performance optimizations.We present MultiStack, operating system support for user-level protocol stacks. MultiStack runs within commodityoperating systems, can concurrently host a large number ofisolated stacks, has a fall-back path to the legacy host stack,and is able to process packets at rates of 10Gb/s.

We validate our design by showing that our mux/de-mux layer can validate and switch packets at line rate (upto 14.88 Mpps) on a 10 Gbit port using 1-2 cores, andthat a proof-of-concept HTTP server running over a basicuserspace TCP outperforms by 18–90% both the same serverand nginx running over the kernel’s stack.

Categories and Subject DescriptorsC.2.2 [Computer-communication Networks]: NetworkProtocols; D.4.4 [Operating Systems]: CommunicationsManagement

Keywordstransport protocols, operating systems, deployability

1. INTRODUCTIONThe TCP/IP protocol suite has been mostly implemented

in the operating system kernel since the inception of UNIXto ensure performance, security and isolation between userprocesses. Over time, new protocols and features have ap-peared (e.g., SCTP, DCCP, MPTCP, improved versions ofTCP), many of which have become part of mainstream OSesand distributions. Fortunately, the Internet is still able toaccommodate the evolution of protocols: a recent study [10]has shown that as many as 86% of Internet paths still allowTCP extensions despite the existence of a large number ofmiddleboxes.

However, the availability of a feature does not imply wide-spread, timely deployment. Being part of the kernel, newprotocols/extensions have system-wide impact, and are typ-ically enabled or installed during OS upgrades. These hap-

0.00

0.25

0.50

0.75

1.00

2007 2008 2009 2010 2011 2012Date

Ratio

of f

lows

OptionSACKTimestampWindowscale

DirectionInboundOutbound

Figure 1: TCP options deployment over time.

pen infrequently not only because of slow release cycles, butalso due to their cost and potential disruption to existingsetups. If protocol stacks were embedded into applications,they could be updated on a case-by-case basis, and deploy-ment would be a lot more timely.

For example, Mac OS, Windows XP and FreeBSD stilluse a traditional Additive Increase Multiplicative Decrease(AIMD) algorithm for TCP congestion control, while Linuxand Windows Vista (and later) use newer algorithms thatachieve better bandwidth utilization and mitigate RTT un-fairness [21, 25]. From a user’s point of view there is noreason not to adopt such new algorithms, but they do notbecause it can only be done via OS upgrades that are oftencostly or unavailable. Even if they are available, OS defaultsettings that disable such extensions or modifications canfurther hinder timely deployment.

Figure 1 shows another example, the usage of thethree most pervasive TCP extensions: Window Scale(WS) [12], Timestamps (TS) [12] and Selective Acknowledg-ment (SACK) [16]⇤. For example, despite WS and TS beingavailable since Windows 2000 and on by default since Win-dows Vista in 2006, as late as 2012 more than 30% and 70%of flows still did not negotiate these options (respectively),showing that it can take a long time to actually upgrade orchange OSes and thus the network stacks in their kernels.We see wider deployment for SACK in 2007 (70%) comparedto the other options thanks to it being on by default sinceWindows 2000, but even with this, 20% of flows still didnot use this option as late as 2011. The argument remains⇤We used a set of daily traces from the WIDE backbonenetwork which provides connectivity to universities and re-search institutes in Japan [3].

ACM SIGCOMM Computer Communication Review 53 Volume 44, Number 2, April 2014

Slow evolution of network stackHonda et al., Rekindling Network Protocol Innovation with User-Level Stacks, ACM SIGCOMM CCR, Vol.44, Num. 2, April 2014

Virtual Machine ?

6

Jon Howell, Galen Hunt, David Molnar, and Donald E. Porter, Living Dangerously: A Survey of Software Download Practices, no. MSR-TR-2010-51, May 2010

Poll: “When you download and run software, how often do you use a virtual machine (to reduce security risks)?”

Meanwhile inFilesystem world..

• There is,

• Filesystem in Userspace (FUSE)

• Userspace code can host new filesystem (sshfs, GmailFS, etc)

• Performance is bad, but doesn’t matter

• Flexibility and functionality do matter

7

http://fuse.sourceforge.net/

Problem Statements

• Slow evolution of network stack

• Interfere to host OS (which is untouchable)

• Too heavy workload of VM

8

What’s NUSE ?• Network stack in Userspace

• Userspace as much as possible

• like Fuse (Filesystem in Userspace)

• Library version of network stack (of monolithic kernel)

• kernel bypassed

• (UNIX) Process-based virtualization9

What can do with NUSE ?• Host operating system

• Linux (for the moment)

• Guest operating systems

• Linux (3.17-rc1 based)

• FreeBSD (ongoing)

• Suitable with kernel-bypass technologies

• DPDK/netmap with (full) network stack + (existing) applications

• Applications

• ping, iperf, nginx (partially worked)

10

11

FUSE vs NUSE

TCP/IPARP/ndisc

NIC

glibc

libnuse

nuse example

userspace

kernel

raw socknetmap

DPDK (etc)

kernel bypassed

VFS

FUSE

NFS

ext3

......

glibc glibc

libfuse

ls -l /tmp/fuse

example/tmp/fuse

userspace

kernel

Design Goals• No modification to userspace apps

• No mod to kernel space as well

• Transparent

• LD_PRELOADable

• x1 performance of native OS

12

Application

ARPQdisc

TCP UDP DCCP SCTPICMP IPv4IPv6

NetlinkBridgingNetfilter

IPSec Tunneling

Kernel layer

NUSE core

POSIX glue

bottom halves/rcu/timer/interrupt

struct net_device

RAW DPDK netmap ...

NIC

petit-scheduler

Recipe1.(monolithic) kernel

source

2. petit-scheduler

3. POSIX glue

• redirect system calls (at libc-level)

4. network I/O

• raw socket, DPDK, netmap, etc..

13

Application

ARPQdisc

TCP UDP DCCP SCTPICMP IPv4IPv6

NetlinkBridgingNetfilter

IPSec Tunneling

Kernel layer

NUSE core

POSIX glue

bottom halves/rcu/timer/interrupt

struct net_device

RAW DPDK netmap ...

NIC

petit-scheduler

1) kernel build• patch to kernel tree

• with new (hw independent) arch (arch/sim)

• robust to (frequent) mainstream changes

• build kernel source tree w/ the patch

• make menuconfig ARCH=sim

• make library ARCH=sim

• ➔ libnuse-linux-3.17-rc1.so

14

Application

ARPQdisc

TCP UDP DCCP SCTPICMP IPv4IPv6

NetlinkBridgingNetfilter

IPSec Tunneling

Kernel layer

NUSE core

POSIX glue

bottom halves/rcu/timer/interrupt

struct net_device

RAW DPDK netmap ...

NIC

petit-scheduler

2) petit scheduler• offer alternate context primitives

• interrupts, timer, thread, bottom halves (tasklet, workqueue, waiter, etc)

!

• Implemented with POSIX thread

• easily debuggable

• ucontext fiber for low overhead (not yet)

15

Application

ARPQdisc

TCP UDP DCCP SCTPICMP IPv4IPv6

NetlinkBridgingNetfilter

IPSec Tunneling

Kernel layer

NUSE core

POSIX glue

bottom halves/rcu/timer/interrupt

struct net_device

RAW DPDK netmap ...

NIC

petit-scheduler

3) POSIX glue code• Hijack function calls

• socket => nuse_socket

• read => nuse_read

• libc level hijack

• apps not aware of

• LD_PRELOAD=libnuse.so ..

• can’t catch int 0x80

16

extern int sim_sock_socket (int,int,int, struct socket **);int socket (int family, int type, int proto){ sim_update_jiffies (); struct socket *kernel_socket = sim_malloc (sizeof (struct socket)); memset (kernel_socket, 0, sizeof (struct socket)); int ret = sim_sock_socket (family, type, proto, &kernel_socket); g_fd_table[curfd++] = kernel_socket; sim_softirq_wakeup (); return curfd - 1;}

https://github.com/thehajime/net-next-nuse/blob/nuse/arch/sim/nuse-glue.c

Application

ARPQdisc

TCP UDP DCCP SCTPICMP IPv4IPv6

NetlinkBridgingNetfilter

IPSec Tunneling

Kernel layer

NUSE core

POSIX glue

bottom halves/rcu/timer/interrupt

struct net_device

RAW DPDK netmap ...

NIC

petit-scheduler

4) network I/O• connect NUSE to NIC

• options

• raw socket (general)

• DPDK (if available)

• netmap (if available)

• Tap ?

18

void sim_dev_rx (struct SimDevice *device, struct SimDevicePacket packet){ struct sk_buff *skb = packet.token; struct net_device *dev = &device->dev; skb->protocol = eth_type_trans(skb, dev); skb->ip_summed = CHECKSUM_PARTIAL; // Do the TCP checksum (FIXME: should be configurable)! netif_rx (skb);}

tatic netdev_tx_t kernel_dev_xmit(struct sk_buff *skb, struct net_device *dev){ netif_stop_queue(dev); sim_dev_xmit ((struct SimDevice *)dev, skb->data, skb->len); dev_kfree_skb(skb); netif_wake_queue(dev); return 0;}static const struct net_device_ops sim_dev_ops = { .ndo_start_xmit = kernel_dev_xmit, .ndo_set_mac_address = eth_mac_addr,};

https://github.com/thehajime/net-next-nuse/blob/nuse/arch/sim/sim-device.c

How to use NUSE ?• download

• git clone git://github.com/thehajime/net-next-nuse

• compile

• make library ARCH=sim NETMAP=yes

• execute

• sudo ./nuse (application)

• success ? : lucky guy !

• fail: add hijack calls20

Alternatives• Container (LXC, OpenVZ, vimage)

• share kernel with host operating system (no flexibility)

• virtual machine (KVM,Xen,UML)

• flexible/functional, but heavy bootstrap

• Library OS

• full scratch: mtcp, Mirage, lwIP

• Porting: OSv, Sandstorm, libuinet (FreeBSD), Arrakis (lwIP), OpenOnload (lwIP?)

• Glue-layer: LKL (Linux-2.6), Rump (NetBSD)

21

Alternatives (cont’d) Rumpkernel

• https://github.com/rumpkernel/wiki/wiki

• One binary runs on everywhere

• Linux,xBSD,Soralis,cygwin Host

• Xen Dom-U

• Bare metal (hardware, KVM, Virtualbox)

• Well-defined API (hypercall)

!

• Only NetBSD network stack is available22

Evaluation

• Performance ?

• not good so far..

• Generality

• Run all applications ? up to POSIX coverage

23

next time..

Ongoings• (efficient) thread scheduling

• batch Tx/Rx

• fork(2)/exec(2)

• multi-processes

!

• => migrate to rumpkernel ?25

Summary

• Network Stack in Userspace (NUSE)

• network stack library

• light virtualization

• fast evolution, easy deployments

26

https://github.com/thehajime/net-next-nuse

GASPP:AGPU-AcceleratedStatefulPacketProcessingFramework

28

Giorgos Vasiliadis, Lazaros Koromilas, Michalis Polychronakis, and Sotiris Ioannidis, GASPP: A GPU-Accelerated Stateful Packet Processing Framework, USENIX ATC 2014, June, 2014