62
Container-relevant Kernel developments Tycho Andersen [email protected] GH: tych0

Container-relevant Upstream Kernel Developments

Embed Size (px)

Citation preview

Container-relevant Kernel developments

Tycho [email protected]: tych0

IMA● Integrity Management Architecture (“IMA”, “I’ma”)● In-kernel protection against unauthorized userspace

file modification

IMAopen(“/foo/bar”, O_RDWR)

sha256sum(“/foo/bar”) == getxattr(“/foo/bar”, “security.ima”)

verify(“/foo/bar”) == getxattr(“/foo/bar”, “security.evm”)

open(“/foo/bar”, O_RDWR) = -EPERM

IMA$ tee /sys/kernel/security/policy <<EOFPROC_SUPER_MAGIC=0x9fa0dont_measure fsmagic=0x9fa0dont_appraise fsmagic=0x9fa0EXT4_MAGIC=0xEF53appraise fsmagic=$EXT4_MAGIC fowner=$userappraise func=MODULE_CHECKEOF

ima_appraise={off,enforce,fix,log}

IMA

IMA namespacing● global policy● which namespace to pin?● what about unshare()?● ima: namespacing IMA audit messages

https://lkml.org/lkml/2017/7/20/905

IMAAuditstruct container *LSMTime Namespaceseccomp loggingLandlockWireguardKSPPXPFO

Audit

Audittype=USER_LOGIN msg=audit(1506873468.459:1814706): pid=27995 uid=0 auid=4294967295 ses=4294967295 msg='op=login acct="root" exe="/usr/sbin/sshd" hostname=? addr=113.195.145.13 terminal=sshd res=failed'type=USER_AUTH msg=audit(1506873489.492:1814707): pid=28128 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="root" exe="/usr/sbin/sshd" hostname=113.195.145.13 addr=113.195.145.13 terminal=ssh res=failed'type=USER_LOGIN msg=audit(1506873489.492:1814708): pid=28128 uid=0 auid=4294967295 ses=4294967295 msg='op=login acct="root" exe="/usr/sbin/sshd" hostname=? addr=113.195.145.13 terminal=sshd res=failed'type=USER_AUTH msg=audit(1506873491.708:1814709): pid=28128 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="root" exe="/usr/sbin/sshd" hostname=113.195.145.13 addr=113.195.145.13 terminal=ssh res=failed'type=USER_LOGIN msg=audit(1506873491.708:1814710): pid=28128 uid=0 auid=4294967295 ses=4294967295 msg='op=login acct="root" exe="/usr/sbin/sshd" hostname=? addr=113.195.145.13 terminal=sshd res=failed'type=USER_AUTH msg=audit(1506873493.864:1814711): pid=28128 uid=0 auid=4294967295 ses=4294967295 msg='op=PAM:authentication acct="root" exe="/usr/sbin/sshd" hostname=113.195.145.13 addr=113.195.145.13 terminal=ssh res=failed'

Audit namespacing● which namespace to pin?● what about unshare()?● RFC: Audit Kernel Container IDs

https://lkml.org/lkml/2017/9/13/383● RFC(v2): Audit Kernel Container IDs

https://lkml.org/lkml/2017/10/12/354

IMAAuditstruct container *LSMTime Namespaceseccomp loggingLandlockWireguardKSPPXPFO

struct container *int cfd = container_create(const char *name, unsigned int flags);

container_mount(int cfd, const char *source, const char *target, /* NULL -> root */ const char *filesystemtype, unsigned long mountflags, const void *data);

container_chroot(int cfd, const char *path);

mkdirat(int cfd, const char *path, mode_t mode);mknodat(int cfd, const char *path, mode_t mode, dev_t dev);

struct container *container_bind_mount_across(int cfd, const char *source, const char *target);int fd = openat(int cfd, const char *path, unsigned int flags, mode_t mode); int fd = container_socket(int cfd, int domain, int type, int protocol);fork_into_container(int cfd);container_wait(int container_fd, int *_wstatus, unsigned int wait, struct rusage *rusage);container_kill(int container_fd, int initonly, int signal);container_add_key(const char *type, const char *description, const void *payload, size_t plen, int container_fd);

struct container *● Device restriction● “supervising” the container● Make containers kernel objects

https://lkml.org/lkml/2017/5/22/645

IMAAuditstruct container *LSMTime Namespaceseccomp loggingLandlockWireguardKSPPXPFO

LSM● Linux Security Module

○ SELinux○ AppArmor○ Smack○ Landlock○ tomoyo○ yama○ loadpin○ SARA

LSM namespacing (stacking, chaining)● 2004: https://lwn.net/Articles/110432/ Stackable security modules● 2010: https://lwn.net/Articles/393008/ LSM Stacking (again)● 2011: https://lwn.net/Articles/426921/ Supporting multiple LSMs● 2012: https://lwn.net/Articles/518345/ Another LSM stacking approach● 2013: https://lwn.net/Articles/548314/ LSM: Multiple concurrent LSMs● 2014: https://lwn.net/Articles/548314/ LSM: Generalize existing module

stacking● 2015: https://lwn.net/Articles/635771/ Progress in security module stacking● 2016-2017: https://lwn.net/Articles/719731/ Stacking for major security

modules

LSM namespacing (stacking, chaining)Host: AppArmor

Guest: SELinux

Nested: Smack

LSM namespacing (stacking, chaining)Host: AppArmor

Guest: AppArmor

Nested: AppArmor

LSM namespacing (stacking, chaining)● SELinux in development:

https://marc.info/?l=selinux&m=150696042210126&w=2

IMAAuditstruct container *LSMTime Namespaceseccomp loggingLandlockWireguardKSPPXPFO

unshare(CLONE_NEWTIME)gettimeofday(); settimeofday();clock_getres();clock_gettime(); clock_settime();time();

unshare(CLONE_NEWTIME)?gettimeofday(); settimeofday();clock_getres();clock_gettime(); clock_settime();time();

virtual Dynamic Shared Object (vDSO)● optimization to make frequent syscalls faster● injected into a task’s address space by the kernel

unshare(CLONE_NEWTIME)?

Task 1 Task 2 Task n...Task 3

kernel: tick_handle_periodic() -> update_vsyscall()

seccomp logging

IMAAuditstruct container *LSMTime Namespaceseccomp loggingLandlockWireguardKSPPXPFO

seccomp can’t dereference pointersptr = “/tmp/foo”;open(ptr, O_RDWR);

__secure_computing(...) = 0ptr = “/etc/passwd”;

sys_open() do_sys_open() do_filp_open() path_openat() vfs_open() do_dentry_open()

Landlock● eBPF based Linux Security Module http://landlock.io

__secure_computing()sys_open() do_sys_open() do_filp_open() path_openat() vfs_open() do_dentry_open() security_file_open()

Landlockint security_file_open(struct file *file,

struct cred *cred);struct file { ... struct path f_path; struct inode *f_inode;};

IMAAuditstruct container *LSMTime Namespaceseccomp loggingLandlockWireguardKSPPXPFO

Wireguard● WireGuard is an extremely simple yet fast and modern

VPN https://www.wireguard.com/● Allows for transparent encryption between endpoints

Wireguard● IPSec: 400k lines● OpenVPN: 100k lines + SSL● Wireguard: 4k lines

Wireguard● Noise protocol: https://noiseprotocol.org● Curve25519, Blake2s, ChaCha20, Poly1305,

SipHash2-4● No cypher agility

Kernel Self Protection Project (KSPP)● Currently ~12 organizations and ~10 individuals

working onabout ~20 technologies

● KSPP focuses on the kernel protecting the kernel from attack

● More at: https://outflux.net/slides/2017/lss/kspp.pdf

IMAAuditstruct container *LSMTime Namespaceseccomp loggingLandlockWireguardKSPPXPFO

eXclusive Page Frame Ownership (XPFO)● Introduced in “Rethinking Kernel Isolation” by

Kemerlis, Polychronakis, and Keromytis● Protects against ret2dir attacks● 29 files changed, 1013 insertions(+), 57 deletions(-)● Implementation supports x86 and arm64

mm basics

0x00007fbcd334f000 (user)

0x1214b9000(physical)

0xffff8801214b9000 (kernel)

Classic attackstruct file_operations { int (*flush) (...)};

/* kernel text */int do_flush(...){ ...}

/* userspace memory */int bad_flush(...){ commit_creds(prepare_kernel_cred(0));}

Classic attack● PaX UDEREF● SMEP+SMAP on x86● PXN on ARM

Updated attackstruct file_operations { int (*flush) (...)};

/* kernel text */int do_flush(...){ ...}

/* userspace memory 0x00007fbcd334f000 */int bad_flush(...){ commit_creds(prepare_kernel_cred(0));}

/* userspace alias in kernel 0xffff8801214b9000 */

Enter XPFO!● Keep track of who owns page● Map/unmap accordingly● Flush TLB as necessary

Get involved● https://lists.linux-foundation.org/mailman/listinfo/containers● http://www.openwall.com/lists/kernel-hardening/● https://sourceforge.net/p/linux-ima/mailman/linux-ima-devel/

THANK YOU :)THANK YOU

Image credits● Marty Bee for Brain Dump: http://www.martybee.com/● https://en.wikipedia.org/wiki/White_Rabbit#/media/File:Down_the_Rabbit_Hole.png● https://upload.wikimedia.org/wikipedia/commons/thumb/1/13/Container_ship_Hanjin_Taipei

.jpg/1024px-Container_ship_Hanjin_Taipei.jpg● https://en.wikipedia.org/wiki/Hansel_and_Gretel#/media/File:1903_Ludwig_Richter.jpg● https://upload.wikimedia.org/wikipedia/commons/8/87/WinonaSavingsBankVault.JPG● http://www.gizmodo.in/photo/20861051.cms● https://upload.wikimedia.org/wikipedia/commons/b/be/TPM.svg● Kyle Spiers (Security Intern at Docker) for Gordon photo

On allocation

allocate 0x00007fbcd334f000

TLB flush

CPU core

CPU core CPU core

CPU core

On map/unmap

map 0x00007fbcd334f000

TLB flush

CPU core

CPU core CPU core

CPU core

x86void flush_tlb_kernel_range(unsigned long start, unsigned long end){ ... on_each_cpu(do_kernel_range_flush, &info, 1);}

x86/* * Can deadlock when called with interrupts disabled. ... */ WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled() && !oops_in_progress);

On map/unmap

map 0x00007fbcd334f000

TLB flush

CPU core

CPU core CPU core

CPU core

Benchmark● kernbench running from n/2 - n cores in steps of 2● test inter-core interference from excess flushing

2x Xeon E5-2650 v4, 24 cores/48 threads

2.2 GHz,30 MB SmartCache

Xeon E3-1240, 4 cores/8 threads

3.3 GHz,8 MB SmartCache

Amlogic Coretex A53 4 cores (odroid-C2)

1.5 GHz,32k L1 (I/D),512k L2

XPFO links● Original paper:

https://cs.brown.edu/~vpk/papers/ret2dir.sec14.pdf● v6 posting: https://lkml.org/lkml/2017/9/7/445