Upload
sandisk
View
690
Download
3
Embed Size (px)
DESCRIPTION
This was a presentation given by Thomas Kejser, EMEA CTO of Fusion-io, during IPExpo 2012 in London. You can see a recording of the presentation here: http://fio.cc/QKjZxK Flash memory solutions are quickly moving from innovative new technology to a crucial building block in today’s data centers. Leading flash memory platforms are optimised to leave all the disk era code behind in the disk era where it belongs. Less disk-driven code means less latency and more performance, which is why companies across Europe are adopting flash as a new high performance memory tier for their servers. Few would argue that flash is rapidly replacing disk for performance, but flash in the server is only half the battle. The next wave of flash memory innovation will follow as application developers move on from coding apps for disk and start to integrate flash memory optimisation into their software. Big Data applications are among the first to make this transition, with many other software developers in line to come next. In this talk, Fusion-io EMEA CTO Thomas Kesjer will explore flash as a memory tier and how application optimisation will lead to the next wave of innovation in the flash memory revolution.
Citation preview
APPLICATION OPTIMISATION Flash’s Final Frontier
AGENDA
Where were we? Where are we? Where will be go from here?
October 25, 2012 2
ONCE UPON A TIME…
October 25, 2012 3
CPU
Where data is needed
Where data Is stored
DOES TECHNOLOGY ALWAYS ADVANCE?
October 25, 2012 4
CPU
Where data is needed
Where data is stored
ENTER: FLASH TECHNOLOGY
5
CPU
CPU
CPU
TECHNOLOGY PROGRESSES TO A POINT…
Fusion Devices: • 1 Billion IOPS aggregate • Millions of IOPS on single PCI slot • Capacity in the 10s of TB / Server Flash will get: • Commoditized • Cheaper • Faster • (Somewhat) denser October 25, 2012 6
Time
DESIGN HABITS
STORAGE BECAME…
Highly Aggregated Tunable (storage experts) Layered Complex
October 25, 2012 8
Sequential
Random
PROGRAMMERS ADAPTED…
Sync Model
October 25, 2012 9
Async Model
STOP Queue
Problem: Context Switching
Problem: Over Subscription
FILE SYSTEMS ADAPTED…
October 25, 2012 10
File System
Kernel Block Layer
Device Driver
Sector Mapping
Met
a da
ta OS
Problems: Double Work CPU File System Overhead
DATABASES ADAPTED…
Page A
Page B
Page C
11
Page A
Page B
Page C
Page A
Page B
Page C
RAM
1
2
3
ACID Change
Log Write (sequential)
Flush/Checkpoint (Random)
Problems: Double Write Defragmentation
SIMPLIFICATION
MAKING USE OF THE NEW MEDIA
Existing I/O paradigm: • open() • read() • write() • seek() • close() New Atomic Extensions: • nvm_vectored_write()
October 25, 2012 13
EXAMPLE: ATOMIC I/O PRIMITIVES
October 25, 2012 14
NVM Translation Layer
iov[0] iov[1] iov[2] iov[3] iov[4]
LBA 7 + range
LBA 24 + range
LBA 42 + range
LBA 68 + range
LBA 24 + range
LBA 7 + range
iov[0] iov[1]
Application issues call to atomic I/O primitives
TRANSACTION ENVELOPES
WRITE ALL BLOCKS ATOMICALLY TRIM ALL BLOCKS ATOMICALLY WRITE AND TRIM ATOMICALLY
ATOMIC I/O PRIMITIVES BENCHMARKS (ATOMIC I/O VS NON-ATOMIC I/O)
1U HP blade server with 16 GB RAM, 8 CPU cores - Intel(R) Xeon(R) CPU X5472 @ 3.00GHz with single 1.2 TB ioDrive2 mono
Significantly more functionality with negligible performance cost
October 25, 2012 15
DIRECTFS – ELIMINATING DUPLICATE LOGIC
October 25, 2012 16
Linux VFS
Kernel Block Layer
Device Driver
Sector Mapping
Met
a da
ta
ext / btrfs / xfs DirectFS
Driver Primitives
Meta data
DIRECTFS WITH ATOMIC WRITES - ACHIEVING RAW DEVICE PERFORMANCE
October 25, 2012 17
Ban
dwid
th (M
iB/s
)
I/O Size B
andw
idth
(MiB
/s)
I/O Size Block directFS
1 Thread 8 Threads
Filesystem convenience AND atomic writes with the performance of simple writes to raw device
MAKING DATABASES RUN FASTER
Page A
Page B
Page C
Page A
Page B
Page C
Page A
Page B
Page C
RAM
ACID Change
Log Write (sequential)
Flush/Checkpoint (Random)
Page A
Page B
Page C
Page A
Page B
Page C RAM
ACID Change
Atomic Write
CASE STUDY: PERCONA SERVER (MYSQL)
Percona has added atomics support to Percona Server 5.5
▸ Removes the need of the MySQL double write buffer
▸ Ensures data integrity in case of system crashes
▸ Writes 50% less, great for flash
▸ Removes complexity from the software stack
▸ Improves both transaction bandwidth and latency
▸ Works though the directFS filesystem or on RAW devices
October 25, 2012 19
PERCONA SERVER 5.5 TPC-C BENCHMARKS
Benchmarks run by Percona with Atomic I/O and directFS pre-release
October 25, 2012 20
Percona Server on DirectFS with Atomic I/O
Non-ACID (Double-write disabled for comparison)
ACID (Standard
Double-write)
50% more transactions with same atomic durability on the same device
ACID (Atomic write
replacing Double-write)
Percona Server on ext4
DIRECTFS – BENEFITS IN ELIMINATING DUPLICATE LOGIC
October 25, 2012 21
File System Lines of Code
directFS 6879
ReiserFS 19996
ext4 25837
btrfs 51925
XFS 63230
ATOMICS AND DIRECTFS BENEFITS
October 25, 2012 22
▸ 9x reduction in source code Direct access to the underlying media provides a simplified code base with file system semantics
▸ 2x Flash Media Life Elimination of write ahead logging increase life span of the media
▸ +50% transaction throughput Simplifying the database write path and utilizing atomic storage primitives directly translates to increased throughput,
Revolution
DISK OR MEMORY?
Core
Core
Core
Core
L1
L1
L1
L1
L3
L2
L2
L2
L2
1ns 10ns 100ns 100us 10ms 10us
THE COMING SHIFT
As an SSD, flash accelerates applications
As direct-access Non-Volatile Memory, flash transforms software development.
October 25, 2012 25
HOW?
Where the industry is headed -
October 25, 2012 26
Developers allocate 10/100/(1000?) TBs of Non-Volatile Memory, and never do explicit I/O again.
WHY? ELIMINATING THE MISMATCH
Manipulating data structures in memory is native to software development and fast.
October 25, 2012 27
Converting in-memory data structures to block I/O for persistence is foreign and expensive.
… but in-memory data has had no persistence.
NOT JUST A BLOCK DEVICE ANYMORE…
Existing I/O paradigm: • open(), read(), write(), seek(), close() New Atomic Extensions: • nvm_vectored_write() Key Value Store Extensions: • nvm_kv_open() • kv_put() • kv_get() • kv_batch_*()
October 25, 2012 28
EXAMPLE: KEY-VALUE STORE API LIBRARY
October 25, 2012 29
key value 1-128B 64b-1MB
kv_put() kv_get() or kv_batch_get()
key expiration timer marks KV pair for VSL garbage collection
key hashed into sparse address
space to simplify collision
management
value returned through single I/O operation, regardless of
value size
key value key value key value
pool A pool B
pool C
kv_get_current(), kv_next()
Iterate through each KV pair in
a pool of related keys
Atomic transaction envelope
Application issues call to Key-Value Store API
key
value
NVM Translation Layer
KEY-VALUE STORE API LIBRARY BENCHMARKS (NATIVE KV GET/PUT VS. RAW READS/WRITES)
0"
20000"
40000"
60000"
80000"
100000"
120000"
140000"
0" 20" 40" 60" 80" 100" 120" 140"
GETs/s&
Threads&
Sample&Performance&5&GET&
512B"
4KB"
16KB"
64KB"
0"
20000"
40000"
60000"
80000"
100000"
120000"
0" 20" 40" 60" 80" 100" 120" 140"
PUTs/
s&
Threads&
Sample&Performance&4&PUT&
512B"
4KB"
16KB"
64KB"
0"
20000"
40000"
60000"
80000"
100000"
120000"
140000"
0" 20" 40" 60" 80"
OPS/s
&
Threads&
Performane&rela2ve&to&ioDrive&
512B"Key"GET"
1KB"FIO"READ"
0"
20000"
40000"
60000"
80000"
100000"
120000"
0" 10" 20" 30" 40" 50" 60" 70"
OPS/s
&
Threads&
Performance&rela3ve&to&ioDrive&
512B"Key"PUT"
1K2FIO"WRITE"
Sample Performance - GET
Performance relative to ioDrive
Sample Performance - PUT
Performance relative to ioDrive
1U HP blade server with 16 GB RAM, 8 CPU cores - Intel(R) Xeon(R) CPU X5472 @ 3.00GHz with single 1.2 TB ioDrive2 mono
Significantly more functionality with negligible performance cost
October 25, 2012 30
KEY-VALUE STORE API LIBRARY BENCHMARKS: (VS MEMCACHEDB)
October 25, 2012 31
KEY-VALUE STORE API LIBRARY BENEFITS
October 25, 2012 32
▸ 95% performance of raw device Smarter media now natively understands a key-value I/O interface with lock-free updates, crash recovery, and no additional metadata overhead.
▸ Up to 3x capacity increase Dramatically reduces over-provisioning with coordinated garbage collection and automated key expiry.
▸ 3x throughput on same SSD Early benchmarks comparing against memcached with BerkeleyDB persistence show up to 3x improvement.
OS SWAP VS. EXTENDED MEMORY
▸ Originally designed as a last resort to prevent OOM (out-of-memory) failures ▸ Never tuned for high-performance demand-paging ▸ Never tuned for multi-threaded apps ▸ Poor performance, ex. < 30 MB/sec throughput
▸ No application code changes required ▸ Designed to migrate hot pages to DRAM and cold pages to ioMemory ▸ Tuned to run natively on flash (leverages native characteristics) ▸ Tuned for multi-threaded apps ▸ 10-15x throughput improvement over standard OS Swap
October 25, 2012 33
Non-Volatile Storage (Disks, SSDs, etc.)
System Memory
OS SWAP Mechanism
NV Memory (volatile usage)
System Memory
Extended Memory Mechanism
CHECKPOINTED MEMORY PERSISTENCE PATH
October 25, 2012 34
1. Application designates virtual address space range to be checkpointed a. Causes creation of independently-addressable linked clone of the
checkpointed address range (no data moves or copies) b. Checkpoint appears as addressable file in the directFS
native filesystem namespace.
2. Application can continue manipulating contents of designated virtual address range without affecting contents of persisted checkpoint file.
3. Application can load or manipulate persisted checkpoint file at a later time.
System Memory
NV Memory Checkpointed
Memory
API SPECS POSTED AT DEVELOPER.FUSIONIO.COM
Early-access to ioMemory SDK API specs and technical documentation (limited enrollment during early-access phase) http://developer.fusionio.com
▸ Write less code to create high-performing apps
▸ Tap into performance not available with conventional I/O access to SSDs
▸ Reduce operating costs by decreasing RAM while increasing NVM
October 25, 2012 35
Direct-access to NVM is for developers whose software retrieves and stores data.
OPEN INTERFACES AND OPEN SOURCE
▸ NVM Primitives: Open Interface
▸ directFS: Open Source, POSIX Interface
▸ NVM API Libraries: Open Source, Open Interface
▸ INCITS SCSI (T10) active standards proposals:
• SBC-4 SPC-5 Atomic-Write http://www.t10.org/cgi-bin/ac.pl?t=d&f=11-229r6.pdf
• SBC-4 SPC-5 Scattered writes, optionally atomic http://www.t10.org/cgi-bin/ac.pl?t=d&f=12-086r3.pdf
• SBC-4 SPC-5 Gathered reads, optionally atomic http://www.t10.org/cgi-bin/ac.pl?t=d&f=12-087r3.pdf
▸ SNIA NVM-Programming TWG active member
36 October 25, 2012
Tradi&onal SSDs
ioMemory™ with Conven&onal I/O
ioMemory™ as Transparent Cache
ioMemory™ with direct access I/O
ioMemory™ with memory seman&cs
Applica&
on
Applica&on
Applica&
on Applica&on Applica&on Applica&on Applica&on
User-‐defined I/O API Libraries
User-‐defined Memory API Libraries
OS Block I/O OS Block I/O OS Block I/O Direct-‐access I/O API Libraries
Memory Seman&cs API Libraries
Host
Host
File System File System File System directFS – NVM filesystem
directFS – NVM filesystem
Block Layer Block Layer Block Layer I/O Primi&ves Memory Primi&ves
SAS/SATA VSL™
expanded flash transla&on layer
directCache™ VSL™ VSL™
Network VSL™
Remote
RAID Controller
Read/Write Read/Write Read/Write Read/Write CPU Load/Store Flash
Transla&on Layer
Read/Write
Na&ve NVM Access |
37 October 25, 2012
FLASH MEMORY EVOLUTION
CATALYST FOR TOP INDUSTRY PLAYERS TO ACCELERATE PURSUIT OF NVM PROGRAMMING
October 25, 2012 38
… AND RESONATING THROUGH THE INDUSTRY
October 25, 2012 39
T H A N K Y O U !
October 25, 2012 40