Upload
lin-yi
View
1.009
Download
0
Embed Size (px)
Citation preview
Finding a needle in Haystack: Facebook’s photo storageDoug.B, Kumar. S, Li. HC, Sobel. J, Vajgel. P, Facebook Inc.
ネットワークサービス特論LIN Y I81517372
6.59
11
25
0
5
10
15
20
25
30
2010 2011 2012 2013
Ph
oto
# (
10 U
S b
illio
ns)
Increase of Photo Uploading # on Facebook
Photo # for one year 2
Why do We Need a New One
• Traditional POSIX based file system:
• Directories
• Per file metadata
Waste in storage capacity
Metadata must be read from disk into memory
Accessing metadata is the bottleneck
Key problem in using a network attached storage (NAS) appliance mounted over NFS
3
• Several disk operations were necessary to read a single photo
∴Using disk IOs (Input/Output) for metadata is NOT GOOD!
• Translate the filename to an inode number
One or more
• Read the inode from
disk
Another• Read the file
itself
A final one
Why do We Need a New One
4
Web Server
Browser CDN
Photo Storage
1 2 4 5
3
6
The Procedure How a Picture is Downloaded
Photo Storage
Photo Storage
5
Why NFS-based Design but not CDN
PR O S
• CDNs do well on hottestphotos— profile pictures and photos that have been recently uploaded
CO NS
• Long tail: A large number of requests for less popular (often older) content generated by Facebook
• Requests from the long tail lead to great traffic
• Impossible to cache all of them
6
Web Server
Browser CDN
Photo Storage
1 2 4 5
3
6
The Procedure How a Picture is Downloaded
Photo Storage
Photo Storage
7
Web Server
Browser CDN
NAS
1 2
7 4
3
8
NFS-based Design of Facebook
Photo Store Server
NAS NAS
6 5
Photo Store Server
To store each photo in its own file on a set of commercial Network-attached storage (NAS) appliances.
A set of machines, Photo Store servers, then mount all the volumes exported by these NAS appliances over NFS.
NFS
8
Each directory of
an NFS volume
Thousands of files
An excessive number of disk operations (10)
Loading
The Problem of This Architecture
9
One single image
Each directory of
an NFS volume
Hundreds of imagesDisk operations (3)
Loading
• One to read the directory metadata into memory
• A second to load the inode into memory
• And a third to read the file contents
The way NAS appliances manage directory metadata (placing thousands of files in a directory) was extremely inefficient
The Problem of This Architecture
10
Web Server
Browser CDN
NAS
1 2
7 4
3
6
The Problem of This Architecture
Photo Store Server
NAS NAS
6 5
Photo Store Server
Let the Photo Store servers explicitly cache file handles returned by the NAS appliances
Caches the filename to file handle
NFS
Be able to open the file directlyusing a custom system call, “open_by_filehandle”
Only minor improvement∵ Less popular photos are less likely to be cached to begin with. 11
Not Feasible Relying on NAS Appliance
• An expensive requirement for traditional filesystems
• Focusing only on caching (NAS appliance’s cache ormemcache) has limited impact for reducing disk operations.
Memcache
All the images
12
Proposal of a New Method is Necessary
• GFS Development work, log data, and photos
• NAS Development work and log data
• Hadhoop Extremely large log data
Serving photo requests in the long tail
13
Proposal of a New Method is Necessary
• To build a custom storage system
• To reduce the amount of filesystem metadata per photo
• To have enough main memory than to buy more NAS appliances
Serving photo requests in the long tail
14
Haystack
• An object storage system for sharing photos on Facebook where data is written once, read often, never modified, and rarely deleted.
• Long-Tail-Effect
A sharp rise in requests for photos that are a few days old
A significant number of requests for old photos cannot be dealt with using cached data
Cumulative distribution function of thenumber of photos
15
4 Goals:
• High throughput and low latency
• Fault-tolerant
• Cost-effective
• Simple
16
3 Contributions
• Haystack, an object storage system optimized for the efficient storage and retrieval of billions of photos
• Lessons learned in building and scaling an inexpensive, reliable, and available photo storage system
• A characterization of the requests made to Facebook’s photo sharing application
17
Strategy
18
• Straight-forward approach:It stores multiple photos in a single file and therefore maintains very large files. good, efficient, strong simplicity, rapid implementation and deployment,
• Two kinds of metadata:
Application metadata describes the information needed to construct a URL that a browser can use to retrieve a photo.Filesystem metadata identifies the data necessary for ahost to retrieve the photos that reside on that host’s disk.
Web Server
Browser CDN
1 46 9
5
10
Haystack Architecture
2 3
HaystackDirectory Haystack
Store
HaystackCache
7 8
19
Components of Haystack
• Haystack Directory
• Haystack Cache
• Haystack Store
20
4 Functions of Haystack Directory
• Haystack Directory
1. Provides a mapping from logical volumes to physical volumes.
For web servers to upload photos
To construct the image URLs for a page request
2. Balances writes across logical volumes and reads across physical volumes.
3. Determines whether a photo request should be handled by the CDN or by the Cache.
4. Identifies the reasons of read-only logical volumes
Operational reasons?
Maximal storage capacity? 21
Features of Haystack Cache
• Haystack Cache
• Receives HTTP requests for photos from CDNs and also directly from users’ browsers
• As a distributed hash table
• Uses a photo’s id to locate cached data
• Or gets the photo from the Store machine identified in the URL and replies to either the CDN or the user’s browser (Cannot respond to the request)
22
Features of Haystack Store• Haystack Store
• Multiple physical volumes, each with millions of photos, like a large file (100 GB) saved as ‘/hay/haystack <logical volume id>’
• Uses the id of the corresponding logical volume + The file offset at which the photo resides Access a photo quickly
• Retrieving the filename, offset, and size for a particular photo without needing disk operations.
• Maintains an in-memory data
To retrieve needles quickly
Reconstruct retrieves directly from the volume
file before processing requests after a crush
23
Each Physical Volume of Haystack Store
• Each physical volume
Every store machine, consisting of a super block followed by a sequence of needles (Photos stored in Haystack)
Physical Volume
24
Superblock
Needle 1
Needle 2
Needle 3
… …
Needle N
Header
Cookie
Key
Alternate Key
Flags
Size
Data
Footer
Data Checksum
Padding
The Super Block and the Format of Each Needle
25
Tolerance of Failure
2 Techniques:
1. Pitchfork
For detection, background task, periodically checks the health of each Store machine
Automatically marks all logical volumes of that Store machine as read-only
2. Bulk Sync
For repair, reset the data of a Store machine, happen rarely (a few each month), simple but time-wasting
Bottleneck: is that the amount of data to be bulk synced needs hours for mean time to recovery
• Faulty hard drives
• Misbehaving RAID controllers
• Bad motherboards
26
Evaluation
1. Characterize the photo requests seen by Facebook
2. Effectiveness of the Directory
3. Effectiveness of the Cache
4. Analyze how well the Store performs using both synthetic and production workloads
27
Evaluation
• Characterize the photo requests seen by Facebook
Cumulative distribution function of thenumber of photos requested in a daycategorized by age (time since it was
uploaded).
Volume of daily photo traffic
28
Evaluation
• Effectiveness of the Directory
Volume of multi-write operations sent to 9 differentwrite-enabled Haystack Store machines.
The graph has 9 different lines that closely overlap each other.Directory balances writes well
29
Evaluation
Analyze how well the Store performs using both synthetic and production workloads
Achieved high hit rates of approximately 80%.
• Effectiveness of the Cache
30
• Analyze how well the Store performs using synthetic & production workloads
• Benchmarks: 1. Randomio, as an open-source multithreaded disk I/O program
Measure the raw capabilities of storage devices
2. Haystress, as a custom built multi-threaded program
Evaluate Store machines for a variety of synthetic workloads
7 different Haystress workloads were used to evaluate Store machines
Evaluation
31
• Analyze how well the Store performs using synthetic & production workloads
Evaluation
Throughput and latency of read and multi-write operations on synthetic workloads. Config B uses a mix of 8KB and 64KB images. Remaining configs use 64KB images.
Performs random reads to 64KB images on a Store machine with 201 volumes Haystack delivers 85%
of the raw throughput Only 17% higher latency.
32
• Analyze how well the Store performs using synthetic & production workloads
Evaluation
Throughput and latency of read and multi-write operations on synthetic workloads. Config B uses a mix of 8KB and 64KB images. Remaining configs use 64KB images.
Performs random reads to 30% of 64KB images and 70% of 8KB images Higher throughput Less latency.
33
• Analyze how well the Store performs using synthetic & production workloads
Evaluation
Throughput and latency of read and multi-write operations on synthetic workloads. Config B uses a mix of 8KB and 64KB images. Remaining configs use 64KB images.
∵Haystack can batch writes together∴1, 4, and 16 writes of images were batched into a single multi-write Improvement of throughput by 30% in D 78% in E Also reduces per image
latency
34
• Analyze how well the Store performs using synthetic & production workloads
Evaluation
Throughput and latency of read and multi-write operations on synthetic workloads. Config B uses a mix of 8KB and 64KB images. Remaining configs use 64KB images.
F uses a mix of 98% reads & 2% multi-writesG uses a mix of 96% reads & 4% multi-writes Each multi-write writes 16 images High read throughput even in the presence of writes
35
• Analyze how well the Store performs using synthetic & production workloads
Evaluation
Rate of different operations on two Haystack Store machines: One read-only and the other write-enabled.
Peak photo uploads on Sun. & Mon. A smooth drop during the rest days
36
Rate of different operations on two Haystack Store machines: One read-only and the other write-enabled.
• Analyze how well the Store performs using synthetic & production workloads
Evaluation
Many more requests An increase in the read request rate as more
data gets written to write-enabled machines
37
• Analyze how well the Store performs using synthetic & production workloads
Evaluation
Multi-write latencies are very flat and stable, But the read performance is unstable for 3 reasons:1. The read traffic increases as the number of photos
stored on the machine increases2. Read-only machine doesnot need to cache photos3. Recently written photos are usually read back
immediately because Facebook highlights recent content
Average latency of Read and Multi-write operations on the two Haystack Store machines over the same 3 week period. 38
• Limited the number of disk operations (bottleneck) to only the ones necessary for reading actual photo data.
• Dramatically reducing the memory used for filesystem metadata, thereby making it practical to keep all this metadata in main memory.
Conclusion
39
Thank you very much for your attention!