24
Tradeoffs in Scalable Data Routing for Deduplication Clusters FAST '11 Wei Dong From Princeton University Fred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, Philip Shilance From EMC 2011. 04. 21 (Thu) Kwangwoon univ. SystemSoftware Lab. HoSeok Seo 1

Tradeoffs in Scalable Data Routing for Deduplication Clusters

  • Upload
    cybele

  • View
    64

  • Download
    0

Embed Size (px)

DESCRIPTION

Tradeoffs in Scalable Data Routing for Deduplication Clusters. FAST '11. Wei Dong From Princeton University Fred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, Philip Shilance From EMC. 2011. 04. 21 ( Thu ) Kwangwoon univ . SystemSoftware Lab. HoSeok Seo. Introduction. - PowerPoint PPT Presentation

Citation preview

Page 1: Tradeoffs in Scalable Data Routing for Deduplication Clusters

Tradeoffs in Scalable Data Routing for Deduplication Clus-

tersFAST '11

Wei Dong From Princeton UniversityFred Douglis, Kai Li, Hugo Patterson, Sazzala Reddy, Philip Shilance From EMC

2011. 04. 21 (Thu)Kwangwoon univ. SystemSoftware Lab.

HoSeok Seo1

Page 2: Tradeoffs in Scalable Data Routing for Deduplication Clusters

2

Introduction This paper proposes

a deduplication cluster storage system having a primary node with the a hard disk

Basically cluster storage systems are... a well-known technique to increase capacity

but have 2 problems- less deduplication than the single node system- not exhibit linear performance

Page 3: Tradeoffs in Scalable Data Routing for Deduplication Clusters

3

Introduction Goal

Scalable Throughput- using Super-chunk for data transfer- maximize the parallelism of disk I/O by balanced routing data to nodes- reduce bottleneck of disk I/O utilizing cache locality

Scalable Capacity- using a cluster storage system- route repeated data to the same node- maintain the balanced utilization between nodes

High Deduplication like single node system- using a super-chunk that consist of consecutive chunks

Page 4: Tradeoffs in Scalable Data Routing for Deduplication Clusters

4

Introduction Chunk

Definition- A segment of Data stream

Merits- when a chunk size is small...

• Show high deduplication

- when a chunk size is big...• Show high throughput

Page 5: Tradeoffs in Scalable Data Routing for Deduplication Clusters

5

Introduction Super-chunk

Definition- Consist of consecutive chunks

Merits- Maintain high cache locality- Reduce system overhead- Get similar deduplication rate of chunk

Demerits- Risk of duplication creation- Can result in imbalance utilization between nodes

Issues of super-chunk- How they are formed- How they are assigned to nodes- How they route super-chunks to nodes for a balance

Page 6: Tradeoffs in Scalable Data Routing for Deduplication Clusters

6

Dataflow of Deduplication Cluster

1. Divide Data Streams into Chunks

2. Create fingerprints of chunks

3. Create a super-chunk

4. Select a representative for a super-chunk in chunks

5. Route a super-chunk to one of nodes

Page 7: Tradeoffs in Scalable Data Routing for Deduplication Clusters

7

Deduplication flow at a node (cont.)

Page 8: Tradeoffs in Scalable Data Routing for Deduplication Clusters

8

Deduplication flow at a node

Dup?at dedup

logic

Finger-print in cache?

Finger-print in index?

Write Fingerprint & Chunkto a container

no

yes

no

Dediplication Done

yesno

Is a con-tainer full?

Write a container to a disk

A chunk

Load fingerprints were written at the

same time to cache

yes

Color box means that it re-quires disk access

Page 9: Tradeoffs in Scalable Data Routing for Deduplication Clusters

9

What is Container? Container

Definition- fixed-size large pieces in a disk- consist of two part : Fingerprint & Chunk Data

Usage- Use it to store Fingerprint & Chunk of non-duplicated data into a disk

Finger-prints Chunk Data

Page 10: Tradeoffs in Scalable Data Routing for Deduplication Clusters

10

Issue 1 : How super-chunk are formed? How super-chunk are formed?

Determine an average super-chunk size- Experimented with a variety size from 8KB to 4MB- Generally 1MB is a good choice

Page 11: Tradeoffs in Scalable Data Routing for Deduplication Clusters

11

Issue 2 :How they assigned to nodes Use Bin Manager running on master node Bin Manager executes rebalance between nodes by bin migration(For stateless

routing)

1. assign number of binto a super-chunk

node 1

node 2

node 3

node N

bin1 bin2 bin3 ... bin Mnode1 node2 node3 ... node N

bin man-ager

M>N

a super-chunk

2. find a node by number of bin

3. route a super-chunk to a node

Page 12: Tradeoffs in Scalable Data Routing for Deduplication Clusters

12

Issue 3 :How they route super-chunks to nodes for bal-ance Use two DATA Routing to overcome demerits of super-chunk

stateless technique with a bin migration- light-weight and well suited for most balanced workloads

stateful technique- Improve deduplication while avoiding data skew

Page 13: Tradeoffs in Scalable Data Routing for Deduplication Clusters

13

Stateless Technique Basic

1. Create fingerprint about each chunks 2. Select a representative fingerprint in fingerprints 3. allocate a bin to super-chunk ( such mod #bin )

How to Create fingerprint Hash all of chunk ( a.k.a hash(*) ) Hash N byte of chunk ( a.k.a hash(N) ) ※ Use SHA-1 Hash function

How to select representative fingerprint first maximum minimum

Page 14: Tradeoffs in Scalable Data Routing for Deduplication Clusters

14

Stateful Technique (cont.) Merits compare to Stateless

Higher Deduplication like single node backup system Balanced overload Bin migration no longer needed

Demerits Increased operations Increased cost of memory or communication

Page 15: Tradeoffs in Scalable Data Routing for Deduplication Clusters

15

Stateful Technique Process

Calculate "weighted voting"

Select a node that has the highest weighted voting

number of match * overloaded value

1

number of match : number of duplication chunk at each nodeoverloaded value : overloaded utilization of node relative to the average storage utilization

1.0

Page 16: Tradeoffs in Scalable Data Routing for Deduplication Clusters

16

Datasets

Page 17: Tradeoffs in Scalable Data Routing for Deduplication Clusters

17

Evaluation Metrics Capacity

Total Deduplication (TD)- the original dataset size % deduplication size

Data Skew- Max node utilization % avg node utilization

Effective Deduplication (ED)- TD % Data Skew

Normalized ED- Show that how much deduplication close to a single-node system

Throughput # of on-disk fingerprint index lookups

Page 18: Tradeoffs in Scalable Data Routing for Deduplication Clusters

18

Experimental Results :Overall Effectiveness Using Trace-driven simulation

Page 19: Tradeoffs in Scalable Data Routing for Deduplication Clusters

19

Experimental Results :Overall Effectiveness

with mig

Page 20: Tradeoffs in Scalable Data Routing for Deduplication Clusters

20

Experimental Results :Feature Selection

HYDRAstor- Routing chunks to nodes according to content- Good performance- Worse deduplication rate due to 64KB chunks

Page 21: Tradeoffs in Scalable Data Routing for Deduplication Clusters

21

Experimental Results :Cache Locality and Throughput

Logical Skew : max(size before dedupe) / avg ( size before dedupe)

Max lookup : maximum normalized total number of fingerprint index lookupsED : Effective Deduplication

(32node) (32node)

Page 22: Tradeoffs in Scalable Data Routing for Deduplication Clusters

22

Experimental Results :Effect of Bin Migration

The ED drops between migration points due to increasing skew.

Page 23: Tradeoffs in Scalable Data Routing for Deduplication Clusters

23

Summary

Stateless Stateful

Small Clusters

LargeClusters ALL

Deduplication Good Bad Good

Data Skew Good Bad Good

Overhead Good Good Bad

Page 24: Tradeoffs in Scalable Data Routing for Deduplication Clusters

24

Conclusion 1. Using Super-chunks for data routing is superior to using individ-

ual chunks to achieve scalable throughput while maximizing dedu-plication

2. The stateless routing method (hash(64)) with bin migration is a simple and efficient way

3. The effective deduplication of the stateless routed cluster may drop quickly as the number of nodes increases.To solve this problem, proposed stateful data routing approach.Simulations show good performance when using up to 64 nodes in a cluster