Schlosser Dynamo

  • Upload
    bsisco

  • View
    234

  • Download
    0

Embed Size (px)

Citation preview

  • 8/12/2019 Schlosser Dynamo

    1/21

    Dynamo: Amazon's HighlyAvailable Key-value Store

    Guiseppe DeCandia, Deniz Hastorun,Madan Jampani, Gunavardhan Kakulapati,

    Avinash Lakshman, Alex Pilchin,Swami Sivasubramanian, Peter Vosshall,

    and Werner Vogels

    Presented by Steve SchlosserBig Data Reading Group

    October 1, 2007

  • 8/12/2019 Schlosser Dynamo

    2/21

    What Dynamo is

    Dynamo is a highly available distributed key-value storage system

    put(), get() interface

    Sacrifices consistency for availability

    Provides storage for some of Amazon's keyproducts (e.g., shopping carts, best seller lists, etc.)

    Uses synthesis of well known techniques toachieve scalability and availability

    Consistent hashing, object versioning, conflict resolution,etc.

  • 8/12/2019 Schlosser Dynamo

    3/21

    Scale

    Amazon is busy during the holidays

    Shopping cart: tens of millions of requests for 3million checkouts in a single day

    Session state system: 100,000s of concurrentlyactive sessions

    Failure is common

    Small but significant number of server and networkfailures at all times

    Customers should be able to view and add items to their shopping cart evenif disks are failing, network routes are flapping, or data centers are beingdestroyed by tornados.

  • 8/12/2019 Schlosser Dynamo

    4/21

    Flexibility

    Minimal need for manual administration

    Nodes can be added or removed withoutmanual partitioning or redistribution

    Apps can control availability, consistency, cost-effectiveness, performance

    Can developers know this up front?

    Can it be changed over time?

  • 8/12/2019 Schlosser Dynamo

    5/21

    Assumptions & requirements

    Simple query model

    values are small (

  • 8/12/2019 Schlosser Dynamo

    6/21

    Service level agreements

    SLAs are used widely at Amazon

    Sub-services must meet strict SLAs

    e.g., 300ms response time for 99.9% of requests atpeak load of 500 requests/s

    Average-case SLAs are not good enough

    Mentioned a cost-benefit analysis that said 99.9% is

    the right number Rendering a single page can make requests to

    150 services

  • 8/12/2019 Schlosser Dynamo

    7/21

    Consistency

    Eventual consistency

    Always writable

    Can always write to shopping cart

    Pushes conflict resolution to reads

    Application-driven conflict resolution

    e.g., merge conflicting shopping carts

    Or Dynamo enforces last-writer-wins

    How often does this work?

  • 8/12/2019 Schlosser Dynamo

    8/21

    Other stuff

    Incremental scalability

    Minimal management overhead

    Symmetry

    No master/slave nodes

    Decentralized

    Centralized control leads to too many failures

    Heterogeneity

    Exploit capabilities of different nodes

  • 8/12/2019 Schlosser Dynamo

    9/21

    Interface

    get(key) returns object replica(s) for key, plus acontext object

    context encodes metadata, opaque to caller

    put(key, context, object) stores object

  • 8/12/2019 Schlosser Dynamo

    10/21

    Variant of consistent hashing

    A

    B

    C

    DE

    F

    G

    Key K

    Each node isassigned tomultiple points

    in the ring(e.g., B, C, Dstore keyrange(A, B)

    # of points canbe assigned basedon nodes capacity

    If node becomesunavailable, load isdistributed to other

  • 8/12/2019 Schlosser Dynamo

    11/21

    Replication

    A

    B

    C

    DE

    F

    G

    Key KCoordinator for key K

    D stores (A, B], (B, C], (C, D]

    B maintains apreference

    list for each data itemspecifying nodes storingthat item

    Preference list skipsvirtual nodes in favor ofphysical nodes

  • 8/12/2019 Schlosser Dynamo

    12/21

    Data versioning

    put() can return before update is applied to all replicas

    Subsequent get()s can return older versions

    This is okay for shopping carts

    Branched versions are collapsed

    Deleted items can resurface

    A vector clock is associated with each object version

    Comparing vector clocks can determine whether twoversions are parallel branches or causally ordered

    Vector clocks passed by the contextobject in get()/put()

    Application must maintain this metadata?

  • 8/12/2019 Schlosser Dynamo

    13/21

    Vector clock example

  • 8/12/2019 Schlosser Dynamo

    14/21

    Quorum-likeness

    get() & put() driven by two parameters:

    R: the minimum number of replicas to read

    W: the minimum number of replicas to write

    R + W > N yields a quorum-like system

    Latency is dictated by the slowest R (or W) replicas

    Sloppy quorum to tolerate failures

    Replicas can be stored on healthy nodes downstream in thering, with metadata specifying that the replica should be sentto the intended recipient later

  • 8/12/2019 Schlosser Dynamo

    15/21

    Adding and removing nodes

    Explicit commands issued via CLI or browser

    Gossip-style protocol propagates changesamong nodes

    New node chooses virtual nodes in the hash space

  • 8/12/2019 Schlosser Dynamo

    16/21

    Implementation

    Persistent store either Berkeley DBTransactional Data Store, BDB Java Edition,MySQL, or in-memory buffer w/ persistent

    backend All in Java!

    Common N, R, W setting is (3, 2, 2)

    Results are from several hundred nodesconfigured as (3, 2, 2)

    Not clear whether they run in a single datacenter

  • 8/12/2019 Schlosser Dynamo

    17/21

    One tick= 12 hours

  • 8/12/2019 Schlosser Dynamo

    18/21

    One tick= 1 hour

  • 8/12/2019 Schlosser Dynamo

    19/21

    One tick

    = 30 minutes

    During periods of high loadpopular objects dominate

    During periods of low load,fewer popular objects are accessed

  • 8/12/2019 Schlosser Dynamo

    20/21

    Quantifying divergent versions

    In a 24 hour trace

    99.94% of requests saw exactly one version

    0.00057% received 2 versions

    0.00047% received 3 versions

    0.00009% received 4 versions

    Experience showed that diversion came usually

    from concurrent writers due to automated clientprograms (robots), not humans

  • 8/12/2019 Schlosser Dynamo

    21/21

    Conclusions

    Scalable: Easy to shovel in more capacity at Christmas

    Simple:

    get()/put() maps well to Amazons workload

    Flexible: Apps can set N, R, W to match their needs

    Inflexible:

    Apps have to set N, R, W to match their needs

    Apps may have to do their own conflict resolution

    They claim its easy to set these does this mean that there arent manyinteresting points?

    Interesting?