22
© Neeraj Suri EU-NSF ICT March 2006 Dependable Embedded Systems & SW Group www.deeds.informatik.tu-darmstadt.de HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke , Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de TU Darmstadt, Germany

HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

© Neeraj SuriEU-NSF ICT March 2006

Dependable Embedded Systems & SW Groupwww.deeds.informatik.tu-darmstadt.de

HP: Hybrid Paxos for WANs

Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suridan,majuntke,marco,[email protected]

TU Darmstadt, Germany

Page 2: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 2Matthias Majuntke

Resilience of Critical Services

request

reply

clients

n ≥ 2t+1replicas

serverrequest

no reply

clients

SMR

Safety Critical Systems Resilience against

catastrophic failures State Machine Replication

Illusion of a single serverthat never fails

Wide Area Replication Large and unpredictable

delays in WANs latency-optimal protocol

Page 3: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 3Matthias Majuntke

Which Consensus Protocol State Machine Replication (SMR)

Clients propose commands to replicas Agreement on sequence of commands → replicas are in consistent

state when executing command sequence Consensus protocol needed

Latency-optimal protocols Latency: #message delays between when client proposes command and

when command is learned by learner (to be executed).

Two Protocols by Lamport Classic Paxos (CP)

•• 3 message delays (during normal operation)3 message delays (during normal operation)•• Majority quorum for recoveryMajority quorum for recovery

Fast Paxos (FP)•• 2 message delays (during normal operation)2 message delays (during normal operation)•• 2 + 4 message delays in presence of collisions2 + 4 message delays in presence of collisions•• Larger quorum for recoveryLarger quorum for recovery

Client → Leader →Acceptors → Client

Client →Acceptors → Client

Page 4: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 4Matthias Majuntke

Paxos vs. Fast Paxos Compared Latency “Planetlab” Experiments Simulation of the CP and FP msg. patterns (different topologies) FP not always faster than CP

Some clients prefer CP, some FP Single crash can turn setting

Page 5: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 5Matthias Majuntke

Motivation for a Hybrid Protocol

No clear winner between CP and FP With respect to latency

Hybrid Protocol: Hybrid Paxos (HP) Runs CP and FP in parallel Chooses quickest outcome of two protocols Implements Generalized Consensus

•• Commuting commands may be chosen in any orderCommuting commands may be chosen in any order Does not negatively affect throughput

•• FP mode switched off when not beneficialFP mode switched off when not beneficial

Page 6: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 6Matthias Majuntke

Outline of the Talk

Contribution System Model Background on Paxos and Generalized Consensus Hybrid Paxos protocol Evaluation Discussion Conclusion

Page 7: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 7Matthias Majuntke

Contribution

Hybrid Paxos (HP) CP with additional “fast mode“ Fast learning in absence of collisions 3 msg delays as CP in presence of collisions Latency optimal 2f+1 servers, f may crash (optimal) Linear number of messages (optimal)

First efficient implementation of Generalized Consensus Experiments using Emulab

HP reaches theoretical minimum of latency HP does not negatively affect throughput

Page 8: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 8Matthias Majuntke

System Model

Distributed System n servers Any number of clients (may crash) Communication via reliable FIFO channels Crash-stop model At most minority of servers fails (n ≥ 2f+1), f = #crashes

Asynchrony ΩΩ Failure detector (eventually outputs same correct leader)

Generalized Consensus Command History Equivalence class of command sequences Sequences c1 and c2 are equivalent iff executing them produces same

outputs and state commuting commands

clients servers

Page 9: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 9Matthias Majuntke

Background on Generalized Consensus

Protocol operates on command history = equivalence class ofcommand sequences

Terms on histories Prefix relation on histories glb of histories (largest common prefix, intersection) lub of histories (smallest common extension, union) h and h‘ compatible iff exists g: h g, h‘ g

Definition of Generalized Consensus Consistency: every two learned histories are compatible. Nontriviality: if history is chosen than all contained commands have

been proposed. Conservatism: if history h is learned, then h was chosen. Progress: if command c is proposed, eventually a history containing c is

learned.

Page 10: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 10Matthias Majuntke

Background on Paxos Family

Following holds for CP, FP, and HP Clients are proposers and learners Servers are acceptors

Cooperate to choose single comand history Acceptors query ΩΩ and elect leader among them

Unique Leader needed for progress only Paxos * protocols operate in rounds

Each leader is preassigned a set of round numbers Operation modes

Recovery, to change rounds (must ensure consistency) Normal operation

Quorums of acceptors CP: any two quorums intersect FP: requires larger fast quorums

•• intersection of quorum and fast quorum FQ is larger than n-|FQ|intersection of quorum and fast quorum FQ is larger than n-|FQ|

|FQ|n-|FQ|

n-|FQ|+1

Page 11: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 11Matthias Majuntke

CP and FP Message Patterns

Recovery (all protocols)cl

ld

acc

Normal Operation of FPcl

ld

acc

Normal Operation of CP

Fast mode Recovery from collision

1a 1b 2a2b

Phase 1 Phase 2

2a 2b

2b

2bfast

2bfast

1a 1b2a 2bpropose

propose

chosen

Page 12: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 12Matthias Majuntke

Ideas behind Message Patterns Normal Operation CP

Client sends proposal (command) to leader Leader appends command to history

and sends history to acceptors (2a) Acceptors accept history as local history Acceptors send history back to client (2b)

Normal Operation FP Client sends proposal to acceptors Acceptors append commands to local fast history (optimistic) Acceptors send history back to client (and leader) (2bfast) Collision Recovery triggered by Leader

Recovery (to start a new round) Phase 1: initialized by new leader (1a) Acceptors send local histories to leader (1b) Leader determines chosen history Phase 2: Leader synchronizes acceptors to chosen history (2a) Reply to clients (2b)

Core ofprotocol

Page 13: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 13Matthias Majuntke

Combining the two protocols

cl

ld

acc2a 2b

2bpropose

2bfast

2bfast chosen

propose

2bfast

Execute CP and FP pattern in parallel CP with additional FP mode Acceptors locally maintain fast and classic history

•• History from ld as classic historyHistory from ld as classic history•• Commands from Commands from clcl appended to fast history appended to fast history

No naïve combination Clients learn either by receiving

•• Quorum of equal 2b messages (Quorum of equal 2b messages (learnlearn))•• Fast Quorum of equal 2bfast messages Fast Quorum of equal 2bfast messages and one 2b messageand one 2b message

((hybrid learnhybrid learn))

CP FPHP

Needed also in FP forspeculative execution

Page 14: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 14Matthias Majuntke

Hybrid Recovery

Same message pattern Acceptors maintain separate histories

Classic history Fast history

Leader perform CP and FP like recoveries in parallel Determines history fh from FP recovery Determines history h from CP recovery

Problem: h and fh might be incompatible (no common extension) Determine largest prefix pfh of fh which is compatible with h Pick lub of pfh and h (smallest common extension)

Why is this correct (sufficient for Consistency)? To show: any history lh learned by hybrid learn is prefix of pfh. lh fh, and all prefixes of fh compatible with h are prefixes of pfh Sufficient to show: lh compatible with h By hybrid learning: some acceptor holds lh as classic history lh and h have been sent by leader lh and h are compatible

Neither h nor fh sufficientGoal: lub of h and fh

Page 15: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 15Matthias Majuntke

Implementation Optimization

Optimization 1 (msg complexity) Leader does not send entire history to acceptors (2a) FIFO channels

Optimization 2 (execution) Implementing state machine at servers Only leader executes commands (speculatively) Prevents rollbacks at acceptors Clients receive history digests + result

Optimization 3 (latency) Diverging fast and classic histories during normal mode prevents

hybrid learning Periodically acceptors locally align fh to h (as in hybrid recovery)

Optimization 4 (throughput) FP mode switched off during high load Leader monitors load

Also true for FP

Page 16: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 16Matthias Majuntke

Evaluation

Experimental setting Banking system, two operations deposit and withdraw deposit operations are commutable (Generalized Consensus) Emulab test bed 20ms link delay between client and servers, 100Mbps Topology similar to “Europe“ topology from beginning of presentation Servers 600Mhz PC, Fedora 6

Page 17: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 17Matthias Majuntke

Latency

Latency of HP with varyingwithdraw rate = probabilityof collisions

Latency vs throughput (withand w/o batching)

Page 18: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 18Matthias Majuntke

Throughput

Throughput with increasing clients Throughput with increasingnumber of f

Page 19: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 19Matthias Majuntke

Related Work [Lamport: ACM Computer 1998] The Part-Time parliament [Lamport: Dist. Comp. 2006] Fast Paxos [Lamport: TR2005] Generalized Consensus and Paxos [Dobre, Suri DSN2006] One-step Consensus with Zero-degradation [Charron-Bost, Schiper: PRDC2006] Improving Fast Paxos: Being Optimal

with no Overhead Minimum latency of FP and CP only in failure-free runs

[Camargos, Schmidt, Pedone: NCA2008] Mulitcoordinated AgreementProtocols for Higher Availability Improved availability of CP by multiple leaders; collision resolution req.

[Zielinski: DISC2005] Optimistic Generic Broadcast Parallel execution of CP and FP; not resilience optimal; quadratic msg complexity

[Mao, Junqueira, Marzullo: OSDI2008] Mencius: Building EfficientReplicated State Machine for WANs Based on CP; partition consensus instances among several leaders (throughput) Each client has LAN connection to one leader (latency) Perfect failure detector needed

Page 20: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 20Matthias Majuntke

Discussion

Comparison to CP Implements CP Never worse than CP FP mode switched off when leader is highly loaded

Comparison to FP HP and FP need 2 msg delays in absence of collisions HP needs 3, FP needs 6 msg delays in presence of collisions Experiments: Collision rate grows faster than server utilization rate

•• Servers underutilized when hybrid learning rate below 50%Servers underutilized when hybrid learning rate below 50%•• FP would spend >50% of the time recovering from collisionsFP would spend >50% of the time recovering from collisions

Optimizations Batching possible

Increasing throughput by a magnitude

Page 21: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 21Matthias Majuntke

Summary

HP: Hybrid Paxos Idea: add fast learning to Paxos

Generalized Consensus protocol First protocol with 2 msg delays in absence of collisions and 3 msg

delays otherwise Optimal latency, resilience and number of messages

Generalized Consensus is practical approach for WAN replication HP can outperform state of the art protocols

HP is a Generalized Consensus protocolthat features minimal latency and

maximum throughput in most situations !

Page 22: HP: Hybrid Paxos for WANspeople.rennes.inria.fr/Francois.Taiani/edcc2010/wp...HP: Hybrid Paxos for WANs Dan Dobre, Matthias Majuntke, Marco Serafini and Neeraj Suri {dan,majuntke,marco,suri}@cs.tu-darmstadt.de

EDCC, Valencia, May 18, 2010 22Matthias Majuntke

Thank you for yourattention!

Questions?