49
Paxos Building Reliable System 2012-11 @drdrxp

Paxos building-reliable-system

  • Upload
    drdr-xp

  • View
    682

  • Download
    7

Embed Size (px)

Citation preview

Page 1: Paxos building-reliable-system

PaxosBuilding Reliable System

2012-11 @drdrxp

Page 2: Paxos building-reliable-system

Background

分布式系统:

许多机器在一起工作,完成1件事

分布式中大部分问题最后归结为一致性问题

Paxos: 分布式系统的核心算法:达成一致

Page 3: Paxos building-reliable-system

Agenda

问题

基础Replication算法,不足

Paxos算法,工作原理

Paxos优化

Paxos Demo

Page 4: Paxos building-reliable-system

磁盘, 年损坏率: 4%服务器, 当机时间: 0.1%网络, 异地IDC间丢包: <30%IDC故障, 总有那么几次

存储系统必须可靠:99.99999999%

怎么做到?

Problem

Page 5: Paxos building-reliable-system

Solution

写多份副本,丢1(or n)份不会产生数据丢失

2副本: ~ 0.003952%3副本: < 0.000001%

参考: http://weibo.com/1539471463/yAxrGkW1w

Page 6: Paxos building-reliable-system

Solution.

怎么写

除了份数之外,我们还关注什么

可靠性 可用性

完整性 原子性 一致性

事务性 ...

Page 7: Paxos building-reliable-system

Fundamental Algorithms

Master-Slave Async

Master-Slave Sync

Master-Slave Semi-Sync

Quorum Write

Page 8: Paxos building-reliable-system

Master-Slave Async.

Mysql的经典HA方案

写操作发给MasterMaster写到磁盘

Master应答OKMaster复制到Slave

磁盘可能在完成复制前损坏

: 不可靠

Time

MasterClient Slave.1 Slave.2

Disk Failure

Page 9: Paxos building-reliable-system

Master-Slave Sync.

写操作发给Master Master负责复制到Slave

写阻塞

直到所有Slave应答OK

单节点失效整个系统不可写

: 可靠

: 可用性低

Time

MasterClient Slave.1 Slave.2

Page 10: Paxos building-reliable-system

Master-Slave Semi-Sync

写操作发给MasterMaster负责复制到Slave

写阻塞

直到[1,n)个Slave应答OK

: 可靠性高,可用性高

: 数据可能不完整

下1个Master?--> Quorum Write

Time

MasterClient Slave.1 Slave.2

Page 11: Paxos building-reliable-system

Quorum Write

Dynamo / Cassandra

只要求写多数节点成功

不需要Master

与Quorum Read 1起用

W + R > N

N = 2*F+1,F是允许失效节点数;

2 + 2 > 3 适合大部分场合

Time

Node.1Client Node.2 Node.3

Page 12: Paxos building-reliable-system

Quorum Write: Last-Win1般使用时间戳决定写覆盖顺序

后写入覆盖先前写入

Time

Node.1Client Node.2 Node.3

Page 13: Paxos building-reliable-system

Quorum Write..

: 可靠性高

: 可用性高

: 数据完整

: 够了吗?

Page 14: Paxos building-reliable-system

Quorum Write: W + R > N

一致性: Immediate :)// 取决于 "R"..

事务性: Non-Atomic-Update :(Dirty-Read :(Lost-Update :(

http://en.wikipedia.org/wiki/Concurrency_control

Page 15: Paxos building-reliable-system

Imaginary Storage Service

以Quorum RW的机制建立1个3节点的存储

展示Quorum RW的不足,Paxos如何解决

功能:只存储1个变量"i""i"的每个版本对应1个记录: i1, i2, i3..

命令:set <n> 将 i 设置为某值

inc <n> 将 i 的值加<n>

Page 16: Paxos building-reliable-system

Imaginary Storage Service.

"set" 直接对应Quorum Write"inc" 最简单的事务性操作:

1. Quorum Read读出 i 最新版本的值: i1;2. i2 = i1 + n;3. Quorum Write 保存下1个版本:i2;

X set i2=3

Xget i

2121 00

3221 32

Xget i1=2

i2 = i1 + 1

3221 32

Page 17: Paxos building-reliable-system

set i2=3

OKset i2=4

Must Fail

Imaginary Storage Service..

X

Xget i

2121 00

3221 32

5321 53

Xget i1=2

i2 = i1 + 1

X gets i3=5

Yget i1=2

Y

i2 = i1 + 2

3221 32

Y executes Quorum RW again...

Page 18: Paxos building-reliable-system

Imaginary Storage Service...

上例中inc能正确运行的条件:

对 i 的某版本的多个更新操作,只有1个成功

在通用存储系统中推广为:

1个记录的值确定之后不可更改

如何确定1个值

Page 19: Paxos building-reliable-system

Determine a Value

X

Y

Any value set?

XNo

XX -

---

Any value set?

---

YYes, Y gives up

X

XX -

XX -

对1个记录写入前,执行Quorum Read检查是否已存在值

//已确定的,或可能已确定的

Page 20: Paxos building-reliable-system

Determine a Value.

X YAny value set?

XNo

YYX Y

XX -

---Any value set?

--- YNo

X

但是X, Y可能同时认为自己可写

Lost Update

Page 21: Paxos building-reliable-system

Determine a Value..

XAny value set?

XNo

YYX Y

---

---

X

Y---Any value set?

Quorum Write:Only accept update from X

--- YNo

Quorum Write:Only accept update from Y

X --

Page 22: Paxos building-reliable-system

Determine a Value...

按照以上流程,任意1个记录的值可以被正确确定下来

将记录对应到"i"的每1个版本上,便可以构建1个可靠的存储系统

后来Leslie Lamport 将这个过程写成了论文

Page 23: Paxos building-reliable-system

Paxos

Page 24: Paxos building-reliable-system

What is Paxos

可靠的存储(基于Quorum RW)存储1个值

整个系统对1个值达成一致,不可更改

Paxos算法实例执行1次确定1个值(2次交互)

多个并发写入,只有1个成功

Immediate Consistency(Quorum RW)

Page 25: Paxos building-reliable-system

Paxos

Classic Paxos2 rounds per instance

Multi Paxos~1 rounds per instance

Fast Paxos1 round per instance ( without conflict )2 rounds per instance ( with conflict )

Page 26: Paxos building-reliable-system

Paxos: Precondition

条件:

存储可靠否则退化成拜占庭Paxos

容忍:

网络不可靠

进程死掉

乱序消息

Page 27: Paxos building-reliable-system

Proposer: 写操作发起者

Acceptor: 存储节点,接受写入; n = 2f + 1

Quorum( of acceptors ) : Acceptor中的多数派

Round:Paxos 的1次运行,至少包括2个phase:Phase 1 & Phase 2

Round Number (rnd): 每个Round的唯一标识;

单调升;Last-Win;

Paxos: Concept

Page 28: Paxos building-reliable-system

Last Round Number (last_rnd): Acceptor曾收到的最大rnd;表示Acceptor目前认可可写的Proposer

Value (v): Acceptor 已经接受的值

Value round number (vrnd): Acceptor接受的value的rnd

最终确定的值: 某个Value被多数(Quorum)个Acceptor 接受

才认为Paxos系统确定了这个值

Paxos: Concept.

Page 29: Paxos building-reliable-system

Paxos: Classic - phase 1

Xrnd=1

Xlast_rnd=0, v=nil, vrnd=0last_rnd=0, v=nil, vrnd=0..Phase 1

1,1, -

---

Proposer X Acceptor 1,2,3

Acceptor: 记录Proposer发来的rnd,表示可以接受这个round的phase 2请求

只接受大于last_rnd的rnd

Page 30: Paxos building-reliable-system

Paxos: Classic - phase 1.

Xrnd=1

XPhase 1

1,1, -

---

Proposer X Acceptor 1,2,3

Proposer: 检查每个Acceptor返回的last_rnd,如果last_rnd比自己的

rnd更大,则放弃此轮round;

检查Acceptor返回的所有v和vrnd,如果所有v都是空,

Proposer可以任意决定Phase 2写入的值;

否则选择最大vrnd的v;

last_rnd=0, v=nil, vrnd=0last_rnd=0, v=nil, vrnd=0..

Page 31: Paxos building-reliable-system

Paxos: Classic - phase 2

Xv="x", rnd=1

XAcceptedPhase 2

1,1, -

1,x11,x1 -

Proposer X Acceptor 1,2,3

v=x, vrnd=1

Proposer: 向Acceptor Quorum写入自己决定的v

Page 32: Paxos building-reliable-system

Paxos: Classic - phase 2.

Xv="x", rnd=1

XAcceptedPhase 2

1,1, -

1,x11,x1 -

Proposer X Acceptor 1,2,3

v=x, vrnd=1

Acceptor:只接受rnd等于last_rnd的Phase 2请求

last_rnd==rnd保证2个phase中间没有其他Proposer介入

否则(Acceptor接受了更高的rnd),则写失败;此时提高rnd重新执行

Page 33: Paxos building-reliable-system

Paxos: Classic without Conflict

Xrnd=1

Xlast_rnd=0, v=nil, vrnd=0

Xv="x", rnd=1

XAccepted

Phase 1

Phase 2

1,1, -

---

1,1, -

1,x11,x1 -

Proposer X Acceptor 1,2,3

v=x, vrnd=1

Page 34: Paxos building-reliable-system

Paxos: Resolve Conflict

X

Y

rnd=1

XPhase 1 for X

rnd=2

OK, forget XPhase 1 for Y

Y

X

Y

v="x", rnd=1Fail

v="y",rnd=2

OK

Phase 2

Y

round=1

round=2

Time

2,y21,x1 2,y2

2,1,x1 2,

2,1,x1 2,

2,1, 2,

1,1, -

1,1, -

---

Page 35: Paxos building-reliable-system

Paxos:Respect Existent Value

Xrnd=3

X

v="y",vrnd=2;v="x",vrnd=1; choose 'y'

Phase 1

X v="y",vrnd=3

Phase 2

round=3

2,y21,x1 2,y2

3,y23,x1 2,y2

3,y23,x1 2,y2

X OK 3,y33,y3 3,y3

Page 36: Paxos building-reliable-system

Paxos........Learner:

被accepted的value最终发给leaner;多数情况下client端是learner;

MegaStore: coordinatorZookeeper: slave nodes

Livelock

Page 37: Paxos building-reliable-system

Multi Paxos

在1个请求中为多个Paxos实例运行Phase 1;再分别为每个Paxos实例分别运行Phase 2

稳定的写入者

Acceptor写入顺序一致

Applications:chubby zookeeper megastore spanner

Page 38: Paxos building-reliable-system

Fast Paxos

多个Proposer直接Phase 2, Fast Paxos的rnd=0;

Acceptor只在v是nil的时候才接受

当遇到多个Fast并发冲突:

产生新的rnd>0执行Classic Paxos来解决冲突

跟Quorum Write 1样便宜?

Page 39: Paxos building-reliable-system

Fast Paxos Quorum

--- - -

0,x0 -0,x0 0,x0

0,y00,x0

X

fast rnd=0

X

phase 2

OK

Y

fast rnd=0phase 22/5; Fails

-

0,y0? ?

X0是否已经被系统接受了?

当Y只联系到半数Acceptor时,为了保证也能确认X0是否接受:要求X0被接受的标准变为:在Y联系到的半数Acceptor里,也有半数以上接受了X0因此接受X0的Acceptor > 3/4

Page 40: Paxos building-reliable-system

Fast Paxos Quorum.

Quorum > 3/4 Acceptor 要求更高的系统可用性

如果Classic 可以工作在系统99.99999999%的时间里Fast 只能工作在系统99.999%的时间里

Fast Paxos需要每个Paxos Group里有5个AcceptorMulti Datacenter Consistency需要5个IDC

Fast Paxos 部署3个IDC:要求每次写入都能联系到每1个IDC

Page 41: Paxos building-reliable-system

Fast Paxos 4/5 Y Conflict

--- - -

0,x0 -0,x0 0,x0 0,x0

0,y00,x0 0,x0 0,x0 0,x0

2,y00,x0 0,x0 2,x02,x0

2,x20,x0 0,x0 2,x22,x2

X

fast rnd=0

X

phase 2

OK

Y

fast rnd=0phase 21/5; Fail

Y

classic rnd=2phase 1OK, "x"

Yphase 2OK, writes "x"

Page 42: Paxos building-reliable-system

Fast Paxos 4/5 XY conflict

--- - -

0,x0 0,x0 0,x0 0,y0 0,y0

1,x0 1,x0 1,x0 0,y0 0,y0

1,x0 1,x0 2,y0 2,y02,x0

X

fast rnd=0

X

phase 2

Conflict

Y

fast rnd=0

phase 2

YConflict

0,x0 0,x0 0,x0 0,y0 0,y0X

classic rnd=1

phase 1

Y

classic rnd=2

phase 1X OK, only "x"

YOK, only "y"

Yphase 2

2,y22,y22,y22,y22,y2Xfail in phase 2

Page 43: Paxos building-reliable-system

Multi Paxos in Zookeeper

- -- - -

Xwrite "x"

--- - -

Paxos Leader Election...

Leader Slaves

-- - --,xphase 2: accept "x"

-,xOK

- --,x -,x

- --,x

phase 3: learn "x";make "x" visible

-,xOK

- --,x -,x

-,x -,x

XOK

Page 44: Paxos building-reliable-system

Talk is cheap. Show me the code

Paxos存储系统实现:

3个节点

命令:

set my-id varname version valueget varnameinc my-id varname n

https://github.com/drmingdrmer/pypaxos

Page 45: Paxos building-reliable-system

Demo

set Bob i 1 10 # OKset Bob i 1 20 # No changeset Alice i 1 10 # No change

inc Bob i 1 # 11inc Alice i 1 # 12

Page 46: Paxos building-reliable-system

Demo.

#Shutdown a Paxos member..

inc Alice i 1 # 13

#Shutdown two Paxos member..

inc Alice i 1 # Failure

Page 47: Paxos building-reliable-system

Demo..

Concurrency

./example.py inc Alice i 1

./example.py inc Bob i 1

Page 48: Paxos building-reliable-system

Note

Phase 2中,Acceptor实际上可以接受rnd大于等于自己的last_rnd的请求;