241 towards real-time collaboration system

Towards real-time collaboration system: OT 알고리즘에서 CRDT 시스템으로

노현걸 삼성전자 S/W R&D Center

Contents ●  Real-time collaboration system ●  Operational Transform (OT) 알고리즘 ●  Conflict-free Replicated Data Types (CRDT)

○  WOOT/WOOTO/WOOTH ○  Treedoc ○  Logoot ○  RGA

●  비교 분석 ●  맺음: OT 알고리즘에서 CRDT 시스템으로

“학술적인 real-time collaboration 관련 연구 성과들을 소개”

Real-time collaboration ●  1989년 Ellis/Gibbs에 의해 collaborative editing 연구가 시작

○  Operational transformation (OT) 알고리즘 (dOPT)

●  Assumptions

○  Variable length document

■  문자들 간의 순서 (order)

■  Insert and Delete operations with integer index

○  Optimistic replication

■  Documents are replicated

■  Operations are optimitic; local first, sync later

○  P2P 방식의 operation delivery (no client/server)

■  사용자가 모두 다른 순서로 operation들을 실행

○  Eventual consistency

< P2P방식의 operation delivery>

Real-time collaboration 왜 어렵나?

●  Operation들간의 causality(인과성)/concurrency(동시성)가 존재 ○  어떻게 detect 할 것인가?

●  Concurrent operation들이 여러 site들에서 다른 순서로 실행됨 ○  모든 사이트가 같은 순서(total order)로 실행한다고 문제가 해결되지 않음

●  각각의 operation들은 의도(intention)를 가지고 있음 ○  Integer index 때문에 operation의 의도가 왜곡

<site 1> <site 2>

OT 알고리즘 (1/3) ●  Idea: Intention을 보존하기 위해 concurrent operation들의 integer index를 변환(transformation)

●  Theory: two transformation properties, TP1, TP2을 만족 시켜야 함

OT 알고리즘 (2/3) ●  TP1/TP2를 어떻게 만족 시킬 것인가?

●  Operation의 실행이 History buffer의 기존 operation들과 비교/변환 해야 함. ○  Causality or Concurrency ➔ Vector Clock ➔ Scalability 문제 ○  Remote operation: O(|H|2) 이상의 complexity *H: history buffer의 op수

●  알고리즘에 따라 history buffer의 op들을 재정렬(transposing)이 필요

○  예) ABT: insert끼리 delete끼리 모으고 concurrent insert별로 정렬해야함: O(|H|2) ●  No Update operation

OT 알고리즘 (3/3) ●  OT 알고리즘의 진화 ●  JUPITER

○  Client/server 기반의 OT 알고리즘 ○  변형이 Google Wave/Docs에서 쓰이는 것으로 알려짐 ○  Server가 client의 operation들을 serialization ○  No vector clock ○  P2P 알고리즘의 subset; SaaS 서비스는 C/S 모델로도 충분한가?

CRDT (1/2) ●  Idea:

○  Real-time collaboration을 위해서 분산 data structure를 제공

●  CRDT ○  처음에는 Commutative/Convergence, 나중엔 Conflict-free Replicated

Data Type로 Why? ○  Optimistic replication of abstract data types

●  유래 ○  Proposed by Marc Shapiro/Nuno Preguica/Gerald Oster/Pascal Urso

et al. ○  And me.. really?

●  Commutativity ○  모든 경우에 OP1 ➔ OP2 ≡ OP2 ➔ OP1 ○  어떻게?

CRDT (2/2) ●  특징

○  Integer index 대신에 unique ID ○  No history buffer ○  Conflict를 해결할 수 있는 meta data를 object와 함께 저장 ○  Vector clock free

●  주요 CRDTs ○  Linked List: 주로 Insert & Delete operation like OT algorithm

■  WOOT[CSCW’06], Treedoc[ICDCS’09], Logoot[TPDS’10], RGA [JPDC’11]

■  RGA의 경우 Update 가능 ○  Array, Set, Hash table, Trees, and Graph…

■  Array, Set, Hash table 등은 last write win 기법 ■  Tree나 graph 등에서 children들 간의 순서가 존재하면 Linked list의 해결 기법이 쓰일 수 있음: 연구가 진행되고 있음

CRDT for linked list

WOOT/WOOTO/WOOTH ●  최초의 CRDT : collaborative editing WithOut Operational Transformation ●  새로운 object가 Insert될 때마다 object들 간의 partial order를 이용해 전체

object들의 total order를 계산 ○  Total order 구하기 위해 document의 검색을 필요: O(N) *N: object 수

●  Delete의 경우 Tombstone

●  WOOTO: total order를 구하는 과정을 optimizing WOOTH: RGA에서 영향을 받아 Hash table을 도입하여 성능 향상

Insert(‘b’, ID(‘a’), ID(‘c’)); Insert(‘x’, ID(‘a’), ID(‘b’)); Insert(‘y’, ID(‘a’), ID(‘c’)); Insert(‘z’, ID(‘a’), ID(‘c’));

Treedoc ●  Dense index scheme: object와 object들 사이의 항상 새로운 index 생성 가능

○  Tree의 path를 index로 사용 ○  Object가 추가 될 위치를 insert에 표기 ○  Concurrent Inserts 의 경우 임시로 추가 id 부여

●  Tombstone for Delete: leaf 노드가 아닌 경우 garbage collection에 어려움 ●  Sequential insert 들의 경우 index가 계속 늘어남

Insert(‘b’, 0) Insert(‘c’, 1) Insert(‘d’, 00) Insert(‘e’, 01) Insert(‘f’, 10) Insert(‘g’, 11)

Logoot ●  Dense index scheme

○  UID: (LIST of <integer, site ID>, clock) ○  Site IDs are comparable

●  No tombstone for Delete

Replicated Growable Array (RGA)

●  Unique ID: S4Vector ○  <int ssn, int sid, int sum, int seq>

●  Precedence transitivity ○  Operation commutativity를 구현하기 위한 원리

■  Causality와 concurrency를 고려해서 precedence가 transitive해야 함

○  S4Vector로 transitive precedence를 구현

Replicated Growable Array (RGA)

●  두 개의 S4Vector를 저장 ○  Update/Delete 지원

●  Hashtable + linked list ○  Hashtable for remote operation: O(1)

●  Tombstone for Delete ○  Garbage collection condition

비교 분석 (1/3) ●  c: 평균적인 concurrent op수 ●  n: document의 크기 (tombstone 제외) ●  N: 전체 추가된 글자 수 (tombstone 포함) ●  t = N-n tombstone의 갯수 ●  k: Logoot의 id 사이즈 ●  R: replica/site의 수 ●  H: 수행된 operation 수 (in History buffer) ●  d: celling((t+c)/n) 연속적인 object수 사이에

tombstone 또는 concurrent object의 수 ●  일반적으로 H>>N>n>

비교 분석 (2/3) ●  From “Evaluating CRDTs for Real-time Document Editing”, DocEng 2011

●  User operations: local operations with copy-paste

●  c: 평균적인 concurrent op수 ●  n: document의 크기 (tombstone 제외) ●  N: 전체 추가된 글자 수 (tombstone 포함) ●  t = N-n tombstone의 갯수 ●  k: Logoot의 id 사이즈 ●  R: replica/site의 수 ●  H: 수행된 operation 수 (in History buffer) ●  d: celling((t+c)/n) 연속적인 object수 사이에

tombstone 또는 concurrent object의 수

비교 분석 (3/3) ●  From “Evaluating CRDTs for Real-time Document Editing”, DocEng 2011

●  Character operations: remote operations; ○  일반적으로 site들은 remote operation들을 더 많이 실행

●  c: 평균적인 concurrent op수 ●  n: document의 크기 (tombstone 제외) ●  N: 전체 추가된 글자 수 (tombstone 포함) ●  t = N-n tombstone의 갯수 ●  k: Logoot의 id 사이즈 ●  R: replica/site의 수 ●  H: 수행된 operation 수 (in History buffer) ●  d: celling((t+c)/n) 연속적인 object수 사이에

tombstone 또는 concurrent object의 수

맺음: OT 알고리즘에서 CRDT 시스템으로

●  알고리즘 vs. 시스템

○  알고리즘: “어떠한 문제를 해결하기 위한 여러 동작들의 유한한 모임”

■  Euclid’s algorithm, Dijkstra algorithm, OT algorithm 등

○  시스템: “하나의 공통적인 목적을 수행하기 위해 조직화된 요소들의 집합체”

■  Operating system, file system, database system 등

○  Real-time collaborative application을 위한 data system으로...

■  Real-time collaboration을 위해 CRDT를 제공

■  Operation delivery, session, garbage collection 등을 제공

■  Scalability, load balancing

■  Amazon Dynamo처럼 latency 보장

●  Open problems / questions

○  P2P 방식의 알고리즘이 필요한가?

■  Scalability & low latency를 위하여 server를 P2P 방식으로 운영

●  어떻게 operation을 전달할 것인가? Gossip protocol?

■  Vector clock 없이 Causality는 어떻게 지킬 수 있나?

●  Lesson learned

○  논문들 어렵다.. 그것도 매우…

Supplement

Optimistic replication

Tie by Delete