26
Akka Cluster and Auto-scaling Ikuo Matsumura CyberAgent, Inc. 2017/02/26

Akka Cluster and Auto-scaling

Embed Size (px)

Citation preview

Page 1: Akka Cluster and Auto-scaling

Akka Cluster andAuto-scaling

Ikuo MatsumuraCyberAgent, Inc.

2017/02/26

Page 2: Akka Cluster and Auto-scaling

Akka Cluster

⾮中央集権的なノード群構築を⾏うAkka拡張 10ヶ⽉程運⽤してきた中からつまづいた点・学んだ点を紹介

• Decentralized cluster membership service

• no single point of failure, bottleneck

• distribute actors over multiple JVMs

• Applied to build a sub-system on AD serving

• Tens of servers

• Operations about 10 months

Page 3: Akka Cluster and Auto-scaling

Requirements in our case

• Host a lot of Entity with low cost

• Fit existing Akka application

• Down-time is acceptable to some extent*

Akkaベースで多数のEntityを低コストで配備したい 多少のダウンタイムは許容できる

*online machine learning

Page 4: Akka Cluster and Auto-scaling

Our application of Akka Cluster

永続ActorをCluster Shardingで配備 データ保管にコモディティサービスを使⽤

frontend frontend frontend

entitiesentities

…entities

frontend

• Existing app• Tens of nodes• Auto-scaling

• New sub-system• Several nodes

ElastiCache(Journal)

S3(Snapshot)

data stores

Page 5: Akka Cluster and Auto-scaling

Challenges

• Strategy on unreachables removal

• Lifecycle of journals

運⽤する中でつまづいた2つの課題についてお話します

Page 6: Akka Cluster and Auto-scaling

Membership Lifecycle in Cluster Specification

クラスタメンバーのライフサイクルの概観

http://doc.akka.io/docs/akka/2.4/common/cluster.html#Membership_Lifecycle

joining

up

downremoved

join

leaving

exiting

unreachable

leave

Page 7: Akka Cluster and Auto-scaling

Joinning and Leader Action

“leader action”を経て、他メンバと通信できるようになる

joining

up

down

join

unreachable

Page 8: Akka Cluster and Auto-scaling

Joinning and Leader Action

“leader action”を経て、他メンバと通信できるようになる

joining

up

down

unreachable

leader action

Page 9: Akka Cluster and Auto-scaling

Joinning and Leader Action

Scale-in発⽣時、unreachable のままになる

joining

up

down

unreachable

failure detector

leader action

Page 10: Akka Cluster and Auto-scaling

Joinning and Leader Action

leader actionが⾏えなくなる。 結果、新規メンバが他のメンバと通信できないままに。

joining

up

down

unreachable

leader action

Page 11: Akka Cluster and Auto-scaling

Joinning and Leader Action

Scale-inをトリガにしたdown指定が必要

joining

up

down

unreachable

leader action

scale-in

trigger

mark as down

Page 12: Akka Cluster and Auto-scaling

Joinning and Leader Action

unreachableを除くことでleader actionが再開可能に

joining

up

down

unreachable

leader action

Page 13: Akka Cluster and Auto-scaling

Leader actions blocked by unreachables

leader actionが⾏えない状態のログの例

Members that are “up” but have not seen the current state

“Leader can currently not perform its duties”

Page 14: Akka Cluster and Auto-scaling

Causes and actions on unreachables

Type of failures ExamplePossible

external action

network partitions -wait for recovery or

abandon a part

machine crashesscale-in mark as down

quarantined in akka remote layer

restart an actor system

unresponseive process

long GC restart a JVM

CPU starvation by credit shortage in EC2

re-create an instance

failure detector はエラーの原因までは区別できない 原因に応じてクラスタ外部からの回復措置が要る

Page 15: Akka Cluster and Auto-scaling

Split Brain Resolver* (commercial add-on)

• Mark members as “down” when a part of the cluster become unreachable for some time

• Strategies

• Static Quorum

• Keep Majority - default in Lagom

• Keep Oldest

• Keep Referee

* http://doc.akka.io/docs/akka/rp-current/scala/split-brain-resolver.html

商⽤add-onである程度包括的に⾃動のdown指定が可能 ⼀定時間メンバの状態・到達可能性に変化がない時に発動

Page 16: Akka Cluster and Auto-scaling

Reset cluster membership (poor man’s)

存命ノードを新しいクラスタに参加させ直す

seed(s)

old cluster (ddata) new cluster (ddata)

Page 17: Akka Cluster and Auto-scaling

Reset cluster membership (poor man’s)

存命ノードを新しいクラスタに参加させ直す

seed(s)

old cluster (ddata) new cluster (ddata)

Page 18: Akka Cluster and Auto-scaling

Reset cluster membership (poor man’s)

存命ノードを新しいクラスタに参加させ直す

seed(s)

old cluster (ddata) new cluster (ddata)

Page 19: Akka Cluster and Auto-scaling

Caution

• Side-effect caused by app restart

• ddata is experimental (at Akka 2.4)

• Use Akka 2.4.8 or higher*

再起動による副作⽤やAkkaのバージョンに注意が必要

*has a fix on distributed pub-sub akka#20847

Page 20: Akka Cluster and Auto-scaling

To keep cluster membership healthy

1. Trigger mark-as-down (or leave) on scale-in

2. Automate restart/recreation of AcotrSystem, JVM, server instance

3. Setup a fallback mechanism such as split brain resolver, or rejoining into a new cluster

unreachable対策のまとめ

Page 21: Akka Cluster and Auto-scaling

Challenges

• Strategy of unreachables removal

• Lifecycle of journals

次に、2つ⽬の課題についてお話しします。

Page 22: Akka Cluster and Auto-scaling

Journal

entitiesentities

…entities

ElastiCache(Journal)

S3(Snapshot)

Event Sourcingにおけるイベントストアに対応するAPI Journalをキャッシュのように運⽤する想定をした

Page 23: Akka Cluster and Auto-scaling

Cleanup old journals in Redis plug-in*

JournalのDeleteMessageでは⼀部データが残るケースがある snapshotとのsequenceNrの⼀貫性に注意

key in Redis removed on deleteMessages

journal:$persistenceId Yes

journal:$persistenceId.highestSequenceNr No

* https://github.com/hootsuite/akka-persistence-redis/blob/master/src/main/scala/com/hootsuite/akka/persistence/redis/journal/RedisJournal.scala

Deleting highstSequenceNr could cause loading old version of snapshot.→ Keep only the latest snapshot.

Page 24: Akka Cluster and Auto-scaling

Event Sourcing and Ecosystem

“it stores a complete history of the events associated with the aggregates in your domain”

Reference 3: Introducing Event Sourcing, CQRS Journey[CQJ]

本来のイベントストアはイベントの完全な履歴を持つ想定 そこから逸れるとエコシステム(plug-in)のサポートも弱くなる

Page 25: Akka Cluster and Auto-scaling

Summary

• Lessons learned from devops of an Akka Cluster app

• Strategy on unreachables removal

• scale-in trigger

• automatic restart/recreation

• fallback mechanism; split-brain resolver / rejoining

• Lifecycle of journals

• cost of deviation from Event Sourcing

unreachableメンバを取り除く仕組みを各種⼊れておく Journalのキャッシュ的運⽤は意外と⼤変(なことがある)

Page 26: Akka Cluster and Auto-scaling

Reference

[CQJ] Exploring CQRS and Event Sourcing, Dominic Betts, Julian Dominguez, Grigori Melnik, Fernando Simonazzi, Mani Subramanian, 2012, https://msdn.microsoft.com/en-us/library/jj554200.aspx

[PSE] Persistence - Schema Evolution, Akka Documentation, http://doc.akka.io/docs/akka/2.4/scala/persistence-schema-evolution.html