68

Md sal clustering internals

Embed Size (px)

Citation preview

MD-SAL Clustering InternalsMoiz Raja, Technical Lead, Cisco Systems

IRC: moizer

#ODSummit

▪Abhishek Kumar▪Basheeruddin Ahmed▪Colin Dixon▪Harman Singh▪Kamal Rameshan▪Robert Varga▪Tony Tkacik

My Collaborators

3

Tom Pantelis

▪Luis Gomez▪Phillip Shea▪Radhika Hirannaiah▪and many more…

▪Architecture▪Modules▪Flows▪Diagnostics▪Questions

Agenda

#ODSummit

Architecture

Subsystems

6

member-1

member-2

member-3

Distributed Data Store

member-1 member-2

Remote RPC Connector

High Level Architecture

Distributed Data Store Remote RPC Connector

Persistence Remoting Clustering

Actor Systems

8

Distributed Data Store Remote RPC Connector

Actor Hierarchy

Configuration

Dispatchers

Data Synchronization

9

Data storeSynchronized Data TreeRaft for Distributed Consensus

Remote RPCSynchronized RPC RegistryGossip for data distribution

#ODSummit

Distributed Data Store Architecture

Accessing Remote Data

11

Client

member-1 member-2

Location Transparency

12

Client

member-1 member-2D

istrib

uted

Dat

aSto

re

DistributedDataStore

13

DOMStore

DistributedDataStore

Communication

14

Client

member-1 member-2D

istrib

uted

Dat

aSto

re

Shard

Data Distribution

15

Client

member-1D

istrib

uted

Dat

aSto

re

member-2

member-3

topology

inventory

Module Based Shards

16

/

/inventory /topology /toaster

HA

17

member-2

member-3

inventory – follower -1

inventory – follower - 2

Client

member-1

DistributedDataStore

inventory – leader

Raft Distributed Consensus

18

disc

over

s no

de w

ith h

ighe

r ter

m Follower

Candidate

Leader

starts up/ recovers

times out, starts elections

receives votesfrom majority of nodes

times out, restarts elections

follower-2follower-1

leader

appe

nd en

tries

append entries

Election Replication/Consensus

Journal replication

19

leader

follower-1

follower-2

transaction-1

transaction-2

transaction-3

transaction-4

transaction-1

transaction-2

transaction-3

transaction-4

transaction-1

transaction-2

transaction-3

transaction-4

Snapshot Replication

20

leader

follower-1

follower-2

Durability/Recovery

21

Journal

Snapshots

#ODSummit

Remote RPC Architecture

Invoking a Remote RPC

23

Consumer

member-1 member-2

Provider

Location Transparency

24

Consumer

member-1 member-2

Provider

RpcP

rovi

derP

roxy

Rem

oteR

pcBr

oker

RPC Registry

25

ProviderRPC

RegistrationListener

RPC Registry

RPC Registry Replication - Gossip

26

version=1

version=2

modify

changeversion

Local bucket updates change version

m1,v1

m2,v5

m3,v7

All buckets and theirversions known to allmembers

Every 1 second memberssend all known bucket versions to any one peer

m1

m2 m3status

statusstat

us

m2 m3

m1

status

upda

telocal versions higher – send updatelocal versions lower – send status to sender

#ODSummit

Modules

Modules

28

sal-clustering-commons

sal-akka-raft sal-remoterpc-connector

sal-distributed-datastore

sal-clustering-config

sal-akka-raft-example

sal-dummy-distributed-datastore

clustering-test-app

▪Some common messages▪Actor base classes▪The Protobuf messages used in Helium▪The Protobuf NormalizedNode serialization code▪The NormalizedNode streaming code▪Other miscellaneous utility classes

sal-clustering-commons

29

▪Implementation of the Raft Algorithm on top of akka▪Uses akka-persistence for durability▪Provides a base class called RaftActor which when can be extended

by anyone who wants to replicate state▪See sal-akka-raft-example which provides a simple implementation of

a replicated HashMap

sal-akka-raft

30

▪ConcurrentDOMDataBroker▪DistributedDataStore▪Implementation of the DOMStore SPI▪Shard built on top of RaftActor▪Creates Shards based on Sharding strategy▪Code for a client to interact with the Shard Leader

sal-distributed-datastore

31

▪RemoteRpcProvider▪Default RPC Provider. Invoked when an RPC is not found in the local

MD-SAL registry.▪Code for BucketStore which provides a mechanism to replicate state

based on Gossip▪Code for RpcBroker which allows invoking a remote rpc

sal-remoterpc-connector

32

#ODSummit

Data Store Flows

Startup

34

DistributedConfigDataStoreProviderModule

DistributedDataStore

ShardManager

Shard1 Shard Shard3 Shard4

createInstance

ActorContextwaitTillReadyLatch

create & waitTillReady

Recovery

35

Shard1 Shard Shard3 Shard4

ShardManager

read last known state from disk

ready

waitTillReadyLatch

countDown

▪Recovery must be complete▪All Shard Leaders must be known▪Three messages are monitored by ShardManager

▪Cluster.MemberStatusUp▪Used to figure out the address of a cluster member

▪LeaderStateChanged▪Used to figure out if a Follower has a different Leader

▪ShardRoleChanged▪Use to figured out any changes in a Shard’s Role

▪Waiting is not infinite, by default it lasts only 90 seconds but is configurable

▪Will block config sub-system

Waiting for Ready

36

Creating a Transaction

37

DistributedDataStorenewReadWriteTransaction

TransactionProxy

create

First Operation

38

ActorContext.findPrimary

PrimaryCache.lookup/ShardManager.findPrimary

Found?

LocalTransactionContextRemoteTransactionContext

NoOpTransactionContext

TransactionProxywrite(“inventory”, node)

Local?

N

Y

N Y

Transactions

39

Client

DistributedDataStore

inventory – leader

Client

DistributedDataStore

inventory – leader

Local Transaction Remote Transaction

mem

ber-

1 mem

ber-

1m

embe

r-2

Local Transaction Optimization

40

LocalTransactionContext Shard - Leader

write

merge

delete

ready

member-1

Remote Transaction Optimization

41

RemoteTransactionContext Shard Leader

write

merge

delete

ready

write mod

merge mod

delete mod

member-1 member-2

Transaction Rate Limiting

42

rate-limit = 100 Tx/Sec

TxCohort

Shard Leader

member-2

20ms

TxCohort

50ms

TxCohort

15ms

after rate-limit/2 transactions done….new-rate-limit = 25 Tx/Sec

Operation Limiting

43

RemoteTransactionContext Shard Leader

write

merge

delete

write mod

merge mod

delete mod

member-1 member-2

……

block

Commit Coordination

44

Shard Leader

member-2

ShardCommitCoordinator

Tx1 - ready

Tx2 - ready

Tx3 - ready

Tx1 - commit

Tx3 - commit

Tx3 - abort

Tx2 - commit

Tx1

Tx2

Tx3

Managing the in-memory journalReplicated To All

45

Client leader

follower-1 follower-2

commit transaction

txn

txn txn

Managing the in-memory journalCluster member unavailable

46

Client leader

follower-1

commit transaction

txn

txn

txntxntxn

txntxntxn

Data Change Notifications

47

Client leader

follower-1 follower-2

commit transaction

txn

txn txn

notify

#ODSummit

RPC Connector Flows

Startup

49

RemoteRpcBrokerModulecreateInstance

RpcManager

RemoteRpcProvider

RpcBroker RpcRegistry RemoteRpcImpl RpcListener

Default RPC Delegate

50

RpcManager SchemaContext

DOMRpcProviderService

read all rpc definitions

registerImplementation(remoteRpcImpl)

RPC Registered

51

RpcProviderRegistryaddRoutedRpcImpl

RoutedRpcRegistrationregisterPath

RpcListener

RpcRegistry

Invoking a Remote RPC

52

RemoteRpcImplinvokeRpc

RpcRegistry

Route found?

RpcBroker

ExecuteRpc

FooService

throw Exception

Invoking a Remote RPC

53

RemoteRpcImpl

Consumer

Provider

member-1 member-2

RpcBroker

RpcRegistry

invokeRpc

invokeRpcfindRoute

ExecuteRpc

#ODSummit

Data Store Diagnostics

Transaction Tracing

55

Created txn member-2-txn-9400 of type READ_WRITE on chain member-2-txn-chain-13

Client

Server

Tx member-2-txn-9400 read /(urn:opendaylight:inventory?...

member-3-shard-inventory-operational: Creating transaction : shard-member-2-txn-9400

Tx member-2-txn-9400 Readying 1 transactions for commit

Tx member-2-txn-9400 commit

member-3-shard-inventory-operational: Readying transaction member-2-txn-9400

member-3-shard-inventory-operational: Committing transaction member-2-txn-9400

Tx member-2-txn-9400: commit succeeded

Cluster MemberInitiator

Counter

Transaction Type

Module

Data store type

Replication Tracing

56

Leader

Sending AppendEntries to follower member-2-shard-topology-operational: AppendEntries [term=2, leaderId=member-1-shard-topology-operational, prevLogIndex=520, prevLogTerm=2, entries=[Entry{index=521, term=2}], leaderCommit=520, replicatedToAllIndex=-1]

Follower

handleAppendEntries: AppendEntries [term=2, leaderId=member-2-shard-topology-operational, prevLogIndex=520, prevLogTerm=2, entries=[Entry{index=521, term=2}], leaderCommit=520, replicatedToAllIndex=-1]

handleAppendEntries returning : AppendEntriesReply [term=2, success=true, logLastIndex=521, logLastTerm=2, followerId=member-1-shard-topology-operational]

handleAppendEntriesReply from member-2-shard-topology-operational: applying to log – commitIndex: 521, lastAppliedIndex: 520

handleAppendEntriesReply - FollowerLogInformation for member-2-shard-topology-operational updated: matchIndex: 521, nextIndex: 522

Shard MBean

57

org.opendaylight.controller:type=DistributedOperationalDataStore,Category=Shards,name=member-1-shard-inventory-operational

Operational

Config

member-1

member-2

member-3

default

inventory

topology

operational

config

AttributesAbortTransactionsCount CommitIndex CommittedTransactionsCount CurrentTerm FailedTransactionsCount

FollowerInfo FollowerInitialSyncStatus

InMemoryJournalDataSize

InMemoryJournalLogSize LastApplied

LastCommittedTransactionTime LastIndex LastTerm Leader RaftState

ReadOnlyTransactionCount

ReadWriteTransactionCount WriteOnlyTransactionCount

VotedFor and more….

ShardManager MBean

58

org.opendaylight.controller:type=DistributedOperationalDataStore,Category=ShardManager,name=shard-manager-operational

Operational

Config

operational

config

Attributes

• LocalShards• SyncStatus

Data store GeneralRuntimeInfo MBean

59

org.opendaylight.controller:type=DistributedConfigDatastore,name=GeneralRuntimeInfo

Operational

Config

Attributes

• TransactionCreationRateLimit

Transaction Commit RateMBean

60

org.opendaylight.controller.cluster.datastore:name=distributed-data-store.config.commit.rate

Attributes

• 50thPercentile• 75thPercentile• 90thPercentile• and so on…

operational

config

• Count• Min• Max• StdDev

Data store GeneralRuntimeInfo MBean

61

org.opendaylight.controller:type=DistributedConfigDatastore,name=GeneralRuntimeInfo

Operational

Config

Attributes

• TransactionCreationRateLimit

Message Statistics MBean

62

org.opendaylight.controller.actor.metric:name=/user/shardmanager-config.msg-rate.ActorInitialized

Attributes

• 50thPercentile• 75thPercentile• 90thPercentile• and so on…

operational

config

• Count• Min• Max• StdDev

Message Name

#ODSummit

Remote RPC Diagnostics

RemoteRpcBroker MBean

64

org.opendaylight.controller:type=RemoteRpcBroker,name=RemoteRpcRegistry

Attributes

• BucketVersions• GlobalRpc• LocalRegisteredRoutedRpc

Operations

• findRpcByName• findRpcByRoute

Message Statistics MBean

65

org.opendaylight.controller.actor.metric:name=/user/rpc/registry.msg-rate.AddOrUpdateRoutes

Attributes

• 50thPercentile• 75thPercentile• 90thPercentile• and so on…

• Count• Min• Max• StdDev

Message Name

66

▪Deploy a cluster▪Run clustering integration tests▪Write an application that works in the cluster▪Write bugs to report features which you find missing▪Try running dsBenchMark on a cluster▪Test out replication using the dummy data store▪Check out the code▪Send email to [email protected] with questions

Suggested Next Steps…

67

#ODSummit

Thank You