22
An SNMP based failure detection service Matthias Wiesmann, Péter Urbán, Xavier Défago FN NF S Swiss National Science Foundation 北陸先端科學技術大學院大學 1

An SNMP based failure detection service...• Consensus, Atomic broadcast, Atomic commitment • Formal definitions: P, ♢S, Ω, φ, etc. • Also outside distributed system community

  • Upload
    others

  • View
    9

  • Download
    0

Embed Size (px)

Citation preview

An SNMP based failure detection service

Matthias Wiesmann, Péter Urbán, Xavier Défago

FN NFSSwiss National Science Foundation

北陸先端科學技術大學院大學

1

An SNMP based failure detection service

Overview

• Failure Detection

• SNMP

• SNMP-FD Service

• Conclusion & Future Work

2

2

An SNMP based failure detection service

Failure detection

• Basic abstraction for distributed systems

• Needed for solving fundamental problems:

• Consensus, Atomic broadcast, Atomic commitment

• Formal definitions: P, ♢S, Ω, φ, etc.

• Also outside distributed system community

• System, Network, Cluster management, Middleware, Grid systems

• Common problem→ common solution

3

3

Failure detection service

An SNMP based failure detection service

•Factorize-out failure detection

• Shared, separate implementation

• Complex, adaptive policies

• Proposed in the literature

• Implemented by middleware systems, grid systems

• Problem: no standard

•Every system has its own service and interfaces

4

4

An SNMP based failure detection service

Standard Needed

• External standard

• Connect clients of the service to the service• Application, Middleware, Monitoring and administration tools

• Suspicion notifications, configuration

• Internal standard

• Use standard for heartbeat messages, internal queries

• We need some form of Middleware

5

5

An SNMP based failure detection service

Overview

• Failure Detection

• SNMP

• SNMP-FD Service

• Conclusion & Future Work

6

6

An SNMP based failure detection service

Why SNMP ?

• Simple Network Management Protocol

• Present in many network attached devices now

• Routers, Smart Switches, UPS, Printers…

• Simple Middleware

• Offers basic features

• Relatively lightweight

• Many applicable standard defined

• Monitoring, fault reporting.

7

7

Network

EquipmentNetwork

Management

StationSNMP

Agent

Monitored

ObjectMIB

An SNMP based failure detection service

SNMP Architecture

• Network Management Station

• Manages equipment

• Equipment’s agent

• Exposes Management Information Base (MIB)

• Sends asynchronous messages (traps) to NMS.

8

8

An SNMP based failure detection service

Overview

• Failure Detection

• SNMP

• SNMP-FD Service

• Conclusion & Future Work

9

9

!"#$%&'

!"#$%&'S

()*$'B!"#$%"&'()&"*'++,

p1B

-,."+%,&'+"/&*'+,!01

-,2"%$!*3%$"#,4,53&6'%,!01

-,.'3&%7'3%+,!01

!"#$%&'&($%)*$+,-&.%*,+/01,

!"#$%"&'()&"*'++,

p2B

!"#$%"&'()&"*'++,

p3B

-,!01,

23(4&56,%*

()*$'A

7,*

8+09:

2,*&;&7,*

<,0+*=,0*&

8+09:!"#$%&'&()%&$*"+",)-./%"012

!"34)-5"012

-,.'3&%7'3%,8#"9:'(6',!01

-,)&"*'++,8#"9:'(6',!01

-,;3$:/&',<'%'*%$"#,!01,

!"#$%&'&($%)*$+)%6&.%*,+/01,&

!"#$%"&$#6)&"*'++

pA

3$*)!10*)$%

2,*&;&7,*

6%)*7)-7"012

*/+%"=,!01

()

(*

)>?+$*3:,@$#A

!'++36'+

B/'&$'+,C,D&$%'+

An SNMP based failure detection service

SNMP-FD service

10

10

An SNMP based failure detection service

Service Architecture

• Distributed architecture

• Dæmon runs on each host

• Monitor local processes

• Exchange heartbeats

• Export MIB information

• Can interact with equipment

• Similar to other failure detection service architectures

11

!"#$%&'

!"#$%&'S

()*$'B!"#$%"&'()&"*'++,

p1B

-,."+%,&'+"/&*'+,!01

-,2"%$!*3%$"#,4,53&6'%,!01

-,.'3&%7'3%+,!01

!"#$%&'&($%)*$+,-&.%*,+/01,

!"#$%"&'()&"*'++,

p2B

!"#$%"&'()&"*'++,

p3B

-,!01,

23(4&56,%*

()*$'A

7,*

8+09:

2,*&;&7,*

<,0+*=,0*&

8+09:!"#$%&'&()%&$*"+",)-./%"012

!"34)-5"012

-,.'3&%7'3%,8#"9:'(6',!01

-,)&"*'++,8#"9:'(6',!01

-,;3$:/&',<'%'*%$"#,!01,

!"#$%&'&($%)*$+)%6&.%*,+/01,&

!"#$%"&$#6)&"*'++

pA

3$*)!10*)$%

2,*&;&7,*

6%)*7)-7"012

*/+%"=,!01

()

(*

)>?+$*3:,@$#A

!'++36'+

B/'&$'+,C,D&$%'+

11

An SNMP based failure detection service

Features

• Uses many standard MIBs

• Host Resource MIB → Process state

• Target & Notification MIB → Subscriber description

• Alarm MIB → Current suspicion description

• Lightweight messaging:

• Heartbeat are UDP traps.

• Leverages SNMP

• Configuration, Security, Interoperability with devices12

❏✔

12

An SNMP based failure detection service

New MIBs

• Heartbeat MIB

• Describes heartbeats

• Failure detection MIB

• Describes failure detector configuration

• Described using standard SNA syntax file

• Accessible using standard tools

13

SNMP table: SnmpFD::kbhbTable

index interval lastbeat lasttime count lostcount mean stddev min max kt-dhcp23.jaist.ac.jp 100 2678 0:0:04:38.49 2671 0 99.9955 28.3417 0 356

13

An SNMP based failure detection service

Implementation

• Validate the architecture

• Implemented in Java

• Open source libraries

• SNMP4J, SNMP4J Agent, Apache Libraries

• Available for download

• http://ddsg.jaist.ac.jp//en/projects/snmp-fd/

14

14

An SNMP based failure detection service

Implementation

• Validate the architecture

• Implemented in Java

• Open source libraries

• SNMP4J, SNMP4J Agent, Apache Libraries

• Available for download

• http://ddsg.jaist.ac.jp//en/projects/snmp-fd/

14

14

• Simple failure detector

• Stable performance

• Consistent with expectations

• Validates approach

• Issues

• High latency

• High CPU load on receiver

An SNMP based failure detection service

Performance

15

0.001

0.01

0.1

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Timeout (ms)

Mis

takes

per

sec

ond

99.85%

99.9%

99.95%

100.0%

Quer

y

Acc

ura

cyMistake Rate

Query Accuracy

15

• Simple failure detector

• Stable performance

• Consistent with expectations

• Validates approach

• Issues

• High latency

• High CPU load on receiver

An SNMP based failure detection service

Performance

15

0.001

0.01

0.1

1

0 1 2 3 4 5 6 7 8 9 10 11 12 13

Timeout (ms)

Mis

takes

per

sec

ond

99.85%

99.9%

99.95%

100.0%

Quer

y

Acc

ura

cyMistake Rate

Query Accuracy

}SNMP4JIssue, code spends 95% of time there.

15

An SNMP based failure detection service

Overview

• Failure Detection

• SNMP

• SNMP-FD Service

• Conclusion & Future Work

16

16

An SNMP based failure detection service

Conclusion

• Failure detection service

• Standard based approach desirable

• No need for new standard → use existing SNMP

• Approach validated by prototype

• Can be extended with more advance FD techniques

• Hierarchical FD, adaptive FD, etc.

• Can interact with network devices

• For instance link-down traps

17

17

An SNMP based failure detection service

Future work

• Look into performance issues

• Optimizations needed in the SNMP4J library.

•Port the framework to the latest version of SNMP4J

• Support for AgentX

• Implement framework as sub-agent

• Implement more advanced failure detectors

• Handle network equipment data

18

18

An SNMP based failure detection service

Questions?

19

19

An SNMP based failure detection service

Questions?

19

Thank you…

19