Splunk Live

8/3/2019 Splunk Live

1/20

Splunk Live October 2010

[email protected]


2/20

page 2

Who we are

Established financial services technology consulting company

Founded in 2004 by experts in risk management technology

Exclusive focus on Capital Markets

Engaged at top-tier international banks and hedge funds

Offices in NY, London, Bangalore

www.riskfocusinc.com

Broad product, functional and technology expertise Expertise translates into common solution patterns which are reused to client benefit

Products: Credit, Rates, Commodities, FX

Process: Trade Capture, Valuation, on demand / end of day Valuation / Risk, Enterprise Market / Credit Risk

FpML: A Common Language for Financial Communication

Our Approach

We aim for better, generalized solutions to problem patterns


3/20

page 3

Presentation Agenda

The Enterprise IT Problem

Challenges of Enterprise Systems

Splunk Solutions for the whole Software Development Lifecycle Cross-cutting concerns

Design Release Cycle

Operations

Recommendations


4/20

page 4

Towers of Hanoi or Tower of Babel?


5/20

page 5

The algorithm

CommonLanguage

EffectiveCommunication

Strategic Success

Clear Message

Reactive

S

plunk


6/20

page 6

The architecture

Common Format

TransparentConversations

Robust System

Message Driven

Reactive

S

plunk


7/20

page 7

Unified Operational Intelligence with Splunk

Capital Markets systems:

Expensive Complex

Large operational and support teams

Maintenance/support lags development initiatives

Costly downtime

Maintenance: Preventive is better than Corrective

Corrective Maintenance: quick and replicable


8/20

page 8

EXAMPLE: Fictional Trading System Diagram


9/20

page 9

Operational Patterns in Large Systems

How do we apply behavior across functional components?

Cross-cutting concerns Apply to all parts, regardless of function

At application level, often handled via Aspect Oriented Programming:

Logging

Performance Profiling

Security

Transactionality

But what about at higher levels?

This is how the operations team experiences the system


10/20

page 10

MessageListener

Cross cutting at the APPLICATION Level

NovationHandler

TradeDAO

Logging


11/20

page 11

ExternalGateway

Cross cutting at the SYSTEM Level

Client TradeProcessing

Log Aggregation


12/20

page 12

ValuationSystem

Cross cutting at the ORGANIZATION Level

TradingSystem

Market DataSystem

Operational Intelligence


13/20

page 13

Design

Problem: The Design Paradox Modular and Distributed are great for design and development

increased productivity

improved flexibility

They make a system look fragmented to the operational teams. Borders are problematic

Example An issue occurs within one of the components

This leads to an incident across the border

The symptoms are observed in a different place at a different time

Solution Aggregate all logs and cross-index them Create an integrated dashboard


14/20

page 14

Dashboard

See issues by: functional area

component

support classification etc.


15/20

page 15

Conversation

Track a problem message across all components


16/20

page 16

Release Cycle

Problem : The Problem Only Occurs in Production (good acronym) Tests passed

For some reason we only see the problem once the system is live

Example Exception occurred in QA/UA, but tests passed and no one saw it

Same problem blew up in Production later

Solution with Splunk Tag & Categorize events

Ignorable

Known (and have recipe for recovery)

New

Link to everything: Knowledge Base (e.g. Support Wiki)

Source Control viewer (FishEye)

Build Server (TeamCity/Hudson)

Bug Database (e.g. Jira)


17/20

page 17

Root Cause

Show problem FpML message via ReST Drill through to Support Wiki for solution


18/20

page 18

Operations

Problem: The Non Sequitur Lack of context makes investigation very expensive

Collaboration frequently means long conference calls

Example We have a problem. Can you look at it?

Collaborative effort preceding call is lost

Inability to correlate events across components and over time

Inability to look historically. When did the problem appear first?

Did we just introduce it in this release?

Solution Just email a Splunk link

Single entry point for ALL INTELLIGENCE on this problem

It can be passed around with no loss


19/20

page 19

Performance

Support Email: Sync was slow starting 1pm. Any ideas?

Useless without Splunk; legitimate with it

See trends over time, across releases

Confirm, drill down, resolve


20/20

page 20

Recommendations

Good Design takes into account the whole lifecycle of a System You will be remembered for the failures

The challenge is Clear Communication. The requirements are Volume, speed, etc

You CAN have it both ways: clarity does not have to hinder performance Splunk helps

Design for transparency Optimize for people not machines. Hardware is cheaper Design for the end user Design for the operations team State should be human readable

Design for scalability Make it faster by adding hardware not by compromising transparency

Make it faster only after it works and is transparent

A system chain is only as strong as the weakest link Splunk unifies it all

Documents

Splunk Live