Upload
apaperino
View
227
Download
0
Embed Size (px)
Citation preview
8/3/2019 Splunk Live
1/20
Splunk Live October 2010
8/3/2019 Splunk Live
2/20
page 2
Who we are
Established financial services technology consulting company
Founded in 2004 by experts in risk management technology
Exclusive focus on Capital Markets
Engaged at top-tier international banks and hedge funds
Offices in NY, London, Bangalore
www.riskfocusinc.com
Broad product, functional and technology expertise Expertise translates into common solution patterns which are reused to client benefit
Products: Credit, Rates, Commodities, FX
Process: Trade Capture, Valuation, on demand / end of day Valuation / Risk, Enterprise Market / Credit Risk
FpML: A Common Language for Financial Communication
Our Approach
We aim for better, generalized solutions to problem patterns
8/3/2019 Splunk Live
3/20
page 3
Presentation Agenda
The Enterprise IT Problem
Challenges of Enterprise Systems
Splunk Solutions for the whole Software Development Lifecycle Cross-cutting concerns
Design Release Cycle
Operations
Recommendations
8/3/2019 Splunk Live
4/20
page 4
Towers of Hanoi or Tower of Babel?
8/3/2019 Splunk Live
5/20
page 5
The algorithm
CommonLanguage
EffectiveCommunication
Strategic Success
Clear Message
Reactive
S
plunk
8/3/2019 Splunk Live
6/20
page 6
The architecture
Common Format
TransparentConversations
Robust System
Message Driven
Reactive
S
plunk
8/3/2019 Splunk Live
7/20
page 7
Unified Operational Intelligence with Splunk
Capital Markets systems:
Expensive Complex
Large operational and support teams
Maintenance/support lags development initiatives
Costly downtime
Maintenance: Preventive is better than Corrective
Corrective Maintenance: quick and replicable
8/3/2019 Splunk Live
8/20
page 8
EXAMPLE: Fictional Trading System Diagram
8/3/2019 Splunk Live
9/20
page 9
Operational Patterns in Large Systems
How do we apply behavior across functional components?
Cross-cutting concerns Apply to all parts, regardless of function
At application level, often handled via Aspect Oriented Programming:
Logging
Performance Profiling
Security
Transactionality
But what about at higher levels?
This is how the operations team experiences the system
8/3/2019 Splunk Live
10/20
page 10
MessageListener
Cross cutting at the APPLICATION Level
NovationHandler
TradeDAO
Logging
8/3/2019 Splunk Live
11/20
page 11
ExternalGateway
Cross cutting at the SYSTEM Level
Client TradeProcessing
Log Aggregation
8/3/2019 Splunk Live
12/20
page 12
ValuationSystem
Cross cutting at the ORGANIZATION Level
TradingSystem
Market DataSystem
Operational Intelligence
8/3/2019 Splunk Live
13/20
page 13
Design
Problem: The Design Paradox Modular and Distributed are great for design and development
increased productivity
improved flexibility
They make a system look fragmented to the operational teams. Borders are problematic
Example An issue occurs within one of the components
This leads to an incident across the border
The symptoms are observed in a different place at a different time
Solution Aggregate all logs and cross-index them Create an integrated dashboard
8/3/2019 Splunk Live
14/20
page 14
Dashboard
See issues by: functional area
component
support classification etc.
8/3/2019 Splunk Live
15/20
page 15
Conversation
Track a problem message across all components
8/3/2019 Splunk Live
16/20
page 16
Release Cycle
Problem : The Problem Only Occurs in Production (good acronym) Tests passed
For some reason we only see the problem once the system is live
Example Exception occurred in QA/UA, but tests passed and no one saw it
Same problem blew up in Production later
Solution with Splunk Tag & Categorize events
Ignorable
Known (and have recipe for recovery)
New
Link to everything: Knowledge Base (e.g. Support Wiki)
Source Control viewer (FishEye)
Build Server (TeamCity/Hudson)
Bug Database (e.g. Jira)
8/3/2019 Splunk Live
17/20
page 17
Root Cause
Show problem FpML message via ReST Drill through to Support Wiki for solution
8/3/2019 Splunk Live
18/20
page 18
Operations
Problem: The Non Sequitur Lack of context makes investigation very expensive
Collaboration frequently means long conference calls
Example We have a problem. Can you look at it?
Collaborative effort preceding call is lost
Inability to correlate events across components and over time
Inability to look historically. When did the problem appear first?
Did we just introduce it in this release?
Solution Just email a Splunk link
Single entry point for ALL INTELLIGENCE on this problem
It can be passed around with no loss
8/3/2019 Splunk Live
19/20
page 19
Performance
Support Email: Sync was slow starting 1pm. Any ideas?
Useless without Splunk; legitimate with it
See trends over time, across releases
Confirm, drill down, resolve
8/3/2019 Splunk Live
20/20
page 20
Recommendations
Good Design takes into account the whole lifecycle of a System You will be remembered for the failures
The challenge is Clear Communication. The requirements are Volume, speed, etc
You CAN have it both ways: clarity does not have to hinder performance Splunk helps
Design for transparency Optimize for people not machines. Hardware is cheaper Design for the end user Design for the operations team State should be human readable
Design for scalability Make it faster by adding hardware not by compromising transparency
Make it faster only after it works and is transparent
A system chain is only as strong as the weakest link Splunk unifies it all