Upload
arjuncchaudhary
View
228
Download
0
Embed Size (px)
Citation preview
7/27/2019 abos_12892
1/36
1
7/27/2019 abos_12892
2/36
22
ETL Implementation for Extreme Performance
Presented By:
Mrs. Catherine Boeving
Mr. Greg Wade
7/27/2019 abos_12892
3/36
3
Topics
About Us
Tips and tricks for high performance mapping
design
Pipeline techniques to improve throughput
Stacked pipelines to achieve extreme throughput
Ensuring data integrity in an extremeenvironment
Q&ACopyright 2012 Lockheed Martin
7/27/2019 abos_12892
4/36
4
About Us
Who we are Catherine Boeving, Software Developer
Greg Wade, Information Systems Architect
What we do Build large scale active data warehouses with near real
time data loads and high availability for Department ofDefense (DoD) customers
Where do we work Lockheed Martin Global Systems and Solutions; A leading
federal services and information technology contractor
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
5/36
5
Our Environment
Input
Data Acquisition
EDWTeradata
StagingOracle
ETL SPARC Enterprise M5000
144 GB RAM
Oracle Solaris 10 OS
Informatica PowerCenter and
DataTransformation V9.1.0 HotFix 3
Excess ETL server capacity needed toachieve extreme throughput.
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
6/36
6
Performance vs. Throughput
Performance The execution time for one run of a workflows/mappings
Throughput
The volume of data that can be ETLed in a specifiedperiod of time
Performance is needed to achieve throughput
High performance workflows/mappings are not alwaysenough to meet demanding service level agreements
Demanding SLAs requireboth high performance andthroughput
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
7/367
Carefully implementedlookups can improvemapping performance
Tips and Tricks
Reference Data Lookups
Validate the source data
Expand the source data
Stage Data Lookups
Previous source data
Processing of partial transactions Improve integration
Lookups
Requirements
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
8/368
Cached
Only one DB request
Best used when referenced
data does not change
Better for small tables
May be able to compensatefor slow/overloaded DB
Un-Cached
Many DB requests
Required when referenced
data changes
Better for large tables
Most performanceimprovement with fast DB
Selecting the correctlookup type is key toperformance
Tips and Tricks
Lookups
Approaches
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
9/369
Part Number Example
1000 part numbers and descriptions
20 character part numbers with 50 character descriptions
Lookups
Calculations
Tips and Tricks
Simple calculations providesome insight but testing is
needed in your environment
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
10/3610
Lookups
Calculations
DB Transfer Entire Table
= # rows * (characters per row)
= 1000 * (20 + 50) = 70K Bytes
One Un-Cached Lookup = SQL request + SQL response = 100 + (20 + 50) = 170 Bytes
Break even point = 70K / 170 = about 411
lookups per mapping execution
Tips and Tricks
A good estimatebut ignores DBspeed, network,
etc.
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
11/3611
Tips and Tricks
Lookups
Implementation
Peer review or inspectionchecklists should includevalidating lookup type selection
Historical Data Load
Cached large table and process large amounts of data
one time 25 files/10K rows of data against 25M cached lookup
Average 50 minute workflow execution times and 6 hours totalload time
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
12/36
12
Tips and Tricks
Reduces Round Trips to the Database
Combine Several Lookups with Sequence Logic
Simplifies Complex Database Insert Logic
Potential Loss of Data Lineage
Controlling network chatterbetween ETL and the DB isessential for high performance
Stored Procedures
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
13/36
13
Tips and Tricks
Stored Procedures
Implementation
Eliminate redundant
stored procedure calls
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
14/36
14
Tips and Tricks
Stored Procedures
Implementation
Ensure Matching Port and Parameter Sizes
Mismatched parameter sizes will send extra bytes to
database Occurs when database and mapping development done in
parallel
Verify the Import ofStored Procedures
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
15/36
15
Rapid Development
Reusable components and patterns
Understandability
Onboard new staff with unique ETL approach
Maintenance
Source system updates
Future performance tuning
Tips and Tricks
Mapping Design
Goals
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
16/36
16
Tips and Tricks
Mapping Design
Mapplet Execution
Used to handle similarcode from differentsources
Smaller risk in one-time changes
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
17/36
17
Tips and Tricks
Mapping Design
Worklet Execution
Isolates performancetuning and minimizesregression testing
Implements standardsfor new developers
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
18/36
18
Tips and Tricks
Mapping Design
Stage Table Implementation
Improves critical
path completion Simplifies complex
data
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
19/36
19
2 Files Processed in4 Minutes
Standard Approach
No Pipeline
Pipeline Processing
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
20/36
20
2 Files Processedin 3 Minutes
Pipeline Processing
Pipeline Approach
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
21/36
21
Pre-Process Sort transaction types
Split large files
Transform in Multiple Steps DataTransformation (DT)
Break sources in multiple logical parts
Smart Loading
Decouple DB loading from transformations
Use of external loaders
Extract
Transform
Load
Added Complexityrequires standardsand review
Pipeline Processing
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
22/36
22
Use Flat Files Between Pipeline Steps
High demands on ETL servers file system
Requires highly tuned cluster file systems
Pipeline Steps Have Similar Run Times
Simple Three Step Pipeline
DT File, Workflow Output File, External DB Loader
Pipeline Processing
Implementing the Pipeline
Flat File Movement Adds ComplexityAnd Must Be Monitored
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
23/36
23
Pipeline Processing
Implementing the Pipeline
Batch File Processing
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
24/36
24
6 Files Processedin 3 Minutes
Files applied to the DB innon-deterministic order
Pipeline Processing
Stacked Pipelines
Threaded 3x
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
25/36
25
Pipeline Threading
Implementation
Pipeline Processing
Manipulate XML
Replicate Parameter Files
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
26/36
26
Pipeline Processing
Example Calculations F = files to process = 100 files
T = time to ETL and load a file = 5 minutes
P = number of pipeline steps = 3 steps
S = number of stacked pipelines = 4 pipelines
Assume
All pipeline steps take the same amount of time
Ignore any overhead for intermediate files
Estimate with yourworkload to see thepossibilities
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
27/36
27
Pipeline Processing
Example Calculations Summary Standard Processing (No Pipeline)
F * T = 100 files * 5 minutes = 500 minutes
Pipeline
(P + (F 1)) * (T/P) = (3 + (100 -1)) * (5/3) = 170 minutes 294% speedup over standard processing
Stacked Pipelines
((P + (F 1)) * (T/P)) / S = 170 / 4 = 43
395% speedup over single pipeline
1162% speedup over standard processing
Theoretical speedup shown. Actualspeedup depends on yourenvironmentCopyright 2012 Lockheed Martin
7/27/2019 abos_12892
28/36
28
Copyright 2012 Lockheed Martin
Pipeline Processing
Pipeline Metrics
2 Steps Standard Processing
387 sec for one file batch
Pipeline Processing
Step #1 222 sec for one file batch Step #2 240 sec for one file batch
100 File Batches Calculation
Standard = 100 * 387 = 38700 sec
Pipeline = (2 + (100 1))* (240/2) = 12120
Speedup = 313 %
7/27/2019 abos_12892
29/36
29
Threaded Workflows Metrics -- 4 Threads
Pipeline Processing
TOTAL
RUNTIME
(SECONDS)TOTAL INPUT
ROWS TOTALWEIGHTEDAVERAGE
RUNTIME
TOTAL
ROWS/
TOTAL
WEIGHTED
AVERAGE
RUNTIME
PERCENT
DIFFERENCE
Non-Threaded
Workflows 1,836,519.00 81,148,820.00 303,229.40 267.62 -17.93%Threaded
Workflows 5,503,758.00 88,555,273.00 271,577.78 326.08 21.85%
Full speedup not realized.Consider data volume whenthreading.
We ran out of
files to process!
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
30/36
30
Data Integrity
Customer Satisfaction, Trust, and Growth
Is the data accurate?
How complete is the picture?
Finding the Bottleneck?
Building the System of Record
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
31/36
31
File Monitoring System exchange
Reaching pre-processing phase
Data Monitoring Check for data validity
Track session execution times
Output File Monitoring
Output files load time
Follow invalid output files
Extract
Transform
Load
Data Integrity
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
32/36
32
Data Integrity
Event Tracking
EExchange
Event
EPre-Process
Event EBulk Process
Event EOutput File
Event
EDatabase File
Event
E T L
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
33/36
7/27/2019 abos_12892
34/36
34
Data Integrity
Proven Metrics
Checking the Box on the SLA
Quantifiable numbers
Build and Track Future Growth
Handle errors and invalid data
Review of metrics mayrequire redesign.
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
35/36
35
Key Points
Tips and Tricks
Stored Procedures, Lookups, Mapping Design
Pipelines
Pipeline ETL Processing
Stacked Pipelines and Threaded Workflows
Data Integrity
Events for ETL, Alerting, Proven Metrics
Copyright 2012 Lockheed Martin
7/27/2019 abos_12892
36/36