abos_12892

Embed Size (px)

Citation preview

  • 7/27/2019 abos_12892

    1/36

    1

  • 7/27/2019 abos_12892

    2/36

    22

    ETL Implementation for Extreme Performance

    Presented By:

    Mrs. Catherine Boeving

    Mr. Greg Wade

  • 7/27/2019 abos_12892

    3/36

    3

    Topics

    About Us

    Tips and tricks for high performance mapping

    design

    Pipeline techniques to improve throughput

    Stacked pipelines to achieve extreme throughput

    Ensuring data integrity in an extremeenvironment

    Q&ACopyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    4/36

    4

    About Us

    Who we are Catherine Boeving, Software Developer

    Greg Wade, Information Systems Architect

    What we do Build large scale active data warehouses with near real

    time data loads and high availability for Department ofDefense (DoD) customers

    Where do we work Lockheed Martin Global Systems and Solutions; A leading

    federal services and information technology contractor

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    5/36

    5

    Our Environment

    Input

    Data Acquisition

    EDWTeradata

    StagingOracle

    ETL SPARC Enterprise M5000

    144 GB RAM

    Oracle Solaris 10 OS

    Informatica PowerCenter and

    DataTransformation V9.1.0 HotFix 3

    Excess ETL server capacity needed toachieve extreme throughput.

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    6/36

    6

    Performance vs. Throughput

    Performance The execution time for one run of a workflows/mappings

    Throughput

    The volume of data that can be ETLed in a specifiedperiod of time

    Performance is needed to achieve throughput

    High performance workflows/mappings are not alwaysenough to meet demanding service level agreements

    Demanding SLAs requireboth high performance andthroughput

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    7/367

    Carefully implementedlookups can improvemapping performance

    Tips and Tricks

    Reference Data Lookups

    Validate the source data

    Expand the source data

    Stage Data Lookups

    Previous source data

    Processing of partial transactions Improve integration

    Lookups

    Requirements

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    8/368

    Cached

    Only one DB request

    Best used when referenced

    data does not change

    Better for small tables

    May be able to compensatefor slow/overloaded DB

    Un-Cached

    Many DB requests

    Required when referenced

    data changes

    Better for large tables

    Most performanceimprovement with fast DB

    Selecting the correctlookup type is key toperformance

    Tips and Tricks

    Lookups

    Approaches

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    9/369

    Part Number Example

    1000 part numbers and descriptions

    20 character part numbers with 50 character descriptions

    Lookups

    Calculations

    Tips and Tricks

    Simple calculations providesome insight but testing is

    needed in your environment

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    10/3610

    Lookups

    Calculations

    DB Transfer Entire Table

    = # rows * (characters per row)

    = 1000 * (20 + 50) = 70K Bytes

    One Un-Cached Lookup = SQL request + SQL response = 100 + (20 + 50) = 170 Bytes

    Break even point = 70K / 170 = about 411

    lookups per mapping execution

    Tips and Tricks

    A good estimatebut ignores DBspeed, network,

    etc.

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    11/3611

    Tips and Tricks

    Lookups

    Implementation

    Peer review or inspectionchecklists should includevalidating lookup type selection

    Historical Data Load

    Cached large table and process large amounts of data

    one time 25 files/10K rows of data against 25M cached lookup

    Average 50 minute workflow execution times and 6 hours totalload time

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    12/36

    12

    Tips and Tricks

    Reduces Round Trips to the Database

    Combine Several Lookups with Sequence Logic

    Simplifies Complex Database Insert Logic

    Potential Loss of Data Lineage

    Controlling network chatterbetween ETL and the DB isessential for high performance

    Stored Procedures

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    13/36

    13

    Tips and Tricks

    Stored Procedures

    Implementation

    Eliminate redundant

    stored procedure calls

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    14/36

    14

    Tips and Tricks

    Stored Procedures

    Implementation

    Ensure Matching Port and Parameter Sizes

    Mismatched parameter sizes will send extra bytes to

    database Occurs when database and mapping development done in

    parallel

    Verify the Import ofStored Procedures

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    15/36

    15

    Rapid Development

    Reusable components and patterns

    Understandability

    Onboard new staff with unique ETL approach

    Maintenance

    Source system updates

    Future performance tuning

    Tips and Tricks

    Mapping Design

    Goals

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    16/36

    16

    Tips and Tricks

    Mapping Design

    Mapplet Execution

    Used to handle similarcode from differentsources

    Smaller risk in one-time changes

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    17/36

    17

    Tips and Tricks

    Mapping Design

    Worklet Execution

    Isolates performancetuning and minimizesregression testing

    Implements standardsfor new developers

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    18/36

    18

    Tips and Tricks

    Mapping Design

    Stage Table Implementation

    Improves critical

    path completion Simplifies complex

    data

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    19/36

    19

    2 Files Processed in4 Minutes

    Standard Approach

    No Pipeline

    Pipeline Processing

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    20/36

    20

    2 Files Processedin 3 Minutes

    Pipeline Processing

    Pipeline Approach

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    21/36

    21

    Pre-Process Sort transaction types

    Split large files

    Transform in Multiple Steps DataTransformation (DT)

    Break sources in multiple logical parts

    Smart Loading

    Decouple DB loading from transformations

    Use of external loaders

    Extract

    Transform

    Load

    Added Complexityrequires standardsand review

    Pipeline Processing

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    22/36

    22

    Use Flat Files Between Pipeline Steps

    High demands on ETL servers file system

    Requires highly tuned cluster file systems

    Pipeline Steps Have Similar Run Times

    Simple Three Step Pipeline

    DT File, Workflow Output File, External DB Loader

    Pipeline Processing

    Implementing the Pipeline

    Flat File Movement Adds ComplexityAnd Must Be Monitored

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    23/36

    23

    Pipeline Processing

    Implementing the Pipeline

    Batch File Processing

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    24/36

    24

    6 Files Processedin 3 Minutes

    Files applied to the DB innon-deterministic order

    Pipeline Processing

    Stacked Pipelines

    Threaded 3x

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    25/36

    25

    Pipeline Threading

    Implementation

    Pipeline Processing

    Manipulate XML

    Replicate Parameter Files

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    26/36

    26

    Pipeline Processing

    Example Calculations F = files to process = 100 files

    T = time to ETL and load a file = 5 minutes

    P = number of pipeline steps = 3 steps

    S = number of stacked pipelines = 4 pipelines

    Assume

    All pipeline steps take the same amount of time

    Ignore any overhead for intermediate files

    Estimate with yourworkload to see thepossibilities

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    27/36

    27

    Pipeline Processing

    Example Calculations Summary Standard Processing (No Pipeline)

    F * T = 100 files * 5 minutes = 500 minutes

    Pipeline

    (P + (F 1)) * (T/P) = (3 + (100 -1)) * (5/3) = 170 minutes 294% speedup over standard processing

    Stacked Pipelines

    ((P + (F 1)) * (T/P)) / S = 170 / 4 = 43

    395% speedup over single pipeline

    1162% speedup over standard processing

    Theoretical speedup shown. Actualspeedup depends on yourenvironmentCopyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    28/36

    28

    Copyright 2012 Lockheed Martin

    Pipeline Processing

    Pipeline Metrics

    2 Steps Standard Processing

    387 sec for one file batch

    Pipeline Processing

    Step #1 222 sec for one file batch Step #2 240 sec for one file batch

    100 File Batches Calculation

    Standard = 100 * 387 = 38700 sec

    Pipeline = (2 + (100 1))* (240/2) = 12120

    Speedup = 313 %

  • 7/27/2019 abos_12892

    29/36

    29

    Threaded Workflows Metrics -- 4 Threads

    Pipeline Processing

    TOTAL

    RUNTIME

    (SECONDS)TOTAL INPUT

    ROWS TOTALWEIGHTEDAVERAGE

    RUNTIME

    TOTAL

    ROWS/

    TOTAL

    WEIGHTED

    AVERAGE

    RUNTIME

    PERCENT

    DIFFERENCE

    Non-Threaded

    Workflows 1,836,519.00 81,148,820.00 303,229.40 267.62 -17.93%Threaded

    Workflows 5,503,758.00 88,555,273.00 271,577.78 326.08 21.85%

    Full speedup not realized.Consider data volume whenthreading.

    We ran out of

    files to process!

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    30/36

    30

    Data Integrity

    Customer Satisfaction, Trust, and Growth

    Is the data accurate?

    How complete is the picture?

    Finding the Bottleneck?

    Building the System of Record

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    31/36

    31

    File Monitoring System exchange

    Reaching pre-processing phase

    Data Monitoring Check for data validity

    Track session execution times

    Output File Monitoring

    Output files load time

    Follow invalid output files

    Extract

    Transform

    Load

    Data Integrity

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    32/36

    32

    Data Integrity

    Event Tracking

    EExchange

    Event

    EPre-Process

    Event EBulk Process

    Event EOutput File

    Event

    EDatabase File

    Event

    E T L

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    33/36

  • 7/27/2019 abos_12892

    34/36

    34

    Data Integrity

    Proven Metrics

    Checking the Box on the SLA

    Quantifiable numbers

    Build and Track Future Growth

    Handle errors and invalid data

    Review of metrics mayrequire redesign.

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    35/36

    35

    Key Points

    Tips and Tricks

    Stored Procedures, Lookups, Mapping Design

    Pipelines

    Pipeline ETL Processing

    Stacked Pipelines and Threaded Workflows

    Data Integrity

    Events for ETL, Alerting, Proven Metrics

    Copyright 2012 Lockheed Martin

  • 7/27/2019 abos_12892

    36/36