29
1/17/2013 1 DIMENZIJSKO MODELIRANJE PROF. DRAŽENA GAŠPAR 15.01.2013. UPRAVLJANJE POSLOVNIM PODACIMA INFORMACIJE Novi plan nastave ubrzanje Softver za formiranje kocke http://www.bi-lite.com/product/DownloadCUBEitZERO.aspx CUBE-it Zero Foundation - free

BDM_15012013_6

Embed Size (px)

DESCRIPTION

UPP slajdovi 4

Citation preview

  • 1/17/2013

    1

    DIMENZIJSKO MODELIRANJE

    PROF. DRAENA GAPAR

    15.01.2013.

    UPRAVLJANJE

    POSLOVNIM PODACIMA

    INFORMACIJE

    Novi plan nastave ubrzanje

    Softver za formiranje kocke

    http://www.bi-lite.com/product/DownloadCUBEitZERO.aspx

    CUBE-it Zero Foundation - free

  • 1/17/2013

    2

    NAPREDNI KONCEPTI

    1. Degenerativna dimenzija

    2. Pahuljasta shema (Snowflaking)

    3. Previe dimenzija

    4. Surogatni kljuevi

    5. Periodini snapshot

    6. Poluzbrojive vrijednosti (fakti)

    7. Data warehouse bus matrix

    8. Podudarne dimenzije/fakti

    9. Sporo promjenjive dimenzije

    10. Dimenzije s viestrukim vrijednostima

    DEGENERATE DIMENSION

    Dimension table without atributes.

    A degenerate dimension is data that is

    dimensional in nature but stored in a

    fact table.

    Example: a dimension that only has Order

    Number and Order Line Number

    1:1 relationship with the Fact table

    CONSEQUENCES ????

  • 1/17/2013

    3

    DEGENERATE DIMENSION

    Consequence:

    Two tables with a billion rows

    Instead of one table with a billion rows.

    It would be a degenerate dimension and Order Number and Order Line Number

    would be stored in the Fact table.

    DEGENERATE DIMENSION

    Degenerate dimensions commonly occur

    when the fact table's grain is a single

    transaction (or transaction line).

    Transaction control header numbers

    assigned by the operational business

    process are typically degenerate

    dimensions, such as order, ticket, credit

    card transaction, or check numbers.

    These degenerate dimensions are natural

    keys of the "parents" of the line items.

  • 1/17/2013

    4

    DEGENERATE DIMENSIONS

    Example:

    ORDERS TRANSACTIONS

    order#

    customer id

    customer lname

    customer fname

    shipto street address

    shipto city

    shipto state

    shipto zip

    order total amount

    discount amount

    net order amount

    payment amount

    order date

    ORDERS FACTS

    customer key

    shipto address key

    order date key

    order total amount

    discount amount

    net order amount

    payment amount

    order#

    DIM CUSTOMER

    Customer key

    customer id

    customer lname

    customer fname

    DIM SHIPTO ADDRESS

    Shipto address key

    shipto street address

    shipto city

    shipto state

    shipto zip

    DIM Order Date

    Order date key

    Calendar date

    Calendar month

    SNOWFLAKING

    Normalized star schema

  • 1/17/2013

    5

    SNOWFLAKING

    Problems:

    Increases complexity for users

    Decreases performance (numerous tables and joins)

    Slows down the users ability to browse within a dimension (example of problem: all brands within a category)

    TOO MANY DIMENSIONS

  • 1/17/2013

    6

    TOO MANY DIMENSIONS

    A very large number of dimensions typically is a

    sign that several dimensions are not completely

    independent and should be combined into a

    single dimension.

    It is a dimensional modeling mistake to represent

    elements of a hierarchy as separate dimensions

    in the fact table.

    SURROGATE KEYS

    Surrogate (artificial, nonnatural, synthetic) keys are integers that are assigned sequentially as

    needed to populate a dimension.

    A surrogate key is a substitution for the natural

    primary key.

    It is meaningless.

    It is just a unique identifier or number for each row

    that can be used for the primary key to the table.

    The only requirement for a surrogate primary key

    is that it is unique for each row in the table.

    The surrogate keys merely serve to join

    dimensional tables to the fact table.

    It is useful because the natural primary key (i.e.

    Customer Number in Customer table) can change

    and this makes updates more difficult.

  • 1/17/2013

    7

    Advantages of using surrogate keys

    Performance

    Efficient joins

    smaller indexes

    more rows per block

    Data integrity

    When the keys in operational systems are reused

    Discontinued products, Deceased customers, etc.

    Mapping when integrating data from different sources

    Keys from different sources may be different

    Mapping table of the surrogate key and keys from different

    sources

    SURROGATE KEYS

    Advantages of using surrogate keys (Cont)

    Handling unknown or N/A values

    Ease of assignment a surrogate key value to

    rows with these values

    Tracking changes in dimensional attribute values

    Creating new attributes and assigning the

    next available surrogate key

    SURROGATE KEYS

  • 1/17/2013

    8

    Disadvantages of using surrogate keys

    Assignment and management of surrogate keys and

    appropriate substitution of these keys for natural

    keys extra load for ETL system

    Many ETL tools have built-in capabilities to

    support surrogate key processing

    Once the process is developed, it can be

    easily reused for other dimensions

    SURROGATE KEYS

    PERIODIC SNAPSHOT

    At predetermined intervals snapshots of the same level of details are taken and stacked consecutively in the fact table

    Example: most financial reports, bank account value, inventory level

    Complements detailed transaction facts but not substitutes them

    Share the same conformed dimensions but have less dimensions

  • 1/17/2013

    9

    TYPES OF FACTS

    There are three types of facts:

    Additive: Additive facts are facts that

    can be summed up through all of the

    dimensions in the fact table.

    Semi-Additive: Semi-additive facts are

    facts that can be summed up for some of

    the dimensions in the fact table, but not

    the others.

    Non-Additive: Non-additive facts are

    facts that cannot be summed up for any of

    the dimensions present in the fact table.

    ADDITIVE FACTS

    The purpose of this table is to record the sales amount for each

    product in each store on a daily basis.

    Sales_Amount is the fact. In this case, Sales_Amount is an

    additive fact, because you can sum up this fact along any of

    the three dimensions present in the fact table -- date, store,

    and product. For example, the sum of Sales_Amount for all 7

    days in a week represent the total sales amount for that week.

    Date

    Store

    Product

    Sales_Amount

  • 1/17/2013

    10

    SEMIADDITIVE AND NONADITIVE FACTS

    The purpose of this table is to record the current balance for each

    account at the end of each day, as well as the profit margin for each

    account for each day.

    Current_Balance and Profit_Margin are the facts.

    Current_Balance is a semi-additive fact, as it makes sense to add

    them up for all accounts (what's the total current balance for all

    accounts in the bank?), but it does not make sense to add them up

    through time (adding up all current balances for a given account for

    each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to

    add them up for the account level or the day level.

    Date

    Account

    Current_Balance

    Profit_Margin

    CONFORMED DIMENSIONS/FACTS

    Master or common reference dimensions

    Shared across the DW environment joining to multiple fact tables representing various business processes

    2 types

    Identical dimensions

    One dimension being a subset of a more detailed dimension

  • 1/17/2013

    11

    CONFORMED DIMENSIONS/FACTS

    Identical dimensions Same content, interpretation, and presentation

    regardless of the business process involved

    Same keys, attribute names, attribute definitions, and domain values regardless of domain values they join to

    Example: product dimension referenced by orders and the one referenced by inventory are identical

    One dimension being a perfect subset of a more detailed, granular dimension table Same attribute names, definitions, and domain

    values

    Example: sales is linked to a dimension table at the individual product level; sales forecast is linked at the brand level

    CONFORMED DIMENSIONS

    Sales Fact Table

    Date key FK

    Product key FK

    other FKeys Sales quantity

    Sales amount

    Product Dimension

    Product key PK

    Product description

    SKU number

    Brand description

    Sub class description

    Class description

    Department description

    Color

    size

    Display type

    Sales Forecast Fact Table

    Month key FK

    Brand key FK

    other FKeys Forecast quantity

    Forecast amount

    Brand Dimension

    Brand key PK

    Brand description

    Sub class description

    Class description

    Department description

    Display type

  • 1/17/2013

    12

    CONFORMED DIMENSIONS

    Benefits

    Consistency

    Every fact table is filtered consistently and results are labeled consistently

    Integration

    Users can create queries that drill across fact tables representing different processes individually and then join result set on common dimension attributes

    Reduced development time to market

    Once created, conform dimensions are reused

    CONFORMED FACTS

    If facts do live in more than one fact table, the underlying definitions and equations for these facts must be the same if they are to be called the same thing.

    If facts are labeled identically, then they need to be defined in the same dimensional contex and with the same units of measure from data mart to data mart.

    Examples: revenue, profit, standard prices, standard costs, measures of quality, measures of customer satisfaction and other KPIs.

  • 1/17/2013

    13

    CONFORMED DIMENSIONS/FACTS

    Master or common reference dimensions

    Shared across the DW environment joining to multiple fact tables representing various business processes

    2 types

    Identical dimensions

    One dimension being a subset of a more detailed dimension

    SLOWLY CHANGING DIMENSIONS

    Dimension table attributes change infrequently

    Mini-dimensions

    Separating more frequently changing attributes into their own separate dimension table, mini-dimension

    3 types of handling slowly changing dimensions

    Overwrite the dimension attribute

    Add a new dimension row

    Add a new dimension attribute

  • 1/17/2013

    14

    SLOWLY CHANGING DIMENSIONS - OVERWRITE THE DIMENSION ATTRIBUTE

    New values overwrite old ones

    No history is kept

    Problems occur if data was previously

    aggregated based on old values

    Will not match ad-hoc aggregations based

    on new values

    Previous aggregations need to be updated

    to keep aggregated data in-sync.

    SLOWLY CHANGING DIMENSIONS - ADD A NEW DIMENSION ROW

    Most popular technique

    New row with new surrogate PK is inserted into

    dimension table to reflect new attribute values

    Both, old and new values are stored along with

    effective and expiration dates, and the current row

    indicator

    Example:

  • 1/17/2013

    15

    SLOWLY CHANGING DIMENSIONS - ADD A NEW DIMENSION ATTRIBUTE

    Used infrequently

    A new column is added to the dimension table

    Old value is recorded in a prior attribute column

    New value is recorded in the existing column

    All BI applications transparently use the new attribute

    Queries can be written to access values stored in the prior attribute column

    MM

    -07

    DATA WAREHOUSE BUS ARCHITECTURE

    Cannot built the enterprise data warehouse in one step.

    Building isolated pieces will defeat consistency goal.

    Need an architected incremental approach data warehouse bus architecture.

    By defining a standard bus interface for the data warehouse environment, separate data marts can be implemented by different groups at different times. The separate data marts can be plugged together and usefully coexist if they adhere to the standard.

  • 1/17/2013

    16

    1/1

    7/2

    01

    3

    MM

    roo

    m ,

    Exe

    Pgp

    2004

    -07

    31

    DATA WAREHOUSE BUS ARCHITECTURE

    Purchase Orders

    Store Inventory

    Store Sales

    Date Product Store Prom. WHouse Vender Shipper

    1/1

    7/2

    01

    3

    MM

    roo

    m ,

    Exe

    Pgp

    2004

    -07

    32

    DATA WAREHOUSE BUS ARCHITECTURE

    During architecture phase, team designs a

    master suite of standardized dimensions

    and facts that have uniform interpretation

    across the enterprise.

    Separate data marts are then developed

    adhering to this architecture.

  • 1/17/2013

    17

    ENTERPRISE BUS ARCHITECTURE

    Requirements are gathered and represented in a form

    of Enterprise Data Warehouse Bus Matrix

    Each row corresponds to a business/process

    Each column corresponds to a dimension of the business

    Each column is a conformed dimension

    Enterprise Data Warehouse Bus Matrix documents

    the overall data architecture for DW/BI system

    ENTERPRISE BUS ARCHITECTURE MATRIX

  • 1/17/2013

    18

    ENTERPRISE BUS ARCHITECTURE MATRIX

    Possible Problems:

    Level of details for each column and row in the matrix

    Row-related

    Listing departments/imitating organizational

    chart instead of business processes

    Listing reports and analytics related to business

    process instead of the business process itself

    Ex. Shipping orders business process supports various

    analytics such as customer ranking, sales rep

    performance, product movement analyses

    ENTERPRISE BUS ARCHITECTURE MATRIX

    Possible Problems (Cont):

    Column-related

    Generalized columns/dimensions

    Example: Entity column is too general as it includes employees, suppliers, contractors, vendors, customers

    Too many columns related to the same dimension

    Worst case when each attribute is listed separately

    Example: Product, Product Group, LOB are all related to

    the Product dimension and should be listed as one.

  • 1/17/2013

    19

    DIMENSIONAL MODELING MISTAKES TO AVOID

    Place text attributes used for constraining and grouping in a

    fact table

    Limit verbose descriptive attributes in dimensions to save space

    Split hierarchies and hierarchy levels into multiple dimensions

    Ignore the need to track dimension attribute changes

    Solve all query performance problems by adding more hardware

    Use operational or smart keys to join dimension tables to a fact

    table

    Neglect to declare and then comply with the fact tables grain

    Design the dimensional model based on a specific report

    Expect users to query the lowest-level atomic dana in a

    normalized format

    Fail to conform facts and dimensions across separate fact tables

    DW 2.0

    Modeling process

  • 1/17/2013

    20

    DW 2.0 - MODELING

    The starting point for DW2.0 is the modeling

    process.

    2 basic models:

    Process model

    Data model

    The process model

    aplies to the data

    mart environment

    The data model

    applies to the

    integraterd sector,

    the near line

    sector and the

    aechival sector.

  • 1/17/2013

    21

    CORPORATE DATA MODEL

    Corporate data model must have identified and structured

    the following:

    the major subjects of the enterprise,

    the relationships between the subjects,

    the creation of an ERD (entity relationship diagram),

    for each major subject area:

    the keys(s) of the subject,

    the attributes of the subject,

    the subtypes of the subject,

    the connectors of one subject area to the next,

    the grouping of attributes.

    CORPORATE DATA MODEL

  • 1/17/2013

    22

    CORPORATE DATA MODEL

    The process analysis is interesting but usually is only an adjunct to the corporate data model because the process analysis applies directly to

    the operational environment, not the data warehouse environment. It

    is the corporate data model that forms the backbone of design for the

    data warehouse, not the process analysis.

    The corporate data model is usually broken into multiple levels - a high

    level and a mid level. The high level of the corporate data model

    contains the major subject areas and how they relate.

    CORPORATE DATA MODEL

    Example of a high-level corporate data model

    Four subject areas:

    - Customer

    - Account

    - Order

    - Product

    Direct relationship between customer

    and account, between account and

    order, and between order and product.

  • 1/17/2013

    23

    CORPORATE DATA MODEL

    The next level of modeling in the corporate data model is

    the mid level of modeling. The mid level of modeling is

    the place where much of the detail of the model is found.

    The mid level of modeling contains keys, attributes,

    subtypes, groupings of attributes, and connectors.

    CORPORATE DATA MODEL

    There is a relationship between each subject area identified

    in the high level model and the mid level models. For

    each subject area identified, there is a single mid level

    model.

  • 1/17/2013

    24

    Transformation of corporate data model to DW model

    through activities:

    the removal of purely operational data,

    the addition of an element of time to the key structure of

    the data warehouse if one is not already present,

    the addition of appropriate derived data,

    the transformation of data relationships into data

    artifacts,

    accommodating the different levels of granularity found

    in the data warehouse,

    merging like data from different tables together,

    creation of arrays of data, and

    the separation of data attributes according to their

    stability characteristics.

    CORPORATE DATA MODEL

    Removing operational data

    - Estimation about reasonable chance that the

    dana will be used for DSS

    CORPORATE DATA MODEL

  • 1/17/2013

    25

    Adding an element of time to the warehouse key

    CORPORATE DATA MODEL

    Adding derived data

    As a rule data modelers do not include derived data as part of the

    data modeling process. Consequently, corporate data models do

    not contain derived data. The reason for the omission of derived

    data is that when derived data is included in the data model, that

    the data model will grow to ungainly proportions and the data

    model will never be complete.

    The next transformation that must be made to the corporate data

    model is that of adding derived data to the data warehouse data

    model where appropriate. It is appropriate to add derived data to the data warehouse data model where the

    derived data is popularly accessed and calculated once.

    The addition of derived data makes sense because it reduces the amount of processing

    required upon accessing the data in the warehouse. In addition, once properly

    calculated, there never is any fear in the integrity of the calculation. Once the

    derived data is properly calculated, there never is the chance that someone will

    come along and use an incorrect algorithm for the calculation of the data, thus

    enhancing the credibility of data in the data warehouse.

    CORPORATE DATA MODEL

  • 1/17/2013

    26

    Adding derived data

    CORPORATE DATA MODEL

    Changing granularity of dana

    CORPORATE DATA MODEL

  • 1/17/2013

    27

    Merging tables

    CORPORATE DATA MODEL

    Preconditions:

    Tables share a common key Data from different tables is used together frequently

    Pattern of insertion is roughly the same

    Organizing data according to its stability

    CORPORATE DATA MODEL

  • 1/17/2013

    28

  • 1/17/2013

    29

    Questions..