Upload
dejan-babic
View
213
Download
1
Embed Size (px)
DESCRIPTION
UPP slajdovi 4
Citation preview
1/17/2013
1
DIMENZIJSKO MODELIRANJE
PROF. DRAENA GAPAR
15.01.2013.
UPRAVLJANJE
POSLOVNIM PODACIMA
INFORMACIJE
Novi plan nastave ubrzanje
Softver za formiranje kocke
http://www.bi-lite.com/product/DownloadCUBEitZERO.aspx
CUBE-it Zero Foundation - free
1/17/2013
2
NAPREDNI KONCEPTI
1. Degenerativna dimenzija
2. Pahuljasta shema (Snowflaking)
3. Previe dimenzija
4. Surogatni kljuevi
5. Periodini snapshot
6. Poluzbrojive vrijednosti (fakti)
7. Data warehouse bus matrix
8. Podudarne dimenzije/fakti
9. Sporo promjenjive dimenzije
10. Dimenzije s viestrukim vrijednostima
DEGENERATE DIMENSION
Dimension table without atributes.
A degenerate dimension is data that is
dimensional in nature but stored in a
fact table.
Example: a dimension that only has Order
Number and Order Line Number
1:1 relationship with the Fact table
CONSEQUENCES ????
1/17/2013
3
DEGENERATE DIMENSION
Consequence:
Two tables with a billion rows
Instead of one table with a billion rows.
It would be a degenerate dimension and Order Number and Order Line Number
would be stored in the Fact table.
DEGENERATE DIMENSION
Degenerate dimensions commonly occur
when the fact table's grain is a single
transaction (or transaction line).
Transaction control header numbers
assigned by the operational business
process are typically degenerate
dimensions, such as order, ticket, credit
card transaction, or check numbers.
These degenerate dimensions are natural
keys of the "parents" of the line items.
1/17/2013
4
DEGENERATE DIMENSIONS
Example:
ORDERS TRANSACTIONS
order#
customer id
customer lname
customer fname
shipto street address
shipto city
shipto state
shipto zip
order total amount
discount amount
net order amount
payment amount
order date
ORDERS FACTS
customer key
shipto address key
order date key
order total amount
discount amount
net order amount
payment amount
order#
DIM CUSTOMER
Customer key
customer id
customer lname
customer fname
DIM SHIPTO ADDRESS
Shipto address key
shipto street address
shipto city
shipto state
shipto zip
DIM Order Date
Order date key
Calendar date
Calendar month
SNOWFLAKING
Normalized star schema
1/17/2013
5
SNOWFLAKING
Problems:
Increases complexity for users
Decreases performance (numerous tables and joins)
Slows down the users ability to browse within a dimension (example of problem: all brands within a category)
TOO MANY DIMENSIONS
1/17/2013
6
TOO MANY DIMENSIONS
A very large number of dimensions typically is a
sign that several dimensions are not completely
independent and should be combined into a
single dimension.
It is a dimensional modeling mistake to represent
elements of a hierarchy as separate dimensions
in the fact table.
SURROGATE KEYS
Surrogate (artificial, nonnatural, synthetic) keys are integers that are assigned sequentially as
needed to populate a dimension.
A surrogate key is a substitution for the natural
primary key.
It is meaningless.
It is just a unique identifier or number for each row
that can be used for the primary key to the table.
The only requirement for a surrogate primary key
is that it is unique for each row in the table.
The surrogate keys merely serve to join
dimensional tables to the fact table.
It is useful because the natural primary key (i.e.
Customer Number in Customer table) can change
and this makes updates more difficult.
1/17/2013
7
Advantages of using surrogate keys
Performance
Efficient joins
smaller indexes
more rows per block
Data integrity
When the keys in operational systems are reused
Discontinued products, Deceased customers, etc.
Mapping when integrating data from different sources
Keys from different sources may be different
Mapping table of the surrogate key and keys from different
sources
SURROGATE KEYS
Advantages of using surrogate keys (Cont)
Handling unknown or N/A values
Ease of assignment a surrogate key value to
rows with these values
Tracking changes in dimensional attribute values
Creating new attributes and assigning the
next available surrogate key
SURROGATE KEYS
1/17/2013
8
Disadvantages of using surrogate keys
Assignment and management of surrogate keys and
appropriate substitution of these keys for natural
keys extra load for ETL system
Many ETL tools have built-in capabilities to
support surrogate key processing
Once the process is developed, it can be
easily reused for other dimensions
SURROGATE KEYS
PERIODIC SNAPSHOT
At predetermined intervals snapshots of the same level of details are taken and stacked consecutively in the fact table
Example: most financial reports, bank account value, inventory level
Complements detailed transaction facts but not substitutes them
Share the same conformed dimensions but have less dimensions
1/17/2013
9
TYPES OF FACTS
There are three types of facts:
Additive: Additive facts are facts that
can be summed up through all of the
dimensions in the fact table.
Semi-Additive: Semi-additive facts are
facts that can be summed up for some of
the dimensions in the fact table, but not
the others.
Non-Additive: Non-additive facts are
facts that cannot be summed up for any of
the dimensions present in the fact table.
ADDITIVE FACTS
The purpose of this table is to record the sales amount for each
product in each store on a daily basis.
Sales_Amount is the fact. In this case, Sales_Amount is an
additive fact, because you can sum up this fact along any of
the three dimensions present in the fact table -- date, store,
and product. For example, the sum of Sales_Amount for all 7
days in a week represent the total sales amount for that week.
Date
Store
Product
Sales_Amount
1/17/2013
10
SEMIADDITIVE AND NONADITIVE FACTS
The purpose of this table is to record the current balance for each
account at the end of each day, as well as the profit margin for each
account for each day.
Current_Balance and Profit_Margin are the facts.
Current_Balance is a semi-additive fact, as it makes sense to add
them up for all accounts (what's the total current balance for all
accounts in the bank?), but it does not make sense to add them up
through time (adding up all current balances for a given account for
each day of the month does not give us any useful information). Profit_Margin is a non-additive fact, for it does not make sense to
add them up for the account level or the day level.
Date
Account
Current_Balance
Profit_Margin
CONFORMED DIMENSIONS/FACTS
Master or common reference dimensions
Shared across the DW environment joining to multiple fact tables representing various business processes
2 types
Identical dimensions
One dimension being a subset of a more detailed dimension
1/17/2013
11
CONFORMED DIMENSIONS/FACTS
Identical dimensions Same content, interpretation, and presentation
regardless of the business process involved
Same keys, attribute names, attribute definitions, and domain values regardless of domain values they join to
Example: product dimension referenced by orders and the one referenced by inventory are identical
One dimension being a perfect subset of a more detailed, granular dimension table Same attribute names, definitions, and domain
values
Example: sales is linked to a dimension table at the individual product level; sales forecast is linked at the brand level
CONFORMED DIMENSIONS
Sales Fact Table
Date key FK
Product key FK
other FKeys Sales quantity
Sales amount
Product Dimension
Product key PK
Product description
SKU number
Brand description
Sub class description
Class description
Department description
Color
size
Display type
Sales Forecast Fact Table
Month key FK
Brand key FK
other FKeys Forecast quantity
Forecast amount
Brand Dimension
Brand key PK
Brand description
Sub class description
Class description
Department description
Display type
1/17/2013
12
CONFORMED DIMENSIONS
Benefits
Consistency
Every fact table is filtered consistently and results are labeled consistently
Integration
Users can create queries that drill across fact tables representing different processes individually and then join result set on common dimension attributes
Reduced development time to market
Once created, conform dimensions are reused
CONFORMED FACTS
If facts do live in more than one fact table, the underlying definitions and equations for these facts must be the same if they are to be called the same thing.
If facts are labeled identically, then they need to be defined in the same dimensional contex and with the same units of measure from data mart to data mart.
Examples: revenue, profit, standard prices, standard costs, measures of quality, measures of customer satisfaction and other KPIs.
1/17/2013
13
CONFORMED DIMENSIONS/FACTS
Master or common reference dimensions
Shared across the DW environment joining to multiple fact tables representing various business processes
2 types
Identical dimensions
One dimension being a subset of a more detailed dimension
SLOWLY CHANGING DIMENSIONS
Dimension table attributes change infrequently
Mini-dimensions
Separating more frequently changing attributes into their own separate dimension table, mini-dimension
3 types of handling slowly changing dimensions
Overwrite the dimension attribute
Add a new dimension row
Add a new dimension attribute
1/17/2013
14
SLOWLY CHANGING DIMENSIONS - OVERWRITE THE DIMENSION ATTRIBUTE
New values overwrite old ones
No history is kept
Problems occur if data was previously
aggregated based on old values
Will not match ad-hoc aggregations based
on new values
Previous aggregations need to be updated
to keep aggregated data in-sync.
SLOWLY CHANGING DIMENSIONS - ADD A NEW DIMENSION ROW
Most popular technique
New row with new surrogate PK is inserted into
dimension table to reflect new attribute values
Both, old and new values are stored along with
effective and expiration dates, and the current row
indicator
Example:
1/17/2013
15
SLOWLY CHANGING DIMENSIONS - ADD A NEW DIMENSION ATTRIBUTE
Used infrequently
A new column is added to the dimension table
Old value is recorded in a prior attribute column
New value is recorded in the existing column
All BI applications transparently use the new attribute
Queries can be written to access values stored in the prior attribute column
MM
-07
DATA WAREHOUSE BUS ARCHITECTURE
Cannot built the enterprise data warehouse in one step.
Building isolated pieces will defeat consistency goal.
Need an architected incremental approach data warehouse bus architecture.
By defining a standard bus interface for the data warehouse environment, separate data marts can be implemented by different groups at different times. The separate data marts can be plugged together and usefully coexist if they adhere to the standard.
1/17/2013
16
1/1
7/2
01
3
MM
roo
m ,
Exe
Pgp
2004
-07
31
DATA WAREHOUSE BUS ARCHITECTURE
Purchase Orders
Store Inventory
Store Sales
Date Product Store Prom. WHouse Vender Shipper
1/1
7/2
01
3
MM
roo
m ,
Exe
Pgp
2004
-07
32
DATA WAREHOUSE BUS ARCHITECTURE
During architecture phase, team designs a
master suite of standardized dimensions
and facts that have uniform interpretation
across the enterprise.
Separate data marts are then developed
adhering to this architecture.
1/17/2013
17
ENTERPRISE BUS ARCHITECTURE
Requirements are gathered and represented in a form
of Enterprise Data Warehouse Bus Matrix
Each row corresponds to a business/process
Each column corresponds to a dimension of the business
Each column is a conformed dimension
Enterprise Data Warehouse Bus Matrix documents
the overall data architecture for DW/BI system
ENTERPRISE BUS ARCHITECTURE MATRIX
1/17/2013
18
ENTERPRISE BUS ARCHITECTURE MATRIX
Possible Problems:
Level of details for each column and row in the matrix
Row-related
Listing departments/imitating organizational
chart instead of business processes
Listing reports and analytics related to business
process instead of the business process itself
Ex. Shipping orders business process supports various
analytics such as customer ranking, sales rep
performance, product movement analyses
ENTERPRISE BUS ARCHITECTURE MATRIX
Possible Problems (Cont):
Column-related
Generalized columns/dimensions
Example: Entity column is too general as it includes employees, suppliers, contractors, vendors, customers
Too many columns related to the same dimension
Worst case when each attribute is listed separately
Example: Product, Product Group, LOB are all related to
the Product dimension and should be listed as one.
1/17/2013
19
DIMENSIONAL MODELING MISTAKES TO AVOID
Place text attributes used for constraining and grouping in a
fact table
Limit verbose descriptive attributes in dimensions to save space
Split hierarchies and hierarchy levels into multiple dimensions
Ignore the need to track dimension attribute changes
Solve all query performance problems by adding more hardware
Use operational or smart keys to join dimension tables to a fact
table
Neglect to declare and then comply with the fact tables grain
Design the dimensional model based on a specific report
Expect users to query the lowest-level atomic dana in a
normalized format
Fail to conform facts and dimensions across separate fact tables
DW 2.0
Modeling process
1/17/2013
20
DW 2.0 - MODELING
The starting point for DW2.0 is the modeling
process.
2 basic models:
Process model
Data model
The process model
aplies to the data
mart environment
The data model
applies to the
integraterd sector,
the near line
sector and the
aechival sector.
1/17/2013
21
CORPORATE DATA MODEL
Corporate data model must have identified and structured
the following:
the major subjects of the enterprise,
the relationships between the subjects,
the creation of an ERD (entity relationship diagram),
for each major subject area:
the keys(s) of the subject,
the attributes of the subject,
the subtypes of the subject,
the connectors of one subject area to the next,
the grouping of attributes.
CORPORATE DATA MODEL
1/17/2013
22
CORPORATE DATA MODEL
The process analysis is interesting but usually is only an adjunct to the corporate data model because the process analysis applies directly to
the operational environment, not the data warehouse environment. It
is the corporate data model that forms the backbone of design for the
data warehouse, not the process analysis.
The corporate data model is usually broken into multiple levels - a high
level and a mid level. The high level of the corporate data model
contains the major subject areas and how they relate.
CORPORATE DATA MODEL
Example of a high-level corporate data model
Four subject areas:
- Customer
- Account
- Order
- Product
Direct relationship between customer
and account, between account and
order, and between order and product.
1/17/2013
23
CORPORATE DATA MODEL
The next level of modeling in the corporate data model is
the mid level of modeling. The mid level of modeling is
the place where much of the detail of the model is found.
The mid level of modeling contains keys, attributes,
subtypes, groupings of attributes, and connectors.
CORPORATE DATA MODEL
There is a relationship between each subject area identified
in the high level model and the mid level models. For
each subject area identified, there is a single mid level
model.
1/17/2013
24
Transformation of corporate data model to DW model
through activities:
the removal of purely operational data,
the addition of an element of time to the key structure of
the data warehouse if one is not already present,
the addition of appropriate derived data,
the transformation of data relationships into data
artifacts,
accommodating the different levels of granularity found
in the data warehouse,
merging like data from different tables together,
creation of arrays of data, and
the separation of data attributes according to their
stability characteristics.
CORPORATE DATA MODEL
Removing operational data
- Estimation about reasonable chance that the
dana will be used for DSS
CORPORATE DATA MODEL
1/17/2013
25
Adding an element of time to the warehouse key
CORPORATE DATA MODEL
Adding derived data
As a rule data modelers do not include derived data as part of the
data modeling process. Consequently, corporate data models do
not contain derived data. The reason for the omission of derived
data is that when derived data is included in the data model, that
the data model will grow to ungainly proportions and the data
model will never be complete.
The next transformation that must be made to the corporate data
model is that of adding derived data to the data warehouse data
model where appropriate. It is appropriate to add derived data to the data warehouse data model where the
derived data is popularly accessed and calculated once.
The addition of derived data makes sense because it reduces the amount of processing
required upon accessing the data in the warehouse. In addition, once properly
calculated, there never is any fear in the integrity of the calculation. Once the
derived data is properly calculated, there never is the chance that someone will
come along and use an incorrect algorithm for the calculation of the data, thus
enhancing the credibility of data in the data warehouse.
CORPORATE DATA MODEL
1/17/2013
26
Adding derived data
CORPORATE DATA MODEL
Changing granularity of dana
CORPORATE DATA MODEL
1/17/2013
27
Merging tables
CORPORATE DATA MODEL
Preconditions:
Tables share a common key Data from different tables is used together frequently
Pattern of insertion is roughly the same
Organizing data according to its stability
CORPORATE DATA MODEL
1/17/2013
28
1/17/2013
29
Questions..