Upload
uttam-satapathy
View
215
Download
0
Embed Size (px)
Citation preview
7/29/2019 Dw-itm 1st 2ndClass
1/89
Data Warehousing and OLAP
What is a data warehouse?
A multi-dimensional data model
Data warehouse architecture
Data warehouse implementation
Further development of data cube technology
1
7/29/2019 Dw-itm 1st 2ndClass
2/89
, .
A decision support database that is maintainedseparately from the organizations operationala a ase
Support information processing by providing a solidplatform of consolidated, historical data for analysis.
A data warehouse is a subject-oriented, integrated,time-variant, and nonvolatile collection of data in support
-
Inmon
Data warehousing:
2
e process o cons ruc ng an us ng a a
warehouses
7/29/2019 Dw-itm 1st 2ndClass
3/89
-
Organized around major subjects, such as customer,
roduct sales.
Focusing on the modeling and analysis of data for
decision makers not on dail o erations or transaction
processing.
subject issues by excluding data that are not useful in
3
7/29/2019 Dw-itm 1st 2ndClass
4/89
,
data sources relational databases, flat files, on-line transaction
recor s
Data cleaning and data integration techniques areapplied.
Ensure consistency in naming conventions, encodingstructures, attribute measures, etc. among different
E.g., Hotel price: currency, tax, breakfast covered, etc.
When data is moved to the warehouse, it is
4
.
7/29/2019 Dw-itm 1st 2ndClass
5/89
he time horizon for the data warehouse is significantly
longer than that of operational systems.
Operationa ata ase: current va ue ata.
Data warehouse data: provide information from as or ca perspec ve e.g., pas - years
Every key structure in the data warehouse
Contains an element of time, explicitly or implicitly
But the key of operational data may or may not
5
conta n t me e ement .
7/29/2019 Dw-itm 1st 2ndClass
6/89
-
Aphysically separate store of data transformed from theoperational environment.
Operational update of data does not occur in the data
warehouse environment.
Does not require transaction processing, recovery,
and concurrency control mechanisms
Requires only two operations in data accessing:
initial loadin of dataand access of data.
6
7/29/2019 Dw-itm 1st 2ndClass
7/89
Data Warehouse vs. Hetero eneous DBMS
Build wrappers/mediators on top of heterogeneous databases
uer driven a roach
When a query is posed to a client site, a meta-dictionary is
used to translate the query into queries appropriate for,
integrated into a global answer set
Complex information filtering, compete for resources
Data warehouse: update-driven, high performance Information from heterogeneous sources is integrated in advance
7
7/29/2019 Dw-itm 1st 2ndClass
8/89
.
OLTP (on-line transaction processing)
Major task of traditional relational DBMS
Day-to-day operations: purchasing, inventory, banking,
OLAP (on-line analytical processing)
Major task of data warehouse system Data analysis and decision making
Distinct features (OLTP vs. OLAP):
.
Data contents: current, detailed vs. historical, consolidated
Database design: ER + application vs. star + subject
8
View: current, local vs. evolutionary, integrated
Access patterns: update vs. read-only but complex queries
7/29/2019 Dw-itm 1st 2ndClass
9/89
.
OLTP OLAP
users clerk, IT professional knowledge worker
function day to day operations decision support
DB desi n a lication-oriented sub ect-oriented
data current, up-to-date
detailed, flat relational
isolated
historical,
summarized, multidimensional
integrated, consolidated-
access read/write
index/hash on prim. key
lots of scans
unit of work short, simple transaction complex query
# records accessed tens millions
#users thousands hundreds
DB size 100MB-GB 100GB-TB
9
metric transaction throughput query throughput, response
7/29/2019 Dw-itm 1st 2ndClass
10/89
High performance for both systems
DBMS tuned for OLTP: access methods, indexing,concurrency control, recovery
multidimensional view, consolidation.
Different functions and different data: m ss ng a a: ec s on suppor requ res s or ca a a
which operational DBs do not typically maintain
data consolidation: DS requires consolidation
(aggregation, summarization) of data fromheterogeneous sources
10
inconsistent data representations, codes and formats
which have to be reconciled
7/29/2019 Dw-itm 1st 2ndClass
11/89
From Tables and Spreadsheets
A data warehouse is based on a multidimensional data model which
views data in the form of a data cube
A data cube such as sales allows data to be modeled and viewed
in multiple dimensions
Dimension tables, such as item (item_name, brand, type), ortime ay, wee , mont , quarter, year
Fact table contains measures (such as dollars_sold) and keys to
each of the related dimension tables
In data warehousing literature, an n-D base cube is called a base
cuboid. The top most 0-D cuboid, which holds the highest-level of
11
summarization, is ca e t e apex cu oi . T e attice o cu oi s
forms a data cube.
7/29/2019 Dw-itm 1st 2ndClass
12/89
Cube: A Lattice of Cuboids
all0-D(apex) cuboid
me em oca on supp er1-D cuboids
, ,
time,supplier
,
item,supplier
,
2-D cuboids
time,item,location
time,item,supplier
time,location,supplier
item,location,supplier
3-D cuboids
12
time, item, location, supplier
4-D(base) cuboid
7/29/2019 Dw-itm 1st 2ndClass
13/89
Conceptual Modeling of
a a are ouses
Star schema:A fact table in the middle connected to a
Snowflake schema: A refinement of star schema
set of smaller dimension tables, forming a shape
Fact constellations: Multiple fact tables share
13
, ,
therefore called galaxy schema or fact constellation
7/29/2019 Dw-itm 1st 2ndClass
14/89
time
item_day
day_of_the_week
monthSales Fact Table item_keyitem_name
brandquarter
year
_
item_key
type
supplier_type
location_key
street
location_
location_key
units soldbranch_key
branch name
branch
cityprovince_or_street
country
_
dollars_sold
avg_sales
_
branch_type
14
Measures
7/29/2019 Dw-itm 1st 2ndClass
15/89
time
item_
day
day_of_the_week
month
Sales Fact Tableitem_key
item_name
brandsupplier_key
supplier_type
supplier
quarter
year
_
item_key
branch key
ype
supplier_key
location_key
street
location_
location_key
units_soldbranch_key
branch name
branch
c y_ eydollars_sold
avg_sales
_
branch_typecity_key
city
city
15
Measures_ _
country
7/29/2019 Dw-itm 1st 2ndClass
16/89
Exam le of Fact Constellation
time_key
time
item Shipping Fact Tableday
day_of_the_weekmonth
quarter
Sales Fact Table
time ke
item_key
item_namebrand
t e
time_key
item_keyyear
_
item_key
branch_key
supplier_type shipper_key
from_location
location_key
locationlocation_key
units_soldbranch_key
branch_name
branch to_location
dollars_cost
city
province_or_street
country
dollars_sold
avg_sales
branch_type units_shipped
shipper
16
easures shipper_keyshipper_name
location_keyshipper_type
7/29/2019 Dw-itm 1st 2ndClass
17/89
Measures: How are measures com uted?
A data cube measure is a numerical function that
can be evaluated at each point in the data cubespace. A measure value is computed for a given
point by aggregating the data corresponding to
the respective dimension-value pairs defining thegiven point.
17
7/29/2019 Dw-itm 1st 2ndClass
18/89
Measures: T ree ategoriesdistributive: if the result derived by applying the function to
naggregate values is the same as that derived by
applying the function on all the data without partitioning.
.g., coun , sum , m n , max .
algebraic: if it can be computed by an algebraic function
of which is obtained by applying a distributive aggregate
function. E.g., avg(), min_N(), standard_deviation().
holistic: if there is no constant bound on the storage size
. E.g., median(), mode(), rank().
18
7/29/2019 Dw-itm 1st 2ndClass
19/89
A Concept Hierarchy: Dimension (location)AConcept hierarchy defines a sequence of mappings
from a set of low level conce ts to hi her-level conce tsmore general concepts.
allall
Europe North_America...region
MexicoCanadaSpainGermany ......countr
...
19
.. ...
7/29/2019 Dw-itm 1st 2ndClass
20/89
View of Warehouses and HierarchiesA concept hierarchy that is a total or partial order among
.
Specification of hierarchies
Schema hierarchyay < mon avg(Length))
79
7/29/2019 Dw-itm 1st 2ndClass
80/89
Query4: Suppose we are interested in those
customers for whom the total length of theircalls during summer (June to august) exceedsone-third of the total length of their calls during
. ,like to find the longest call made during the
was made.
80
7/29/2019 Dw-itm 1st 2ndClass
81/89
select FromAC, FromTel, R.ToAC, R.Length
from CALLS
group by FromAC, FromTel : R
such that R.Date > 2003 05 31andR.Date< 2003/09/01
*
AND R.Length=max(R.Length)
81
7/29/2019 Dw-itm 1st 2ndClass
82/89
Information processing
supports querying, basic statistical analysis, and reporting
us ng a es, c ar s an grap s
Analytical processing
multidimensional anal sis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting
Data mining
now e ge iscovery rom i en patterns
supports associations, constructing analytical models,
performing classification and prediction, and presenting the
82
mining results using visualization tools.
Differences among the three tasks
7/29/2019 Dw-itm 1st 2ndClass
83/89
data mining systems?
The functionalities of OLAP and data mining can be viewed as
simplify data analysis, while data mining allows the automated
discovery of implicit patterns and interesting knowledge hidden
.
OLAP tools are targeted towards simplifying and supporting
interactive data analysis, whereas the goal of data mining tools
is to automate as much of the process as possible, while still
allow users to guide the process.
In this sense data minin oes one ste be ond traditional on-
83
line analytical processing.
7/29/2019 Dw-itm 1st 2ndClass
84/89
7/29/2019 Dw-itm 1st 2ndClass
85/89
From On-Line Analytical Processing
7/29/2019 Dw-itm 1st 2ndClass
86/89
-
OLAP provides facilities for data mining on different subsets
of data and at different levels of abstraction, by drilling,
visualization tools, will greatly enhance the power andflexibility of exploratory data mining.
On-line selection of data minin functions Often a user may not know what kinds of knowledge she
would like to mine.
,
OLAM provides users with the flexibility to select desireddata mining functions and swap data mining tasksd namicall .
86
An OLAM Architectureayer
7/29/2019 Dw-itm 1st 2ndClass
87/89
User GUI API
User Interfacen ng query n ng resu t
OLAMEngine OLAPEngine
Layer3
OLAP/OLAM
Data Cube API
MDDBLayer2
MDDB
Database API
La er1
Filtering&Integration Filtering
87
Data
Warehouse
Data cleaning
Data integrationData
Repository
Databases
An OLAM Architecture
7/29/2019 Dw-itm 1st 2ndClass
88/89
An OLAM server performs analytical mining in data cubes in asimilar manner as an OLAP server performs on-line analytical
.
OLAM and OLAP servers both accept user on-line queries via an
graphical user interface API and work with the data cube in the.
A metadata directory is used to guide the access of the data cube.
The data cube can be constructed by accessing and/or integrating
mu p e a a ases v a an an or y er ng a a awarehouse via a database API that may support OLEDB or ODBCconnections.
nce an server may per orm mu p e a a m n ng as s
such as concept description, association, classification, prediction,clustering, time-series analysis, and so on, it usually consists of
88
than OLAP server.
7/29/2019 Dw-itm 1st 2ndClass
89/89
The capability of OLAP to provide multiple and dynamic
views of summarized data in a data warehouse sets a.
OLAP sets a good example for interactive data analysisand provides the necessary preparation for exploratorydata mining.
89