Dw-itm 1st 2ndClass

Embed Size (px)

Citation preview

  • 7/29/2019 Dw-itm 1st 2ndClass

    1/89

    Data Warehousing and OLAP

    What is a data warehouse?

    A multi-dimensional data model

    Data warehouse architecture

    Data warehouse implementation

    Further development of data cube technology

    1

  • 7/29/2019 Dw-itm 1st 2ndClass

    2/89

    , .

    A decision support database that is maintainedseparately from the organizations operationala a ase

    Support information processing by providing a solidplatform of consolidated, historical data for analysis.

    A data warehouse is a subject-oriented, integrated,time-variant, and nonvolatile collection of data in support

    -

    Inmon

    Data warehousing:

    2

    e process o cons ruc ng an us ng a a

    warehouses

  • 7/29/2019 Dw-itm 1st 2ndClass

    3/89

    -

    Organized around major subjects, such as customer,

    roduct sales.

    Focusing on the modeling and analysis of data for

    decision makers not on dail o erations or transaction

    processing.

    subject issues by excluding data that are not useful in

    3

  • 7/29/2019 Dw-itm 1st 2ndClass

    4/89

    ,

    data sources relational databases, flat files, on-line transaction

    recor s

    Data cleaning and data integration techniques areapplied.

    Ensure consistency in naming conventions, encodingstructures, attribute measures, etc. among different

    E.g., Hotel price: currency, tax, breakfast covered, etc.

    When data is moved to the warehouse, it is

    4

    .

  • 7/29/2019 Dw-itm 1st 2ndClass

    5/89

    he time horizon for the data warehouse is significantly

    longer than that of operational systems.

    Operationa ata ase: current va ue ata.

    Data warehouse data: provide information from as or ca perspec ve e.g., pas - years

    Every key structure in the data warehouse

    Contains an element of time, explicitly or implicitly

    But the key of operational data may or may not

    5

    conta n t me e ement .

  • 7/29/2019 Dw-itm 1st 2ndClass

    6/89

    -

    Aphysically separate store of data transformed from theoperational environment.

    Operational update of data does not occur in the data

    warehouse environment.

    Does not require transaction processing, recovery,

    and concurrency control mechanisms

    Requires only two operations in data accessing:

    initial loadin of dataand access of data.

    6

  • 7/29/2019 Dw-itm 1st 2ndClass

    7/89

    Data Warehouse vs. Hetero eneous DBMS

    Build wrappers/mediators on top of heterogeneous databases

    uer driven a roach

    When a query is posed to a client site, a meta-dictionary is

    used to translate the query into queries appropriate for,

    integrated into a global answer set

    Complex information filtering, compete for resources

    Data warehouse: update-driven, high performance Information from heterogeneous sources is integrated in advance

    7

  • 7/29/2019 Dw-itm 1st 2ndClass

    8/89

    .

    OLTP (on-line transaction processing)

    Major task of traditional relational DBMS

    Day-to-day operations: purchasing, inventory, banking,

    OLAP (on-line analytical processing)

    Major task of data warehouse system Data analysis and decision making

    Distinct features (OLTP vs. OLAP):

    .

    Data contents: current, detailed vs. historical, consolidated

    Database design: ER + application vs. star + subject

    8

    View: current, local vs. evolutionary, integrated

    Access patterns: update vs. read-only but complex queries

  • 7/29/2019 Dw-itm 1st 2ndClass

    9/89

    .

    OLTP OLAP

    users clerk, IT professional knowledge worker

    function day to day operations decision support

    DB desi n a lication-oriented sub ect-oriented

    data current, up-to-date

    detailed, flat relational

    isolated

    historical,

    summarized, multidimensional

    integrated, consolidated-

    access read/write

    index/hash on prim. key

    lots of scans

    unit of work short, simple transaction complex query

    # records accessed tens millions

    #users thousands hundreds

    DB size 100MB-GB 100GB-TB

    9

    metric transaction throughput query throughput, response

  • 7/29/2019 Dw-itm 1st 2ndClass

    10/89

    High performance for both systems

    DBMS tuned for OLTP: access methods, indexing,concurrency control, recovery

    multidimensional view, consolidation.

    Different functions and different data: m ss ng a a: ec s on suppor requ res s or ca a a

    which operational DBs do not typically maintain

    data consolidation: DS requires consolidation

    (aggregation, summarization) of data fromheterogeneous sources

    10

    inconsistent data representations, codes and formats

    which have to be reconciled

  • 7/29/2019 Dw-itm 1st 2ndClass

    11/89

    From Tables and Spreadsheets

    A data warehouse is based on a multidimensional data model which

    views data in the form of a data cube

    A data cube such as sales allows data to be modeled and viewed

    in multiple dimensions

    Dimension tables, such as item (item_name, brand, type), ortime ay, wee , mont , quarter, year

    Fact table contains measures (such as dollars_sold) and keys to

    each of the related dimension tables

    In data warehousing literature, an n-D base cube is called a base

    cuboid. The top most 0-D cuboid, which holds the highest-level of

    11

    summarization, is ca e t e apex cu oi . T e attice o cu oi s

    forms a data cube.

  • 7/29/2019 Dw-itm 1st 2ndClass

    12/89

    Cube: A Lattice of Cuboids

    all0-D(apex) cuboid

    me em oca on supp er1-D cuboids

    , ,

    time,supplier

    ,

    item,supplier

    ,

    2-D cuboids

    time,item,location

    time,item,supplier

    time,location,supplier

    item,location,supplier

    3-D cuboids

    12

    time, item, location, supplier

    4-D(base) cuboid

  • 7/29/2019 Dw-itm 1st 2ndClass

    13/89

    Conceptual Modeling of

    a a are ouses

    Star schema:A fact table in the middle connected to a

    Snowflake schema: A refinement of star schema

    set of smaller dimension tables, forming a shape

    Fact constellations: Multiple fact tables share

    13

    , ,

    therefore called galaxy schema or fact constellation

  • 7/29/2019 Dw-itm 1st 2ndClass

    14/89

    time

    item_day

    day_of_the_week

    monthSales Fact Table item_keyitem_name

    brandquarter

    year

    _

    item_key

    type

    supplier_type

    location_key

    street

    location_

    location_key

    units soldbranch_key

    branch name

    branch

    cityprovince_or_street

    country

    _

    dollars_sold

    avg_sales

    _

    branch_type

    14

    Measures

  • 7/29/2019 Dw-itm 1st 2ndClass

    15/89

    time

    item_

    day

    day_of_the_week

    month

    Sales Fact Tableitem_key

    item_name

    brandsupplier_key

    supplier_type

    supplier

    quarter

    year

    _

    item_key

    branch key

    ype

    supplier_key

    location_key

    street

    location_

    location_key

    units_soldbranch_key

    branch name

    branch

    c y_ eydollars_sold

    avg_sales

    _

    branch_typecity_key

    city

    city

    15

    Measures_ _

    country

  • 7/29/2019 Dw-itm 1st 2ndClass

    16/89

    Exam le of Fact Constellation

    time_key

    time

    item Shipping Fact Tableday

    day_of_the_weekmonth

    quarter

    Sales Fact Table

    time ke

    item_key

    item_namebrand

    t e

    time_key

    item_keyyear

    _

    item_key

    branch_key

    supplier_type shipper_key

    from_location

    location_key

    locationlocation_key

    units_soldbranch_key

    branch_name

    branch to_location

    dollars_cost

    city

    province_or_street

    country

    dollars_sold

    avg_sales

    branch_type units_shipped

    shipper

    16

    easures shipper_keyshipper_name

    location_keyshipper_type

  • 7/29/2019 Dw-itm 1st 2ndClass

    17/89

    Measures: How are measures com uted?

    A data cube measure is a numerical function that

    can be evaluated at each point in the data cubespace. A measure value is computed for a given

    point by aggregating the data corresponding to

    the respective dimension-value pairs defining thegiven point.

    17

  • 7/29/2019 Dw-itm 1st 2ndClass

    18/89

    Measures: T ree ategoriesdistributive: if the result derived by applying the function to

    naggregate values is the same as that derived by

    applying the function on all the data without partitioning.

    .g., coun , sum , m n , max .

    algebraic: if it can be computed by an algebraic function

    of which is obtained by applying a distributive aggregate

    function. E.g., avg(), min_N(), standard_deviation().

    holistic: if there is no constant bound on the storage size

    . E.g., median(), mode(), rank().

    18

  • 7/29/2019 Dw-itm 1st 2ndClass

    19/89

    A Concept Hierarchy: Dimension (location)AConcept hierarchy defines a sequence of mappings

    from a set of low level conce ts to hi her-level conce tsmore general concepts.

    allall

    Europe North_America...region

    MexicoCanadaSpainGermany ......countr

    ...

    19

    .. ...

  • 7/29/2019 Dw-itm 1st 2ndClass

    20/89

    View of Warehouses and HierarchiesA concept hierarchy that is a total or partial order among

    .

    Specification of hierarchies

    Schema hierarchyay < mon avg(Length))

    79

  • 7/29/2019 Dw-itm 1st 2ndClass

    80/89

    Query4: Suppose we are interested in those

    customers for whom the total length of theircalls during summer (June to august) exceedsone-third of the total length of their calls during

    . ,like to find the longest call made during the

    was made.

    80

  • 7/29/2019 Dw-itm 1st 2ndClass

    81/89

    select FromAC, FromTel, R.ToAC, R.Length

    from CALLS

    group by FromAC, FromTel : R

    such that R.Date > 2003 05 31andR.Date< 2003/09/01

    *

    AND R.Length=max(R.Length)

    81

  • 7/29/2019 Dw-itm 1st 2ndClass

    82/89

    Information processing

    supports querying, basic statistical analysis, and reporting

    us ng a es, c ar s an grap s

    Analytical processing

    multidimensional anal sis of data warehouse data supports basic OLAP operations, slice-dice, drilling, pivoting

    Data mining

    now e ge iscovery rom i en patterns

    supports associations, constructing analytical models,

    performing classification and prediction, and presenting the

    82

    mining results using visualization tools.

    Differences among the three tasks

  • 7/29/2019 Dw-itm 1st 2ndClass

    83/89

    data mining systems?

    The functionalities of OLAP and data mining can be viewed as

    simplify data analysis, while data mining allows the automated

    discovery of implicit patterns and interesting knowledge hidden

    .

    OLAP tools are targeted towards simplifying and supporting

    interactive data analysis, whereas the goal of data mining tools

    is to automate as much of the process as possible, while still

    allow users to guide the process.

    In this sense data minin oes one ste be ond traditional on-

    83

    line analytical processing.

  • 7/29/2019 Dw-itm 1st 2ndClass

    84/89

  • 7/29/2019 Dw-itm 1st 2ndClass

    85/89

    From On-Line Analytical Processing

  • 7/29/2019 Dw-itm 1st 2ndClass

    86/89

    -

    OLAP provides facilities for data mining on different subsets

    of data and at different levels of abstraction, by drilling,

    visualization tools, will greatly enhance the power andflexibility of exploratory data mining.

    On-line selection of data minin functions Often a user may not know what kinds of knowledge she

    would like to mine.

    ,

    OLAM provides users with the flexibility to select desireddata mining functions and swap data mining tasksd namicall .

    86

    An OLAM Architectureayer

  • 7/29/2019 Dw-itm 1st 2ndClass

    87/89

    User GUI API

    User Interfacen ng query n ng resu t

    OLAMEngine OLAPEngine

    Layer3

    OLAP/OLAM

    Data Cube API

    MDDBLayer2

    MDDB

    Database API

    La er1

    Filtering&Integration Filtering

    87

    Data

    Warehouse

    Data cleaning

    Data integrationData

    Repository

    Databases

    An OLAM Architecture

  • 7/29/2019 Dw-itm 1st 2ndClass

    88/89

    An OLAM server performs analytical mining in data cubes in asimilar manner as an OLAP server performs on-line analytical

    .

    OLAM and OLAP servers both accept user on-line queries via an

    graphical user interface API and work with the data cube in the.

    A metadata directory is used to guide the access of the data cube.

    The data cube can be constructed by accessing and/or integrating

    mu p e a a ases v a an an or y er ng a a awarehouse via a database API that may support OLEDB or ODBCconnections.

    nce an server may per orm mu p e a a m n ng as s

    such as concept description, association, classification, prediction,clustering, time-series analysis, and so on, it usually consists of

    88

    than OLAP server.

  • 7/29/2019 Dw-itm 1st 2ndClass

    89/89

    The capability of OLAP to provide multiple and dynamic

    views of summarized data in a data warehouse sets a.

    OLAP sets a good example for interactive data analysisand provides the necessary preparation for exploratorydata mining.

    89