42
Introduction to Data Warehousing Ki-Joon Han Database lab. Konkuk University

Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

  • Upload
    vuminh

  • View
    212

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

Introduction to Data Warehousing

Ki-Joon HanDatabase lab.

Konkuk University

Page 2: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

2

Outline of Talk

• Data Warehousing and Information Integration

• Brief History of Data Warehousing• OLTP vs. OLAP• What is a Data Warehouse?• Types of Data and Their Uses• Data Warehouse Architectures• Issues in Data Warehousing• Course Objectives

Page 3: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

3

A Brief History of Information Technology

• The “dark ages”: paper forms in file cabinets• Computerized systems emerge

– Initially for big projects like Social Security– Same functionality as old paper-based systems

• The “golden age”: databases are everywhere– Most activities tracked electronically– Stored data provides detailed history of activity

• The next step: use data for decision-making– Made possible by omnipresence of IT– Identify inefficiencies in current processes– Quantify likely impact of decisions

Page 4: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

4

Databases for Decision Support• 1st phase: Automating existing processes makes

them more efficient.– Automation → Lots of well-organized, easily accessed

data

• 2nd phase: Data analysis allows for better decision-making. – Analyze data → better understanding– Better understanding → better decisions

• “Data Entry” vs. “Thinking”– Data analysts are decision-makers: managers,

executives, etc.

Page 5: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

5

Problem: Heterogeneous Information Sources

“Heterogeneities are everywhere”

Different interfacesDifferent data representationsDuplicate and inconsistent information

PersonalDatabases

Digital Libraries

Scientific DatabasesWorldWideWeb

Page 6: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

6

Problem: Data Management in Large Enterprises

• Vertical fragmentation of informational systems

• Result of application (user)-driven development of operational systems

Sales Administration Finance Manufacturing ...

Sales PlanningStock Mngmt

...

Suppliers

...Debt Mngmt

Num. Control

...Inventory

Page 7: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

7

Goal: Unified Access to Data

Integration System

• Collects and combines information• Provides integrated view, uniform user interface• Supports sharing

WorldWideWeb

Digital Libraries Scientific Databases

PersonalDatabases

Page 8: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

8

The Traditional Research Approach

Source SourceSource. . .

Integration System

. . .

Metadata

Clients

Wrapper WrapperWrapper

• Query-driven (lazy, on-demand)

Page 9: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

9

Disadvantages of Query-Driven Approach

• Delay in query processing– Slow or unavailable information sources– Complex filtering and integration

• Inefficient and potentially expensive for frequent queries

• Competes with local processing at sources• Hasn’t caught on in industry

Page 10: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

10

The Warehousing Approach

DataDataWarehouseWarehouse

Clients

Source SourceSource. . .

Extractor/Monitor

Integration System

. . .

Metadata

Extractor/Monitor

Extractor/Monitor

• Information integrated in advance

• Stored in DWfor direct querying and analysis

Page 11: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

11

Advantages of Warehousing Approach• High query performance

– But not necessarily most current information• Doesn’t interfere with local processing at sources

– Complex queries at warehouse– OLTP at information sources

• Information copied at warehouse– Can modify, annotate, summarize, restructure, etc.– Can store historical information– Security, no auditing

• Has caught on in industry

Page 12: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

12

Not Either-Or Decision

• Query-driven approach still better for– Rapidly changing information– Rapidly changing information sources– Truly vast amounts of data from large numbers

of sources– Clients with unpredictable needs

Page 13: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

13

Federated Databases• An alternative to data warehouses• Data warehouse

– Create a copy of all the data – Execute queries against the copy

• Federated database – Pull data from source systems as needed to answer

queries• “lazy” vs. “eager” data integration

Data WarehouseFederated Database

Query

Answer

QueryExtraction

RewrittenQueries

AnswerSourceSystems

Warehouse Mediator

SourceSystems

Page 14: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

14

Warehouses vs. Federation• Advantages of federated databases:

– No redundant copying of data– Queries see “real-time” view of evolving data– More flexible security policy

• Disadvantages of federated databases:– Analysis queries place extra load on transactional systems– Query optimization is hard to do well– Historical data may not be available– Complex “wrappers” needed to mediate between analysis

server and source systems• Data warehouses are much more common in practice

– Better performance– Lower complexity– Slightly out-of-date data is acceptable

Page 15: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

15

OLTP vs. OLAP

• OLTP: On-Line Transaction Processing– Many short transactions (queries

+ updates)– Examples:

• Update account balance• Enroll in course• Add book to shopping cart

– Queries touch small amounts of data (one record or a few records)

– Updates are frequent– Concurrency is biggest

performance concern

• OLAP: On-Line Analytical Processing– Long transactions, complex

queries– Examples:

• Report total sales for each department in each month

• Identify top-selling books• Count classes with fewer than

10 students– Queries touch large amounts

of data– Updates are infrequent– Individual queries can require

lots of resources

Page 16: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

16

Why OLAP & OLTP don’t mix (1)

• Transaction processing (OLTP):– Fast response time important (< 1 second)– Data must be up-to-date, consistent at all times

• Data analysis (OLAP):– Queries can consume lots of resources– Can saturate CPUs and disk bandwidth– Operating on static “snapshot” of data usually OK

• OLAP can “crowd out” OLTP transactions– Transactions are slow → unhappy users

• Example: – Analysis query asks for sum of all sales– Acquires lock on sales table for consistency– New sales transaction is blocked

Different performance requirements

Page 17: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

17

Why OLAP & OLTP don’t mix (2)

• Transaction processing (OLTP):– Normalized schema for consistency– Complex data models, many tables– Limited number of standardized queries and updates

• Data analysis (OLAP):– Simplicity of data model is important

• Allow semi-technical users to formulate ad hoc queries

– De-normalized schemas are common• Fewer joins → improved query performance• Fewer tables → schema is easier to understand

Different data modeling requirements

Page 18: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

18

Why OLAP & OLTP don’t mix (3)

• An OLTP system targets one specific process– For example: ordering from an online store

• OLAP integrates data from different processes– Combine sales, inventory, and purchasing data– Analyze experiments conducted by different labs

• OLAP often makes use of historical data– Identify long-term patterns– Notice changes in behavior over time

• Terminology, schemas vary across data sources– Integrating data from disparate sources is a major

challenge

Analysis requires data from many sources

Page 19: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

19

Data Warehouses

• Doing OLTP and OLAP in the same database system is often impractical– Different performance requirements– Different data modeling requirements– Analysis queries require data from many sources

• Solution: Build a “data warehouse”– Copy data from various OLTP systems– Optimize data organization, system tuning for OLAP– Transactions aren’t slowed by big analysis queries– Periodically refresh the data in the warehouse

Page 20: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

20

Data Warehouse EvolutionT

IME

200019951990198519801960 1975

Information-Based Management

DataRevolution

“MiddleAges”

“PrehistoricTimes”

RelationalDatabases

PC’s andSpreadsheets

End-userInterfaces

1st DW Article

DWConfs.

Vendor DWFrameworks

CompanyDWs

“Building theDW”

Inmon (1992)Data Replication

Tools

Page 21: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

21

What is a Data Warehouse?A Practitioners Viewpoint

“A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context.”-- Barry Devlin, IBM Consultant

Page 22: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

22

What is a Data Warehouse?An Alternative Viewpoint

“A DW is a – subject-oriented,– integrated,– time-varying,– non-volatile

collection of data that is used primarily in organizational decision making.”

-- W.H. Inmon, Building the Data Warehouse, 1992

Page 23: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

23

A Data Warehouse is...• Stored collection of diverse data

– A solution to data integration problem– Single repository of information

• Subject-oriented– Organized by subject, not by application– Used for analysis, data mining, etc.

• Optimized differently from transaction-oriented db

• User interface aimed at executive

Page 24: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

24

A Data Warehouse is...(Cont’d)• Large volume of data (Gb, Tb)• Non-volatile

– Historical– Time attributes are important

• Updates infrequent• May be append-only• Examples

– All transactions ever at WalMart– Complete client histories at insurance firm– Stockbroker financial information and portfolios

Page 25: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

25

Summary

Operational SystemsEnterpriseModeling

BusinessInformation Guide

DataWarehouse

CatalogData Warehouse

Population

DataWarehouse

Business InformationInterface

Page 26: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

26

Warehouse is a Specialized DBStandard DB

• Mostly updates• Many small transactions• Mb - Gb of data• Current snapshot• Index/hash on p.k.• Raw data• Thousands of users (e.g.,

clerical users)

Warehouse• Mostly reads• Queries are long and complex• Gb - Tb of data• History• Lots of scans• Summarized, reconciled data• Hundreds of users (e.g.,

decision-makers, analysts)

Page 27: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

27

Types of Data• Business Data - represents meaning

– Real-time data (ultimate source of all business data)– Reconciled data– Derived data

• Metadata - describes meaning– Build-time metadata– Control metadata– Usage metadata

• Data as a product* - intrinsic meaning– Produced and stored for its own intrinsic value– e.g., the contents of a text-book

Page 28: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

28

Data Warehouse Architectures: Conceptual View

• Single-layer– Every data element is stored once only– Virtual warehouse

• Two-layer– Real-time + derived data– Most commonly used approach in

industry today

“Real-time data”

Operationalsystems

Informationalsystems

Derived Data

Real-time data

Operationalsystems

Informationalsystems

Page 29: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

29

Three-layer Architecture: Conceptual View

• Transformation of real-time data to derived data really requires two steps

Derived Data

Real-time data

Operationalsystems

Informationalsystems

Reconciled Data Physical Implementationof the Data Warehouse

View level“Particular informational

needs”

Page 30: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

30

Data Warehousing: Two Distinct Issues

(1) How to get information into warehouse“Data warehousing”

(2) What to do with data once it’s in warehouse“Warehouse DBMS”

• Both rich research areas• Industry has focused on (2)

Page 31: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

31

Issues in Data Warehousing• Warehouse Design• Extraction

– Wrappers, monitors (change detectors)• Integration

– Cleansing & merging• Warehousing specification & Maintenance• Optimizations• Miscellaneous (e.g., evolution)

Page 32: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

32

Loading the Data Warehouse

Source Systems Data Staging Area Data Warehouse(OLTP)

Data is periodically extracted

Data is cleansed and transformed

Users query the data warehouse

Page 33: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

33

Data Extraction

• Source types– Relational, flat file, WWW, etc.

• How to get data out?– Replication tool– Dump file– Create report– ODBC or third-party “wrappers”

Page 34: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

34

WrapperConverts data and queries from one data model to another

Extends query capabilities for sources with limited capabilities

DataModel

B

DataModel

A

Queries

Data

Queries SourceWrapper

Page 35: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

35

Data Transformations

• Convert data to uniform format– Byte ordering, string termination– Internal layout

• Remove, add & reorder attributes– Add key– Add data to get history

• Sort tuples

Page 36: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

36

Monitors

• Goal: Detect changes of interest and propagate to integrator

• How?– Triggers– Replication server– Log sniffer– Compare query results– Compare snapshots/dumps

Page 37: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

37

Data Integration

• Receive data (changes) from multiple wrappers/monitors and integrate into warehouse

• Rule-based• Actions

– Resolve inconsistencies– Eliminate duplicates– Integrate into warehouse (may not be empty)– Summarize data– Fetch more data from sources (DW updates)– etc.

Page 38: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

38

Data Integration is Hard

• Data warehouses combine data from multiple sources• Data must be translated into a consistent format• Data integration represents ~80% of effort for a

typical data warehouse project!• Some reasons why it’s hard:

– Metadata is poor or non-existent– Data quality is often bad

• Missing or default values• Multiple spellings of the same thing

(Cal vs. UC Berkeley vs. University of California)– Inconsistent semantics

• What is a tree ?

Page 39: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

39

Data Cleansing

• Find (& remove) duplicate tuples– e.g., Jane Doe vs. Jane Q. Doe

• Detect inconsistent, wrong data– Attribute values that don’t match

• Patch missing, unreadable data• Notify sources of errors found

Page 40: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

40

Course Objectives

• Gain practical understanding of how data warehouses are built and used

• Gain exposure to data modeling “best practices”• Learn techniques used to process complex queries

over very large data sets• Understand the performance trade-offs that come

from alternative data structures• Learn commonly-used methods for mining and

analysis of large data sets• Become familiar with current research directions

in data warehousing and related areas

Page 41: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

41

Research Topics• Logical Database Design

– How should the data be modeled?– Designing the data warehouse schema

• Query Processing– Analysis queries are hard to answer efficiently– What techniques are available to the DBMS?

• Physical Database Design– How should the data be organized on disk?– What data structures should be used?

• Data Mining– What use is all this data?– Which questions should we ask our data warehouse

Page 42: Introduction to Data Warehousing - :: 건국대학교 …db.konkuk.ac.kr/lecture_note/2005_2_TheDataWarehou… ·  · 2014-03-23Introduction to Data Warehousing Ki-Joon Han Database

42

Additional Topics• Data integration• Data cleaning• Approximate query answering• Data lineage• Data visualization• Incremental maintenance of materialized views• Answering queries using views• Indexing special data types (spatial, text,

geographic)• Metadata management