Upload
lucas-harmon
View
230
Download
1
Embed Size (px)
Citation preview
Intro. to Data WarehouseIntro. to Data Warehouseรศรศ..ดรดร. . วรพจน์� กร�สุ ระเดชวรพจน์� กร�สุ ระเดช
Worapoj Kreesuradej, Ph.D.Worapoj Kreesuradej, Ph.D. AssAssociateociate Professor Professor
Data Mining & Data Exploration Laboratory (DME Lab),Data Mining & Data Exploration Laboratory (DME Lab),
Faculty of Information Technology,Faculty of Information Technology,
King Mongkut's Institute of Technology Ladkrabang,King Mongkut's Institute of Technology Ladkrabang,
Web: www.it.kmitl.ac.th/dmeWeb: www.it.kmitl.ac.th/dme
Email: Email: [email protected]
BookBook
Paulraj Ponniah, Data Warehousing Paulraj Ponniah, Data Warehousing
Fundamentals, John Wiley & Sons, 2001.Fundamentals, John Wiley & Sons, 2001.
Ralph Kimbal and Margy Ross, Ralph Kimbal and Margy Ross, The Data The Data
Warehouse ToolkitWarehouse Toolkit, John Wiley and , John Wiley and
Sons, 2002.Sons, 2002.
Definition of DWDefinition of DW““A collection of integrated, subject-
oriented databases designed to supply the information required for decision-making.” - W. Inmon
A decision support database that is maintained separately from the organization’s operational databases.
A physical repository where relational A physical repository where relational data are specially organized to provide data are specially organized to provide enterprise-wide, cleansed data in a enterprise-wide, cleansed data in a standardized format –E. Turban and etc.standardized format –E. Turban and etc.
R. Kimball’s definition of a DWR. Kimball’s definition of a DW A data warehouse is a copy of A data warehouse is a copy of
transactional data transactional data specifically
structured for querying and analysis.structured for querying and analysis.
Problem: Data Management Problem: Data Management in Large Enterprisesin Large Enterprises
Vertical fragmentation of informational Vertical fragmentation of informational systems systems
Result of application (user)-driven Result of application (user)-driven development of operational systemsdevelopment of operational systems
Sales AdministrationSales Administration FinanceFinance ManufacturingManufacturing ......
Sales PlanningSales PlanningStock MngmtStock Mngmt
......
SuppliersSuppliers
......Debt MngmtDebt Mngmt
Num. ControlNum. Control
......InventoryInventory
Two Approaches for accessing
data:
Query-Driven (Lazy)
Warehouse (Eager)
SourceSource SourceSource
??
Problem: Data Management Problem: Data Management in Large Enterprisesin Large Enterprises
The Need for DWThe Need for DW
SourceSource SourceSourceSourceSource. . .. . .
Integration System
. . .. . .
Metadata
ClientsClients
WrapperWrapper WrapperWrapperWrapperWrapper
Query-driven (lazy, on-demand)
Disadvantages of Query-Disadvantages of Query-Driven ApproachDriven Approach
Delay in query processing Inefficient and potentially expensive
for frequent queries Competes with local processing at
sources
The Warehousing ApproachThe Warehousing Approach
DataWarehouse
ClientsClients
SourceSource SourceSourceSourceSource. . .. . .
Extractor/Extractor/MonitorMonitor
Integration System
. . .. . .
Metadata
Extractor/Extractor/MonitorMonitor
Extractor/Extractor/MonitorMonitor
Information Information integrated in integrated in advanceadvance
Stored in wh Stored in wh for direct for direct querying and querying and analysisanalysis
Advantages of Warehousing Advantages of Warehousing ApproachApproach
High query performance Doesn’t interfere with local processing
at sources Information copied at warehouse
Can modify, annotate, summarize, restructure, etc.
Can store historical information Security, no auditing
Characteristics of DWCharacteristics of DW
Subject oriented
Data are organized by how users refer to it
Integrated Inconsistencies are removed in both nomenclature and conflicting information; (i.e. data are ‘clean’)
Non-volatile Read-only data. Data do not change over time.
Time variant Data are time series, not current status
Subject OrientedSubject OrientedData Warehouse is designed around Data Warehouse is designed around
““subjects” rather than processessubjects” rather than processesA company may have A company may have
Retail Sales SystemRetail Sales System Outlet Sales SystemOutlet Sales System Catalog Sales SystemCatalog Sales System
DW will have a Sales Subject AreaDW will have a Sales Subject Area
Subject OrientedSubject Oriented
Retail Sales Retail Sales SystemSystem
Outlet Sales System
Catalog Sales System
Sales Subject Area
Subject-Oriented Sales Information
Data Warehouse
OLTP Systems
IntegratedIntegrated
Heterogeneous Source SystemsHeterogeneous Source Systems
Need to Integrate source dataNeed to Integrate source data
For Example: Product codes could For Example: Product codes could
be different in different systemsbe different in different systems
Arrive at common code in DWArrive at common code in DW
IntegratedIntegratedClientsClients
DataDataWarehouseWarehouse
SourceSource SourceSourceSourceSource. . .. . .
Extractor/Monitor
Integration System
. . .. . .
Metadata
Extractor/Monitor
Extractor/Monitor
Information Information integrated in integrated in advanceadvance
Stored in DW Stored in DW for direct for direct querying and querying and analysisanalysis
Non-VolatileNon-Volatile Operational update of data does not occur Operational update of data does not occur
in the data warehouse environment.in the data warehouse environment.
Does not require transaction processing, Does not require transaction processing,
recovery, and concurrency control recovery, and concurrency control
mechanismsmechanisms
Requires only two operations in data Requires only two operations in data
accessing: accessing:
initial loading of datainitial loading of data and and access of access of
datadata..
Non-Volatile(Read-Mostly)Non-Volatile(Read-Mostly)
OLTP
DWUSERUSER
USERUSER
WriteWrite
ReadRead
ReadRead
Time VariantTime Variant
The time horizon for the data warehouse is
significantly longer than that of operational
systems.
Operational database: current value data.
Data warehouse data: provide information
from a historical perspective (e.g., past 5-
10 years)
Time VariantTime Variant
Most business Most business analysis has a analysis has a time componenttime component
Trend Analysis Trend Analysis (historical data is (historical data is required)required)
2001 2002 2003 20042001 2002 2003 2004
SalesSales
Data Warehousing Data Warehousing Process Overview Process Overview
Data Warehousing Data Warehousing Process Overview Process Overview The major components of a data The major components of a data
warehousing process warehousing process Data sources Data sources Data extraction Data extraction Data loading Data loading Comprehensive Comprehensive Database /Data Store Data Mart Metadata Metadata Middleware tools /information delivery Middleware tools /information delivery
toolstools
ETL
• Data Extraction
• Data Cleaning and TransformationConvert from legacy/host format to
warehouse format
• Load Sort, summarize, consolidate,
compute views, check integrity, build indexes, partition
The ETL ProcessThe ETL Process
Source Source SystemsSystems
ExtractExtract TransformTransform
Staging Staging AreaArea
LoadLoad
DW DW DatabaseDatabase
Data Staging Area
• A storage area where extracted data is cleaned, transformed and deduplicated.
• Initial storage for data
• Need not be based on Relational model
• Mainly sorting and Sequential processing
• Does not provide data access to users
• Analogy – kitchen of a restaurant
ETL ProcessIssues & Challenges
• Consumes 70-80% of project time
• Heterogeneous Source Systems
• Little or no control over source systems
• Source systems scattered
• Different currencies, measurement units
• Ensuring data quality
Comprehensive Comprehensive Database /Data Store
Mostly a relational DBMostly a relational DB
Oracle, DB2, Sybase, SQL ServerOracle, DB2, Sybase, SQL Server
New DB design for special purpose of New DB design for special purpose of
DW (e.g., scale up, speed up, parallel DW (e.g., scale up, speed up, parallel
processing)processing)
Data Warehouse DesignData Warehouse Design
OLTP Systems are Data Capture SystemsOLTP Systems are Data Capture Systems““DATA IN” systemsDATA IN” systemsDW are “DATA OUT” systemsDW are “DATA OUT” systems
OLTP DW
Dimensional ModelingDimensional ModelingFacts are stored in FACT TablesFacts are stored in FACT TablesDimensions are stored in Dimensions are stored in
DIMENSION tablesDIMENSION tablesDimension tables contains textual Dimension tables contains textual
descriptors of businessdescriptors of businessFact and dimension tables form a Fact and dimension tables form a
Star SchemaStar Schema““BIG” fact table in center surrounded BIG” fact table in center surrounded
by “SMALL” dimension tablesby “SMALL” dimension tables
Star SchemaStar Schema
SALES# TIME_KEY# PRODUCT_KEY# CUSTOMER_KEY* PRICE* QUANTITY* SALES
CUSTOMER# CUSTOMER_KEY* CID* CNAME* STATE* CITY
PRODUCT# PRODUCT_KEY* PID* PNAME* PCNAME
TIME# TIME_KEY* ORDERDATE* DAY_OF_WEEK* DAY_NUMBER_IN_MONTH* DAY_NUMBER_IN_YEAR* WEEK_NUMBER* MONTH* QUARTER* HOLIDAY_FLAG* FISCAL_YEAR* FISCAL_QUARTER
reference
referenced by
reference
referenced by
reference
referenced by
Claim# Physician ID# Patient ID# Service Code# Payer ID# Claim Number# Line Item Number# Claim DateDate of ServicesAmount of ChargeUnit of Services
Service#Service CodeService Description#Category Code
Time Periods#Claim DateYearMonthQuarterWeek
Payer#Payer IDNameAddressPhone NumberEDI Number
Star Schema
Patient#Patient IDPatient NameAddressAgeSexInsurance ID
Physician#Physician IDPhysician NameSpecialty IDCredential ID
Star SchemaStar Schema
Data martData mart
Data mart = subset of DW for community Data mart = subset of DW for community users, e.g. accounting departmentusers, e.g. accounting department
Sometimes exist as Multidimensional Sometimes exist as Multidimensional DatabaseDatabase
Info mart = summarized data + report for Info mart = summarized data + report for community userscommunity users
Meta DataMeta Data
Data about data Needed by both information technology
personnel and users IT personnel need to know data sources and
targets; database, table and column names; refresh schedules; data usage measures; etc.
Users need to know entity/attribute definitions; reports/query tools available; report distribution information; help desk contact information, etc.
Information Delivery Tools Information Delivery Tools
Tools Query & reporting OLAP Data mining, visualization, segmentation,
clustering New developments: text mining, web mining
& personalization Mining multimedia data
Information Delivery ToolsInformation Delivery Tools
Commercial toolsCommercial tools
Crystal Report, Impromptu, WebFocusCrystal Report, Impromptu, WebFocus
Increasingly common mode of delivery: Increasingly common mode of delivery:
Web-enabledWeb-enabled
Data Flow ArchitectureData Flow Architecture System ArchitectureSystem Architecture
Data Warehouse ArchitectureData Warehouse Architecture
Data Flow ArchitectureData Flow Architecture
Data Flow ArchitectureData Flow Architecture
Data Flow ArchitectureData Flow Architecture
Operational data stores (ODS)Operational data stores (ODS)
A type of database often used as an A type of database often used as an interim area for a data warehouse, interim area for a data warehouse, especially for customer information filesespecially for customer information files
MDB=Multidimensional databases MDB=Multidimensional databases
System ArchitecturesSystem Architectures
Three parts of the data warehouseThree parts of the data warehouse The data warehouse that contains the data The data warehouse that contains the data
and associated softwareand associated software Data acquisition (back-end) software that Data acquisition (back-end) software that
extracts data from legacy systems and extracts data from legacy systems and external sources, consolidates and external sources, consolidates and summarizes them, and loads them into the summarizes them, and loads them into the data warehousedata warehouse
Client (front-end) software that allows Client (front-end) software that allows users to access and analyze data from the users to access and analyze data from the warehousewarehouse
System ArchitecturesSystem Architectures
System ArchitecturesSystem Architectures
System ArchitectureSystem Architecture
System ArchitectureSystem Architecture
Data Warehouse DevelopmentData Warehouse Development Data warehouse development Data warehouse development
approachesapproaches Inmon Model: EDW approach, Enterprise-Inmon Model: EDW approach, Enterprise-
wide warehouse, top down wide warehouse, top down Kimball Model: Data mart approach, Data Kimball Model: Data mart approach, Data
mart, bottom up mart, bottom up
Which model is best?Which model is best? There is no one-size-fits-all strategy to data There is no one-size-fits-all strategy to data
warehousing warehousing When properly executed, both result in an When properly executed, both result in an
enterprise-wide data warehouse, but with enterprise-wide data warehouse, but with different architecturesdifferent architectures
The Data Mart Strategy The most common approach Begins with a single mart and architected
marts are added over time for more subject areas
Relatively inexpensive and easy to implement Can be used as a proof of concept for data
warehousing Can perpetuate the “silos of information”
problem Can postpone difficult decisions and
activities Requires an overall integration plan
The Enterprise-wide The Enterprise-wide StrategyStrategy
A comprehensive warehouse is built initially An initial dependent data mart is built using a
subset of the data in the warehouse Additional data marts are built using subsets
of the data in the warehouse Like all complex projects, it is expensive, time
consuming, and prone to failure When successful, it results in an integrated,
scalable warehouse
DW Lifecycle DW Lifecycle (Ralph Kimball )(Ralph Kimball )
Data Warehouse DevelopmentData Warehouse Development
Some best practices for implementing a Some best practices for implementing a data warehouse data warehouse (Weir, 2002):(Weir, 2002):
Project must fit with corporate strategy and Project must fit with corporate strategy and business objectivesbusiness objectives
There must be complete buy-in to the There must be complete buy-in to the project by executives, managers, and usersproject by executives, managers, and users
It is important to manage user expectations It is important to manage user expectations about the completed projectabout the completed project
The data warehouse must be built The data warehouse must be built incrementallyincrementally
Build in adaptability Build in adaptability
Data Warehouse DevelopmentData Warehouse Development
Some best practices for implementing a Some best practices for implementing a data warehouse data warehouse (Weir, 2002):(Weir, 2002):
The project must be managed by both IT The project must be managed by both IT and business professionalsand business professionals
Develop a business/supplier relationshipDevelop a business/supplier relationship Only load data that have been cleansed and Only load data that have been cleansed and
are of a quality understood by the are of a quality understood by the organizationorganization
Do not overlook training requirementsDo not overlook training requirements Be politically aware Be politically aware