20
Business Intelligence for the Real Time Enterprise (BIRTE), Seoul, 11. 09. 2006 Balázs Rácz Csaba István Sidló András Lukács András A. Benczúr Data Mining and Web Search Research Group Computer and Automation Research Institute Hungarian Academy of Sciences (MTA SZTAKI) http://datamining.sztaki.hu/ Two-Phase Data Warehouse Optimized for Data Mining

Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –[email protected] BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

Embed Size (px)

Citation preview

Page 1: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

Business Intelligence for the Real Time Enterprise (BIRTE), Seoul, 11. 09. 2006

Balázs Rácz Csaba István Sidló

András Lukács András A. Benczúr

Data Mining and Web Search Research Group

Computer and Automation Research InstituteHungarian Academy of Sciences (MTA SZTAKI)

http://datamining.sztaki.hu/

Two-Phase Data Warehouse Optimized for Data Mining

Page 2: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

Outline

• Demands and challenges

• Related data warehouse techniques

• The two-phase architecture

• The second phase component

• Data model and data mining framework

• Case studies and measurements

Page 3: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

Data storage and management

• long term storage

Data manipulation, statistical queries

Data mining platform

• custom tasks

• quick development/reconfiguration

DM visualization specialized to user needs

Data analysis know-how

• e.g. web/telco usage, churn, user groups, …

Basic DBMS functionalities

• long range

• changing

• high dimensional

data sources

patterns

knowledge

BI and knowledge discoveryBI and knowledge discovery

Page 4: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

Demands in data processing

Long term storage• growing data volume

• cost of storage

Proper coupling between data warehouses and data mining tools

• optimized data access

• effectively implemented data mining tools

Data mining query language• help and guide the knowledge discovery

process

• flexibility, fast deployment

• code reusing

Page 5: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

Related data warehouse techniques

Read/cost optimized databases• column-oriented databases (C-store), column wise

compression

• nearline data warehouses (Sybase IQ, Sand/DNA)

Coupling with data mining systems• tight – DM is integrated into the DBMS

• semi-tight – interfaced extension of the SQL

• loose – separate DM system

Page 6: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

The two-phase architecture

Streaming data management and data mining toolkit (second phase)

Short-term data cubes

Datasources

Database (first phase)

Long-termdisk storage

Long-term aggregates,

patternsMetadata

Compression module

Data mining module

Data stream query engine

ETLtools

OLAP and other analysis tools

built by standardDBMS tools

new C++ implementation

Page 7: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

The second phase

Data stream approach• optimize data mining access to very large data sets

• (block) sequential access, typically full scans of data

• fit for data mining algorithms: frequent itemset mining, partitioning clustering, Bayes classification, decision trees

Streaming data management and data mining toolkit (second phase)

Short-term data cubes

Datasources

Database (first phase)

Long-termdisk storage

Long-term aggregates,

patternsMetadata

Compression module

Data mining module

Data stream query engine

ETLtools

OLAP and other analysis tools

Effective storage• compression by columns

• large compression rate

• row-wise storage

• fast read-only access

• a new semantic compression method

Page 8: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

New compression method for log data

0

5

10

15

20

25

0 0,2 0,4 0,6 0,8 1 1,2

compression rate rel. to gzip's

decompre

ssion t

ime

0

5

10

15

20

0 0,2 0,4 0,6 0,8 1 1,2compre

ssion t

ime r

el. t

o g

zip's

gzip

bzip2

dslc

0.00

0.05

0.10

0.15

0.20

0.25

0.30

100 1 000 10 000 100 000 1 000 000 10 000 000

size of weblog (kbyte)

compre

ssion r

ate

dslc

original compr. compr. size size rate

(GB) (MB) (%)

weblog 1.89 57.03 2.95%

PIX log 0.92 46.70 4.97%

data

type

Examples of compressions

Page 9: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

Data models

1 4 5 9

0

141 4 85 932 76 1110 1312

1 1 01 100 00 00 00

a row of a binary matrix

sparse matrix data model

relational data model

sparse format of the row

n-tuple

a record

Page 10: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

The data model of the second phase

A common generalization of the relational (k=0 or k=n) and the sparse matrix (h for the rows, b for the nonzero columns) data model

k header n-k body

bodies with a same value in header

h b1

h b3

h b2

h b4

h b5

h b1 b3b2 b4 b5

Page 11: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

DM framework and its query language

Data mining toolkit• flexible modular architecture

• standardized streaming interface between modules

• pre- and postprocessing, data transforming and mining

• variety of data mining algorithms implemented

• so far 200+ modules (more than 100k lines C++ code)

• new modules can be added

DM query language• order and configuration of

the necessary modules should be given

• graphical application builder

Page 12: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

transactional &customer data

from DW

compressionfor internalaggregates

prefix,private,

company

users with several numbers,

user groups

ISD

N

churnclassifier,clustering

cluster inforefeed

as input

results refilled into data tables

Example of a DM pipeline (telecom. data)

Page 13: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

A case study: T-Online Hungary

• Field of operation: online content provider

• Business needs:

– Long-term log storage with access to analysis

– Custom reports to management and editors

• State before:

– Logs kept on tapes, never read back

– Previous data warehouse projects failed due to data volume

– MOLAP technology failed on dimensionality

• Data Size: 6.5M HTML hits/day, TB+ log/month

Page 14: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

Space requirements of one month web log

1.9 GBSecond-phase compressed storage

39.1 GBCompressed DB table

44.9 GBStandard DB table

17.1 GBCompressed (bzip) preprocessed log files

180.7 GBCompressed (bzip) raw log files

Size on diskStoring method

Page 15: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

Average execution times of regular jobs

Page 16: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

Short summary of the two phase

new compression method and data mining framework

built by standard DBMS tools

prepared for data mininghandling dimension tables

stream data modelrelational data model

longer time storage (archive)shorter time storage

second phasefirst phase

Streaming data management and data mining toolkit (second phase)

Short-term data cubes

Datasources

Database (first phase)

Long-termdisk storage

Long-term aggregates,

patternsMetadata

Compression module

Data mining module

Data stream query engine

ETLtools

OLAP and other analysis tools

Page 17: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

Thanks

• Katalin Hum and László Lukács

• T-Online Hungary Inc.

• Hungarian National Office for Research and Technology (NKTH)

– T-mining GVOP-3.1.1-2004/-05-0054/3.0

– Data Riddle NKFP-2/0017/2002

• Hungarian Scientific Research Fund (OTKA)

– T042706

• Inter-University Center for Telecommunications and Informatics, Budapest (ETIK)

Page 18: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

Compression of PIX router logCompression of PIX router log

• 103-105 records/sec

• Goal: preservation for off-line security

• Exemplar PIX log of 100 minutes router activity

– 5.2 million records

– 940 MB in its raw form

• Results:

– 46.7 MB compressed log

– 4.96% compression ratio

– 5 min parsing from text + 2 min compression on a standard P4

Page 19: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

Execution times for reference SQL queriesExecution times for reference SQL queries

select count(distinct USER_ID) from FACT_PAGE_IMPRESSIONwhere DATE_KEY between 20060116 and 20060122

Q3

select count(*) from FACT_PAGE_IMPRESSIONwhere DATE_KEY between 20060101 and 20060131 and HTTP_STATUS_CODE = 200

Q2

select sum(PAGE_ID) from FACT_PAGE_IMPRESSIONwhere DATE_KEY between 20060101 and 20060131

Q1

Page 20: Two-Phase Data Warehouse Optimized for Data Mining [Read-Only].pdfAndrás Lukács –alukacs@sztaki.hu BIRTE 2006 Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006 Data

András Lukács – [email protected] BIRTE 2006Two-Phase Data Warehouse Optimized for Data Mining 11. 09. 2006

Frequent Frequent itemsetitemset mining mining ((ACCIDENTS datasetACCIDENTS dataset))