24
STAATLICH ANERKANNTE FACHHOCHSCHULE Author I: Dip.-Inf. (FH) Johannes Hoppe Author II: M.Sc. Johannes Hofmeister Author III: Prof. Dr. Dieter Homeister Date: 25.03.2011 STUDIEREN UND DURCHSTARTEN.

DMDW Lesson 04 - Data Mining Theory

Embed Size (px)

DESCRIPTION

 

Citation preview

Page 1: DMDW Lesson 04 - Data Mining Theory

STAATLICHANERKANNTEFACHHOCHSCHULE

Author I: Dip.-Inf. (FH) Johannes HoppeAuthor II: M.Sc. Johannes HofmeisterAuthor III: Prof. Dr. Dieter HomeisterDate: 25.03.2011

STUDIERENUND DURCHSTARTEN.

Page 2: DMDW Lesson 04 - Data Mining Theory

STAATLICHANERKANNTEFACHHOCHSCHULE

Data Mining Theory

Author I: Dip.-Inf. (FH) Johannes HoppeAuthor II: M.Sc. Johannes HofmeisterAuthor III: Prof. Dr. Dieter Homeister Date: 25.03.2011

Page 3: DMDW Lesson 04 - Data Mining Theory

Applied Data Warehousing

01

3

Page 4: DMDW Lesson 04 - Data Mining Theory

Applied Data Warehousing

A typical Data Flow looks like this

4

Page 5: DMDW Lesson 04 - Data Mining Theory

Data Warehouse

Practical task

› Create groups of 3 people› Grab “dmdw_rooms_fh_heidelberg-2006-04-03.xls”

› Explore the data in the “room reservations” spreadsheet› Discuss and create a simple database table / document

that matches the data› Find a way to migrate the data from the excel spreadsheet to the

database

› For today I recommend SQL Server Business Intelligence Development Studio+ SQL Server

5

Today's System:

Page 6: DMDW Lesson 04 - Data Mining Theory

Data Warehouse

Next practical task

› One team will have to present a different solutionto migrate the data

› It should be a hands-on lab for the other students› I will upload the materials to my blog

› Preferred time box: 45 - 90 minutes› First team will be: _________________________________

6

Next System:

?

Page 7: DMDW Lesson 04 - Data Mining Theory

Data Warehouse

ETL Teams

› 1. Team: Access – 2x Sebastian, Matthias › 2. Team: Access Access2MySQL MySQL – Mercedes, Fabian,

Marcus, Albert› 3. Teams: Silverlight MS SQL – Sebastian, Patrick› 4. Teams: PHP MySQL – Lars, Maurice, Jeff

7

Next System:

?

Page 8: DMDW Lesson 04 - Data Mining Theory

Data Mining Theory

02

8

Page 9: DMDW Lesson 04 - Data Mining Theory

Data Mining

Introduction (1/3)

› Data Mining is done by running software that examines a database and looks for patterns in the data

› A data warehouse by itself will respond to queries from users

› It will not tell users about patterns in data that users may not have thought about

› To find patterns in data, data mining is used to try and mine key information from a data warehouse

9

Page 10: DMDW Lesson 04 - Data Mining Theory

Data Mining

Introduction (2/3)

› Data mining allows companies to collect information› … to make them more productive and› … to beat their competitors

› Data mining helps to identify› why customers buy certain products › ideas for very direct marketing › ideas for shelf placement › training of employees vs. employee retention › employee benefits vs. employee retention

10

Page 11: DMDW Lesson 04 - Data Mining Theory

Data Mining

Introduction (3/3)

› Data mining attempts to find patterns in data that we did not know about

› Often data mining is just a new buzzword for statistics › But data mining differs from (school) statistics in the way

that large volumes of data are used

Trivial information or well known facts are not an aim of data mining!

11

Page 12: DMDW Lesson 04 - Data Mining Theory

Data Mining

Some Data Mining Algorithms

› Machine learning › Statistics › Pattern recognition › Regression › Association rules

12

› Genetic algorithms › Decision trees › Neural networks › Clustering › Classification › etc.

Page 13: DMDW Lesson 04 - Data Mining Theory

Data Mining

Implementing DM on Top of a DW (1/2)

› Data mining tools / mining algorithms require data!› There are two approaches:

› Copy data from the Data Warehouse and mine it › Mine the data directly in the Data Warehouse

› Popular tools use a variety of different data mining algorithms: › association rules › genetic algorithms › decision trees › neural networks

13

Page 14: DMDW Lesson 04 - Data Mining Theory

Data Mining

Implementing DM on Top of a DW (2/2)

a) Copy data from the data warehouse to data mining tools › Advantage : Data mining tools may organize data so they can run faster › Disadvantage: Can be very "expensive” to move large amounts of data

b) Data mining tools can access data directly in the Data Warehouse › Advantage: No copy of data is needed for data mining › Disadvantage: Data may not be organized in a way that is efficient for the

tool

14

Page 15: DMDW Lesson 04 - Data Mining Theory

Data Mining

The Data Mining Process

› Step 1: Data preparation: cleanup ("scrubbing"), selection, check by specialists for the data. ( data warehouse)

› Step 2: Analysis phase, process the data by a data mining algorithm.

› Phase 3: Evaluation of the output, check if something new was discovered.

15

Page 16: DMDW Lesson 04 - Data Mining Theory

Data Mining

The Data Mining Process

Step 1 - Data preparation› It is useful to fetch data from a data warehouse. This eliminates the need

of collecting data from different sources, filtering and handling inconsistencies. Theoretically a data warehouse is not absolutely necessary, but in practice it is.

› The data preparation process includes data selection and manipulation.

› Validating and cleaning is necessary to eliminate out-of-range values and to handle missing values of our raw data. This may include plausibility checks.

16

Page 17: DMDW Lesson 04 - Data Mining Theory

Data Mining

The Data Mining Process

Step 1 - Data preparation› Even if the data warehouse data are already cleaned and filtered,

experience shows that this is not good enough for data mining.

› The Data preparation also includes formatting, scaling and transformation of the raw data depending on the needs of the data mining algorithm. Examples: scaling of numeric data, currency or metric/inch conversion.

17

Page 18: DMDW Lesson 04 - Data Mining Theory

Data Mining

The Data Mining Process

Step 1 - Data preparation› Many joined tables may be involved, selecting of rows or columns may be

necessary, or two fields are combined as ratio, or we need derived values. › This process needs guidance of someone with a good knowledge about

the data and the problem domain.

› It is usual that this data preparation consumes 50% to 80% of the data mining budget.

18

Page 19: DMDW Lesson 04 - Data Mining Theory

Data Mining

The Data Mining Process

Step 2 - Analysis phase› process the data by a data mining algorithm.› information discovery› Analysis services build in:

› Association› Clustering › Decision Trees› Linear Regression› Logistic Regression› Naives Bayes› Neural Network

19

Page 20: DMDW Lesson 04 - Data Mining Theory

Data Mining

The Data Mining Process

Step 3 - Evaluation of the output› The interpretation and presentation of the results. › The purpose is either decision support or the application development. › Presentation:

› A graphical representation is often useful to present the results to executives. › Example in text form: "If a customer buys washers or dryers, 61% buy a service

agreement. This pattern is present in 1.0% of the transactions".

20

Page 21: DMDW Lesson 04 - Data Mining Theory

Data Mining

The Data Mining Process

Step 3 - Evaluation of the output› Interpretation of the output data might be necessary. › Data mining may replace DSS/EIS

(which is mainly a query application with a graphical display). › In addition to traditional business software with a clearly visible idea and

algorithm, it can also offer the possibility to construct an automated decision support.

21

Page 22: DMDW Lesson 04 - Data Mining Theory

Data Mining

The Data Mining Process

Step 3 - Evaluation of the outputAutomated decision support - Example:

› Every loan application of a bank is passed to a previously trained neural network and results in a score for loan rejected to loan approved. The results of data mining on lots of loan contracts lead to training of the neural network.

› Such algorithms may work even if the underlying processes are not well understood.

› Warning: neural networks are a black box!!

22

Page 23: DMDW Lesson 04 - Data Mining Theory

References

Additional Books and References for Data Mining

David A. Grossman, Ophir Frieder: Introduction to Data Mining, Illinois Institute of Technology 2005

J.P. Bigus, Data Mining with Neural Networks, McGraw-Hill, 1996

Olivia Parr Rud et. al, Data Mining Cookbook - Modeling Data for Marketing, Risk, and Customer Relationship Management, Wiley, 2001

Nong Ye (ed.): The Handbook of Data Mining, Lawrence Erlbaum Associates, 2003

http://www.eruditionhome.com/datamining/ http://en.wikipedia.org/wiki/Data_mining http://www.the-data-mine.com/bin/view/Misc/IntroductionToDataMining

23

Page 24: DMDW Lesson 04 - Data Mining Theory

THANK YOUFOR YOUR ATTENTION

24