Data Mining per l’analisi dei dati: una breve introduzionegiangi/CORSI/SISD/Lezioni/SISD4.pdf · 2007-03-31 · data stored in databases, data warehouses and other information repositories

Data mining per l'analisi dei dati nella PA

Pisa, 9-10-11 Settembre 2004 1

Data Mining per l’analisi deidati: una breve introduzione

Pisa KDD Lab, ISTI-CNR & Univ. Pisahttp://www-kdd.isti.cnr.it/

2

Obiettivi del corsoIntrodurre i concetti di base del processo diestrazione di conoscenza, e le principalitecniche di data mining per l’analisi dei dati.Dare gli strumenti necessari per orientarsi trale molteplici tecnologie coinvolte.Comprendere le potenzialità in risposta adesigenze di indirizzo strategico nei diversisettori applicativiComprendere le diverse modalità con cuiintraprendere progetti di DM



3

Esperienze di FormazioneKDD Lab ( http://www-kdd.isti.cnr.it ), centro di ricercacongiunto fra Università di Pisa e ISTI–CNR, è stato uno deiprimi laboratori di ricerca europei focalizzato sul data mining;ha organizzato fin dal 1998 vari interventi formativi:Analisi dei dati aziendali mediante tecniche di data mining,1998-1999 e 2000-2001, 2002-2003• in collaborazione con il Dipartimento di Statistica per l’Economia e

finanziato dalle province di Pisa e di LivornoTecniche di Data Mining:• Corso di Laurea Specialistica in Informatica, Univ. Pisa

Analisi di dati ed estrazione di conoscenza: Corso di Laurea Specialistica in Informatica per l’Economia e

l’azienda, Univ. Pisa.Laboratorio di Data Warehouse e Data Mining: Corso di Laurea Specialistica in Informatica per l’Economia e

l’azienda, Univ. Pisa.

4

StrutturaParte1 – Introduzione e concetti di base

Motivazioni, applicazioni, il processo KDD, le tecnicheParte 2 – Comprensione e preparazione dei dati

Nozioni basiche di Datawarehouse Analisi Multidimensionale ed OLAPParte 3 – La tecnologia data Mining

Regole Associative e Market Basket Analysis Classificazione e regole predittive e Rilevazione di frodi Clustering e Customer Segmentation Demo di ambienti di Data MiningParte 4 – Data Mining per la PA: tendenze, esperienze e casi distudio

Progetti di Data Mining nelle Università statunitensi Mining Official Data Mining for Health Care



Seminar 1 – Introduction to DM

Motivations, applications,the Knowledge Discovery in

Databases process,the Data Mining techniques

6

Seminar 1 outlineMotivationsApplication AreasKDD Decisional ContextKDD ProcessArchitecture of a KDD systemThe KDD steps in short4 Examples in short



7

Evolution of Database Technology:from data management to data analysis

1960s: Data collection, database creation, IMS and network DBMS.

1970s: Relational data model, relational DBMS implementation.

1980s: RDBMS, advanced data models (extended-relational, OO,

deductive, etc.) and application-oriented DBMS (spatial,scientific, engineering, etc.).

1990s: Data mining and data warehousing, multimedia databases, and

Web technology.

8

What Is Data Mining?Data mining (knowledge discovery from data)

Extraction of interesting (non-trivial, implicit, previously unknown andpotentially useful) patterns or knowledge from huge amount of data

Data mining: a misnomer?

Alternative names

Knowledge discovery (mining) in databases (KDD), knowledgeextraction, data/pattern analysis, data archeology, data dredging,information harvesting, business intelligence, etc.

Watch out: Is everything “data mining”?

(Deductive) query processing.

Expert systems or small ML/statistical programs



9

Why Data MiningIncreased Availability of Huge Amounts of Data

point-of-sale customer data digitization of text, images, video, voice, etc. World Wide Web and Online collections

Data Too Large or Complex for Classical orManual Analysis

number of records in millions or billions high dimensional data (too many fields/features/attributes) often too sparse for rudimentary observations high rate of growth (e.g., through logging or automatic data

collection) heterogeneous data sources

Business Necessity e-commerce high degree of competition personalization, customer loyalty, market segmentation

10

Motivations“Necessity is the Mother of Invention”

Data explosion problem: Automated data collection tools, mature database

technology and internet lead to tremendous amounts ofdata stored in databases, data warehouses and otherinformation repositories.

We are drowning in information, but starving forknowledge! (John Naisbett)

Data warehousing and data mining : On-line analytical processing

Extraction of interesting knowledge (rules, regularities,

patterns, constraints) from data in large databases.



11

Abundance of business and industry dataCompetitive focus - Knowledge ManagementInexpensive, powerful computing enginesStrong theoretical/mathematical foundations machine learning & artificial intelligence statistics database management systems

Motivations for DM

12

Data Mining: Confluence ofMultiple Disciplines

Data Mining

Database Technology Statistics

OtherDisciplines

InformationScience

MachineLearning (AI) Visualization



13

Sources of DataBusiness Transactions widespread use of bar codes => storage of millions of

transactions daily (e.g., Walmart: 2000 stores => 20Mtransactions per day)

most important problem: effective use of the data in areasonable time frame for competitive decision-making

e-commerce data

Scientific Data data generated through multitude of experiments and

observations examples, geological data, satellite imaging data, NASA earth

observations rate of data collection far exceeds the speed by which we

analyze the data

14

Sources of DataFinancial Data company information economic data (GNP, price indexes, etc.) stock markets

Personal / Statistical Data government census medical histories customer profiles demographic data data and statistics about sports and athletes



15

Sources of DataWorld Wide Web and Online Repositories email, news, messages Web documents, images, video, etc. link structure of of the hypertext from millions of

Web sites Web usage data (from server logs, network

traffic, and user registrations) online databases, and digital libraries

16

Classes of applicationsDatabase analysis and decision support Market analysis

target marketing, customer relation management, marketbasket analysis, cross selling, market segmentation.

Risk analysis

Forecasting, customer retention, improved underwriting,quality control, competitive analysis.

Fraud detection

New Applications from New sources of data Text (news group, email, documents)

Web analysis and intelligent search



17

Where are the data sources for analysis? Credit card transactions, loyalty cards, discount

coupons, customer complaint calls, plus (public) lifestylestudies.

Target marketing Find clusters of “model” customers who share the same

characteristics: interest, income level, spending habits,etc.

Determine customer purchasing patterns over time Conversion of single to a joint bank account: marriage,

etc.

Cross-market analysis Associations/co-relations between product sales Prediction based on the association information.

Market Analysis

18

Customer profiling data mining can tell you what types of customers

buy what products (clustering or classification).

Identifying customer requirements identifying the best products for different customers

use prediction to find what factors will attract newcustomers

Summary information various multidimensional summary reports;

statistical summary information (data centraltendency and variation)

Market Analysis (2)



19

Finance planning and asset evaluation: cash flow analysis and prediction contingent claim analysis to evaluate assets trend analysis

Resource planning: summarize and compare the resources and spending

Competition: monitor competitors and market directions (CI:

competitive intelligence). group customers into classes and class-based pricing

procedures set pricing strategy in a highly competitive market

Risk Analysis

20

Applications: widely used in health care, retail, credit card services,

telecommunications (phone card fraud), etc.

Approach: use historical data to build models of fraudulent

behavior and use data mining to help identify similarinstances.

Examples: auto insurance: detect a group of people who stage

accidents to collect on insurance money laundering: detect suspicious money

transactions (US Treasury's Financial CrimesEnforcement Network)

medical insurance: detect professional patients and ringof doctors and ring of references

Fraud Detection



21

More examples: Detecting inappropriate medical treatment:

Australian Health Insurance Commission identifiesthat in many cases blanket screening tests wererequested (save Australian $1m/yr).

Detecting telephone fraud: Telephone call model: destination of the call,

duration, time of day or week. Analyze patterns thatdeviate from an expected norm.

Retail: Analysts estimate that 38% of retailshrink is due to dishonest employees.

Fraud Detection (2)

22

Sports IBM Advanced Scout analyzed NBA game statistics

(shots blocked, assists, and fouls) to gain competitiveadvantage for New York Knicks and Miami Heat.

Astronomy JPL and the Palomar Observatory discovered 22

quasars with the help of data mining

Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web

access logs for market-related pages to discovercustomer preference and behavior pages, analyzingeffectiveness of Web marketing, improving Web siteorganization, etc.

Other applications



23

The selection and processing of data for: the identification of novel, accurate, and

useful patterns, and the modeling of real-world phenomena.

Data mining is a major component of theKDD process - automated discovery ofpatterns and the development of predictiveand explanatory models.

What is Knowledge Discovery inDatabases (KDD)? A process!

24

Selection and Preprocessing

Data Mining

Interpretation and Evaluation

Data Consolidation

Knowledge

p(x)=0.02

Warehouse

Data Sources

Patterns & Models

Prepared Data

ConsolidatedData

The KDD process



25

The KDD Process in Practice

KDD is an Iterative Process art + engineering rather than science

26

Learning the application domain: relevant prior knowledge and goals of application

Data consolidation: Creating a target data setSelection and Preprocessing Data cleaning : (may take 60% of effort!) Data reduction and projection:

find useful features, dimensionality/variable reduction, invariant representation.

Choosing functions of data mining summarization, classification, regression, association, clustering.

Choosing the mining algorithm(s)Data mining: search for patterns of interestInterpretation and evaluation: analysis of results. visualization, transformation, removing redundant patterns, …

Use of discovered knowledge

The steps of the KDD process



27

IdentifyProblem or Opportunity

Measure effectof Action

Act onKnowledge

Knowledge

ResultsStrategy

Problem

The virtuous cycle

28

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

MakingDecisions

Data PresentationVisualization Techniques

Data MiningInformation Discovery

DataExploration

OLAP, MDA

Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts

Data SourcesPaper, Files, Information Providers, Database Systems, OLTP

Data mining and businessintelligence



29

Roles in the KDD process

30

A business intelligenceenvironment



31


Data Mining


Data Consolidation

Knowledge

p(x)=0.02

Warehouse

Data Sources

Patterns & Models

Prepared Data

ConsolidatedData

The KDD process

32

Garbage in Garbage outThe quality of results relates directly toquality of the data50%-70% of KDD process effort is spent ondata consolidation and preparationMajor justification for a corporate datawarehouse

Data consolidation andpreparation



33

From data sources to consolidated datarepository

RDBMS

Legacy DBMS

Flat Files

DataConsolidationand Cleansing

Warehouse

Object/Relation DBMS Multidimensional DBMS Deductive Database Flat files

External

Data consolidation

34

Determine preliminary list of attributesConsolidate data into working database Internal and External sources

Eliminate or estimate missing valuesRemove outliers (obvious exceptions)Determine prior probabilities of categoriesand deal with volume bias

Data consolidation



35


Data Mining


Data Consolidation

Knowledge

p(x)=0.02

Warehouse

The KDD process

36

Generate a set of examples choose sampling method consider sample complexity deal with volume bias issues

Reduce attribute dimensionality remove redundant and/or correlating attributes combine attributes (sum, multiply, difference)

Reduce attribute value ranges group symbolic discrete values quantify continuous numeric values

Transform data de-correlate and normalize values map time-series data to static representation

OLAP and visualization tools play key role

Data selection and preprocessing



37


Data Mining


Data Consolidation

Knowledge

p(x)=0.02

Warehouse

The KDD process

38

Data mining tasks andmethodsDirected Knowledge Discovery Purpose: Explain value of some field in terms

of all the others (goal-oriented) Method: select the target field based on some

hypothesis about the data; ask the algorithmto tell us how to predict or classify newinstances

Examples:what products show increased sale when

cream cheese is discountedwhich banner ad to use on a web page for a

given user coming to the site



39

Data mining tasks andmethodsUndirected Knowledge Discovery(Explorative Methods) Purpose: Find patterns in the data that may be

interesting (no target specified) Method: clustering, association rules (affinity

grouping) Examples:

which products in the catalog often selltogether

market segmentation (groups ofcustomers/users with similar characteristics)

40

Data Mining Models

Automated Exploration/Discovery e.g.. discovering new market segments clustering analysis

Prediction/Classification e.g.. forecasting gross sales given current factors regression, neural networks, genetic algorithms,

decision trees

Explanation/Description e.g.. characterizing customers by demographics and

purchase history decision trees, association rules

x1

x2

f(x)

x

if age > 35 and income < $35k then ...



41

Clustering: partitioning a set of data into a set ofclasses, called clusters, whose members share someinteresting common properties.

Distance-based numerical clustering metric grouping of examples (K-NN) graphical visualization can be used

Bayesian clustering search for the number of classes which result in

best fit of a probability distribution to the data AutoClass (NASA) one of best examples

Automated exploration anddiscovery

42

Learning a predictive modelClassification of a new case/sampleMany methods: Artificial neural networks Inductive decision tree and rule systems Genetic algorithms Nearest neighbor clustering algorithms Statistical (parametric, and non-parametric)

Prediction and classification



43

The objective of learning is to achieve goodgeneralization to new unseen cases.Generalization can be defined as amathematical interpolation or regressionover a set of training pointsModels can be validated with a previouslyunseen test set or using cross-validationmethods

f(x)

x

Generalization and regression

44

Classification and prediction

Classify data based on the values of atarget attribute, e.g., classify countriesbased on climate, or classify cars based ongas mileage.

Use obtained model to predict someunknown or missing attribute values basedon other information.



45

Objective: Develop a general model orhypothesis from specific

examples

Function approximation (curve fitting)

Classification (concept learning, patternrecognition)

x1

x2A

B

f(x)

x

inductive modeling = learning

46

Learn a generalized hypothesis (model) fromselected dataDescription/Interpretation of model providesnew knowledgeAffinity GroupingMethods: Inductive decision tree and rule systems Association rule systems Link Analysis …

Explanation and description(MOBASHER RULES)



47

Affinity GroupingDetermine what items often go together (usually intransactional databases)Often Referred to as Market Basket Analysis used in retail for planning arrangement on shelves used for identifying cross-selling opportunities “should” be used to determine best link structure for a Web

site

Examples people who buy milk and beer also tend to buy diapers people who access pages A and B are likely to place an online

order

Suitable data mining tools association rule discovery clustering Nearest Neighbor analysis (memory-based reasoning)

48

Generate a model of normal activityDeviation from model causes alertMethods: Artificial neural networks Inductive decision tree and rule systems Statistical methods Visualization tools

Exception/deviation detection



49

Outlier and exception dataanalysis

Time-series analysis (trend and deviation):

Trend and deviation analysis: regression,sequential pattern, similar sequences,trend and deviation, e.g., stock analysis.

Similarity-based pattern-directed analysis

Full vs. partial periodicity analysis

Other pattern-directed or statistical analysis

50


Data Mining


Data Consolidationand Warehousing

Knowledge

p(x)=0.02

Warehouse

The KDD process



51

A data mining system/query may generate thousands ofpatterns, not all of them are interesting.

Interestingness measures: easily understood by humans valid on new or test data with some degree of certainty. potentially useful novel, or validates some hypothesis that a user seeks to confirm

Objective vs. subjective interestingness measures Objective: based on statistics and structures of patterns, e.g.,

support, confidence, etc. Subjective: based on user’s beliefs in the data, e.g.,

unexpectedness, novelty, etc.

Are all the discovered patterninteresting?

52

EvaluationStatistical validation and significance testingQualitative review by experts in the fieldPilot surveys to evaluate model accuracy

InterpretationInductive tree and rule models can be readdirectlyClustering results can be graphed and tabledCode can be automatically generated bysome systems (IDTs, Regression models)

Interpretation and evaluation



53

Why Data Mining Query Language?Automated vs. query-driven? Finding all the patterns autonomously in a

database?—unrealistic because the patterns could betoo many but uninteresting

Data mining should be an interactive process User directs what to be mined

Users must be provided with a set of primitivesto be used to communicate with the data miningsystemIncorporating these primitives in a data miningquery language Standardization of data mining industry and

practice

54

Primitives that Define a Data Mining Task

Task-relevant data

Type of knowledge to be mined

Background knowledge

Pattern interestingness measurements

Visualization/presentation of discovered

patterns



55

Primitive 1: Task-RelevantData

Database or data warehouse name

Database tables or data warehouse cubes

Condition for data selection

Relevant attributes or dimensions

Data grouping criteria

56

Primitive 2: Types of Knowledge to BeMined

Characterization

Discrimination

Association

Classification/prediction

Clustering

Outlier analysis

Other data mining tasks



57

Primitive 3: Background Knowledge (ConceptHierarchies)

Schema hierarchy

E.g., street < city < province_or_state < country

Set-grouping hierarchy

E.g., {20-39} = young, {40-59} = middle_aged

Operation-derived hierarchy

email address: [email protected] < department < university < country

Rule-based hierarchy

low_profit_margin (X) <= price(X, P1) and cost(X, P2) and (P1 - P2) < $50

58

Primitive 4: Measurements of PatternInterestingness

Simplicity

e.g., (association) rule length, (decision) tree size

Certainty

e.g., confidence, P(A|B) = #(A and B)/ #(B),classification reliability, etc.

Utility

potential usefulness, e.g., support (association), noisethreshold (description)

Novelty

not previously known, surprising (used to removeredundant rules, e.g., Bill Clinton vs. William Clinton)



59

Primitive 5: Presentation of DiscoveredPatternsDifferent backgrounds/usages may require different

forms of representation

E.g., rules, tables, pie/bar chart, etc.

Concept hierarchy is also important

Discovered knowledge might be more understandable

when represented at high level of abstraction

Interactive drill up/down, pivoting, slicing and dicing

provide different perspectives to data

Different kinds of knowledge require different

representation: association, classification, clustering, etc.

60

DMQL—A Data Mining QueryLanguage

Motivation

A DMQL can provide the ability to support ad-hoc and interactive data mining

By providing a standardized language like SQL Hope to achieve a similar effect like that SQL has on

relational database

Foundation for system development and evolution

Facilitate information exchange, technology transfer,commercialization and wide acceptance

Design

DMQL is designed with the primitivesdescribed earlier



61

An Example Query in DMQL

62

Other Data Mining Languages &Standardization Efforts

Association rule language specifications

MSQL (Imielinski & Virmani’99)

MineRule (Meo Psaila and Ceri’96)

Query flocks based on Datalog syntax (Tsur et al’98)

OLEDB for DM (Microsoft’2000) and recently DMX

(Microsoft SQLServer 2005)

Based on OLE, OLE DB, OLE DB for OLAP, C#

Integrating DBMS, data warehouse and data mining

DMML (Data Mining Mark-up Language) by DMG

(www.dmg.org)



63

Integration of Data Mining and DataWarehousing

Data mining systems, DBMS, Data warehouse

systems coupling

On-line analytical mining data

integration of mining and OLAP technologies

Interactive mining multi-level knowledge

Necessity of mining knowledge and patterns at

different levels of abstraction.

Integration of multiple mining functions

Characterized classification, first clustering and then

association

64

Architecture: Typical Data MiningSystem

data cleaning, integration, and selection

Database or DataWarehouse Server

Data Mining Engine

Pattern Evaluation

Graphical User Interface

Knowledge-Base

Database DataWarehouse

World-WideWeb

Other InfoRepositories



65

Seminar 1 - Bibliography

Jiawei Han, Micheline Kamber, DataMining: Concepts and Techniques,Morgan Kaufmann Publishers, 2000Micheael, J. A. Berry, Gordon S.Linoff, Mastering Data Mining, Wiley,2000 Klosgen, Zytkow, Handbook of DataMining, Oxford, 2001

66

Seminar 1 - Bibliography

Jiawei Han, Micheline Kamber, Data Mining: Conceptsand Techniques, Morgan Kaufmann Publishers, 2000http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-489-8

• U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R.Uthurusamy (editors). Advances in Knowledgediscovery and data mining, MIT Press, 1996.

• David J. Hand, Heikki Mannila, Padhraic Smyth,Principles of Data Mining, MIT Press, 2001.

• S. Chakrabarti, Mining the Web: Discovering Knowledgefrom Hypertext Data, Morgan Kaufmann, ISBN 1-55860-754-4, 2002



Examples of DM projectsCompetitive Intelligence

Fraud Detection,Health care,

Traffic Accident Analysis,Moviegoers database: asimple example at work

L’Oreal, a case-study oncompetitive intelligence:

Source: DM@CINECAhttp://open.cineca.it/datamining/dmCineca/



69

A small example

Domain: technology watch - a.k.a.competitive intelligence Which are the emergent technologies? Which competitors are investing on them? In which area are my competitors active? Which area will my competitor drop in the near

future?

Source of data: public (on-line) databases

70

The Derwent database

Contains all patents filed worldwide in last 10yearsSearching this database by keywords mayyield thousands of documentsDerwent document are semi-structured:many long text fields

Goal: analyze Derwent document to build amodel of competitors’ strategy



71

Structure of Derwentdocuments

72

Example dataset

Patents in the area: patch technology(cerotto medicale) 105 companies from 12 countries 94 classification codes 52 Derwent codes



73

Clustering output

74

Zoom on cluster 2



75

Zoom on cluster 2 - profilingcompetitors

76

Activity of competitors in the clusters



77

Temporal analysis of clusters

Fraud detection and auditplanning

Source: Ministero delle FinanzeProgetto Sogei, KDD Lab. Pisa



79

A major task in fraud detection isconstructing models of fraudulent behavior,for: preventing future frauds (on-line fraud detection) discovering past frauds (a posteriori fraud

detection)

analyze historical audit data to plan effectivefuture audits

Fraud detection

80

Audit planning

Need to face a trade-off betweenconflicting issues: maximize audit benefits: select subjects to

be audited to maximize the recovery ofevaded tax

minimize audit costs: select subjects to beaudited to minimize the resources neededto carry out the audits.



81

Available data sourcesDataset: tax declarations, concerning atargeted class of Italian companies, integratedwith other sources: social benefits to employees, official budget

documents, electricity and telephone bills.

Size: 80 K tuples, 175 numeric attributes.A subset of 4 K tuples corresponds to theaudited companies: outcome of audits recorded as the recovery attribute

(= amount of evaded tax ascertained )

82

Data preparationoriginaldataset81 K

auditoutcomes

4 K

data consolidationdata cleaning

attribute selection



83

Cost modelA derived attribute audit_cost is definedas a function of other attributes

f audit_cost

84

Cost model and the targetvariable

recovery of an audit after the audit costactual_recovery = recovery - audit_cost

target variable (class label) of our analysis isset as the Class of Actual Recovery (c.a.r.):

negative if actual_recovery ≤ 0c.a.r. =

positive if actual_recovery > 0.



85

Quality assessment indicators

The obtained classifiers are evaluatedaccording to several indicators, or metricsDomain-independent indicators confusion matrix misclassification rate

Domain-dependent indicators audit # actual recovery profitability relevance

86

Domain-dependent qualityindicators

audit # (of a given classifier): number oftuples classified as positive =

# (FP ∪ TP)actual recovery: total amount of actualrecovery for all tuples classified as positiveprofitability: average actual recovery perauditrelevance: ratio between profitability andmisclassification rate



87

The REAL case

Classifiers can be compared with theREAL case, consisting of the whole test-set:

audit # (REAL) = 366actual recovery(REAL) = 159.6 M euro

88

Model evaluation: classifier 1(min FP)

no replication in training-set (unbalance towards negative) 10-trees adaptive boosting

misc. rate = 22% audit # = 59 (11 FP) actual rec.= 141.7 Meuro profitability = 2.401



89

Model evaluation: classifier 2(min FN)

replication in training-set (balanced neg/pos) misc. weights (trade 3 FP for 1 FN) 3-trees adaptive boosting

misc. rate = 34% audit # = 188 (98 FP) actual rec.= 165.2 Meuro profitability = 0.878

Atherosclerosis preventionstudy

2nd Department of Medicine, 1st Faculty ofMedicine of Charles University and CharlesUniversity Hospital, U nemocnice 2, Prague 2(head. Prof. M. Aschermann, MD, SDr, FESC)



91

Atherosclerosis preventionstudy:

The STULONG 1 data set is a realdatabase that keeps information aboutthe study of the development ofatherosclerosis risk factors in apopulation of middle aged men.

Used for Discovery Challenge at PKDD00-02-03-04

92

Atherosclerosis preventionstudy:

Study on 1400 middle-aged men at Czechhospitals

Measurements concern development ofcardiovascular disease and other health data in aseries of exams

The aim of this analysis is to look forassociations between medical characteristicsof patients and death causes.Four tables

Entry and subsequent exams, questionnaireresponses, deaths



93

The input dataData from Entry and Exams

AlcoholLiquorsBeer 10Beer 12WineSmokingFormer smokerDuration of smokingTeaSugarCoffee

Chest painBreathlesnessCholesterolUrineSubscapularTriceps

Marital statusTransport to a jobPhysical activity in a jobActivity after a jobEducationResponsibilityAgeWeightHeight

habitsExaminationsGeneral characteristics

94

The input data

%PATIENTSDEATH CAUSE

100.0389TOTAL

5.7 22general atherosclerosis

29.3114tumorous disease

2.0 8unknown

5.9 23sudden death

20.3 79other causes

7.7 30stroke

8.5 33coronary heart disease

20.6 80myocardial infarction



95

Data selection

When joining “Entry” and “Death” tables weimplicitely create a new attribute “Cause ofdeath”, which is set to “alive” for subjectspresent in the “Entry” table but not in the“Death” table.We have only 389 subjects in death table.

96

The prepared data



97

Descriptive Analysis/ SubgroupDiscovery /Association RulesAre there strong relations concerning death cause?

1. General characteristics (?) ⇒ Death cause (?)

2. Examinations (?) ⇒ Death cause (?)

3. Habits (?) ⇒ Death cause (?)

4. Combinations (?) ⇒ Death cause (?)

98

Example of extracted rules

Education(university) & Height<176-180> ÞDeath cause (tumourosdisease), 16 ; 0.62It means that on tumorous disease havedied 16, i.e. 62% of patients with universityeducation and with height 176-180 cm.



99

Example of extracted rules

Physical activity in work(he mainly sits)& Height<176-180> Þ Death cause(tumouros disease), 24; 0.52It means that on tumorous disease havedied 24 i.e. 52% of patients that mainly sitin the work and whose height is 176-180cm.

100

Example of extracted rulesEducation(university) & Height<176-180>ÞDeath cause (tumouros disease), 16; 0.62; +1.1;the relative frequency of patients who died ontumorous disease among patients with universityeducation and with height 176-180 cm is 110 percent higher than the relative frequency of patientswho died on tumorous disease among all the 389observed patients



Moviegoer Database :

102



103

moviegoers.name sex age source movies.name

Amy f 27 Oberlin Independence DayAndrew m 25 Oberlin 12 MonkeysAndy m 34 Oberlin The BirdcageAnne f 30 Oberlin TrainspottingAnsje f 25 Oberlin I Shot Andy WarholBeth f 30 Oberlin Chain ReactionBob m 51 Pinewoods Schindler's ListBrian m 23 Oberlin Super CopCandy f 29 Oberlin EddieCara f 25 Oberlin PhenomenonCathy f 39 Mt. Auburn The BirdcageCharles m 25 Oberlin KingpinCurt m 30 MRJ T2 Judgment DayDavid m 40 MRJ Independence DayErica f 23 Mt. Auburn Trainspotting

SELECT moviegoers.name, moviegoers.sex, moviegoers.age,sources.source, movies.nameFROM movies, sources, moviegoersWHERE sources.source_ID = moviegoers.source_ID AND movies.movie_ID = moviegoers.movie_IDORDER BY moviegoers.name;

104

Example: Moviegoer DatabaseClassification determine sex based on age, source, and movies

seen determine source based on sex, age, and movies

seen determine most recent movie based on past

movies, age, sex, and source

Estimation for predict, need a continuous variable (e.g., “age”) predict age as a function of source, sex, and past

movies if we had a “rating” field for each moviegoer, we

could predict the rating a new moviegoer gives toa movie based on age, sex, past movies, etc.



105

Example: Moviegoer Database

Clustering find groupings of movies that are often seen by

the same people find groupings of people that tend to see the

same movies clustering might reveal relationships that are not

necessarily recorded in the data (e.g., we mayfind a cluster that is dominated by people withyoung children; or a cluster of movies thatcorrespond to a particular genre)

106

Example: Moviegoer DatabaseAssociation Rules market basket analysis (MBA): “which movies go

together?” need to create “transactions” for each moviegoer

containing movies seen by that moviegoer:

may result in association rules such as:{“Phenomenon”, “The Birdcage”} ==>{“Trainspotting”}

{“Trainspotting”, “The Birdcage”} ==> {sex = “f”}

name TID Transaction

Amy 001 {Independence Day, Trainspotting}Andrew 002 {12 Monkeys, The Birdcage, Trainspotting, Phenomenon}Andy 003 {Super Cop, Independence Day, Kingpin}Anne 004 {Trainspotting, Schindler's List}… … ...



107

Example: Moviegoer Database

Sequence Analysis similar to MBA, but order in which items

appear in the pattern is important e.g., people who rent “The Birdcage”

during a visit tend to rent “Trainspotting”in the next visit.

On the road to knowledge: mining21 years of UK traffic accidentreports

Peter Flach et al.Silnet Network of Excellence



109

Mining traffic accident reports

the Hampshire County Council (UK) wantedto obtain a better insight into how thecharacteristics of traffic accidents may havechanged over the past 20 years as a result ofimprovements in highway design and invehicle design.The database, contained police trafficaccident reports for all UK accidents thathappened in the period 1979-1999.

110

Business UnderstandingUnderstanding of road safety in order toreduce the occurrences and severity ofaccidents.

influence of road surface condition; influence of skidding; influence of location (for example: junction approach); and influence of street lighting.

trend analysis: long-term overall trends,regional trends, urban trends, and ruraltrends.the comparison of different kinds of locationsis interesting: for example, rural versusmetropolitan versus suburban.



111

Data understanding

Low data quality. Many attribute valueswere missing or recorded as unknown.Different maps were created toinvestigate the effect of severalparameters like accident severity andaccident date.

112

Modelling

The aim of this effort was to findinteresting associations between roadnumber, conditions (e.g., weather, andlight) and serious or fatal accidents.Certain localities had been selectedand performed the analysis only overthe years 1998 and 1999.



113

Extracted rule

The relative frequency of fatal accidentsamong all accidents in the locality was 3%.The relative frequency of fatal accidents onthe road (V61) under fine weather with nowinds was 9.6% — more than 3 timesgreater.

114

Seminar 1 Case studies -Bibliography

Artif. Intell. Med. 2001 Jun;22(3), Special Issue on DataMining in MedicineKlosgen, Zytkow, Handbook of Data Mining, Oxford, 2001Micheael, J. A. Berry, Gordon S. Linoff, Mastering DataMining, Wiley, 2000 ECML/PKDD2004 Discovery Challengehomepage[http://lisp.vse.cz/challenge/ecmlpkdd2004/]On the road to knowledge: mining 21 years of UK trafficaccident reports, in Data Mining and Decision Support: Aspectsof Integration and Collaboration, pages 143--155. Kluwer AcademicPublishers, January 2003 ata Mining and Decision Support:Aspects of Integration and Collaboration, pages 143--155. Kluwer Academic Publishers, January 2003



How to develop a DataMining Project?

116

CRISP-DM: The life cicle of adata mining project

KDD Process



117

Business understandingUnderstanding the project objectivesand requirements from a businessperspective. then converting this knowledge into adata mining problem definition and apreliminary plan. Determine the Business Objectives Determine Data requirements for

Business Objectives Translate Business questions into Data

Mining Objective

118

Data understanding

Data understanding: characterize dataavailable for modelling. Provide assessmentand verification for data.



119

Modeling

In this phase, various modeling techniquesare selected and applied and theirparameters are calibrated to optimal values.Typically, there are several techniques forthe same data mining problem type. Sometechniques have specific requirements on theform of data.Therefore, stepping back to the datapreparation phase is often necessary.

120

EvaluationAt this stage in the project you have built amodel (or models) that appears to have highquality from a data analysis perspective.Evaluate the model and review the stepsexecuted to construct the model to be certainit properly achieves the business objectives.A key objective is to determine if there issome important business issue that has notbeen sufficiently considered.



121

DeploymentThe knowledge gained will need to be organizedand presented in a way that the customer canuse it.It often involves applying “live” models within anorganization’s decision making processes, forexample in real-time personalization of Webpages or repeated scoring of marketingdatabases.

122

Deployment

It can be as simple as generating a reportor as complex as implementing arepeatable data mining process across theenterprise.

In many cases it is the customer, not thedata analyst, who carries out thedeployment steps.

Documents

Data Mining per l’analisi dei dati: una breve introduzionegiangi/CORSI/SISD/Lezioni/SISD4.pdf · 2007-03-31 · data stored in databases, data warehouses and other information repositories