Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 1
Data Mining per l’analisi deidati: una breve introduzione
Pisa KDD Lab, ISTI-CNR & Univ. Pisahttp://www-kdd.isti.cnr.it/
2
Obiettivi del corsoIntrodurre i concetti di base del processo diestrazione di conoscenza, e le principalitecniche di data mining per l’analisi dei dati.Dare gli strumenti necessari per orientarsi trale molteplici tecnologie coinvolte.Comprendere le potenzialità in risposta adesigenze di indirizzo strategico nei diversisettori applicativiComprendere le diverse modalità con cuiintraprendere progetti di DM
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 2
3
Esperienze di FormazioneKDD Lab ( http://www-kdd.isti.cnr.it ), centro di ricercacongiunto fra Università di Pisa e ISTI–CNR, è stato uno deiprimi laboratori di ricerca europei focalizzato sul data mining;ha organizzato fin dal 1998 vari interventi formativi:Analisi dei dati aziendali mediante tecniche di data mining,1998-1999 e 2000-2001, 2002-2003• in collaborazione con il Dipartimento di Statistica per l’Economia e
finanziato dalle province di Pisa e di LivornoTecniche di Data Mining:• Corso di Laurea Specialistica in Informatica, Univ. Pisa
Analisi di dati ed estrazione di conoscenza: Corso di Laurea Specialistica in Informatica per l’Economia e
l’azienda, Univ. Pisa.Laboratorio di Data Warehouse e Data Mining: Corso di Laurea Specialistica in Informatica per l’Economia e
l’azienda, Univ. Pisa.
4
StrutturaParte1 – Introduzione e concetti di base
Motivazioni, applicazioni, il processo KDD, le tecnicheParte 2 – Comprensione e preparazione dei dati
Nozioni basiche di Datawarehouse Analisi Multidimensionale ed OLAPParte 3 – La tecnologia data Mining
Regole Associative e Market Basket Analysis Classificazione e regole predittive e Rilevazione di frodi Clustering e Customer Segmentation Demo di ambienti di Data MiningParte 4 – Data Mining per la PA: tendenze, esperienze e casi distudio
Progetti di Data Mining nelle Università statunitensi Mining Official Data Mining for Health Care
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 3
Seminar 1 – Introduction to DM
Motivations, applications,the Knowledge Discovery in
Databases process,the Data Mining techniques
6
Seminar 1 outlineMotivationsApplication AreasKDD Decisional ContextKDD ProcessArchitecture of a KDD systemThe KDD steps in short4 Examples in short
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 4
7
Evolution of Database Technology:from data management to data analysis
1960s: Data collection, database creation, IMS and network DBMS.
1970s: Relational data model, relational DBMS implementation.
1980s: RDBMS, advanced data models (extended-relational, OO,
deductive, etc.) and application-oriented DBMS (spatial,scientific, engineering, etc.).
1990s: Data mining and data warehousing, multimedia databases, and
Web technology.
8
What Is Data Mining?Data mining (knowledge discovery from data)
Extraction of interesting (non-trivial, implicit, previously unknown andpotentially useful) patterns or knowledge from huge amount of data
Data mining: a misnomer?
Alternative names
Knowledge discovery (mining) in databases (KDD), knowledgeextraction, data/pattern analysis, data archeology, data dredging,information harvesting, business intelligence, etc.
Watch out: Is everything “data mining”?
(Deductive) query processing.
Expert systems or small ML/statistical programs
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 5
9
Why Data MiningIncreased Availability of Huge Amounts of Data
point-of-sale customer data digitization of text, images, video, voice, etc. World Wide Web and Online collections
Data Too Large or Complex for Classical orManual Analysis
number of records in millions or billions high dimensional data (too many fields/features/attributes) often too sparse for rudimentary observations high rate of growth (e.g., through logging or automatic data
collection) heterogeneous data sources
Business Necessity e-commerce high degree of competition personalization, customer loyalty, market segmentation
10
Motivations“Necessity is the Mother of Invention”
Data explosion problem: Automated data collection tools, mature database
technology and internet lead to tremendous amounts ofdata stored in databases, data warehouses and otherinformation repositories.
We are drowning in information, but starving forknowledge! (John Naisbett)
Data warehousing and data mining : On-line analytical processing
Extraction of interesting knowledge (rules, regularities,
patterns, constraints) from data in large databases.
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 6
11
Abundance of business and industry dataCompetitive focus - Knowledge ManagementInexpensive, powerful computing enginesStrong theoretical/mathematical foundations machine learning & artificial intelligence statistics database management systems
Motivations for DM
12
Data Mining: Confluence ofMultiple Disciplines
Data Mining
Database Technology Statistics
OtherDisciplines
InformationScience
MachineLearning (AI) Visualization
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 7
13
Sources of DataBusiness Transactions widespread use of bar codes => storage of millions of
transactions daily (e.g., Walmart: 2000 stores => 20Mtransactions per day)
most important problem: effective use of the data in areasonable time frame for competitive decision-making
e-commerce data
Scientific Data data generated through multitude of experiments and
observations examples, geological data, satellite imaging data, NASA earth
observations rate of data collection far exceeds the speed by which we
analyze the data
14
Sources of DataFinancial Data company information economic data (GNP, price indexes, etc.) stock markets
Personal / Statistical Data government census medical histories customer profiles demographic data data and statistics about sports and athletes
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 8
15
Sources of DataWorld Wide Web and Online Repositories email, news, messages Web documents, images, video, etc. link structure of of the hypertext from millions of
Web sites Web usage data (from server logs, network
traffic, and user registrations) online databases, and digital libraries
16
Classes of applicationsDatabase analysis and decision support Market analysis
target marketing, customer relation management, marketbasket analysis, cross selling, market segmentation.
Risk analysis
Forecasting, customer retention, improved underwriting,quality control, competitive analysis.
Fraud detection
New Applications from New sources of data Text (news group, email, documents)
Web analysis and intelligent search
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 9
17
Where are the data sources for analysis? Credit card transactions, loyalty cards, discount
coupons, customer complaint calls, plus (public) lifestylestudies.
Target marketing Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits,etc.
Determine customer purchasing patterns over time Conversion of single to a joint bank account: marriage,
etc.
Cross-market analysis Associations/co-relations between product sales Prediction based on the association information.
Market Analysis
18
Customer profiling data mining can tell you what types of customers
buy what products (clustering or classification).
Identifying customer requirements identifying the best products for different customers
use prediction to find what factors will attract newcustomers
Summary information various multidimensional summary reports;
statistical summary information (data centraltendency and variation)
Market Analysis (2)
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 10
19
Finance planning and asset evaluation: cash flow analysis and prediction contingent claim analysis to evaluate assets trend analysis
Resource planning: summarize and compare the resources and spending
Competition: monitor competitors and market directions (CI:
competitive intelligence). group customers into classes and class-based pricing
procedures set pricing strategy in a highly competitive market
Risk Analysis
20
Applications: widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
Approach: use historical data to build models of fraudulent
behavior and use data mining to help identify similarinstances.
Examples: auto insurance: detect a group of people who stage
accidents to collect on insurance money laundering: detect suspicious money
transactions (US Treasury's Financial CrimesEnforcement Network)
medical insurance: detect professional patients and ringof doctors and ring of references
Fraud Detection
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 11
21
More examples: Detecting inappropriate medical treatment:
Australian Health Insurance Commission identifiesthat in many cases blanket screening tests wererequested (save Australian $1m/yr).
Detecting telephone fraud: Telephone call model: destination of the call,
duration, time of day or week. Analyze patterns thatdeviate from an expected norm.
Retail: Analysts estimate that 38% of retailshrink is due to dishonest employees.
Fraud Detection (2)
22
Sports IBM Advanced Scout analyzed NBA game statistics
(shots blocked, assists, and fouls) to gain competitiveadvantage for New York Knicks and Miami Heat.
Astronomy JPL and the Palomar Observatory discovered 22
quasars with the help of data mining
Internet Web Surf-Aid IBM Surf-Aid applies data mining algorithms to Web
access logs for market-related pages to discovercustomer preference and behavior pages, analyzingeffectiveness of Web marketing, improving Web siteorganization, etc.
Other applications
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 12
23
The selection and processing of data for: the identification of novel, accurate, and
useful patterns, and the modeling of real-world phenomena.
Data mining is a major component of theKDD process - automated discovery ofpatterns and the development of predictiveand explanatory models.
What is Knowledge Discovery inDatabases (KDD)? A process!
24
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidation
Knowledge
p(x)=0.02
Warehouse
Data Sources
Patterns & Models
Prepared Data
ConsolidatedData
The KDD process
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 13
25
The KDD Process in Practice
KDD is an Iterative Process art + engineering rather than science
26
Learning the application domain: relevant prior knowledge and goals of application
Data consolidation: Creating a target data setSelection and Preprocessing Data cleaning : (may take 60% of effort!) Data reduction and projection:
find useful features, dimensionality/variable reduction, invariant representation.
Choosing functions of data mining summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)Data mining: search for patterns of interestInterpretation and evaluation: analysis of results. visualization, transformation, removing redundant patterns, …
Use of discovered knowledge
The steps of the KDD process
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 14
27
IdentifyProblem or Opportunity
Measure effectof Action
Act onKnowledge
Knowledge
ResultsStrategy
Problem
The virtuous cycle
28
Increasing potentialto supportbusiness decisions End User
Business Analyst
DataAnalyst
DBA
MakingDecisions
Data PresentationVisualization Techniques
Data MiningInformation Discovery
DataExploration
OLAP, MDA
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data SourcesPaper, Files, Information Providers, Database Systems, OLTP
Data mining and businessintelligence
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 15
29
Roles in the KDD process
30
A business intelligenceenvironment
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 16
31
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidation
Knowledge
p(x)=0.02
Warehouse
Data Sources
Patterns & Models
Prepared Data
ConsolidatedData
The KDD process
32
Garbage in Garbage outThe quality of results relates directly toquality of the data50%-70% of KDD process effort is spent ondata consolidation and preparationMajor justification for a corporate datawarehouse
Data consolidation andpreparation
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 17
33
From data sources to consolidated datarepository
RDBMS
Legacy DBMS
Flat Files
DataConsolidationand Cleansing
Warehouse
Object/Relation DBMS Multidimensional DBMS Deductive Database Flat files
External
Data consolidation
34
Determine preliminary list of attributesConsolidate data into working database Internal and External sources
Eliminate or estimate missing valuesRemove outliers (obvious exceptions)Determine prior probabilities of categoriesand deal with volume bias
Data consolidation
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 18
35
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidation
Knowledge
p(x)=0.02
Warehouse
The KDD process
36
Generate a set of examples choose sampling method consider sample complexity deal with volume bias issues
Reduce attribute dimensionality remove redundant and/or correlating attributes combine attributes (sum, multiply, difference)
Reduce attribute value ranges group symbolic discrete values quantify continuous numeric values
Transform data de-correlate and normalize values map time-series data to static representation
OLAP and visualization tools play key role
Data selection and preprocessing
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 19
37
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidation
Knowledge
p(x)=0.02
Warehouse
The KDD process
38
Data mining tasks andmethodsDirected Knowledge Discovery Purpose: Explain value of some field in terms
of all the others (goal-oriented) Method: select the target field based on some
hypothesis about the data; ask the algorithmto tell us how to predict or classify newinstances
Examples:what products show increased sale when
cream cheese is discountedwhich banner ad to use on a web page for a
given user coming to the site
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 20
39
Data mining tasks andmethodsUndirected Knowledge Discovery(Explorative Methods) Purpose: Find patterns in the data that may be
interesting (no target specified) Method: clustering, association rules (affinity
grouping) Examples:
which products in the catalog often selltogether
market segmentation (groups ofcustomers/users with similar characteristics)
40
Data Mining Models
Automated Exploration/Discovery e.g.. discovering new market segments clustering analysis
Prediction/Classification e.g.. forecasting gross sales given current factors regression, neural networks, genetic algorithms,
decision trees
Explanation/Description e.g.. characterizing customers by demographics and
purchase history decision trees, association rules
x1
x2
f(x)
x
if age > 35 and income < $35k then ...
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 21
41
Clustering: partitioning a set of data into a set ofclasses, called clusters, whose members share someinteresting common properties.
Distance-based numerical clustering metric grouping of examples (K-NN) graphical visualization can be used
Bayesian clustering search for the number of classes which result in
best fit of a probability distribution to the data AutoClass (NASA) one of best examples
Automated exploration anddiscovery
42
Learning a predictive modelClassification of a new case/sampleMany methods: Artificial neural networks Inductive decision tree and rule systems Genetic algorithms Nearest neighbor clustering algorithms Statistical (parametric, and non-parametric)
Prediction and classification
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 22
43
The objective of learning is to achieve goodgeneralization to new unseen cases.Generalization can be defined as amathematical interpolation or regressionover a set of training pointsModels can be validated with a previouslyunseen test set or using cross-validationmethods
f(x)
x
Generalization and regression
44
Classification and prediction
Classify data based on the values of atarget attribute, e.g., classify countriesbased on climate, or classify cars based ongas mileage.
Use obtained model to predict someunknown or missing attribute values basedon other information.
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 23
45
Objective: Develop a general model orhypothesis from specific
examples
Function approximation (curve fitting)
Classification (concept learning, patternrecognition)
x1
x2A
B
f(x)
x
inductive modeling = learning
46
Learn a generalized hypothesis (model) fromselected dataDescription/Interpretation of model providesnew knowledgeAffinity GroupingMethods: Inductive decision tree and rule systems Association rule systems Link Analysis …
Explanation and description(MOBASHER RULES)
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 24
47
Affinity GroupingDetermine what items often go together (usually intransactional databases)Often Referred to as Market Basket Analysis used in retail for planning arrangement on shelves used for identifying cross-selling opportunities “should” be used to determine best link structure for a Web
site
Examples people who buy milk and beer also tend to buy diapers people who access pages A and B are likely to place an online
order
Suitable data mining tools association rule discovery clustering Nearest Neighbor analysis (memory-based reasoning)
48
Generate a model of normal activityDeviation from model causes alertMethods: Artificial neural networks Inductive decision tree and rule systems Statistical methods Visualization tools
Exception/deviation detection
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 25
49
Outlier and exception dataanalysis
Time-series analysis (trend and deviation):
Trend and deviation analysis: regression,sequential pattern, similar sequences,trend and deviation, e.g., stock analysis.
Similarity-based pattern-directed analysis
Full vs. partial periodicity analysis
Other pattern-directed or statistical analysis
50
Selection and Preprocessing
Data Mining
Interpretation and Evaluation
Data Consolidationand Warehousing
Knowledge
p(x)=0.02
Warehouse
The KDD process
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 26
51
A data mining system/query may generate thousands ofpatterns, not all of them are interesting.
Interestingness measures: easily understood by humans valid on new or test data with some degree of certainty. potentially useful novel, or validates some hypothesis that a user seeks to confirm
Objective vs. subjective interestingness measures Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc. Subjective: based on user’s beliefs in the data, e.g.,
unexpectedness, novelty, etc.
Are all the discovered patterninteresting?
52
EvaluationStatistical validation and significance testingQualitative review by experts in the fieldPilot surveys to evaluate model accuracy
InterpretationInductive tree and rule models can be readdirectlyClustering results can be graphed and tabledCode can be automatically generated bysome systems (IDTs, Regression models)
Interpretation and evaluation
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 27
53
Why Data Mining Query Language?Automated vs. query-driven? Finding all the patterns autonomously in a
database?—unrealistic because the patterns could betoo many but uninteresting
Data mining should be an interactive process User directs what to be mined
Users must be provided with a set of primitivesto be used to communicate with the data miningsystemIncorporating these primitives in a data miningquery language Standardization of data mining industry and
practice
54
Primitives that Define a Data Mining Task
Task-relevant data
Type of knowledge to be mined
Background knowledge
Pattern interestingness measurements
Visualization/presentation of discovered
patterns
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 28
55
Primitive 1: Task-RelevantData
Database or data warehouse name
Database tables or data warehouse cubes
Condition for data selection
Relevant attributes or dimensions
Data grouping criteria
56
Primitive 2: Types of Knowledge to BeMined
Characterization
Discrimination
Association
Classification/prediction
Clustering
Outlier analysis
Other data mining tasks
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 29
57
Primitive 3: Background Knowledge (ConceptHierarchies)
Schema hierarchy
E.g., street < city < province_or_state < country
Set-grouping hierarchy
E.g., {20-39} = young, {40-59} = middle_aged
Operation-derived hierarchy
email address: [email protected] < department < university < country
Rule-based hierarchy
low_profit_margin (X) <= price(X, P1) and cost(X, P2) and (P1 - P2) < $50
58
Primitive 4: Measurements of PatternInterestingness
Simplicity
e.g., (association) rule length, (decision) tree size
Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B),classification reliability, etc.
Utility
potential usefulness, e.g., support (association), noisethreshold (description)
Novelty
not previously known, surprising (used to removeredundant rules, e.g., Bill Clinton vs. William Clinton)
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 30
59
Primitive 5: Presentation of DiscoveredPatternsDifferent backgrounds/usages may require different
forms of representation
E.g., rules, tables, pie/bar chart, etc.
Concept hierarchy is also important
Discovered knowledge might be more understandable
when represented at high level of abstraction
Interactive drill up/down, pivoting, slicing and dicing
provide different perspectives to data
Different kinds of knowledge require different
representation: association, classification, clustering, etc.
60
DMQL—A Data Mining QueryLanguage
Motivation
A DMQL can provide the ability to support ad-hoc and interactive data mining
By providing a standardized language like SQL Hope to achieve a similar effect like that SQL has on
relational database
Foundation for system development and evolution
Facilitate information exchange, technology transfer,commercialization and wide acceptance
Design
DMQL is designed with the primitivesdescribed earlier
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 31
61
An Example Query in DMQL
62
Other Data Mining Languages &Standardization Efforts
Association rule language specifications
MSQL (Imielinski & Virmani’99)
MineRule (Meo Psaila and Ceri’96)
Query flocks based on Datalog syntax (Tsur et al’98)
OLEDB for DM (Microsoft’2000) and recently DMX
(Microsoft SQLServer 2005)
Based on OLE, OLE DB, OLE DB for OLAP, C#
Integrating DBMS, data warehouse and data mining
DMML (Data Mining Mark-up Language) by DMG
(www.dmg.org)
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 32
63
Integration of Data Mining and DataWarehousing
Data mining systems, DBMS, Data warehouse
systems coupling
On-line analytical mining data
integration of mining and OLAP technologies
Interactive mining multi-level knowledge
Necessity of mining knowledge and patterns at
different levels of abstraction.
Integration of multiple mining functions
Characterized classification, first clustering and then
association
64
Architecture: Typical Data MiningSystem
data cleaning, integration, and selection
Database or DataWarehouse Server
Data Mining Engine
Pattern Evaluation
Graphical User Interface
Knowledge-Base
Database DataWarehouse
World-WideWeb
Other InfoRepositories
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 33
65
Seminar 1 - Bibliography
Jiawei Han, Micheline Kamber, DataMining: Concepts and Techniques,Morgan Kaufmann Publishers, 2000Micheael, J. A. Berry, Gordon S.Linoff, Mastering Data Mining, Wiley,2000 Klosgen, Zytkow, Handbook of DataMining, Oxford, 2001
66
Seminar 1 - Bibliography
Jiawei Han, Micheline Kamber, Data Mining: Conceptsand Techniques, Morgan Kaufmann Publishers, 2000http://www.mkp.com/books_catalog/catalog.asp?ISBN=1-55860-489-8
• U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, R.Uthurusamy (editors). Advances in Knowledgediscovery and data mining, MIT Press, 1996.
• David J. Hand, Heikki Mannila, Padhraic Smyth,Principles of Data Mining, MIT Press, 2001.
• S. Chakrabarti, Mining the Web: Discovering Knowledgefrom Hypertext Data, Morgan Kaufmann, ISBN 1-55860-754-4, 2002
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 34
Examples of DM projectsCompetitive Intelligence
Fraud Detection,Health care,
Traffic Accident Analysis,Moviegoers database: asimple example at work
L’Oreal, a case-study oncompetitive intelligence:
Source: DM@CINECAhttp://open.cineca.it/datamining/dmCineca/
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 35
69
A small example
Domain: technology watch - a.k.a.competitive intelligence Which are the emergent technologies? Which competitors are investing on them? In which area are my competitors active? Which area will my competitor drop in the near
future?
Source of data: public (on-line) databases
70
The Derwent database
Contains all patents filed worldwide in last 10yearsSearching this database by keywords mayyield thousands of documentsDerwent document are semi-structured:many long text fields
Goal: analyze Derwent document to build amodel of competitors’ strategy
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 36
71
Structure of Derwentdocuments
72
Example dataset
Patents in the area: patch technology(cerotto medicale) 105 companies from 12 countries 94 classification codes 52 Derwent codes
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 37
73
Clustering output
74
Zoom on cluster 2
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 38
75
Zoom on cluster 2 - profilingcompetitors
76
Activity of competitors in the clusters
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 39
77
Temporal analysis of clusters
Fraud detection and auditplanning
Source: Ministero delle FinanzeProgetto Sogei, KDD Lab. Pisa
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 40
79
A major task in fraud detection isconstructing models of fraudulent behavior,for: preventing future frauds (on-line fraud detection) discovering past frauds (a posteriori fraud
detection)
analyze historical audit data to plan effectivefuture audits
Fraud detection
80
Audit planning
Need to face a trade-off betweenconflicting issues: maximize audit benefits: select subjects to
be audited to maximize the recovery ofevaded tax
minimize audit costs: select subjects to beaudited to minimize the resources neededto carry out the audits.
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 41
81
Available data sourcesDataset: tax declarations, concerning atargeted class of Italian companies, integratedwith other sources: social benefits to employees, official budget
documents, electricity and telephone bills.
Size: 80 K tuples, 175 numeric attributes.A subset of 4 K tuples corresponds to theaudited companies: outcome of audits recorded as the recovery attribute
(= amount of evaded tax ascertained )
82
Data preparationoriginaldataset81 K
auditoutcomes
4 K
data consolidationdata cleaning
attribute selection
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 42
83
Cost modelA derived attribute audit_cost is definedas a function of other attributes
f audit_cost
84
Cost model and the targetvariable
recovery of an audit after the audit costactual_recovery = recovery - audit_cost
target variable (class label) of our analysis isset as the Class of Actual Recovery (c.a.r.):
negative if actual_recovery ≤ 0c.a.r. =
positive if actual_recovery > 0.
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 43
85
Quality assessment indicators
The obtained classifiers are evaluatedaccording to several indicators, or metricsDomain-independent indicators confusion matrix misclassification rate
Domain-dependent indicators audit # actual recovery profitability relevance
86
Domain-dependent qualityindicators
audit # (of a given classifier): number oftuples classified as positive =
# (FP ∪ TP)actual recovery: total amount of actualrecovery for all tuples classified as positiveprofitability: average actual recovery perauditrelevance: ratio between profitability andmisclassification rate
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 44
87
The REAL case
Classifiers can be compared with theREAL case, consisting of the whole test-set:
audit # (REAL) = 366actual recovery(REAL) = 159.6 M euro
88
Model evaluation: classifier 1(min FP)
no replication in training-set (unbalance towards negative) 10-trees adaptive boosting
misc. rate = 22% audit # = 59 (11 FP) actual rec.= 141.7 Meuro profitability = 2.401
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 45
89
Model evaluation: classifier 2(min FN)
replication in training-set (balanced neg/pos) misc. weights (trade 3 FP for 1 FN) 3-trees adaptive boosting
misc. rate = 34% audit # = 188 (98 FP) actual rec.= 165.2 Meuro profitability = 0.878
Atherosclerosis preventionstudy
2nd Department of Medicine, 1st Faculty ofMedicine of Charles University and CharlesUniversity Hospital, U nemocnice 2, Prague 2(head. Prof. M. Aschermann, MD, SDr, FESC)
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 46
91
Atherosclerosis preventionstudy:
The STULONG 1 data set is a realdatabase that keeps information aboutthe study of the development ofatherosclerosis risk factors in apopulation of middle aged men.
Used for Discovery Challenge at PKDD00-02-03-04
92
Atherosclerosis preventionstudy:
Study on 1400 middle-aged men at Czechhospitals
Measurements concern development ofcardiovascular disease and other health data in aseries of exams
The aim of this analysis is to look forassociations between medical characteristicsof patients and death causes.Four tables
Entry and subsequent exams, questionnaireresponses, deaths
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 47
93
The input dataData from Entry and Exams
AlcoholLiquorsBeer 10Beer 12WineSmokingFormer smokerDuration of smokingTeaSugarCoffee
Chest painBreathlesnessCholesterolUrineSubscapularTriceps
Marital statusTransport to a jobPhysical activity in a jobActivity after a jobEducationResponsibilityAgeWeightHeight
habitsExaminationsGeneral characteristics
94
The input data
%PATIENTSDEATH CAUSE
100.0389TOTAL
5.7 22general atherosclerosis
29.3114tumorous disease
2.0 8unknown
5.9 23sudden death
20.3 79other causes
7.7 30stroke
8.5 33coronary heart disease
20.6 80myocardial infarction
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 48
95
Data selection
When joining “Entry” and “Death” tables weimplicitely create a new attribute “Cause ofdeath”, which is set to “alive” for subjectspresent in the “Entry” table but not in the“Death” table.We have only 389 subjects in death table.
96
The prepared data
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 49
97
Descriptive Analysis/ SubgroupDiscovery /Association RulesAre there strong relations concerning death cause?
1. General characteristics (?) ⇒ Death cause (?)
2. Examinations (?) ⇒ Death cause (?)
3. Habits (?) ⇒ Death cause (?)
4. Combinations (?) ⇒ Death cause (?)
98
Example of extracted rules
Education(university) & Height<176-180> ÞDeath cause (tumourosdisease), 16 ; 0.62It means that on tumorous disease havedied 16, i.e. 62% of patients with universityeducation and with height 176-180 cm.
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 50
99
Example of extracted rules
Physical activity in work(he mainly sits)& Height<176-180> Þ Death cause(tumouros disease), 24; 0.52It means that on tumorous disease havedied 24 i.e. 52% of patients that mainly sitin the work and whose height is 176-180cm.
100
Example of extracted rulesEducation(university) & Height<176-180>ÞDeath cause (tumouros disease), 16; 0.62; +1.1;the relative frequency of patients who died ontumorous disease among patients with universityeducation and with height 176-180 cm is 110 percent higher than the relative frequency of patientswho died on tumorous disease among all the 389observed patients
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 51
Moviegoer Database :
102
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 52
103
moviegoers.name sex age source movies.name
Amy f 27 Oberlin Independence DayAndrew m 25 Oberlin 12 MonkeysAndy m 34 Oberlin The BirdcageAnne f 30 Oberlin TrainspottingAnsje f 25 Oberlin I Shot Andy WarholBeth f 30 Oberlin Chain ReactionBob m 51 Pinewoods Schindler's ListBrian m 23 Oberlin Super CopCandy f 29 Oberlin EddieCara f 25 Oberlin PhenomenonCathy f 39 Mt. Auburn The BirdcageCharles m 25 Oberlin KingpinCurt m 30 MRJ T2 Judgment DayDavid m 40 MRJ Independence DayErica f 23 Mt. Auburn Trainspotting
SELECT moviegoers.name, moviegoers.sex, moviegoers.age,sources.source, movies.nameFROM movies, sources, moviegoersWHERE sources.source_ID = moviegoers.source_ID AND movies.movie_ID = moviegoers.movie_IDORDER BY moviegoers.name;
104
Example: Moviegoer DatabaseClassification determine sex based on age, source, and movies
seen determine source based on sex, age, and movies
seen determine most recent movie based on past
movies, age, sex, and source
Estimation for predict, need a continuous variable (e.g., “age”) predict age as a function of source, sex, and past
movies if we had a “rating” field for each moviegoer, we
could predict the rating a new moviegoer gives toa movie based on age, sex, past movies, etc.
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 53
105
Example: Moviegoer Database
Clustering find groupings of movies that are often seen by
the same people find groupings of people that tend to see the
same movies clustering might reveal relationships that are not
necessarily recorded in the data (e.g., we mayfind a cluster that is dominated by people withyoung children; or a cluster of movies thatcorrespond to a particular genre)
106
Example: Moviegoer DatabaseAssociation Rules market basket analysis (MBA): “which movies go
together?” need to create “transactions” for each moviegoer
containing movies seen by that moviegoer:
may result in association rules such as:{“Phenomenon”, “The Birdcage”} ==>{“Trainspotting”}
{“Trainspotting”, “The Birdcage”} ==> {sex = “f”}
name TID Transaction
Amy 001 {Independence Day, Trainspotting}Andrew 002 {12 Monkeys, The Birdcage, Trainspotting, Phenomenon}Andy 003 {Super Cop, Independence Day, Kingpin}Anne 004 {Trainspotting, Schindler's List}… … ...
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 54
107
Example: Moviegoer Database
Sequence Analysis similar to MBA, but order in which items
appear in the pattern is important e.g., people who rent “The Birdcage”
during a visit tend to rent “Trainspotting”in the next visit.
On the road to knowledge: mining21 years of UK traffic accidentreports
Peter Flach et al.Silnet Network of Excellence
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 55
109
Mining traffic accident reports
the Hampshire County Council (UK) wantedto obtain a better insight into how thecharacteristics of traffic accidents may havechanged over the past 20 years as a result ofimprovements in highway design and invehicle design.The database, contained police trafficaccident reports for all UK accidents thathappened in the period 1979-1999.
110
Business UnderstandingUnderstanding of road safety in order toreduce the occurrences and severity ofaccidents.
influence of road surface condition; influence of skidding; influence of location (for example: junction approach); and influence of street lighting.
trend analysis: long-term overall trends,regional trends, urban trends, and ruraltrends.the comparison of different kinds of locationsis interesting: for example, rural versusmetropolitan versus suburban.
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 56
111
Data understanding
Low data quality. Many attribute valueswere missing or recorded as unknown.Different maps were created toinvestigate the effect of severalparameters like accident severity andaccident date.
112
Modelling
The aim of this effort was to findinteresting associations between roadnumber, conditions (e.g., weather, andlight) and serious or fatal accidents.Certain localities had been selectedand performed the analysis only overthe years 1998 and 1999.
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 57
113
Extracted rule
The relative frequency of fatal accidentsamong all accidents in the locality was 3%.The relative frequency of fatal accidents onthe road (V61) under fine weather with nowinds was 9.6% — more than 3 timesgreater.
114
Seminar 1 Case studies -Bibliography
Artif. Intell. Med. 2001 Jun;22(3), Special Issue on DataMining in MedicineKlosgen, Zytkow, Handbook of Data Mining, Oxford, 2001Micheael, J. A. Berry, Gordon S. Linoff, Mastering DataMining, Wiley, 2000 ECML/PKDD2004 Discovery Challengehomepage[http://lisp.vse.cz/challenge/ecmlpkdd2004/]On the road to knowledge: mining 21 years of UK trafficaccident reports, in Data Mining and Decision Support: Aspectsof Integration and Collaboration, pages 143--155. Kluwer AcademicPublishers, January 2003 ata Mining and Decision Support:Aspects of Integration and Collaboration, pages 143--155. Kluwer Academic Publishers, January 2003
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 58
How to develop a DataMining Project?
116
CRISP-DM: The life cicle of adata mining project
KDD Process
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 59
117
Business understandingUnderstanding the project objectivesand requirements from a businessperspective. then converting this knowledge into adata mining problem definition and apreliminary plan. Determine the Business Objectives Determine Data requirements for
Business Objectives Translate Business questions into Data
Mining Objective
118
Data understanding
Data understanding: characterize dataavailable for modelling. Provide assessmentand verification for data.
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 60
119
Modeling
In this phase, various modeling techniquesare selected and applied and theirparameters are calibrated to optimal values.Typically, there are several techniques forthe same data mining problem type. Sometechniques have specific requirements on theform of data.Therefore, stepping back to the datapreparation phase is often necessary.
120
EvaluationAt this stage in the project you have built amodel (or models) that appears to have highquality from a data analysis perspective.Evaluate the model and review the stepsexecuted to construct the model to be certainit properly achieves the business objectives.A key objective is to determine if there issome important business issue that has notbeen sufficiently considered.
Data mining per l'analisi dei dati nella PA
Pisa, 9-10-11 Settembre 2004 61
121
DeploymentThe knowledge gained will need to be organizedand presented in a way that the customer canuse it.It often involves applying “live” models within anorganization’s decision making processes, forexample in real-time personalization of Webpages or repeated scoring of marketingdatabases.
122
Deployment
It can be as simple as generating a reportor as complex as implementing arepeatable data mining process across theenterprise.
In many cases it is the customer, not thedata analyst, who carries out thedeployment steps.