26
Page 1 Institut für Softwarewissenschaften – Universität Wien P.Brezany Data Mining und Data Warehousing Prof. Dr. Peter Brezany Institut für Softwarewissenschaften Universität Wien Tel. 4277 38825 Sprechstunde: Di, 12.30-13.30 Institut für Softwarewissenschaften – Universität Wien P.Brezany 2 Outline Business Intelligence and its components Knowledge discovery in databases Data mining techniques - description - classification - prediction - clustering - neural networks Commercial data mining systems (Demo of the SAS Enterprise Miner) Data webhousing Advanced topics: parallel and distributed data analysis

Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 1

Institut für Softwarewissenschaften – Universität WienP.Brezany

Data Mining und Data Warehousing

Prof. Dr. Peter Brezany

Institut für Softwarewissenschaften

Universität WienTel. 4277 38825

Sprechstunde: Di, 12.30-13.30

Institut für Softwarewissenschaften – Universität WienP.Brezany 2

OutlineBusiness Intelligence and its componentsKnowledge discovery in databasesData mining techniques

- description- classification- prediction- clustering- neural networks

Commercial data mining systems (Demo of the SAS Enterprise Miner)

Data webhousingAdvanced topics: parallel and distributed data analysis

Page 2: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 2

Institut für Softwarewissenschaften – Universität WienP.Brezany 3

LiteratureMark and Mary Whitehorn: Business Intelligence: The IBM Solution. Springer-Verlag, 2000.

R. Kimball: The Data Warehouse Toolkit. John Willey, 1996.

J. Han, M. Kamber: Data Mining. Concepts and TechniquesMorgam Kaufmann Publishers, 2000.

M. Ester, J. Sander: Knowledge Discovery in Databases.Springer-Verlag, 2000.

I.H. Witten, E. Frank: Data Mining. (Practical Machine Learning Tools and Techniques with Java Implementations).Morgam Kaufmann Publishers, 2000.

Institut für Softwarewissenschaften – Universität WienP.Brezany 4

Business Intelligence

Definition:

Business Intelligence is an umbrella term, broadly covering theprocesses involved in extracting valuable business informationand knowledge from the mass of data that exists within a typical enterprise.

Page 3: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 3

Institut für Softwarewissenschaften – Universität WienP.Brezany 5

Business Intelligence Tools

• Data warehouses

• OLAP (On-Line Analytical Processing) tools

• Data mining tools

• Text mining tools

• Web mining tools

• Data joiners

• Business Intelligence portals, etc.

the focusof ourlectures

Institut für Softwarewissenschaften – Universität WienP.Brezany 6

Business Intelligence Tools (cont.)• Data warehouse - a repository of multiple heterogeneous data

sources, organized under a unified schema at a single site in order to facilitate management decision making.

• OLAP – analysis techniques with functionlities such as summari-zation, consolidation, and aggregation, as well as the ability to view information from different angles.

• Data mining – extracting or “mining“ knowledge from large data sets.• Text mining – “mining“ large textual (document) databases.• Web mining – discovering knowledge from hypertext data.• Data joiner - working with data from disparate, heterogeneous data

sources• Business Intelligence portal – a Web site designed to be the first

point of entry for visitors to information about a company. With help of the portal´s personalising functions, the use can choose informa-tion sources that he needs for performing a specific task. The portal allows problemlos access to valuable information and data analyses; so, the basis for competent decisions is optimized.

Page 4: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 4

Institut für Softwarewissenschaften – Universität WienP.Brezany 7

DATA MINING

Institut für Softwarewissenschaften – Universität WienP.Brezany 8

Introduction• This lecture topic is about the theme which has come to be

known as data mining and knowledge discovery in large databases, data warehouses, and other massive information repositories.

• Data mining emerged during the late 1980s; has made great strides during the late 1990s, and is expected to continue to flourish into the new millennium.

• The implementation methods discussed are particularly oriented towards the development of scalable and efficient data mining tools.

• We introduce interesting data mining techniques and systems, and discuss applications and research directions.

Page 5: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 5

Institut für Softwarewissenschaften – Universität WienP.Brezany 9

What Motivated Data Mining? Why Is It Important?

• There is the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge.

• Applications ranging from business management, production control, and market analysis, to engineering design and science exploration.

• Data mining can be viewed as a result of the natural evolution of information technology - including database technology, artificial intelligence, machine learning, neural networks, statistics, pattern-recognition, knowledge-based systems, high-performance computing, and data visualization.

Institut für Softwarewissenschaften – Universität WienP.Brezany 10

Collecting Data

Data Re-positories

SatellitesLaboratories(microscopes,

MRI/CT scanners, ...)

Computer simulationsExperiments

(high energy physics,...)

AnalysisBusiness

Page 6: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 6

Institut für Softwarewissenschaften – Universität WienP.Brezany 11

CERNs challenge

• Starting point – New detector LHC

» Large Hadron Collider, 14 TeV» Goals: Search for Higgs Boson and

Graviton (and others)– Start 2006

• Challenges – Data are accessed worldwide

» CERN and Regional Centers (Europe, Asia, America)» 2000 users

– Huge data volumes– Data semantics– Performance and throughput

Institut für Softwarewissenschaften – Universität WienP.Brezany 12

CMSATLAS

LHCb

The LHC Detectors

Page 7: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 7

Institut für Softwarewissenschaften – Universität WienP.Brezany 13

Multi-Tier Model

Institut für Softwarewissenschaften – Universität WienP.Brezany 14

The Evolution of Database TechnologyData Collection and Database Creation (1960s and earlier)- Primitive file processing

Database Management Systems (1970s-early 1980s)- Hierarchical, network and relational DB systems- Query languages (SQL, etc), query optimization- Transaction management, concurrency control, recovery- Data modeling tools

Advanced Database Systems(mid-1980s-present)object-oriented, object-relational,spatial, temporal, multimedia

Web-based Database Systems(1990s-present)- XML-based DB systems, - Web mining

Data Warehousing and Data Mining (late 1980s-present)- Data warehouse and OLAP technology- Data mining and knowledge discovery

Page 8: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 8

Institut für Softwarewissenschaften – Universität WienP.Brezany 15

Database Querying and Data Mining

Query languages like SQL are standardized and powerful, but for not skilledusers are they too difficult.OLAP Tools allow flexible multidimensional queries. Their methods are query-centric.

Query languages like SQLOLAP Tools Data Mining Tools

Data Warehouse

Institut für Softwarewissenschaften – Universität WienP.Brezany 16

We Are Data Rich, But Information Poor

Page 9: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 9

Institut für Softwarewissenschaften – Universität WienP.Brezany 17

So, What Is Data Mining?

Data mining – searching for knowledge (interesting patterns)in your data.

Institut für Softwarewissenschaften – Universität WienP.Brezany 18

Data Mining As a Step in the Process of Knowledge Discovery

• Many people treat data mining as a synonym for the term Knowledge Discovery in Databases, or KDD.

• Alternative view: data mining as an step in KDD:– 1, Data cleaning (to remove noise and inconsistent data)– 2. Data integration (where multiple data sources may be combined)– 3. Data selection (where data relevant to the analysis task are

retrieved from the database)– 4. Data transformation (where data are transformed or consolidated

into forms appropriate for mining by performing summary or aggregation operations, for instance)

– 5. Data mining (an essential process where intelligent methods are applied in order to extract patterns)

– 6. Pattern evaluation (to identify the truly interesting patterns representing knowledge based on some interestingness measures)

– 7. Knowledge presentation to the user

Page 10: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 10

Institut für Softwarewissenschaften – Universität WienP.Brezany 19

Data Mining in Knowledge Discovery

Institut für Softwarewissenschaften – Universität WienP.Brezany 20

Architecture of a Data Mining System

Graphical user interface

Pattern evaluation

Data mining engine

Database or data warehouse server

Knowledgebase

Database Datawarehouse

FilteringData cleaning, data integration

Page 11: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 11

Institut für Softwarewissenschaften – Universität WienP.Brezany 21

Architecture of a Data Mining System (2)Database, data warehouse, or other information repository:One or a set of databases, data warehouses, spreadsheets, etc.

Database or data warehouse server: responsible for fetchingthe relevant data, based on the user’s data mining request.

Knowledge base: domain knowledge that is used to guide thesearch, or evaluate the interestingness of resulting patterns.Such knowledge can include concept hierarchies, used to organi-ze attribute values into different levels of abstraction.

Data mining engine: essential to the data mining system; ideallyconsists of a set of functional modules for tasks such as charac-terization, association, classification, cluster analysis, and evolu-tion and deviation analysis.

Institut für Softwarewissenschaften – Universität WienP.Brezany 22

Architecture of a Data Mining System (3)Pattern evaluation module: This component typically employsinterestingness measures and interacts with the data mining soas to focus the search towards interesting patterns. It may useinterestingness thresholds to filter out discovered patterns.

Graphical user interface: This module communicates betweenusers and the data mining system allowing the user• to specify a data mining query or task• provide information to help focus the search• perform exploratory data mining based on the intermediatedata mining results

• browse database and data warehouse schemas or data structures• evaluate mined patterns• visualize the patterns in different forms.

Page 12: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 12

Institut für Softwarewissenschaften – Universität WienP.Brezany 23

Data Mining vs. Other DisciplinesFrom a data warehouse perspective, data mining can beviewed as an advance stage of on-line analytical processing(OLAP). However, data mining goes far beyond OLAP.

There may be many “data mining systems” on the market -not all of them can perform true data mining. They should bemore appropriately categorized as • machine learning systems• statistical data analysis tools• experimental system prototypes• deductive database systems

Data mining integrates techniques from multiple disciplines:database technology, statistics, machine learning, high-perfor-mance computing, neural networks, pattern recognition, visualiz.

Institut für Softwarewissenschaften – Universität WienP.Brezany 24

Data Mining - On What Kind of Data?• Relational Databases• Data Warehouses• Transactional Databases• Advanced Database Systems and Advanced Database

Applications– Object-oriented databases– Object-relational databases– Spatial databases– Temporal databases and time-series databases– Text databases and multimedia databases– Heterogeneous databases and legacy databases– The World Wide Web

Page 13: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 13

Institut für Softwarewissenschaften – Universität WienP.Brezany 25

Relational Database - An Example• A database system, also called a database management system

(DBMS), consists of a collection of interrelated data, known as a database, and a set of software programs to manage and access the data.

• A relational database is a collection of tables, each of which is assigned a unique name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each tuple represents an object identified by a unique key.

• Relational data can be accessed by database queries written in a relational query language, such as SQL.

• Using data mining, one can search for trends or data patterns in relational databases.

Institut für Softwarewissenschaften – Universität WienP.Brezany 26

Relational Databases - Example

The AllElectronics company is described by the followingtable: customer, item, employee, and branch. Fragments ofthese tables are shown on the next slide; the attribute thatrepresents the key or composite key component is underlined.

•The relation customer consists of a set of attributes, inclu-ding a unuque customer identity number (cust_ID), and so on.

•Tables can also be used to represent the relationships bet-ween or among multiple relational tables. E.g., these includepurchases (customer purchases items, creating a sales tran-saction that is handled by an employee), items_sold (lists theitems sold in the given transaction), and works_at (employeeworks at a branch of AllElectronics).

Page 14: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 14

Institut für Softwarewissenschaften – Universität WienP.Brezany 27

Fragments of Relations from AllElectronics DB

Institut für Softwarewissenschaften – Universität WienP.Brezany 28

Data WarehousesA data warehouse is a repository of information collected frommultiple sources, stored under a unified schema, and which usually resides at a single site.

Data warehouses are constructed via a process of data cleaning,data transformation, data integration, data loading and periodicdata refreshing.

Figure on the next slide shows the basic architecture of a datawarehouse for AllElectronics.

In order to facilitate decision making, the data in a data ware-house are organized around major subjects, such as customer,item, supplier, and activity. The data are stored from a histori-cal perspective and are typically summarized.

Page 15: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 15

Institut für Softwarewissenschaften – Universität WienP.Brezany 29

Architecture of a Data Warehouse

CleanTransformIntegrateLoad

Data source in Ch.

Data source in NY

Data source in T.

Data source in Vancouver Remarks: Ch - Chicago, NY - New York, T - Toronto

Datawarehouse

Query andanalysis tools

Client

Client

Institut für Softwarewissenschaften – Universität WienP.Brezany 30

Modeling a Data WarehouseA data warehouse is usually modeled by a multidimensional database structure, where each dimension corresponds to anattribute in the schema, each cell stores the value of someaggregate measure, such as count or sales_amount.

The actual physical structure of a data warehouse may be arelational data store or a multidimensional data cube. Itprovides a multidimensional view of data and allows the precomputation and fast accessing of summarized data.

Example: A data cube for summarized sales data of AllElectronics is presented in the next slide.

Page 16: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 16

Institut für Softwarewissenschaften – Universität WienP.Brezany 31

A Multidimensional Data Cube

Institut für Softwarewissenschaften – Universität WienP.Brezany 32

Modeling a Data Warehouse (2)

Data warehouse vs. Data mart: A data warehouse collectsinformation about subjects and span an entire organization,and thus its scope is enterprise wide. A data mart isa department-wide.

Data warehouse systems are well suited for On-Line Analyticalprocessing, or OLAP.

OLAP operations allow the presentation of data at differentlevels of abstractions.

Examples of OLAP operations include drill-down and roll-up,which allow the user to view the data at different degrees ofsummarization as illustrated in the previous slide.

Page 17: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 17

Institut für Softwarewissenschaften – Universität WienP.Brezany 33

Transactional DatabasesA transactional database consists of a file where each recordrepresents a transaction.

A transaction includes a unique transaction identity number(trans_id), and a list of the items making up the transaction (such as items purchased in a store).

The transactional database may have additional tables associatedwith it, which contain other information regarding the sale, suchas the date of the transaction, the custommer ID number, the IDnumber of the sales person, etc.

Example: Transactions can be stored in a table, with onerecord per transaction. A fragment of a transactional databasefor AllElectronics is shown in the next slide.

Institut für Softwarewissenschaften – Universität WienP.Brezany 34

Transactional Databases (2)

Trans_id list of item_Ids

T100 I1, I3, I8, I16. . . . . .

The transactional database is usually either stored in a flat filein a format similar to that of the above table, or unfolded intoa standard relation in a format similar to that of the items_sold table in slide no. 18.

A regular data retrieval system is not able to answer querieslike “Which items sold well together?”

Page 18: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 18

Institut für Softwarewissenschaften – Universität WienP.Brezany 35

Advanced Database Systems and Database Applications

Relational DB systems have been widely used in business app-lications.

The new database applications include handling• spatial data (e.g. maps)• engineering design data (e.g., the design of buildings or

integrated circuits)• hypertext and multimedia data (text, image, video, audio data)• time-related data (e.g. stock exchange data)• World Wide Web (a huge, widely distributed information repo-

sitory made available by the Internet)

Institut für Softwarewissenschaften – Universität WienP.Brezany 36

Data Mining Functionalities - What Kinds of Patterns Can be Minded?

• Data mining functionalities are used to specify the kind of patterns that can be found in data mining tasks.

• Data mining tasks can be classified into 2 categories:– Descriptive - they characterize the general properties of the data in

the database.– Prescriptive - they perform inference on the current data in order to

make predictions.

• In some cases, users may have no idea which kinds of patterns may be interesting => searching for several different kinds of patterns in parallel.

• Data mining systems should be able to discover patterns at various granularities (abstraction levels).

• Specifying hints to guide or focus the search.

Page 19: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 19

Institut für Softwarewissenschaften – Universität WienP.Brezany 37

Concept/Class Description: Characterization and Discrimination

• Data can be associated with classes and concepts.• E.g., in the AllElectronics store,

– classes of items for sale include computers and printers, and– concepts of customers include bigSpenders and budgetSpenders.

• It can be useful to describe individual classes and concepts in summarized, concise, and yet precise terms => class/concept descriptions.

• These descriptions can be derived via– !. Data characterization, by summarizing the data of the class under study

(often called the target class) in general terms, or– 2. Data discrimination, by comparison of the target class with one or a set of

comparative classes (often called contrasting classes), or– 3. Both data characterization and discrimination.

Institut für Softwarewissenschaften – Universität WienP.Brezany 38

Concept/Class Description - Examples• Example 1 A description summarizing the characteri-

stics of customers who spent more than $1000 a year at AllElectronics. The result could be a general profile of the customers, such qs they are 40-50 years old, employed, and have excellent credit ratings.

• Example 2 A comparison of 2 custommer groups, such as those who shop for computer products regularly versus those who rarely shop for such products. The resulting description could be a general comparative profile of the customers, such as 80% of the customers who frequently purchase computer products are between 20 and 40 years old and have a university education, whereas 60% of the customers who infrequently buy such products are either seniors, and have no university degree.

Page 20: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 20

Institut für Softwarewissenschaften – Universität WienP.Brezany 39

Association Analysis• Association analysis is the discovery of association rules

showing attribute-value conditions that occur frequently in a given set of data.

• The association rule X => Y is interpreted as “database tuples that satisfy the conditions in X are also likely to satisfy the conditions in Y.”

• Example A data mining system may find in AllElectronics: age(X, “20..29”) and income(X, “20K..29K”) => buys(X,”CD player”) [support = 2%, confidence = 60%]

• X is a variable representing a customer. The rule indicates that of the customers under study, 2% are 20 to 29 years of age with an income of 20K to 29K and have purchased a CD player. There is a 60% probability that a customer in this age and incomegroup will purchase a CD player.

Institut für Softwarewissenschaften – Universität WienP.Brezany 40

Association Analysis (Cont.)

• We would like to determine which items are frequently purchased together within the same transactions. E.g.,contains(T, “computer”) => contains(T, “software”) [support = 1%, confidence = 50%]

• Explanation: if a transaction, T, contains “computer”, there is a 50% chance that it contains “software” as well, and 1% of all of the transactions contain both.

• This rule involves a single attribute or predicate (i.e. contains) => single-dimensional association rule. It can be written simpy as “computer => software {1%,50%]” Remark: On the last slide, we have: multi-dimensional assoc. rule.

Page 21: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 21

Institut für Softwarewissenschaften – Universität WienP.Brezany 41

Classification and Prediction• Classification is the process of finding a set of models

(or functions) that describe and distinguish data classes or concepts, for the purpose of being able to use the model to predict the class of objects whose class label is unknown. The derived model is based on the analysis of a training data (i.e., data objects whose class label is known),

• “How is the derived model presented?” – Classification (IF-THEN) rules– Mathematical formulae– Decision tree - it is a flow-chart-like tree structure, where each

node denotes a test on an attribute value, each branch represents an outcome of the test, and the tree leaves represent classes or class distributions.

– Neural networks - a collection of neuron-like processing units with weighted connections between the units.

Institut für Softwarewissenschaften – Universität WienP.Brezany 42

Classification and Prediction (Cont.)• Prediction - in many applications, users may wish to

predict some missing or unavailable data values rather then class labels. The predicted values are usually numerical data.

• Prediction also encompasses the identification of distribution trends based on the available data.

• Classification and prediction may need to be preceded by relevance analysis, which attempts to identify attributes that do not contribute to the classification or prediction process. These attributes can then be excluded.

Page 22: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 22

Institut für Softwarewissenschaften – Universität WienP.Brezany 43

Classification and Prediction (Cont.)

• Example Classification of a large set of items in the store based on three kinds of responses to a sales campaign: good response, mild response, and no respon-se. We would like to derive a model for each of these classes based on the descriptive features of the items, such as price, brand, place_made, type, and category.

• The resulting classification should maximally distinguish each class from the others, presenting an organized picture of the data set.

• A decision tree, for instance, may identify price as being the single factor that best distinguishes the three classes. Such a decision tree may help us understand the impact of the given campaign and design a more effective campaign for the future.

Institut für Softwarewissenschaften – Universität WienP.Brezany 44

Cluster Analysis• Clustering analyzes data objects without consulting a

known class label.• Clustering can be used to generate such labels.• The objects are clustered or grouped based on the

principle of maximizing the intraclass similarity and minimizing the interclass similarity.

• Each cluster can be viewed as a class of objects, from which rules can be derived.

• Example Cluster analysis can be performed on AllElec-tronics customer data in order to identify homoge-neous subpopulations of customers. These clusters may represent individual target groups for marketing. (Figure on the next slide shows a 2-D plot of customers with respect to customer locations in a city).

Page 23: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 23

Institut für Softwarewissenschaften – Universität WienP.Brezany 45

Cluster Analysis - Example

A 2-D plot of customer data with respect to customer locationsIn a city, showing 3 data clusters. Each cluster „center“ is marked with a „+“.

Institut für Softwarewissenschaften – Universität WienP.Brezany 46

Outlier Analysis• A database may contain data objects that do not

comply with the general behaviour or model of the data. These data objects are outliers,

• Most data mining methods discard outliers as noise or exceptions.

• In some applications such as fraud detection, the rare events can be more interesting than the more regularly occuring ones,

• Example Outlier analysis may uncover fraudulent usage of credit cards by detecting purchases of extreemly large amounts for a given account number in comparison to regular charges incurred by the same account.

Page 24: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 24

Institut für Softwarewissenschaften – Universität WienP.Brezany 47

Evolution Analysis

• It describes and models regularities or trends for objects whose behavior changes over time.

• It includes time-series data analysis. • Example Suppose that we have the major stock market

(time-series) data of the last several years available from the New York Stock Exchange and we would like to invest in shares of high-tech industrial companies. A data mining study of stock exchange data may identify stock evolution regularities for overall stocks and for the stocks of particular companies, Such regularities may help predict future trends in stock market prices.

Institut für Softwarewissenschaften – Universität WienP.Brezany 48

Are All of the Patterns Interesting?• A data mining system has the potential to generate

thousands or even millions of patterns, or rules.• Only a small fraction of the patterns potentially

generated would actually be of interest to any given user.

• Questions: What makes a pattern interesting? Can a data mining system generate all of the interesting patterns? Can a data mining system generate only interesting patterns?

• A pattern is interesting if – (1) it is easily understood by humans,– (2) valid on new or test data with some degree of certainty.– (3) potentially useful , and– (4) novel

Page 25: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 25

Institut für Softwarewissenschaften – Universität WienP.Brezany 49

Interestingness of Patterns (Cont.)

• A pattern is also interesting if it validates a hypothesis that the user sought to confirm.

• An interesting pattern represents knowledge.• Objective measures of pattern interestingness -

these are based on the structure of discovered patterns and the statistics underlying them.

– An objective measure for association rules X => Y is rule support,representing the percentage of transactions from a transac-tion base that the given rule satisfies. This is taken to be the proba-bility P(X U Y), where X U Y indicates that a transaction contains both X and Y, that is, the union of item sets X and Y.

– Another objective measure for association rules is confidence, which assesses the degree of certainty of the detected association. This is taken to be the conditional probability P(X | Y), that is, the probability that a transaction containing X also contains Y.

Institut für Softwarewissenschaften – Universität WienP.Brezany 50

Interestingness of Patterns (Cont.)

• Each interestingness measure is associated with a threshold, which can be controlled by the user.

– For example, rules that do not satisfy a confidence threshold of, say, 50% can be considered uninteresting. Rules below the threshold likely reflect noise, exceptions, or minority cases and are probably of less value.

• Objective measures are insufficient unless combined with subjective measures which reflect the needs and interests of a particular user.

• Many patterns that are interesting by objective stan-dards may represent common knowledge and, therefo-re, are actually uninteresting.

Page 26: Data Mining und Data Warehousing - univie.ac.athomepage.univie.ac.at/peter.brezany/teach/kfk/02ws-vo/skriptum/12 … · Data Warehousing and Data Mining(late 1980s-present) - Data

Page 26

Institut für Softwarewissenschaften – Universität WienP.Brezany 51

Interestingness of Patterns (Cont.)

• Subjective interestingness measures are based on user beliefs in the data. These measures find patterns interesting if they are unexpected or offer strategic information on which the user can act.

• Patterns that are expected can be interesting if they confirm a hypothesis that the user wished to validate, or resemble a user’s hunch.