145
ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW 2012/03/02: INTRODUCTION TO DATA MINING

ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Embed Size (px)

Citation preview

Page 1: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3),

This is part of your

1

where I will cover the first two weeks’ courses

• 2012/02/24: DATABASES: AN OVERVIEW• 2012/03/02: INTRODUCTION TO DATA MINING

Page 2: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Class Info

• Lecturer: Chi-Yao Tseng (曾祺堯 ) [email protected]

• Grading:– No assignments– Midterm:

• 2012/04/20• I’m in charge of 17x2 points out of 120• No take-home questions

2

Page 3: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Outline

• Introduction– From data warehousing to data mining

• Mining Capabilities– Association rules– Classification– Clustering

• More about Data Mining

3

Page 4: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Main Reference

• Jiawei Han, Micheline Kamber, Data Mining: Concepts and Techniques, 2nd Edition, Morgan Kaufmann, 2006.– Official website:

http://www.cs.uiuc.edu/homes/hanj/bk2/

4

Page 5: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Why Data Mining?

• The Explosive Growth of Data: from terabytes to petabytes (1015 B= 1 million GB)

– Data collection and data availability

• Automated data collection tools, database systems, Web, computerized society

– Major sources of abundant data

• Business: Web, e-commerce, transactions, stocks, …

• Science: Remote sensing, bioinformatics, scientific simulation, …

• Society and everyone: news, digital cameras, YouTube, Facebook

• We are drowning in data, but starving for knowledge!

• “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets

5

Page 6: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Why Not Traditional Data Analysis?

• Tremendous amount of data– Algorithms must be highly scalable to handle such as terabytes of data

• High-dimensionality of data – Micro-array may have tens of thousands of dimensions

• High complexity of data• New and sophisticated applications

6

Page 7: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Evolution of Database Technology• 1960s:

– Data collection, database creation, IMS and network DBMS

• 1970s: – Relational data model, relational DBMS implementation

• 1980s: – RDBMS, advanced data models (extended-relational, OO, deductive, etc.) – Application-oriented DBMS (spatial, scientific, engineering, etc.)

• 1990s: – Data mining, data warehousing, multimedia databases, and Web databases

• 2000s– Stream data management and mining– Data mining and its applications– Web technology (XML, data integration) and global information systems

7

Page 8: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

What is Data Mining?

• Knowledge discovery in databases– Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful) patterns or knowledge from huge amount of data.

• Alternative names:– Knowledge discovery (mining) in databases (KDD),

knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

8

Page 9: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Data Mining: On What Kinds of Data?

• Database-oriented data sets and applications

– Relational database, data warehouse, transactional database

• Advanced data sets and advanced applications

– Data streams and sensor data

– Time-series data, temporal data, sequence data (incl. bio-sequences)

– Structure data, graphs, social networks and multi-linked data

– Object-relational databases

– Heterogeneous databases and legacy databases

– Spatial data and spatiotemporal data

– Multimedia database

– Text databases

– The World-Wide Web

9

Page 10: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Knowledge Discovery (KDD) Process

10

Databases

Knowledge!

Patterns

Interpretation / Evaluation

Data Mining

Transformed data

Selection & Transformation

Data warehouse

Data Cleaning& Integration

• This is a view from typical database systems and data warehousing communities.• Data mining plays an essential role in the knowledge discovery process.

Page 11: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Data Mining and Business Intelligence

11

Increasing potentialto supportbusiness decisions End User

Business Analyst

DataAnalyst

DBA

Decision Making

Data Presentation

Visualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

Page 12: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Data Mining: Confluence of Multiple Disciplines

12

Page 13: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

DatabaseData

warehouseWorld-Wide

WebOther info. repositories

Typical Data Mining System

13

Database or Data Warehouse Server

Data Mining Engine

Pattern Evaluation

Graphical User Interface

data cleaning, integration, and selection

Knowledge Base

Page 14: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Data Warehousing

• A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of managements’ decision making process. —W. H. Inmon

14

Page 15: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Data Warehousing

• Subject-oriented:– Provide a simple and concise view around particular subject issues by

excluding data that are not useful in the decision support process.

• Integrated: – Constructed by integrating multiple, heterogeneous data sources.

• Time-variant: – Provide information from a historical perspective (e.g., past 5-10 years.)

• Nonvolatile:– Operational update of data does not occur in the data warehouse

environment– Usually requires only two operations: load data & access data.

15

Page 16: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Data Warehousing

• The process of constructing and using data warehouses

• A decision support database that is maintained separately from the organization’s operational database

• Support information processing by providing a solid platform of consolidated, historical data for analysis

• Set up stages for effective data mining

16

Page 17: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Illustration of Data Warehousing

17

Clean Transform Integrate Load

Data source in Taipei

Data source in New York

Data source in London

.

.

.

DataWarehouse

Query and Analysis

Tools

Query and Analysis

Tools

clientclient

clientclient

Page 18: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

OLTP vs. OLAP

Data WarehouseOLAP(On-line Analytical Processing)

OLTP(On-line Transaction Processing)

AnalyticsData MiningDecision Making

Short online transactions:update, insert, delete

Online-TransactionProcessing Tx.

database

current & detailed data,Versatile

aggregated & historical data, Static and Low volume

Complex Queries

18

Page 19: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Multi-Dimensional View of Data Mining

• Data to be mined– Relational, data warehouse, transactional, stream, object-oriented/relational,

active, spatial, time-series, text, multi-media, heterogeneous, legacy, WWW• Knowledge to be mined

– Characterization, discrimination, association, classification, clustering, trend/deviation, outlier analysis, etc.

– Multiple/integrated functions and mining at multiple levels• Techniques utilized

– Database-oriented, data warehouse (OLAP), machine learning, statistics, visualization, etc.

• Applications adapted– Retail, telecommunication, banking, fraud analysis, bio-data mining, stock

market analysis, text mining, Web mining, etc.

19

Page 20: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Mining Capabilities (1/4)

• Multi-dimensional concept description: Characterization and discrimination– Generalize, summarize, and contrast data

characteristics, e.g., dry vs. wet regions

• Frequent patterns (or frequent itemsets), association – Diaper Beer [0.5%, 75%] (support, confidence)

20

Page 21: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Mining Capabilities (2/4)

• Classification and prediction – Construct models (functions) that describe and distinguish classes or

concepts for future prediction

• E.g., classify countries based on (climate), or classify cars based on (gas mileage)

– Predict some unknown or missing numerical values

21

Page 22: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Mining Capabilities (3/4)

• Clustering– Class label is unknown: Group data to form new categories

(i.e., clusters), e.g., cluster houses to find distribution patterns

– Maximizing intra-class similarity & minimizing interclass similarity

• Outlier analysis– Outlier: Data object that does not comply with the general

behavior of the data– Noise or exception? Useful in fraud detection, rare events

analysis22

Page 23: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Mining Capabilities (4/4)

• Time and ordering, trend and evolution analysis– Trend and deviation: e.g., regression analysis– Sequential pattern mining: e.g., digital camera

large SD memory– Periodicity analysis– Motifs and biological sequence analysis

• Approximate and consecutive motifs

– Similarity-based analysis

23

Page 24: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

More Advanced Mining Techniques

• Data stream mining– Mining data that is ordered, time-varying, potentially infinite.

• Graph mining– Finding frequent subgraphs (e.g., chemical compounds), trees (XML),

substructures (web fragments)• Information network analysis

– Social networks: actors (objects, nodes) and relationships (edges)• e.g., author networks in CS, terrorist networks

– Multiple heterogeneous networks• A person could be multiple information networks: friends, family,

classmates, …– Links carry a lot of semantic information: Link mining

• Web mining– Web is a big information network: from PageRank to Google– Analysis of Web information networks

• Web community discovery, opinion mining, usage mining, …

24

Page 25: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Challenges for Data Mining

• Handling of different types of data• Efficiency and scalability of mining algorithms• Usefulness and certainty of mining results• Expression of various kinds of mining results• Interactive mining at multiple abstraction levels• Mining information from different source of data• Protection of privacy and data security

25

Page 26: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Brief Summary

• Data mining: Discovering interesting patterns and knowledge from massive amount of data

• A natural evolution of database technology, in great demand, with wide applications

• A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation

• Mining can be performed in a variety of data

• Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc.

26

Page 27: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

A Brief History of Data Mining Society• 1989 IJCAI Workshop on Knowledge Discovery in Databases

– Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)

• 1991-1994 Workshops on Knowledge Discovery in Databases

– Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)

• 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)

– Journal of Data Mining and Knowledge Discovery (1997)

• ACM SIGKDD conferences since 1998 and SIGKDD Explorations

• More conferences on data mining

– PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), etc.• ACM Transactions on KDD starting in 2007

27

More details here: http://www.kdnuggets.com/gpspubs/sigkdd-explorations-kdd-10-years.html

Page 28: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Conferences and Journals on Data Mining

• KDD Conferences– ACM SIGKDD Int. Conf. on

Knowledge Discovery in Databases and Data Mining (KDD)

– SIAM Data Mining Conf. (SDM)– (IEEE) Int. Conf. on Data Mining

(ICDM)– European Conf. on Machine

Learning and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD)

– Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)

– Int. Conf. on Web Search and Data Mining (WSDM)

• Other related conferences– DB: ACM SIGMOD, VLDB,

ICDE, EDBT, ICDT– WEB & IR: CIKM, WWW,

SIGIR– ML & PR: ICML, CVPR, NIPS

• Journals – Data Mining and Knowledge

Discovery (DAMI or DMKD)– IEEE Trans. On Knowledge

and Data Eng. (TKDE)– KDD Explorations– ACM Trans. on KDD

28

Page 29: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

CAPABILITIES OF DATA MINING

29

Page 30: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

FREQUENT PATTERNS & ASSOCIATION RULES

30

Page 31: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Basic Concepts• Frequent pattern: a pattern (a set of items, subsequences,

substructures, etc.) that occurs frequently in a data set • First proposed by Agrawal, Imielinski, and Swami [AIS93] in the

context of frequent itemsets and association rule mining• Motivation: Finding inherent regularities in data

– What products were often purchased together?— Beer and diapers?!– What are the subsequent purchases after buying a PC?– What kinds of DNA are sensitive to this new drug?– Can we automatically classify web documents?

• Applications– Basket data analysis, cross-marketing, catalog design, sale campaign

analysis, Web log (click stream) analysis, and DNA sequence analysis

31

Page 32: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Mining Association Rules

• Transaction data analysis. Given:– A database of transactions (Each tx. has a list of

items purchased)– Minimum confidence and minimum support

• Find all association rules: the presence of one set of items implies the presence of another set of items

32

Diaper Beer [0.5%, 75%](support, confidence)

Page 33: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Two Parameters

• Confidence (how true)– The rule X&YZ has 90% confidence:

means 90% of customers who bought X and Y also bought Z.

• Support (how useful is the rule)– Useful rules should have some minimum

transaction support.

33

Page 34: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Mining Strong Association Rules in Transaction Databases (1/2)

• Measurement of rule strength in a transaction database.

34

AB [support, confidence]

txof # total

in items all containing tx of#)Pr(support

BABA

A containing tx of #

both containing tx of#)|Pr( confidence

BAAB

Page 35: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Mining Strong Association Rules in Transaction Databases (2/2)

• We are often interested in only strong associations, i.e.,– support min_sup– confidence min_conf

• Examples:– milk bread [5%, 60%]– tire and auto_accessories auto_services [2%,

80%].

35

Page 36: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Example of Association Rules

Transaction-id Items bought

1 A, B, D

2 A, C, D

3 A, D, E

4 B, E, F

5 B, C, D, E, F

36

Let min. support = 50%, min. confidence = 50% Frequent patterns: {A:3, B:3, D:3, E:3, AD:3} Association rules: A D (s = 60%, c = 100%)

D A (s = 60%, c = 75%)

Page 37: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Two Steps for Mining Association Rules

• Determining “large (frequent) itemsets”– The main factor for overall performance– The downward closure property of frequent

patterns• Any subset of a frequent itemset must be frequent• If {beer, diaper, nuts} is frequent, so is {beer, diaper}• i.e., every transaction having {beer, diaper, nuts} also

contains {beer, diaper}

• Generating rules

37

Page 38: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

The Apriori Algorithm

• Apriori (R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94.)– Derivation of large 1-itemsets L1: At the first

iteration, scan all the transactions and count the number of occurrences for each item.

– Level-wise derivation: At the kth iteration, the candidate set Ck are those whose every (k-1)-item subset is in Lk-1. Scan DB and count the # of occurrences for each candidate itemset.

38

Page 39: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

The Apriori Algorithm—An Example

39

Database TDB

1st scan

C1L1

L2

C2C2

2nd scan

C3 L33rd scan

Tid Items

100 A, C, D

200 B, C, E

300 A, B, C, E

400 B, E

Itemset sup

{A} 2

{B} 3

{C} 3

{D} 1

{E} 3

Itemset sup

{A} 2

{B} 3

{C} 3

{E} 3

Itemset

{A, B}

{A, C}

{A, E}

{B, C}

{B, E}

{C, E}

Itemset sup{A, B} 1{A, C} 2{A, E} 1{B, C} 2{B, E} 3{C, E} 2

Itemset sup{A, C} 2{B, C} 2{B, E} 3{C, E} 2

Itemset

{B, C, E}

Itemset sup

{B, C, E} 2

min. support =2 tx’s (50%)

Page 40: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

From Large Itemsets to Rules

• For each large itemset m– For each subset p of m if ( sup(m) / sup(m-p) min_conf )

• output the rule (m-p)p – conf. = sup(m)/sup(m-p)– support = sup(m)

• m = {a,c,d,e,f,g} 2000 tx’s, p = {c,e,f,g}m-p = {a,d} 5000 tx’s– conf. = # {a,c,d,e,f,g} / # {a,d}– rule: {a,d} {c,e,f,g}

confidence: 40%, support: 2000 tx’s40

Page 41: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Redundant Rules

• For the same support and confidence, if we have a rule {a,d} {c,e,f,g}, do we have [agga98a]:– {a,d} {c,e,f} ?– {a} {c,e,f,g} ?– {a,d,c} {e,f,g} ?– {a} {c,d,e,f,g} ?

41

Yes!

Yes!

No!

No!

Page 42: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Practice

• Suppose we additionally have– 500 ACE– 600 BCD– Support = 3 tx’s (50%), confidence = 66%

• Repeat the large itemset generation– Identify all large itemsets

• Derive up to 4 rules– Generate rules from the large itemsets with the

biggest number of elements (from big to small)42

Page 43: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Discussion of The Apriori Algorithm

• Apriori (R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94.)– Derivation of large 1-itemsets L1: At the first iteration, scan

all the transactions and count the number of occurrences for each item.

– Level-wise derivation: At the kth iteration, the candidate set Ck are those whose every (k-1)-item subset is in Lk-1. Scan DB and count the # of occurrences for each candidate itemset.

• The cardinalitiy (number of elements) of C2 is huge.• The execution time for the first 2 iterations is the

dominating factor to overall performance!• Database scan is expensive. 43

Page 44: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Improvement of the Apriori Algorithm

• Reduce passes of transaction database scans

• Shrink the number of candidates

• Facilitate the support counting of candidates

44

Page 45: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Example Improvement 1- Partition: Scan Database Only Twice

• Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB– Scan 1: partition database and find local frequent

patterns– Scan 2: consolidate global frequent patterns

• A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for mining association in large databases. In VLDB’95

45

Page 46: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Example Improvement 2- DHP• DHP (direct hashing with pruning): Apriori + hashing

– Use hash-based method to reduce the size of C2.

– Allow effective reduction on tx database size (tx number and each tx size.)

46

Tid Items

100 A, C, D

200 B, C, E

300 A, B, C, E

400 B, E

J. Park, M.-S. Chen, and P. Yu.An effective hash-based algorithm for mining association rules. In SIGMOD’95.

Page 47: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Mining Frequent Patterns w/o Candidate Generation

• A highly compact data structure: frequent pattern tree.

• An FP-tree-based pattern fragment growth mining method.

• Search technique in mining: partitioning-based, divide-and-conquer method.

• J. Han, J. Pei, Y. Yin, Mining Frequent Patterns without Candidate Generation, in SIGMOD’2000.

47

Page 48: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Frequent Patter Tree (FP-tree)• 3 parts:

– One root labeled as ‘null’– A set of item prefix subtrees– Frequent item header table

• Each node in the prefix subtree consists of– Item name– Count– Node-link

• Each entry in the frequent-item header table consists of– Item-name– Head of node-link

48

Page 49: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

The FP-tree Structure

49

root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Item Head of node-links

fcabmp

frequent item header table

Page 50: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

FP-tree Construction: Step1

• Scan the transaction database DB once (the first time), and derives a list of frequent items.

• Sort frequent items in frequency descending order.

• This ordering is important since each path of a tree will follow this order.

50

Page 51: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Example (min. support = 3)Tx ID Items Bought (ordered) Frequent Items

100 f,a,c,d,g,i,m,p f,c,a,m,p

200 a,b,c,f,l,m,o f,c,a,b,m

300 b,f,h,j,o f,b

400 b,c,k,s,p c,b,p

500 a,f,c,e,l,p,m,n f,c,a,m,p

51

List of frequent items:(f:4), (c:4), (a:3), (b:3), (m:3), (p:3)

Item Head of node-links

fcabmp

frequent item header table

Page 52: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

FP-tree Construction: Step 2

• Create a root of a tree, label with “null”• Scan the database the second time. The scan of the first tx

leads to the construction of the first branch of the tree.

52

Scan of 1st transaction: f,a,c,d,g,i,m,p

The 1st branch of the tree <(f:1),(c:1),(a:1),(m:1),(p:1)>

root

f:1

c:1

a:1

m:1

p:1

Page 53: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

FP-tree Construction: Step 2 (cont’d)

• Scan of 2nd transaction:a,b,c,f,l,m,o f,c,a,b,m

two new nodes: (b:1) (m:1)

53

root

f:2

c:2

a:2

m:1

p:1

b:1

m:1

Page 54: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

The FP-treeTx ID Items Bought (ordered) Frequent Items

100 f,a,c,d,g,i,m,p f,c,a,m,p

200 a,b,c,f,l,m,o f,c,a,b,m

300 b,f,h,j,o f,b

400 b,c,k,s,p c,b,p

500 a,f,c,e,l,p,m,n f,c,a,m,p

54

root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Item Head of node-links

fcabmp

frequent item header table

Page 55: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Mining Process

• Starts from the least frequent item p– Mining order: p -> m -> b -> a -> c -> f

55

Item Head of node-links

fcabmp

frequent item header table

Page 56: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Mining Process for item p

• Starts from the least frequent item p

56

root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Two paths:<f:4, c:3, a:3, m:2, p:2><c:1, b:1,p:1>

Conditional pattern based of ”p”:<f:2, c:2, a:2, m:2><c:1, b:1>

Conditional frequent pattern:<c:3>

So we have two frequent patterns:{p:3}, {cp:3}

min. support = 3

Page 57: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Mining Process for Item m

57

root

f:4

c:3

a:3

m:2

p:2

b:1

b:1

m:1

c:1

b:1

p:1

Two paths:<f:4, c:3, a:3, m:2><f:4, c:3, a:3, b:1, m:1>

Conditional pattern based of ”m”:<f:2, c:2, a:2><f:1, c:1, a:1, b:1>

Conditional frequent pattern:<f:3, c:3, a:3>

min. support = 3

Page 58: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Mining m’s Conditional FP-tree

58

Mine (<f:3, c:3, a:3> | m)

(am:3)Mine (<f:3, c:3> | am)

(cam:3)Mine (<f:3> | cam)

(fcam:3)

(fam:3)

(cm:3)Mine (<f:3> | cm)

(fcm:3)

(fm:3)

f

ff

f

c

ca

So we have frequent patterns:{m:3}, {am:3}, {cm:3}, {fm:3}, {cam:3}, {fam:3}, {fcm:3}, {fcam:3}

Page 59: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Analysis of the FP-tree-based method

• Find the complete set of frequent itemsets• Efficient because

– Works on a reduced set of pattern bases– Performs mining operations less costly than

generation & test

• Cons:– No advantages if the length of most tx’s are short– The size of FP-tree not always fit into main memory

59

Page 60: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Generalized Association Rules• Given the class hierarchy (taxonomy), one would

like to choose proper data granularities for mining.

• Different confidence/support may be considered.

• R. Srikant and R. Agrawal, Mining generalized association rules, VLDB’95.

60

Page 61: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Clothes

Outerwear Shirts

Jackets Ski Pants

Footwear

Shoes Hiking Boots

Tx ID

Items Bought

100 Shirt

200 Jacket, Hiking Boots

300 Ski Pants, Hiking Boots

400 Shoes

500 Shoes

600 Jacket

Concept Hierarchy Freq. itemset Itemset support

Jacket 2

Outerwear 3

Clothes 4

Shoes 2

Hiking Boots 2

Footwear 4

Outerwear, Hiking Boots

2

Clothes, Hiking Boots 2

Outerwear, Footwear 2

Clothes, Footwear 2sup(30%)

conf(60%)

Outerwear -> Hiking Boots

33% 66%

Outerwear -> Footwear

33% 66%

Hiking Boots -> Outerwear

33% 100%

Hiking Boots -> Clothes

33% 100%

Jacket -> Hiking Boots 16% 50%

Ski Pants -> Hiking Boots

16% 100%

61

Page 62: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Generalized Association Rules

62

uniform support

Milk[support = 10%]

2% Milk [support = 6%]

Skim Milk [support = 4%]

Level 1min_sup = 5%

Level 2min_sup = 5%

Level 1min_sup = 5%

Level 2min_sup = 3%

reduced support

Milk[support = 10%]

2% Milk Not examined

Skim Milk Not examined

Level 1min_sup = 12%

Level 2min_sup = 3%

level filtering

Page 63: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Other Relevant Topics• Max patterns

– R. J. Bayardo. Efficiently mining long patterns from databases. SIGMOD'98.

• Closed patterns– N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal. Discovering frequent

closed itemsets for association rules. ICDT'99.• Sequential Patterns

– What items one will purchase if he/she has bought some certain items.

– R. Srikant and R. Agrawal, Mining sequential patterns, ICDE’95 • Traversal Patterns

– Mining path traversal patterns in a web environment where documents or objects are linked together to facilitate interactive access.

– M.-S. Chen, J. Park and P. Yu. Efficient Data Mining for Path Traversal Patterns. TKDE’98.

and more… 63

Page 64: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

CLASSIFICATION

64

Page 65: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Classification

• Classifying tuples in a database.• Each tuple has some attributes with known

values.• In training set E

– Each tuple consists of the same set of multiple attributes as the tuples in the large database W.

– Additionally, each tuple has a known class identity.

65

Page 66: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Classification (cont’d)

• Derive the classification mechanism from the training set E, and then use this mechanism to classify general data (in testing set.)

• A decision tree based approach has been influential in machine learning studies.

66

Page 67: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Classification – Step 1: Model Construction

• Train model from the existing data pool

67

Training Data

name age income own cars?

Sandy <=30 low no

Bill <=30 low yes

Fox 31…40 high yes

Susan >40 med no

Claire >40 med no

Andy 31…40 high yes

Classification algorithm

Classification rules

Page 68: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Classification – Step 2: Model Usage

68

Testing Data

name age income own cars?

John >40 hight ?

Sally <=30 low ?

Annie 31…40 high ?

No

Classification rules

No

Yes

Page 69: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

What is Prediction?

• Prediction is similar to classification– First, construct model– Second, use model to predict future of unknown

objects

• Prediction is different from classification– Classification refers to predict categorical class

label.– Prediction refers to predict continuous values.

• Major method: regression

69

Page 70: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Supervised vs. UnsupervisedLearning

• Supervised learning (e.g., classification)– Supervision: The training data (observations,

measurements, etc.) are accompanied by labels indicating the class of the observations.

• Unsupervised learning (e.g., clustering)– We are given a set of measurements, observations,

etc. with the aim of establishing the existence of classes or clusters in the data.

– No training data, or the “training data” are not accompanied by class labels.

70

Page 71: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Evaluating Classification Methods

• Predictive accuracy• Speed

– Time to construct the model and time to use the model• Robustness

– Handling noise and missing values• Scalability

– Efficiency in large databases (not memory resident data)• Goodness of rules

– Decision tree size– The compactness of classification rules

71

Page 72: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

A Decision-Tree Based Classification

72

outlook

humidity windy

N

P

N PP

sunnyovercast

rainy

• A decision tree of whether going to play tennis or not:

• ID-3 and its extended version C4.5 (Quinlan’93): A top-down decision tree generation algorithm

high low NoYes

Page 73: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

• Basic algorithm (a greedy algorithm)

– Tree is constructed in a top-down recursive divide-and-conquer manner.

– Attributes are categorical.(if an attribute is a continuous number, it needs to be discretized in advance.) E.g.

– At start, all the training examples are at the root.– Examples are partitioned recursively based on selected

attributes.

Algorithm for Decision Tree Induction (1/2)

73

0 <= age <= 1000 ~ 20

21 ~ 40

41 ~ 60

61 ~ 80

81 ~ 100

Page 74: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Algorithm for Decision Tree Induction (2/2)

• Basic algorithm (a greedy algorithm)

– Test attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain): maximizing an information gain measure, i.e., favoring the partitioning which makes the majority of examples belong to a single class.

– Conditions for stopping partitioning:• All samples for a given node belong to the same class• There are no remaining attributes for further partitioning –

majority voting is employed for classifying the leaf• There are no samples left

74

Page 75: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Decision Tree Induction: Training Dataset

75

age income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

Page 76: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

76

Age?

<= 30 31…40 > 40

Page 77: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Primary Issues in Tree Construction (1/2)

• Split criterion: Goodness function– Used to select the attribute to be split at a tree

node during the tree generation phase– Different algorithms may use different goodness

functions:• Information gain (used in ID3/C4.5)• Gini index (used in CART)

77

Page 78: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Primary Issues in Tree Construction (2/2)

• Branching scheme:– Determining the tree branch to which a sample

belongs– Binary vs. k-ary splitting

• When to stop the further splitting of a node? e.g. impurity measure

• Labeling rule: a node is labeled as the class to which most samples at the node belongs.

78

Income: high

Income: medium

Income: low

Page 79: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

How to Use a Tree?

• Directly– Test the attribute value of unknown sample against the

tree.– A path is traced from root to a leaf which holds the label.

• Indirectly– Decision tree is converted to classification rules.– One rule is created for each path from the root to a leaf.– IF-THEN is easier for humans to understand .

79

Page 80: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

80

Attribute Selection Measure:Information Gain (ID3/C4.5)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to

class Ci, estimated by |Ci, D|/|D| Expected information (entropy) needed to classify a tuple in D:

)(log)( 21

i

m

ii ppDInfo

Expected information (entropy): Entropy is a measure of how "mixed up" an attribute is. It is sometimes equated to the purity or impurity of a

variable. High Entropy means that we are sampling from a uniform

(boring) distribution.

Page 81: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

81

Expected information (entropy) needed to classify a tuple in D:

)(log)( 21

i

m

ii ppDInfo

Expected Information (Entropy)

)322.1(5

2)737.0(

5

3

)5

2(log

5

2)

5

3(log

5

3)2,3()( 22

IDInfo

000

)5

0(log

5

0)

5

5(log

5

5)0,5()( 22

IDInfo

(m: number of labels)

Page 82: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

82

Attribute Selection Measure:Information Gain (ID3/C4.5)

Select the attribute with the highest information gain Let pi be the probability that an arbitrary tuple in D belongs to

class Ci, estimated by |Ci, D|/|D| Expected information (entropy) needed to classify a tuple in D:

Information needed (after using A to split D into v partitions) to classify D:

Information gained by branching on attribute A

)(log)()( 21

i

m

ii ppDIDInfo

)(||

||)(

1j

v

j

jA DI

D

DDInfo

(D)InfoInfo(D)Gain(A) A

Page 83: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

83

Information needed (after using A to split D into v partitions) to classify D:

Expected Information (Entropy)

)(||

||)(

1j

v

j

jA DI

D

DDInfo

)1,2(5

3)1,1(

5

2)( InfoInfoDInfo )0,3(

5

3)0,2(

5

2)( InfoInfoDInfo

Page 84: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

84

Attribute Selection: Information Gain Class P: buys_computer = “yes” Class N: buys_computer = “no”

means “age <=30” has 5 out of 14 samples, with 2 yes’es and 3 no’s. Hence

Similarly,

age pi ni I(pi, ni)

<=30 2 3 0.971

31…40 4 0 0

>40 3 2 0.971

694.0)2,3(14

5

)0,4(14

4)3,2(

14

5)(

I

IIDInfoage

048.0)_(

151.0)(

029.0)(

ratingcreditGain

studentGain

incomeGain

246.0)()()( DInfoDInfoageGain ageage income student credit_rating buys_computer<=30 high no fair no<=30 high no excellent no31…40 high no fair yes>40 medium no fair yes>40 low yes fair yes>40 low yes excellent no31…40 low yes excellent yes<=30 medium no fair no<=30 low yes fair yes>40 medium yes fair yes<=30 medium yes excellent yes31…40 medium no excellent yes31…40 high yes fair yes>40 medium no excellent no

)3,2(14

5I

940.0)14

5(log

14

5)

14

9(log

14

9)5,9()( 22 IDInfo

Page 85: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Gain Ratio for Attribute Selection (C4.5)

• Information gain measure is biased towards attributes with a large number of values.

• C4.5 (a successor of ID3) uses gain ratio to overcome the problem (normalization to information gain.)

– GainRatio(A) = Gain(A)/SplitInfo(A)

GainRatio(income) = 0.029/0.926 = 0.031

• The attribute with the maximum gain ratio is selected as the splitting attribute.

)||

||(log

||

||)( 2

1 D

D

D

DDSplitInfo j

v

j

jA

926.0)14

4(log

14

4)

14

6(log

14

6)

14

4(log

14

4)( 222 DSplitInfoA

85

Page 86: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Gini index (CART, IBM IntelligentMiner)

• If a data set D contains examples from n classes, gini index, Gini(D) is defined as

where pj is the relative frequency of class j in D

• If a data set D is split on A into two subsets D1 and D2, the gini index Gini(D) is defined as:

• Reduction in Impurity:

• The attribute provides the smallest GiniA(D) (or the largest reduction in impurity) is chosen to split the node (need to enumerate all the possible splitting points for each attribute.)

86

n

jp jDGini

1

21)(

)(||||)(

||||)( 2

21

1 DGiniDD

DGiniDDDGiniA

)()()( DGiniDGiniAGiniA

Page 87: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Gini index (CART, IBM IntelligentMiner)

• Ex. D has 9 tuples in buys_computer = “yes” and 5 in “no.”

• Suppose the attribute income partitions D into 10 in D1: {low, medium} and 4 in D2: {high}.

But Giniincomeϵ{medium,high} is 0.30 and thus the best since it is the lowest.

87

459.014

5

14

91)(

22

DGini

)(

45.0

])4

3()

4

1(1[

14

4])

10

4()

10

6(1[

14

10

)(14

4)(

14

10)(

}{

2222

11},{

DGini

DGiniDGiniDGini

highincome

mediumlowincome

Page 88: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Other Attribute Selection Measures

88

• CHAID: a popular decision tree algorithm, measure based on χ2 test for independence

• C-SEP: performs better than info. gain and gini index in certain cases

• G-statistics: has a close approximation to χ2 distribution

• MDL (Minimal Description Length) principle (i.e., the simplest solution is preferred):

– The best tree as the one that requires the fewest # of bits to both (1) encode the tree, and (2) encode the exceptions to the tree

• Multivariate splits (partition based on multiple variable combinations)

– CART: finds multivariate splits based on a linear combination of attributes.

Which attribute selection measure is the best?

Most give good results, none is significantly superior than others

Page 89: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Other Types of Classification Methods

• Bayes Classification Methods

• Rule-Based Classification• Support Vector Machine (SVM)

• Some of these methods will be taught in the following lessons.

89

Page 90: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

CLUSTERING

90

Page 91: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

What is Cluster Analysis?

• Cluster: a collection of data objects– Similar to one another within the same cluster– Dissimilar to the objects in other clusters

• Cluster Analysis– Grouping a set of data objects into clusters

• Typical applications:– As a stand-alone tool to get insight into data

distribution– As a preprocessing step for other algorithms

91

Page 92: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

General Applications of Clustering

• Spatial data analysis– Create thematic maps in GIS by clustering feature spaces.– Detect spatial clusters and explain them in spatial data mining.

• Image Processing• Pattern recognition• Economic Science (especially market research)• WWW

– Document classification– Cluster Web-log data to discover groups of similar access

patterns

92

Page 93: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Examples of Clustering Applications

• Marketing: Help marketers discover distinct groups in their customer bases, and then use this knowledge to develop targeted marketing programs.

• Land use: Identification of areas of similar land use in an earth observation database.

• Insurance: Identifying groups of motor insurance policy holders with a high average claim cost.

• City-planning: Identifying groups of houses according to their house type, value, and geographical location.

93

Page 94: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

• A good clustering method will produce high quality clusters with– High intra-class similarity– Low inter-class similarity

• The quality of a clustering result depends on both the similarity measure used by the method and its implementation.

• The quality of a clustering method is also measured by its ability to discover hidden patterns.

What is Good Clustering?

94

Page 95: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Requirements of Clusteringin Data Mining (1/2)

• Scalability• Ability to deal with different types of

attributes• Discovery of clusters with arbitrary shape• Minimal requirements of domain knowledge

for input• Able to deal with outliers

95

Page 96: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Requirements of Clusteringin Data Mining (2/2)

• Insensitive to order of input records• High dimensionality

– Curse of dimensionality

• Incorporation of user-specified constraints• Interpretability and usability

96

Page 97: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Clustering Methods (I)• Partitioning Method

– Construct various partitions and then evaluate them by some criterion, e.g., minimizing the sum of square errors

– K-means, k-medoids, CLARANS

• Hierarchical Method– Create a hierarchical decomposition of the set of data (or objects)

using some criterion– Diana, Agnes, BIRCH, ROCK, CHAMELEON

• Density-based Method– Based on connectivity and density functions– Typical methods: DBSACN, OPTICS, DenClue

97

Page 98: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Clustering Methods (II)• Grid-based approach

– based on a multiple-level granularity structure– Typical methods: STING, WaveCluster, CLIQUE

• Model-based approach– A model is hypothesized for each of the clusters and tries to find the best fit of

that model to each other– Typical methods: EM, SOM, COBWEB

• Frequent pattern-based– Based on the analysis of frequent patterns– Typical methods: pCluster

• User-guided or constraint-based– Clustering by considering user-specified or application-specific constraints– Typical methods: cluster-on-demand, constrained clustering

98

Page 99: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

99

Typical Alternatives toCalculate the Distance between Clusters

• Single link: smallest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = min(tip, tjq)

• Complete link: largest distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = max(tip, tjq)

• Average: average distance between an element in one cluster and an element in the other, i.e., dis(Ki, Kj) = avg(tip, tjq)

• Centroid: distance between the centroids of two clusters, i.e., dis(Ki, Kj) = dis(Ci, Cj)

• Medoid: distance between the medoids of two clusters, i.e., dis(Ki, Kj) = dis(Mi, Mj)

– Medoid: one chosen, centrally located object in the cluster

Page 100: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Centroid, Radius and Diameter of a Cluster(for numerical data sets)

• Centroid: the “middle” of a cluster

• Radius: square root of average mean squared distance from any point of

the cluster to its centroid

• Diameter: square root of average mean squared distance between all

pairs of points in the cluster

N

tNi ip

mC)(

1

N

mciptN

imR

2)(1

)1(

2)(11

NNjqt

iptN

jNi

mD

100

diameter != 2 * radius

Page 101: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Partitioning Algorithms: Basic Concept

• Partitioning method: construct a partition of a database D of n objects into a set of k clusters.

• Given a number k, find a partition of k clusters that optimizes the chosen partitioning criterion.– Global optimal: exhaustively enumerate all partitions.– Heuristic methods: k-means, k-medoids

• k-means (MacQueen’67)• k-medoids or PAM, partion around medoids (Kaufman &

Rousseeuw’87)

101

Page 102: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

The K-Means Clustering Method

• Given k, the k-means algorithm is implemented in four steps:1. Arbitrarily choose k points as initial cluster centroids.2. Update Means (Centroids): Compute seed points as

the center of the clusters of the current partition. (center: mean point of the cluster)

3. Re-assign Points: Assign each object to the cluster with the nearest seed point.

4. Go back to Step 2, stop when no more new assignment.

102

loop

Page 103: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Example of theK-Means Clustering Method

103

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Given k = 2:

Arbitrarily choose k object as initial cluster centroid

Assign each objects to the most similar centroid

Update the cluster means

Update the cluster means

Re-assign Re-assign

Page 104: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Comments on the K-Means Clustering

• Time Complexity: O(tkn), where n is # of objects, k is # of clusters, and t is # of iterations. Normally, k,t<<n.

• Often terminates at a local optimum. (The global optimum may be found using techniques such as: deterministic annealing and genetic algorithms)

• Weakness:– Applicable only when mean is defined, how about

categorical data?– Need to specify k, the number of clusters, in advance– Unable to handle noisy data and outliers

104

Page 105: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Why is K-Means Unable toHandle Outliers?

• The k-means algorithm is sensitive to outliers– Since an object with an extremely large value may

substantially distort the distribution of the data.

• K-Medoids: Instead of taking the mean value of the object in a cluster as a reference point, medoids can be used, which is the most centrally located object in a cluster.

105

X

Page 106: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

PAM: The K-Medoids Method

• PAM: Partition Around Medoids• Use real object to represent the cluster

1. Randomly select k representative objects as medoids.

2. Assign each data point to the closest medoid.

3. For each medoid m,a. For each non-medoid data point o

b. Swap m and o, and compute the total cost of the configuration.

4. Select the configuration with the lowest cost.

5. Repeat steps 2 to 5 until there is no change in the medoid.

106

loop

Page 107: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

A Typical K-Medoids Algorithm (PAM)

107

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

k = 2

Arbitrary choose k object as initial medoids 0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Assign each remaining object to the nearest medoid

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Swap each medoid and each data point, and compute the total cost of the configuration

m1

m2

Page 108: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

108

PAM Clustering: Total swapping cost TCih=jCjih

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

j

ih

t

Cjih = 0

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

t

i h

j

Cjih = d(j, h) - d(j, i)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

h

i t

j

Cjih = d(j, t) - d(j, i)

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

t

ih j

Cjih = d(j, h) - d(j, t)

- Original medoid: t, i

- h: swap with i

- j: any non-selected object

d(j,h)>d(j,t)d(j,h)>d(j,t)

d(j,h)<d(j,t)d(j,h)<d(j,t)

i h

j j

t t

j j

i t

j jt h

j j

Page 109: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

What is the Problem with PAM?

• PAM is more robust than k-means in the presence of noise and outliers because a medoid is less influenced by outliers or other extreme values than a mean.

• PAM works efficiently for small data sets but does not scale well for large data sets.– O( k(n-k)(n-k) ) for each iteration,

where n is # of data, k is # of clusters– Improvements: CLARA (uses a sampled set to determine

medoids), CLARANS

109

Page 110: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Hierarchical Clustering

• Use distance matrix as clustering criteria.• This method does not require the number of clusters k as an

input, but needs a termination condition.

110

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

aa b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)

Page 111: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

AGNES (Agglomerative Nesting)

• Introduced in Kaufmann and Rousseeuw (1990)• Use the Single-Link method and the dissimilarity matrix. • Merge nodes that have the least dissimilarity• Go on in a non-descending fashion• Eventually all nodes belong to the same cluster

111

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 112: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Dendrogram:Shows How the Clusters are Merged

Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.

112

Page 113: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

DIANA (Divisive Analysis)

• Introduced in Kaufmann and Rousseeuw (1990)• Inverse order of AGNES• Eventually each node forms a cluster on its own.

113

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Page 114: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

More on Hierarchical Clustering

• Major weakness:– Do not scale well: time complexity is at least O(n2), where n is

the number of total objects.– Can never undo what was done previously.

• Integration of hierarchical with distance-based clustering– BIRCH(1996): uses CF-tree data structure and incrementally

adjusts the quality of sub-clusters.– CURE(1998): selects well-scattered points from the cluster and

then shrinks them towards the center of the cluster by a specified fraction.

114

Page 115: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Density-Based Clustering Methods

• Clustering based on density (local cluster criterion), such as density-connected points

• Major features:– Discover clusters of arbitrary shape– Handle noise– One scan– Need density parameters as termination condition

• Several interesting studies:– DBSCAN: Ester, et al. (KDD’96)– OPTICS: Ankerst, et al (SIGMOD’99).– DENCLUE: Hinneburg & D. Keim (KDD’98)– CLIQUE: Agrawal, et al. (SIGMOD’98) (more grid-based)

115

Page 116: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Density-Based Clustering: Basic Concepts

• Two parameters:

– Eps: Maximum radius of the neighborhood

– MinPts: Minimum number of points in an Eps-neighborhood of that point

116

Eps

Page 117: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Density-Based Clustering: Basic Concepts

• Two parameters:

– Eps: Maximum radius of the neighborhood

– MinPts: Minimum number of points in an Eps-neighborhood of that point

• NEps(q): {p | dist(p,q) <= Eps} // p, q are two data points

• Directly density-reachable: A point p is directly density-reachable from a point q w.r.t. Eps, MinPts if

– p belongs to NEps(q)

– core point condition:

|NEps (q)| >= MinPts 117

p

q

MinPts = 5

Eps = 1 cm

Page 118: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Density-Reachable and Density-Connected

• Density-reachable:

– A point p is density-reachable from a point q w.r.t. Eps, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi.

• Density-connected:

– A point p is density-connected to a point q w.r.t. Eps, MinPts if there is a point o such that both, p and q are density-reachable from o w.r.t. Eps and MinPts.

118

p

qp2

p q

o

Page 119: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Border

Core

DBSCAN: Density Based Spatial Clustering of Applications with Noise

• Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points.

• Discovers clusters of arbitrary shape in spatial databases with noise.

119

Border

Eps = 1cm

MinPts = 5

Page 120: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

DBSCAN: The Algorithm

• Arbitrary select an unvisited point p.

• Retrieve all points density-reachable from p w.r.t. Eps and MinPts.

• If p is a core point, a cluster is formed. Mark all these points as visited.

• If p is a border point (no points are density-reachable from p), mark p as visited and DBSCAN visits the next point of the database.

• Continue the process until all of the points have been visited.

120

Page 121: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

References (1)

• R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD'98

• M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.• M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering

structure, SIGMOD’99.• P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scientific, 1996• Beil F., Ester M., Xu X.: "Frequent Term-Based Text Clustering", KDD'02• M. M. Breunig, H.-P. Kriegel, R. Ng, J. Sander. LOF: Identifying Density-Based Local Outliers. SIGMOD 2000.• M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large

spatial databases. KDD'96.• M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for

efficient class identification. SSD'95.• D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172,

1987.• D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic

systems. VLDB’98.

121

Page 122: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

References (2)

• V. Ganti, J. Gehrke, R. Ramakrishan. CACTUS Clustering Categorical Data Using Summaries. KDD'99. • D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In

Proc. VLDB’98.• S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98.• S. Guha, R. Rastogi, and K. Shim. ROCK: A robust clustering algorithm for categorical attributes. In ICDE'99, pp. 512-

521, Sydney, Australia, March 1999. • A. Hinneburg, D.l A. Keim: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. KDD’98.• A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.• G. Karypis, E.-H. Han, and V. Kumar. CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling.

COMPUTER, 32(8): 68-75, 1999. • L. Kaufman and P. J. Rousseeuw, 1987. Clustering by Means of Medoids. In: Dodge, Y. (Ed.), Statistical Data Analysis

Based on the L1 Norm, North Holland, Amsterdam. pp. 405-416.• L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons,

1990.• E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98.• J. B. MacQueen (1967): "Some Methods for classification and Analysis of Multivariate Observations", Proceedings

of 5-th Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, University of California Press, 1:281-297

• G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988.

• P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.• R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.

122

Page 123: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

References (3)

• L. Parsons, E. Haque and H. Liu, Subspace Clustering for High Dimensional Data: A Review , SIGKDD Explorations, 6(1), June 2004

• E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition,.

• G. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. VLDB’98.

• A. K. H. Tung, J. Han, L. V. S. Lakshmanan, and R. T. Ng. Constraint-Based Clustering in Large Databases, ICDT'01.

• A. K. H. Tung, J. Hou, and J. Han. Spatial Clustering in the Presence of Obstacles , ICDE'01• H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern similarity in large data sets, SIGMOD’ 02. • W. Wang, Yang, R. Muntz, STING: A Statistical Information grid Approach to Spatial Data Mining, VLDB’97.• T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large

databases. SIGMOD'96.• Wikipedia: DBSCAN. http://en.wikipedia.org/wiki/DBSCAN.

123

Page 124: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

MORE ABOUT DATA MINING

124

Page 125: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

ICDM ’10 KEYNOTE SPEECH“10 YEARS OF DATA MINING RESEARCH: RETROSPECT AND PROSPECT”

http://www.cs.uvm.edu/~xwu/PPT/ICDM10-Sydney/ICDM10-Keynote.pdf

125

Xindong Wu, University of Vermont, USA

Page 126: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

The Top 10 AlgorithmsThe 3-Step Identification Process

1. Nominations. ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners were invited in September 2006 to each nominate up to 10 best-known algorithms.

2. Verification. Each nomination was verified for its citations on Google Scholar in late October 2006, and those nominations that did not have at least 50 citations were removed. 18 nominations survived and were then organized in 10 topics.

3. Voting by the wider community.

126

Page 127: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Top-10 Most Popular DM Algorithms:18 Identified Candidates (I)

• Classification– #1. C4.5: Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann.,

1993.– #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and

Regression Trees. Wadsworth, 1984.– #3. K Nearest Neighbors (kNN): Hastie, T. and Tibshirani, R. 1996. Discriminant

Adaptive Nearest Neighbor Classification. TPAMI. 18(6)– #4. Naive Bayes Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All?

Internat. Statist. Rev. 69, 385-398.• Statistical Learning

– #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-Verlag.

– #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. Association Analysis

• Association Analysis– #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining

Association Rules. In VLDB '94.– #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without

candidate generation. In SIGMOD '00.

127

Page 128: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

The 18 Identified Candidates (II)

• Link Mining– #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual

Web search engine. In WWW-7, 1998.– #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked

environment. SODA, 1998.• Clustering

– #11. K-Means: MacQueen, J. B., Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, 1967.

– #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient data clustering method for very large databases. In SIGMOD '96.

• Bagging and Boosting– #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision-theoretic

generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139.

128

Page 129: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

• Sequential Patterns– #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns:

Generalizations and Performance Improvements. In Proceedings of the 5th International Conference on Extending Database Technology, 1996.

– #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In ICDE '01.

• Integrated Mining– #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and association

rule mining. KDD-98. • Rough Sets

– #17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992

• Graph Mining– #18. gSpan: Yan, X. and Han, J. 2002. gSpan: Graph-Based Substructure Pattern

Mining. In ICDM '02.

129

The 18 Identified Candidates (III)

Page 130: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Top-10 Algorithm Finally Selected at ICDM’06

• #1: C4.5 (61 votes)• #2: K-Means (60 votes)• #3: SVM (58 votes)• #4: Apriori (52 votes)• #5: EM (48 votes)• #6: PageRank (46 votes)• #7: AdaBoost (45 votes)• #7: kNN (45 votes)• #7: Naive Bayes (45 votes)• #10: CART (34 votes)

130

Page 131: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

10 Challenging Problems in Data Mining Research

• Developing a Unifying Theory of Data Mining• Scaling Up for High Dimensional Data/High Speed Streams• Mining Sequence Data and Time Series Data• Mining Complex Knowledge from Complex Data• Data Mining in a Graph Structured Data• Distributed Data Mining and Mining Multi-agent Data• Data Mining for Biological and Environmental Problems• Data-Mining-Process Related Problems• Security, Privacy and Data Integrity• Dealing with Non-static, Unbalanced and Cost-sensitive Data

131

Page 132: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Advanced Topics in Data Mining

• Web & Text Mining• Spatio-temporal Data Mining• Data Stream Mining• Uncertain Data Mining• Privacy Preserving in Data Mining• Graph Mining• Social Network Mining• Visualization of Data Mining• …

132

Page 133: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

DATA STREAM MINING

133

Page 134: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Data Stream Management• Design synopsis structures for streams• Design real-time and approximate algorithms for stream mining and query processing

Multiple Data Streams

Characteristics of Data Streams• Arrive in a high speed• Arrive continuously, and possibly endlessly • Have a huge volume

Examples: Sensor network data, network flow data, stock market data, etc.

(Approximate) Results

(Approximate) Results

Online Stream Summarization

Stream SynopsesStream

Synopses

Stream Process System

Query Processing

Stream Mining

134

Page 135: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

GRAPH MINING

135

Page 136: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Why Graph Mining?

• Graphs are ubiquitous– Chemical compounds (Cheminformatics)– Protein structures, biological pathways/networks (Bioinformatics)– Program control flow, traffic flow, and workflow analysis – XML databases, Web, and social network analysis

• Graph is a general model– Trees, lattices, sequences, and items are degenerated graphs

• Diversity of graphs– Directed vs. undirected, labeled vs. unlabeled (edges & vertices),

weighted, with angles & geometry (topological vs. 2-D/3-D)

• Complexity of algorithms: many problems are of high complexity

136

Page 137: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Graph Pattern Mining

• Frequent sub-graphs– A (sub-)graph is frequent if its support (occurrence

frequency) in a given dataset is no less than a minimum support threshold.

• Applications of graph pattern mining– Mining biochemical structures– Program control flow analysis– Mining XML structures or Web communities– Building blocks for graph classification, clustering,

compression, comparison, and correlation analysis

137

Page 138: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Example: Frequent Subgraphs

• Graph dataset

138

FREQUENT PATTERNS(MIN SUPPORT IS 2)

(A) (B) (C)

(1) (2)

Page 139: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Graph Mining Algorithms

• Incomplete beam search – Greedy (Subdue: Holder et al. KDD’94)

• Inductive logic programming (WARMR: Dehaspe et al. KDD’98)

• Graph theory-based approaches– Apriori-based approach

• AGM/AcGM: Inokuchi, et al. (PKDD’00), FSG: Kuramochi and Karypis (ICDM’01), PATH#: Vanetik and Gudes (ICDM’02, ICDM’04), FFSM: Huan, et al. (ICDM’03)

– Pattern-growth approach

• MoFa: Borgelt and Berthold (ICDM’02), gSpan: Yan and Han (ICDM’02), Gaston: Nijssen and Kok (KDD’04)

139

Page 140: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

SOCIAL NETWORK MINING

140

Page 141: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

What is Social Network?

141

Nodes: individualsLinks: social relationship (family/work/friendship/etc.)

Social Network:Many individuals with diverse social interactions between them.

Page 142: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Example of Social Networks

142

Friendship networknode: personlink: acquaintanceship

http://nexus.ludios.net/view/demo/

Page 143: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Example of Social Networks

143

Ke, Visvanath & Börner, 2004

Co-author networknode: authorlink: write papers together

Page 144: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

Mining on Social Networks• Social network analysis has a long history in social sciences.

– A summary of the progress has been written. Linton Freeman, “The Development of Social Network Analysis.” Vancouver: Empirical Pres, 2006.

• Today: Convergence of social and technological networks, computing and info. systems with intrinsic social structure. (By Jon Kleinberg, Cornell University)

• Relevant Topics:– How to build a suitable model for search and diffusion in social networks.– Link mining in a multi-relational, heterogeneous and semi-structured network.– Community formation, clustering of social network data.– Abstract or summarization of social networks.– Privacy preserving in social network data.

and many others…

144

Page 145: ADVANCED ALGORITHMS IN COMPUTATIONAL BIOLOGY (C3), This is part of your 1 where I will cover the first two weeks’ courses 2012/02/24: DATABASES: AN OVERVIEW

THANK YOU!

145