MÓDULO III - Faculdade de Farmácialgouveia/quimio2009/material/Aula13_20100527... · Hierarchical...

Mestrado em Farmacotecnia Avançada

Chemometrics MSc Program in Advanced Pharmaceutics 1

FFUL, Lisbon (Portugal), 2009

Luís F. Gouveia, lgouveia@ff.ul.pt

Aula 13Aula 13 (27.05.2010)(27.05.2010)

MÓDULO IIIMÓDULO III

Classification & Clustering

Classificação Multivariada / Classification & Clustering

Métodos lineares (sem redução de dimensionalidade)

Classificação hierárquica e não hierárquica

Otto, M., Chemometrics Chap.5

ChemometricsChemometricsChemometricsChemometrics

Otto, M., Chemometrics Chap.5

SumárioSumárioSumárioSumário

clustering vs. classification

supervised vs. unsupervised learning/classification

types of clustering algorithms

Respostas a...Respostas a...Respostas a...Respostas a...

What is cluster analysis ?

What do we use clustering for ?

Are there different approaches to data clustering ?

What are the major clustering techniques ?

Classification Classification vsvs ClusteringClusteringClassification Classification vsvs ClusteringClustering

Clustering (aglomeração):

The task is to learn a classification from the data. No predefined classification is required.

Clustering algorithms divide a data set into natural groups(clusters).

Instances (objects, observations, cases) in the same cluster

Instances (objects, observations, cases) in the same cluster are similar to each other, they share certain properties.

Classification:

The task is to learn to assign instances (objects, observations, cases) to predefined classes.

Círculos Quadrados

In cluster analysis we search for patterns in a data set by grouping the (multivariate) observations into clusters.

The goal is to find an optimal grouping for which the observations or objects within each cluster are similar,

observations or objects within each cluster are similar, but the clusters are dissimilar to each other.

Methods of Multivariate analysis, 2nd Ed, Wiley 2002

Cluster analysis differs fundamentally from classification analysis.

In classification analysis, we allocate the observations to a known number of predefined groups or populations.

known number of predefined groups or populations.

In cluster analysis, neither the number of groups nor the groups themselves are known in advance.

Methods of Multivariate analysis, 2nd Ed, Wiley 2002

Supervised vs unsupervised class.Supervised vs unsupervised class.

Assign objects to classes (groups) on the basis of measurements made on the objects

Unsupervised: classes unknown, want to discover them from the data (cluster analysis)

clustering is an unsupervised task, i.e., the training data doesn’t specify what we are trying to learn (the clusters).

specify what we are trying to learn (the clusters).

Supervised: classes are predefined, want to use a (training or learning) set of labeled objects to form a classifier for classification of future observations

classification requires supervised learning, i.e., the training data has to specify what we are trying to learn (the classes).

Clustering can be used for:

Exploratory data analysis: visualize the data at hand, get a feeling for what the data look like, what its properties are. First step in building a model.

Generalization: discover objects/cases/instances that are similar to each other, and hence can be handled in the same way.

Hierarchical vs Hierarchical vs FlatFlat clusteringclusteringHierarchical vs Hierarchical vs FlatFlat clusteringclustering

Hierarchical clustering:Preferable for detailed data analysis.

Provides more information than flat (non-hierarchical) clustering.

No single best algorithm exists (different algorithms are optimal for different applications).

Less efficient than flat clustering.

Hierarchical vs Hierarchical vs FlatFlat clusteringclusteringHierarchical vs Hierarchical vs FlatFlat clusteringclustering

Flat clustering:Preferable if efficiency is important or for large data sets.

k-means is conceptually the most simple method, should be used first on new data, results are often sufficient.

k-means assumes an Euclidian representation space; inappropriate for nominal data.

Class. vs ClusteringClass. vs ClusteringClass. vs ClusteringClass. vs Clustering

Class 1

Class 2

Class …

Class n

Classification vs ClusteringClassification vs ClusteringClassification vs ClusteringClassification vs Clustering

Class 1 Class 2 Class … Class n

Classification algorithm

(trainning set)

Classification vs ClusteringClassification vs ClusteringClassification vs ClusteringClassification vs Clustering

Class 1 Class 2 Class … Class n

new object(s)

Classification algorithm

Clustering...Clustering...Clustering...Clustering...

No predefined classification is required.

Q: Qual o critério subjacente?

A: Côr

--> Apenas um critério/variável/característica

14 7 82

Class 1 Class 2

6 94 7

Class 1 Class 2 Class 3

Class 4

Class 1 Class 2 Class 3

Data preprocessingData preprocessing

ID Alt Peso Género

A1 1.50 60 0

A2 1.68 67 0

A3 1.85 74 1

A4 1.63 65 0

A5 1.48 59 0

A6 1.53 61 0

A7 1.50 60 0

A8 1.53 61 0

A9 1.68 67 1

A10 1.70 68 0

A11 1.50 60 0

A12 1.59 64 0

A13 1.64 77 1

A14 1.72 69 0

Não se devem usar variáveis dicotómicas,

nominais...

1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90

Altura (m)

Peso (Kg)

50 55 60 65 70 75 80

Peso (Kg)

género

1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90

Altura (m)

Género

A14 1.72 69 0

Dendogram

Similarity

Data preprocessingData preprocessingID Alt Peso Género

A1 1.50 60 0

A2 1.68 67 0

A3 1.85 74 1

A4 1.63 65 0

A5 1.48 59 0

A6 1.53 61 0

A7 1.50 60 0

A8 1.53 61 0

A9 1.68 67 1

A10 1.70 68 0

A11 1.50 60 0

A12 1.59 64 0

A13 1.64 77 1

A14 1.72 69 0

Dendogram

Similarity

1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90

Altura (m)

Peso (Kg)

Dendogram

ID Alt Peso Género

A1 1.50 60 0

A2 1.68 67 0

A3 1.85 74 1

A4 1.63 65 0

A5 1.48 59 0

A6 1.53 61 0

A7 1.50 60 0

A8 1.53 61 0

A9 1.68 67 1

A10 1.70 68 0

A11 1.50 60 0

A12 1.59 64 0

A13 1.64 77 1

A14 1.72 69 0

Existe algo de errado neste clustering?

Dendogram

Similarity

1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90

Altura (m)

Peso (Kg)

ID Alt Peso Género

A1 1.50 60 0

A2 1.68 67 0

A3 1.85 74 1

A4 1.63 65 0

A5 1.48 59 0

A6 1.53 61 0

A7 1.50 60 0

A8 1.53 61 0

A9 1.68 67 1

A10 1.70 68 0

A11 1.50 60 0

A12 1.59 64 0

A13 1.64 77 1

A14 1.72 69 0

Média 1.61 65.19 0.21

Min 1.48 59 0

Max 1.85 77 1

Range 0.37 17.8 1

Dendogram

Similarity

Dendogram

1º discriminante é a massa corporal

1º discriminante é o género (M/F)

Similarity

Data preprocessing - ScalingData preprocessing - Scaling

Standard deviation (autoscaling)

ID Alt Peso Género

A1 0.054 0.045 0.000

A2 0.541 0.449 0.000

A3 1.000 0.831 1.000

A4 0.405 0.337 0.000

A5 0.000 0.000 0.000

A6 0.135 0.112 0.000

A7 0.054 0.045 0.000

A8 0.135 0.112 0.000

A9 0.541 0.449 1.000

A10 0.595 0.494 0.000

A11 0.054 0.045 0.000

A12 0.297 0.247 0.000

A13 0.432 1.000 1.000

A14 0.649 0.539 0.000

Média 0.35 0.34 0.21

Min 0 0 0

Max 1 1 1

Range 1 1 1

Dispersão semelhante

(igual qd range scaling)

ID Alt Peso Género

A1 1.50 60 0

A2 1.68 67 0

A3 1.85 74 1

A4 1.63 65 0

A5 1.48 59 0

A6 1.53 61 0

A7 1.50 60 0

A8 1.53 61 0

ID Alt Peso Género

A1 0.054 0.045 0.000

A2 0.541 0.449 0.000

A3 1.000 0.831 1.000

A4 0.405 0.337 0.000

A5 0.000 0.000 0.000

A6 0.135 0.112 0.000

A7 0.054 0.045 0.000

A8 0.135 0.112 0.000

A8 1.53 61 0

A9 1.68 67 1

A10 1.70 68 0

A11 1.50 60 0

A12 1.59 64 0

A13 1.64 77 1

A14 1.72 69 0

Média 1.61 65.19 0.21

Min 1.48 59 0

Max 1.85 77 1

Range 0.37 17.8 1

A8 0.135 0.112 0.000

A9 0.541 0.449 1.000

A10 0.595 0.494 0.000

A11 0.054 0.045 0.000

A12 0.297 0.247 0.000

A13 0.432 1.000 1.000

A14 0.649 0.539 0.000

Média 0.35 0.34 0.21

Min 0 0 0

Max 1 1 1

Range 1 1 1

Scaling (by range)

0.000 0.200 0.400 0.600 0.800 1.000

Altura (m)

Peso (Kg)A14

1.20 1.30 1.40 1.50 1.60 1.70 1.80 1.90

Altura (m)

Peso (Kg)

non-scaled data scaled data

Data preprocessing - CenteringData preprocessing - Centering

A5A6A7A8

A11A12

-0.2 -0.1 0.0 0.1 0.2 0.3

Altura (m)

Peso (Kg)

0.0 0.5 1.0 1.5 2.0

Altura (m)

Peso (Kg)

non-centered data centered data

Scaled and centered data

-0.4 -0.2 0.0 0.2 0.4 0.6 0.8

Altura (m)

Peso (Kg)

Clustering example 1Clustering example 1Clustering example 1Clustering example 1

Dataset HIRSUTES (133 objects, 7 variables)

Categorias/grupos:

Control (26 mulheres normais e saudáveis)

Hirsutes (107 mulheres com vários graus de hirsutismo)

Variáveis:

Testosterone

Estradiol-17b

Estrone

Dehydroepiandosterone

Salivary testosterone

Testosterone-estradiol binding globulin

Free (unbound) testosterone

Hirsutes dataset (ca 1990)

Os azeites...Os azeites...

70 objectos (amostras)

Variáveis: quantificação de cerca de 12 compostos

Os azeites...Os azeites...

Algoritmos de Algoritmos de clusteringclusteringAlgoritmos de Algoritmos de clusteringclustering

Baseiam-se todos na permissa de que os objectos pertencentes a um cluster são mais semelhantes entre si que relativamente aos objectos pertencentes a outros clusters

HierarchicalAglomerativos (bottom-up)

merge clusters iteratively.start by placing each object in its own cluster.merge these atomic clusters into larger and larger clusters until all objects are in a single cluster.

single cluster.Most hierarchical methods belong to this category. They differ only in their definition of between-cluster similarity.

Divisivos (top-down)split a cluster iteratively.It does the reverse by starting with all objects in one cluster and subdividing them into smaller pieces.Divisive methods are not generally available, and rarely have been applied.

Non-hierarchical (Flat)k-means Algorithm

Aglomerativos (bottom-up)

Divisivos (top-down)

The k-means Algorithm (non hierarchical)Iterative, hard, flat clustering algorithm based on Euclidian distance.

Intuitive formulation:

Specify k, the number of clusters to be generated.

Chose k points at random as cluster centers.

Assign each instance to its closest cluster center using Euclidian

Assign each instance to its closest cluster center using Euclidian distance.

Calculate the centroid (mean) for each cluster, use it as new cluster center.

Reassign all instances to the closest cluster center.

Iterate until the cluster centers don’t change any more.

k-means

Centroid 20

Centroid

0 2 4 6 8 10

Centroid20

0 2 4 6 8 10

Centroid

Properties of the algorithm:It only finds a local maximum, not a global one.

The clusters it comes up with depend a lot on which random cluster centers are chose initially.

Can be used for hierarchical clustering: first apply k-means with k = 2, yielding two clusters. Then apply it again on each of the two clusters, etc.

of the two clusters, etc.

Distance metrics other than Euclidian distance can be used

Distâncias e similaridadeDistâncias e similaridadeDistâncias e similaridadeDistâncias e similaridade

SimilaridadeAverage linkage

Single linkage

Complete linkage

SimilaridadeAverage linkage 8

SimilaridadeSingle linkage 8

SimilaridadeComplete linkage 8

Distâncias

Euclidean City-block

SummarySummary

Clustering algorithms discover groups of similar instances, instead of requiring a predefined classification. They are unsupervised.

Depending on the application, hierarchical or flat clustering is appropriate.

The k-means algorithm assigns instances to clusters according to

The k-means algorithm assigns instances to clusters according to Euclidian distance to the cluster centers. Then it recomputes cluster centers as the means of the instances in the cluster.

Clusters can be evaluated against an external classification (expert-generated or predefined) or task-based.

Classificação

In a typical pattern-recognition study, samples are classified according to a specific property by using measurements that are indirectly related to the property of interest.

An empirical relationship or classification rule is developed from a set of samples for which the property of interest and the measurements are known.

Practical Guide to Chemometrics, 2nd Ed 2006

The classification rule is then used to predict the property of samples that are not part of the original training set.

Developing a classification rule from spectroscopic or chromatographic data may be desirable for several

Developing a classification rule from spectroscopic or chromatographic data may be desirable for several reasons, including

the identification of the source of pollutants,

detection of odorants,

presence or absence of disease in a patient from which a sample has been taken,

Food/pharma quality testing

The set of samples for which the property of interest and measurements are known is called the training set, whereas the set of measurements that describe each sample in the data set is called a pattern.

pattern.

The determination of the property of interest by assigning a sample to its respective class is called recognition, hence the term “pattern recognition.”

ClassificationClassificationClassificationClassification

Classe ACasos, objectos

Classe Bobjectos

Classe ACasos, objectos

Classificador

Classe Bobjectos

Classificador

ÍmparPar

Classificador

ParÍmparPar

ÍmparPar

Classificador

...need to know the classification outcome (Class)!

ÍmparPar

12345678

Class = [x/2-Int(x/2)] * 2

0 � par

1 � ímpar

Training set

Classificador

Ímpar

Object Char Class 1 Class 2A 1 Primo ÍmparB 2 NPrimo ParC 3 Primo ÍmparD 4 NPrimo Par

Classification (training set)1 é primo, 3 é primo, 5 é primo…, logo todos os números

ímpares são primos…

D 4 NPrimo ParE 5 Primo ÍmparF 6 NPrimo ParG 7 Primo ÍmparH 8 NPrimo Par

Choose a proper training and test set

Classification ExampleClassification Example

Object Ca Phospfate1 8.00 5.50

2 8.25 5.75

3 8.70 6.30

4 10.00 3.00

5 10.25 4.00

FeaturesDistance matrix

1 2 3 4 5 61 0.000

2 0.354 0.000

3 1.063 0.711 0.000

4 3.202 3.260 3.547 0.000

5 2.704 2.658 2.774 1.031 0.000

6 2.658 2.704 2.990 0.559 0.707 0.000

5 10.25 4.00

6 9.75 3.50

Otto, Chemometrics, 1999

6 2.658 2.704 2.990 0.559 0.707 0.000

Distance matrix1* 3 4 5 6

1* 0.000

3 1.774 0.000

4 3.231 3.347 0.0005 2.681 2.774 1.031 0.000

6 2.681 2.990 0.559 0.707 0.000

Object Ca Phospfate1 8.00 5.50

2 8.25 5.75

3 8.70 6.30

4 10.00 3.00

5 10.25 4.00

Features

6 2.681 2.990 0.559 0.707 0.0005 10.25 4.00

6 9.75 3.50

Distance matrix1* 3 4* 5

1* 0.000

3 1.774 0.000

4 * 2.956 3.169 0.000

5 2.681 2.774 0.869 0.000

Dendogram

Similarity

The objects to be classified are first sensed by a transducer (camera), whose signals are preprocessed.

Next the features are extracted and

finally the classification is emitted, here either “salmon” or “sea bass.”

Although the information flow is often chosen to be from the source to the classifier, some systems employ information flow in which earlier levels of processing can be altered based on the tentative or preliminary response in later levels (gray arrows).

Richard O. Duda, Peter E. Hart, and David G. Stork, Pattern Classification. John Wiley & Sons Inc, 2001

Richard O. Duda, Peter E. Hart, and David G.Stork, Pattern Classification. John Wiley & SonsInc, 2001

“demasiado” simples“demasiado” simples

“demasiado” complexo“demasiado” complexo

O compromisso...O compromisso...

ClassificationClassification

SegmentationIsola o objecto do background e/ou de outros objectos

Feature

SensingConverte características físicas, químicas ou outras em dados numéricos

decision

Classification

Feature extraction

Extrai a informação relevante para a classificação do objecto

Classification systemClassification system

Escolha do

Selecção das características

Recolha de dados

treino e teste

Capacidade

Treino do “classificador”

Escolha do modelo

Avaliação do “classificador”

Capacidade discriminativa

MÓDULO III - Faculdade de Farmácialgouveia/quimio2009/material/Aula13_20100527... · Hierarchical...

Documents

トランスクリプトーム解析基礎演習 - JST...Hierarchical Clustering Cluster Boundaries Inferred Network by GGM Parameter description You are allowed to access the result

2. Clustering - LMU Munich · 26 2. Clustering Inhalt dieses Kapitels 3.1 Einleitung Ziel des Clustering, Anwendungen, Typen von Clustering-Algorithmen 3.2 Partitionierende Verfahren

Penerapan Hybrid Hierarchical Clustering Via Mutual ... · mengelompokkan nobyek ke dalam kkelompok (k

2001/12/18CHAMELEON1 CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲

Pengenalan Pola - afif.lecture.ub.ac.idafif.lecture.ub.ac.id/files/2014/05/Slide-12-Klasterisasi... · 3 Latihan dan Diskusi 4 Progress Final Project Hierarchical Clustering adalah

PowerPoint Presentation · [23 ] Jevin D. West, Ian Wesley-Smith, Carl T. Bergstrom,”A recommendation system based on hierarchical clustering of an article-level citation network,”

Similarity/Clustering 인공지능연구실 문홍구 2006. 1. 17. 2 Content What is Clustering Clustering Method Distance-based -Hierarchical -Flat Geometric embedding

Hierarchical Clustering untuk Integrasi Dokumen (Greg-Petra)

A Dynamic Hierarchical Clustering Method for Trajectory-Based Unusual Video Event Detection

Hierarchical Interactive Theater Model

Modul clustering data mining modul clustering

Hierarchical Interactive Training (HIT)

Hierarchical temporal memory

Giải thuật gom cụm Clustering algorithmscit.ctu.edu.vn/~dtnghi/dataminingR/clustering.pdf · Giải thuật K-Means 28 Giới thiệu vềclustering Hierarchical clustering

SPATIAL HIERARCHICAL CLUSTERING - Unesp

Hierarchical Clustering

Hierarchical production planning

Hierarchical Document Clustering Using Frequent Itemsets

Stream Mining: Clustering von Streamdaten fileMatthias Biehl Stream Mining: Clustering von Streamdaten- 2 Stream Mining Beispiele Herkömmliches Clustering Stream Clustering-Eigenschaften-Lösungsansatz

Adaptively Attribute-Hiding (Hierarchical) Inner Product ... · (Hierarchical) Inner Product Encryption ... the same security that achieves shorter public and secret keys. A hierarchical