56
Panorama des méthodes de détection et de traitement des anomalies Laure Berti-Équille IRD AAFD 2012 www.ird.fr [email protected]

Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Panorama des méthodes de détection et de traitement

des anomaliesLaure Berti-Équille

IRD

AAFD 2012

[email protected]

Page 2: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

À la recherche des problèmes… de qualité de données

“Dirty Data” :– Données malformatées

– Données aberrantes (outliers)

– Doublons

– Données incohérentes

– Données obsolètes

– Données fausses, incorrectes, erronées

– Données incomplètes, tronquées, censurées

– Données manquantes

2AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 2

Page 3: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Outline

1. Motivating Example

2. Generic Guidelines

3. Methods for Anomaly Detection

4. Techniques for Cleaning Dirty Data

5. Summary and Conclusions

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 3

Page 4: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Outline

1. Motivating Example

2. Generic Guidelines

3. Methods for Anomaly Detection

4. Techniques for Cleaning Dirty Data

5. Summary and Conclusions

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 4

Page 5: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

IP Data Streams: A Picture

• 10 Attributes, every 5 minutes, over four weeks

• Axes transformed for plotting

5*L. Berti-Équille, T. Dasu, D. Srivastava : Discovery of complex glitch patterns : A novel approachto Quantitative Data Cleaning.Proc. of ICDE 2011 , pp. 733-744, Hannover, Germany, 2011.

Page 6: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Detection of Patterns of Anomalies

Missing

Outliers

Duplicate

OutliersInterfaces

Utilization_OutUtilization_In

Bytes_Out

Bytes_In

Memory

CPULatencySyslog_EventsCPU_Poll

6

Page 7: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Detection: Main Issues

� A large variety of detection methods with conflicting results

� No benchmark

� DQ problems are not necessarily rare events

� DQ problems may be (partially) correlated

� Mutual masking-effects impair the detection(e.g., - missing values affects the detection of duplicates

- duplicate records affects the detection of outliers

- imputation methods may mask the presence of duplicates)

� Classical assumptions won’t work (e.g., MCAR/MAR, normality, symmetry, uni-modality)

7AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 7

Page 8: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Cleaning: What Can Be Done?

• Cleaning strategies (ad hoc)

– Impute missing values � component-wise median?

– De-duplicate � retain a random record?

– Handle outliers � identify and remove? So many methods but contradicting results?

– Drop all records that have any imperfection

– Add special categories and analyze singularities in isolation

• Almost all existing approaches look at one-shot approaches to univariate glitches. Why?

• Cleaning introduces new errors !?

8AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 8

Page 9: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Deletion Imputation Modeling

Deletion Fusion RandomSelection

Deletion Winsorization Trimming

Data

MissingValues

Duplicates

Outliers

So Many Choices…

99AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 9

Page 10: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Outline

1. Motivating Example

2. Generic Guidelines

3. Methods for Anomaly Detection

4. Techniques for Cleaning Dirty Data

5. Summary and Conclusions

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 10

Page 11: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

GuidelinesStep 1 – Explore the data distributions

Goal– Detect and count missing, extreme and aberrant data values– Decide not to consider some values or variables– Decide the transformation and corrective actions to apply

For continuous variables– Discretization– Test for normality (essential for small datasets) and normalization– Optional test for homoscedasticity (equality of variance-covariance

matrices)– Detect non-linearity and non-monotony

For discrete variables– Group the variables with small populations– Create new relevant aggregates

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 11

Page 12: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Step 1 - Data Distribution Characteristics

( )∑ −=N

ii xx

N

21σµσ=CV

−=N

i x

i xx

NS

31

σ

−=N

i x

i xx

NK

41

σ

• Dispersion– Standard deviation– Coefficient of Variation (CV): a normalized measure of dispersion

of a probability distribution

– IQR: Q3-Q1– Homoscedasticity: equality of variances for

a variable on different subsets using Levene, Barlett or Fisher tests (if p<.05 ⇒ heteroscedasticity)

• Skewness: measure of the asymmetry of the probability distribution of a real-valued random variable• = 0 : when the distribution is symmetrical• >0 : the mass of the distribution is concentrated on the left • <0 : the mass of the distribution is concentrated on the right

• Kurtosis: measure of the flatness of the distribution• =3 flat like the normal distribution• >3 more concentrated • <3 flatter than the Gaussian

12

Page 13: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

13

Step 1- Test for Normality

• Many DM methods assume multivariate normal distributions

• Multivariate normality can be detected by inspecting the indices of multivariate skewness and kurtosis

• Lack of univariate normality occurs when the skewness index > 3.0 and kurtosis index > 10

• Non-normal distributions can sometimes be corrected by transforming variables

• Tests:– Kolmogorov-Smirnov Test: non-parametric test that quantifies the maximum distance between the

empirical distribution function of the variable and the cdf of the normal distribution

– Anderson-Darling Test: variant of K-S test weighting the tails of distributions

– Lilliefors Test: variant of K-S test for unknown mean and standard deviation

– Shapiro-Wilk Test : orders the sample values in ascending order and uses the correlation to detect small departures from normality - not suitable for very large sample sizes (SAS proc UNIVARIATE)

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 13

Page 14: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Goal– Detect inconsistencies between 2 or more variables– Determine relationships between one target variable and one or

more variables contributing to its explanation in order to eliminate no effect variables

– Determine relationships between explanation variables in order to avoid multicollinearity that may causes the failure of regression techniques

– Quantify the strength of the relationship and sensitivity in presence of outliers

– Detect spurious correlations

Methods– Bivariate statistics measuring pair-wise correlations– Discover FDs

GuidelinesStep 2 – Analyze data relationships

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 14

Page 15: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

MV statistics

Model-based methodsLinear, logistic regressionProbabilistic methods

MCD, MVE, Robust estimators

ClusteringDistance-based techniquesDensity-based techniquesSubspace-based techniques

VisualizationGraphicsQ-Q plotConfusion Matrix

Distributional techniquesSkewness, KurtosisGoodness of fit tests: normality, Chi-square tests, analysis of residulas, Kullback-Lieber divergenceControl Charts: X-Bar, CUSUM, R

UV statistics

ClassificationRule-based techniquesSVM, Neural Networks, Bayesian NetworksInformation theoretic measuresKernel-based methods

Rule & Pattern DiscoveryAssociation Rule DiscoveryFD, AFD, CFD mining

GuidelinesStep 1&2 - Use the toolbox for detection

Ultimate Research Goals

� Benchmarking� Optimization� Refinement� Scalability� Tuning� Real-time� Interactivity

15AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 15

Page 16: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Guidelines

Step 3 - Data Preparation: Major Tasks

• Data cleaning– Fill in missing values, smooth noisy data, identify or remove outliers,

and resolve inconsistencies

• Data integration– Integration of multiple databases, data cubes, or files

• Data transformation– Normalization and aggregation

• Data reduction– Obtains reduced representation in volume but produces the same or

similar analytical results

• Data discretization– Part of data reduction but with particular importance, especially for

numerical data

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 16

Page 17: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Data Preparation: Major Tasks

• Data cleaning– Fill in missing values, smooth noisy data, identify or remove outliers,

and resolve inconsistencies

• Data integration– Integration of multiple databases, data cubes, or files

• Data transformation– Normalization and aggregation

• Data reduction– Obtains reduced representation in volume but produces the same or

similar analytical results

• Data discretization– Part of data reduction but with particular importance, especially for

numerical data

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 17

Page 18: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Outline

1. Motivating Example

2. Methods for Anomaly Detection

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 18

– Non standardized, misfielded/formatted

– Duplicates

– Outliers

– Inconsistencies

– Missing, truncated

– Out-of-date

– Erroneous, contradicting, false

Page 19: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Outline

1. Motivating Example

2. Methods for Anomaly Detection

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 19

– Non standardized, misfielded/formatted

– Duplicates

– Outliers

– Inconsistencies

– Missing, truncated

– Out-of-date

– Erroneous, contradicting, false

Page 20: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Name Affiliation City, State, Zip, Country Phone

Piatetsky-Shapiro G.,PhD U. of Massachusetts 617-264-9914

David J. Hand Imperial College London, UK

Benjamin W. Wah Univ. of Illinois IL 61801, USA (217) 333-6903

Hand D.J.

Vippin Kumar U. of Minnesota, MI, USA

Xindong Wu U. of Vermont Burlington-4000 USA NULL

Philip S. Yu U. of Illinois Chicago IL, USA 999-999-9999

Osmar R. Zaiiane U. of Alberta CA 111-111-1111

Example

Misfielded Value

Non-standard representationICDM Steering Committee

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 20

Page 21: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Extract-Transform-Load (1/4)

• Format detection, verification, and conversion

• Standardization of values with loose or predictable structure

e.g., addresses, names, bibliographic entries

• Abbreviation enforcing

• Data consolidation based on dictionaries and constraints

• Declarative language extensions• Machine learning and HMM

for field and record segmentation• Constraint-based method [Fan et al., 2008]

Goals

Approaches

[Christen et al., 2002]

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 21

Page 22: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

22

ETL OperatorsOperators Category Application

Mapping, Convert, Select, Drop, Add, Merge, Format

Row-level Locally applied to a single row

Copy, Filter, Split, Switch

Router Locally decide, for each row, which of the many (output) destinations it should be sent to

Pivot/Unpivot, Aggregate, Clustering

Unary Grouper Transform a set of rows to a single row

Union, Merge, Join, Look-up, Compare, Divide

Binary or N-ary Combine many inputs into one output

Sort Unary Holistic Perform a transformation to the entire dataset

[Vassiliadis et al. 2007]

Page 23: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Open Source ETL: 2 of Many

Kettle (PDI)

Febrl

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 23

http://cs.anu.edu.au/~Peter.Christen/Febrl/febrl-0.3/http://www.pentaho.com/

Page 24: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Extract-Transform-Load (4/4)

• Design of Ad Hoc scenarios

• Performance/scalability issues due to dependencies among ETL jobs and sequential processing

• DB bottleneck for bulk ETL operators

• Mainly for structured (relational) data

• Optimization of ETL Workflows*• Active data warehousing• Cleaning of data streams

Limitations

Research Directions

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 24

*A. Simitsis, P. Vassiliadis, T. K. Sellis. State-Space Optimization of ETL Workflows. IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE) vol. 17, no. 10, pp. 1404-1419, October 2005.

Page 25: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Outline

1. Motivating Example

2. Methods for Anomaly Detection

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 25

– Non standardized, misfielded/formatted

– Duplicates

– Outliers

– Inconsistencies

– Missing, truncated

– Out-of-date

– Erroneous, contradicting, false

Page 26: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

1. Reduce the search space partitioning the dataset into mutually exclusive blocks to compare• Hashing, sorted keys, sorted nearest neighbors, (Multiple)

Windowing, Clustering

2. Select and compute a comparison function measuring the similarity distance between pairs of records• Token-based : N-grams comparison, Jaccard, TF-IDF, cosine

similarity• Edit-based: Jaro distance, Edit distance, Levenshtein, Soundex• Domain-dependent: data types, ad-hoc rules, relationship-

aware similarity measures

3. Select a decision model to classify pairs of records as matching, non-matching or potentially matching

4. Select the deduplication method

Record Linkage (RL)

Blocking

Comparison

Classification

Fusion

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 26

Page 27: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Record Linkage (RL)

Blocking

Comparison

Classification

Fusion

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 27

• ELMAGARMID, AHMED K., IPEIROTIS, PANAGIOTIS G., & VERYKIOS, VASSILIOS S. Duplicate Record Detection: A Survey. IEEE Trans.

Knowl. Data Eng., 19(1), 1–16, 2007.

• SimMetrics: Similarity Metric Java Library http://sourceforge.net/projects/simmetrics/

• KOUDAS, NICK, SARAWAGI SUNITA, SRIVASTAVA DIVESH. Record Linkage: Similarity Measures and Algorithms. Tutorial of SIGMOD 2006.

• DONG, LUNA, NAUMANN, FELIX : Data fusion -Resolving Data Conflicts for Integration. Tutorial of VLDB 2009.

Page 28: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Chaining or Spurious Linkage

ID Name Address

1 AT&T 180 Park. Av Florham Park

2 ATT 180 park Ave. Florham Park NJ

3 AT&T Labs 180 Park Avenue Florham Park

4 ATT Park Av. 180 Florham Park

5 TAT 180 park Av. NY

6 ATT 180 Park Avenue. NY NY

7 ATT Park Avenue, NY No. 180

8 ATT 180 Park NY NY

Park Av. 180 Florham Park

180 Park Avenue Florham Park

180 Park. Av Florham Park

180 park Ave. Florham Park NJ

180 Park Avenue. NY NY

180 park Av. NY

180 Park NY NY

Park Avenue, NY No. 180

1

34

56

8

Limitations: • Expertise required for method

selection and parameterization• No Benchmark

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 28

Page 29: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Outline

1. Motivating Example

2. Methods for Anomaly Detection

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 29

– Non standardized, misfielded/formatted

– Duplicates

– Outliers

– Inconsistencies

– Missing, truncated

– Out-of-date

– Erroneous, contradicting, false

Page 30: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Outlier Taxonomy

Anomaly Detection

Contextual Anomaly Detection

Collective Anomaly Detection

Online Anomaly Detection

Distributed Anomaly Detection

Point Anomaly Detection

Classification Based

Rule Based

Neural Networks Based

SVM Based

Nearest Neighbor Based

Density Based

Distance Based

Statistical

Parametric

Non-parametric

Clustering Based Others

Information Theory Based

Spectral Decomposition Based

Visualization Based

Chandola, V., Banerjee, A., & Kumar, V. (2009). Anomaly Detection – A Survey. ACM Computing Surveys, 41(3), 1–58.

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 30

Page 31: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Example

• N1 and N2 are normal regions

• o1, o2 and o4 are punctual anomalies

• Region O3 is a collective anomaly

X

Z

N1N2

o1

o2

O3Y

O4

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 31

Page 32: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

So many detection methods…X

Y

Z

Multivariate AnalysisBivariate Analysis

comparison

Rejection area: Data space excluding the area defined between 2% and 98% quantiles for X and Y

Rejection area based on:

Mahalanobis_dist(cov(X,Y)) > χ2(.98,2)

Y

X X

Y

Legitimate outliers or data quality problems?

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 32

Page 33: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Contextual Anomaly

aka “conditional anomalies” *

* Xiuyao Song, Mingxi Wu, Christopher Jermaine, Sanjay Ranka, Conditional Anomaly Detection, IEEE Transactions on Data and Knowledge Engineering, 2006.

NormalAnomaly

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 33

Page 34: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Collective Anomaly

• A collection of abnormal observations• Requires the existence of a certain type of relationship

between the observations:– Sequential– Spatial– Connectivity (graph)

• Each instance of a collective anomaly is not abnormal itself

Subsequence anomaly

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 34

Page 35: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Outlier Detection (1/4)

• Detection by inspecting frequency distributions and univariate measures of Skewness and Kurtosis

• Numerous Detection Techniques– Distributional univariate technique: 3σ away from the mean

– Goodness of fit tests: tests for normality, χ2 test, analysis of residuals, Q-Q plots, Kullback-Liebler divergence

– Control charts (X-Bar, R, CUSUM), error bounds, tolerance limits

– Regression-based technique: measures the outlyingness of a model, not an individual data point

– Geometric techniques: define layers of increasing depth, outer layers contain the outlying points

Page 36: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Outlier Detection Methods (2/4)

• Popular methods: LOF, INFLO, LOCI see Tutorial of [Kriegel et al., 2009]

ELKI: http://elki.dbs.ifi.lmu.de/wiki

• Mixture distribution: Anomaly detection over noisy data using learned probability distributions [Eskin, 2000]

• Entropy: Discovering cluster-based local outliers [He, 2003]

• Projection into higher dimensional space: Kernel methods for pattern analysis [Shawne-Taylor, Cristiani, 2005]

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 36

Page 37: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Limitations

– When normal points do not have sufficient number of neighbours

– In high dimensional spaces due to data sparseness

– When datasets have modes with varying density

– Computationally expensive

Distance-based outliers (3/4)

O dNearest Neighbour-based ApproachesA point O in a dataset is an DB(p,d)-outlier if at least fraction p of the points in the data set lies greater than distance d from the point O. [Knorr, Ng, 1998]

Outliers are the top n points whose distance to the k-thnearest neighbor is greatest. [Ramaswamy et al., 2000]

O NNd

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 37

Page 38: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

d1

d2

Goal

Compute local densities of particular regions and declare data points in low density regions as potential anomalies

Methods• Local Outlier Factor (LOF) [Breunig et al., 2000]• Connectivity Outlier Factor (COF) [Tang et al., 2002]• Multi-Granularity Deviation Factor [Papadimitriou et al., 2003]

Density-based outliers (4/4)

O1O2 NN: O2 is outlier but O1 is not

LOF: O1 is outlier but O2 is not

• Difficult choice between methods with contradicting results• In high dimensional spaces, factor values will tend to cluster

because density is defined in terms of distance

Limitations

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 38

Page 39: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Outline

1. Motivating Example

2. Generic Guidelines

3. Methods for Anomaly Detection

4. Techniques for Cleaning Dirty Data

5. Summary and Conclusions

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 39

Page 40: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

How to Handle Missing Data?

– Inclusion (applicable for less than 15%)

• Anomalies are treated as a specific category

– Deletion

• List-wise deletion omits the complete record (for less than 2%)

• Pair-wise deletion excludes only the anomaly value from a calculation

– Substitution (applicable for less than 15%)

• Single imputation based on mean, mode or median replacement

• Linear regression imputation

• Multiple imputation (MI)

• Full Information Maximum Likelihood (FIML)AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 40

Page 41: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

How to Handle Dirty Data?

• Binning / Smoothing

– first sort data and partition into bins

– then one can smooth by bin means, smooth by bin median, smooth by bin boundaries, etc.

• Clustering

– detect and remove outliers

• Combined computer and human inspection

– detect suspicious values and check by human

• Regression

– smooth by fitting the data into regression functions

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 41

Page 42: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Discretization (Binning) (1/3)

Goal

Transform continuous variables into a set of ranges treated as (ordered) categories

Advantages

– Simultaneous analysis of quantitative and qualitative variables

– Ability to capture non-linear correlations between continuous variables

– Neutralize extreme values

– Handle missing values with the creation of a specific category

– Cardinality reduction

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 42

Page 43: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Discretization (Binning) (2/3)

Recommendations– Avoid large differences between the numbers of distinct values

(categories) per variable– Avoid categories with small population– The appropriate number of categories for a discrete or categorical

variable is 4 or 5– Remember :

• the weight of a variable is proportional to its number of distinct values

• the weight of a category is inversely proportional to its population

– Cardinality reduction on observations, variables, and categories• Very few variables implies possible information loss• Too many variables implies very small populations and less

interpretable results

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 43

Page 44: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Binning Methods (3/3)

• Equal-width (distance) partitioning:– It divides the range into N intervals of equal size: uniform

grid– if A and B are the lowest and highest values of the

attribute, the width of intervals will be: W = (B-A)/N.

– The most straightforward– But outliers may dominate presentation– Skewed data is not handled well.

• Equal-depth (frequency) partitioning:– It divides the range into N intervals, each containing the

same number of samples– Good data scaling– Managing categorical attributes can be tricky.

Page 45: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Data Transformation

• Smoothing: remove noise from data

• Aggregation: summarization, data cube construction

• Generalization: concept hierarchy climbing

• Normalization: scaled to fall within a small, specified range

– min-max normalization

– z-score normalization

– normalization by decimal scaling

• Attribute/feature construction

– New attributes constructed from the given ones

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 45

Page 46: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Summary

• Data preparation is a big issue for warehousing and data

• Data preparation includes:

– Anomaly Detection

– Data cleaning

– Data transformation

– Discretization

– Data reduction and feature selection

• A lot a methods have been developed: an extremely active

area of research

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 46

Page 47: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Conclusions

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 47

• Still a lot needs to be doneto offer:

– An Iterative process with performance and quality guarantees

– Benchmarks

– Optimization

– Formalized guidelines and rigourous methodologies

– User assistance

Iterative Detection and Cleaning

Patterns and Dependencies among Anomalies

Detection

Cleaning Explanation

DuplicatesDeduplication

Outliers Uni- and MV- Detection

Missing DataImputation

Inconsistent DataConstraint

Page 48: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Any questions ?

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 48

Page 49: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

Limited Bibliography

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 49

Page 50: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

References• Tutorials

– BATINI, CARLO, TIZIANA, CATARCI, & SCANNAPIECO, MONICA. 2004. A Survey of Data Quality Issues in Cooperative Systems. Tutorial of the 23rd International Conference on Conceptual Modeling, ER 2004.

– KOUDAS, NICK, SARAWAGI SUNITA, SRIVASTAVA DIVESH. Record Linkage: Similarity Measures and Algorithms.Tutorial of SIGMOD 2006.

• Books– NAUMANN, FELIX. Quality-Driven Query Answering for Integrated Information Systems. Lecture

Notes in Computer Science, vol. 2261. Springer-Verlag,2002.

– BATINI, CARLO, & SCANNAPIECO, MONICA. Data Quality: Concepts, Methodologies and Techniques. Data-Centric Systems and Applications. Springer-Verlag, 2006.

– DASU, TAMRAPARNI, & JOHNSON, THEODORE. Exploratory Data Mining and Data Cleaning. John Wiley, 2003.

– WANG, RICHARD Y., ZIAD, MOSTAPHA, & LEE, YANG W. Data Quality.Advances in Database Systems, vol. 23. Kluwer Academic Publishers, 2002.

• Data Profiling– DASU, TAMRAPARNI, JOHNSON, THEODORE, S. Muthukrishnan, V. Shkapenyuk, Mining Database

Structure; Or, How to Build a Data Quality Browser, Proc. SIGMOD Conf. 2002

– CARUSO, FRANCESCO, COCHINWALA, MUNIR, GANAPATHY, UMA, LALK, GAIL, & MISSIER, PAOLO. 2000. Telcordia’s Database Reconciliation and Data Quality Analysis Tool. Pages 615–618 of: Proceedings of 26th International Conference on Very Large Data Bases, VLDB 2000. Cairo, Egypt.

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 50

Page 51: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

References• ETL

– CHRISTEN, PETER: Febrl -: an open source data cleaning, deduplication and record linkage system with a graphical user interface. KDD 2008: 1065-1068, 2008.

– CHRISTEN, PETER, CHURCHES, TIM, ZHU, XI. Probabilistic name and address cleaning and standardization. Australasian Data Mining Workshop 2002.

– RAHM, E., DO, H.H., Data Cleaning: Problems and Current Approaches, Data Engineering Bulletin 23(4) 3-13, 2000.

– GALHARDAS, HELENA, FLORESCU, DANIELA, SHASHA, DENNIS, SIMON, ERIC, SAITA, CRISTIAN-AUGUSTIN. Declarative Data Cleaning: Language, Model, and Algorithms, Proc. VLDB Conf., pp. 371-380, 2001.

– JOHNSON THEODORE, MARATHE, AMIT, DASU TAMRAPARNI. Database Exploration and Bellman. IEEE Data Eng. Bull. 26(3): 34-39,2003.

– VASSILIADIS, PANOS, VAGENA Z., SKIADOPOULOS S., KARAYANNIDIS N. and SELLIS, T. ARKTOS: A Tool For Data Cleaning and Transformation in Data Warehouse Environments. Bulletin of the Technical Committee on Data Engineering, vol. 23, no. 4, pp. 42-47, December 2000.

– VASSILIADIS, PANOS, KARAGIANNIS ANASTASIOS, TZIOVARA, VASILIKI, SIMITSIS, ALKIS. Towards a Benchmark for ETL Workflows. QDB 2007: 49-60, 2007.

– ELFEKY, MOHAMED G., ELMAGARMID, AHMED K., & VERYKIOS, VASSILIOS S. TAILOR: A Record Linkage Tool Box. Pages 17–28 of: Proceedings of the 18th International Conference on Data Engineering, ICDE 2002. San Jose, CA, USA, 2002.

– ELMAGARMID, AHMED K., IPEIROTIS, PANAGIOTIS G., & VERYKIOS, VASSILIOS S. Duplicate Record Detection: A Survey. IEEE Trans. Knowl. Data Eng., 19(1), 1–16, 2007.

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 51

Page 52: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

References• ETL

– LIM, EE-PENG, SRIVASTAVA, JAIDEEP, PRABHAKAR, SATYA, & RICHARDSON, JAMES. 1993. Entity Identification in Database Integration. Pages 294–301 of: Proceedings of the 9th International Conference on Data Engineering, ICDE 1993. Vienna, Austria.

– LOW, WAI LUP, LEE, MONG-LI, & LING, TOK WANG. 2001. A Knowledge-Based Approach for Duplicate Elimination in Data Cleaning. Inf. Syst., 26(8), 585–606.

– SIMITSIS, ALKIS, VASSILIADIS, PANOS, & SELLIS, TIMOS K. 2005. Optimizing ETL Processes in Data Warehouses. Pages 564–575 of: Proceedings of the 21st International Conference on Data Engineering, ICDE 2005. Tokyo, Japan.

– TEJADA, SHEILA, KNOBLOCK, CRAIG A., & MINTON, STEVEN. 2002. Learning Domain-Independent String Transformation Weights for High Accuracy Object Identification. Pages 350–359 of: Proceedings of the 8thACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2002. Edmonton, AL, Canada.

– A. Simitsis, P. Vassiliadis, T. K. Sellis. State-Space Optimization of ETL Workflows. IEEE Transactions on Knowledge and Data Engineering (IEEE TKDE) vol. 17, no. 10, pp. 1404-1419, October 2005.

• Approximate String Matching– NAVARRO, GONZALO. 2001. A Guided Tour to Approximate String Matching. ACM Comput. Surv.,

33(1), 31–88.– GRAVANO, LUIS, IPEIROTIS, PANAGIOTIS G., JAGADISH, H. V., KOUDAS, NICK, MUTHUKRISHNAN, S.,

PIETARINEN, LAURI, & SRIVASTAVA, DIVESH. 2001. Using q-grams in a DBMS for Approximate String Processing. IEEE Data Eng. Bull., 24(4), 28–34.

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 52

Page 53: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

References• Record Linkage

– ANANTHAKRISHNA, ROHIT, CHAUDHURI, SURAJIT, & GANTI, VENKATESH. Eliminating Fuzzy Duplicates in Data Warehouses. pp. 586–597, Proc. of VLDB 2002.

– BAXTER, ROHAN A., CHRISTEN, PETER, & CHURCHES, TIM. A Comparison of Fast Blocking Methods for Record Linkage. Pages 27–29 of: Proceedings of the KDD’03 Workshop on Data Cleaning, Record Linkage and Object Consolidation, 2003.

– BILENKO, MIKHAIL, BASU, SUGATO, & SAHAMI, MEHRAN. 2005. Adaptive Product Normalization: Using Online Learning for Record Linkage in Comparison Shopping. Pages 58–65 of: Proceedings of the 5th IEEE International Conference on Data Mining, ICDM 2005. Houston, TX, USA, 2005.

– BHATTACHARYA, INDRAJIT, & GETOOR, LISE. Iterative Record Linkage for Cleaning and Integration. Pages 11–18 of: Proceedings of the 9th ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery, DMKD, 2004.

– FELLEGI, IVAN P., & SUNTER, A.B. A Theory for Record Linkage. Journal of the American Statistical Association, 64, 1183–1210, 1969.

– WINKLER, WILLIAM E. The State of Record Linkage and Current Research Problems. Tech. Rept. Statistics of Income Division, Internal Revenue Service Publication R99/04. U.S. Bureau of the Census, Washington, DC, USA, 1999.

– WINKLER, WILLIAM E. Methods for Evaluating and Creating Data Quality.Inf. Syst., 29(7), 531–550, 2004.

– WINKLER, WILLIAM E., & THIBAUDEAU, YVES. An Application of the Fellegi-Sunter Model of Record Linkage to the 1990 U.S. Decennial Census. Tech. Rept. Statistical Research Report Series RR91/09. U.S. Bureau of the Census,Washington,DC, USA, 1991.

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 53

Page 54: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

References• Duplicate Detection

– HERNANDEZ, M., STOLFO, S., The Merge/Purge Problem for Large Databases, Proc. SIGMOD Conf pg 127-135, 1995.

– HERNANDEZ, M., STOLFO, S., Real-World Data is Dirty: Data Cleansing and the Merge/Purge Problem, Data Mining and Knowledge Discovery, 2(1)9-37, 1998.

– BILENKO, MIKHAIL, & MOONEY, RAYMOND J. Adaptive Duplicate Detection Using Learnable String Similarity Measures. Proc. of the 9th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pp. 39–48, Washington, DC, USA, 2003.

– BILKE, ALEXANDER, BLEIHOLDER, JENS, BÖHM, CHRISTOPH, DRABA, KARSTEN, NAUMANN, FELIX, &WEIS, MELANIE. 2005. Automatic Data Fusion with HumMer. of: Proc. of the 31st Intl. Conf. on Very Large Data Bases, VLDB 2005, pp. 1251–1254 Trondheim, Norway.

– CHAUDHURI, SURAJIT, GANTI, VENKATESH, &KAUSHIK, RAGHAV. 2006. A Primitive Operator for Similarity Joins in Data Cleaning. Page 5 of: Proceedings of the 22nd International Conference on Data Engineering, ICDE 2006. Atlanta, GA, USA.

– GRAVANO, LUIS, IPEIROTIS, PANAGIOTIS G., KOUDAS, NICK, & SRIVASTAVA, DIVESH. Text Joins for Data Cleansing and Integration in an RDBMS. Proc.of the 19th Intl. Conf. on Data Engineering, ICDE 2003, pp. 729–731, Bangalore, India, 2003.

– MCCALLUM, ANDREW, NIGAM, KAMAL, &UNGAR, LYLE H. 2000. Efficient Clustering of High-Dimensional Data Sets with Application to Reference Matching. Proc. of the 6th ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, KDD 2000, pp. 169–178. Boston, MA, USA.

– MONGE, ALVARO E. 2000. Matching Algorithms within a Duplicate Detection System. IEEE Data Eng. Bull., 23(4), 14–20.

– WEIS, MELANIE, & NAUMANN, FELIX. 2004. Detecting Duplicate Objects in XML WEIS, MELANIE, NAUMANN, FELIX, & BROSY, FRANZISKA. 2006. A Duplicate Detection Benchmark for XML (and Relational) Data. Proc. of the 3rd Intl. ACM SIGMOD 2006 Workshop on Information Quality in Information Systems, IQIS 2006. Chicago, IL, USA.

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 54

Page 55: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

References• Data Preparation

– STATNOTES: Topics in Multivariate Analysis. Retrieved 10/17/2008 from http://www2.chass.ncsu.edu/garson/pa765/statnote.htm

– KLINE, R.B., Data Preparation and Screening, Chapter 3. in Principles and Practice of Structural Equation

Modeling, NY: Guilford Press, pp. 45-62, 2005.

– BANSAL, NIKHIL, BLUM, AVRIM, and CHAWLA, SHUCHI. Correlation clustering. Machine Learning, 56(1-3):89–113, 2004.

– PARSONS, SIMON. Current Approaches to Handling Imperfect Information in Data and Knowledge Bases. IEEE Trans. Knowl. Data Eng., 8(3), 353–372, 1996.

– PEARSON, RONALD K. The problem of disguised missing data. SIGKDD Explorations 8(1): 83-92, 2006.

– PEARSON, RONALD K. Surveying Data for Patchy Structure. SDM 2005

– PEARSON, RONALD K. Mining Imperfect Data: Dealing with Contamination and Incomplete Records. Philadelphia: SIAM 2005.

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 55

Page 56: Panorama des méthodes de détection et de traitement des ... · CV = ∑ − = N i x xi x N S 3 1 σ ∑ − = N i x xi x N K 4 1 σ • Dispersion – Standard deviation – Coefficient

References• Geometric Outliers

– PREPARATA SHAMOS. Computational Geometry: An Introduction, Springer-Verlag 1988

• Distributional Outliers– KNORR, EDWIN M., & NG, RAYMOND T. Algorithms for Mining Distance-Based Outliers in Large Datasets.

Proc. of 24rd International Conference on Very Large Data Bases, VLDB 1998, pp. 392–403. New York

City, NY, USA, 1998.

– BREUNIG, MARKUS M., KRIEGEL, HANS-PETER, NG, RAYMOND T., & SANDER, JÖRG. LOF: Identifying

Density-Based Local Outliers. Proc. of the 2000 ACM SIGMOD International Conference on Management

of Data, pp. 93–104. Dallas, TX, USA, 2000.

• Missing Value Imputation– SCHAFER, J. L., Analysis of Incomplete Multivariate Data, New York: Chapman and Hall,1997

– LITTLE, R. J. A. and RUBIN, D. B., Statistical Analysis with Missing Data. New York: John Wiley & Sons, 1987.

– Mc KNIGHT, P. E., FIGUEREDO, A. J., SIDANI, S., Missing Data: A Gentle Introduction. Guilford Press, 2007.– DEMPSTER, ARTHUR PENTLAND, LAIRD, NAN M., & RUBIN, DONALD B. Maximum Likelihood from

Incomplete Data via the EM Algorithm. Journal of the Royal Statistical Society, 39, 1–38,1977.

AAFD'12, Univ. Paris 13, Institut Galilée, 28-29 juin 2012 56