25
Multivariate Outlier Detection and Robust Covariance Matrix Estimation Daniel  Peña  and Francisco J.  Prieto Department of Statistics and Econometrics Universidad Carlos III de Madrid 28903 Getafe (Madrid) Spain ([email protected] ) ([email protected] ) In this article, we present a simple multivariate outlier-detection procedure and a robust estimator for the covariance matrix, based on the use of information obtained from projections onto the directions that maximize and minimize the kurtosis coefcient of the projected data. The properties of this estimator (computational cost, bias) are analyzed and compared with those of other robust estimators described in the literature through simulation studies. The performance of the outlier-detection procedure is analyzed by applying it to a set of well-known examples. KEY WORDS: Kurtosis; Linear projection; Multivariate statistics. The detection of outliers in multivariate data is recognized to be an important and difcult problem in the physical, chem- ical, and engineering sciences. Whenever multiple measure- ments are obtained, there is always the possibility that changes in the measurement process will generate clusters of outliers. Most standard multivariate analysis techniques rely on the assumption of normality and require the use of estimates for both the location and scale parameters of the distribution. The presence of outliers may distort arbitrarily the values of these estimators and render meaningless the results of the applica- tion of these techniques. According to Rocke and Woodruff (1996), the problem of the joint estimation of location and shape is one of the most difcult in robust statistics. Wilks (1963) proposed identifying sets of outliers of size  j in normal multivariate data by checking the minimum values of the ratios — A 4I 5 =A, where —A 4I 5  is the internal scatter of a modied sample in which the set of observations  I  of size  j has been deleted and — A is the internal scatter of the complete sample. The internal scatter is proportional to the determinant of the covariance matrix and the ratios are computed for all possible sets of size  j . Wilks computed the distribution of the statistic for  j  equal to 1 and 2. It is well known that this procedure is a likelihood ratio test and that for  j  D  1 the method is equivalent to selecting the observation with the largest Mahalanobis distance from the center of the data. Because a direct extension of this idea to sets of out- liers larger than 2 or 3 is not practical, Gnanadesikan and Kettenring (1972) proposed to reduce the multivariate detection problem to a set of univariate problems by looking at projections of the data onto some direction. They chose the direction of maximum variability of the data and, therefore, they proposed to obtain the principal components of the data and search for outliers in these directions. Although this method provides the correct solution when the outliers are located close to the directions of the principal components, it may fail to identify outliers in the general case. An alternative approach is to use robust location and scale estimators. Maronna (1976) studied afnely equivariant M estimators for covariance matrices, and Campbell (1980) proposed using the Mahalanobis distance computed using M estimators for the mean and covariance matrix. Stahel (1981) and Donoho (1982) proposed to solve the dimensional- ity problem by computing the weights for the robust estimators from the projections of the data onto some directions. These directions were chosen to maximize distances based on robust univariate location and scale estimators, and the optimal values for the distances could also be used to weigh each point in the computation of a robust covariance matrix. To ensure a high breakdown point, one global optimization problem with discontinuous derivatives had to be solved for each data point, and the associated computational cost became prohibitive for large high-dimensional datasets. This computational cost can be reduced if the directions are generated by a resampling procedure of the original data, but the number of directions to consider still grows exponentially with the dimension of the problem. A different procedure was proposed by Rousseeuw (1985) based on the computation of the ellipsoid with the smallest volume or with the smallest covariance determinant that would encompass at least half of the data points. This procedure has been analyzed and extended in a large number of articles; see, for example, Hampel, Ronchetti, Rousseeuw, and Stahel (1986), Rousseeuw and Leroy (1987), Davies (1987), Rousseeuw and van Zomeren (1990), Tyler (1991), Cook, Hawkins, and Weisberg (1993), Rocke and Woodruff (1993, 1996), Maronna and Yohai (1995), Agulló (1996), Hawkins and Olive (1999), Becker and Gather (1999), and Rousseeuw and Van Driessen (1999). Public-domain codes implementing these procedures can be found in STATLIB— for example, the code FSAMVE from Hawkins (1994) and MULTOUT from Rocke and Woodruff (1993, 1996). FAST- MCD from Rousseeuw and Van Driessen is implemented as the “mcd.cov” function of S-PLUS. © 2001 American Statistical Association and the American Society for Quality TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3 286

2001, Pena, Prieto

Embed Size (px)

DESCRIPTION

Paper on multivariate outlier detection

Citation preview

  • Multivariate Outlier Detection andRobust Covariance Matrix Estimation

    Daniel Pea and Francisco J. Prieto

    Department of Statistics and EconometricsUniversidad Carlos III de Madrid

    28903 Getafe (Madrid)Spain

    ([email protected] ) ([email protected])

    In this article, we present a simple multivariate outlier-detection procedure and a robust estimator forthe covariance matrix, based on the use of information obtained from projections onto the directions thatmaximize and minimize the kurtosis coef cient of the projected data. The properties of this estimator(computational cost, bias) are analyzed and compared with those of other robust estimators described inthe literature through simulation studies. The performance of the outlier-detection procedure is analyzedby applying it to a set of well-known examples.

    KEY WORDS: Kurtosis; Linear projection; Multivariate statistics.

    The detection of outliers in multivariate data is recognizedto be an important and dif cult problem in the physical, chem-ical, and engineering sciences. Whenever multiple measure-ments are obtained, there is always the possibility that changesin the measurement process will generate clusters of outliers.Most standard multivariate analysis techniques rely on theassumption of normality and require the use of estimates forboth the location and scale parameters of the distribution. Thepresence of outliers may distort arbitrarily the values of theseestimators and render meaningless the results of the applica-tion of these techniques. According to Rocke and Woodruff(1996), the problem of the joint estimation of location andshape is one of the most dif cult in robust statistics.

    Wilks (1963) proposed identifying sets of outliers of size jin normal multivariate data by checking the minimum valuesof the ratios A4I5=A, where A4I5 is the internal scatter of amodi ed sample in which the set of observations I of size jhas been deleted and A is the internal scatter of the completesample. The internal scatter is proportional to the determinantof the covariance matrix and the ratios are computed for allpossible sets of size j . Wilks computed the distribution ofthe statistic for j equal to 1 and 2. It is well known thatthis procedure is a likelihood ratio test and that for j D 1the method is equivalent to selecting the observation with thelargest Mahalanobis distance from the center of the data.

    Because a direct extension of this idea to sets of out-liers larger than 2 or 3 is not practical, Gnanadesikanand Kettenring (1972) proposed to reduce the multivariatedetection problem to a set of univariate problems by lookingat projections of the data onto some direction. They chose thedirection of maximum variability of the data and, therefore,they proposed to obtain the principal components of the dataand search for outliers in these directions. Although thismethod provides the correct solution when the outliers arelocated close to the directions of the principal components, itmay fail to identify outliers in the general case.

    An alternative approach is to use robust location andscale estimators. Maronna (1976) studied af nely equivariantM estimators for covariance matrices, and Campbell (1980)

    proposed using the Mahalanobis distance computed usingM estimators for the mean and covariance matrix. Stahel(1981) and Donoho (1982) proposed to solve the dimensional-ity problem by computing the weights for the robust estimatorsfrom the projections of the data onto some directions. Thesedirections were chosen to maximize distances based on robustunivariate location and scale estimators, and the optimalvalues for the distances could also be used to weigh each pointin the computation of a robust covariance matrix. To ensure ahigh breakdown point, one global optimization problem withdiscontinuous derivatives had to be solved for each data point,and the associated computational cost became prohibitive forlarge high-dimensional datasets. This computational cost canbe reduced if the directions are generated by a resamplingprocedure of the original data, but the number of directionsto consider still grows exponentially with the dimension ofthe problem.

    A different procedure was proposed by Rousseeuw (1985)based on the computation of the ellipsoid with the smallestvolume or with the smallest covariance determinant thatwould encompass at least half of the data points. Thisprocedure has been analyzed and extended in a large numberof articles; see, for example, Hampel, Ronchetti, Rousseeuw,and Stahel (1986), Rousseeuw and Leroy (1987), Davies(1987), Rousseeuw and van Zomeren (1990), Tyler (1991),Cook, Hawkins, and Weisberg (1993), Rocke and Woodruff(1993, 1996), Maronna and Yohai (1995), Agull (1996),Hawkins and Olive (1999), Becker and Gather (1999), andRousseeuw and Van Driessen (1999). Public-domain codesimplementing these procedures can be found in STATLIBfor example, the code FSAMVE from Hawkins (1994) andMULTOUT from Rocke and Woodruff (1993, 1996). FAST-MCD from Rousseeuw and Van Driessen is implemented asthe mcd.cov function of S-PLUS.

    2001 American Statistical Association andthe American Society for Quality

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

    286

  • MULTIVARIATE OUTLIER DETECTION AND ROBUST COVARIANCE MATRIX ESTIMATION 287

    Because these procedures are based on the minimization ofcertain nonconvex and nondifferentiable criteria, these estima-tors are computed by resampling. For example, Rousseeuw(1993) proposed selecting p observations from the originalsample and computing the direction orthogonal to the hyper-plane de ned by these observations. The maximum over this nite set of directions is used as an approximation to theexact solution. Unfortunately, the number of candidate solu-tions grows exponentially with the size of the problem and, asa consequence, the corresponding procedures become compu-tationally expensive for even moderately sized problems. Hadi(1992, 1994), Atkinson (1994), Hawkins and Olive (1999),and Rousseeuw and van Driessen (1999) presented methodsto compute approximations for these estimates requiring rea-sonable computation times.

    In this article, we present an alternative procedure, basedon the analysis of the projections of the sample points onto acertain set of 2p directions, where p is the dimension of thesample space. These directions are obtained by maximizingand minimizing the kurtosis coef cient of the projections. Theproposed procedure can be seen as an empirically successfuland faster way of implementing the StahelDonoho (SD) algo-rithm. The justi cation for using these directions is presentedin Section 1. Section 2 describes the proposed procedure andillustrates its behavior on an example. Section 3 compares itto other procedures by a simulation study. It is shown that theproposed procedure works well in practice, is simple to imple-ment, and requires reasonable computation times, even forlarge problems. Finally, Section 4 presents some conclusions.

    1. KURTOSIS AND OUTLIERS

    The idea of using projections to identify outliers is the basisfor several outlier-detection procedures. These procedures relyon the fact that in multivariate contaminated samples eachoutlier must be an extreme point along the direction from themean of the uncontaminated data to the outlier. Unfortunately,high-breakdown-poin t methods developed to date along theselines, such as the SD algorithm, require projecting the dataonto randomly generated directions and need very large num-bers of directions to be successful. The ef ciency of thesemethods could be signi cantly improved, at least from a com-putational point of view, if a limited number of appropriatedirections would suf ce to identify the outliers. Our proposalis to choose these directions based on the values of the kurtosiscoef cients of the projected observations.

    In this section, we study the impact of the presence ofoutliers on the kurtosis values and the use of this momentcoef cient to identify them. We start by considering the uni-variate case in which different types of outliers produce dif-ferent effects on the kurtosis coef cient. Outliers generated bythe usual symmetric contaminated model increase the kurtosiscoef cient of the observed data. A small proportion of outliersgenerated by an asymmetric contaminated model also increasethe kurtosis coef cient of the observed data. These two resultssuggest that for multivariate data outliers may be revealed onunivariate projections onto directions obtained by maximizingthe kurtosis coef cient of the projected data. However, a largeproportion of outliers generated by an asymmetric contami-nation model can make the kurtosis coef cient of the data

    very small, close to its minimum possible value. This resultsuggests searching for outliers also using directions obtainedby minimizing the kurtosis of the projections. Therefore, aprocedure that would search for outliers by projecting the dataonto the directions that maximize or minimize the kurtosis ofthe projected points would seem promising.

    In univariate normal data, outliers have often been associ-ated with large kurtosis values, and some well-known testsof normality are based on the asymmetry and kurtosis coef -cients. These ideas have also been used to test for multivariatenormality (see Malkovich and A 1973). Additionally, someprojection indices that have been applied in projection pur-suit algorithms are related to the third and fourth moments(Jones and Sibson 1987; Posse 1995). Hampel (1985) derivedthe relationship between the critical value and the breakdownpoint of the kurtosis coef cient in univariate samples. He alsoshowed that two-point distributions are the least favorable fordetecting univariate outliers using the kurtosis coef cient.

    To understand the effect of different types of outliers onthe kurtosis coef cient, suppose that we have a sample of uni-variate data from a random variable that has a distribution Fwith nite moments (the uncontaminated sample). We assumewithout loss of generality that F D

    RxdF4x5 D 0, and we

    will use the notation mF 4j5DRxjdF 4x50 The sample is con-

    taminated by a fraction < 1=2 of outliers generated fromsome contaminating distribution G1 with G D

    RxdG4x51 and

    we will denote the centered moments of this distribution bymG4j5D

    R4xG5jdG4x50 Therefore, the resulting observed

    random variable X follows a mixture of two distributions,41 5F C G0 The signal-to-noise ratio will be given byr 2 D2G=mF 4251 and the ratio of variances of the two distribu-tions will be v2 D mG425=mF 425. The third- and fourth-ordermoment coef cients for the mixture and the original and con-taminating distributions will be denoted by ai Dmi435=m3=2i 425and i Dmi445=m2i 425 for i D X1F 1G, respectively. The ratioof the kurtosis coef cients of G and F will be D G=F .

    Some conditions must be introduced on F and G to ensurethat this is a reasonable model for outliers. The rst con-dition is that, for any values of the distribution parameters,the standard distribution F has a bounded kurtosis coef cient,F 0 This bound avoids the situation in which the tails of thestandard distribution are so heavy that extreme observations,which cannot be distinguished from outliers, will appear withsigni cant probability. Note that the most often used distri-butions (normal, Students t, gamma, beta, : : : ) satisfy thiscondition. The second condition is that the contaminating dis-tribution G is such that

    GmG435 03 (1)that is, if the distribution is not symmetric, the relevant tail ofG for the generation of outliers and the mean of G both lieon the same side with respect to the mean of F0 This secondassumption avoids situations in which most of the observationsgenerated from G might not be outliers.

    To analyze the effect of outliers on the kurtosis coef cient,we write its value for the contaminated population as (see theappendix for the derivation)

    X D F C4154c4C 4rc3C6r2c2C r 4c05

    h0Ch1r 2Ch2r 41 (2)

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • 288 DANIEL PEA AND FRANCISCO J. PRIETO

    where c4 D F 4v4 15=41 51 c3 D aGv3 aF 1 c2 D C 41 5v21 c0 D 3 C 41 531 h0 D 41C 4v2 1552,h1 D 241 5h1=20 , and h2 D 241 520 We consider thetwo following cases:

    1. The centered case, in which we suppose that both F andG have the same mean, and as a consequence G D 0 andr D 0. From (2), we obtain for this case

    X DF 41C4v4 15541C4v21552 0

    Note that the kurtosis coef cient increases due to the pres-ence of outliers; that is, X F whenever v4 2v2 C 1 4v2 152. This holds if 1, or equivalently if G F 1and under this condition the kurtosis will increase for anyvalue of . Thus in the usual situation in which the outliermodel is built by using a contaminating distribution of thesame family as the original distribution [as in the often-usednormal scale-contaminated model; see, for instance, Box andTiao (1968)] or with heavier tails, the kurtosis coef cient ofthe observed data is expected to be larger than that of theoriginal distribution.

    2. Consider now the noncentered case, in which both distri-butions are arbitrary and we assume that the means of G and Fare different (G 6D 0). A reasonable condition to ensure thatG will generate outliers for F is that the signal-to-noise ratior is large enough. If we let r ! in (2) (and we assume thatthe moment coef cients in the expression remain bounded),we obtain

    X ! 3C 4153415 0

    This result agrees with the one obtained by Hampel (1985).Note that if D 05 the kurtosis coef cient of the observeddata will be equal to 1, the minimum possible value. On theother hand, if ! 0 the kurtosis coef cient increases withoutbound and will become larger than F , which is bounded.Therefore, in the asymmetric case, if the contamination is verylarge the kurtosis coef cient will be very small, whereas if thecontamination is small the kurtosis coef cient will be large.

    The preceding results agree with the dual interpretationof the standard fourth-moment coef cient of kurtosis (seeRuppert 1987; Balanda and MacGillivray 1988) as measuringtail heaviness and lack of bimodality. A small numberof outliers will produce heavy tails and a larger kurtosiscoef cient. But, if we increase the amount of outliers, we canstart introducing bimodality and the kurtosis coef cient maydecrease.

    Kurtosis and Projections

    The preceding discussion centered on the behavior of thekurtosis coef cient in the univariate case as an indicator forthe presence of outliers. A multivariate method to take advan-tage of these properties would proceed through two stagesdetermining a set of projection directions to obtain univariatesamples and then conducting an analysis of these samples todetermine if any outlier may be present in the original sample.As indicated in the introduction, this is the approach developedby Stahel and Donoho. In this section we will show how to nd interesting directions to detect outliers.

    The study of the univariate kurtosis coef cient indicatesthat the presence of outliers in the projected data will implyparticularly large (or small) values for the kurtosis coef cient.As a consequence, it would be reasonable to use as projec-tion directions those that maximize or minimize the kurtosiscoef cient of the projected data. For a standard multivariatecontamination model, we will show that these directions areable to identify a set of outliers.

    Consider a p-dimensional random variable X followinga (contaminated normal) distribution given as a mixture ofnormals of the form 41 5N401 I5C N4e11I5, wheree1 denotes the rst unit vector. This contamination modelis particularly dif cult to analyze for many outlier-detectionprocedures, (e.g., see Maronna and Yohai 1995). Moreover,the analysis in the preceding section indicates that this model,for noncentered contaminating distributions, may correspondto an unfavorable situation from the point of view of thekurtosis coef cient.

    Since the kurtosis coef cient is invariant to af ne transfor-mations, we will center and scale the variable to ensure that ithas mean 0 and covariance matrix equal to the identity. Thistransformed variable, Y , will follow a distribution of the form415N4m11 S5CN4m21 S5, where

    m1 DS1=2e11 m2 D 415S1=2e11 D 14151 2 D

    24151C2415

    S D 114I 2e1e015 (3)

    and 1 and 2 denote auxiliary parameters, introduced to sim-plify the expressions (see the appendix for a derivation of thesevalues).

    We wish to study the behavior of the univariate projec-tions for this variable and their kurtosis coef cient values.Consider an arbitrary projection direction u. Using the af neinvariance of the kurtosis coef cient, we will assume u D 1.The projected univariate random variable Z D u0Y will followa distribution 41 5N4m01u1u0Su5C N4m02u1u0Su5, withE4Z5 D 01 and E4Z25 D 10 The kurtosis coef cient of Z willbe given by

    Z45 D a41 1 5Cb411 52C c411541 (4)where the coef cients a, b, and c correspond to

    a411 5D 32141C25

    b411 5D 6221

    41542 41525

    c411 5D 22 31C2

    216C415

    1

    C 3C 4153415 (5)

    and u1 D e01u (see the appendix for details on thisderivation).

    We wish to study the relationship between the directionto the outliers, e1 in our model and the directions u thatcorrespond to extremes for the projected kurtosis coef cient.

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • MULTIVARIATE OUTLIER DETECTION AND ROBUST COVARIANCE MATRIX ESTIMATION 289

    The optimization problem de ning these extreme directionswould be either

    max

    s.t.

    Z45

    1 1 (6)or the equivalent minimization problem.From the rst-order optimality conditions for (6), the

    extremes may correspond to either a point in the interval411 15 such that 0Z45 D 0 or to the extreme points of theinterval, D 1. Since 0Z D 4c3C2b, the points that makethis derivative equal to 0 are D 0 and D pb=42c5.We now analyze in detail the nature of each of these threepossible extreme points.

    1. D 1that is, the direction of the outliers. This direc-tion corresponds to a local maximizer whenever 4cC 2b >0 (the derivative at D 1) and to a minimizer whenever4cC 2b < 0 (the case in which 4cC 2b D 0 will be treatedwhen considering the third candidate to an extreme point). Theexpression for 4cC 2b from (5) is quite complex to analyzein the general case, but for the case of small contaminationlevels ( ! 0) it holds that 1 ! 1, 2= ! 2, and

    lim!0

    4cC2b

    D 44C1224 151implying that, since this value is positive except for verysmall values of and 4 1 2=3), for small values of the direction of the outliers will be a maximizer for thekurtosis coef cient. Moreover, for large contamination levels( ! 1=2), after some manipulation of the expressions in (5)it holds that

    lim!1=2

    44cC 2b5D 82 34152C24C15

    4C1542C2C252 < 01and, as a consequence, if is large we always have a mini-mizer along the direction of the outliers.2. D 0, a direction orthogonal to the outliers. Along

    this direction, as 00Z405 D 2b1 the kurtosis coef cient has amaximizer whenever b < 0. For small contaminations, fromlim!0 b= D 62415, the kurtosis coef cient has a max-imizer for < 1 and a minimizer for > 1. Comparing thekurtosis coef cient values when the direction to the outliersand a direction orthogonal to it are both local maximizers(when < 1), from (4) and

    lim!0

    Z4 15Z405

    D 242641551it follows that D 1 corresponds to the global maximizer,except for very small values of . For large contaminations,from

    lim!1=2

    b D 62 1C1

    2 12C2C 2 1

    the kurtosis coef cient has a maximizer at D 0.3. D pb=2c1 an intermediate direction, if this value

    lies in the interval 61117that is, if 0 < b=2c < 1. Forsmall contamination levels (! 0), it holds that

    lim!0

    b2c

    D 312

    1

    and this local extreme point exists whenever 1 2=3 < 1 Large cont.

    D 1 Global max. Global max. Global min.D 0 Local max. Global min. Global max.D pb=2c Global min.

    is smaller than 1. Additionally, since in this case 00Z D 4bfor < 1, it holds that b < 0 whenever < 1, implying 00Z >0. Consequently, for small concentrated contaminations thisadditional extreme point exists and is a minimizer. For largecontamination levels, it holds that

    lim!1=2

    b2c

    D 3 1

    2 2C2C21 10C2 0

    For 0 and any , this expression is either negative or largerthan 1. As a consequence, no extreme intermediate directionexists if the contamination is large.

    Table 1 provides a brief summary of the preceding results.Entries Global max. and Global min. indicate if a givendirection is the global maximizer or the global minimizer ofthe kurtosis coef cient, respectively. Local max. indicatesthe case in which the direction orthogonal to the outliers is alocal maximizer for the kurtosis coef cient.

    To detect the outliers, the procedure should be able tocompute the direction to the outliers ( D 1) from Prob-lem (6). However, from Table 1, to obtain this directionwe would need both the global minimizer and the globalmaximizer of (6). In practice this cannot be done ef ciently;as an alternative solution, we compute one local minimizerand one local maximizer. These computations can be donewith reasonable computational effort, as we describe later on.This would not ensure obtaining the direction to the outliersbecause in some cases these two directions might correspondto a direction orthogonal to the outliers and the intermediatedirection, for example, but we are assured of obtaining eitherthe direction to the outliers or a direction orthogonal to it. Thecomputation procedure is then continued by projecting thedata onto a subspace orthogonal to the computed directions,and additional directions are obtained by solving the resultingoptimization problems. In summary, we will compute oneminimizer and one maximizer for the projected kurtosiscoef cient, project the data onto an orthogonal subspace, andrepeat this procedure until 2p directions have been computed.For the case analyzed previously, this procedure should ensurethat the direction to the outliers is one of these 2p directions.

    Note that, although we have considered a special contam-ination pattern, this suggested procedure also seems reason-able in those cases in which the contamination patterns aredifferentfor example, when more than one cluster of con-taminating observations is present.

    2. DESCRIPTION OF THE ALGORITHM

    We assume that we are given a sample 4x11 : : : 1 xn5 ofa p-dimensional vector random variable X. The proposed

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • 290 DANIEL PEA AND FRANCISCO J. PRIETO

    procedure is based on projecting each observation onto a setof 2p directions and then analyzing the univariate projectionsonto these directions in a similar manner to the SD algorithm.These directions are obtained as the solutions of 2p simplesmooth optimization problems, as follows:

    1. The original data are rescaled and centered. Let Nx denotethe mean and S the covariance matrix of the original data;then the points are transformed using

    yi D S1=24xi Nx51 i D 11 : : : 1 n0 (7)2. Compute p orthogonal directions and projections maxi-

    mizing the kurtosis coef cient.

    a. Set y415i D yi and the iteration index j D 1.b. The direction that maximizes the coef cient of kur-

    tosis is obtained as the solution of the problem

    dj D arg maxd

    1

    n

    nXiD1

    d0y4j5i4

    s.t. d0d D 10 (8)

    c. The sample points are projected onto a lower-dimension subspace, orthogonal to the direction dj . De ne

    vj D dj e11 Qj D(I vjv 0j

    v0jdj

    if v0jdj 6D 0I otherwise,

    where e1 denotes the rst unit vector. The resulting matrixQj is orthogonal, and we compute the new values

    u4j5

    i z4j5

    i

    y4jC15i

    D Qjy4j5i 1 i D 11 : : : 1 n1

    where z4j5i is the rst component of u4j5

    i , which satis esz4j5

    i D d0jy4j5i (the univariate projection values) and y4jC15i cor-responds to the remaining p j components of u4j5i . We setj D jC11 and, if j < p, we go back to step 2b. Otherwise,we let z4p5i D y4p5i .3. We compute another set of p orthogonal directions and

    projections minimizing the kurtosis coef cient.

    a. Reset y4pC15i D yi and j D pC1.b. The preceding steps 2b and 2c are repeated, but now

    instead of (8) we solve the minimization problem

    dj D arg mind

    1

    n

    nXiD1

    d0y4j5i4

    s.t. d0d D 1(9)

    to compute the projection directions.

    4. To determine if z4j5i is an outlier in any one of the 2pdirections, we compute a univariate measure of outlyingnessfor each observation as

    ri D max1j2p

    z4j5i median4z4j55MAD 4z4j55

    0 (10)

    5. These measures ri are used to test if a given observationis considered to be an outlier. If ri > p , then observation i issuspected of being an outlier and labeled as such. The cutoffvalues p are chosen to ensure a reasonable level of Type Ierrors and depend on the sample space dimension p.

    Table 2. Cutoff Values for Univariate Projections

    Sample space dimension p 5 10 20Cutoff value p 401 609 1008

    6. If the condition in Step 5 were satis ed for some i, anew sample composed of all observations i such that ri pis formed, and the procedure is applied again to the reducedsample. This is repeated until either no additional observationssatisfy ri >p or the number of remaining observations wouldbe less than 4nCpC15=2.

    7. Finally, a Mahalanobis distance is computed for allobservations labeled as outliers in the preceding steps, usingthe data (mean and covariance matrix) from the remainingobservations. Let U denote the set of all observations notlabeled as outliers. The algorithm computes

    QmD 1U Xi2U

    xi1

    eS D 1U 1Xi2U

    4xi Qm54xi Qm051

    andvi D 4xi Qm50eS14xi Qm5 8 i 62 U 0

    Those observations i 62 U such that vi < 2p1 099 are consid-ered not to be outliers and are included in U . The processis repeated until no more such observations are found (or Ubecomes the set of all observations).

    The values of p in Step 5 of the algorithm have beenobtained from simulation experiments to ensure that, in theabsence of outliers, the percentage of correct observationsmislabeled as outliers is approximately equal to 5%. Table 2shows the values used for several sample-space dimensions.The values for other dimensions could be obtained by inter-polating logp linearly in logp.

    2.1 Computation of the Projection Directions

    The main computational effort in the application of the pre-ceding algorithm is associated with the determination of localsolutions for either (8) or (9). This computation can be con-ducted in several ways:

    1. Applying a modi ed version of Newtons method.2. Obtaining the solution directly from the rst-order opti-

    mality conditions. The optimality conditions for both prob-lems are

    4nXiD1

    4d0y4j5i 53y

    4j5

    i 2dD 0d0dD 10

    Multiplying the rst equation by d and replacing the con-straint, we obtain the value of . The resulting condition is

    nXiD1

    4d0y4j5i 52y

    4j5

    i y4j50i d D

    nXiD1

    4d0y4j5i 54d0 (11)

    This equation indicates that the optimal d will be a unit eigen-vector of the matrix

    M4d5 nXiD1

    4d0y4j5i 52y

    4j5i y

    4j50i 1

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • MULTIVARIATE OUTLIER DETECTION AND ROBUST COVARIANCE MATRIX ESTIMATION 291

    Table 3. Cutoff Values for Univariate Projections

    Sample space dimension p 5 10 20Scaling factor kp 098 095 092

    that is, of a weighted covariance matrix for the sample, withpositive weights (depending on d). Moreover since the eigen-value at the solution is the value of the fourth moment, we areinterested in computing the eigenvector corresponding to thelargest or smallest eigenvalue.

    In summary, an iterative procedure to compute the directiond proceeds through the following steps:

    1. Select an initial direction Nd0 such that Nd02 D 102. In iteration lC1, compute NdlC1, as the unit eigenvector

    associated with the largest (smallest) eigenvalue of M4 Ndl5.3. Terminate whenever NdlC1 Ndl< , and set dj D NdlC1.Another relevant issue is the de nition of the initial direc-

    tion Nd0. Our choice has been to start with the direction corre-sponding to the largest (when computing maximizers for the

    0 10

    5

    0

    5

    (a)

    10 0

    5

    0

    5

    (b)

    10 0 10

    0

    5

    10

    15

    (c)

    Figure 1. Scatterplots for a Dataset With D 1. The axes correspond to projections onto (a) the direction to the outliers (x axis) and orthogonaldirection (y axis), (b) the direction maximizing the kurtosis coef cient (x axis) and an orthogonal direction (y axis), (c) the direction minimizingthe kurtosis coef cient (x axis) and an orthogonal direction (y axis).

    kurtosis) or the smallest (when minimizing) principal compo-nents of the normalized observations y4j5i =y4j5i . These direc-tions have the property that, once the observations have beenstandardized, they are af ne equivariant. They would also cor-respond to directions along which the observations projectedonto the unit hypersphere seem to present some relevant struc-ture; this would provide a reasonable starting point when theoutliers are concentrated, for example. A more detailed dis-cussion on the motivation for this choice of initial directionswas given by Juan and Prieto (1997).

    2.2 Robust Covariance Matrix Estimation

    Once the observations have been labeled either as outliers oras part of the uncontaminated sample, following the proceduredescribed previously, it is possible to generate robust estimatesfor the mean and covariance of the data as the mean andcovariance of the uncontaminated observations. This approach,as opposed to the use of weight functions, seems reasonablegiven the reduced Type I errors associated with the procedure.Nevertheless, note that it is necessary to correct the covarianceestimator to account for the bias associated with these errors.

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • 292 DANIEL PEA AND FRANCISCO J. PRIETO

    The proposed estimators become

    QmD 1U Xi2U

    xi

    and eSc D 14U 15kd Xi2U4xi Qm54xi Qm501where U is the set of all observations not labeled as outliers,U denotes the number of observations in this set, and kd isa constant that has been estimated to ensure that the trace ofthe estimated matrix is unbiased.

    The values of kd have been obtained through a simulationexperiment for several sample space dimensions and are givenin Table 3. The values for other dimensions could be obtainedby interpolating log kp linearly in logp.

    2.3 Examples

    To illustrate the procedure and the relevance of choosingprojection directions in the manner described previously, weshow the results from the computation of the projection direc-tions for a few simple cases. The rst ones are based ongenerating 100 observations from a model of the form 41

    0 10

    5

    0

    5

    (a)

    10 0 1015

    10

    5

    0

    (b)

    0 10

    5

    0

    5

    (c)

    Figure 2. Scatterplots for a Dataset With D 3. The axes correspond to projections onto (a) the direction to the outliers (x axis) and anorthogonal direction (y axis), (b) the direction maximizing the kurtosis coef cient (x axis) and an orthogonal direction (y axis), (c) the directionminimizing the kurtosis coef cient (x axis) and an orthogonal direction (y axis).

    5N401 I5CN410e1 I 5 in dimension 2, where eD 41 150, fordifferent values of .

    Consider Figure 1, corresponding to the preceding modelwith D 01. This gure shows the scatterplots of the data,where the axes have been chosen as (1) in Figure 1(a), thedirection to the outliers (e) and a direction orthogonal to it;(2) in Figure 1(b), the direction giving a maximizer for thekurtosis coef cient (the x axis) and a direction orthogonal toit; and (3) in Figure 1(c), the direction corresponding to aminimizer for the kurtosis coef cient (also for the x axis) anda direction orthogonal to it. In this case, the direction maxi-mizing the kurtosis coef cient allows the correct identi cationof the outliers, in agreement with the results in Table 1 for thecase with small .

    Figure 2 shows another dataset, this time corresponding to D 03, in the same format as Figure 1. As the analysis in thepreceding section showed and the gure illustrates, here therelevant direction to identify the outliers is the one minimizingthe kurtosis coef cient, given the large contamination present.

    Finally, Figure 3 presents a dataset generated from themodel using D 02 in the same format as the preceding

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • MULTIVARIATE OUTLIER DETECTION AND ROBUST COVARIANCE MATRIX ESTIMATION 293

    0 10

    5

    0

    5

    (a)

    10 0 10

    15

    10

    5

    0

    (b)

    10 5 0

    0

    5

    10

    (c)

    Figure 3. Scatterplots for a Dataset With D 2. The axes correspond to projections onto (a) the direction to the outliers (x axis) and anorthogonal direction (y axis), (b) the direction maximizing the kurtosis coef cient (x axis) and an orthogonal direction (y axis), (c) the directionminimizing the kurtosis coef cient (x axis) and an orthogonal direction (y axis).

    gures. It is remarkable in this case that both the directionmaximizing the kurtosis coef cient and the direction mini-mizing it are not the best ones for identifying the outliers;instead, the direction orthogonal to that maximizing thekurtosis corresponds now to the direction to the outliers. Theoptimization procedure has computed a direction orthogonalto the outliers as the maximizer and an intermediate directionas the minimizer. As a consequence, the direction to theoutliers is obtained once Problem (6) is solved for the obser-vations projected onto the direction maximizing the kurtosis.This result justi es that in some cases (for intermediatecontamination levels) it is important to compute directionsorthogonal to those corresponding to extremes in the kurtosiscoef cient, and this effect becomes even more signi cant asthe sample-space dimension increases.Consider a nal example in higher dimension. A sample of

    100 observations has been obtained by generating 60 obser-vations from an N401 I5 distribution in dimension 10, and10 observations each from N410di1 I5 distributions for i D11 : : : 14, where di were distributed uniformly on the unithypersphere. Figure 4 shows the projections of these observa-

    tions onto four of the directions obtained from the applicationof the proposed procedure. Each plot gives the value of theprojection onto one of the directions for each observation inthe sample. The outliers are the last 40 observations and havebeen plotted using the symbols C, o, #1 and foreach of the clusters, while the uncontaminated observationsare the rst 60 in the set and have been plotted using thesymbol .

    Figure 4(a) shows the projections onto the kurtosis maxi-mization direction. This direction is able to isolate the observa-tions corresponding to one of the clusters of outliers in the data(the one labeled as #) but not the remaining outliers. Thenext direction, which maximizes the kurtosis on a subspaceorthogonal to the preceding direction, reveals the outliers indi-cated as C, as shown in Figure 4(b). This process is repeateduntil eight additional directions, maximizing the kurtosis ontothe corresponding orthogonal subspaces, are generated. Thenext direction obtained in this way (the third one maximizingthe kurtosis) is not able to reveal any outliers, but the fourth,shown in Figure 4(c), allows the identi cation of the outliersshown as o. The remaining kurtosis maximization directions

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • 294 DANIEL PEA AND FRANCISCO J. PRIETO

    0 50 1004

    3

    2

    1

    0

    1

    ************

    ******

    *

    *

    *

    *

    ****

    **********************************++++

    ++

    +

    +++oooooo

    oooo

    #

    #

    #

    #

    ###

    #

    ##

    xxxx

    x

    x

    xxxx

    (a)

    0 50 1004

    3

    2

    1

    0

    1**********************************************

    **************

    +++++

    +

    ++++

    oooooooooo###

    #######xxxxxx

    xxxx

    (b)

    0 50 1002

    0

    2

    4

    ****

    ***********

    ***************************************

    ******+++

    +

    ++++

    ++

    ooooooo

    oo

    o

    #####

    #

    ####xxxxxxxxxx

    (c)

    0 50 1002

    1

    0

    1

    2

    ***

    *******

    *

    ***

    ***

    ****

    *

    *

    ******************

    *****

    *

    *******

    ******

    +++++

    +++

    +

    +oo

    o

    ooooooo

    ####

    ###

    #

    ##

    xx

    x

    x

    xxxxxx

    (d)

    Figure 4. Univariate Projections Onto Directions Generated by the Algorithm for a Dataset in Dimension 10. The x axis represents the obser-vation number, while the y axis corresponds to the projections of each observation onto (a) the rst maximization direction, (b) the secondmaximization direction, (c) the fourth maximization direction, (d) the third minimization direction.

    (not shown in the gure) are not able to reveal any additionalgroups of outliers.

    To detect the outliers labeled as , the kurtosis minimiza-tion directions must be used. The rst two of these are againunable to reveal the presence of any outliers. On the otherhand, the third minimization direction, shown in Figure 4(d),allows the identi cation of (nearly) all the outliers at once (itis a direction on the subspace generated by the four directionsto the centers of the outlier clusters). The remaining directionsare not very useful.

    This example illustrates the importance of using both mini-mization and maximization directions and in each case relyingnot just on the rst optimizer but on computing a full set oforthogonal directions.

    3. PROPERTIES OF THE ESTIMATOR

    The computation of directions maximizing the kurtosiscoef cient is af ne equivariant. Note that the standardizationof the data in Step 1 of the algorithm ensures that the resultingdata are invariant to af ne transformations, except for a rota-tion. The computation of the projection directions preserves

    this property, and the values of the projections are af neinvariant. Note also that the initial point for the optimizationalgorithm is not affected by af ne transformations.

    As a consequence of the analysis in Section 1.1, we con-clude that the algorithm is expected to work properly if thedirections computed are those to the outliers or orthogonal tothem since additional orthogonal directions will be computedin later iterations. It might fail if one of the computed direc-tions is the one corresponding to the intermediate extremedirection (whenever it exists). This intermediate direction willcorrespond either to a maximizer or to a minimizer, dependingon the values of 1 , and . Because the projection step doesnot affect the values of or , if we assume that !,this intermediate direction would be found either as part ofthe set of p directions maximizing the kurtosis coef cient oras part of the p minimizers, but it cannot appear on both sets.As a consequence, if this intermediate direction appears as amaximizer (minimizer), the set of minimizing (maximizing)directions will include only the directions corresponding to D 0 or D 1 and, therefore, the true direction to theoutliers will always be a member of one of these two sets.

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • MULTIVARIATE OUTLIER DETECTION AND ROBUST COVARIANCE MATRIX ESTIMATION 295

    Table 4. Results Obtained by the Proposed Algorithm, Using Both Maximization and Minimization Directions, on Some Small Datasets

    Dataset Dimension # Observations # Outliers Outliers Time (s.)

    Heart 2 12 5 2, 6, 8, 10, 12 .05Phosphor 2 18 7 1, 4, 6, 7, 10, 16, 18 .16Stackloss 3 21 8 1, 2, 3, 4, 13, 14, 20, 21 .27Salinity 3 28 8 5, 10, 11, 15, 16, 17, 23, 24 .32HawkinsBraduKass (1984) 3 75 14 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 .17Coleman 5 20 7 1, 6, 9, 10, 11, 13, 18 .22Wood 5 20 4 4, 6, 8, 19 .22Bush re 5 38 15 7, 8, 9, 10, 11, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38 .11

    Simulation Results

    We have conducted a number of computational experimentsto study the practical behavior of the proposed procedure.Since the use of minimization directions for the kurtosis coef- cient is not a very intuitive choice, we have implemented twoversions of the proposed algorithmkurtosis1 corresponds tothe description given in Section 2; kurtosis2 uses only the setof p maximization directions, while preserving the remainingimplementation details in the algorithm.Our rst experiment has analyzed the outlier-detection

    behavior of the algorithm on a collection of eight smalldatasets. The rst seven were taken from Rousseeuw andLeroy (1987) and were studied by Rousseeuw and VanDriessen (1999), for example. The last one is from Campbell(1989) and was analyzed by Maronna and Yohai (1995) andBecker and Gather (1999), among others. Table 4 gives thecorresponding results for algorithm kurtosis1, indicating thedataset, its dimension and number of observations, the numberof observations that have been labeled as suspected outliers,the speci c observations that have been so labeled, and therunning times in seconds. The cutoff points used to labelthe observations as outliers have been those indicated in thedescription of the algorithm in Section 2 (Steps 5 and 7 andTable 2). All values are based on a Matlab implementationof the proposed procedure, and the running times have beenobtained using Matlab 4.2 on a 450 MHz Pentium PC.These results are similar to those reported in the literature

    for other outlier-detection methods, and they indicate that theproposed method behaves reliably on these test sets. Thesesame test problems have been analyzed using kurtosis2. Theresults are given in Table 5, and for these small problems arenearly identical (except for the phosphor dataset) to the ones

    Table 5. Results Obtained by the Proposed Algorithm, Using Only Maximization Directions, on Some Small Datasets

    Dataset # Outliers Outliers Time (s.)

    Heart 5 2, 6, 8, 10, 12 .05Phosphor 2 1, 6 .05Stackloss 8 1, 2, 3, 4, 13, 14, 20, 21 .16Salinity 8 5, 10, 11, 15, 16, 17, 23, 24 .11HawkinsBraduKass (1984) 14 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14 .06Coleman 7 1, 6, 9, 10, 11, 13, 18 .11Wood 4 4, 6, 8, 19 .11Bush re 16 7, 8, 9, 10, 11, 12, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38 .11

    obtained using both minimization and maximization directionsand presented in Table 4.

    To explore further the properties of the method, we haveperformed an extensive set of simulation experiments forlarger sample sizes and observation dimensions. The experi-ments compare the performance of both proposed algorithms,regarding the identi cation of the outliers and the estimationof covariance matrices, with the results from two other codes:

    1. A recent and ef cient algorithm for the implementationof the minimum covariance determinant (MCD) procedureproposed by Rousseeuw (1985). The FAST-MCD algorithmbased on the splitting of the problem into smaller subprob-lems, is much faster than previous algorithms; it was proposedby Rousseeuw and Van Driessen (1999).

    2. A version of the SD algorithm, corresponding to theimplementation described by Maronna and Yohai (1995). Thechoice of parameters has been the same as in this reference.In particular, the number of subsamples has been chosen as1,000 for dimension 5. For dimensions 10 and 20, not includedin the Monte Carlo study by Maronna and Yohai (1995), wehave used 2,000 and 5,000 subsamples, respectively.

    For a given contamination level , we have generated aset of 10041 5 observations from an N401 I 5 distributionin dimension p. We have added 100 additional observationsfrom an N4e1I5 distribution, where e denotes the vector411 : : : 1150. This model is analogous to the one used byRousseeuw and van Driessen (1999). This experiment hasbeen conducted for different values of the sample-spacedimension p (p D 51 101 20), the contamination level ( D 011 021 031 04), the distance of the outliers (D 101 100),and the standard deviation of these outliers

    p4pD 011 11 5).

    For each set of values, 100 samples have been generated.

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • 296 DANIEL PEA AND FRANCISCO J. PRIETO

    Table 6. Success Rates for the Detection of Outliers Forming One Cluster

    D 10 D 100p

    p FAST-MCD SD Kurtosis1 Kurtosis2 FAST-MCD SD Kurtosis1 Kurtosis2

    5 03 01 0 100 100 83 100 100 100 881 100 100 95 38 100 100 94 31

    04 01 0 0 53 0 0 100 100 01 100 99 91 0 100 100 93 05 100 93 100 100 100 100 100 100

    10 02 01 0 100 100 100 100 100 100 1001 100 100 60 83 100 100 61 84

    03 01 0 100 100 0 0 100 100 11 100 100 23 2 100 100 21 0

    04 01 0 0 52 0 0 0 100 01 74 0 82 0 67 0 81 05 100 53 100 100 100 73 100 100

    20 01 01 86 100 100 100 100 100 100 1001 100 100 87 88 100 100 84 82

    02 01 0 72 100 8 0 100 100 71 98 61 1 2 100 100 0 05 100 67 100 100 100 100 100 100

    03 01 0 0 98 0 0 0 100 01 19 0 0 0 20 0 0 05 100 0 100 100 100 0 100 100

    04 01 0 0 1 0 0 0 5 01 0 0 9 0 1 0 8 05 100 0 99 95 100 0 99 90

    Table 6 gives the number of samples in which all the outliershave been correctly identi ed for each set of parameter valuesand both the proposed algorithms (kurtosis1 and kurtosis2)and the FAST-MCD and SD algorithms. To limit the size ofthe table, we have shown only those cases in which one ofthe algorithms scored less than 95 successes.

    The proposed method (kurtosis1) seems to perform muchbetter than FAST-MCD for concentrated contaminations,while its behavior is worse for those cases in which the shapeof the contamination is similar to that of the original data( D 1). From the results in Section 1, this case tends tobe one of the most dif cult ones for the kurtosis algorithmbecause the objective function is nearly constant for alldirections, and for nite samples it tends to present manylocal minimizers, particularly along directions that are nearlyorthogonal to the outliers. Nevertheless, this behavior, closelyassociated with the value D 1, disappears as moves awayfrom 1. For example, for p D 10 and D 03 the number ofsuccesses in 100 trials goes up from 23 for D 1 to 61 forp D 08 and 64 for D 1025. In any case, we have included

    the values for D 1 to show the worst-case behavior of thealgorithm.

    Regarding the SD algorithm, the proposed method behavesbetter for large space dimensions and large contamination lev-els, showing that it is advantageous to study the data on asmall set of reasonably chosen projection directions, particu-larly in those situations in which a random choice of directionswould appear to be inef cient.

    The variant of the algorithm that uses only maximizationdirections (kurtosis2) presents much worse results thankurtosis1 when the contamination level is high and thecontamination is concentrated. As the analysis in Section 1

    suggested, in those cases the minimization directions areimportant.

    The case analyzed in Table 6 covers a particular contamina-tion model, the one analyzed in Section 1. It is interesting tostudy the behavior of the algorithm on other possible contam-ination models, for example when the outliers form severalclusters. We have simulated cases with two and four clus-ters of outliers, constructed to contain the same number ofobservations (100=k), with centers that lie at a distance D 10pp from the origin (the center of the uncontaminatedobservations) along random uniformly distributed directions.The variability inside each cluster is the same for all ofthem. Table 7 gives the results of these simulations, in thesame format as Table 6.

    The results are similar to those in Table 6. The proposedmethod works much better than FAST-MCD for small valuesof and worse for values of close to 1. Regarding the SDalgorithm, the random choice of directions works better as thenumber of clusters increases. Nevertheless, note that, as thesample space dimension and the contamination level increase,the preceding results seem to indicate that the SD algorithmmay start to become less ef cient.

    The results in Tables 6 and 7 show that the 2p directionsobtained as extremes of the kurtosis coef cient of the pro-jections can be computed in a few seconds and perform inmost cases better than the thousands of directions randomlygenerated by the SD estimator, requiring a much larger com-putational time. Moreover, the minimization directions play asigni cant role for large concentrated contaminations. Theseresults suggest that the SD estimator can be easily improvedwhile preserving its good theoretical properties by includingthese 2p directions in addition to the other randomly selecteddirections. From the same results, we also see, that the FAST-

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • MULTIVARIATE OUTLIER DETECTION AND ROBUST COVARIANCE MATRIX ESTIMATION 297

    Table 7. Success Rate for the Detection of Outliers Forming Two and Four Clusters

    2 clusters 4 clusters

    p p FAST-MCD SD Kurtosis1 Kurtosis2 FAST-MCD SD Kurtosis1 Kurtosis2

    5 .4 01 65 100 16 0 100 100 89 941 100 100 100 81 100 100 100 100

    10 .3 01 18 100 100 78 100 100 100 100.4 01 0 100 51 0 72 100 15 5

    1 83 100 60 0 99 100 97 95

    20 .2 01 15 100 100 100 93 100 100 1001 100 100 90 88 100 100 100 100

    .3 01 0 100 100 0 22 100 100 991 37 98 1 0 99 100 99 985 100 60 100 100 100 100 100 100

    .4 01 0 60 5 0 0 100 6 01 0 28 3 0 1 99 2 05 100 0 100 93 100 23 98 100

    MCD code performs very well in situations in which thekurtosis procedure fails and vice versa. Again, a combinationof these two procedures can be very fruitful.The preceding tables have presented information related to

    the behavior of the procedures with respect to Type II errors.To complement this information, Type I errors have also beenstudied. Table 8 shows the average number of observationsthat are labeled as outliers by both procedures when 100 obser-vations are generated from an N401 I 5 distribution. Each valueis based on 100 repetitions.The kurtosis algorithm is able to limit the size of these

    errors through a proper choice of the constants p in Step 5of the algorithm. The SD algorithm could also be adjustedin this way, although, in the implementation used, the cutoff

    for the observations has been chosen asq2p1 095, following the

    suggestion of Maronna and Yohai (1995).A second important application of these procedures is the

    robust estimation of the covariance matrix. The same simula-tion experiments described previously have been repeated butnow measuring the bias in the estimation of the covariancematrix. The chosen measure has been the average of the log-arithms of the condition numbers for the robust covariancematrix estimators obtained using the three methodsFAST-MCD, SD, and kurtosis. Given the sample generation process,a value close to 0 would indicate a small bias in this conditionnumber. Tables 9 and 10 show the average values for theseestimates for the settings in Tables 6 and 7, respectively. Tolimit the size of the tables, only two values for the contami-nation level ( D 011 03) have been considered.The abnormally large entries in these tables correspond to

    situations in which the algorithm is not able to identify the

    Table 8. Percentage of Normal Observations Mislabeledas Outliers

    Dimension FASTMCD SD Kurtosis1 Kurtosis2

    5 909 804 609 60910 2209 02 909 110220 3602 00 706 702

    outliers properly. An interesting result is that the kurtosis pro-cedure does a very good job regarding this measure of per-formance in the estimation of the covariance matrix, at leastwhenever it is able to identify the outliers properly. Note inparticular how well it compares to FAST-MCD, a procedurethat should perform very well, particularly for small contam-ination levels or large dimensions. Its performance is evenbetter when compared to SD, showing again the advantagesof a nonrandom choice of projection directions.

    Regarding computational costs, comparisons are not simpleto carry out because the FAST-MCD code is a FORTRANcode, while the kurtosis procedure has been written in Matlab.Tables 4 and 5 include running times for some small datasets.Table 11 presents some running times for larger datasets, con-structed in the same manner as those included in Tables 610.All cases correspond to D 02, D 10, and p D 01. TheSD code used 151000 replications for p D 30 and 301000for p D 40. All other values have been xed as indicated forTable 6. The times correspond to the analysis of a singledataset and are based on the average of the running times for10 random datasets. They have been obtained on a 450 MHzPentium PC under Windows 98.

    These times compare quite well with those of SD andFAST-MCD. Since the version of FAST-MCD we have used isa FORTRAN code, this should imply additional advantages ifa FORTRAN implementation of the proposed procedure weredeveloped. A Matlab implementation of the proposed proce-dures is available at http://halweb.uc3m.es/fjp/download.html .

    4. CONCLUSIONS

    A method to identify outliers in multivariate samples, basedon the analysis of univariate projections onto directions thatcorrespond to extremes for the kurtosis coef cient, has beenmotivated and developed. In particular, a detailed analysis hasbeen conducted on the properties of the kurtosis coef cientin contaminated univariate samples and on the relationshipbetween directions to outliers and extremes for the kurtosis inthe multivariate case.

    The method is af ne equivariant, and it shows a very sat-isfactory practical performance, especially for large sample

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • 298 DANIEL PEA AND FRANCISCO J. PRIETO

    Table 9. Average Logarithm of the Condition Numbers for Covariance Matrix Estimates, Outliers Forming One Cluster

    D 10 D 100p

    p FAST-MCD SD Kurtosis1 Kurtosis2 FAST-MCD SD Kurtosis1 Kurtosis2

    5 01 01 097 1026 090 088 1001 1022 091 0901 007 1015 090 094 1005 1010 084 0905 1002 099 088 093 099 1006 090 090

    03 01 7079 4009 097 1071 092 4008 095 10871 091 2095 1021 3048 088 2095 1048 60965 089 2008 1005 1005 093 2056 1003 1007

    10 01 01 1087 2000 1060 1063 1084 1097 1056 10581 1085 1076 1059 1061 1084 1084 1061 10645 1086 1060 1053 1059 1085 1071 1055 1059

    03 01 9053 5055 1057 8038 14000 5059 1059 120891 1063 4041 5011 6021 1059 4045 8085 100895 1069 3032 1072 1073 1068 4021 1075 1075

    20 01 01 3087 3052 2060 2053 3001 3056 2048 20491 2099 3054 3000 2088 3010 3049 3085 40055 3006 3019 2045 2042 3009 3051 2042 2043

    03 01 10097 7097 2045 9096 15056 12057 2044 140591 7032 7011 7033 7032 10096 11070 11094 110955 2087 5026 2061 2056 2077 9085 2060 2060

    space dimensions and concentrated contaminations. In thissense, it complements the practical properties of MCD-basedmethods such as the FAST-MCD procedure. The method alsoproduces good robust estimates for the covariance matrix, withlow bias.

    The associate editor of this article suggested a generaliza-tion of this method based on using the measure of multivari-ate kurtosis introduced by Arnold (1964) and discussed byMardia (1970) and selecting h p directions at a time tomaximize (or minimize) the h-variate kurtosis. A second setof h directions orthogonal to the rst can then be obtained andthe procedure can be repeated as in the proposed algorithm.This idea seems very promising for further research on thisproblem.

    Table 10. Average Logarithm of the Condition Numbers for Covariance Matrix Estimates, Outliers Forming Two and Four Clusters

    2 clusters 4 clusters

    p p FAST-MCD SD Kurtosis1 Kurtosis2 FAST-MCD SD Kurtosis1 Kurtosis2

    5 .1 01 1001 095 093 091 1006 079 086 0831 1003 090 090 092 1004 077 089 0915 097 084 087 090 1003 079 092 090

    .3 01 092 2042 094 098 092 1048 099 10011 092 1095 1002 1008 090 1037 1006 10045 089 1056 1004 1002 091 1013 1000 1002

    10 .1 01 1081 1054 1056 1058 1084 1026 1053 10561 1085 1044 1057 1064 1087 1023 1051 10605 1090 1038 1061 1064 1086 1021 1056 1054

    .3 01 8000 3040 1056 2041 1068 2027 1057 10591 1070 2087 1083 1078 1066 2013 1072 10755 1067 2051 1075 1071 1069 1089 1073 1076

    20 .1 01 3004 2074 2051 2045 3011 2014 2042 20381 3008 2079 2042 2041 3015 2016 2039 20325 3008 2061 2040 2039 3016 2007 2037 2024

    .3 01 10091 5005 2047 9010 6050 3053 2050 20501 5087 4093 6096 6098 2076 3082 2063 20705 2076 4040 2055 2061 2081 3058 2063 2061

    There are also practical problems in which the af ne equi-variance property may not be very relevant. For example, inmany engineering problems arbitrary linear combinations ofthe design variables have no particular meaning. For thesecases, and especially in the presence of concentrated con-taminations, we have found that adding those directions thatmaximize the fourth central moment of the data results in amore powerful procedure.

    The results presented in this article emphasize the advan-tages of combining random and speci c directions. It can beexpected that, if we have a large set of random uniformlydistributed outliers in high dimension, a method that computesa very large set of random directions will be more powerfulthan another one that computes a small number of speci c

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • MULTIVARIATE OUTLIER DETECTION AND ROBUST COVARIANCE MATRIX ESTIMATION 299

    Table 11. Running Times (in s.) on Large Synthetic Datasets

    p n FAST-MCD SD Kurtosis1 Kurtosis2

    10 100 505 400 102 06200 908 800 206 108

    20 100 2006 1107 303 105200 3600 2201 709 400

    30 300 11408 10906 2800 1809500 18306 18208 5401 4606

    40 400 27005 33809 7401 3804

    directions. On the other hand, when the outliers appear alongspeci c directions, a method that searches for these directionsis expected to be very useful. These results emphasize theadvantages of combining random and speci c directions in thesearch for multivariate outliers. In particular, the incorpora-tion of the kurtosis directions in the standard SD procedurecan improve it in many cases with small additional computa-tional time.

    ACKNOWLEDGMENTS

    We thank Peter Rousseeuw and Katrien van Driessen formaking their code FAST-MCD available to us. We also thankthe referees, the associate editor, and Technometrics EditorKaren Kafadar for their suggestions and comments. They havebeen very helpful in improving the content and presentationof the article.

    APPENDIX: DETAILS OF THE DERIVATION OFTHEORETICAL RESULTS IN SECTION 1.

    A.1 An Expression for X

    To derive (2), we need expressions for mX 445 and mX425 interms of the moments of the distributions F and G. Note that,since F D 0, we have that E4X5D G0 Moreover

    E4X25D 415mF 425C4mG425C2G5DmF 42541Cv2Cr 25

    andmX 425DmF 42541C4v215C415r 250

    For the fourth moment,

    mX 445D 415Z4xG54dF4x5

    CZ4xG54dG4x51

    whereZ4xG54dF 4x5D mF 4454GmF 435C622GmF 425C44G1

    and Z4xG54dG4x5D mG445C4415GmG435C641522GmG425C 41544G0

    Combining these results and rearranging terms, we have

    mX 445=mF 4252

    D F C415 4r4aGv3aF 5C 4Gv4F 5=415C6r 24C 415v25C r 443C 41535 0

    The desired result follows from X DmX445=mX 4252 and theseexpressions.

    A.2 Parameters in the Distribution of Y

    The mean of a random variable X following a distributionof the form 415N401 I5CN4e11I5 is X D e1 andits covariance matrix is

    SS D 415I C4I C2e1e01522e1e01D 41C5I C4152e1e01D 1 I C

    24151

    e1e01 0

    The inverse of SS will also be a rank-one modi cation of theidentity. It is easy to check that

    S SS1 D 114I 2e1e150 (A.1)

    Note that S is diagonal with all entries equal to 1=1 exceptfor the rst one. Its square root, S1=2, is also a diagonal matrixwith all entries equal to 1=

    p1 except for the rst one, which

    equalsp4125=1. In particular,

    S1=2e1 Ds

    121

    e1 D1p

    1C2415e10 (A.2)

    The distribution of Y D S1=24X X 5 follows from theseresults.

    A.3 An Expression for Z

    The kurtosis coef cient of Z will be equal to its fourthmoment. E4Z45 D 41 5E4Z415 C E4Z4251 where Z1 isN4m01u1 u

    0Su5, Z2 is N4m02u1u

    0Su5, and

    E4Z4i 5D E44Zi Nzi545C 6E44Zi Nzi525Nz2i C Nz4iD 3 4i C6 2i Nz2i C Nz4i 1 (A.3)

    where Nzi and i denote the mean and standard deviation of Zi .Letting D e01u, from (A.2) and (3) it follows that

    Nz1 Dm01uDp

    1C2415Nz2 Dm02uD

    415p1C2415

    1

    and from (A.1)

    21 D 22 =D u0Su D122

    10

    Replacing these values in (A.3), we have

    Z D 3412252

    2141C25

    C6122

    1

    241521C 2415

    4C4155

    C 4241524

    41C2415523C 4153415 0

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • 300 DANIEL PEA AND FRANCISCO J. PRIETO

    Grouping all the terms that correspond to the same pow-ers of and using 14C4155 41C25 D 41542 415251 the result in (4) is obtained.

    [Received March 1999. Revised June 2000.]

    REFERENCES

    Agull, J. (1996), Exact Iterative Computation of the Multivariate Mini-mum Volume Ellipsoid Estimator With a Branch and Bound Algorithm, inProceedings in Computational Statistics, ed. A. Prat, Heidelberg: Physica-Verlag, pp. 175180.

    Arnold, H. J. (1964), Permutation Support for Multivariate Techniques,Biometrika, 51, 6570.

    Atkinson, A. C. (1994), Fast Very Robust Methods for the Detection ofMultiple Outliers, Journal of the American Statistical Association, 89,13291339.

    Balanda, K. P., and MacGillivray, H. L. (1988), Kurtosis: A Critical Review,The American Statistician, 42, 111119.

    Becker, C., and Gather, U. (1999), The Masking Breakdown Point of Mul-tivariate Outlier Identi cation Rules, Journal of the American StatisticalAssociation, 94, 947955.

    Box, G. E. P., and Tiao, G. C. (1968), A Bayesian Approach to Some OutlierProblems, Biometrika, 55, 119129.

    Campbell, N. A. (1980), Robust Procedures in Multivariate Analysis I:Robust Covariance Estimation, Applied Statistics, 29, 231237.

    (1989), Bush re Mapping Using NOAA AVHRR Data, technicalreport, CS IRO, North Ryde, Australia.

    Cook, R. D., Hawkins, D. M., and Weisberg, S. (1993), Exact Iterative Com-putation of the Robust Multivariate Minimum Volume Ellipsoid Estimator,Statistics and Probability Letters, 16, 213218.

    Davies, P. L. (1987), Asymptotic Behavior of S-Estimators of MultivariateLocation Parameters and Dispersion Matrices, The Annals of Statistics, 15,12691292.

    Donoho, D. L. (1982), Breakdown Properties of Multivariate Location Esti-mators, unpublished Ph.D. qualifying paper, Harvard University, Dept. ofStatistics.

    Gnanadesikan , R., and Kettenring, J. R. (1972), Robust Estimates, Residuals,and Outliers Detection with Multiresponse Data, Biometrics, 28, 81124.

    Hadi, A. S. (1992), Identifying Multiple Outliers in Multivariate Data, Jour-nal of the Royal Statistical Society, Ser. B, 54, 761771.

    (1994), A Modi cation of a Method for the Detection of Outliers inMultivariate Samples, Journal of the Royal Statistical Society, Ser. B, 56,393396.

    Hampel, F. R. (1985), The Breakdown Point of the Mean Combined WithSome Rejection Rules, Technometrics, 27, 95107.

    Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., and Stahel, W. A. (1986),Robust Statistics: The Approach Based on In uence Functions, New York:Wiley.

    Hawkins, D. M. (1994), The Feasible Solution Algorithm for the MinimumCovariance Determinant Estimator in Multivariate Data, ComputationalStatistics and Data Analysis, 17, 197210.

    Hawkins, D. M., Bradu, D., and Kass, G. V. (1984), Location of SeveralOutliers in Multiple Regression Data Using Elemental Sets, Technometrics,26, 197208.

    Hawkins, D. M., and Olive, D. J. (1999), Improved Feasible Solution Algo-rithms for High Breakdown Estimation, Computational Statistics and DataAnalysis, 30, 111.

    Jones, M. C., and Sibson, R. (1987), What Is Projection Pursuit? Journalof the Royal Statistical Society, Ser. A, 150, 2930.

    Juan, J., and Prieto, F. J. (1997), Identi cation of Point-Mass Contaminationsin Multivariate Samples, Working Paper 9713, Statistics and Economet-rics Series, Universidad Carlos III de Madrid.

    Malkovich, J. F., and A , A. A. (1973), On Tests for Multivariate Normal-ity, Journal of the American Statistical Association, 68, 176179.

    Mardia, K. V. (1970), Measures of Multivariate Skewness and Kurtosis withApplications, Biometrika, 57, 519530.

    Maronna, R. A. (1976), Robust M-Estimators of Multivariate Location andScatter, The Annals of Statistics, 4, 5167.

    Maronna, R. A., and Yohai, V. J. (1995), The Behavior of the StahelDonohoRobust Multivariate Estimator, Journal of the American Statistical Asso-ciation, 90, 330341.

    Posse, C. (1995), Tools for Two-Dimensional Exploratory Projec-tion Pursuit, Journal of Computational and Graphical Statistics, 4,83100.

    Rocke, D. M., and Woodruff, D. L. (1993), Computation of Robust Esti-mates of Multivariate Location and Shape, Statistica Neerlandica, 47,2742.

    (1996), Identi cation of Outliers in Multivariate Data, Journal ofthe American Statistical Association, 91, 10471061.

    Rousseeuw, P. J. (1985), Multivariate Estimators With High BreakdownPoint, in Mathematical Statistics and its Applications (vol. B), eds.W. Grossmann, G. P ug, I. Vincze, and W. Wertz, Dordrecht : Reidel,pp. 283297.

    (1993), A Resampling Design for Computing High-Breakdown PointRegression, Statistics and Probability Letters, 18, 125128.

    Rousseeuw, P. J., and Leroy, A. M. (1987), Robust Regression and OutlierDetection, New York: Wiley.

    Rousseeuw, P. J., and van Driessen, K. (1999), A Fast Algorithm forthe Minimum Covariance Determinant Estimator, Technometrics, 41,212223.

    Rousseeuw, P. J., and van Zomeren, B. C. (1990), Unmasking MultivariateOutliers and Leverage Points, Journal of the American Statistical Associ-ation, 85, 633639.

    Ruppert, D. (1987), What Is Kurtosis, The American Statistician, 41,15.

    Stahel, W. A. (1981), Robuste Schtzungen: In nitesimale Optimalittund Schtzungen von Kovarianzmatrizen, unpublished Ph.D. thesis,Eidgenssische Technische Hochschule , Zurich.

    Tyler, D. E. (1991), Some Issues in the Robust Estimation of MultivariateLocation and Scatter, in Directions in Robust Statistics and Diagnos-tics, Part II, eds. W. Stahel and S. Weisberg, New York: Springer-Verlag,pp. 327336.

    Wilks, S. S. (1963), Multivariate Statistical Outliers, Sankhya, Ser. A, 25,407426.

    DiscussionDavid M. Rocke

    Department of Applied ScienceUniversity of CaliforniaDavis, CA 95616

    ([email protected] )

    David L. Woodruff

    Graduate School of ManagementUniversity of CaliforniaDavis, CA 95616

    ([email protected] )

    Pea and Prieto present a new method for robust multi-variate estimation of location and shape and identi cation ofmultivariate outliers. These problems are intimately connectedin that identifying the outliers correctly automatically allowsexcellent robust estimation results and vice versa.

    Some types of outliers are easy to nd and some are dif -cult. In general, the previous literature concludes that problems

    are more dif cult when the fraction of outliers is large. More-over, widely scattered outliers are relatively easy, whereas

    2001 American Statistical Association andthe American Society for Quality

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • DISCUSSION 301

    concentrated outliers can be very dif cult (Rocke andWoodruff1996) . This article is aimed squarely at the case inwhich the outliers form a single cluster, separated from themain data and of the same shape (but possibly different size)as the main data, an especially dif cult case for many outlier-detection methods (Rocke and Woodruff 1996). We believe

    that it is entirely appropriate that special methods like thosein the present article be developed to handle this case, whichis often too dif cult for general-purpose outlier-detection androbust-estimation methods.

    In this discussion, we make some comparisons of the esti-mators kurtosis1, FAST-MCD, SD, and also certain M and Sestimators; then we point out the connection to cluster anal-ysis (Rocke and Woodruff 2001). The MCD, MVE, SD, andS estimators with hard redescending in uence functions areknown to have maximum breakdown. Kurtosis1 probably alsohas maximum breakdown, although this article has no formalproof. M estimators are sometimes thought to be of breakdown1=4pC 15, but this is actually incorrect. Work by Maronna(1976), Donoho (1982), and Stahel (1981) showed that when-ever the amount of contamination exceeds 1=4pC15 a root ofthe estimating equations that can be carried over all boundsexists. In fact, a root may exist that remains bounded. S esti-mators are a subclass of M estimators, as shown by Lopuha(1989), and have maximal breakdown when hard redescending functions are used and the parameters are correctly chosen.This provides an example of a class of M estimators that havemaximal breakdown.M estimators can be highly statistically ef cient and are

    easy to compute by iteratively reweighted least squares butneed a high-breakdown initial estimator to avoid convergingon the bad root (this also applies to S estimators, whichare computed by solving the related constrained M estimationproblem). MULTOUT (Rocke and Woodruff 1996) combinesa robust initial estimator with an M estimator to yield a high-breakdown, statistically ef cient methodology. Note also thatidentifying outliers by distance and reweighting points withweight 0 if the point is declared an outlier and weight 1 other-wise is a type of M estimator with a function that is constantuntil the outlier rejection cutoff point. This is used by manymethods (e.g., FAST-MCD, kurtosis1) as a nal step, whichimproves statistical ef ciency.Thus, the key step in outlier identi cation and robust esti-

    mation is to use an initial procedure that gives a suf cientlygood starting place. The gap can be quite large between thetheoretical breakdown of an estimator and its practical break-down, the amount and type of contamination such that suc-cess is unlikely. Consider, for example, the case in Table 6in which n D 100, p D 20, D 03, D 1, and D 100. Theamount of contamination is well below the breakdown point(40%), and the contamination is at a great distance from themain data, but none of the methods used in this study are verysuccessful.The study reported in Table 6 has some dif culties:

    1. The number of data points is held xed at 100 as thedimension rises, which leads to a very sparse data problem indimension 20. Many of these methods could perhaps do muchbetter with an adequate amount of data. This is particularly

    true of FAST-MCD and MULTOUT, which are designed tohandle large amounts of data.

    2. When nD 100 and p D 20, the maximum breakdown ofany equivariant estimator is .4. Of course, this does not meanthat the estimator must break down whatever the con gurationof the outliers, but it does show that at D 04 and p D 20 andfor every dataset generated there is an outlier con gurationsuch that no equivariant estimator will work.

    3. The case D 5 is one in which the outliers are actuallyscattered rather than concentrated. This is especially true when D 10. With standard normal data in 20 dimensions, almostall observations lie at a distance less than the .001 point ofa 20 distribution, which is 6.73. The outliers are placed ata distance of 10, and the .999 sphere for these has a radiusof 455460735D 33065. Thus, the outliers actually surround thegood data, rather than lying on one side. This explains whyFAST-MCD gets all of the D 5 cases regardless of otherparameter settings.

    4. When n D 100 and p D 20, the preceding computationshows that standard normal data will almost all lie in a sphereof radius 6.73 around the center. If the outliers are displacedby 10 with the same covariance (D 1), these spheres overlapconsiderably, and in some instances no method can identifyall outliers because some of them lie within the good data.Shrunken outliers (D 01) do not have this problem since theradius of the outliers is only .67, so a displacement of 10prevents overlap. Expanded outliers (D 5) are relatively easybecause the outliers surround the main data, and the chanceof one falling in the relatively small sphere of the good datais small.

    The results are better for two and four clusters (Table 8)than for one and would be even better for individually scat-tered outliers. The important insight is that many methodswork well except in one hard casewhen the outliers liein one or two clusters. Methods such as the FAST-MCDand MULTOUT must be supplemented by methods speci -cally designed to nd clustered or concentrated outliers. Theseclustered outlier methods are not replacements for the othermethods because they may perform poorly when there aresigni cant nonclustered outliers.

    The remaining question concerns appropriate methods tosupplement the more general estimation techniques and allowdetection of clustered/concentrated outliers. The methods ofPea and Prieto are aimed at that task, by trying to nd thedirection in which the clustered outliers lie. The performanceof these methods is documented in the article under discus-sion. Another attempt in this direction was given by Juanand Prieto (2001), who tried to nd outliers by looking atthe angles subtended by clusters of points projected on anellipsoid. They showed reasonable performance of this methodbut provided computations only for very concentrated outliers( D 001).

    Rocke and Woodruff (2001) presented another approach. Ifthe most problematic case for methods like FAST-MCD andMULTOUT is when the outliers form clusters, why not applymethods of cluster analysis to identify them? There are manyimportant aspects to this problem that cannot be treated inthe limited space here, but we can look at a simple version

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • 302 DAVID M. ROCKE AND DAVID L. WOODRUFF

    of our procedure. We search for clusters using a model-basedclustering framework and heuristic search optimization meth-ods, then apply an M estimator to the largest identi ed cluster(for details, see Coleman and Woodruff 2000; Coleman et al.1999; Reiners 1998; Rocke and Woodruff 2001).

    To illustrate our point, we ran a slightly altered version ofthe simulation study in Table 6 from Pea and Prieto. First, weplaced the outliers on a diagonal rather than on a coordinateaxis. If u D 411 11 : : : 115, then the mean of the outliers wastaken to be p1=2u, which lies at a distance of from 0. Aftergenerating the data matrix X otherwise in the same way as didPea and Prieto, we centered the data to form a new matrixX and then generated sphered data in the following way:Let X D UDV>, the singular value decomposition (SVD) inwhich D is the diagonal matrix of singular values. Then thematrix eX DX VD1V > D UV > is an af ne transformed ver-sion of the data with observed mean vector 0 and observedcovariance matrix I . The purpose of these manipulations is toensure that methods that may use nonaf ne-equivariant meth-ods do not have an undue advantage. Pushing the outlierson the diagonal avoids giving an advantage to component-wise methods. The SVD method of sphering, unlike the usualCholesky factor approach, preserves the direction of displace-ment of the outliers.

    We ran all the cases from the simulation of Pea and Prieto.Time constraints prevented us from running the 100 repeti-tions, but out of the ve repetitions we did run, we succeededin 100% of the cases in identifying all of the outliers, exceptfor the cases in which p D 20 and D 1; these required moredata for our methods to work. We could identify all of theoutliers in 100% of the cases when n D 500, for example,instead of n D 100. The successful trials included all of theother cases in which no other method reported in Table 6 wasvery successful.

    This insight transforms part of the outlier-identi cationproblem into a cluster-analysis problem. However, the lat-ter is not necessarily easy (Rocke and Woodruff 2001). Forexample, we tried the same set of simulation trials using thestandard clustering methods available in S-PLUS. In this case,we ran 10 trials of each method. The methods used weretwo hierarchical agglomeration methods, mclust (Ban eld andRaftery 1993) and agnes (Kaufman and Rousseeuw 1990;Struyf, Hubert, and Rousseeuw 1997); diana, a divisive hier-archical clustering method; fanny, a fuzzy clustering method;pam, clustering around medoids (Kaufman and Rousseeuw1990; Struyf et al. 1997); and k-means (Hartigan 1975). First,it should be noted that all of these methods succeed almostall of the time for separated clusters ( D 01 or D 1) if thedata are not sphered. All of these methods except mclust useEuclidean distance and make no attempt at af ne equivariance.Mclust uses an af ne equivariant objective function based ona mixture model, but the initialization steps that are crucial toits performance are not equivariant.

    Over the 72 cases considered (with sphered data), none ofthese methods achieved as much as 50% success. The overallsuccess rates were agnes 36%, diana 48%, fanny 41%, k-means 10%, mclust 37%, and pam 12% (compared to MCD75%, SD 77%, kurtosis1 83%, and our clustering method89%). For shrunken outliers ( D 01), the most successful

    were fanny 81% and mclust 37%, and the least successfulwere agnes 1% and k-means 2% (compared to MCD 41%,SD 70%, kurtosis1 88%, and our clustering method 100%).For expanded outliers ( D 5), the most successful wereagnes 97%, diana 96%, mclust 73%, and fanny 39%, and theleast successful was pam 0% (compared to MCD 100%, SD88%, kurtosis1 100%, and our clustering method 100%). Thiscase is quite easy for robust multivariate estimation methods;FAST-MCD and MULTOUT each get 100% of these. Again,the most dif cult case is shift outliers (Rocke and Woodruff1996; Hawkins 1980, p. 104), in which D 1. The best per-formance among these clustering methods is diana 34%, withthe next best being k-means at 16% (compared to MCD 82%,SD 73%, kurtosis1 62%, and our clustering method 67%).

    The poor performance of these clustering methods is prob-ably due to two factors. First, use of the Euclidean metric isdevastating when the shape of the clusters is very nonspher-ical. Since this can certainly occur in practice, such methodsshould at least not be the only method of attack. Second,some of these problems require more extensive computationthan these methods allow, at least in the implementation inS-PLUS. Performance of many descents of the algorithm froma wide variety of starting points, including random ones, canobtain better solutions than a single descent from a singleplausible starting point. This is particularly true when the dataare extremely sparse, as they are in these simulations. Manyof these methods could, of course, be modi ed to use mul-tiple starting points and might show enormously improvedperformance.

    We con rmed this by using two additional model-basedclustering methods that permit control of the amount of com-putation. The rst of these was EMMIX (McLachlan andBasford 1988; McLachlan and Peel 2000). The second wasEMSamp (Rocke and Dai 2001). For both programs, the per-formance was similar to that of our clustering method; successwas usual if enough random starting points were used exceptfor shift outliers in dimension 20 in the present nD 100 case.

    Since different methods are better at different types of out-liers and since once the outliers have been identi ed by anymethod they can be said to stay identi ed, an excellent strat-egy when using real data instead of simulated data (where thestructure is known) is to use more than one method. Amongthe methods tested by Pea and Prieto, the best combina-tion is FAST-MCD plus kurtosis1. Together these can improvethe overall rate of identi cation from 75% and 83%, respec-tively, to 90%, since FAST-MCD is better for shift outliersand kurtosis1 is better for shrunken outliers. An even bettercombination is our clustering method plus FAST-MCD, whichgets an estimated 95% of the cases.

    We can summarize the most important points we have triedto make here as follows:

    1. General-purpose robust-estimation and outlier-detectionmethods work well in the presence of scattered outliers ormultiple clusters of outliers.

    2. To deal with one or two clusters of outliers in dif -cult cases, methods speci cally meant for this purpose suchas those of Pea and Prieto and Juan and Prieto (2001) areneeded to supplement the more general methods.

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • DISCUSSION 303

    3. An alternative approach for this case is to use clusteringmethods to supplement the general-purpose robust methods.4. Choice of clustering method and computational imple-

    mentation are important determinants of success. We havefound that model-based clustering together with heuristicsearch technology can provide high-quality methods (Cole-man and Woodruff 2000; Coleman et al. 1999; Reiners 1998;Rocke and Dai 2001; Rocke and Woodruff 2001).5. The greatest chance of success comes from use of multi-

    ple methods, at least one of which is a general-purpose methodsuch as FAST-MCD and MULTOUT, and at least one of whichis meant for clustered outliers, such as kurtosis1, the anglemethod of Juan and Prieto (2001), or our clustering method(Rocke and Woodruff 2001).

    REFERENCES

    Ban eld, J. D., and Raftery, A. E. (1993), Model-Based Gaussian and Non-Gaussian Clustering, Biometrics, 49, 803821.

    Coleman, D. A., Dong, X., Hardin, J., Rocke, D. M., and Woodruff, D. L.(1999), Some Computational Issues in Cluster Analysis With no a PrioriMetric, Computational Statistics and Data Analysis, 31, 111.

    Coleman, D. A., and Woodruff , D. L. (2000), Cluster Analysis for LargeDatasets: Ef cient Algorithms for Maximizing the Mixture Likelihood,Journal of Computational and Graphical Statistics, 9, 672688.

    Hartigan, J. A. (1975), Clustering Algorithms, New York: Wiley.Hawkins, D. M. (1980), The Identi cation of Outliers, London: Chapman and

    Hall.Juan, J., and Prieto, F. J. (2001), Using Angles to Identify Concentrated

    Multivariate Outliers, Technometrics, 43, 311322.Kaufman, L., and Rousseeuw, P. J. (1990), Finding Groups in Data,

    New York: Wiley.Lopuha, H. P. (1989), On the Relation Between S-Estimators andM -Estimators of Multivariate Location and Covariance, The Annals ofStatistics, 17, 16621683.

    McLachlan, G. J., and Basford, K. E. (1988), Mixture Models: Inference andApplication to Clustering, New York: Marcel Dekker.

    McLachlan, G. J., and Peel, D. (2000), Finite Mixture Models, New York:Wiley.

    Reiners, T. (1998), Maximum Likelihood Clustering of Data Sets Usinga Multilevel, Parallel Heuristic, unpublished Masters thesis, TechnischeUniversitt Braunschweig, Germany, Dept. of Economics, Business Com-puter Science and Information Management.

    Rocke, D. M., and Dai, J. (2001), Sampling and Subsampling for ClusterAnalysis in Data Mining With Application to Sky Survey Data, unpub-lished manuscript .

    Rocke, D. M., and Woodruff, D. L. (2001), A Synthesis of Outlier Detectionand Cluster Identi cation, unpublished manuscript .

    Struyf, A., Hubert, M., and Rousseeuw, P. J. (1997), Integrating Robust Clus-tering Techniques in S-Plus, Computational Statistics and Data Analysis,26, 1737.

    DiscussionMia Hubert

    Department of Mathematics and Computer ScienceUniversity of Antwerp

    AntwerpBelgium

    ([email protected])

    Pea and Prieto propose a new algorithm to detect multi-variate outliers. As a byproduct , the population scatter matrixis estimated by the classical empirical covariance matrix ofthe remaining data points.The interest in outlier-detection procedures is growing fast

    since data mining has become a standard analysis tool in bothindustry and research. Once information (data) is gathered,marketing people or researchers are not only interested in thebehavior of the regular clients or measurements (the gooddata points) but they also want to learn about the anomalousobservations (the outliers).As pointed out by the authors, many different procedures

    have already been proposed over the last decades. However,none of them has a superior performance at all kinds ofcontamination patterns. The high-breakdown MCD covari-ance estimator of Rousseeuw (1984) is probably the mostwell known and most respected procedure. There are sev-eral reasons for this. First, the MCD has good statisticalproperties since it is af ne equivariant and asymptoticallynormally distributed. It is also a highly robust estimator,achieving a breakdown value of 50% and a bounded in u-ence function at any elliptical distribution (Croux and Haes-broeck 1999). Another advantage is the availability of afast and ef cient algorithm, called FAST-MCD (Rousseeuw

    and Van Driessen 1999), which is currently incorporated inS-PLUS and SAS. Therefore, the MCD can be used to robus-tify many multivariate techniques such as discriminant anal-ysis (Hawkins and McLachlan 1997), principal-componentanalysis (PCA) (Croux and Haesbroeck 2000), and factor anal-ysis (Pison, Rousseeuw, Filzmoser, and Croux 2000).

    Whereas the primary goal of the MCD is to robustly esti-mate the multivariate location and scatter matrix, it also canbe used to detect outliers by looking at the (squared) robustdistances RD 4xi5D 4xiTMCD5tS1MCD4xiTMCD5. Here, TMCDand SMCD stand for the MCD location and scatter estimates.One can compare these robust distances with the quantilesof the 2p distribution. Since this rejection rule often leadsto an in ated Type II error, Hardin and Rocke (2000) devel-oped more precise cutoff values to improve the MCD outlier-detection method.

    If one is mainly interested in nding the outliers, it is lessimportant to estimate the shape of the good points with greataccuracy. As the authors explain, it seems natural to try to nd

    2001 American Statistical Association andthe American Society for Quality

    TECHNOMETRICS, AUGUST 2001, VOL. 43, NO. 3

  • 304 MIA HUBERT

    (for each data point) the direction along which the relativedistance to the uncontaminated observations is largest. Thiswas the idea behind the StahelDonoho estimator. Since thecenter of the good data is unknown, this estimator projectsthe dataset on all possible directions and then considers arobust univariate distance [Formula (10)] for each data point.The maximal distance of the point over all directions is thende ned as its outlyingness.

    This projection pursuit (PP) strategy has the advantage ofbeing applicable even when p > n. This situation often occursin chemometrics, genetics, and engineering. It is therefore notsurprising that many multivariate methods in high dimensionshave been robusti ed using the PP principle (e.g., see Li

    and Chen 1985; Croux and Ruiz-Gazen 1996, 2000; Hubert,Rousseeuw, and Verboven 2000). The computation time ofthese estimators would be huge if all possible directions wereto be scanned. To reduce the search space, Stahel (1981) pro-posed taking the directions orthogonal to random p subsets. InPCA, where only orthogonal equivariance is required, Crouxand Ruiz-Gazen (1996) and Hubert et al. (2000) looked atthe n directions from the spatial median of the data to eachdata point. However, it is clear that the precision of thesealgorithms would improve if one were able to nd the mostinteresting directions. This is exactly what the authors try to doin this article, which I have therefore read with great interest.

    However, several key aspects of their proposal can becriticized.

    1. The standardization in Step 1. First a standardizationis performed [Formula (7)] using the classical mean andcovariance matrix. It is well known that these estimators areextremely sensitive to outliers, which often leads to maskinglarge Type II errors by labeling outliers as good data pointsand swamping large Type I errors by agging good points asoutliers. Rousseeuw and Leroy (1987, pp. 271273) gave anexample of how classical prestandardization can destroy therobustness of the nal results.

    2. Maximizing a