Consistent Procedures for Cluster Tree Estimation and Pruning

  • View
    212

  • Download
    0

Embed Size (px)

Text of Consistent Procedures for Cluster Tree Estimation and Pruning

  • 7900 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 60, NO. 12, DECEMBER 2014

    Consistent Procedures for ClusterTree Estimation and Pruning

    Kamalika Chaudhuri, Sanjoy Dasgupta, Samory Kpotufe, and Ulrike von Luxburg

    Abstract For a density f on Rd , a high-density cluster is anyconnected component of {x : f (x) }, for some > 0. The setof all high-density clusters forms a hierarchy called the clustertree of f . We present two procedures for estimating the clustertree given samples from f . The first is a robust variant of thesingle linkage algorithm for hierarchical clustering. The secondis based on the k-nearest neighbor graph of the samples. We givefinite-sample convergence rates for these algorithms, which alsoimply consistency, and we derive lower bounds on the samplecomplexity of cluster tree estimation. Finally, we study a treepruning procedure that guarantees, under milder conditions thanusual, to remove clusters that are spurious while recovering thosethat are salient.

    Index Terms Clustering algorithms, convergence.

    I. INTRODUCTION

    WE CONSIDER the problem of hierarchical clustering ina density-based setting, where a cluster is formalizedas a region of high density. Given data drawn i.i.d. from someunknown distribution with density f in Rd , the goal is toestimate the hierarchical cluster structure of the density,where a cluster is defined as a connected subset of an f -levelset {x X : f (x) }. These subsets form an infinite treestructure as 0 varies, in the sense that each cluster at somelevel is contained in a cluster at a lower level < . Thisinfinite tree is called the cluster tree of f and is illustrated inFigure 1.

    Our formalism of the cluster tree (Section II-B) and ournotion of consistency follow early work on clustering, inparticular that of Hartigan [1]. Much subsequent work hasbeen devoted to estimating the connected components of asingle level set; see, for example, [2][7]. In contrast to these

    Manuscript received November 4, 2013; revised July 16, 2014; acceptedAugust 2, 2014. Date of publication October 3, 2014; date of current versionNovember 18, 2014. K. Chaudhuri and S. Dasgupta were supported by theNational Science Foundation under Grant IIS-1162581. U. von Luxburg wassupported by the German Research Foundation under Grant LU1718/1-1and Research Unit 1735 through the Project entitled Structural Inference inStatistics: Adaptation and Efficiency.

    K. Chaudhuri and S. Dasgupta are with the University of Californiaat San Diego, La Jolla, CA 92093 USA (e-mail: kamalika@cs.ucsd.edu;dasgupta@cs.ucsd.edu).

    S. Kpotufe was with the Toyota Technological Institute at Chicago, Chicago,IL 60637 USA. He is now with Princeton University, Princeton, NJ 08544USA (e-mail: samory@princeton.edu).

    U. von Luxburg is with the University of Hamburg, Hamburg, Germany(e-mail: luxburg@informatik.uni-hamburg.de).

    Communicated by N. Cesa-Bianchi, Associate Editor for Pattern Recogni-tion, Statistical Learning and Inference.

    Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TIT.2014.2361055

    Fig. 1. Top: A probability density f on R, and two clusters at a fixedlevel . Bottom: The same density, with the overall branching structure ofthe corresponding cluster tree; within each branch, the corresponding clustershrinks gradually as one moves upward.

    results, the present work is concerned with the simultaneousestimation of all level sets of an unknown density: recoveringthe cluster tree as a whole.

    Are there hierarchical clustering algorithms whichconverge to the cluster tree? Previous theory work, [1] and [8],has provided partial consistency results for the well-knownsingle-linkage clustering algorithm, while other work [9] hassuggested ways to overcome the deficiencies of this algorithmby making it more robust, but without proofs of convergence.In this paper, we propose a novel way to make single-linkage more robust, while retaining most of its eleganceand simplicity (see Figure 3). We show that this algorithmimplicitly creates a hierarchy of geometric graphs, and werelate connected components of these graphs to clusters ofthe underlying density. We establish finite-sample rates ofconvergence for the clustering procedure (Theorem 3.3);the centerpiece of our argument is a result on continuumpercolation (Theorem 4.7). This also implies consistency inthe sense of Hartigan.

    We then give an alternative procedure based on the k-nearestneighbor graph of the sample (see Figure 4). Such graphs arewidely used in machine learning, and interestingly there is stillmuch to understand about their expressiveness. We show thatby successively removing points from this graph, we can createa hierarchical clustering that also converges to the cluster

    0018-9448 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

  • CHAUDHURI et al.: CONSISTENT PROCEDURES FOR CLUSTER TREE ESTIMATION AND PRUNING 7901

    tree, at roughly the same rate as the linkage-based scheme(Theorem 3.4).

    Next, we use tools from information theory to give alower bound on the problem of cluster tree estimation(Theorem VI.1.), which matches our upper bounds in itsdependence on most of the parameters of interest.

    The convergence results for our two hierarchical clusteringprocedures nevertheless leave open the possibility that the treesthey produce contain spurious branching. This is a well-studiedproblem in the cluster tree literature, and we address it witha pruning method (Figure 9) that preserves the consistencyproperties of the tree estimators while providing finite-sampleguarantees on the removal of false clusters (Theorem VII.5.)This procedure is based on simple intuition that can carry overto other cluster tree estimators.

    II. DEFINITIONS AND PREVIOUS WORK

    Let X be a subset of Rd . We exclusively consider Euclideandistance on X , denoted . Let B(x, r) be the closed ball ofradius r around x .

    A. Clustering

    We start by considering the more general context of clus-tering. While clustering procedures abound in statistics andmachine learning, it remains largely unclear whether clustersin finite datafor instance, the clusters returned by a particularprocedurereveal anything meaningful about the underlyingdistribution from which the data is sampled. Understandingwhat statistical estimation based on a finite data set revealsabout the underlying distribution is a central preoccupation ofstatistics and machine learning; however this kind of analysishas proved elusive in the case of clustering, except perhaps inthe case of density-based clustering.

    Consider for instance k-means, possibly the most popularclustering procedure in use today. If this procedure returns kclusters on an n-sample from a distribution f , what dothese clusters reveal about f ? Pollard [10] proved a basicconsistency result: if the algorithm always finds the globalminimum of the k-means cost function (which, incidentally,is NP-hard and thus computationally intractable in general;see [11, Th. 3]), then as n , the clustering is the globallyoptimal k-means solution for f , suitably defined. Even then,it is unclear whether the best k-means solution to f is aninteresting or desirable quantity in settings outside of vectorquantization.

    Our work, and more generally work on density-based clus-tering, relies on meaningful formalisms of how a clustering ofdata generalizes to unambiguous structures of the underlyingdistribution. The main such formalism is that of the clustertree.

    B. The Cluster Tree

    We start with notions of connectivity. A path P in S Xis a continuous function P : [0, 1] S. If x = P(0)and y = P(1), we write x P y and we say that x and yare connected in S. This relation connected in S is

    Fig. 2. A probability density f , and the restriction of C f to a finite set ofeight points.

    an equivalence relation that partitions S into its connectedcomponents. We say S X is connected if it has a singleconnected component.

    The cluster tree is a hierarchy each of whose levels is apartition of a subset of X , which we will occasionally call asubpartition of X . Write (X ) = {subpartitions of X }.

    Definition 2.1: For any f : X R, the cluster treeof f is a function C f : R (X ) given by C f () =connected components of {x X : f (x) }. Any element ofC f (), for any , is called a cluster of f .For any , C f () is a set of disjoint clusters of X . They forma hierarchy in the following sense.

    Lemma 2.2: Pick any . Then:1) For any C C f (), there exists C C f () such that

    C C .2) For any C C f () and C C f (), either C C or

    C C = .We will sometimes deal with the restriction of the clus-

    ter tree to a finite set of points x1, . . . , xn . Formally,the restriction of a subpartition C (X ) to these points isdefined to be C[x1, . . . , xn] = {C {x1, . . . , xn} : C C}.Likewise, the restriction of the cluster tree is C f [x1, . . . , xn] :R ({x1, . . . , xn}), where C f [x1, . . . , xn]() = C f ()[x1, . . . , xn] (Figure 2).

    C. Notion of Convergence and Previous Work

    Suppose a sample Xn X of size n is used to constructa tree Cn that is an estimate of C f . Hartigan [1] provided asensible notion of consistency for this setting.

    Definition 2.3: For any sets A, A X , let An (resp, An)denote the smallest cluster of Cn containing A Xn (resp,A Xn). We say Cn is consistent if, whenever A and A aredifferent connected components of {x : f (x) } (for some > 0), Pr(An is disjoint from An) 1 as n .

    It is well known that if Xn is used to build a uniformlyconsistent density estimate fn (that is, s