Fast Algorithm for the Lasso based L -Graph Construction · 2019. 7. 12. · Fast Algorithm for the Lasso based L 1-Graph Construction Yasuhiro Fujiwara †‡∗,YasutoshiIda†,JunyaArai†,MaiNishimura†,SotetsuIwamura†

Fast Algorithm for the Lasso based L1-Graph Construction

Yasuhiro Fujiwara†‡!, Yasutoshi Ida†, Junya Arai†, Mai Nishimura†, Sotetsu Iwamura†

†NTT Software Innovation Center, 3-9-11 Midori-cho Musashino-shi, Tokyo, Japan‡NTT Communication Science Laboratories, 2-4 Seika-Cho Soraku-gun, Kyoto, Japan

!Osaka University, 1-5 Yamadaoka, Suita-shi, Osaka, Japan{fujiwara.yasuhiro, ida.yasutoshi, arai.junya, nishimura.mai,

iwamura.sotetsu}@lab.ntt.co.jp

ABSTRACTThe lasso-based L1-graph is used in many applications sinceit can e!ectively model a set of data points as a graph. Thelasso is a popular regression approach and the L1-graph rep-resents data points as nodes by using the regression result.More specifically, by solving the L1-optimization problem ofthe lasso, the sparse regression coe"cients are used to ob-tain the weights of the edges in the graph. Conventionalgraph structures such as k-NN graph use two steps, adja-cency searching and weight selection, for constructing thegraph whereas the lasso-based L1-graph derives the adja-cency structure as well as the edge weights simultaneously byusing a coordinate descent. However, the construction costof the lasso-based L1-graph is impractical for large data setssince the coordinate descent iteratively updates the weightsof all edges until convergence. Our proposal, Castnet, cane"ciently construct the lasso-based L1-graph. In order toavoid updating the weights of all edges, we prune edges thatcannot have nonzero weights before entering the iterations.In addition, we update edge weights only if they are nonzeroin the iterations. Experiments show that Castnet is signifi-cantly faster than existing approaches.

1. INTRODUCTIONWe are now living in the big data era. With the rapid de-

velopment of database technologies, high-dimensional datahas become prevalent in many application domains whichdemand techniques to process data in an e!ective manner[21, 17]. From the data mining perspective, it is importantto detect the hidden structures of high-dimensional datathat can be collectively revealed only by processing largeamounts of data. This is because the hidden structures allowus to associate, and thus utilize, the semantic information ofdata; data belonging to the same cluster share the same se-mantic information [9]. From a practical perspective, findingthe hidden structures of high-dimensional data is a funda-mental process in computational biology, human-computer

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License. To view a copyof this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. Forany use beyond those covered by this license, obtain permission by [email protected] of the VLDB Endowment, Vol. 10, No. 3Copyright 2016 VLDB Endowment 2150-8097/16/11.

interaction, medical analysis, scientific data exploration, andinformation retrieval [10].

Graph is a fundamental data model in finding the hid-den structures of high-dimensional data. The nodes andedges of a graph represent data points and the relationshipsamong them, respectively. Since graphs are useful, variousapproaches to exploit graphs have been proposed in manycommunities [8, 9, 11, 19]. However, the database commu-nity has paid relatively little attention to developing graphconstruction approaches although it is a quite important re-search problem in real-world applications. Conceptually, agraph should capture the intrinsic complex structures in-duced by high dimensionality of the data. A primitive ap-proach for graph construction is k-nearest neighbor (k-NN).In a k-NN graph, each node is connected to its k-nearestneighbor nodes by exploiting Euclidean distance. However,Euclidean distance is sensitive to data noise, a problem thatis inherent in real-world applications [4].

In order to address this problem, Meinshausen et al. pro-posed to construct the L1-graph by solving the L1-optimizationproblem of the lasso [16]. The lasso is a popular least squaresregression approach that represents each data point as thelinear combination of the remaining data points along withcoe"cients [13]. By adding an L1-norm regularization term,the lasso achieves the sparsity of the solution; it assigns ex-actly zero coe"cients to the most irrelevant data points inthe regression. The idea of Meinshausen et al. is to ex-ploit the lasso to construct graphs; they applied the lasso toeach node, where the regression coe"cients are used as edgeweights in the graph. In the approach, the neighborhood re-lationships and edge weights are simultaneously obtained bysolving the L1-optimization problem of the lasso. Thus, theresulting graph is referred as the L1-graph. The L1-graph isfundamentally di!erent from traditional k-NN graphs. Sincethe L1-graph utilizes the higher order relationships amongdata points, it is robust to data noise unlike k-NN graphs[4]. In addition, the L1-graph has small memory consump-tion since the lasso can sparsely represent each node by usingthe small number of remaining nodes [16].

Even though the lasso-based L1-graph is used in manyapplications, its construction incurs high computation cost[4, 14, 15]. The graph construction approach proposed byMeinshausen et al. picks up nodes one by one and computesthe lasso for each node. In terms of processing speed, thecoordinate descent [5] is an e"cient approach to the lassothat iteratively updates edge weights one at a time to com-pute the regression result. However, it is infeasible to applythe coordinate descent to large-scale data sets. This is be-

229

cause it iteratively computes the weights of all edges untilconvergence even if weights of almost all edges are zero dueto the L1-norm regularization term of the lasso.

1.1 Problem StatementIf N and M are the numbers of nodes and dimensions,

respectively, a set of high-dimensional data is represented asmatrix X of N "M size. We address the following problem:

Problem (L1-graph construction).Given: Matrix X of the high-dimensional data and tun-

ing parameter !.Construct: the L1-graph of N nodes with edge weights

computed by the lasso, e!ciently.

In this problem, tuning parameter ! is a positive regular-ization parameter that balances the trade-o! between theloss function and the L1-norm regularization of the lasso.Our approach can be used in various applications as shownbelow due to the generality of the problem.

Brain Analysis. Recent advances in computational neuro-science allow researchers to use large-scale gene expressiondatabases in analyzing genes and anatomies. It is crucialto detect gene-anatomy association for understanding brainfunction from molecular and genetic information. The lasso-based L1-graph has been used to model the anatomical or-ganization of brain structure [14]. In the graph, each noderepresents the spatial location of a gene, and the edges be-tween nodes encode the correlations between locations inthe brain. Since spatially adjacent regions tend to exhibitcorrelated expression patterns, most of the edges connectto spatially adjacent regions in the graph. However, thegraph revealed an apparent exception; the nodes annotatedas dentate gyrus (DG) were found to be highly correlated tomany other nodes in distant regions of the brain. DG is animportant area for learning and memory to process spatialinformation. According to neuroanatomy, DG receives mul-tiple sensory inputs including visual and auditory. Thus,DG plays an important role in blocking and filtering excita-tory activity from the inputs.

Motion Segmentation. In the field of computer vision, mo-tion segmentation is a fundamental process in decomposinga video into moving objects and a background. Motion seg-mentation is a pre-processing step in surveillance, tracking,and action recognition. It is important to develop a mo-tion segmentation approach that is robust against noise sincedata is never clean in real world applications. For example,in human image recognition, the data exhibits di!erent de-grees of noise depending on rotation, lighting, scale, and soon. Although many approaches have been proposed in theliterature, the approach based on the a"ne camera modelis arguably the most popular for motion segmentation. Animportant intuition behind the approach is to use motion;we can easily discern independently moving objects by sim-ply seeing their motion. In the approach, all the trajectoriesassociated with a single rigid object lie in a linear subspace.Therefore, motion segmentation can be achieved by cluster-ing trajectories into di!erent motion subspaces [4]. Morespecifically, the approach constructs the L1-graph from thetrajectory data to apply spectral clustering since the L1-graph is robust against noise unlike other types of graphs.The approach is more e!ective than existing approaches.

Lesion Detection. Automatic lesion detection is an impor-tant process in early medical screening or providing second

opinions for decision making. In computer-aided diagno-sis, analyzing CT images is an essential step in generating3D models. Although it is conceptually simple that lesionsare just regions whose features di!er from those of normalanatomical structures, it is technically challenging to detectlesions in CT images due to the poor contrast between thelesion and surrounding tissues, and the high variability of le-sion shape. The lasso-based L1-graph is a robust approachto lesion detection where a voxel in a CT image correspondsto a node in the graph [15]. The approach identifies lesionregions in CT images by applying a semi-supervised learningapproach that propagates labels of manually-labeled nodesto unlabeled nodes in the graph. The approach is motivatedby the superior discriminant power of the sparse represen-tations yielded by the lasso. Due to the sparsity constraintimposed by the L1-norm regularization term, the L1-graphhave a sparse structure where only nodes of similar featuresare connected to each other. As a result, the approach cane!ectively determine the lesion regions from noisy data. Theapproach experimentally shows that it can distinctly iden-tify prostate cancer; one of the leading causes of male cancerdeath in the United States.

1.2 ContributionsIn this paper, we propose Castnet, a novel and e"cient

approach to constructing the lasso-based L1-graph. Be-fore entering the iterations, we prune those edges whoseweights cannot be nonzero. In addition, we e"ciently up-date nonzero weights in each iteration by pruning edgeswhose weights are zero. Although our approach skips un-necessary weight updates similarly to the state-of-the-artapproach of the lasso [7], we substantially improve the ef-ficiency compared to the previous approach by e!ectivelypruning the inner product computations used for graph con-struction. As a result, our approach is 130 times faster thanthe state-of-the-art approach. Even though the lasso-basedL1-graph was designed to overcome the sensitivity of k-NNgraphs to data noise, it is seldom applied to large-scale datasets due to the high computation cost. However, our ap-proach can e"ciently construct graphs, which will improvethe usefulness of many applications such as brain analysis,motion segmentation, and lesion detection.

The remainder of this paper is organized as follows: Sec-tion 2 describes related work. Section 3 gives an overviewof the background. Section 4 introduces our approach. Sec-tion 5 reviews our experiments and the results. Section 6details case-studies. Section 7 provides our conclusions.

2. RELATED WORKMeinshausen et al. proposed the lasso-based L1-graph

as more robust approach against noise than traditional k-NN graphs [16]. Their approach picks up nodes one byone and uses the lasso to compute weights for the node.A popular approach to solving the L1-optimization problemof the lasso is the coordinate descent [5]. To resolve the L1-optimization problem, it iteratively updates edge weightsone at a time. In terms of the e"ciency, Sling is a state-of-the-art approach based on the coordinate descent althoughit is not so useful to construct the L1-graph [7]. Sling usesthe sequential strong rule to discard unnecessary edges. Itis based on the assumption that a grid of tuning parame-ters (!1 > !2 > !3 > . . .) is used in solving the lasso the

230

Table 1: Definitions of main symbols.Symbol Definition

N Number of nodes in the graphM Number of dimensions of each data pointm Target rank of SVD! Tuning parameterp Node for weight computation

ep[u] Edge from node u to pwp[u] Weight of edge ep[u]dp Number of edges to node p

K[u|wp] KKT condition score of node u for weight vector wp

K[u|wp] Upper bound of KKT condition score K[u|wp]K[u|wp] Lower bound of KKT condition score K[u|wp]

xp p-th row vector of matrix Xwp 1 ! N weight vector of node pX N ! M matrix of high-dimensional dataV Set of nodes in the graph

M[wp] Set of edges that must have nonzero weights for wp

C[wp] Set of edges that can have nonzero weights for wp

same as the other screening techniques [24]; the tuning pa-rameter controls the sparsity of the solution. In each it-eration, Sling estimates parameters of the soft-thresholdingoperator used in updating weights. In addition, Sling em-ploys the covariance-based approach to e"ciently updateweights by computing the inner products of all nodes beforeentering each iteration [6]. After convergence, Sling checksthe Karush-Kuhn-Tucker (KKT) condition of all discardededges since the sequential strong rule may erroneously dis-card edges of nonzero weights [24]. Although Sling is fasterthan the original coordinate descent approach for the lasso,it is not practical to construct the L1-graph by directly us-ing Sling. This is because the tuning parameter is constantwhen constructing a graph; the sequential strong rule is note!ective in pruning unnecessary edges since ! does not dy-namically change when constructing a graph. In addition,Sling computes the inner products of all nodes even if almostall edges are zero as a result of the lasso. To the best of ourknowledge, Castnet is the first approach that can e"cientlyconstruct the lasso-based L1-graph.

3. PRELIMINARYWe formally define the notations and introduce the back-

ground of this paper. Table 1 lists the main symbols andtheir definitions. In the lasso-based L1-graph, nodes andedges correspond to data points and their relationships, re-spectively. Let p be a node in the graph and V be theset of nodes in the graph, the lasso is used to compute theweights of edges by picking up each node such that p # V.Since the lasso sets zero coe"cients to almost all edges, thegraph has a sparse structure. Let X # RN"M be the ma-trix of N data points that have M dimensional features andxp = (xp[1], xp[2], . . . , xp[M ]) be the p-th row vector in ma-trix X, vector xp corresponds to the p-th data point or nodep. We assume the each row vector is centered and normal-ized [6]; each vector has average and variance of 0 and 1,respectively. Let wp be a 1 " N weight vector whose u-th element wp[u] is the weight of the edge from node u top. In the lasso-based L1-graph, each node is representedas the sparse linear superposition of other nodes by solvingthe lasso optimization problem. Specifically, we computeweights of edges to node p by minimizing the following ob-jective function of the regression where wp[p] = 0 [16]:

minwp#RN1

2M $xp %wpX$22 + !$wp$1 (1)

where $ · $1 and $ · $2 are the L1-norm and L2-norm of avector, respectively, and ! > 0 is a tuning parameter. Notethat the graph does not have self-loop edges since wp[p] = 0for each picked up node. In Equation (1), the first term cor-responds to the squared loss by regression; the second termcorresponds to an L1-norm constraint for the coe"cients.Therefore, Equation (1) implements the regularization ofweight vector wp by trading o! the accuracy of the regres-sion for a reduction in the sparsity of coe"cients; it has thee!ect of giving a solution to the objective function with fewedges of nonzero weights [23]. It is clear from Equation (1)that the graph has improved sparsity as we raise tuning pa-rameter !. As described in the previous paper [6], if somenodes are strongly correlated, a single node would be used torepresent node p; only an edge from the single node is likelyto have nonzero weight in that case. If matrix X has full rowrank, we have a unique solution for the optimization prob-lem of Equation (1) [13]. Otherwise, which is necessarily thecase when N > M , there may not be a unique solution.

In order to e"ciently solve the lasso, Friedman et al. pro-posed the coordinate descent that updates weights of edgesone at a time until they reach convergence [5]. The coor-dinate descent partially conducts optimization with respectto weight wp[u] by supposing that it has already estimatedother weights for all nodes u such that u # V\p. Specifically,the coordinate descent updates weights as follows:

wp[u]& S[zp[u|wp],!] (2)

where S[·, ·] is the following soft-thresholding operator [5]:

S[zp[u|wp], !]

=

!zp[u|wp]% ! (zp[u|wp] > 0 and |zp[u|wp]| > !)zp[u|wp] + ! (zp[u|wp] < 0 and |zp[u|wp]| > !)0 (|zp[u|wp]| ' !)

(3)

In addition, zp[u|wp], a parameter of node u for weight vec-tor wp, is computed as follows:

zp[u|wp] = 1M

"Mi=1 xu[i](xp[i]% x(u)

p [i]) (4)

where x(u)p [i] is the regression result for element xp[i] without

using node u computed as follows:

x(u)p [i] =

"v#V\{p,u}wp[v]xv [i] (5)

As shown by the above equations, zp[u|wp] and x(u)p [i] are

required to update a weight by Equation (2). Since x(u)p [i] is

di!erent from each weight, above equations indicate that theoriginal coordinate descent approach needs di!erent regres-sion results for each weight. It takes O(M) time to computezp[u|wp] from Equation (4). In addition, Equation (4) ad-

ditionally requires O(dp) time to compute x(u)p [i] for each

dimension by Equation (5) if dp is the number of nonzeroweight edges to node p; dp is degree of node p. As a result,if T is the number of updates until convergences, it takesO(dpMT ) time to compute edge weights to node p by usingEquation (2). As a result, the computation cost of constructthe L1-graph can be prohibitive for large datasets.

4. PROPOSED METHODThis section presents Castnet, which can e"ciently con-

struct the lasso-based L1-graph. Section 4.1 overviews theideas that underlie our approach. Section 4.2 describes ourapproach to determine the edge sets to update weights by

231

using upper and lower bounds of KKT condition score. InSection 4.3, we introduce the approach that computes thebounds of KKT condition score. Section 4.4 shows the graphconstruction algorithm along with its theoretical properties.

4.1 Main IdeasIn constructing the lasso-based L1-graph, the original co-

ordinate descent approach incurs high computation cost.This is because (1) it updates the weights of all edges untilthe convergence although almost all edges have zero weightsdue to the L1-norm constraint of the lasso and (2) it up-dates each weight by computing a di!erent regression resultfor each weight as described in Section 3.

For e"cient graph construction, our approach is to pruneunnecessary edges that cannot have nonzero weights so as todetermine a set of edges for weight updating. More specifi-cally, we first limit weight updates to the set of edges thatmust have nonzero weight until convergence. After that,we update weights of the set of edges that can have nonzeroweight. As a result, we can e"ciently obtain weights for eachnode. Theoretically, we e"ciently determine edge sets thatmust/can have nonzero weight by computing the upper andlower bounds of KKT condition scores; KKT condition wasoriginally used to find edges that are erroneously discardedby the sequential screening rule as described in Section 2[24]. In addition, we exploit two types of e"cient updatecomputations for weights. The first type of update compu-tation is based on the residual of the lasso that is the samefor each weight. Since we do not need to compute di!er-ent regression results in each weight update, we can greatlyenhance computation speed comparing to the original co-ordinate descent approach. We use this type of computa-tion for edges that have zero weight. Otherwise, we updatethe weight using the covariance-based approach [6]; this isthe second type of update computation. Our approach canmore e"ciently update weights than the previous approacheven though we exploit the same covariance-based approach[7]. This is because we use the residual-based update com-putation if the weight of an edge is zero. As described inSection 2, the covariance-based approach requires the com-putations of inner products for all edges to update weightsbefore entering the iterations. Since we do not update alledge weights by the covariance-based approach, we can ef-fectively prune the inner product computations unlike theprevious approach. Furthermore, we skip update computa-tions for edges if they do not have nonzero weight. Thisis because the edge sets to update may include zero weightedges. By skipping the update computations for such edges,we improve the e"ciency with which nonzero weights areupdated in the iterations.

Even though we adopt a di!erent strategy from the pre-vious approach in constructing the L1-graph, we can prov-ably guarantee to output the same graph as the previousapproach if matrix X has full row rank. This is becauseour approach converges on the unique lasso solution in thatcase. Therefore, our approach is e!ective and e"cient inconstructing the lasso-based L1-graph.

4.2 Weight UpdatesThis section describes the approach to updating weights.

We first introduce the definitions of edge sets as well as theirproperties in Section 4.2.1. We then describe the weightupdate computations in Section 4.2.2. We finally show our

approach of skipping update computations for zero weightedges in Section 4.2.3.

4.2.1 DefinitionsBefore entering the iterations, we compute the two edge

sets of M[wp] and C[wp]. M[wp] is the set of edges whoseweights must be nonzero when updated with weight vec-tor wp. Similarly, C[wp] is the set of edges that can havenonzero weights when updated with weight vector wp. The-oretically, edge set M[wp] and C[wp] are based on the KKTcondition [24]. We obtain these edge sets by computing theupper and lower bounds of KKT condition scores that aregiven in Section 4.3. Before defining the edge sets, we showhow the KKT condition can check whether weight wp[u] hasnonzero score or not for weight vector wp as follows:

Definition 1 (KKT condition [24]). IfK[u|wp] is theKKT score of node u for weight vector wp, we have

K[u|wp] = 1!wp[u] + 1

!M (xp %wpX)x$u (6)

where x$u is the transpose of xu. For all nodes u such that

u # V/p, if the weight of the edge from node u to p isupdated with weight vector wp, K[u|wp] has the followingscores based on weight wp[u] after the update:

#$

%

K[u|wp] > 1 (i" wp[u] > 0)K[u|wp] < %1 (i" wp[u] < 0)%1 ' K[u|wp] ' 1 (i" wp[u] = 0)

(7)

It is clear that we can obtain a set of edges that havenonzero weights by directly computing the KKT conditionscores from Equation (6). This, however, incurs high com-putation cost; it takes O(NM) time for determining eachKKT condition score in the worst case since the sizes of xp,wp, X, and x$

u in Equation (6) are 1"M , 1"N , N"M , andM"1, respectively. To overcome this problem, we e"cientlycompute the upper and lower bounds of the KKT conditionscores introduced in Section 4.3. If K[u|wp] and K[u|wp]are the upper and lower bounds of KKT condition scoreof K[u|wp], respectively, we have K[u|wp] ' K[u|wp] 'K[u|wp]. By exploiting the bounds, the two edge sets aregiven as follows:

Definition 2 (Edge Set M). For node u such that u (=p, the definition of edge set M[wp] is given as follows:

M[wp] = {ep[u] : K[u|wp] < %1 or K[u|wp] > 1} (8)

Definition 3 (Edge Set C). Letting u (= p, the fol-lowing equation defines edge set C[wp]:

C[wp] = {ep[u] : K[u|wp] > 1 or K[u|wp] < %1} (9)

In Equations (8) and (9), ep[u] is the edge from node u top. Therefore, an edge is included in M[wp] if K[u|wp] < %1or K[u|wp] > 1 holds. Similarly, when we have K[u|wp] > 1or K[u|wp] < %1, an edge is placed in edge set C[wp]. TheKKT condition yields the following properties for edge sets:

Lemma 1 (Edge Set M). Each edge ep[u] such thatep[u] # M[wp] must have nonzero weight if its weight isupdated with weight vector wp.

Proof For each edge included in edge set M[wp], wehave K[u|wp] < %1 or K[u|wp] > 1 if its weight is up-dated for weight vector wp from Definition 2. In the case

232

of K[u|wp] < %1, we have K[u|wp] ' K[u|wp] < %1 sinceK[u|wp] is the upper bound of K[u|wp]. Therefore, weightwp[u] of edge ep[u] must have negative score from the KKTcondition. If K[u|wp] > 1 holds for edge ep[u], we similarlyhave K[u|wp] ) K[u|wp] > 1 since K[u|wp] ) K[u|wp]holds. As a result, edge ep[u] must have positive weight inthis case from the KKT condition. !

Lemma 2 (Edge Set C). For each edge ep[u] such thatep[u] # C[wp], its weight wp[u] can have nonzero score if weperform the update computation with weight vector wp.

Proof To prove Lemma 2, we show that the weightof an edge must be zero after being updated with weightvector wp if K[u|wp] ' 1 and K[u|wp] ) %1 hold. In thecase of K[u|wp] ' 1 and K[u|wp] ) %1, we have %1 'K[u|wp] ' K[u|wp] ' K[u|wp] ' 1. Therefore, such anedge cannot have nonzero weight from the KKT condition.In addition, it is clear that such an edge cannot be includedin edge set C[wp] from Definition 3. As a result, if we haveK[u|wp] > 1 or K[u|wp] < %1 for an edge, the edge canhave nonzero weight and so is included in C[wp]. !

In terms of the relationship between edge set M[wp] andC[wp], we have the following property:

Lemma 3 (Edge Set M and C). If an edge is includedin M[wp], the edge must be included in C[wp].

Proof From Definition 2, we have K[u|wp] < %1 orK[u|wp] > 1 for an edge if the edge is included in M[wp]. IfK[u|wp] < %1 holds, we have K[u|wp] ' K[u|wp] < %1 forthe edge. In addition, if we have K[u|wp] > 1, K[u|wp] )K[u|wp] > 1 holds for the edge. Therefore, it is clear thatsuch the edge is also included in C[wp] from Definition 3. !

Lemma 3 indicates that, if an edge is determined to havenonzero weight by M[wp], the edge is also determined tohave nonzero weight by C[wp]. Similarly, Lemma 3 indicatesthat, if an edge is determined not to have nonzero weightby C[wp], the edge is also determined not to have nonzeroweight by M[wp]; an edge cannot have nonzero weight if itis not included in C[wp]. We determine M[wp] and C[wp]before entering the iterations to e"ciently perform updatecomputations for edges that must/can have nonzero weights.

4.2.2 Efficient Update ComputationThe edge sets introduced in the previous section can, be-

fore entering the iterations, reduce the number of edges toupdate. This section describes our approach to improvingthe speed of a single update computation in the iterations.As shown in Equation (2), the original coordinate descentapproach updates weights by exploiting soft-thresholdingoperator S[zp[u|wp],!]. Since parameter zp[u|wp] requiresdi!erent regression results for each weight as described inSection 3, the original coordinate descent approach incurshigh computation cost. In our approach, we use two typesof e"cient update computations. The first type of com-putation updates weight by computing parameter zp[u|wp]without demanding di!erent regression results as follows:

zp[u|wp] = 1M

"Mi=1 xu[i]rp[i] + wp[u] (10)

where rp[i] is the residual for element xp[i] in the regressionresult by the lasso computed as follows:

rp[i] = xp[i]%"

v#V\p wp[v]xv[i] (11)

Since the residual used in Equation (11) is the same foreach weight, we can e"ciently compute parameter zp[u|wp]if the regression result does not change after the update.While Equation (4) needs O(dpM) time, Equation (10) re-quires O(M) time to compute parameter zp[u|wp]. ForEquation (10), we have the following property:

Lemma 4 (Parameter z). Equation (4) and (10) yieldthe same result in computing parameter zp[u|wp].

Proof From Equation (5) and (11), we have

x(u)p [i]=

"v#V\{p,u} wp[v]xv[i]=

"v#V\p wp[v]xv [i]%wp[u]xu[i]

=xp[i]%rp[i]%wp[u]xu[i]

Therefore, from Equation (4)

zp[u|wp] = 1M

"Mi=1 xu[i](xp[i]% x(u)

p [i])

= 1M

"Mi=1 xu[i](rp[i] +wp[u]xu[i])

= 1M

"Mi=1 xu[i]rp[i] + 1

Mwp[u]"M

i=1(xu[i])2

Since the variance of vector xu is 1 as described in Section 3,we have

"Mi=1(xu[i])

2 = M . As a result,

zp[u|wp] = 1M

"Mi=1 xu[i]rp[i] + wp[u]

which completes the proof from Equation (10). !

We exploit Equation (10) if an edge has zero weight beforethe update. Since such an edge is expected to have zeroweight again after the update [5], the residual is likely tohave the same score after the update. Therefore, we cane!ectively utilize residual rp[i] again in the next iteration inupdating other weights. Even if the edge has nonzero weightafter the update, we can incrementally update residual rp[i]in O(1) time by exploiting Equation (11).

If an edge has nonzero weight before the update, we usethe covariance-based approach to update weights since it cane"ciently compute parameter zp[u|wp] in O(dp) time [6]:

zp[u|wp]=wp[u]+ 1M

&*xp,xu+%

"v:|wp[v]|>0wp[v]*xv,xu+

'(12)

where *xp,xu+ is the inner product of vector xp and xu,i.e., *xp,xu+ =

"Mi=1 xp[i]xu[i]. Note that Equation (4) and

(12) give the same result. Although we exploit the samecovariance-based approach, our approach has less computa-tion cost than the previous approach [7]. Before updatingweights, the covariance-based approach requires inner prod-uct computations of nonzero weights to all edges undergo-ing update computations. Since the previous approach up-dates the weights of all the edges by the covariance-basedapproach, it needs O(dpNM) time to compute inner prod-ucts. On the other hand, we exploit the covariance-basedapproach only if the edge has nonzero weight before theupdate. As a result, our approach takes O(d2pM) time tocompute the inner products. Note that we have dp , Nsince the L1-graph has sparse structure.

4.2.3 Gradual Edge AdditionAs shown in Section 4.2.1, we determine edge set M[wp]

and C[wp] before entering the iterations. If U is an edge setto update weights, we can naively reduce the number of up-dates by setting U = M[wp] and U = C[wp] to directly applythe coordinate descent. However, this approach may itera-tively perform update computations for zero weight edgessince we use the upper and lower bounds to obtain M[wp]

233

and C[wp]; we exploit the approximations to determineM[wp]and C[wp]. This section describes the approach that skipsunnecessary update computations to improve the e"ciencyof the L1-graph construction1.

For e"cient graph construction, we add edges one by onefrom edge set A to edge set U. In our approach, we firstobtain edge set U as edges of nonzero weight and performupdate computations until convergence for edge set U. Af-ter convergence, we computeM[wp] and obtain set of addingedges as A = M[wp]\U. Then, we perform an update com-putation for each edge in A one by one. If an edge hasnonzero weight after the update computation, we add theedge to U and update weights of U until convergence. Other-wise, we do not iteratively perform update computations forthe edge. Similarly, we compute A = C[wp]\U and updateweights. Since we avoid iterative computations if an edgedoes not have nonzero weight, we can improve the e"ciencyof weight updates. Theoretically, this approach is based onthe following property of converged weights:

Lemma 5 (Converged Weights). Let w%p be a con-

verged weight vector for edge set U[wp] + u obtained by thecoordinate descent based on weight vector wp. If (1) weightvector wp reaches convergence by the coordinate descent withedge set U[wp] and (2) wp[u] = 0 and w%

p[u] = 0 hold fornode u, we have w%

p = wp.

Proof Since wp[u] = 0 and w%p[u] = 0, the u-th element

of wp and w%p are obviously the same. If w%

p[u] = 0 holds, itis clear that weight w%

p[u] does not have any impact on theupdate results from Equation (5). In addition, weight vectorwp has already reached convergence by exploiting edge setU[wp]. Therefore, we have w%

p[u] = wp[u] for all elements inU[wp] after the update computations to obtain weight vectorw%

p if we use weight vector wp in the coordinate descent. Asa result, we have w%

p = wp. !

Lemma 5 indicates that we can skip the update computa-tions for the edge from node u to p as its weight is zero in theiterations without changing the regression results yielded bythe coordinate descent. This skipping of unnecessary com-putations allows us to e"ciently update nonzero weights.We show the detail algorithm of this approach in Section 4.4as a part of the graph construction algorithm.

4.3 Upper and Lower BoundsAs described in Section 4.2, we can prune edges that

do not have nonzero weight by computing KKT conditionscores. However, it needs the high cost of O(NM) time tocompute the score for each edge because the size of datamatrix X is N"M . This section introduces our approachesto e"ciently compute the upper and lower bounds of KKTconditions scores. Section 4.3.1 describes our approach tocompute the bounds by using SVD (singular value decom-position). Section 4.3.2 shows the incremental approach forbound computations.

4.3.1 SVD-Based Bound ComputationsIn this approach, we approximate matrix X by comput-

ing SVD. We adopt SVD since it gives high approximationquality in terms of squared loss [20]. Let V be a unitary

1Although all the edges included in M[wp] have nonzero weights after

the update computation with weight vector wp, the weight vectorchanges after the updates. Therefore, M[wp] can include edges whoseweights are zero as a result of update computations.

matrix and X be the transformed matrix of X, matrix Xis represented as X = XV by SVD. Similarly, let vector xu

be the transformed vector of xu, vector xu is represented asxu = xuV. If rp = xp %wpX, the definitions of the upperand lower bounds are given as follows:

Definition 4 (Upper and Lower Bounds). If K[u|wp]and K[u|wp] are the upper and lower bounds of K[u|wp], re-spectively, K[u|wp] and K[u|wp] are given as follows:

K[u|wp]= 1!wp[u]+ 1

2!M

($rp$22+M%

"mi=1(rp[i]%xu[i])

2)(13)

K[u|wp]= 1!wp[u]% 1

2!M

($rp$22+M%

"mi=1(rp[i]+xu[i])

2)(14)

In Definition 4, $rp$2 is the L2-norm of vector rp, rp[i]is the i-th element of rp, and m is the target rank of SVD.The following lemma shows that K[u|wp] and K[u|wp] givethe upper and lower bounds, respectively:

Lemma 6 (Upper and Lower Bounds). If we updatethe weight of the edge from node u to p for weight vector wp,we have K[u|wp] ) K[u|wp] and K[u|wp] ' K[u|wp] for theupper and lower bounds given in Definition 4.

Proof Since VV$ = I where I is the identity matrix,

(xp%wpX)x$u =(xp%wpX)VV$x$

u = rpx$u =

"Mi=1 rp[i]xu[i]

Since rp[i]xu[i] = 12{(rp[i])

2+(xu[i])2%(rp[i]%xu[i])

2} holds,

(xp%wpX)x$u = 1

2

"Mi=1{(rp[i])

2 +(xu[i])2% (rp[i]% xu[i])

2}

Note that"M

i=1(xu[i])2 = M holds since SVD is orthogonal

transformation and each vector is normalized as describedin Section 3. Therefore, we have from Equation (6)

K[u|wp]= 1!wp[u]+ 1

2!M

($rp$22+M%

"Mi=1(rp[i]% xu[i])

2)

Since (rp[i]% xu[i])2 ) 0 holds, we have

K[u|wp]' 1!wp[u]+ 1

2!M

($rp$22+M%

"mi=1(rp[i]% xu[i])

2)

Therefore, we have K[u|wp] ) K[u|wp] from Equation (13).Similarly, we have K[u|wp] ' K[u|wp] since rp[i]xu[i] =12{%(rp[i])

2 % (xu[i])2 + (rp[i] + xu[i])

2} holds. !

From the above proof, it is clear that errors of the upperand lower bounds are proportional to

"Mi=m+1(rp[i]%xu[i])2

and"M

i=m+1(rp[i] + xu[i])2, respectively. Although SVD

gives the smallest squared error in approximating matrixX, it is not guaranteed to give the tightest upper and lowerbounds since vector rp is distorted by vector wp in the formof rp = xp % wpX. However, it is expected that we canimprove the quality of the upper and lower bounds by usingSVD. This is because rp[i] and xu[i] are likely to take smallabsolute values as i increases.

It takes O(m) time to compute the bounds from Equa-tion (13) and (14) if we have norm $rp$2. Note that norm$rp$2 is computed in O(dpm) time and used for all edges tonode p once we compute the norm. In addition, we can e"-ciently compute the SVD of matrix X in O(NM logm) timeby using an existing technique [12]. Note that we computeSVD only once in constructing the graph.

4.3.2 Incremental ComputationAs described in Section 4.2.3, we update weights by adding

edges one by one to edge set U to update from edge set Aby computing A = M[wp]\U and A = C[wp]\U. We can re-duce the computation time to obtain edge set M[wp] and

234

C[wp] by exploiting the SVD-based approach introducedin the previous section. However, the e"ciency of the ap-proach would only be moderate since it requires O(m) timeto compute the bounds for each edge every time we computethe edge sets. How can we cut down the computation costfor determining the bounds? This is the motivation behindour approach of incremental computation; we incrementallycompute the bounds at O(1) time. This approach is basedon the approach that adds edges one by one as introducedin Section 4.2.3; we can e!ectively update the bounds sincethe weight vector is not so di!erent before and after theedge additions. This section first describes the relation be-tween KKT condition score and parameter zp[u|wp]. Then,it shows the incremental approach to computing the bounds.

The following lemma confirms that we can compute theKKT condition score from parameter zp[u|wp]:

Lemma 7 (Equivalence of KKT condition score).For node u and weight vector wp, we can compute KKT con-dition score K[u|wp] from parameter zp[u|wp] as follows:

K[u|wp] = 1!zp[u|wp] (15)

Proof From Equation (6), we have

!K[u|wp] = wp[u]+ 1M (xp%wpX)x$

u

= 1M (wp[u]M+*xu,xp+%*xu,wpX+)

where *xp,xu+ is the inner product of vector xp and xu.Therefore, we have

!K[u|wp]= 1M

(wp[u]M+

"Mi=1

&xu[i]xp[i]%xu[i]

"v#Vwp[v]xv[i]

')

= 1M

(wp[u]M+

"Mi=1xu[i]

&xp[i]%

"v#Vwp[v]xv [i]

')

Since"M

i=1(xu[i])2 = M and wp[p] = 0 as described in Sec-

tion 3, we have the following equation from Equation (5):

!K[u|wp]= 1M

("Mi=1xu[i]

&wp[u]xu[i]+xp[i]%

"v#Vwp[v]xv[i]

')

= 1M

("Mi=1xu[i]

&xp[i]%

"v#V\{p,u}wp[v]xv[i]

')

= 1M

"Mi=1xu[i]

&xp[i]%x(u)

p [i]'

Therefore, !K[u|wp]=zp[u|wp] holds from Equation (4). !

Lemma 7 indicates that we can compute KKT conditionscore in O(1) time if (1) an edge is included in the edgeset and (2) we have parameter zp[u|wp] of the edge. Weincrementally compute the upper and lower bounds afteredge addition as follows:

Definition 5 (Incremental computation). Let " bethe number of edges in edge set A that are added to edgeset U and wi

p = (wip[1], w

ip[2], . . . , w

ip[n]) be the converged

weight vector after the i-th edge addition (1 ' i ' "). Inaddition, let rip = (rip[1], r

ip[2], . . . , r

ip[n]) be a residual vector

that corresponds to weight vector wip given as follows:

rip = xp %wipX (16)

Let #ip be the score that corresponds to the di"erence of KKTcondition score before and after the i-th edge addition:

#ip = 1!{$r

ip % ri&1

p $22 + $wip %wi&1

p $1} (17)

We compute the upper and lower bounds after the iterationsbased on Lemma 7 as follows if an edge is included in edge setA and added to edge set U in the $-th addition (1 ' $ ' "):

K[u|w"p ] =

1!zp[u|w

#p ] +#p (18)

K[u|w"p ] =

1!zp[u|w

#p ]%#p (19)

In Equation (18) and (19), #p ="#

i=1 #ip. If the edge is not

included in edge set A, the bounds are computed as follows:

K[u|w"p ] = K[u|w0

p] +#p (20)

K[u|w"p ] = K[u|w0

p]%#p (21)

In Definition 5, K[u|w0p] and K[u|w0

p] correspond to thebounds before the edge additions; K[u|w"

p ] andK[u|w"p ] cor-

respond to the bounds after the edge additions. Therefore,we can incrementally update the bounds by using zp[u|w#

p ],K[u|w0

p], and K[u|w0p] from Definition 5. We have the fol-

lowing property for the bounds:

Lemma 8 (Incremental computation). After the "-th edge addition, we have K[u|w"

p ] ) K[u|w"p ] and K[u|w"

p ] 'K[u|w"

p ] for the bounds given by Definition 5 if the weightsof edges are updated for weight vector w"

p .Proof From Equation (6) and (16), we have the follow-

ing equation from the Cauchy-Schwarz inequality [22] forthe i-th edge addition (1 ' i ' "):

|K[u|wip]%K[u|wi&1

p ]|= 1!|

1M (rip%ri&1

p )x$u +(wi

p[u]%wi&1p [u])|

' 1!{

1M$r

ip%ri&1

p $22$x$u$22+$wip%wi&1

p $1}= 1!{$r

ip%ri&1

p $22+$wip%wi&1

p $1}

This is because we have $x$u $22 = M . Therefore, from Equa-

tion (17), |K[u|wip] %K[u|wi&1

p ]| ' #ip. As a result, for anedge included in the edge set,

|K[u|w"p ]%K[u|w#

p ]| = |""

i=#+1(K[u|wip]%K[u|wi&1

p ])|' |

""i=#+1 #

ip| ' |

""i=1 #

ip| = #p

Thus, from Lemma 7, we have

1!zp[u|w

#p ]%#p ' K[u|w"

p ] ' 1!zp[u|w

#p ] +#p

Similarly, for an edge that is not included in the edge set,

K[u|w0p]%#p ' K[u|w"

p ] ' K[u|w0p] +#p

which completes the proof. !

As described in Section 4.2.2, we compute the residuals forthe added edges every time the edges change the regressionresult to e"ciently update the weights. Since the coordinatedescent updates weights one at a time, we can incrementallycompute #ip and #p at O(1) time from Equation (17). Asa result, we can incrementally compute the bounds in O(1)time by exploiting Definition 5.

In terms of computing the upper and lower bounds, theincremental approach of Definition 5 is the same as the SVD-based approach of Definition 4. However, they yield di!erentbounds since they adopt totally di!erent processes to com-pute the bounds. In addition, the incremental approach cancompute the bounds more e"ciently. Therefore, we first ex-ploit the incremental approach to obtain the bounds for eachedge. Then, we use the SVD-based approach only for edgesthat are determined to update weights by the incrementalapproach as described in the next section.

4.4 Graph Construction AlgorithmAlgorithm 1 gives a full description of Castnet, which e"-

ciently constructs the lasso-based L1-graph. In Algorithm 1,W is an N"N matrix whose p-th row corresponds to weightvector wp, and P is the set of nodes selected for weight com-putation. Castnet initializes weights based on the observa-tion that the weight of an edge from node u to v is similarto that from node v to u; wv[u] - wu[v] [4].

235

Algorithm 1 CastnetInput: matrix X, tuning parameter !, target rank of SVD mOutput: matrix W1: P = ";2: compute rank-m SVD of matrix X;3: for i = 1 to N do4: p = argmax(#wu#1|u $ V\P);5: U = {ep[u] |wp[u] %= 0};6: update weights for U by Equation (12);7: for j = 1 to N do8: K[uj |wp] = &', K[uj |wp] = ';

9: step = 1;10: while step ( 2 do11: for each u $ V\p do12: compute the bounds by Definition 5;13: if step = 1 then14: M[wp] = ";15: for each u $ V\p do16: if K[u|wp] < &1 or K[u|wp] > 1 then17: compute the bounds of node u by Definition 4;18: if K[u|wp] < &1 or K[u|wp] > 1 then19: add edge ep[u] to M[wp];

20: A = M[wp]\U;21: else22: C[wp] = ";23: for each u $ V\p do24: if K[u|wp] > 1 or K[u|wp] < &1 then25: compute the bounds of node u by Definition 4;26: if K[u|wp] > 1 or K[u|wp] < &1 then27: add edge ep[u] to C[wp];

28: A = C[wp]\U;29: for each u $ A do30: compute wp[u] for weight vector wp by Equation (10);31: if wp[u] %= 0 then32: add edge ep[u] to edge set U;33: update weights for U by Equation (12);

34: if A = " then35: step + +;36: add node p to P;37: for each u $ V\P do38: wu[p] = wp[u];

return weights of the graph;

Algorithm 1 starts by initializing node set P and comput-ing SVD of data matrix X to allow the upper and lowerbounds to be determined in the iterations (lines 1-2). Itpicks up the node that has the maximum L1 norm for theweight vector since we can accurately perform the regressionfor the node from the observation (line 4). It then performsupdate computations for edges that have nonzero weightsby using Equation (12) and initializes the bounds (lines 5-8). Our approach has two steps in computing the weights ofselected nodes; the first step is based on edge set M[wp] andthe second step exploits edge set C[wp]. In order to pruneunnecessary updates, it computes the bounds determinedby the incremental approach as described in Section 4.3.2(lines 11-12) and checks the conditions of edge set M[wp]and C[wp] (line 16 and line 24). If an edge can meet theconditions of the edge sets (Definition 2 and 3), it com-putes the bounds again by using the SVD-based approachto determine the edge sets (lines 17-19 and lines 25-27). Asdescribed in Section 4.2.3, it adds edges one by one to edgeset U so as to increase the e"ciency. Therefore, it computesedge set A after determining edge sets M[wp] and C[wp](line 20 and line 28). If the weight of an edge is zero, wecan e"ciently compute its weight by using Equation (10)(line 30); such edges have no impact on the regression resultas shown in Lemma 5. Therefore, it updates weights untilconvergence only if the added edge has nonzero weight (lines31-33). The weight vector clearly reaches convergence if we

have no edge to add. Therefore, it proceeds the step of con-structing the graph if we have A = . (lines 34-35). After theiterations, it sets the weights of edges to unselected nodesbased on the observation (lines 37-38).

In Algorithm 1, we implicitly assumed a single threadwhere we pick up nodes one by one from the entire set ofnodes. However, Algorithm 1 is easily parallelized; we di-vide the set of nodes into several sets and parallelly performAlgorithm 1 for each obtained set. Since we compute edgesfrom all the nodes for each obtained set, we can obtain thesame graph as the single-thread approach even if we use theparallelization approach. We use the k-means clustering toe!ectively divide the set according to the hidden structureof data points. Although the k-means clustering performedin a single thread, we can e"ciently divide the entire set ofnodes since (1) we use SVD to reduce the number of dimen-sions and (2) we apply k-means++ to increase the processingspeed of the k-means clustering [1].

For Algorithm 1, we have the following properties:

Theorem 1 (Computation Cost). Let t be the num-ber of update computations in our approach, Castnet takesO(dpt+M(N+logm+t+(dp)

2)) time to compute weights ofa selected node in the lasso-based L1-graph.

Proof Castnet first computes the SVD of matrix Xused in computing the bounds at O(M logm) time amor-tized on the number of nodes [12]. In the SVD-based boundcomputations, it takes O(dpm) time to compute norm $rp$2and O(Nm) time to compute the bounds. Similarly, weneed O(N) time to compute the bounds by the incrementalcomputation approach. In addition, we can compute edgesets M[wp], C[wp], and A in O(N) time. In performing up-date computations for edges whose weights are zero, it takesO(Mt) time to compute the residuals and O(M(N % dp))time to update computations. In updating nonzero weights,it needs O(M(dp)2) time to compute the inner products andO(dpt) time to update weights. Therefore, Castnet needsO(dpt+M(N+logm+t+(dp)

2)) time. !

In practice, the number of update computations t in-creases with the number of nodes N . However, we canreduce t if a node has small degree dp. We show these rela-tionships in Section 5.

Theorem 2 (Construction result). If matrix X hasfull row rank, the proposed approach converges on the sameregression result as the original coordinate descent approachin computing edge weights for a selected node.

Proof As shown in Algorithm 1, Castnet first updateweights if K[u|wp] < %1 or K[u|wp] > 1 holds. It thenupdate weights if K[u|wp] > 1 or K[u|wp] < %1 hold. Asshown in Lemma 3, we have K[u|wp]> 1 or K[u|wp]<%1if K[u|wp] < %1 or K[u|wp] > 1 holds. Therefore, thesetwo processes of the proposed approach indicate that we donot perform update computations for weights of edges suchthat K[u|wp]' 1 and K[u|wp])%1. It is clear that suchedges cannot have nonzero weights from the KKT condition(Definition 1). As a result, our approach cannot prune edgesthat have nonzero weights. Since there is a unique regressionresult if matrix X has full row rank, the coordinate descentapproach converges to the unique result [6, 24]. Since ourapproach is based on the coordinate descent, it is clear thatCastnet converges the same result as the original coordinatedescent approach if matrix X has full row rank. !

236

100

101

102

103

104

105

106

107

USPS Reuters Protein SensIT MSD INRIA

Wal

l clo

ck ti

me

[s]

CastnetSling

Original

100

101

102

103

104

105

106


Wal

l clo

ck ti

me

[s]

CastnetSling

Original

100

101

102

103

104

105

106


Wal

l clo

ck ti

me

[s]

CastnetSling

Original

(1) ! = 0.1 (2) ! = 0.2 (3) ! = 0.3

Figure 1: Graph construction time of each approach.

If matrix X does not have full row rank, which is neces-sarily the case when N > M , there may not be a uniqueregression result. In this case, our approach yields almostthe same result as the original coordinate descent approach.In the next section, we detail experiments that show thee"ciency and e!ectiveness of the proposed approach.

5. EXPERIMENTAL EVALUATIONWe performed experiments to demonstrate the e!ective-

ness of our approach. In the experiments, we used the sixdatasets taken from various domains: USPS, Reuters, Pro-tein, SensIT, MSD, and INRIA. USPS is a popular datasetof handwritten digits captured from envelopes by the U.S.Postal Service2. This dataset contains 2, 007 grayscale im-ages, each 16 " 16 pixels; the number of features is 256.Reuters is a corpus of newswire stories that contains 8, 293documents3. In this dataset, tf-idf is used as the documentfeature; it has 18, 933 dimensions. Protein is a set of aminoacid sequences where the number of data and features are17, 766 and 357, respectively4. SensIT is a dataset obtainedfrom a distributed wireless sensor network for vehicle typeclassification5 . In this dataset, the number of data and fea-tures are 78, 823 and 100, respectively. MSD is a musicdataset that contain 515, 345 songs of 90 features6. In thisdataset, songs are mostly western and commercial tracks.INRIA consists of 1, 000, 000 image features extracted fromboth personal and Flickr photos, which of each is representedby a 128-D SIFT descriptor7. Since we have N > M , matrixX is not full row rank except for Reuters while matrix X isfull row rank for the dataset. The numbers of nodes in thedatasets increase in the order of USPS, Reuters, Protein,SensIT, MSD, and INRIA.

In what follows, “Castnet”, “Sling”, and “Original” rep-resents our approach, the state-of-the-art approach for thelasso [7], and the original coordinate descent approach [5],respectively. We set the target rank of SVD to 10 for theproposed approach. All the experiments were conducted ona Linux 2.70 GHz Intel Xeon server with 1 TB of main mem-ory. We implemented all the approaches using GCC.

5.1 Graph Construction TimeWe assessed the graph construction time needed for each

approach. Figure 1 shows the results where we set tuning2https://www.otexts.org/1577

3http://www.cad.zju.edu.cn/home/dengcai/Data/data.html

4http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/

5http://mldata.org/repository/data/

6http://labrosa.ee.columbia.edu/millionsong/

7http://corpus-texmex.irisa.fr/

Table 2: Degree and updates in INRIA.! N = 250, 000 N = 500, 000 N = 1, 000, 000

Degree0.1 16.77 15.02 15.32

0.2 9.91 9.34 9.59

0.3 6.76 6.82 7.17

Updates0.1 5.44 ! 105 1.03 ! 106 2.04 ! 106

0.2 4.82 ! 105 9.27 ! 105 1.82 ! 106

0.3 4.22 ! 105 7.56 ! 105 1.53 ! 106

parameter ! to 0.1, 0.2, and 0.3. In Figure 1, we omit theresult of Sling for MSD and INRIA datasets since it couldnot construct the L1-graph within a month. Similarly, weomit the results of the original approach for SensIT, MSD,and INRIA datasets for the same reason. Table 2 shows theaverage degree and update computations of each node; IN-RIA was used as the dataset and we changed the numberof nodes in the graph. To evaluate the scalability o!eredby our parallelization approach, we plot speedup over singlethread vs. the number of threads for each graph in Fig-ure 2. Figure 3 shows the computation overhead of k-meansclustering in the parallelization approach.

Figure 1 shows that our approach is much faster than theprevious approaches. Specifically, Castnet is up to 1, 300times faster than the original approach. In order to obtainedge weights, the original approach requires a di!erent re-gression result for each weight as described in Section 3.Therefore, the original approach must compute the regres-sion result every time it updates each weight. In addition,the original approach iteratively updates weights until con-vergence even though almost all edges have zero weights dueto the sparse structure of the L1-graph, which explains itshigh computation cost. To improve the computation speedfor the lasso, Sling updates weights by using the covariance-based approach as described in Section 2. Since the covariance-based approach does not need the di!erent regression re-sults, Sling can more e"ciently update weights in each iter-ation than the original approach. However, the covariance-based approach needs inner product computations for alledges to update the weights before entering the iterations.On the other hand, our approach can avoid the heavy com-putations of inner products by using the two types of updatecomputations as described in Section 4.2.2. In addition, ourapproach selectively updates weights by computing edge setsfrom upper and lower bounds of KKT condition scores asdescribed in Section 4.2.1. As a result, our approach is upto 130 times faster than the state-of-the-art approach. Asshown in Figure 1, computation time of our approach in-creases with dataset size. This is because we need moreupdate computations as the number of nodes increases as

237

0

5

10

15

20

25

30


Spee

dup

over

sin

gle

thre

ad4-thread8-thread

16-thread32-thread

Figure 2: E!ect of parallelizationapproach.

10-3

10-2

10-1

100

101


Wal

l clo

ck ti

me

[s]

4-thread8-thread

16-thread32-thread

Figure 3: Overhead with paral-lelization.

100

101

102

103

104

105

106

107


Wal

l clo

ck ti

me

[s]

Castnet(10)Castnet(20)Castnet(30)

Figure 4: E!ect of target ranknumber.

Table 3: Number of update computations.Dataset m = 10 m = 20 m = 30

USPS 2.362 ! 107 2.296 ! 107 1.833 ! 107

Reuters 1.471 ! 108 1.470 ! 108 1.469 ! 108

Protein 1.400 ! 109 1.380 ! 109 1.372 ! 109

SensIT 1.373 ! 1010 1.341 ! 1010 1.092 ! 1010

MSD 4.258 ! 1011 1.975 ! 1011 7.196 ! 1010

INRIA 2.038 ! 1012 2.029 ! 1012 1.992 ! 1012

shown in Table 2. On the other hand, Figure 1 indicatesthat we can increase the e"ciency by increasing tuning pa-rameter !. This is because, as shown in Table 2, the numberof update computations decreases if we reduce the degree ofeach node by assigning large values to the tuning parameter.Since our approach can e!ectively prune unnecessary com-putations, it can more e"ciently construct the lasso-basedL1-graph than the previous approaches.

As described in Section 4.4, Castnet can be parallelizedby using multiple threads. As shown in Figure 2, our paral-lelization approach is up to 25 times faster than the single-thread approach with 32-thread execution. Although ourparallelization approach needs the additional process of k-means clustering in a single thread, its computation over-head has negligible impact on the graph construction time.This is because we can e"ciently perform k-means cluster-ing by using SVD and k-means++ as shown in Figure 3; ittakes at most 5 seconds to perform the clustering for thelargest dataset. Therefore, we can e!ectively increase theprocessing speed by exploiting the parallelization approach.

5.2 Efficiency vs. Target Rank NumberAs described in Section 4.3, our approach computes SVD

of rank m for the data matrix to increase the computationspeed of the upper and lower bounds. Since the boundsare used in computing the edge sets for updating weights,the target rank of SVD is expected to impact the graphconstruction time. In this section, we show the graph con-struction times recorded for di!erent target rank values ofSVD. Figure 4 shows the results where we set ! = 0.1 as thetuning parameter of the lasso. In this figure, the results ofour approach are indicated by “Castnet(m)” where m is thetarget rank of SVD. In addition, Table 3 shows the num-ber of update computations for all the nodes needed by ourapproach to construct the L1-graph when ! = 0.1.

As shown in Figure 4, the graph construction time variesonly slightly against the target rank of SVD for the datamatrix; our approach is not sensitive to rank m in terms ofe"ciency. As described in Section 4.3, it takes O(m) timeto compute the upper and lower bounds by SVD. Therefore,

Table 4: Regression results of each approach.Dataset Approach Objective

functionSquaredloss

L1-normcon-straint

USPSCastnet 0.1355 0.0348 0.1007

Sling 0.1355 0.0348 0.1007

Original 0.1355 0.0348 0.1007

ReutersCastnet 0.3985 0.3360 0.0625

Sling 0.3985 0.3360 0.0625

Original 0.3985 0.3360 0.0625

ProteinCastnet 0.3323 0.1651 0.1672

Sling 0.3323 0.1651 0.1672

Original 0.3323 0.1651 0.1672

we can increase the e"ciency of computing the bounds ifrank m has a small score. On the other hand, as shown inTable 3, we can reduce the number of update computationsif rank m has a large score. This is because we can improvethe accuracy of SVD in approximating the data matrix byincreasing the target rank. In conclusion, we can e"cientlycompute the bounds by reducing the target rank at the sac-rifice of the number of update computations whereas we cane"ciently update weights with the large target rank thatincurs high computation cost for the bounds. Due to thetrade-o! derived from rank m, our approach is not sensitiveto the target rank of SVD.

5.3 Effectiveness of the Graph StructureAs described in Section 4.4, one major advantage of our

approach is that it yields the same L1-graph as the origi-nal approach if matrix X has full row rank. However, ourapproach can output di!erent results from the original ap-proach if matrix X does not have full row rank since theremay not be a unique lasso solution in terms of the objectivefunction introduced in Section 3 (Equation (1)). In this sec-tion, we evaluate the e!ectiveness of the regression result inconstructing the L1-graph of each approach. Table 4 showsthe averages of the objective function in computing edgeweights for each node in the L1-graph when ! = 0.1. In thetable, “Squared loss” and “L1-norm constraint” correspondto the first and second terms of the objective function, re-spectively. Note that the datasets used in this experimentdo not have full row rank due to N > M except for Reutersdataset. In addition, the same as our approach, Sling canoutput di!erent regression results from the original approachif matrix X does not have full row rank [7].

Table 4 indicates that our approach yields the same ob-ject function scores as the original approach for the Reutersdataset, which does have full row rank. This is because ourapproach provably guarantees to output the same solution

238

Table 5: Label propagation.Castnet L1 k-NN LN

Precision 0.581 0.576 0.433 0.554

Graph Construction [s] 19.19 78.17 3.35 7.19

Number of edges 34, 804 39, 329 36, 926 30, 426

for the L1-optimization problem as described in Theorem 2.Although our approach can yield di!erent regression resultsif matrix X does not have full row rank, the graph structuresobtained by our approach were almost the same as those bythe original approach. This is because we perform updatecomputations for edges that must/can have nonzero weightsby computing the upper and lower bounds of KKT condi-tion scores as shown in Algorithm 1. Since KKT conditionscores correspond to gradients in coordinate descent [5], ourapproach can e!ectively identify edges that e!ectively im-prove the objective functions by obtaining the edge set thatmust/can have nonzero weights. Therefore, our approachcan e!ectively compute the solution for the optimizationproblem of the lasso by using the upper and lower bounds.As a result, scores of the objective function of our approachand the original approach are the same in practice for USPSand Protein datasets as shown in Table 4 even though thesedatasets do not have full row rank. Along with Figure 1,Table 4 indicates that our approach significantly improvesthe e"ciency to construct the L1-graph while the quality ofgraph structure is as high as the previous approaches.

6. CASE STUDIESIn order to e!ectively find the hidden structure of data

points, it is important to construct a graph that is robustagainst the noise. Our approach allows users to e"cientlyconstruct the lasso-based L1-graph so as to enhance thequality of applications more than is possible with traditionalk-NN graphs. This section compares our approach to a k-NN graph as well as the L1-graph proposed by Cheng et al.[2] and the linear neighborhood graph proposed by Wanget al. [25]. The L1-graph by Cheng et al. uses only theL1-norm constraint as the objective function while our ap-proach uses the sum of squared loss and L1-norm constraintas described in Section 3. The linear neighborhood graphby Wang et al. computes the edge weight of each node as alinear combination of its neighbors.

We used the COIL-100 dataset that contains images of100 objects8. The objects were placed on a turntable thatwas rotated through 360 degrees to vary object pose withrespect to a fixed camera. Images of the object were takenat pose intervals of 5 degrees; 72 poses per object, resultingin 7, 200 images. We used bag of keypoints of 128 dimen-sions as image features where we exploited SIFT descriptorto computer feature vector [3]. Since we have N > M , datamatrix X does not have full row rank for this dataset. In or-der to evaluate the e!ectiveness, we used label propagation[9]; it estimates labels of unlabeled data points from labeleddata points. To this end, it propagates labels to the unla-beled nodes in the given graph from a labeled seed node.We randomly selected a data point from each object as theseed node. Table 5 shows labeling accuracy and graph con-struction time of each approach; labeling accuracy is ratioof labeled nodes that correspond to the same object as theseed node. Table 5 also shows the number of edges obtained

8http://www1.cs.columbia.edu/CAVE/software/

Table 6: Connected nodes by each approach.

Approach

Connected nodes

Castnet 0.610 0.512 0.653 0.161 0.609 0.353

0.394 0.008 0.014 0.002 0.061 0.060

L10.422 0.350 0.330 0.273 0.418 0.279

0.313 0.002 0.270 0.058 0.255 0.070

k-NN 0.295 0.282 0.252 0.222 0.221 0.206

0.254 0.184 0.220 0.219 0.199 0.190

LN 0.350 0.339 0.374 0.277 0.353 0.299

0.305 0.005 0.193 0.194 0.167

by each approach. In addition, Table 6 shows examples ofconnected nodes and their edge weights obtained by eachapproach. In Table 5 and 6, “L1” and “LN” correspondto the results of the L1-graph by Cheng et al. [2] and thelinear neighborhood graph by Wang et al. [25]. In con-structing k-NN graph, we used FLANN that approximatelyfinds the nearest neighbors; it is a popular approach used inOpenCV [18]. We set parameters of FLANN so that it yieldsaccuracy of 90% in computing the nearest neighbors sincethis setting is well used in OpenCV. Note that FLANN out-performs LSH-based approaches in terms of e"ciency. Weset ! = 0.4 to construct the L1-graph and computed thetop four nodes to obtain edges in the k-NN graph. As aresult, each approach yielded similar number of edges asshown in Table 5. For the k-NN graph, we computed edgeweights from normalized graph Laplacians [9]. Note thatour approach yields the same labeling results as the originalcoordinate descent approach.

Table 5 indicates that our approach can more e!ectivelyidentify semantically equivalent nodes than other graphs.In addition, as shown in Table 6, our approach successfullyconnected semantically equivalent nodes to the picked upnodes. This is because, by e!ectively using the regressionapproach, the loss-based L1-graph can connect semanticallyequivalent nodes. Therefore, semantically equivalent nodescomprise a cluster structure in the high dimensional featurespaces. In terms of constructing the L1-graph, the approachby Cheng et al. is similar to our approach. However, sincetheir approach does not use the squared loss as the objec-tive function, it is not robust against noise as described in

239

the previous study [4]. This indicates that our approach canimprove the robustness against noise by using the squaredloss in the objective function. For example, in the third caseof Table 6 where a white car is a node, both our and theirapproaches connected a green car to the node. However, ourapproach assigns much lower edge weight to the semanticallydi!erent node than semantically equivalent nodes. On theother hand, their approach assigns high edge weight to thesemantically di!erent node. In addition, Table 5 indicatesthat our approach is more e"cient than their approach inconstructing the L1-graph. This is because our approachskips unnecessary weight updates. In terms of e"ciency,k-NN graph can be constructed more e"ciently than thelasso-based L1-graph as shown in Table 5. However, k-NNgraph as well as linear neighborhood graph o!ers lower label-ing accuracy than our approach. This is because, as shownin Table 6, they are not so e!ective as our approach in con-necting semantically equivalent nodes. For example, in thefirst case where a white cat is a node, k-NN graph connectsred doll to the node. In addition, they give high weights tosemantically di!erent nodes as shown in Table 6.

Although the traditional k-NN graph is the most popularapproach, it is sensitive to data noise as shown in Table 5.E!ectiveness is quite important in many applications suchas lesion detection as described in Section 1.1; we can reducedeaths caused by prostate cancer by e!ectively performinglesion detection. The lasso base L1-graph was proposed toimprove the sensitive to data noise. Although it overcomesthe drawback, the original coordinate descent approach doesnot scale well to handle large data sets as shown in Figure 1.The contribution of our approach is to significantly increasethe e"ciency of constructing the L1-graphs without sacri-ficing the usefulness of the graph structure. As a result, theproposed approach is an attractive option for the researchcommunity in constructing the lasso-based L1-graphs.

7. CONCLUSIONSWe addressed the problem of e"ciently constructing the

lasso-based L1-graph. Our approach, Castnet, limits up-date computations to the edges that must/can have nonzeroweights before entering the iterations by computing the up-per and lower bounds of KKT condition scores. In addition,our approach e"ciently updates nonzero weights in each it-eration by pruning edges whose weights are zero. Experi-ments show that our approach can achieve higher e"ciencythan existing approaches. Since the L1-graph can e!ectivelycapture the cluster structures that share useful semantic in-formation, it plays a fundamental role in many applicationssuch as brain analysis, motion segmentation, and lesion de-tection. Our approach allows many applications to be im-plemented more e"ciently, and will help to improve the ef-fectiveness of future data mining applications.

8. REFERENCES[1] D. Arthur and S. Vassilvitskii. k-means++: The

Advantages of Careful Seeding. In SODA, pages 1027–1035,2007.

[2] B. Cheng, J. Yang, S. Yan, Y. Fu, and T. S. Huang.Learning with l1-graph for Image Analysis. IEEE Trans.Image Processing, 19(4):858–866, 2010.

[3] C. Dance, J. Willamowski, L. Fan, C. Bray, and G. Csurka.Visual Categorization with Bags of Keypoints. In ECCVInternational Workshop on Statistical Learning inComputer Vision, 2004.

[4] E. Elhamifar and R. Vidal. Sparse Subspace Clustering. InCVPR, pages 2790–2797, 2009.

[5] J. Friedman, T. Hastie, H. Hofling, and R. Tibshirani.Pathwise Coordinate Optimization. Annals of AppliedStatistics, 1(2):302–332, 2007.

[6] J. H. Friedman, T. Hastie, and R. Tibshirani.Regularization Paths for Generalized Linear Models viaCoordinate Descent. Journal of Statistical Software,33(1):1–22, 2 2010.

[7] Y. Fujiwara, Y. Ida, H. Shiokawa, and S. Iwamura. FastLasso Algorithm via Selective Coordinate Descent. InAAAI, pages 1561–1567, 2016.

[8] Y. Fujiwara and G. Irie. E!cient Label Propagation. InICML, pages 784–792, 2014.

[9] Y. Fujiwara, G. Irie, S. Kuroyama, and M. Onizuka.Scaling Manifold Ranking Based Image Retrieval. PVLDB,8(4):341–352, 2014.

[10] Y. Fujiwara, M. Nakatsuji, H. Shiokawa, Y. Ida, andM. Toyoda. Adaptive Message Update for Fast A!nityPropagation. In SIGKDD, pages 309–318, 2015.

[11] Y. Fujiwara, M. Nakatsuji, H. Shiokawa, T. Mishima, andM. Onizuka. Fast and Exact Top-k Algorithm forPageRank. In AAAI, 2013.

[12] N. Halko, P. Martinsson, and J. A. Tropp. FindingStructure with Randomness: Probabilistic Algorithms forConstructing Approximate Matrix Decompositions. SIAMReview, 53(2):217–288, 2011.

[13] T. Hastie, R. Tibshirani, and J. Friedman. The Elements ofStatistical Learning. Springer, 2011.

[14] S. Ji. Computational Network Analysis of the Anatomicaland Genetic Organizations in the Mouse Brain.Bioinformatics, 27(23):3293–3299, 2011.

[15] S. Liao, Y. Gao, and D. Shen. Sparse Patch Based ProstateSegmentation in CT Images. In MICCAI, pages 385–392,2012.

[16] N. Meinshausen and P. Buhlmann. High-DimensionalGraphs and Variable Selection with the Lasso. The Annalsof Statistics, 34(3):1436–1462, June 2006.

[17] T. Mishima and Y. Fujiwara. Madeus: Database LiveMigration Middleware under Heavy Workloads for CloudEnvironment. In SIGMOD, pages 315–329, 2015.

[18] M. Muja and D. G. Lowe. Scalable Nearest NeighborAlgorithms for High Dimensional Data. IEEE Trans.Pattern Anal. Mach. Intell., 36(11):2227–2240, 2014.

[19] M. Nakatsuji, Y. Fujiwara, A. Tanaka, T. Uchiyama, andT. Ishida. Recommendations over Domain Specific UserGraphs. In ECAI, pages 607–612, 2010.

[20] W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P.Flannery. Numerical Recipes 3rd Edition. CambridgeUniversity Press, 2007.

[21] H. Shiokawa, Y. Fujiwara, and M. Onizuka. SCAN++:E!cient Algorithm for Finding Clusters, Hubs and Outlierson Large-scale Graphs. PVLDB, 8(11):1178–1189, 2015.

[22] J. M. Steele. The Cauchy-Schwarz Master Class: AnIntroduction to the Art of Mathematical Inequalities.Cambridge University Press, 2004.

[23] R. Tibshirani. Regression Shrinkage and Selection via theLasso. Journal of the Royal Statistical Society, Series B,58:267–288, 1996.

[24] R. Tibshirani, J. Bien, J. Friedman, T. Hastie, N. Simon,J. Taylor, and R. J. Tibshirani. Strong Rules for DiscardingPredictors in Lasso-type Problems. Journal of the RoyalStatistical Society: Series B (Statistical Methodology),74(2):245–266, 2012.

[25] F. Wang and C. Zhang. Label Propagation through LinearNeighborhoods. IEEE Trans. Knowl. Data Eng.,20(1):55–67, 2008.

240

Documents

Fast Algorithm for the Lasso based L -Graph Construction · 2019. 7. 12. · Fast Algorithm for the Lasso based L 1-Graph Construction Yasuhiro Fujiwara †‡∗,YasutoshiIda†,JunyaArai†,MaiNishimura†,SotetsuIwamura†