Enh Aced Mod Com Detect

Embed Size (px)

Citation preview

  • 8/12/2019 Enh Aced Mod Com Detect

    1/13

    Enhanced Modularity-based Community Detection by Random Walk Network

    Preprocessing

    Darong Lai and Hongtao LuDepartment of Computer Science and Engineering, Shanghai Jiao Tong University,

    800 Dong Chuan Road, 200240, Shanghai, China

    Christine Nardini

    Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology,Chinese Academy of Sciences, 320 Yue Yang Road, 200031, Shanghai, China

    (Dated: April 20, 2010)

    The representation of real systems with network models is becoming increasingly common andcritical to both capture and simplify systems complexity, notably, via the partitioning of networksinto communities. In this respect, the definition of modularity, a common and broadly used qualitymeasure for networks partitioning, has induced a surge of efficient modularity-based communitydetection algorithms. However, recently, the optimization of modularity has been found to showa resolution limit, which reduces its effectiveness and range of applications. Therefore, one recenttrend in this area of research has been related to the definition of novel quality functions, alternativeto modularity. In this paper, however, instead of laying aside the important body of knowledge de-veloped so far for modularity-based algorithms, we propose to use a strategy to preprocess networksbefore feeding them into modularity-based algorithms. This approach is based on the observationthat dynamic processes triggered on vertices in the same community possess similar behavior pat-

    terns but dissimilar on vertices in different communities. Validations on real-world and syntheticnetworks demonstrate that network preprocessing can enhance the modularity-based community de-tection algorithms to find more natural clusters and effectively alleviates the problem of resolutionlimit.

    PACS numbers: 89.75.Hc

    I. INTRODUCTION

    Many real complex systems have recently been conve-niently modeled as networks, with elements as verticesand relations between pairs of vertices as edges. Al-though as diverse as real complex systems can be, thesenetworks exhibit number of interesting and importantcommon properties, such as power-law degree distribu-tion, small-world average shortest paths and high localinterconnections, i.e. high clustering coefficients [13].Recent researches reveal that networked systems are or-ganized as interconnected subgroups of vertices, calledcommunities. For example, communities may be sets ofweb pages with common topics [4], functional modules[5, 6] and groups of peoples in different organizations [5].

    Communities are in general subgroups of vertices withmore chances to interconnect in the same subgroup butless often to connect to other vertices in different sub-groups. Since the seminal work of community structureanalysis in complex networks by Girvan and Newman [5],

    this topic has attracted growing attention. Modularity,an important and popular quality function, is then pro-posed to qualify the community decomposition [7], whichgreatly drives the advance of the researches on commu-nity detection. Modularity is often calculated based onthe comparison of the actual network with its null model

    [email protected]

    (or configuration model) [8]. The null model is a networkwith the same number of vertices and edges incident toeach vertex, but randomly rewired. The larger the valueof the modularity of a partition, the more exactly thenetwork can be decomposed into its real communities.Modularity was first used in a class of divisive algorithmsto select the partition with the highest value [7], and has

    frequently been used as a quality index ever since. Worksof this type include spectral methods based on Laplacianmatrix [9], information based divisive algorithm [10] andmore. Modularity has also been used as a target functionto be optimized directly, in algorithms such as simulatedannealing [11], extremal optimization [12], spectral opti-mization [8, 13, 14] (for a more complete description seereview [15]).

    Although modularity is important and popular, For-tunato and Barthelemy recently found that modularityoptimization has a resolution limit [16], indicating thatclusters that are small with respect to the size of thewhole network tend to be merged into larger ones by

    modularity optimization, even if they are cliques. Severalworks have tried to address this problem. For example,some works concentrate on recursively maximizing mod-ularity for each single cluster found by previous partitions[16, 17], without warranty of finding the proper commu-nities; others have tried to define novel quality indexesas alternatives, such as modularity density [18], whichare completely different from modularity-based ones andforce to abandon the large body of previously developedefficient modularity-based methods. A spin model-based

  • 8/12/2019 Enh Aced Mod Com Detect

    2/13

    2

    formulation with a tunable parameter has also been pro-posed [19, 20]. In fact, the spin model-based method isdesigned not to address the resolution limit of modular-ity but to generalize modularity in view of statistical me-chanics in physics. It is thus expected that such a gener-alized quality function (Hamiltonian) has resolution limitas well [20]. A parameter is used to tune modular reso-lution to reveal underlying hierarchical structure in net-

    works if a prior information on the community structureis available. But in many cases the community structureis not known in advance, thus there is no simple wayto find which will give the most probable mesoscaledescription of the network. As it has already been rec-ognized [20], when the size distribution of communitiesof a network is broad (for example, collaboration net-works), there is not unique properto identify the mostprobable communities. Besides the aforementioned ef-forts, little attention has been drawn to preprocess thenetworks before they are fed into modularity-based algo-rithms. Arenas et al. [21] worked in this direction andproposed a method to modify the topology of the originalnetwork by adding self-loops with an identical weight rto every vertex and then used modularity optimization todetect dozens of possible candidates of community par-titions. Such a method needs to sample a great numberof values of r, and it is usually difficult to choose oneor more probable partitions from so many possible can-didates ifa prioriinformation is lacking (for larger net-works this becomes increasingly difficult since partitionswith equal number of groups are not necessary identi-cal). In this paper we similarly avoid developing a newcommunity detection algorithm, but we aim at combin-ing modularity-based methods with a network prepro-cessing strategy to analyze network communities. Wewill show that modularity-based methods coupled with

    network preprocessing can alleviate the problem of reso-lution limit in an effective way. Several other works usea similar strategy to attack multiscale modular structureor overlapping communities of networks [22, 23], with thehelp of the concept of stability of a network [24]. How-ever, these methods are fundamentally different from theone proposed in this paper with respect to the problemsaddressed and the underlying methodology (more thor-ough discussions of these differences will be given laterwhen the proposed method is detailed).

    Community structure reflects that each community ina network is to some extent independent from other com-munities. If a dynamic process as random walk (a process

    that a walker randomly takes from one vertex to one of itsneighbors with a probability proportional to the weightof the edge connecting them) is triggered on one vertex,the process is more likely to persist on the vertices inthe same community and far less on the vertices in dif-ferent communities. Based on this fact, random walkersstarting from vertices in the same community are likelyto have similar patterns when visiting or influencing thewhole network. In other words, random walkers startingfrom vertices in the same community will have similar

    trajectories when they randomly walk on the network,and will be likely to have very different trajectories fromthose of random walkers starting from vertices in differ-ent communities. If a suitable measure to capture suchtype of relations is available, we can differentiate edgesby the partial and incomplete information on the commu-nity membership of their connected vertices, and devisea strategy to sharpen the differences.

    In this paper, we indeed use random walk as a surro-gate of the dynamic processes on the network to unveilsuch relationship. If two random walkers starting fromtwo different vertices have highly similar patterns, thevertices are said to be in the same community with highprobability, but if the patterns are very different, thevertices are in the same community with very low proba-bility. As a result, we can set high positive weight on anedge to show that its two connected vertices are in thesame community with high probability, and very low pos-itive weight otherwise. Obviously, the similarity of thepatterns associated with any two vertices can be used asa way to increase or decrease the weight on the edge join-ing the vertices. After reweighting the edges on the orig-inal network, the topology of the network appears to bechanged, as caused by our diminishing uncertainty on thecommunity membership of the vertices on the network.By iteratively operating the same strategy on the newlyweighted network, i.e. continuing to reduce the uncer-tainty, we can sharpen the differences of weights betweenintra- and inter-community edges. After a few rounds ofedge reweighting, the community structure of the origi-nal network becomes more obvious since the weights ofthe inter-community edges nearly vanish. Interestingly,we can show that the cost of the network preprocessing islow, while the ability of the modularity-based communitydetection algorithms to find more natural communities in

    networks is enhanced.In the following, after briefly introducing the concept

    of modularity and its resolution limit, in Section II weextend the concept on binary networks to weighted ones,a step necessary to apply weighting concepts to modu-larity, originally defined on binary networks. We thenpropose our combined strategy to find communities innetworks in Section III. We also give in Section IV aseries of examples of application of the proposed methodto real and synthetic networks with known communitystructure, to show its effectiveness. We further discussthe proposed method and draw conclusions in Section V.

    II. MODULARITY AND ITS RESOLUTION

    LIMIT

    Modularity guides us to identify the best decomposi-tions of networks into communities. In this section, wefirst introduce the modularity for binary networks andits resolution limit when it is optimized, and then extendthe analysis to weighted networks.

  • 8/12/2019 Enh Aced Mod Com Detect

    3/13

    3

    A. Binary Networks

    An undirected binary network is usually associatedwith a symmetric adjacency matrix A, whose elementaij = 1 if there exists an edge between vertex i and j,andaij = 0 otherwise. Vertex degree kiis the total num-ber of edges incident to vertex i. Given a partition of anundirected network intoCgroups, modularity is defined

    as [7, 8]:

    Q=Cs=1

    ls

    L (

    ds

    2L)2

    = 1

    2L

    ij

    (aij kikj

    2L )(ci, cj)

    (1)

    where L is the total number of edges in the network andthe sum is over Cgroups which can be cast into element-wise sum by function (, )((ci, cj) = 1 ifi and j are inthe same community, i.e. community index ci = cj , and0 otherwise); ls is the number of edges within the group

    s, and the total degree of vertices in this group is ds.The goal of modularity-based community detection algo-rithms is to maximize the modularity. Modularity max-imization sets an intrinsic scale to the communities thatcan be found, which means that communities that aresmaller than this scale may not be discovered [16]. Thisresolution limit depends on the total number of edgesin the network and the degree of interconnectedness be-tween pairs of communities, which can be as large as theorder of size of the network. Fortunato and Barthelemydefined a sub-graph to be a valid community if in a binarynetwork it satisfies:

    ls

    L (

    ds

    2L )2

    >0 (2)

    To illustrate the behavior of modularity maximization,we briefly present an adaptation of the example usedby Fortunato and Barthelemy to illustrate the concept[16]. Consider a binary network with L edges and withat least three communities (community C1, C2 and therest of the network Rnet that contains one or more com-munities). All these communities satisfy the definitionin Eq. 2. Two different partitions of this network areconsidered. The first partition A considersC1 and C2 asindependent communities, while the second partition Bmerges them into one larger community. The partition

    of the rest of the networkR

    netwith respect toC1

    andC2

    is the same in both partition A and B. Intuitively, it ismore reasonable to say partition A is more natural thanpartition B since both sub-graphsC1and C2 are commu-nities under the definition given in Eq. 2. Consequently,the modularity QA of partition A should be larger thanthe modularity QB of partition B, i.e. QA > QB, whichimplies:

    l2 > 2La1

    (a1+ b1+ 2)(a2+ b2+ 2) (3)

    Here,l2is the number of edges in C2;a1is the ratio of thenumberlC1C2 of edges between C1 and C2 to the numberof edgesl1in C1 , anda2 is the one between lC1C2 andl2.Similarly, b1 is the ratio of the number of edges betweenC1 and Rnet to l1, and b2 is the ratio of the number ofedges between C2 andRnet to l2. If there is no edge be-tweenC1 and C2, i.e. a1 = a2 = 0, we always obtain par-tition A as we maximize modularity. In an undirected bi-

    nary network, however, there must exist at least one edgebetween C1 and C2 if they are directly connected. Underthe simplest case where l1= l2 = l, the minimal value ofthe coefficients is a1 = a2 = b1 = b2 =

    1

    l. In this case,

    if l2

    L2

    1, we obtain partition B as optimal. This

    analysis demonstrates that modularity sets a resolutionscale to the communities found and some of them maybe the combinations of other smaller communities belowthis scale. However, the network should be decomposedinto natural groups, with more edges within groups butfar fewer between them. If a binary network can be trans-formed into a weighted one by incorporating informationon the structure of the network, a possible solution to

    this issue would be to add more weights inside the groupsby putting fewer weights on the inter-community edges,iteratively increasing the differences between inter- andintra-community edges. The idea of random walk basediterative edge reweighting has been used in Markov Clus-ter Algorithm [25] and random walk based hierarchicalclustering [26], however, to the best of our knowledge,its implications in network modularity were not foreseen.Weighting edges by random walk has also been found tooptimize modularity [22, 23], with the help of the con-cept of stability of a network [24]. Although both stabil-ity weighting and our method determine edge weights byrandom walks within a certain time scale (called random

    walk lengthin the current paper), the stability weightingstrategy consists of a single time weighting and is funda-mentally different from the one used in the current paper.Stability weighting, in fact, tends to add new edges to orremove old edges from the original network by increasingor decreasing the time scale, which modifies the connect-edness of a network while our method does not, since itonly weights differently already existing edges accordingto the knowledge on inter- and intra- community edges.More importantly, stability weighting uses directly therandom walk probabilities as similarities, which biasesthe weighting process since it gives higher values betweenhigh degree vertices and other vertices. Conversely, ourapproach uses the sum of probabilities as dimensionalproperties of vertex vectors and thus the similarities areinduced from the similarities between vertex vectors (seeSection III for more details).

    B. Weighted Networks

    To adapt the concept of iterative reweighting to edges,a first necessary step consists of extending the modularity

  • 8/12/2019 Enh Aced Mod Com Detect

    4/13

    4

    concept from binary networks to weighted ones. This canbe easily done since weighted networks can be regardedas binary networks with multiple edges between pairs ofvertices, if suitable weight unit is chosen [27]. A weightednetwork is determined by its weighted adjacency matrixW, whose element wij is a positive real number if thereexists an edge between vertex i and j, and 0 otherwise.The weighted extension of modularity is [28]:

    Qw =

    Cs=1

    wss

    M (

    ws

    2M)2

    = 1

    2M

    ij

    (wij wi.wj.

    2M )(ci, cj)

    (4)

    here M is the total weight of edges in the network; wiand ws are respectively the weighted degree of vertex iand groups, whilewssis the total weight of edges withingroup s. Similarly, the extension of the valid definitionof community in a weighted network is:

    wss

    M (

    ws

    2M)2

    >0 (5)

    Since in an undirected binary network a1 = a2 = b1 =b2 =

    1

    lis the minimal value that we can obtain, nothing

    can be done to improve the resolution scale. However,if the binary network is transformed into a weighted onein an effective way, the minimal value of these coeffi-cients can be as small as desired, since we can weigh theinter-community edges. Moreover, if the cost of networkpreprocessing is not excessively high we are sure to pre-process the networks. The benefit is that we can usealready developed algorithms without any modificationand we do not add important extra costs to achieve bet-

    ter results. With this in mind, we first consider a simpleextreme case and assume aw1

    = aw2

    = bw1

    = bw2

    = wminw

    with the total weights of C1 and C2 in a weighted net-work being w1 and w2, and set w1 = w2 = w. As aresult, we obtain partition A as our optimal partitionwhen maximizing the modularity if:

    w2 = w > 2M aw

    1

    (aw1

    + bw1

    + 2)((aw2

    + bw2

    + 2)

    = 2Mwmin

    w

    (wminw

    + wminw

    + 2)(wminw

    + wminw

    + 2)

    (6)

    Conversely, ifw

    Mwmin2

    wmin, we get partition B

    as optimal by maximizing the modularity of the weightednetwork. Ideally, if we can put infinitesimal weights onthe inter-community edges, i.e. wmin 0, we can over-come the resolution limit problem imposed by modular-ity. That is to say, we will always find the more naturalpartition A of the network into communities.

    III. RANDOM WALK NETWORK

    PREPROCESSING

    Dynamic processes like random walk on a network canexplore the underlying network structure and this con-cept has already been employed to reveal communitiesin networks [2224, 2933]. Starting from a vertex, awalker will randomly move to one of the neighbors of

    that vertex in the next time step with a probability de-termined by a transition matrix P. The element pij ofthe transition matrix is the ratio between the weight ofedge (i, j) and the total weight of edges associated tovertex i, i.e. pij =

    aijki

    or pij = wijwi

    . In recent re-searches, it has been found that a random walker will betrapped for a longer time within than between communi-ties [22, 24, 32]. In other words, random walkers startingfrom vertices in the same community will behave simi-larly when they randomly walk across the network. Inthe following, for convenience, we use vertex behavior toindicate the trajectory of a random walker starting fromit. Based on this, if two vertices have very similar be-

    haviors, we can say with high confidence that -if they areconnected- these two vertices are in the same community,and thus the edge connecting them is very likely to bean intra-community edge. Conversely, large differencesbetween behaviors of two connected vertices mean thatthe edge they are connected with is very likely to be aninter-community edge.

    In order to get some sense of this approach, we brieflyintroduce an example making use of a social network(Zachary Karate Club Network [34]), which will be dis-cussed in more details later in Section IV to compare thebehaviors of vertices in the same versus different commu-nities. In the real community partition of this network,vertex 2 and 8 are in the same community, while vertex

    1 and 34 are in different ones. As expected, we can seefrom Fig. 1(a) that the behavior patterns of vertex 2 and8 are comparably consistent, indicating that vertex 2 and8 are in the same community with high probability. Thefar difference between the behavior patterns of vertex 2and 34 gives the opposite evidence that they are unlikelyto be in the same community (see Fig. 1(b)).

    One of the implications of community partition is tocorrectly differentiate inter-community edges from intra-community ones. Modularity optimization can be re-garded as a way to add as many weights as possibleinto communities and reduce weights of inter-communityedges. A pair of vertices can easily and frequently be

    visited by random walks triggered on them if they are inthe same community. As a consequence, we can give highweight to an edge with its two endpoints having similarbehaviors under random walk processes and low weightotherwise. Based on a suitable similarity metric that cap-tures well the similarities of behaviors among vertices, wecan reweigh the edges of the original network without de-teriorating its underlying community structure.

    The probability of a walker starting from one vertexto reach another in t-step random walk is determined

  • 8/12/2019 Enh Aced Mod Com Detect

    5/13

    5

    0 5 10 15 20 25 30 350

    0.5

    1

    Behaviorquantity

    (a)

    0 5 10 15 20 25 30 350

    0.2

    0.4

    0.6

    0.8

    vertex index

    Behaviorquantity

    (b)

    vertex 1

    vertex 34

    vertex 2

    vertex 8

    FIG. 1. Illustration of random walk behavior patterns of ver-tices in Zachary karate club network [34]. (a) Behavior pat-terns of vertex 2 and 8 that are in the same community; (b)Behavior patterns of vertex 1 and 34 that are in differentcommunities.

    by matrix Pt (t is called random walk length here). Pt

    records the trajectories of random walks and has beenused to reveal community structure in networks [32],where each row of Pt is considered as a vector in n-dimensional Euclidean metric space to induce similari-ties between pairs of vertices in term of Euclidean dis-tances between vectors, denoted PL-Similarity here. Wesimilarly view each vertex in network space as a vectorbut use instead

    t=1 P

    to track the trajectories of ran-dom walks. As the behavior patterns can be quantifiedas n-dimensional vectors, several similarity/dissimilaritymeasures can capture their coherence, such as square Eu-clidean distance [32], cosine similarity or angular distance[9], and so on. In practice, we empirically find that co-sine similarity or angular distance captures such simi-larity much more accurately, since the dimension of thevectornis very high. What the similarity needs to reflectis the difference between the directions of two vectors, forinstance to what extent they are parallel, not the abso-lute distance to indicate how far apart they are.

    Both PL-Similarity and the one proposed here arebased on the fact that random walks triggered on ver-tices in the same community will have similar trajec-tories. However, PL-Similarity only considers a t-steprandom walk, and one of the difficulties of PL-Similarityis its sensitive dependence to parts of the network far

    away from the target vertices considered each time. Ifthe network is nearly bipartite, PL-Similarity will havefluctuations which result in an unstable measure. Con-versely, the similarity used here takes into accountttypesof random walks whose steps vary from 1 to t. Such asimilarity emphasizes the contributions from the verticesnear the target vertices currently considered, since theidentification of two target vertices as being in the samecommunity is mainly dependent on the interconnected-ness between the target vertices and the nearby vertices,

    not on that between the target vertices and the far awayvertices. The cosine similarity of two vectors vi and vj isdefined:

    Simcos(vi, vj) = (vi, vj)(vi, vi)

    (vj , vj)

    (7)

    where (vi, vj) is the inner product of vector vi andvj . If

    two behaviorial vectors vi and vj are highly consistent,i.e. vi almost parallels to vj , cosine similarityS imcos 1; but ifvi and vj are orthogonal, i.e. elements ofvi andvj are alternatively 0, S imcos 0. Thus, S imcos can beused as the probability of an edge to lie in a community.

    For each edge of the original network, we set as weightthe cosine similarity of the behavior patterns of its twoconnected vertices. A new weighted network is thenproduced irrespectively from whether the original net-work is binary or weighted. Repeating the same strat-egy on the newly obtained weighted network, we itera-tively sharpen the differences between weights of intra-and inter-community edges. After the original network ispreprocessed, we feed the finally weighted network into

    an already developed community detection algorithm andobtain the community decomposition of the original net-work.

    In this paper, we focus on four main modularity-basedcommunity detection algorithms. Two have comparableaccuracy and high efficiency, and are the eigenvector-based method proposed by Newman [14], denoted asEigenMod, and a very fast algorithm developed byBlondel et al. [35], denoted as FastMod. Two otherstochastic algorithms are also considered in the casesof small networks in section IV: one by Guimera andAmaral [11] uses simulated annealing to optimize mod-ularity, and is denoted SAMod, and the other by Duch

    and Arenas [12] uses extremal optimization, named EO-Mod. The two stochastic algorithms can obtain higheraccuracy if they are run several times to select the bestresults, but consequently this requires much higher com-putation time than EigenMod and FastMod. The fourmodularity-based algorithms all have very good perfor-mance in terms of accuracy, and any improvement onthe results obtained by these algorithms indicates thatthe proposed strategy is useful and effective.

    We here briefly introduce the four modularity-basedalgorithms. EigenMod reformulates modularity in aquadratic form:

    Q=

    1

    2Mij

    (wij wiwj

    2M )(ci, cj)

    = 1

    4M

    ij

    siBijsj =sTBs

    (8)

    and casts community detection as Eigen-decompositionof the matrix B. Here Bij = wij

    wiwj2M

    and s is themembership vector representing the partition of the net-work into two communities (vertex i is assigned to onecommunity if si = +1, or to another one if si = 1).

  • 8/12/2019 Enh Aced Mod Com Detect

    6/13

    6

    EigenMod searches vector s to be as much as possibleparallel to vector u corresponding to the largest eigen-value, which is achieved by extracting the signs of thecomponents of the vector u: si = +1 if ui > 0 andsi= 1 ifui 0. The result is further refined by vertexsweep to get the highest increase of modularity. This bi-secting procedure continues on each of the clusters untilmodularity cannot be improved any further. Differently

    from EigenMod, FastMod first treats each vertex as aseparate community and then computes the modularitygain by removing vertex i from its community and mov-ing it to the community of its neighbor j. If there exists amoving with highest positive modularity gain, vertexi isput in that community and otherwise stays in its originalcommunity. This step of vertex moving is repeated untilno improvement can be achieved. The next step builds anew network whose vertices are communities found in thefirst step, and the weights of edges between new verticesare the sum of weights of edges between vertices in thecommunities represented by these new vertices. The twosteps are repeated until the local maximal value of mod-ularity is obtained. FastMod is very fast and can dealwith networks with millions of vertices, based on the factthat the modularity gain of vertex moving can be easilycomputed:

    Q = [

    win+ ki,in

    2W (

    wtot+ ki

    2W )2]

    [

    win

    2W (

    wtot

    2W )2

    ki

    2W]

    (9)

    herewinis the total weight of the edges in the communitywhere vertex i can be moved to, and wtot and ki,in arethe total weight of the edges associated with the com-munity and the degree of vertex i in that community,

    respectively. SAMod uses minus modularity as energy tobe minimized. At each temperature, a number of ran-dom updates are accepted with some probability: if theenergy is reduced the update is definitely accepted, oth-erwise it is accepted with a probability inversely propor-tional to the exponential increase of energy. The randomupdates are generated jointly by a series of individualvertex movements from one community to another andcollective movements involving merging two communitiesand splitting a community. EOMod reformulates mod-ularity as the sum of individual vertex contributions. Itinitially splits the whole network into two random partswith equal number of vertices, and each connected com-ponents is considered an initial community. It then iter-atively moves the vertex with lower modularity contribu-tion from one community to another until local maximalof modularity is reached. After that each communityfound is treated as a separate network by deleting itslinks to other communities and the process continues un-til the total modularity cannot be improved any further.

    We can summarize the procedure to combine networkpreprocessing with already developed community detec-tion algorithms as follows:Step 1. Set the random walk length tto a suitable value,

    and calculate the quantityt

    =1 P(i) for each vertex i

    of the input network with a weighted adjacency matrixW, where P = (D1W), andD is the diagonal matrixwith weighted degrees of vertices on its diagonal;Step 2. For each edge (i, j) of the network, calculate thecosine similarity of behavior patterns of vertex i and j,i.e. wij = cos(

    t=1 P

    (i),t

    =1 P(j)), and set wij as

    its new weight;

    Step 3. Run Step 1 and Step 2 iteratively for severalroundsI.Step 4. Feed the finally weighted network into a commu-nity detection algorithm adopted, and output the com-munity decomposition of the original network.

    The procedure can be divided into two main phases:preprocessing of the original network (step 1-3) and de-composition of the newly obtained network into commu-nities (step 4). The complexity of step 1 depends on the

    computation oft

    =1 P(i), which can be done on aver-

    age in O(< k >t), where < k > is the average numberof neighbors of vertices (degrees) in the network. If thedegree of a network is bounded, as it is always the case

    in real applications, calculatingt

    =1 P

    (i) is done inconstant time. Thus the cost of step 1 is O(n). It is easyto show that for L edges in the network the average costof step 2 is O(L < k >), which isO(n < k >) for a sparsenetwork. Consequently, the total cost of the first phase isO(IL < k >) orO(In < k >) for a sparse network. As wewill show later in Section IV, few iterations of step 3 aresufficient to distinguish the intra- and inter-communityedges and thus the complexity of the first phase is O(n) ifthe network is sparse and its degree is bounded. The costof the second phase is that of the community detectionalgorithm adopted, which areO(n2 log n) for both Eigen-Mod and EOMod, and approximately O(n) for FastMod.The complexity of SAMod is far higher than that of the

    three others and cannot be simply estimated in terms ofnor L for its dependence on the parameters used. In allapplications in this paper, we find that the first phaseof the procedure is fast and thus does not impose heavyburden on the total cost.

    Before applying the procedure to detect communitiesin networks, the parametertof random walk length needsto be specified in advance (step 1). Since the conver-gence speed of random walk is exponential, t in practiceshould not be chosen too large, i.e. possibly not greaterthan log(n). To examine the influence of t on the cosinesimilarity more thoroughly, we consider an extreme casewhen t . As it has already been found [32, 36],

    when t the probability of random walker being ona vertex j only depends on the degree wj of vertex j :

    limt

    Ptij = wj

    2M, 1 i n (10)

    Let tc be the converged steps of random walk, then ift tc, P

    t C. HereC is an nconstant matrix whoserows are identical vectors with jth entry being

    wj2M

    . The

    behavior quantityt

    =1 P can thus be separated into

  • 8/12/2019 Enh Aced Mod Com Detect

    7/13

    7

    two partial sums over tx < tc and t tc:

    t=1

    P =

    tx=1

    P + (t+ 1 tc)C, t tc,

    t= t; (11)tx

    =1

    P + 0, t < tc,

    tx= t.(11)

    Let X=tx

    =1 P and P =

    t=1 P

    tx

    =1 P. Co-

    sine similarity can then be rewritten in a similar way:

    Simcos(xi, xj)

    = (xi, xj) + (xi, ) + (xj , ) + (,)

    (xi, xi) + 2(xi, ) + (,)

    (xj , xj) + 2(xj , ) + (,).

    (12)

    here xi and are the ith rows of X and P, respec-

    tively. When t , (, ) , which makesSimcos(xi, xj) 1 for any pair of edge endpoints iand j. Edge weighting is in such case equivalent tosimply ignoring all edge weights in the original networkand thus produces a binary network. This is undesir-able since we never obtain any information on the net-work structure but loose valuable information containedin edge weights. Information needed to distinguish inter-and intra-community edges comes merely from the localstructure of a network. Therefore we need to choose asmall value of t to preprocess network by edge weight-ing. The strategy to choose the value of t is similar tothat for choosing the number of principal componentsable to reveal the internal structure of the data withPrincipal Component Analysis (PCA, [37]) widely usedin data mining and pattern recognition. If we imagine

    Ps (1 t) like principal components, the valueof t < tc is chosen to best explain local structure of thenetwork and can be thought of as follows: the Ps cor-responding to smaller values of account for as muchof the information on local structure in the network aspossible. It is also possible to choose t < tc as large aspossible, however, random walks corresponding to largervalue of t give little additional information on the net-works local structure while introducing additional infor-mation on global network structure that is not so usefulin distinguishing inter- and intra-community edges andsometimes misleading (i.e. relevant information on localstructure contains noises). Furthermore, larger t results

    in more iterations for the procedure to eliminate unnec-essary or misleading information. Although there is sofar no theoretical foundation for determining the opti-mal values of t, in our experience, we observed that inmost cases small values like t = 2, 3, 4 are sufficient tocapture the information on local structure of the net-work without over-determining it (t = 2, 3, 4 are chosento reflect the fact that triplets, triangles and rectanglesare more frequently found within communities than be-tween). Moreover, for a very sparse network the value oftshould increase while for a very dense network decrease,

    due to the exponential converging speed of the randomwalk. The choice of small values of t achieves good em-pirical compromise between efficiency and accuracy as wewill see in a series of examples in next section.

    IV. APPLICATIONS

    To experimentally validate the effectiveness of the com-bined procedure, we apply it to a number of real andsynthetic networks with known community structure.

    A. Zacharys Karate Club Network

    This network is constructed by W. W. Zachary [34],and consists of 34 vertices representing members of akarate club and 78 edges to show friendship relations be-tween its members. The network had been split into twodisjoined groups (the first one consists of 16 vertices: 1-8, 11-14, 17-18, 20, 22, and the second one consists of

    the rest) due to the disagreement between the admin-istrator and the instructor of the club during the yearsin which W. W. Zachary studied it. The network hasbecome one of the well-known benchmarks to test com-munity detection algorithms [1214, 21, 29], and, whenoptimizing modularity, it is always split into four groups(similar to the ones shown in Fig. 2 (b) except vertex 10is put in the red up-triangle group, with modularity value0.4188 in most cases, including EigenMod and EOMod).

    (a) (b)

    FIG. 2. (a) Community partition of Zachary karate club net-work when t = 3; (b) Community partition of Zachary karateclub network when t = 4. All four modularity-based algo-rithms produce the same results when optimizing the modu-larity of partitions of the reweighted network.

    The output of FastMod somewhat depends on the in-put order of the vertices though the influence is limited.When we permute the input order of vertices, the maxi-mal modularity that FastMod can achieve is 0.4198 (alsothe highest modularity that SAMod can achieve), andthe exact partitioning is shown in Fig. 2 (b) where ver-tex 10 is now in the group with most of its friends. To testthe random walk preprocessing, we choose two randomwalk lengths: t = 3 and t = 4. We iterated 5 times toreweight the network (I= 5), and fed the novel network

  • 8/12/2019 Enh Aced Mod Com Detect

    8/13

    8

    into all four modularity-based algorithms introduced insection III. Results are shown in Fig. 2 with differentrandom walk lengths. For FastMod, we permuted theinput order of the vertices and chose the partition withthe highest modularity. For the two stochastic algorithmsSAMod and EOMod, we ran them under no less 30 differ-ent initial conditions and selected the partitions with thehighest modularity on the weighted network. The differ-

    ence between the partitions in Fig 2 (a) and (b) consistsof different subdivision of the second groups from the realsplitting. Intuitively, the partition in Fig. 2 (a) is morenatural than the one in Fig. 2 (b) since vertex 25, 26and 32 form a triangle and the blue square vertices forma much more compact group. As we have seen, whenoptimizing modularity calculated on the original net-work, both EigenMod and EOMod obtained lower mod-ularity (0.4188) while FastMod and SAMod can achievehigher modularity (0.4198). After the network was pre-processed, the results of all the four modularity-basedalgorithms became consistent. Such an interesting phe-nomenon indicates that modifying the topology of theoriginal network by edge reweighting changes the search-ing space of modularity landscape and sometimes facili-tates the heuristics to achieve optimal modularity.

    B. Network of American College Football Teams

    We applied the proposed procedure to the network ofAmerican college football games during the fall regularseason 2000 [5]. The network contains 115 vertices repre-senting teams and 613 edges representing games playedbetween the teams. The 115 teams come from 12 confer-ences and games are more frequent between teams fromthe same conference than between teams from different

    conferences. We first applied all the four modularity-based algorithms to this network to reproduce the results.All the methods decompose the network into 10 commu-nities, but the values of modularity obtained are different,namely 0.6009, 0.6046, 0.6044 and 0.6043 by EigenMod,FastMod, SAMod and EOMod, respectively. The dif-ferences among these partitions are the different assign-ments of vertices 37 (CentralFlorida), 59 (LouisianaT-ech), 60 (LouisianaMonroe), 64 (MiddleTennesseeState),83 (NotreDame) and 98 (LouisianaLafayette). For exam-ple, vertices 37, 59, 60, 64 and 98 are merged by EOModinto the conference Southeastern, while by SAMod ver-tices 59, 60, 64 and 98 are put into Conference USA

    and vertex 37 is moved to conference Mid-American.Although the difference between values of modularity ob-tained by EigenMod and FastMod is a little more signifi-cant, the only difference between the partitions of Eigen-Mod and FastMod is the classification of vertex 83. Wethus conjecture that the searching space of values of mod-ularity on the original network contains many local min-ima resulting in such inconsistent behaviors of the fourmodularity-based algorithms.

    To preprocess this network, we again used two ran-

    dom walk lengths: t = 3 and t = 4, and fed the pro-cessed weighted network into all four algorithms. Al-though these algorithms give different partitions whendirectly acting on the original network, the partitionsobtained become consistent when the preprocessed oneis fed. To see more clearly the decomposition behavior ofthe proposed procedure, we visualized the confusion ma-trix whose elements (i, j) represent the number of com-

    mon vertices found both in communityi by the proposedprocedure and in real community j . The columns of theconfusion matrix were rearranged to make the diagonalelements as large as possible to obtain the one-to-one cor-respondence between the communities found and the realones, see Fig. 3. The procedure decomposes the networkinto 12 communities when t = 3 and 11 communitieswhen t = 4. The difference is that community indexed11 is merged into community indexed 7, i.e. the elementsof the 11th row of the matrix in Fig. 3 (a) are moved andadded to the 7th row (see the difference between the con-fusion matrices in Fig. 3 (a) and Fig. 3 (b)). The groupin green square and the blue arrow in Fig. 4 illustrate thismerging. In these two partitions, the real community In-dependence (indexed 6) is disassembled and its membersare distributed to three other conferences. This is dueto the fact that teams in the conference Independenceplay more games with inter-conference teams than thosein their own conference, which is against the definitionused here that vertices in the same community have moreedges than vertices in different communities. The valuesof modularity calculated on the original network of thepartitions with t = 3 and that with t = 4 are 0.6032 and0.6010, respectively. Although the values obtained by theproposed procedure appear to be smaller than the maxi-mal one (i.e. 0.6046), we find that the partitions are morenatural than those obtained by solely applying the detec-

    tion algorithms to the network. If we use the number ofcorrectly classified vertices as quality index, i.e. accuracy,we can quantify this assumption. To calculate the accu-racy, we first searched the correspondence with maximaltotal sum of common members between the communi-ties found by the procedure and the real ones. In orderto make the correspondence to be a one-to-one mapping,we assigned the real community index to the newly foundone with the largest common members and each real com-munity index could only be assigned once. We continuedthis assignment until all found communities had been re-indexed or the indexes of the real communities were usedup. Consequently, the accuracy is the sum of the num-

    ber of common members between found communities andtheir corresponding real communities (for example, theaccuracy of the proposed procedure at t = 3 is the sumof the diagonal elements of the matrix shown in Fig. 3(a)). This type of accuracy is a very stringent measureto put emphasis not only on the correct classification ofthe vertices into communities but also on the correctnessof the number of communities found. By this index ofperformance, both partitions found by the proposed pro-cedure correctly classify 104 vertices out of 115 vertices

  • 8/12/2019 Enh Aced Mod Com Detect

    9/13

    9

    (vertices in ellipses and square are considered to be mis-classified, see Fig. 4), while partitions found by directlyfeeding the network into the four modularity-based algo-rithms correctly classify 100 vertices. The higher accu-racy by the proposed procedure results from the differentclassification of the conference Sun Belt. The confer-ence Sun Belt is merged by pure modularity-based al-gorithms into the conference Mountain West (due to

    resolution limit), while most of its team members are cor-rectly identified by the proposed combined procedure asan independent community. The advantages of randomwalk network preprocessing in this case are two folds.First, network preprocessing makes the behaviors of thefour modularity-based community detection algorithmsconsistent. Secondly, and more importantly, network pre-processing helps the four algorithms to find more naturalcommunities, i.e. it can not only assign vertices to theirtrue communities but also predict the number of commu-nities more correctly.

    We also used this network to show that iterative edgereweighting makes the behavior patterns of vertices in thesame community become consistent. For example, vertex1 and vertex 5 in this network are in the same commu-nity. Fig. 5 shows that their patterns quickly becomeconsistent after 3 iterations. Iterative reweighting putsiteratively heavier weights on the intra-community edgesand lighter weights on the inter-community ones, whichmakes a random walker starting from a vertex spend mostof its time visiting its neighbors in the same commu-nity and hardly reach vertices outside the community. Inall the applications in the present paper we iterativelyweighted each network 5 times (I= 5) if not explicitlydeclared.

    C. Clique Chain Networks

    We tested our method on the same synthetic dataseton which Fortunato and Barthelemy identified modular-ity resolution limit [16]. Fig. 6 gives two types of cliquechain networks. The network in Fig. 6 (a) consists of30 identical cliques, with 5 vertices each, connected by asingle edge. The one in Fig. 6 (b) is a network with twopairs of identical cliques and the big pair has 20 verticeseach while the other pair is a pair of 5-vertex cliques.Intuitively, the cliques are the natural communities andproper community detection algorithms should find themcorrectly. If the network is not preprocessed, none of the

    four modularity-based methods can correctly identify theregularity of cliques being communities. The communi-ties they detect are the combinations of cliques. If thenetwork is preprocessed first, i.e. iteratively reweightingedges with t = 3 or t = 4, they all correctly decomposethe network into cliques. We further performed testingon ten similar clique chain networks by varying the num-ber of cliques chained. The results are summarized inTable. I. Note that the combined procedure correctly de-tects the number of cliques over the whole range, while

    13 0 0 0 0 0 0 0 0 0 2 0

    0 10 0 0 0 0 0 0 0 0 0 0

    0 0 9 0 0 0 0 0 0 0 0 0

    0 0 0 11 0 0 0 0 0 0 0 0

    0 0 0 0 8 0 0 0 0 0 2 0

    0 0 1 0 0 8 0 0 0 0 0 0

    0 0 0 0 0 0 12 0 0 0 0 0

    0 0 0 0 0 1 0 4 0 0 1 0

    0 0 0 0 0 0 0 0 8 0 0 0

    0 0 0 0 0 0 0 0 0 9 0 0

    0 0 0 0 0 1 0 3 0 0 0 0

    0 0 0 0 0 0 0 0 0 0 0 12

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    11

    12

    7 9 5 3 2 12 10 11 8 1 6 4

    found

    communityindex

    real community index

    (a)t=3

    13 0 0 0 0 0 0 0 0 0 2 0

    0 10 0 0 0 0 0 0 0 0 0 0

    0 0 9 0 0 0 0 0 0 0 0 0

    0 0 0 11 0 0 0 0 0 0 0 0

    0 0 0 0 8 0 0 0 0 0 2 0

    0 0 1 0 0 8 0 0 0 0 0 0

    0 0 0 0 0 1 12 3 0 0 0 0

    0 0 0 0 0 1 0 4 0 0 1 0

    0 0 0 0 0 0 0 0 8 0 0 0

    0 0 0 0 0 0 0 0 0 9 0 0

    0 0 0 0 0 0 0 0 0 0 0 0

    0 0 0 0 0 0 0 0 0 0 0 12

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    merged

    12

    7 9 5 3 2 12 10 11 8 1 6 4

    real community index

    foundcommunityind

    ex

    (b)t=4

    FIG. 3. Confusion matrix constructed by the communitiesfound by the proposed procedure and the real ones. Row la-bels represent the labels of the communities found and columnlabels are real community indexes (as they are labeled in [5],plus 1). The columns are rearranged to make the diagonal el-ements as large as possible. (a) Confusion matrix with t = 3.(b) Confusion matrix with t= 4.

    the four algorithms initially find the correct number ofcliques but fail when the number of cliques increases asshown analytically [16]. Similarly, the four algorithmsfind 3 communities of the network in Fig. 6 (b) with two20-vertex cliques correctly detected and the other oneidentified as the combination of the 5-vertex clique pair.

    In contrast, the combined procedure detects all 4 cliquescorrectly. The success of the combined procedure is dueto the incorporation of the information on the local struc-ture of the networks.

    Besides the arguments in section III, we further gen-erated a series of the same type of networks as the oneshown in Fig. 6 (a) but varied the number of chainedcliques from 100 to 2000 to get more intuition on the ef-fect of parameterstandIon the behavior of the proposedprocedure. Random walk lengths t were chosen from 1

  • 8/12/2019 Enh Aced Mod Com Detect

    10/13

    10

    TABLE I. Results for the resolution limit test on networks made out of cliques chained by a single edge. The number of cliquesranges from 20 to 30. Without network preprocessing, all four modularity-based methods cannot detect the correct number ofcliques as it increases. If the networks are preprocessed as suggested in the paper, all of them detect the cliques correctly. HereEitherMod represents either of the four modularity-based algorithms.

    # Number of cliquesMethods 20 21 22 23 24 25 26 27 28 29 30EigenMod only 20 21 14 15 16 16 16 16 16 16 16FastMod only 20 21 22 12 12 13 13 14 14 15 15

    SAMod only 20 21 21 12 12 13 14 14 15 16 16EOMod only 20 21 14 14 12 13 15 15 16 16 16Preprocessing+EitherMod 20 21 22 23 24 25 26 27 28 29 30

    FIG. 4. Result with random walk length t = 3. In thiscase, the combined procedure finds 12 communities, which is

    consistent with the ground truth number of communities. 11out of 115 vertices in ellipse and in square are misclassifiedby the suggested accuracy index.

    to 6, and at each value oft the number of iterations I in-creased from 1 to the first value able to detect the correctnumber of cliques chained in a network. Table II summa-rizes the results. The information on local structure ofsuch type of clique chain networks can be fully expressedby t = 1 since all the vertices in the same community(clique) can be visited in a one step walk. Thus the num-ber of iterations necessary to extract relevant informationis only one in all cases. When t increases, information on

    global structure of the network is also included, thereforemore iterations are needed to extract relevant informa-tion on local structure from the one on global structurethat may hinder the procedure to quickly differentiateinter- and intra-community edges. Such type of cliquechain networks are so special that they are different frommost of real-world networks. When applying the pro-posed procedure to networks with realistic features, dif-ferent combinations of the parameters t and Ishow verygood and similar performance if networks possess clear

    0 50 1000

    0.05

    0.1

    0.15

    0.2

    Iteration #1

    0 50 1000

    0.1

    0.2

    0.3

    0.4

    Iteration #2

    0 50 1000

    0.1

    0.2

    0.3

    0.4

    Iteration #3

    0 50 1000

    0.1

    0.2

    0.3

    0.4

    Iteration #4

    0 50 1000

    0.1

    0.2

    0.3

    0.4

    Iteration #5

    0 50 1000

    0.1

    0.2

    0.3

    0.4

    Iteration #6

    FIG. 5. Iteration quickly makes the behavior patterns of ver-tices in the same community consistent. Vertex 1 (red solid)and vertex 5 (black dash-dot) in the network of American col-lege football teams are chosen to illustrate this phenomenon.

    (a) (b)

    FIG. 6. Illustration of two types of clique chain networks. (a)Network made of 30 cliques, with 5 vertices each, connected bya single edge. (b) Network made out of two pairs of identicalcliques, with big pair having 20 vertices each and the otherpair 5 vertices each.

    community structure, as will see in the coming subsec-tion.

    D. LFR Benchmark Networks

    All the networks we have tested so far are small orspecial. To test the random walk preprocessing on largernetworks, we used a recently introduced class of bench-mark networks with realistic features like power law dis-tributions of degree and community size [38]. The vertexdegrees of a network are controlled by a power law dis-

  • 8/12/2019 Enh Aced Mod Com Detect

    11/13

    11

    TABLE II. Relationship between random walk length t and number of iterations I to identify correct number of cliques inclique chain networks. Each clique contains 5 vertices and the number of cliques varies from 100 to 2000.

    # Number of cliques/100I

    t 1 2 3 4 5 6 7 8 9 10-13 14-16 17 18-201 1 1 1 1 1 1 1 1 1 1 1 1 12 2 3 3 4 4 4 4 5 5 5 5 6 63 2 3 4 4 5 5 5 5 6 6 6 6 7

    4 3 4 5 5 5 6 6 6 6 7 7 7 85 3 4 5 6 6 7 7 7 7 8 8 9 96 3 5 6 7 7 7 8 8 8 9 10 10 10

    tribution with exponent t1, and the community size alsodistributes according to power law with exponentt2. Theratio between the external degree of each vertex with re-spect to its community and the total degree of the vertexis determined by a common mixing parameter . Thelarger the value of a network is, the harder it is todetect communities in it.

    SAMod and EOMod are stochastic algorithms and

    thus need to rerun many times to obtain comparable re-sults, this results in much higher computational cost. Inaddition, compared to SAMod and EOMod even witha large number of repetitions, EigenMod and FastModare empirically found to perform much better on LFRnetworks both in accuracy and efficiency. For this rea-son in this validation on LFR networks we only reportedthe improvements on EigenMod and FastMod to see ifthe proposed procedure is effective. We first generateda set of undirected binary networks with 1000 vertices.For the exponent of the degree distribution and that ofthe community size, the default values provided by thealgorithm were used: t1 = 2, t2 = 1. The mixing

    parameter varies from 0.05 to 0.8. The average degreeand the maximal degree are 20 and 50, respectively. Topreprocess these networks, we set the random walk lengthto 3. We used the same strategy to calculate accuracyand the results are presented in terms of the fraction ofcorrectly classified vertices to ease figure readability. Theperformance comparison on this set of networks is shownin Fig. 7 (a). Since our strategy for calculating accuracywill penalize heavily the results with wrong predictionof the number of communities, we can see from the fig-ure that EigenMod and FastMod cannot correctly detectcommunities in these networks even if the networks haveclear community structure (0.35 0.5). The reasonis that resolution limit occurs when optimizing modular-ity on the partitions of these networks. Conversely, afterthe networks are preprocessed, EigenMod and FastModcan correctly detect communities as long as 0.6. Theproblem of resolution limit becomes even worse when thenumber of vertices in the network increases and the com-munity sizes are diverse. We further generated two othersets of networks to confirm the comparison. The vertexnumber of the first set of networks is 5000 each, whilethat of the second one is 10000. The parameters for gen-

    erating these larger networks are the same as those for1000-vertex networks except that the minimal commu-nity size and the maximal community size in the 10000-vertex networks are respectively 20 and 200 to make thecommunity size more diverse. The results are shown inFig. 7 (b). We only presented the performance for Fast-Mod, but for EigenMod it is similar. Since the problem ofresolution limit becomes more severe in these cases, the

    performance of FastMod without network preprocessingbecomes so bad that it cannot detect communities cor-rectly in the entire range of the mixing parameter. Asexpected, the performance of FastMod becomes excel-lent if the networks are preprocessed before they are fedinto the modularity-based algorithm. If networks pos-sess clear community structure, different random walklengths will give very similar results. When 0.5,the procedure by network preprocessing with t = 2, 3, 4correctly detects the communities in LFR networks (seeFig. 8). The reason is that in such cases the procedurewill quickly set very small (nearly vanishing) weights oninter-community edges (Fig. 7). When > 0.5 (i.e.

    the external degree of a vertex is larger than the inter-nal one), community structures of LFR networks becomefuzzier, and it is much harder for the procedure to dif-ferentiate intra-community edges from inter-communityones. As a consequence, weights for real inter-communityedges cannot be quickly reduced and more iterations areneeded. For example, when we iterate 10 times to weightedges of LFR networks with 5000 vertices, the procedurewith t = 3 can correctly detect communities as long as 0.6, showing higher performance than that with 5iterations (Fig. 8, >0.55).

    V. DISCUSSION AND CONCLUSIONS

    In this paper, we have proposed a combined proce-dure to enhance the ability of modularity to detect com-munities in complex networks. The proposed procedurecombines previously developed community detection al-gorithms with random walk network preprocessing andaims at alleviating the resolution limit problem of mod-ularity optimization. By analyzing the resolution limitof modularity optimization in weighted undirected net-

  • 8/12/2019 Enh Aced Mod Com Detect

    12/13

    12

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    mixing parameter

    fraction

    of

    correctly

    classified

    vertices

    N = 1000

    FastMod

    PreFastModEigenMod

    PreEigenMod

    (a)

    0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    mixing parameter

    fraction

    ofcorrectly

    classified

    vertices

    N=5000, FastMod

    N=5000, PreFastMod

    N=10000,FastMod

    N=10000,PreFastMod

    (b)

    FIG. 7. Performance comparison with or without ran-dom walk network preprocessing. (a) Performance on net-works with 1000 vertices. (b) Performance on networks with5000 and 10000 vertices. PreFastMod means the proce-dure that combines network preprocessing with FastMod, andPreEigenMod represents the procedure that combines net-work preprocessing with EigenMod.

    works, as the simple extension of the similar analysis inundirected binary ones, we find that the resolution limitof modularity optimization can in theory be overcome aslong as we can iteratively reweigh differently intra- andinter-community edges. Network preprocessing is thusused to iteratively put much heavier weights on possible

    intra-community edges but far lighter weights on inter-community ones. The processed network is then fed intothe adopted modularity-based community detection al-gorithm to produce the partition with the highest mod-ularity, which corresponds to the community decomposi-tion for the original network. We applied the combinedprocedure to sets of real and synthetic networks. Theexperimental results demonstrate that the procedure candetect communities as natural as possible and alleviatethe resolution limit problem of modularity optimization.

    0.5 0.55 0.6 0.65 0.7 0.75 0.80.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    mixing parameter

    fraction

    of

    correctly

    classified

    vertices

    N = 5000

    t = 2, I = 5

    t = 3, I = 5

    t = 4, I = 5

    t = 3, I = 10

    FIG. 8. Performance comparison on 5000-vertex LFR net-works with different random walk lengths and iterations.

    To preprocess the network, we use random walk as asurrogate of dynamic process on it to explore its local

    structure, which gives us enough knowledge to differenti-ate intra- from inter-community edges. We use the sumof transition matrices of random walks in path length 3or 4 to induce a quantity matrix. Each row of this ma-trix is a vector to be used as the behavior pattern ofa vertex. Due to the exponential convergence speed ofrandom walk, small random walk length is empiricallychosen to capture enough information on local structurewithout over-determination and should be not greaterthan log(n), with n being the number of vertices in thenetwork. Based on the fact that random walks triggeredon vertices in the same community have similar trajec-tories, an edge is likely to be an intra-community edge

    if the behavior patterns associated with the two verticesit connects are very similar, and the likelihood is deter-mined by the cosine similarity between patterns. All theedges in the network are weighted by such likelihoods,and a new weighted network with much clearer commu-nity structure than the original one is induced. Itera-tively performing edge weighting for a few rounds givesmore and more knowledge on intra- and inter-communityedges, this knowledge can then be used by previouslydeveloped modularity-based community detection algo-rithms to find more natural communities. We combinedthe network preprocessing with several state-of-the-artmodularity-based algorithms. Improvements on the re-sults produced by these algorithms indicate the effective-ness of the proposed procedure.

    Although modularity presents a resolution limit thatprevents it to some extent from being effective in all ap-plications, its effectiveness can be enhanced by combiningmodularity optimization with a suitable strategy as therandom walk network preprocessing suggested here. Thebenefit coming from the procedure is that we do not needto develop a completely new community detection algo-rithm but it is possible to adopt any already developed

  • 8/12/2019 Enh Aced Mod Com Detect

    13/13

    13

    method without modification and with limited computa-tional extra cost. We expect the proposed procedure tobe applied to a variety of networks, including social andbiological ones.

    ACKNOWLEDGMENTS

    The authors thank M.E.J Newman for providingZachary Karate club network and American football net-

    work data. The authors also thank M.E.J Newman, A.Arenas, S. Fortunato, R. Lambiotte, R. Guimera, forsharing the codes or tools with us. This work is sup-ported by National Natural Science Foundation of Chinaunder grant No. 60873133 (NSFC, No. 60873133).

    [1] R. Albert and A. L. Barabasi, Reviews of modern physics74, 47 (2002).

    [2] M. E. J. Newman, SIAM Review45, 167 (2003).[3] S. Boccaletti, V. Latora, Y. Moreno, M. Chavez, and

    D. U. Hwang, Physics Reports 424, 175 (2006).[4] G. W. Flake, S. Lawrence, C. L. Giles, and F. M. Coetzee,

    IEEE Computer 35, 66 (2002).[5] M. Girvan and M. E. J. Newman, Proc. Natl. Acad. Sci.

    USA99

    , 7821 (2002).[6] G. Palla, I. Derenyi, I. Farkas, and T. Vicsek, Nature435, 814 (2005).

    [7] M. E. J. Newman and M. Girvan, Phys. Rev. E 69,026113 (2004).

    [8] M. E. J. Newman, Phys. Rev. E 74, 036104 (2006).[9] L. Donetti and M. A. Munoz, Journal of Statistical Me-

    chanics: Theory and Experiment, P10012(2004).[10] S. Fortunato, V. Latora, and M. Marchiori, Phys. Rev.

    E 70, 056104 (2004).[11] R. Guimera and L. A. N. Amaral, Nature 433, 895

    (2005).[12] J. Duch and A. Arenas, Phys. Rev. E72, 027104 (2005).[13] S. White and P. Smyth, in SIAM Data Mining Confer-

    ence (Society for Industrial Mathematics, 2005) p. 274.

    [14] M. E. J. Newman, Proc. Natl. Acad. Sci. USA103, 8577(2006).

    [15] S. Fortunato, Physics Reports 486, 75 (2010).[16] S. Fortunato and M. Barthelemy, Proc. Natl. Acad. Sci.

    USA104, 36 (2007).[17] J. Ruan and W. Zhang, Phys. Rev. E 77, 016104 (2008).[18] Z. Li, S. Zhang, R. S. Wang, X. S. Zhang, and L. Chen,

    Phys. Rev. E 77, 036109 (2008).[19] J. Reichardt and S. Bornholdt, Phys. Rev. E 74, 016110

    (2006).[20] J. M. Kumpula, J. Saramaki, K. Kaski, and J. Kertesz,

    Eur. Phys. J. B 56, 41 (2007).[21] A. Arenas, A. Fernandez, and S. Gomez, New J. Phys.

    10, 053039 (2008).[22] R. Lambiotte, J. Delvenne, and M. Barahona, arXiv:

    0812.1770 (2009).[23] T. Evans and R. Lambiotte, Phys. Rev. E 80, 016105

    (2009).[24] J. C. Delvenne, S. N. Yaliraki, and M. Barahona, arXiv:

    0812.1811(2008).[25] S. Van Dongen, PhD Thesis University of Utrecht(2000).

    [26] D. Harel and Y. Koren, Lecture Notes in Computer Sci-ence2245, 18 (2001).

    [27] M. E. J. Newman, Phys. Rev. E 70, 056131 (2004).[28] A. Arenas, J. Duch, A. Fernandez, and S. Gomez, New

    J. Phys. 9, 176 (2007).[29] H. Zhou, Phys. Rev. E 67, 041908 (2003).[30] H. Zhou, Phys. Rev. E 67, 061901 (2003).[31] H. Zhou and R. Lipowsky, Lecture Notes in Computer

    Science3038, 1062 (2004).[32] P. Pons and M. Latapy, Lecture notes in computer science

    3733, 284 (2005).[33] D. Lai, H. Lu, and C. Nardini, Physica A: Statistical

    Mechanics and its Applications 389, 2443 (2010).

    [34] W. W. Zachary, Journal of Anthropological Research33,452 (1977).

    [35] V. D. Blondel, J. L. Guillaume, R. Lambiotte, andE. Lefebvre, Journal of Statistical Mechanics: Theoryand Experiment, P10008(2008).

    [36] J. Noh and H. Rieger, Phys. Rev. Lett. 92, 118701 (2004).[37] K. Fukunaga, Introduction to Statistical Pattern Recog-

    nition, 2nd ed. (Academic Press, 1990).[38] A. Lancichinetti and S. Fortunato, Phys. Rev. E 80,

    016118 (2009).