Mol Biol Evol 2012 de Vienne Molbev_msr317

Embed Size (px)

Citation preview

  • 7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317

    1/12

    Phylo-MCOA: A Fast and Efficient Method to Detect OutlierGenes and Species in Phylogenomics Using MultipleCo-inertia Analysis

    Damien M. de Vienne,*,1,2 Sebastien Ollier,2 and Gabriela Aguileta1,2

    1Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG) and UPF, Barcelona, Spain2Laboratoire Ecologie, Systematique et Evolution, UMR 8079, Universite Paris-Sud 11, Orsay cedex, France

    *Corresponding author: E-mail: [email protected] editor: Herve Philippe

    Abstract

    Full genome data sets are currently being explored on a regular basis to infer phylogenetic trees, but there are oftendiscordances among the trees produced by different genes. An important goal in phylogenomics is to identify whichindividual gene and species produce the same phylogenetic tree and are thus likely to share the same evolutionary history.On the other hand, it is also essential to identify which genes and species produce discordant topologies and thereforeevolve in a different way or represent noise in the data. The latter are outlier genes or species and they can providea wealth of information on potentially interesting biological processes, such as incomplete lineage sorting, hybridization,and horizontal gene transfers. Here, we propose a new method to explore the genomic tree space and detect outlier genesand species based on multiple co-inertia analysis (MCOA), which efficiently captures and compares the similarities in thephylogenetic topologies produced by individual genes. Our method allows the rapid identification of outlier genes andspecies by extracting the similarities and discrepancies, in terms of the pairwise distances, between all the species in all thetrees, simultaneously. This is achieved by using MCOA, which finds successive decomposition axes from individualordinations (i.e., derived from distance matrices) that maximize a covariance function. The method is freely available asa set of R functions. The source code and tutorial can be found online at http://phylomcoa.cgenomics.org.

    Key words: phylogenomics, outlier genes, phylogenetic markers, multivariate analysis, fungal genomes.

    Introduction

    The analysis of genomic data sets has generated the press-ing need to compare a large amount of phylogenetic trees.

    The trees under comparison may, for example, representa set of shared orthologous genes from a given organismor taxonomic group, sampled across their full genomes.The goal of such comparisons can be 2-fold: either to iden-tify the set of genes producing a concordant topology thatmay reflect the species history or to identify the set of out-lier genes that generate topological incongruities and arelikely to evolve differently or represent noise. While mostefforts have been devoted to detecting genes that produceconcordant gene trees, less attention has been given toidentifying outliers, although they are very importantcandidates for studying interesting biological processes.

    Detecting outlier genes and species is particularly rele-

    vant in phylogenomics because genes whose tree topolo-gies are discordant with respect to an overall topologicalconsensus are good candidates for evolving at significantlydifferent rates, to be under positive selection for functionaldivergence, to have been horizontally transferred, to beparalogous instead of orthologous genes, or to be randomlysampled from incompletely sorted lineages. On the otherhand, species that appear as outliers (i.e., that are placedvery differently in different gene trees) may representspecies that are more likely to experience gene transfers,

    having sequencing errors, having undergone bottlenecks,

    or other biological processes at the species level. Even

    though topological discordances can be detected at thegene or the species level, most methods so far are designed

    to look at genes only.In the context of phylogenetic gene tree comparison and

    phylogenomics, several approaches and methods havebeen proposed to detect concordant and discordant topol-

    ogies trying to integrate as much as possible the available

    information. A solution that attempts to deal with all the

    available data is the construction of a super matrix from

    the alignment of all or a large number of the available

    individual genes, an approach known as total evidence

    (Gadagkar et al. 2005; Aguileta et al. 2008). However, most

    methods involve some sort of consensus solution either

    by classical consensus of multiple competing phylogenies

    (Felsenstein 1985), by building a super tree (Gordon1986; Sanderson et al. 1998), by Bayesian sampling (Ane

    et al. 2007), or a summary by agreement subtrees (Cranston

    and Rannala 2007). Other approaches work by trying to

    filter out the noise and/or increasing the phylogenetic

    signal (Rodriguez-Ezpeleta et al. 2007; Roure et al. 2007)

    or by clustering congruent loci (Leigh et al. 2008, 2011).Another possibility involves identifying the conflicting

    bipartitions and representing all the possible alternativesin a graphic way, an approach that results in networks

    The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, pleasee-mail: [email protected]

    Mol. Biol. Evol. doi:10.1093/molbev/msr317 Advance Access publication January 3, 2012 1

    MBE Advance Access published February 14, 2012

    http://phylomcoa.cgenomics.org/http://phylomcoa.cgenomics.org/
  • 7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317

    2/12

    of genes (Huson and Bryant 2006), or a projection of theconflicting signals in 2Ds (White et al. 2007). Also, in orderto obtain a single tree from multiple individual trees of thesame organism or species, the coalescent approach can beused (Rosenberg 2002; Suchard et al. 2003; Suchard 2005;Degnan and Rosenberg 2006) or the Bayesian reconstruc-tion of gene trees taking the species tree as a prior (BEST)(Edwards et al. 2007). Finally, approaches for graphically

    comparing phylogenetic trees include the use ofmultidimensional scaling to visualize tree-to-tree pairwisedistances (Hillis et al. 2005); the use of principal componentanalysis (PCA) to detect events of horizontal gene transfer(Brochier et al. 2002; Susko et al. 2006); and employingHadamard spectral analysis represented in a triangulargraph called Treeness Triangles to investigate the lossof phylogenetic signal (White et al. 2007). However, byconcatenating the multilocus data or by summarizing orobtaining a consensus of their individual gene trees, mostof the latter methods lose a wealth of potentially interest-ing information, especially by removing outlier data or

    trees. Phylo-MCOA is unique in that it analyzes all the treessimultaneously without discarding data a priori. Further-more, as far as we know, the possibility to also detect out-lier species when comparing phylogenetic trees has notbeen proposed before.

    Here, we present Phylo-MCOA, a fast method for com-paring multiple phylogenetic trees and detecting outliergenes and species based on multiple co-inertia analysis(MCOA) (Chessel and Hanafi 1996). The method allowsthe efficient identification of concordant and outliergenes and species, by enabling the quick comparison ofa large number of trees. It builds a consensus typologyby extracting the similarities, in terms of the pairwise

    distances between all the species in all the trees, simulta-neously. The consensus typology is a 2D graphic represen-tation of the distance interrelationships between speciessupported by the majority of the genes. Phylo-MCOA findssuccessive decomposition axes from individual ordinations(derived from matrices of distances between species) thatmaximize a covariance function in order to build the con-sensus typology, and it estimates the contribution of eachindividual gene tree to this typology (i.e., to what extenteach gene deviates or agrees with what the majority ofgenes support). MCOA is expected to reveal the commonfeatures (i.e., pairwise distances and species placements) of

    single-tree topologies, build a consensus typology and com-pare individual trees with this reference. Until now, MCOAhas been used with ecological (Bady et al. 2004; Hedde et al.2005) and genetic data (Laloe et al. 2007; Berthouly et al.2008); however, as far as we know, this is the first time thatMCOA is applied to phylogenetic analysis. Phylo-MCOA isunique in tracing the position of each species with respectto all other species across the compared phylogenies. Thesimultaneous analysis of all the species placements in alltrees makes it possible to detect, not only concordant genesbut also outlier gene trees, and even more importantly, itcan show which species andgenes explainthose differences.Our method is thus new in allowing the analysis of the data

    at the species level for comparing the histories of differentspecies across individual gene trees.

    Phylo-MCOA can be very useful for phylogenomic anal-yses, where big genomic data sets are typically analyzed.First, Phylo-MCOA readily identifies the outlier genesand species that cause topological incongruence. This isuseful for studying the evolution of specific genes, partic-ularly those that seem to differ from the rest of the genome.

    The method will also identify candidate outlier species,which may undergo interesting biological processes atthe species level. Phylo-MCOA is unique in pointing notonly to outlier genes but also to outlier species. Second,when the goal is to construct a species tree or a tree thatreflects the evolution of most genes in the genome, Phylo-MCOA can be used to identify which genes producediscordant trees and remove them for subsequent phylo-genetic analysis. In this way, only the genes that share thesame evolutionary history are kept in order to build a treethat reflects the evolution of the species or that of most ofthe genome. Third, Phylo-MCOA can be used to investigate

    specific evolutionary hypotheses including, but not limitedto, which genes may have been laterally transferred, whichones may be more readily lost or replaced, or which speciesproduce phylogenetic discordance.

    Materials and Methods

    MCOA in the Context of Phylogenetic AnalysisIn phylogenomics, there is often the need to compare setsof phylogenetic trees, determine how the position of eachspecies changes in every tree, and to find out which treesare more similar between each other in terms of the pair-wise topological distance (i.e., either nodal or patristic). It

    would thus be advantageous to compare the position ofeach species in all trees at the same time, to know if thereis a correlation among the species positions and distancesrepresented in all trees (i.e., create a consensus typologythat serves as reference), and to what extent each individ-ual tree deviates from such a consensus. The problem athand involves measuring the relationship between theposition of each species and the distances between themin all the trees of a given set.

    In the context of multivariate data analysis, one is inter-ested in ordering objects (e.g., variables, sites, and tables) bytheir similarity, and the relationships between the objects

    are measured by different metrics (Manly 2004). Generally,objects are projected onto principal axes and are thus char-acterized numerically or graphically. This procedure placessimilar objects near each other and dissimilar ones far apartin geometric space. Traditional ordering techniques (e.g.,PCA) typically use a singular decomposition axis value ofone table. In the case of our proposed method, MCOAallows the simultaneous ordination of many Euclidean dis-tance matrices derived from every tree in the set by findingsuccessive decomposition axes from each distance matrixthat maximize a covariance function. Doing this allows theextraction of the consensus typology contributed by alltrees.

    de Vienne et al. doi:10.1093/molbev/msr317 MBE

    2

  • 7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317

    3/12

    Here, we introduce the basic concepts of the method;interested readers are referred to the appendix in thesupplementary material (Supplementary Material online)and to Chessel and Hanafi (1996) for its mathematicaldescription.

    To better understand how MCOA is applied to phylo-genetic data (i.e., a set ofK trees), we can decompose themethod in two steps. We first explain how one tree can berepresented in a 2D space through Principal CoordinateAnalysis (Gower 1984; Gower and Legendre 1986) (PCO,fig. 1a), and we then explain how K trees can be simulta-neously analyzed through MCOA (fig. 1b).

    A single tree can be transformed into a distance matrixby calculating the pairwise distances between species in thetrees (fig. 1a, step 1). Distance between species can eitherbe nodal (number of nodes separating two leaves in thetree) or patristic (sum of the branch lengths separatingtwo leaves). We then compute the square root of this

    distance matrix in order to get a Euclidean distance matrix,which is an optimal condition for the PCO to be performed.Whereas the methods typically used to obtain Euclideandistances impose a geometric distortion on them, takingtheir square root does not affect the results (de Vienne

    et al. 2011). The principal coordinates of the leaves (i.e.,terminal branches) in a Euclidean space are then derivedfrom the Euclidean distance matrix by a PCO analysis(fig. 1a, step 2). Each axis defined by the PCO analysis max-imizes the variance of the coordinates of the species. Whenoriginal distances between species are Euclidean, all theeigenvalues are positive and correspond to the varianceassociated to the axis. In that case, the distances betweenspecies in the new space defined by the PCO analysis arethe same as the original distances defined by the tree.Because most of the variance is generally concentratedin the first axis (depending on the topology of the tree),we can represent the respective position of the leaves in

    FIG. 1. Principle of MCOA as applied to phylogenetic trees. (a) From one tree to a 2D representation of the tree. Step 1: A single phylogenetic

    tree can be converted into a distance matrix using either nodal or patristic distances between the leaves. Step 2: A PCO on the square root of

    the distance matrix allows retrieving the coordinates of the species in a Euclidean space. Step 3: Positions of the species in a 2D space formed

    by the two first axes of the PCO. (b) MCOA from Ktrees to the cohesion plot. Steps 1 and 2: From K trees, Kdistance matrices and KEuclidean

    spaces are obtained, as in A. Each Euclidean space can then be seen as a scatter plot (or set of points) in an n 1 dimensional space (n beingthe number of leaves). Step 3: MCOA finds K auxiliary vectors (u11,u12, . . ., u

    1K) and a reference vector m

    1 so that the sum of the covariances

    between each u1 vector and the unique m1 vector is maximized. This is done recursively for m1, m2, etc. Step 4: Cohesion plot for the two first

    axes of the MCOA analysis m1 and m2. Labels represent the reference position of the species (consensus typology). Black points represent the

    positions of the species in each individual gene tree.

    Detecting Outlier Genes and Species in Phylogenomics doi:10.1093/molbev/msr317 MBE

    3

    http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1
  • 7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317

    4/12

    a reduced space formed by these axes of the PCO (fig. 1a,step 3). The relative positions of the taxa in this reducedspace represent the main dichotomies observed in the tree.

    The first steps of the MCOA analysis, as applied to mul-tiple phylogenetic trees, are the same as before. Each one ofthe K trees is transformed into a distance matrix (fig. 1b,step1), and the square root of the distance matrices arethen computed and used to perform K-independent PCOs(fig. 1b, step2). We thus obtain KEuclidean spaces, one foreach initial tree. Each one of these spaces can be seen as

    a set of points (scatter plots on fig. 1b) distributed ona space having n 1 dimensions (or axes), where n isthe number of leaves in each tree. The basic goal of MCOAis then to find a set of K vectors (one for each of the KEuclidean spaces) called auxiliary vectors (from u11 tou1K), as well as a single reference vector (called m

    1), so thatthe sum of the covariances between each of the uKauxiliaryvectors and the m1 vector is maximized (fig. 1b, step 3). Ananalytic demonstration of this maximization has beengiven by Chessel and Hanafi (1996). The obtained m1 vectoris the first axis of the MCOA analysis. The same method isthen applied to obtain a m2 vector by maximizing the sum

    of the covariance between this vector and K u2

    vectors(from u21 to u

    2K). Thus m

    2 becomes the second axis of theMCOA. This is done recursively for retrieving the n axesof the MCOA analysis.

    Phylo-MCOA: Using MCOA for PhylogeneticAnalysisFollowing the analysis described above, the reference posi-tion of each species (coordinates of the species projectedon the m1, m2, . . . mn vectors) as well as the position of thespecies in each individual tree (coordinates of the speciesprojected on the u1, u2, . . . , un auxiliary vectors) can becomputed. It is possible to plot the reference positions

    of each species, as well as their position in each individualtree, using the two first axes of the MCOA analysis (fig. 1bpart 4). This plot is usually called the cohesion plot. The setof reference positions of the species in the cohesion plot iscalled consensus typology. For a given species, the distancebetween its reference position and its position in each in-dividual tree is represented by lines. Long lines representcases were the species has a position in a given tree thatis not concordant with its position in all other trees (andtherefore neither in the consensus typology). On the con-

    trary, short lines represent cases where the position of thespecies in a given tree is roughly the same as in all the othertrees (and therefore in the consensus typology). The cohe-sion plot thus contains N Tdots, where N represents thenumber of species in each tree and T the number of trees.

    The cohesion plot is very informative for small data sets,where one can assess the position of each species in eachindividual tree relative to its reference position. However,this plot becomes difficult to read and to interpret asthe number of genes and species increases because specieslabels superimpose and the cloud of points and linesbecomes overcrowded. Moreover, the cohesion plot gives

    only a graphical representation of the data and does notallow the extraction of statistically significant outlier genesand species. Therefore, we show instead the informationcontained in cohesion plots but using a different and morereadable representation: the two-Way Reference matrix(hereafter referred to as the 2WR matrix). The 2WR matrixis built by computing, for every species, the distance be-tween its reference position and its position in each indi-vidual tree in the multidimensional space. The rows in the2WR matrix represent species and columns represent genes(fig. 2a). Note that the order of species (rows) in the 2WRmatrix follows the average relative position of species inthe gene trees. In other words, two species that are close

    FIG. 2. Detection of complete outliers by Phylo-MCOA. (a) The 2WR matrix is computed by calculating, for every species, the distance

    separating its position in each gene tree to its reference position (see text). The 2WR matrix provides information about species in rows and

    genes in columns. (b) The 2WR matrix is transformed into two binary matrices (2WRBgn and 2WRBsp) by detecting outlier cells as described in

    the Methods section. (c) A score is computed for each gene from the 2WRBgn matrices and for each species from the 2WRBsp matrix and

    represented by a bar plot. The dashed gray line on the bar plot represents the threshold chosen by the user over which a gene or a species is

    considered as a complete outlier.

    de Vienne et al. doi:10.1093/molbev/msr317 MBE

    4

  • 7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317

    5/12

    in the 2WR matrix (two contiguous rows) are two speciesthat are closely related in the gene trees.

    Detection of OutliersPhylo-MCOA can detect two types of outliers: completeoutliers and cell-by-cell outliers. Complete outliers are oftwo types: complete outlier species and complete outliergenes. Complete outlier genes place all the species in a posi-tion different from their positions in the other gene trees,whereas complete outlier species are species whose posi-tion with respect to other species is different in all the genetrees. Cell-by-cell outliers represent species whose position

    in some specific gene trees is not concordant with theirposition in the other gene trees (i.e., single cells in the2WR matrix whose value seems outlier with respect tothe other values in the matrix).

    Cell-by-cell outliers cannot be detected by Phylo-MCOAfrom the 2WR matrix if complete outliers are present be-cause it hides the cell-by-cell signal. The detection of out-liers by Phylo-MCOA is thus performed in multiple steps: 1)Phylo-MCOA analysis of the original data set; 2) detectionof complete outliers; 3) removal of the complete outliersfrom the original data set; 4) Phylo-MCOA analysis ofthe new data set; and 5) detection of cell-by-cell outliers. If

    no complete outliers are detected, steps 3 and 4 are omitted.For detecting complete outliers (fig. 2), the 2WR matrix

    is separated in two matrices, (2WRBsp and 2WRBgn, fig. 2b)by detecting outlier cells. For constructing the 2WRBsp ma-trix, a cell is considered as an outlier (and assigned the value1), if its value in the 2WR matrix is higher than Q3 kIQR,where Q3 represents the third quartile of the distributionof values of its column and IQR represents the interquartilerange of these values. Symmetrically, for the 2WRBgn matrix,a cell is considered as outlier if its value in the 2WR matrix ishigher than Q3 kIQR, where Q3 and IQR are now com-puted based on the values in the row. We set k5 1.5 as thedefault value as is classically done when detecting outliers

    with this method. The two obtained binary matrices arerepresented in figure 2b. From the 2WRBgn matrix, a scoreis computed for each gene by computing the proportion of1s in its column. From the 2WRBsp matrix, a score is com-puted for each species by computing the proportion of 1sin its row. This leads to the two barplots presented in figure2c. In this example, gene 2 and gene 13 are detected as out-liers by more than 90% of the species, and t16 and t33 aredetected as outlier species by more than 90% of the genes.

    The user can fix a threshold for considering that a speciesor a gene is a complete outlier. By default, we set thisthreshold to 50% (i.e., species detected as outlier for more

    than 50% of the genes are seen as complete outliers andgenes detected as outliers for more than 50% of the speciesare considered as complete outliers).

    For the detection of cell-by-cell outliers (fig. 3), the 2WRmatrix (fig. 3a) is transformed into the 2WRgn and 2WRsp

    matrices by normalizing the 2WR matrix by rows and col-umns, respectively; the product of these two normalizedmatrices gives the 2WRgnsp matrix that is used for thecell-by-cell outlier detection.

    The 2WRgnsp matrix is first transformed into a binarymatrix by detecting outlier cells, according to the valuesin the rows, and to a second binary matrix by detecting

    outlier cells according to the values in the columns. Thesetwo binary matrices are finally multiplied in order to con-serve only cells detected as outliers for both rows and col-umns. This last binary matrix is called 2WRBgnsp (fig. 3b).

    This procedure for detecting cell-by-cell outliers is, how-ever, too permissive, and therefore, a second filtering step isnecessary. As explained before, Phylo-MCOA starts bytransforming trees into distance matrices. Species in thetrees are not independent from each other. Therefore, evena single modification in a tree will change all the values inthe associated distance matrix. When we detect a cell-by-cell outlier, we have to distinguish between true outliersand species that are affected by this outlier just because

    FIG. 3. Detection of cell-by-cell outliers by Phylo-MCOA. (a) The 2WR matrix is obtained as described in the text. (b) The 2WR matrix is then

    transformed into the 2WRBgnsp binary matrix by successive normalizations and detection of outliers (see text). This matrix contains islands of

    outliers representing groups of closely related species erroneously detected as outliers surrounding the true outlier. (c) Islands are removed by

    considering as true outlier only the cell having the maximum value in the island. This maximum is taken from the 2WRgnsp matrix.

    Detecting Outlier Genes and Species in Phylogenomics doi:10.1093/molbev/msr317 MBE

    5

  • 7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317

    6/12

    they are close in the distance matrix. To solve this problem,we take advantage of a Phylo-MCOA property that ordersspecies in the 2WR matrix by their mean proximity in thecohesion plot. When two or more species that are contig-uous in the 2WRBgnsp matrix for a given gene are detectedas outliers, we define this group as an island (dashedcircles in fig. 3b). In each island, we need to distinguish be-tween the true outlier species and the contiguous species

    that are erroneously detected as outliers. To do so, we con-sider that the outlier cell with the highest value in the2WRgnsp is the true outlier, and the contiguous cells areartifacts. This filtering step allows obtaining a binary matrixfree of islands that contains only true cell-by-cell outliers(fig. 3c).

    Handling Tree and Node SupportWhen comparing phylogenetic trees, it is very importantto consider the statistical support of both trees and nodes.To take into account measures of tree support such as like-lihood scores, posterior probabilities, or any other suitable

    scores, Phylo-MCOA can assign different weights, specifiedby the user, to the different trees according to their sup-port. To do so, during the MCOA analysis, each table (rep-resenting Euclidean spaces, see fig. 1) is multiplied by thesquare root of the weight that is associated to it (in abso-lute values). As a result, the values in the tables correspond-ing to a specific gene with a high weight are higher andthe importance of the corresponding auxiliary vectorsfor retrieving the m vectors also increases (fig. 1b, step3). In this sense, those genes will weigh more in the MCOA.It is also possible to take into account node support (e.g.,bootstrap values) when building the consensus typology,by choosing a cutoff level of node confidence below

    which nodes will be collapsed. In this way, poorly sup-ported nodes (i.e., unlikely species bipartitions) will haveless or no influence in the consensus typology.

    Handling Missing DataPhylo-MCOA is capable of analyzing data sets that areincompletely sampled (i.e., with missing data), in the caseswhere a given gene is not available for one or more speciesbut is found in all other species. To handle incomplete sam-ples, the distances that correspond to missing species ina given pairwise distance matrix are assigned the valueof the average distance obtained from all the other distance

    matrices where the species is present.

    Results

    Phylo-MCOA Evaluation: Simulated Data SetsSimulated data sets were generated in order to evaluate theefficiency and robustness of Phylo-MCOA. All the data setswere created in the same way: we generated a random treewith N species by random splitting of edges as imple-mented in function rtree of the ape package in R language(Paradis et al. 2004) and duplicated this gene tree in orderto get Tidentical gene trees. A new gene tree was randomlygenerated for each test performed. To create a complete

    outlier species, we changed its position randomly in allT trees. To generate a complete outlier gene, we replacedone of the gene trees by a completely random tree.

    Subsequently, we evaluated the efficiency of Phylo-MCOA by its ability to retrieve the correct complete outliergene and species, using the corresponding 2WR matrix. Alloutliers were detected based on nodal distances, using theoption distance 5 nodal in Phylo-MCOA.

    To simulate cell-by-cell outliers, we randomly sampledone species and one tree and changed randomly the posi-tion of the selected species in the selected tree. By repeat-ing this operation multiple times, we obtained data setswith a variable proportion of cell-by-cell outliers.

    Effect of Sample Size on the Ability of Phylo-MCOA to

    Retrieve Complete Outliers

    Simulated data sets were obtained containing between 10and 100 genes (10, 20, 30, 50, and 100, respectively) andbetween 10 and 100 species (10, 20, 30, 50, and 100, respec-tively); all possible combinations were explored. In all thesedata sets, we artificially introduced three complete outliergenes and three complete outlier species.

    We analyzed these data sets with Phylo-MCOA and eval-uated the proportion of complete outlier genes and speciesthat the method was able to retrieve (table 1). We observedthat as long as the number of both species and genes islarge enough (!20), Phylo-MCOA always correctly re-trieved the outliers (black cells in table 1). On the otherhand, when the number of species is small (10), Phylo-MCOA is unable to retrieve the outlier genes correctly,even for a large number of genes (first column, light gray

    cells in table 1). However, outlier species are easier to re-trieve as long as the sample size is large enough (!20), evenif there is a small number of genes (10, first row, dark graycells in table 1). Finally, for moderately small number ofoutlier species (520), outlier genes are easier to retrievethan species (second column, dark gray cells in table 1).

    Effect of the Proportion of Complete Outliers on the Ability

    of Phylo-MCOA to Retrieve Them

    We created a data set of 50 genes and 50 species and testedthe ability of Phylo-MCOA to retrieve complete outliersas their proportion increased. We tested data sets wherebetween 2% and 60% of the genes, and between 2% and

    Table 1. Effect of the Size of the Data on the Ability of Phylo-

    MCOA to Correctly Retrieve Complete Outlier Genes and Species.

    Retrieved

    Species/Genes

    Number of Species

    10 20 30 50 80 100

    Number of genes 10 0/0 0/0 1/0 3/0 3/0 3/0

    20 0/0 1/2 3/3 3/3 3/3 3/3

    30 0/0 0/3 3/3 3/3 3/3 3/3

    50 0/0 0/3 3/3 3/3 3/3 3/3

    80 0/0 0/3 3/3 3/3 3/3 3/3100 0/1 0/3 3/3 3/3 3/3 3/3

    NOTE.Three outlier genes and three outlier species were simulated. Each cellcontains two values separated by a slash (/): number of outlier species retrieved/number of outlier genes retrieved. Nodal distances were used in the Phylo-MCOA analysis. The darker the shading, the better the retrieval of outliers byPhylo-MCOA.

    de Vienne et al. doi:10.1093/molbev/msr317 MBE

    6

  • 7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317

    7/12

    60% of the species were outliers (with values 2%, 10%, 20%,40%, and 60%). All possible combinations of these propor-tions were tested. We then evaluated the number of com-plete outlier genes and species that Phylo-MCOA was ableto retrieve correctly in each case (table 2). We observedthat even when 20% of genes and species are outliers, Phy-lo-MCOA was still able to retrieve almost all of them cor-rectly (black cells in table 2). A large proportion of outlier

    genes (40%) does not prevent retrieving outlier species.Similarly, a large proportion of outlier species is not a prob-lem for retrieving outlier genes (dark gray cells in table 2).However, if a large proportion of both genes and species areoutliers, this will make it difficult to retrieve outliers (lightgray cells in table 2).

    Effect of the Proportion of Missing Data on the Ability of

    Phylo-MCOA to Retrieve Complete Outliers

    In typical biological data sets, there are plenty of missingdata, therefore, methods for real data set analysis shouldbe robust to incompletely sampled data sets. In order toevaluate the ability of Phylo-MCOA to retrieve outlier

    genes and species despite missing data, we created a dataset with 100 genes and 100 species containing three com-plete outlier species and five complete outlier genes. Weincreased the proportion of missing data progressivelyby randomly removing species from the original set ofgenes. We removed 1%, 2%, 5%, 10%, 15%, 20%, 30%,and 50% of the data, respectively (table 3). Note that a pro-portion of 20% of missing data means that, on average,each tree has lost 20% of its leaves. We observed thatPhylo-MCOA is very robust to missing data. Even with20% of the data missing, the three complete outlier speciesand one of the five complete outlier genes were correctlyretrieved (table 2).

    Capacity of Phylo-MCOA to Detect Complete Outliers in the

    Presence of Cell-by-Cell Outliers

    The detection of cell-by-cell outliers by Phylo-MCOA isregularly performed on a data set from which completeoutliers have been removed. In order to verify whetherPhylo-MCOA is able to correctly detect complete outliersin the presence of cell-by-cell outliers, we simulated a dataset with 100 genes and 100 species containing three com-

    plete outlier species and three complete outlier genes, andevaluated the ability of Phylo-MCOA to correctly detectthe complete outliers given different numbers of cell-by-celloutliers, from 1 to 1,500. Results are shown in table 4.If,1,200 cell-by-cell outliers are present, Phylo-MCOA isalways able to retrieve correctly the complete outliers. For1,500 cell-by-cell outliers, all complete outlier genes are stillretrieved, but the complete outlier species are not retrieved.

    Capacity of Phylo-MCOA to Retrieve Cell-by-Cell Outliers

    As previously explained, the detection of cell-by-cell out-liers by Phylo-MCOA is performed on a data set free of

    complete outliers, therefore, for this test complete outliersare excluded. We analyzed 100 genes and 100 species witha variable proportion of cell-by-cell outliers and evaluatedthe proportion that was detected by Phylo-MCOA. Foreach number of cell-by-cell outliers, we repeated the oper-ation 20 times in order to have an estimation of the stan-dard deviation of the mean proportion of cell-by-celloutliers retrieved (table 5). We observed that the mean pro-portion of cell-by-cell outliers retrieved decreased veryslowly, starting at 100% (SD 5 0) for two and five cell-by-cell outliers included, to more than 85% for 1,000cell-by-cell included. Phylo-MCOA is thus able to detectcell-by-cell outliers even when there are several of them.

    Table 2. Effect of the Proportion of Outlier Genes and Species in the Ability of Phylo-MCOA to Correctly Retrieve Them.

    Retrieved Species/Genes (%)

    Proportion of Complete Outlier Species [Number]

    2% [1] 10% [5] 20% [10] 40% [20] 60% [30]

    Proportion of complete outlier genes [number] 2 [1] 1/1 5/1 9/1 0/1 0/0

    10 [5] 1/5 5/5 8/5 0/5 0/0

    20 [10] 1/10 5/10 10/10 0/0 0/0

    40 [20] 1/0 5/0 0/0 0/0 0/0

    60 [30] 0/0 0/0 0/0 0/0 0/0

    NOTE.The data set was composed of 50 genes and 50 species. The proportions of outliers are given in percentage and the absolute number of outliers is given underbracket ([ ]). Each cell contains two values separated by a slash (/): number of complete outlier species retrieved/number of complete outlier genes retrieved. Nodaldistances were used in the Phylo-MCOA analysis. The darker the shading, the better the retrieval of outliers by Phylo-MCOA.

    Table 3. Effect of the Proportion of Missing Data on the Number

    of Complete Outliers Retrieved (genes and species).

    Proportion of Missing Data

    1% 2% 5% 10% 20% 30% 50%

    Species retrieved 3 3 3 3 3 0 0

    Genes retrieved 5 5 5 5 1 0 0

    NOTE.The data set was comprised of 100 genes and 100 species with fivecomplete outlier genes and three complete outlier species. Nodal distances wereused in the Phylo-MCOA analysis. The darker the shading, the better the retrievalof outliers by Phylo-MCOA.

    Table 4. Number of Complete Outliers Retrieved with Increasing

    Number of Cell-by-Cell Outliers.

    Number of Cell-by-Cell Outliers

    1 10 50 100 300 500 1,000 1,200 1,500

    Species retrieved 3 3 3 3 3 3 3 3 0

    Genes retrieved 3 3 3 3 3 3 3 3 3

    NOTE.The data set was comprised of 100 genes and 100 species with threecomplete outlier genes and three complete outlier species. Nodal distances wereused in the Phylo-MCOA analysis. The darker the shading, the better the retrievalof outliers by Phylo-MCOA.

    Detecting Outlier Genes and Species in Phylogenomics doi:10.1093/molbev/msr317 MBE

    7

  • 7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317

    8/12

    Phylo-MCOA Evaluation: Real Data SetWe used Phylo-MCOA on a real data set comprising 246gene trees for 21 fungal species. This data set was used pre-viously by Aguileta et al. (2008) for retrieving the best genesto reconstruct the phylogeny of fungi, and the data areavailable in FUNYBASE (Marthey et al. 2008). Aguiletaet al. (2008) had compared each individual gene tree tothe species tree obtained from the concatenation of the246 genes. From this comparison, they had obtained a scorefor each gene tree representing how suitable it was for find-ing the reference (concatenated) tree. Using Phylo-MCOA,we analyzed the data set of Aguileta et al. (2008) using

    nodal distance between species and obtained a score foreach species and gene as explained in the Materials andMethods section. These scores are plotted in figure 4 forthe genes (top) and for the species (bottom). Phylo-MCOAdetected five complete outlier genes (black bars in fig. 4)and no complete outlier species. The fact that no speciesappeared as outliers in this data set is not surprisingbecause the goal of their work was to find optimal genemarkers for reconstructing the fungal species tree, there-fore, species producing incongruent trees were removedfrom the data set in Aguileta et al. (2008). In order to checkfor consistency, we compared the score that Aguileta et al.obtained for each gene (a topological similarity score re-

    ferred to as topscore) to the score we obtained with Phy-lo-MCOA. Figure 5 (white dots) represents the scatter plot

    of these two scores. Phylo-MCOA is able to retrieve thecorrect ranking of trees and to identify genes that canbe considered as outliers. We found that trees consideredas outliers in Aguileta et al. (2008) were indeed also outliersaccording to the analysis using Phylo-MCOA (gray arrowsin fig. 5). When using patristic distances instead of nodaldistances for the Phylo-MCOA analysis (black dots infig. 5), the same trend is still present but the correlationwith the topscore values of Aguileta et al. (2008) seemsto slightly decrease. This is expected because the topscoreused by Aguileta et al. (2008) is a topological distancemeasure, thus more similar to Phylo-MCOA with nodal

    distances than to Phylo-MCOA with patristic distances.Overall, this illustrates the general utility of Phylo-MCOA for analyzing real data sets. Moreover, the methodis very fast: using the set of trees in Aguileta et al. (2008), ittook about 1 min on a pc laptop.

    Using Phylo-MCOA for Filtering Phylogenomic DataSetsDeep relationships in the animal phylogeny are still de-bated. In 2009, Schierwater et al. (2009) proposed a phylog-eny of the Metazoa that was questioned because of the lowquality of the data set used to construct it. More precisely,Philippe et al. (2011) detected important problems in thedata set of Schierwater et al. (2009), such as contamina-tions, deep paralogies, or horizontal gene transfers. They

    FG1000

    FG1010

    FG1015

    FG1020

    FG1021

    FG1024

    FG1031

    FG1046

    FG1049

    FG1056

    FG1063

    FG1073

    FG1108

    FG436

    FG459

    FG465

    FG478

    FG487

    FG491

    FG499

    FG503

    FG507

    FG508

    FG524

    FG525

    FG529

    FG533

    FG534

    FG535

    FG542

    FG543

    FG546

    FG549

    FG551

    FG552

    FG556

    FG559

    FG569

    FG570

    FG575

    FG576

    FG578

    FG579

    FG583

    FG586

    FG588

    FG589

    FG591

    FG595

    FG598

    FG610

    FG612

    FG616

    FG621

    FG627

    FG630

    FG634

    FG635

    FG636

    FG638

    FG644

    FG646

    FG649

    FG651

    FG652

    FG657

    FG659

    FG662

    FG665

    FG668

    FG673

    FG678

    FG679

    FG684

    FG685

    FG686

    FG687

    FG689

    FG690

    FG691

    FG692

    FG694

    FG695

    FG696

    FG698

    FG699

    FG702

    FG704

    FG705

    FG708

    FG716

    FG718

    FG720

    FG722

    FG723

    FG730

    FG734

    FG735

    FG740

    FG746

    FG747

    FG750

    FG752

    FG756

    FG757

    FG758

    FG761

    FG762

    FG763

    FG771

    FG781

    FG784

    FG788

    FG789

    FG798

    FG805

    FG813

    FG819

    FG821

    FG825

    FG832

    FG834

    FG841

    FG844

    FG848

    FG850

    FG854

    FG855

    FG861

    FG862

    FG863

    FG864

    FG866

    FG870

    FG878

    FG884

    FG893

    FG894

    FG896

    FG897

    FG898

    FG901

    FG909

    FG912

    FG916

    FG927

    FG933

    FG942

    FG948

    FG964

    FG975

    FG994

    MS241

    MS277

    MS282

    MS294

    MS302

    MS304

    MS305

    MS307

    MS311

    MS313

    MS320

    MS325

    MS329

    MS333

    MS334

    MS337

    MS345

    MS348

    MS351

    MS353

    MS355

    MS358

    MS359

    MS361

    MS362

    MS368

    MS375

    MS377

    MS378

    MS379

    MS380

    MS382

    MS384

    MS386

    MS393

    MS394

    MS395

    MS397

    MS398

    MS399

    MS400

    MS401

    MS407

    MS408

    MS409

    MS411

    MS413

    MS414

    MS415

    MS417

    MS418

    MS422

    MS424

    MS428

    MS429

    MS430

    MS431

    MS432

    MS437

    MS441

    MS442

    MS444

    MS447

    MS448

    MS452

    MS453

    MS454

    MS456

    MS459

    MS460

    MS462

    MS463

    MS467

    MS470

    MS481

    MS484

    MS485

    MS486

    MS487

    MS489

    MS493

    MS497

    MS501

    MS517

    MS524

    MS529

    MS532

    MS541

    MS547

    MS549

    MS550

    MS561

    MS578

    MS621

    Genes

    Outlierdetectionproportion

    0.0

    0.2

    0.4

    0.6

    0.8

    Sce

    Spa

    Pch

    Cne

    Uma

    Yli

    Dha

    Clu

    Spo

    Ssc

    Ani

    Afu

    Cim

    Mgr

    Ncr

    Tre

    Fgr

    Ago

    Kla

    Cgl

    Sba

    Species

    outlierdetectionproportion

    0.00

    0.02

    0.04

    0.06

    0.08

    0.10

    0.12

    FIG. 4. Bar plots for the fungal data showing the score calculated for each gene (top) and each species (bottom). Black bars represent significant

    outliers according to the threshold chosen by the user (here 0.4). Nodal distances were used for the Phylo-MCOA analysis in this case.

    Table 5. Mean Proportion of Cell-by-Cell Outliers Retrieved Out of 20 Repetitions with Increasing the Number of Cell-by-Cell Outliers.

    Proportion Retrieved

    Number of Cell-by-Cell Outliers

    2 5 10 20 50 100 200 300 400 500 600 700 800 900 1,000

    Mean 1 1 0.975 0.973 0.960 0.947 0.938 0.932 0.923 0.907 0.897 0.888 0.888 0.871 0.863

    SD 0 0 0.055 0.034 0.030 0.022 0.018 0.015 0.011 0.011 0.014 0.011 0.010 0.008 0.009

    NOTE.The data set was comprised of 100 genes and 100 species. Nodal distances were used in the Phylo-MCOA analysis.

    de Vienne et al. doi:10.1093/molbev/msr317 MBE

    8

  • 7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317

    9/12

    reanalyzed the data set of Schierwater after having replacedproblematic sequences by correct ones and showed thatthe obtained tree, even if it had less supported nodes,was more realistic given previous studies.

    In order to test whether Phylo-MCOA is able to filter

    problematic sequences from a real data set, we analyzed

    the set of phylogenetic trees used by Schierwater et al.(2009). We removed the outliers detected by Phylo-MCOA,built a phylogeny using only the filtered data set (i.e., with-out problematic sequences) and compared our tree withthe one obtained without filtering and the one producedby Philippe et al. (2011) after filtering the data.

    To obtain the animal phylogeny, we used the originaldata set of Schierwater et al. (2009). It was first aligned

    and trimmed with the same tools and parameters as inPhilippe et al. (2011), and the individual gene trees werereconstructed using the same model as Philippe et al.(2011). As in Philippe et al. (2011), four genes (AT8,DCR, N4L, and ND6) were removed from the analysis be-cause they contained too few positions after trimming. Allthe trees were analyzed with Phylo-MCOA using patristicdistances and all default parameters for the outlier detec-tion. Four complete outliers genes were detected by Phylo-MCOA and removed before performing the cell-by-celldetection: rrna16S, rrna18S, rrna28S, and rrna58S. Of the984 cells in the 2WR matrix (41 genes 24 species), 21

    cell-by-cell outliers were detected (see supplementary tableS1, Supplementary Material online) and were removedfrom the original data set (by removing the sequencesin the original sequence files). The remaining sequences(available as supplementary dataset S1, Supplementary Ma-terial online) were then concatenated and used to recon-struct a phylogenetic tree with 100 bootstrap replicatesusing RAxML (Stamatakis 2006) and the same model as inSchierwater et al. (2009). The tree we obtained withoutfiltering (fig. 6a) was similar to the one proposed by Schier-water et al. (2009), which is known to contain noisy dataand thus produce an inconsistent topology. After filtering,there was a general decrease of bootstrap support for no-

    des in the diploblasts group (fig. 6b), while the support

    FIG. 5. Scatter plot of the topological congruence (top score)

    between 246 fungal gene trees and the consensus species tree

    obtained from the concatenation of individual trees (x axis) againstthe score of the genes according to Phylo-MCOA (y axis). Black dots

    represent the scores when using patristic distances. White dots

    represent the scores when using nodal distances. Outlier genes

    according to Phylo-MCOA and identified by black bars in figure 4

    are shown with small gray arrows. They appear to be poorly

    congruent with the species tree (small values on the x axis).

    FIG. 6. Comparison of the phylogeny of animals from the Schierwater et al. (2009) data set. (a) Tree obtained from the original sequences.

    (b) Tree obtained after removal of cell-by-cell outliers using Phylo-MCOA. Species with a star (*) represent Porifera. The gray arrows point to

    the node support of the diploblasts group in the two trees. Nodes with black dots represent bootstrap supports higher than 95. A hundred

    bootstrap replicates were performed for each tree.

    Detecting Outlier Genes and Species in Phylogenomics doi:10.1093/molbev/msr317 MBE

    9

    http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1
  • 7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317

    10/12

    of the nodes outside the diploblasts were almost notaffected. Overall, the bootstrap support for the mono-phyly of diploblasts went from 100 in the original dataset to 63 in the filtered data set (gray arrows in fig. 6),which is in accordance with the accepted vision that dip-loblasts are certainly not monophyletic. The monophylyof Porifera (species with an * in fig. 6) was lost, which isalso in accordance with previous studies and coherent

    with the result of Philippe et al. (2011) on this dataset, who obtained a node for the Porifera with a supportvalue of 36 (Philippe et al. 2011).

    Phylo-MCOA was thus able to identify true outliers thatneed to be filtered out prior to further phylogenetic anal-ysis in order to estimate a more consistent topology. How-ever, we could not retrieve exactly the same topology as theone obtained by Philippe et al. This can be due to the factthat we do not replace the sequences identified as prob-lematic by correct sequences as done in Philippe et al.(2011). Moreover, Phylo-MCOA considers that the correctsignal is the one that is present in the majority of the genes,

    but this is not true in this data set for the Hexactinellida,where the majority of the sequences are slow evolvingmitochondrial sequences being contaminations from theDemospongiae, while true nuclear sequences of Hexactinel-lidae are fast evolving (Philippe et al. 2011). As a conse-quence, Hexactinellida and Demospongiae still forma very solid group after filtering, and the branch lengthleading to Hexactinellida is reduced in the new tree(fig. 6b) compared with the original tree (fig. 6a).

    The Phylo-MCOA Program: Availability andRequirementsPhylo-MCOA is provided as a set of R functions whichmake use of the ade4 (Chessel et al. 2004; Dray et al.2007), ape (Paradis et al. 2004), and lattice (for graphicalfunctions, Sarkar 2010) packages and newly developedfunctions. It requires a recent version of R (.2.8.0) installedon the machine. Phylo-MCOA is freely distributed underthe GNU General Public License. It can be downloadedat http://phylomcoa.cgenomics.org.

    Discussion

    It has become a standard practice to use multiple genes inorder to obtain a reliable and concordant phylogeny. When

    sampling many different genes for the same set of species,the individual gene trees will often be partly incongruent. Inthis case, one is interested in detecting which subset ofgenes and species is responsible for the discrepancies, inother words, in finding the outliers. Phylo-MCOA, themethod proposed here, provides a quick and powerfulway to detect outliers by integrating the phylogenetic in-formation contained in all the individual gene trees. Phylo-MCOA determines how different the compared topologiesare but more importantly, it identifies which genes and spe-cies explain those differences. This may prove very usefulwhen determining possible causes for incongruities, suchas horizontal gene transfers, different selective pressures,

    recombination events, among others. Outlier genes are typ-ically affected by these factors, so identifying which subsetof species explain the differences may help to focus theefforts for studying the affected species. On the other hand,by building a consensus typology Phylo-MCOA also findsthe genes that produce concordant phylogenies. Thesegenes are excellent candidates for recovering the speciestree, which can be subsequently verified by other methods.

    Phylo-MCOA thus provides good candidates, not onlyfor outliers, but for phylogenetic markers producingconcordant topologies.

    To our knowledge, this is the first time that MCOA isapplied directly to analyze phylogenetic distances. Basedon a covariation optimization criterion, MCOA is a methodthat captures the similarities (and dissimilarities) obtainedfrom the ordination of several tables (types) of data, as ithas been previously demonstrated for the analysis of eco-logical data (Bady et al. 2004; Hedde et al. 2005; Laloe et al.2007). Furthermore, it is a statistically solid technique thatlends itself well to the analysis of phylogenetic data (i.e., in

    terms of topological distances), where the covariation ofthe histories of genes and species can be integrated in a sin-gle ordination analysis. Even though a consensus typologyis proposed by the method, MCOA takes into account allthe gene trees in order to build it. We believe this makesa better use of the individual tree information than otherforms of phylogenetic consensus that do not take intoaccount outliers or use only a part of the available data.Our method is unique in simultaneously detecting outlierspecies and genes from the comparison of multiple genetrees.

    One important aspect of tree comparisons involves theestimation of pairwise distance matrices. The Eigen decom-

    position and further ordination represented in the consen-sus typology take the distances as absolute distances.Therefore, the consensus typology will change dependingon the type of distance that is actually being measured.There are two ways to calculate the distance betweentwo species in a phylogenetic tree: the sum of the branchlengths or the number of nodes separating the species inthe trees. Both distance metrics have advantages and dis-advantages: using the sum of the branch lengths (patristicdistances) allows incorporating the relative rates of evolu-tion but can artificially bring closer species that have smallbranch lengths and separate species with longer branches.

    Conversely, if the internodal distance separating species isused, then the consensus typology informs us about a de-gree of phylogenetic proximity, but it tells us nothing aboutthe relative rates of evolution. It is important to be explicitabout the pairwise distance method employed in orderto derive conclusions, as genes identified as outliers whenusing one method may seem concordant with other geneswhen using another one. Comparing the results given byboth methods can be very informative. A gene that is iden-tified as an outlier when using patristic distances but notwhen using nodal distances will represent a gene evolvingat higher or slower rates in some species compared withother genes.

    de Vienne et al. doi:10.1093/molbev/msr317 MBE

    10

    http://phylomcoa.cgenomics.org/http://phylomcoa.cgenomics.org/http://phylomcoa.cgenomics.org/
  • 7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317

    11/12

    Phylo-MCOA allows weighing trees according to differ-ent scores (e.g., likelihood scores), which ensures that morecredible trees are given more weight in the analysis. It isequally important to assign weights to nodes to distinguishsupported relationships from spurious or dubious ones. Inprinciple, one could incorporate a measure of node supportinto MCOA metrics; however, this is not straightforward.At present, we have arrived at a solution that may alleviate

    the problem of node support assignment to some extent.We have incorporated the possibility of collapsing lowlysupported nodes, according to a cutoff value given bythe user. In this way, the nodes that are poorly supportedwill contribute little or nothing to the construction of theconsensus typology. Assessing node support across phylog-enies is difficult, especially with cases of missing or dupli-cated data, as not all nodes will be present in all trees. Forthis reason, we suggest the use of Treeko (Marcet-Houbenand Gabaldon 2011) for trees with duplicated genes, priorto analysis with Phylo-MCOA. Interestingly, our methodcan be used even in the absence of completely sampled

    data sets. The program can deal with gene losses, as dem-onstrated by randomly deleting species from the simulateddata analyzed.

    We have shown that Phylo-MCOA is able to correctlydetect outlier genes and species from the comparison ofmultiple phylogenetic trees. We have shown this by ana-lyzing simulated data sets but, more interestingly, fromthe analysis of real data sets. We have shown, with two dif-ferent examples, how Phylo-MCOA can detect trueoutliers. First, with the fungal data set by Aguileta et al.(2008), where we found the same concordant and discor-dant genes as in the original study; and second, where weuse Phylo-MCOA to predict which cell-by-cell outliers need

    to be removed from phylogenetic reconstruction in orderto recover the animal phylogeny obtained after removingnoisy data, as done by Philippe et al. (2011).

    Phylo-MCOA can be used to detect candidate genes forfurther analysis, or for filtering out problematic genes inphylogenetic reconstruction. Phylo-MCOA can detecttwo types of outliers. Complete outliers can identify whichgenes and species deviate from the evolutionary history ofmost genes analyzed. This information can be very usefulfor filtering out noisy data in phylogenetic reconstruction.On the other hand, cell-by-cell outliers point to particularcases of topological discordance, where one specific species

    is an outlier in a given gene. This information can be used topostulate specific hypotheses concerning particular genesor species. Although cell-by-cell outlier detection is imple-mented in the current Phylo-MCOA version, there is roomfor improvement, specifically in the way outliers are scoredto define significance.

    Phylo-MCOA can be used to detect candidate genes forfurther analysis, or for filtering out problematic genes inphylogenetic reconstruction. Nevertheless, caution shouldalways be exerted because the method will give moreweight to the information given by the majority of genesunder study, and if the data set analyzed is plaguedwith dubious sequences, the method will favor an incorrect

    signal. This is what we observed when we analyzed the dataset ofSchierwater et al. (2009) that is known to be riddenwith multiple faulty sequences. External information on thequality of the sequences should be employed wheneverpossible prior to analysis with our method.

    Supplementary Material

    Supplementary table S1 and data set S1 are available atMolecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

    Acknowledgments

    We would like to thank Herve Philippe for the time hespent on this manuscript and the helpful commentsand suggestions he made. We thank David Moreira for con-structive discussions, Avner Bar-Hen for interesting com-ments on the use of MCOA for phylogenomics analysis,Vincent Berry for suggestions on future developments

    and comments on distances between species, and FranSupek for his suggestions on statistical significance testing.Finally, we thank Tatiana Giraud and Toni Gabaldon forinsightful comments on the manuscript as well as threeanonymous reviewers. D.M.V. acknowledges postdoctoralgrants from the Centre National de la RechercheScientifique (CNRS) and Universite Paris-Sud 11. G.A. ac-knowledges postdoctoral grants from the CNRS, UniversiteParis-Sud 11, and Agence National de la Recherche (ANR-06-BLAN-0201).

    ReferencesAguileta G, Marthey S, Chiapello H, Lebrun MH, Rodolphe F,

    Fournier E, Gendrault-Jacquemard A, Giraud T. 2008. Assessing

    the performance of single-copy genes for recovering robust

    phylogenies. Syst Biol. 57:613627.

    Ane C, Larget B, Baum DA, Smith SD, Rokas A. 2007. Bayesian

    estimation of concordance among gene trees. Mol Biol Evol. 24:

    412426.

    Bady P, Doledec S, Dumont B, Fruget JF. 2004. Multiple co-inertia

    analysis: a tool for assessing synchrony in the temporal

    variability of aquatic communities. C R Biol. 327:2936.

    Berthouly C, BedHom B, Tixier-Boichard M, Chen CF, Lee YP,

    Laloe D, Legros H, Verrier E, Rognon X. 2008. Using molecular

    markers and multivariate methods to study the genetic diversity

    of local European and Asian chicken breeds. Anim Genet. 39:

    121129.

    Brochier C, Bapteste E, Moreira D, Philippe H. 2002. Eubacterial

    phylogeny based on translational apparatus proteins. Trends

    Genet. 18:15.

    Chessel D, Dufour AB, Thioulouse J. 2004. The ade4 package-I-one-

    table methods. R News. 4:510.

    Chessel D, Hanafi M. 1996. Analyses de la co-inertie de K nuages de

    points. Rev Stat Appl. 44:3560.

    Cranston KA, Rannala B. 2007. Summarizing a posterior distribution

    of trees using agreement subtrees. Syst Biol. 56:578590.

    de Vienne DM, Aguileta G, Ollier S. 2011. The Euclidean nature of

    phylogenetic distance matrices. Syst Biol. 60:826832.

    Degnan JH, Rosenberg NA. 2006. Discordance of species trees with

    their most likely gene trees. PLoS Genet. 2:762768.

    Detecting Outlier Genes and Species in Phylogenomics doi:10.1093/molbev/msr317 MBE

    11

    http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/http://www.mbe.oxfordjournals.org/http://www.mbe.oxfordjournals.org/http://www.mbe.oxfordjournals.org/http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1
  • 7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317

    12/12

    Dray S, Dufour AB, Chessel D. 2007. The ade4 package-II: two-table

    and K-table methods. R News. 7:4752.

    Edwards SV, Liu L, Pearl DK. 2007. High-resolution species trees

    without concatenation. PNAS 104:59365941.

    Felsenstein J. 1985. Confidence-limits on phylogeniesan approach

    using the bootstrap. Evolution 39:783791.

    Gadagkar SR, Rosenberg MS, Kumar S. 2005. Inferring species

    phylogenies from multiple genes: concatenated sequence tree

    versus consensus gene tree. J Exp Zool B Mol Dev Evol. 304:6474.

    Gordon AD. 1986. Consensus supertreesthe synthesis of rootedtrees containing overlapping sets of labeled leaves. J Classif. 3:

    335348.

    Gower JC. 1984. Distance matrices and their Euclidean approxima-

    tion. In: Jambu M, Diday E, Lebart L, Pages J, Tomassone R,

    editors. Data analysis and informatics III. Amsterdam (The

    Netherlands): Elsevier. p. 321.

    Gower JC, Legendre P. 1986. Metric and Euclidean properties of

    dissimilarity coefficients. J Classif. 3:548.

    Hedde M, Lavelle P, Joffre R, Jimenez JJ, Decaens T. 2005. Specific

    functional signature in soil macro-invertebrate biostructures.

    Funct Ecol. 19:785793.

    Hillis DM, Heath TA, St John K. 2005. Analysis and visualization of

    tree space. Syst Biol. 54:471482.

    Huson DH, Bryant D. 2006. Application of phylogenetic networks inevolutionary studies. Mol Biol Evol. 23:254267.

    Laloe D, Jombart T, Dufour AB, Moazami-Goudarzi K. 2007.

    Consensus genetic structuring and typological value of

    markers using multiple co-inertia analysis. Genet Sel Evol.39:545567.

    Leigh JW, Schliep K, Lopez P, Bapteste E. 2011. Let them fall where

    they may: congruence analysis in massive, phylogenetically

    messy datasets. Mol Biol Evol. 28:27732785.

    Leigh JW, Susko E, Baumgartner M, Roger AJ. 2008. Testing

    congruence in phylogenomic analysis. Syst Biol. 57:104115.

    Manly BFJ. 2004. Multivariate statistical methods: a primer. 3rd ed.

    London: Chapman & Hall, Ltd.

    Marcet-Houben M, Gabaldon T. 2011. TreeKO: a duplication-aware

    algorithm for the comparison of phylogenetic trees.NucleicAcids Res. 39(10):e66.

    Marthey S, Aguileta G, Rodolphe F, et al. (11 co-authors). 2008.

    FUNYBASE: a FUNgal phYlogenomic dataBASE. BMC Bioinfor-

    matics. 9:456.

    Paradis E, Claude J, Strimmer K. 2004. APE: analyses of phylogenetics

    and evolution in R language. Bioinformatics 20:289290.

    Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M,

    Worheide G, Baurain D. 2011. Resolving difficult phylogenetic

    questions: why more sequences are not enough. PLoS Biol. 9:

    e1000602.

    Rodriguez-Ezpeleta N, Brinkmann H, Roure B, Lartillot N, Lang BF,Philippe H. 2007. Detecting and overcoming systematic errors in

    genome-scale phylogenies. Syst Biol. 56:389399.

    Rosenberg NA. 2002. The probability of topological concordance of

    gene trees and species trees. Theor Popul Biol. 61:225247.

    Roure B, Rodriguez-Ezpeleta N, Philippe H. 2007. SCaFoS: a tool for

    selection, concatenation and fusion of sequences for phyloge-

    nomics. BMC Evol Biol. 7(1 Suppl):S2.

    Sanderson MJ, Purvis A, Henze C. 1998. Phylogenetic supertrees:

    assembling the trees of life. Trends Ecol Evol. 13:105109.

    Sarkar D. 2010. Lattice: lattice graphics [Internet]. R package version

    0.18-8. Available from: http://CRAN.R-project.org/package=lattice.

    Schierwater B, Eitel M, Jakob W, Osigus H-J, Hadrys H, Dellaporta SL,

    Kolokotronis S-O, DeSalle R. 2009. Concatenated analysis sheds

    light on early metazoan evolution and fuels a modernurmetazoon hypothesis. PLoS Biol. 7:e1000020.

    Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based

    phylogenetic analyses with thousands of taxa and mixed models.

    Bioinformatics 22:26882690.

    Suchard MA. 2005. Stochastic models for horizontal gene transfer:

    taking a random walk through tree space. Genetics 170:419431.

    Suchard MA, Weiss RE, Sinsheimer JS, Dorman KS, Patel M,

    McCabe ERB. 2003. Evolutionary similarity among genes. J Am

    Stat Assoc. 98:653662.

    Susko E, Leigh J, Doolittle WF, Bapteste E. 2006. Visualizing and

    assessing phylogenetic congruence of core gene sets: a case

    study of the gamma-proteobacteria. Mol Biol Evol. 23:10191030.

    White WT, Hills SF, Gaddam R, Holland BR, Penny D. 2007. Treeness

    triangles: visualizing the loss of phylogenetic signal.Mol Biol Evol.24:20292039.

    de Vienne et al. doi:10.1093/molbev/msr317 MBE

    http://cran.r-project.org/package=latticehttp://cran.r-project.org/package=latticehttp://cran.r-project.org/package=lattice