Mol Biol Evol 2012 de Vienne Molbev_msr317

7/28/2019 Mol Biol Evol 2012 de Vienne Molbev_msr317

1/12

Phylo-MCOA: A Fast and Efficient Method to Detect OutlierGenes and Species in Phylogenomics Using MultipleCo-inertia Analysis

Damien M. de Vienne,*,1,2 Sebastien Ollier,2 and Gabriela Aguileta1,2

1Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG) and UPF, Barcelona, Spain2Laboratoire Ecologie, Systematique et Evolution, UMR 8079, Universite Paris-Sud 11, Orsay cedex, France

*Corresponding author: E-mail: [email protected] editor: Herve Philippe

Abstract

Full genome data sets are currently being explored on a regular basis to infer phylogenetic trees, but there are oftendiscordances among the trees produced by different genes. An important goal in phylogenomics is to identify whichindividual gene and species produce the same phylogenetic tree and are thus likely to share the same evolutionary history.On the other hand, it is also essential to identify which genes and species produce discordant topologies and thereforeevolve in a different way or represent noise in the data. The latter are outlier genes or species and they can providea wealth of information on potentially interesting biological processes, such as incomplete lineage sorting, hybridization,and horizontal gene transfers. Here, we propose a new method to explore the genomic tree space and detect outlier genesand species based on multiple co-inertia analysis (MCOA), which efficiently captures and compares the similarities in thephylogenetic topologies produced by individual genes. Our method allows the rapid identification of outlier genes andspecies by extracting the similarities and discrepancies, in terms of the pairwise distances, between all the species in all thetrees, simultaneously. This is achieved by using MCOA, which finds successive decomposition axes from individualordinations (i.e., derived from distance matrices) that maximize a covariance function. The method is freely available asa set of R functions. The source code and tutorial can be found online at http://phylomcoa.cgenomics.org.

Key words: phylogenomics, outlier genes, phylogenetic markers, multivariate analysis, fungal genomes.

Introduction

The analysis of genomic data sets has generated the press-ing need to compare a large amount of phylogenetic trees.

The trees under comparison may, for example, representa set of shared orthologous genes from a given organismor taxonomic group, sampled across their full genomes.The goal of such comparisons can be 2-fold: either to iden-tify the set of genes producing a concordant topology thatmay reflect the species history or to identify the set of out-lier genes that generate topological incongruities and arelikely to evolve differently or represent noise. While mostefforts have been devoted to detecting genes that produceconcordant gene trees, less attention has been given toidentifying outliers, although they are very importantcandidates for studying interesting biological processes.

Detecting outlier genes and species is particularly rele-

vant in phylogenomics because genes whose tree topolo-gies are discordant with respect to an overall topologicalconsensus are good candidates for evolving at significantlydifferent rates, to be under positive selection for functionaldivergence, to have been horizontally transferred, to beparalogous instead of orthologous genes, or to be randomlysampled from incompletely sorted lineages. On the otherhand, species that appear as outliers (i.e., that are placedvery differently in different gene trees) may representspecies that are more likely to experience gene transfers,

having sequencing errors, having undergone bottlenecks,

or other biological processes at the species level. Even

though topological discordances can be detected at thegene or the species level, most methods so far are designed

to look at genes only.In the context of phylogenetic gene tree comparison and

phylogenomics, several approaches and methods havebeen proposed to detect concordant and discordant topol-

ogies trying to integrate as much as possible the available

information. A solution that attempts to deal with all the

available data is the construction of a super matrix from

the alignment of all or a large number of the available

individual genes, an approach known as total evidence

(Gadagkar et al. 2005; Aguileta et al. 2008). However, most

methods involve some sort of consensus solution either

by classical consensus of multiple competing phylogenies

(Felsenstein 1985), by building a super tree (Gordon1986; Sanderson et al. 1998), by Bayesian sampling (Ane

et al. 2007), or a summary by agreement subtrees (Cranston

and Rannala 2007). Other approaches work by trying to

filter out the noise and/or increasing the phylogenetic

signal (Rodriguez-Ezpeleta et al. 2007; Roure et al. 2007)

or by clustering congruent loci (Leigh et al. 2008, 2011).Another possibility involves identifying the conflicting

bipartitions and representing all the possible alternativesin a graphic way, an approach that results in networks

The Author 2012. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, pleasee-mail: [email protected]

Mol. Biol. Evol. doi:10.1093/molbev/msr317 Advance Access publication January 3, 2012 1

MBE Advance Access published February 14, 2012
http://phylomcoa.cgenomics.org/http://phylomcoa.cgenomics.org/


2/12

of genes (Huson and Bryant 2006), or a projection of theconflicting signals in 2Ds (White et al. 2007). Also, in orderto obtain a single tree from multiple individual trees of thesame organism or species, the coalescent approach can beused (Rosenberg 2002; Suchard et al. 2003; Suchard 2005;Degnan and Rosenberg 2006) or the Bayesian reconstruc-tion of gene trees taking the species tree as a prior (BEST)(Edwards et al. 2007). Finally, approaches for graphically

comparing phylogenetic trees include the use ofmultidimensional scaling to visualize tree-to-tree pairwisedistances (Hillis et al. 2005); the use of principal componentanalysis (PCA) to detect events of horizontal gene transfer(Brochier et al. 2002; Susko et al. 2006); and employingHadamard spectral analysis represented in a triangulargraph called Treeness Triangles to investigate the lossof phylogenetic signal (White et al. 2007). However, byconcatenating the multilocus data or by summarizing orobtaining a consensus of their individual gene trees, mostof the latter methods lose a wealth of potentially interest-ing information, especially by removing outlier data or

trees. Phylo-MCOA is unique in that it analyzes all the treessimultaneously without discarding data a priori. Further-more, as far as we know, the possibility to also detect out-lier species when comparing phylogenetic trees has notbeen proposed before.

Here, we present Phylo-MCOA, a fast method for com-paring multiple phylogenetic trees and detecting outliergenes and species based on multiple co-inertia analysis(MCOA) (Chessel and Hanafi 1996). The method allowsthe efficient identification of concordant and outliergenes and species, by enabling the quick comparison ofa large number of trees. It builds a consensus typologyby extracting the similarities, in terms of the pairwise

distances between all the species in all the trees, simulta-neously. The consensus typology is a 2D graphic represen-tation of the distance interrelationships between speciessupported by the majority of the genes. Phylo-MCOA findssuccessive decomposition axes from individual ordinations(derived from matrices of distances between species) thatmaximize a covariance function in order to build the con-sensus typology, and it estimates the contribution of eachindividual gene tree to this typology (i.e., to what extenteach gene deviates or agrees with what the majority ofgenes support). MCOA is expected to reveal the commonfeatures (i.e., pairwise distances and species placements) of

single-tree topologies, build a consensus typology and com-pare individual trees with this reference. Until now, MCOAhas been used with ecological (Bady et al. 2004; Hedde et al.2005) and genetic data (Laloe et al. 2007; Berthouly et al.2008); however, as far as we know, this is the first time thatMCOA is applied to phylogenetic analysis. Phylo-MCOA isunique in tracing the position of each species with respectto all other species across the compared phylogenies. Thesimultaneous analysis of all the species placements in alltrees makes it possible to detect, not only concordant genesbut also outlier gene trees, and even more importantly, itcan show which species andgenes explainthose differences.Our method is thus new in allowing the analysis of the data

at the species level for comparing the histories of differentspecies across individual gene trees.

Phylo-MCOA can be very useful for phylogenomic anal-yses, where big genomic data sets are typically analyzed.First, Phylo-MCOA readily identifies the outlier genesand species that cause topological incongruence. This isuseful for studying the evolution of specific genes, partic-ularly those that seem to differ from the rest of the genome.

The method will also identify candidate outlier species,which may undergo interesting biological processes atthe species level. Phylo-MCOA is unique in pointing notonly to outlier genes but also to outlier species. Second,when the goal is to construct a species tree or a tree thatreflects the evolution of most genes in the genome, Phylo-MCOA can be used to identify which genes producediscordant trees and remove them for subsequent phylo-genetic analysis. In this way, only the genes that share thesame evolutionary history are kept in order to build a treethat reflects the evolution of the species or that of most ofthe genome. Third, Phylo-MCOA can be used to investigate

specific evolutionary hypotheses including, but not limitedto, which genes may have been laterally transferred, whichones may be more readily lost or replaced, or which speciesproduce phylogenetic discordance.

Materials and Methods

MCOA in the Context of Phylogenetic AnalysisIn phylogenomics, there is often the need to compare setsof phylogenetic trees, determine how the position of eachspecies changes in every tree, and to find out which treesare more similar between each other in terms of the pair-wise topological distance (i.e., either nodal or patristic). It

would thus be advantageous to compare the position ofeach species in all trees at the same time, to know if thereis a correlation among the species positions and distancesrepresented in all trees (i.e., create a consensus typologythat serves as reference), and to what extent each individ-ual tree deviates from such a consensus. The problem athand involves measuring the relationship between theposition of each species and the distances between themin all the trees of a given set.

In the context of multivariate data analysis, one is inter-ested in ordering objects (e.g., variables, sites, and tables) bytheir similarity, and the relationships between the objects

are measured by different metrics (Manly 2004). Generally,objects are projected onto principal axes and are thus char-acterized numerically or graphically. This procedure placessimilar objects near each other and dissimilar ones far apartin geometric space. Traditional ordering techniques (e.g.,PCA) typically use a singular decomposition axis value ofone table. In the case of our proposed method, MCOAallows the simultaneous ordination of many Euclidean dis-tance matrices derived from every tree in the set by findingsuccessive decomposition axes from each distance matrixthat maximize a covariance function. Doing this allows theextraction of the consensus typology contributed by alltrees.

de Vienne et al. doi:10.1093/molbev/msr317 MBE

2


3/12

Here, we introduce the basic concepts of the method;interested readers are referred to the appendix in thesupplementary material (Supplementary Material online)and to Chessel and Hanafi (1996) for its mathematicaldescription.

To better understand how MCOA is applied to phylo-genetic data (i.e., a set ofK trees), we can decompose themethod in two steps. We first explain how one tree can berepresented in a 2D space through Principal CoordinateAnalysis (Gower 1984; Gower and Legendre 1986) (PCO,fig. 1a), and we then explain how K trees can be simulta-neously analyzed through MCOA (fig. 1b).

A single tree can be transformed into a distance matrixby calculating the pairwise distances between species in thetrees (fig. 1a, step 1). Distance between species can eitherbe nodal (number of nodes separating two leaves in thetree) or patristic (sum of the branch lengths separatingtwo leaves). We then compute the square root of this

distance matrix in order to get a Euclidean distance matrix,which is an optimal condition for the PCO to be performed.Whereas the methods typically used to obtain Euclideandistances impose a geometric distortion on them, takingtheir square root does not affect the results (de Vienne

et al. 2011). The principal coordinates of the leaves (i.e.,terminal branches) in a Euclidean space are then derivedfrom the Euclidean distance matrix by a PCO analysis(fig. 1a, step 2). Each axis defined by the PCO analysis max-imizes the variance of the coordinates of the species. Whenoriginal distances between species are Euclidean, all theeigenvalues are positive and correspond to the varianceassociated to the axis. In that case, the distances betweenspecies in the new space defined by the PCO analysis arethe same as the original distances defined by the tree.Because most of the variance is generally concentratedin the first axis (depending on the topology of the tree),we can represent the respective position of the leaves in

FIG. 1. Principle of MCOA as applied to phylogenetic trees. (a) From one tree to a 2D representation of the tree. Step 1: A single phylogenetic

tree can be converted into a distance matrix using either nodal or patristic distances between the leaves. Step 2: A PCO on the square root of

the distance matrix allows retrieving the coordinates of the species in a Euclidean space. Step 3: Positions of the species in a 2D space formed

by the two first axes of the PCO. (b) MCOA from Ktrees to the cohesion plot. Steps 1 and 2: From K trees, Kdistance matrices and KEuclidean

spaces are obtained, as in A. Each Euclidean space can then be seen as a scatter plot (or set of points) in an n 1 dimensional space (n beingthe number of leaves). Step 3: MCOA finds K auxiliary vectors (u11,u12, . . ., u

1K) and a reference vector m

1 so that the sum of the covariances

between each u1 vector and the unique m1 vector is maximized. This is done recursively for m1, m2, etc. Step 4: Cohesion plot for the two first

axes of the MCOA analysis m1 and m2. Labels represent the reference position of the species (consensus typology). Black points represent the

positions of the species in each individual gene tree.

Detecting Outlier Genes and Species in Phylogenomics doi:10.1093/molbev/msr317 MBE

3
http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1


4/12

a reduced space formed by these axes of the PCO (fig. 1a,step 3). The relative positions of the taxa in this reducedspace represent the main dichotomies observed in the tree.

The first steps of the MCOA analysis, as applied to mul-tiple phylogenetic trees, are the same as before. Each one ofthe K trees is transformed into a distance matrix (fig. 1b,step1), and the square root of the distance matrices arethen computed and used to perform K-independent PCOs(fig. 1b, step2). We thus obtain KEuclidean spaces, one foreach initial tree. Each one of these spaces can be seen as

a set of points (scatter plots on fig. 1b) distributed ona space having n 1 dimensions (or axes), where n isthe number of leaves in each tree. The basic goal of MCOAis then to find a set of K vectors (one for each of the KEuclidean spaces) called auxiliary vectors (from u11 tou1K), as well as a single reference vector (called m

1), so thatthe sum of the covariances between each of the uKauxiliaryvectors and the m1 vector is maximized (fig. 1b, step 3). Ananalytic demonstration of this maximization has beengiven by Chessel and Hanafi (1996). The obtained m1 vectoris the first axis of the MCOA analysis. The same method isthen applied to obtain a m2 vector by maximizing the sum

of the covariance between this vector and K u2

vectors(from u21 to u

2K). Thus m

2 becomes the second axis of theMCOA. This is done recursively for retrieving the n axesof the MCOA analysis.

Phylo-MCOA: Using MCOA for PhylogeneticAnalysisFollowing the analysis described above, the reference posi-tion of each species (coordinates of the species projectedon the m1, m2, . . . mn vectors) as well as the position of thespecies in each individual tree (coordinates of the speciesprojected on the u1, u2, . . . , un auxiliary vectors) can becomputed. It is possible to plot the reference positions

of each species, as well as their position in each individualtree, using the two first axes of the MCOA analysis (fig. 1bpart 4). This plot is usually called the cohesion plot. The setof reference positions of the species in the cohesion plot iscalled consensus typology. For a given species, the distancebetween its reference position and its position in each in-dividual tree is represented by lines. Long lines representcases were the species has a position in a given tree thatis not concordant with its position in all other trees (andtherefore neither in the consensus typology). On the con-

trary, short lines represent cases where the position of thespecies in a given tree is roughly the same as in all the othertrees (and therefore in the consensus typology). The cohe-sion plot thus contains N Tdots, where N represents thenumber of species in each tree and T the number of trees.

The cohesion plot is very informative for small data sets,where one can assess the position of each species in eachindividual tree relative to its reference position. However,this plot becomes difficult to read and to interpret asthe number of genes and species increases because specieslabels superimpose and the cloud of points and linesbecomes overcrowded. Moreover, the cohesion plot gives

only a graphical representation of the data and does notallow the extraction of statistically significant outlier genesand species. Therefore, we show instead the informationcontained in cohesion plots but using a different and morereadable representation: the two-Way Reference matrix(hereafter referred to as the 2WR matrix). The 2WR matrixis built by computing, for every species, the distance be-tween its reference position and its position in each indi-vidual tree in the multidimensional space. The rows in the2WR matrix represent species and columns represent genes(fig. 2a). Note that the order of species (rows) in the 2WRmatrix follows the average relative position of species inthe gene trees. In other words, two species that are close

FIG. 2. Detection of complete outliers by Phylo-MCOA. (a) The 2WR matrix is computed by calculating, for every species, the distance

separating its position in each gene tree to its reference position (see text). The 2WR matrix provides information about species in rows and

genes in columns. (b) The 2WR matrix is transformed into two binary matrices (2WRBgn and 2WRBsp) by detecting outlier cells as described in

the Methods section. (c) A score is computed for each gene from the 2WRBgn matrices and for each species from the 2WRBsp matrix and

represented by a bar plot. The dashed gray line on the bar plot represents the threshold chosen by the user over which a gene or a species is

considered as a complete outlier.


4


5/12

in the 2WR matrix (two contiguous rows) are two speciesthat are closely related in the gene trees.

Detection of OutliersPhylo-MCOA can detect two types of outliers: completeoutliers and cell-by-cell outliers. Complete outliers are oftwo types: complete outlier species and complete outliergenes. Complete outlier genes place all the species in a posi-tion different from their positions in the other gene trees,whereas complete outlier species are species whose posi-tion with respect to other species is different in all the genetrees. Cell-by-cell outliers represent species whose position

in some specific gene trees is not concordant with theirposition in the other gene trees (i.e., single cells in the2WR matrix whose value seems outlier with respect tothe other values in the matrix).

Cell-by-cell outliers cannot be detected by Phylo-MCOAfrom the 2WR matrix if complete outliers are present be-cause it hides the cell-by-cell signal. The detection of out-liers by Phylo-MCOA is thus performed in multiple steps: 1)Phylo-MCOA analysis of the original data set; 2) detectionof complete outliers; 3) removal of the complete outliersfrom the original data set; 4) Phylo-MCOA analysis ofthe new data set; and 5) detection of cell-by-cell outliers. If

no complete outliers are detected, steps 3 and 4 are omitted.For detecting complete outliers (fig. 2), the 2WR matrix

is separated in two matrices, (2WRBsp and 2WRBgn, fig. 2b)by detecting outlier cells. For constructing the 2WRBsp ma-trix, a cell is considered as an outlier (and assigned the value1), if its value in the 2WR matrix is higher than Q3 kIQR,where Q3 represents the third quartile of the distributionof values of its column and IQR represents the interquartilerange of these values. Symmetrically, for the 2WRBgn matrix,a cell is considered as outlier if its value in the 2WR matrix ishigher than Q3 kIQR, where Q3 and IQR are now com-puted based on the values in the row. We set k5 1.5 as thedefault value as is classically done when detecting outliers

with this method. The two obtained binary matrices arerepresented in figure 2b. From the 2WRBgn matrix, a scoreis computed for each gene by computing the proportion of1s in its column. From the 2WRBsp matrix, a score is com-puted for each species by computing the proportion of 1sin its row. This leads to the two barplots presented in figure2c. In this example, gene 2 and gene 13 are detected as out-liers by more than 90% of the species, and t16 and t33 aredetected as outlier species by more than 90% of the genes.

The user can fix a threshold for considering that a speciesor a gene is a complete outlier. By default, we set thisthreshold to 50% (i.e., species detected as outlier for more

than 50% of the genes are seen as complete outliers andgenes detected as outliers for more than 50% of the speciesare considered as complete outliers).

For the detection of cell-by-cell outliers (fig. 3), the 2WRmatrix (fig. 3a) is transformed into the 2WRgn and 2WRsp

matrices by normalizing the 2WR matrix by rows and col-umns, respectively; the product of these two normalizedmatrices gives the 2WRgnsp matrix that is used for thecell-by-cell outlier detection.

The 2WRgnsp matrix is first transformed into a binarymatrix by detecting outlier cells, according to the valuesin the rows, and to a second binary matrix by detecting

outlier cells according to the values in the columns. Thesetwo binary matrices are finally multiplied in order to con-serve only cells detected as outliers for both rows and col-umns. This last binary matrix is called 2WRBgnsp (fig. 3b).

This procedure for detecting cell-by-cell outliers is, how-ever, too permissive, and therefore, a second filtering step isnecessary. As explained before, Phylo-MCOA starts bytransforming trees into distance matrices. Species in thetrees are not independent from each other. Therefore, evena single modification in a tree will change all the values inthe associated distance matrix. When we detect a cell-by-cell outlier, we have to distinguish between true outliersand species that are affected by this outlier just because

FIG. 3. Detection of cell-by-cell outliers by Phylo-MCOA. (a) The 2WR matrix is obtained as described in the text. (b) The 2WR matrix is then

transformed into the 2WRBgnsp binary matrix by successive normalizations and detection of outliers (see text). This matrix contains islands of

outliers representing groups of closely related species erroneously detected as outliers surrounding the true outlier. (c) Islands are removed by

considering as true outlier only the cell having the maximum value in the island. This maximum is taken from the 2WRgnsp matrix.


5


6/12

they are close in the distance matrix. To solve this problem,we take advantage of a Phylo-MCOA property that ordersspecies in the 2WR matrix by their mean proximity in thecohesion plot. When two or more species that are contig-uous in the 2WRBgnsp matrix for a given gene are detectedas outliers, we define this group as an island (dashedcircles in fig. 3b). In each island, we need to distinguish be-tween the true outlier species and the contiguous species

that are erroneously detected as outliers. To do so, we con-sider that the outlier cell with the highest value in the2WRgnsp is the true outlier, and the contiguous cells areartifacts. This filtering step allows obtaining a binary matrixfree of islands that contains only true cell-by-cell outliers(fig. 3c).

Handling Tree and Node SupportWhen comparing phylogenetic trees, it is very importantto consider the statistical support of both trees and nodes.To take into account measures of tree support such as like-lihood scores, posterior probabilities, or any other suitable

scores, Phylo-MCOA can assign different weights, specifiedby the user, to the different trees according to their sup-port. To do so, during the MCOA analysis, each table (rep-resenting Euclidean spaces, see fig. 1) is multiplied by thesquare root of the weight that is associated to it (in abso-lute values). As a result, the values in the tables correspond-ing to a specific gene with a high weight are higher andthe importance of the corresponding auxiliary vectorsfor retrieving the m vectors also increases (fig. 1b, step3). In this sense, those genes will weigh more in the MCOA.It is also possible to take into account node support (e.g.,bootstrap values) when building the consensus typology,by choosing a cutoff level of node confidence below

which nodes will be collapsed. In this way, poorly sup-ported nodes (i.e., unlikely species bipartitions) will haveless or no influence in the consensus typology.

Handling Missing DataPhylo-MCOA is capable of analyzing data sets that areincompletely sampled (i.e., with missing data), in the caseswhere a given gene is not available for one or more speciesbut is found in all other species. To handle incomplete sam-ples, the distances that correspond to missing species ina given pairwise distance matrix are assigned the valueof the average distance obtained from all the other distance

matrices where the species is present.

Results

Phylo-MCOA Evaluation: Simulated Data SetsSimulated data sets were generated in order to evaluate theefficiency and robustness of Phylo-MCOA. All the data setswere created in the same way: we generated a random treewith N species by random splitting of edges as imple-mented in function rtree of the ape package in R language(Paradis et al. 2004) and duplicated this gene tree in orderto get Tidentical gene trees. A new gene tree was randomlygenerated for each test performed. To create a complete

outlier species, we changed its position randomly in allT trees. To generate a complete outlier gene, we replacedone of the gene trees by a completely random tree.

Subsequently, we evaluated the efficiency of Phylo-MCOA by its ability to retrieve the correct complete outliergene and species, using the corresponding 2WR matrix. Alloutliers were detected based on nodal distances, using theoption distance 5 nodal in Phylo-MCOA.

To simulate cell-by-cell outliers, we randomly sampledone species and one tree and changed randomly the posi-tion of the selected species in the selected tree. By repeat-ing this operation multiple times, we obtained data setswith a variable proportion of cell-by-cell outliers.

Effect of Sample Size on the Ability of Phylo-MCOA to

Retrieve Complete Outliers

Simulated data sets were obtained containing between 10and 100 genes (10, 20, 30, 50, and 100, respectively) andbetween 10 and 100 species (10, 20, 30, 50, and 100, respec-tively); all possible combinations were explored. In all thesedata sets, we artificially introduced three complete outliergenes and three complete outlier species.

We analyzed these data sets with Phylo-MCOA and eval-uated the proportion of complete outlier genes and speciesthat the method was able to retrieve (table 1). We observedthat as long as the number of both species and genes islarge enough (!20), Phylo-MCOA always correctly re-trieved the outliers (black cells in table 1). On the otherhand, when the number of species is small (10), Phylo-MCOA is unable to retrieve the outlier genes correctly,even for a large number of genes (first column, light gray

cells in table 1). However, outlier species are easier to re-trieve as long as the sample size is large enough (!20), evenif there is a small number of genes (10, first row, dark graycells in table 1). Finally, for moderately small number ofoutlier species (520), outlier genes are easier to retrievethan species (second column, dark gray cells in table 1).

Effect of the Proportion of Complete Outliers on the Ability

of Phylo-MCOA to Retrieve Them

We created a data set of 50 genes and 50 species and testedthe ability of Phylo-MCOA to retrieve complete outliersas their proportion increased. We tested data sets wherebetween 2% and 60% of the genes, and between 2% and

Table 1. Effect of the Size of the Data on the Ability of Phylo-

MCOA to Correctly Retrieve Complete Outlier Genes and Species.

Retrieved

Species/Genes

Number of Species

10 20 30 50 80 100

Number of genes 10 0/0 0/0 1/0 3/0 3/0 3/0

20 0/0 1/2 3/3 3/3 3/3 3/3

30 0/0 0/3 3/3 3/3 3/3 3/3

50 0/0 0/3 3/3 3/3 3/3 3/3

80 0/0 0/3 3/3 3/3 3/3 3/3100 0/1 0/3 3/3 3/3 3/3 3/3

NOTE.Three outlier genes and three outlier species were simulated. Each cellcontains two values separated by a slash (/): number of outlier species retrieved/number of outlier genes retrieved. Nodal distances were used in the Phylo-MCOA analysis. The darker the shading, the better the retrieval of outliers byPhylo-MCOA.


6


7/12

60% of the species were outliers (with values 2%, 10%, 20%,40%, and 60%). All possible combinations of these propor-tions were tested. We then evaluated the number of com-plete outlier genes and species that Phylo-MCOA was ableto retrieve correctly in each case (table 2). We observedthat even when 20% of genes and species are outliers, Phy-lo-MCOA was still able to retrieve almost all of them cor-rectly (black cells in table 2). A large proportion of outlier

genes (40%) does not prevent retrieving outlier species.Similarly, a large proportion of outlier species is not a prob-lem for retrieving outlier genes (dark gray cells in table 2).However, if a large proportion of both genes and species areoutliers, this will make it difficult to retrieve outliers (lightgray cells in table 2).

Effect of the Proportion of Missing Data on the Ability of

Phylo-MCOA to Retrieve Complete Outliers

In typical biological data sets, there are plenty of missingdata, therefore, methods for real data set analysis shouldbe robust to incompletely sampled data sets. In order toevaluate the ability of Phylo-MCOA to retrieve outlier

genes and species despite missing data, we created a dataset with 100 genes and 100 species containing three com-plete outlier species and five complete outlier genes. Weincreased the proportion of missing data progressivelyby randomly removing species from the original set ofgenes. We removed 1%, 2%, 5%, 10%, 15%, 20%, 30%,and 50% of the data, respectively (table 3). Note that a pro-portion of 20% of missing data means that, on average,each tree has lost 20% of its leaves. We observed thatPhylo-MCOA is very robust to missing data. Even with20% of the data missing, the three complete outlier speciesand one of the five complete outlier genes were correctlyretrieved (table 2).

Capacity of Phylo-MCOA to Detect Complete Outliers in the

Presence of Cell-by-Cell Outliers

The detection of cell-by-cell outliers by Phylo-MCOA isregularly performed on a data set from which completeoutliers have been removed. In order to verify whetherPhylo-MCOA is able to correctly detect complete outliersin the presence of cell-by-cell outliers, we simulated a dataset with 100 genes and 100 species containing three com-

plete outlier species and three complete outlier genes, andevaluated the ability of Phylo-MCOA to correctly detectthe complete outliers given different numbers of cell-by-celloutliers, from 1 to 1,500. Results are shown in table 4.If,1,200 cell-by-cell outliers are present, Phylo-MCOA isalways able to retrieve correctly the complete outliers. For1,500 cell-by-cell outliers, all complete outlier genes are stillretrieved, but the complete outlier species are not retrieved.

Capacity of Phylo-MCOA to Retrieve Cell-by-Cell Outliers

As previously explained, the detection of cell-by-cell out-liers by Phylo-MCOA is performed on a data set free of

complete outliers, therefore, for this test complete outliersare excluded. We analyzed 100 genes and 100 species witha variable proportion of cell-by-cell outliers and evaluatedthe proportion that was detected by Phylo-MCOA. Foreach number of cell-by-cell outliers, we repeated the oper-ation 20 times in order to have an estimation of the stan-dard deviation of the mean proportion of cell-by-celloutliers retrieved (table 5). We observed that the mean pro-portion of cell-by-cell outliers retrieved decreased veryslowly, starting at 100% (SD 5 0) for two and five cell-by-cell outliers included, to more than 85% for 1,000cell-by-cell included. Phylo-MCOA is thus able to detectcell-by-cell outliers even when there are several of them.

Table 2. Effect of the Proportion of Outlier Genes and Species in the Ability of Phylo-MCOA to Correctly Retrieve Them.

Retrieved Species/Genes (%)

Proportion of Complete Outlier Species [Number]

2% [1] 10% [5] 20% [10] 40% [20] 60% [30]

Proportion of complete outlier genes [number] 2 [1] 1/1 5/1 9/1 0/1 0/0

10 [5] 1/5 5/5 8/5 0/5 0/0

20 [10] 1/10 5/10 10/10 0/0 0/0

40 [20] 1/0 5/0 0/0 0/0 0/0

60 [30] 0/0 0/0 0/0 0/0 0/0

NOTE.The data set was composed of 50 genes and 50 species. The proportions of outliers are given in percentage and the absolute number of outliers is given underbracket ([ ]). Each cell contains two values separated by a slash (/): number of complete outlier species retrieved/number of complete outlier genes retrieved. Nodaldistances were used in the Phylo-MCOA analysis. The darker the shading, the better the retrieval of outliers by Phylo-MCOA.

Table 3. Effect of the Proportion of Missing Data on the Number

of Complete Outliers Retrieved (genes and species).

Proportion of Missing Data

1% 2% 5% 10% 20% 30% 50%

Species retrieved 3 3 3 3 3 0 0

Genes retrieved 5 5 5 5 1 0 0

NOTE.The data set was comprised of 100 genes and 100 species with fivecomplete outlier genes and three complete outlier species. Nodal distances wereused in the Phylo-MCOA analysis. The darker the shading, the better the retrievalof outliers by Phylo-MCOA.

Table 4. Number of Complete Outliers Retrieved with Increasing

Number of Cell-by-Cell Outliers.

Number of Cell-by-Cell Outliers

1 10 50 100 300 500 1,000 1,200 1,500

Species retrieved 3 3 3 3 3 3 3 3 0

Genes retrieved 3 3 3 3 3 3 3 3 3

NOTE.The data set was comprised of 100 genes and 100 species with threecomplete outlier genes and three complete outlier species. Nodal distances wereused in the Phylo-MCOA analysis. The darker the shading, the better the retrievalof outliers by Phylo-MCOA.


7


8/12

Phylo-MCOA Evaluation: Real Data SetWe used Phylo-MCOA on a real data set comprising 246gene trees for 21 fungal species. This data set was used pre-viously by Aguileta et al. (2008) for retrieving the best genesto reconstruct the phylogeny of fungi, and the data areavailable in FUNYBASE (Marthey et al. 2008). Aguiletaet al. (2008) had compared each individual gene tree tothe species tree obtained from the concatenation of the246 genes. From this comparison, they had obtained a scorefor each gene tree representing how suitable it was for find-ing the reference (concatenated) tree. Using Phylo-MCOA,we analyzed the data set of Aguileta et al. (2008) using

nodal distance between species and obtained a score foreach species and gene as explained in the Materials andMethods section. These scores are plotted in figure 4 forthe genes (top) and for the species (bottom). Phylo-MCOAdetected five complete outlier genes (black bars in fig. 4)and no complete outlier species. The fact that no speciesappeared as outliers in this data set is not surprisingbecause the goal of their work was to find optimal genemarkers for reconstructing the fungal species tree, there-fore, species producing incongruent trees were removedfrom the data set in Aguileta et al. (2008). In order to checkfor consistency, we compared the score that Aguileta et al.obtained for each gene (a topological similarity score re-

ferred to as topscore) to the score we obtained with Phy-lo-MCOA. Figure 5 (white dots) represents the scatter plot

of these two scores. Phylo-MCOA is able to retrieve thecorrect ranking of trees and to identify genes that canbe considered as outliers. We found that trees consideredas outliers in Aguileta et al. (2008) were indeed also outliersaccording to the analysis using Phylo-MCOA (gray arrowsin fig. 5). When using patristic distances instead of nodaldistances for the Phylo-MCOA analysis (black dots infig. 5), the same trend is still present but the correlationwith the topscore values of Aguileta et al. (2008) seemsto slightly decrease. This is expected because the topscoreused by Aguileta et al. (2008) is a topological distancemeasure, thus more similar to Phylo-MCOA with nodal

distances than to Phylo-MCOA with patristic distances.Overall, this illustrates the general utility of Phylo-MCOA for analyzing real data sets. Moreover, the methodis very fast: using the set of trees in Aguileta et al. (2008), ittook about 1 min on a pc laptop.

Using Phylo-MCOA for Filtering Phylogenomic DataSetsDeep relationships in the animal phylogeny are still de-bated. In 2009, Schierwater et al. (2009) proposed a phylog-eny of the Metazoa that was questioned because of the lowquality of the data set used to construct it. More precisely,Philippe et al. (2011) detected important problems in thedata set of Schierwater et al. (2009), such as contamina-tions, deep paralogies, or horizontal gene transfers. They

FG1000

FG1010

FG1015

FG1020

FG1021

FG1024

FG1031

FG1046

FG1049

FG1056

FG1063

FG1073

FG1108

FG436

FG459

FG465

FG478

FG487

FG491

FG499

FG503

FG507

FG508

FG524

FG525

FG529

FG533

FG534

FG535

FG542

FG543

FG546

FG549

FG551

FG552

FG556

FG559

FG569

FG570

FG575

FG576

FG578

FG579

FG583

FG586

FG588

FG589

FG591

FG595

FG598

FG610

FG612

FG616

FG621

FG627

FG630

FG634

FG635

FG636

FG638

FG644

FG646

FG649

FG651

FG652

FG657

FG659

FG662

FG665

FG668

FG673

FG678

FG679

FG684

FG685

FG686

FG687

FG689

FG690

FG691

FG692

FG694

FG695

FG696

FG698

FG699

FG702

FG704

FG705

FG708

FG716

FG718

FG720

FG722

FG723

FG730

FG734

FG735

FG740

FG746

FG747

FG750

FG752

FG756

FG757

FG758

FG761

FG762

FG763

FG771

FG781

FG784

FG788

FG789

FG798

FG805

FG813

FG819

FG821

FG825

FG832

FG834

FG841

FG844

FG848

FG850

FG854

FG855

FG861

FG862

FG863

FG864

FG866

FG870

FG878

FG884

FG893

FG894

FG896

FG897

FG898

FG901

FG909

FG912

FG916

FG927

FG933

FG942

FG948

FG964

FG975

FG994

MS241

MS277

MS282

MS294

MS302

MS304

MS305

MS307

MS311

MS313

MS320

MS325

MS329

MS333

MS334

MS337

MS345

MS348

MS351

MS353

MS355

MS358

MS359

MS361

MS362

MS368

MS375

MS377

MS378

MS379

MS380

MS382

MS384

MS386

MS393

MS394

MS395

MS397

MS398

MS399

MS400

MS401

MS407

MS408

MS409

MS411

MS413

MS414

MS415

MS417

MS418

MS422

MS424

MS428

MS429

MS430

MS431

MS432

MS437

MS441

MS442

MS444

MS447

MS448

MS452

MS453

MS454

MS456

MS459

MS460

MS462

MS463

MS467

MS470

MS481

MS484

MS485

MS486

MS487

MS489

MS493

MS497

MS501

MS517

MS524

MS529

MS532

MS541

MS547

MS549

MS550

MS561

MS578

MS621

Genes

Outlierdetectionproportion

0.0

0.2

0.4

0.6

0.8

Sce

Spa

Pch

Cne

Uma

Yli

Dha

Clu

Spo

Ssc

Ani

Afu

Cim

Mgr

Ncr

Tre

Fgr

Ago

Kla

Cgl

Sba

Species

outlierdetectionproportion

0.00

0.02

0.04

0.06

0.08

0.10

0.12

FIG. 4. Bar plots for the fungal data showing the score calculated for each gene (top) and each species (bottom). Black bars represent significant

outliers according to the threshold chosen by the user (here 0.4). Nodal distances were used for the Phylo-MCOA analysis in this case.

Table 5. Mean Proportion of Cell-by-Cell Outliers Retrieved Out of 20 Repetitions with Increasing the Number of Cell-by-Cell Outliers.

Proportion Retrieved

Number of Cell-by-Cell Outliers

2 5 10 20 50 100 200 300 400 500 600 700 800 900 1,000

Mean 1 1 0.975 0.973 0.960 0.947 0.938 0.932 0.923 0.907 0.897 0.888 0.888 0.871 0.863

SD 0 0 0.055 0.034 0.030 0.022 0.018 0.015 0.011 0.011 0.014 0.011 0.010 0.008 0.009

NOTE.The data set was comprised of 100 genes and 100 species. Nodal distances were used in the Phylo-MCOA analysis.


8


9/12

reanalyzed the data set of Schierwater after having replacedproblematic sequences by correct ones and showed thatthe obtained tree, even if it had less supported nodes,was more realistic given previous studies.

In order to test whether Phylo-MCOA is able to filter

problematic sequences from a real data set, we analyzed

the set of phylogenetic trees used by Schierwater et al.(2009). We removed the outliers detected by Phylo-MCOA,built a phylogeny using only the filtered data set (i.e., with-out problematic sequences) and compared our tree withthe one obtained without filtering and the one producedby Philippe et al. (2011) after filtering the data.

To obtain the animal phylogeny, we used the originaldata set of Schierwater et al. (2009). It was first aligned

and trimmed with the same tools and parameters as inPhilippe et al. (2011), and the individual gene trees werereconstructed using the same model as Philippe et al.(2011). As in Philippe et al. (2011), four genes (AT8,DCR, N4L, and ND6) were removed from the analysis be-cause they contained too few positions after trimming. Allthe trees were analyzed with Phylo-MCOA using patristicdistances and all default parameters for the outlier detec-tion. Four complete outliers genes were detected by Phylo-MCOA and removed before performing the cell-by-celldetection: rrna16S, rrna18S, rrna28S, and rrna58S. Of the984 cells in the 2WR matrix (41 genes 24 species), 21

cell-by-cell outliers were detected (see supplementary tableS1, Supplementary Material online) and were removedfrom the original data set (by removing the sequencesin the original sequence files). The remaining sequences(available as supplementary dataset S1, Supplementary Ma-terial online) were then concatenated and used to recon-struct a phylogenetic tree with 100 bootstrap replicatesusing RAxML (Stamatakis 2006) and the same model as inSchierwater et al. (2009). The tree we obtained withoutfiltering (fig. 6a) was similar to the one proposed by Schier-water et al. (2009), which is known to contain noisy dataand thus produce an inconsistent topology. After filtering,there was a general decrease of bootstrap support for no-

des in the diploblasts group (fig. 6b), while the support

FIG. 5. Scatter plot of the topological congruence (top score)

between 246 fungal gene trees and the consensus species tree

obtained from the concatenation of individual trees (x axis) againstthe score of the genes according to Phylo-MCOA (y axis). Black dots

represent the scores when using patristic distances. White dots

represent the scores when using nodal distances. Outlier genes

according to Phylo-MCOA and identified by black bars in figure 4

are shown with small gray arrows. They appear to be poorly

congruent with the species tree (small values on the x axis).

FIG. 6. Comparison of the phylogeny of animals from the Schierwater et al. (2009) data set. (a) Tree obtained from the original sequences.

(b) Tree obtained after removal of cell-by-cell outliers using Phylo-MCOA. Species with a star (*) represent Porifera. The gray arrows point to

the node support of the diploblasts group in the two trees. Nodes with black dots represent bootstrap supports higher than 95. A hundred

bootstrap replicates were performed for each tree.


9
http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1


10/12

of the nodes outside the diploblasts were almost notaffected. Overall, the bootstrap support for the mono-phyly of diploblasts went from 100 in the original dataset to 63 in the filtered data set (gray arrows in fig. 6),which is in accordance with the accepted vision that dip-loblasts are certainly not monophyletic. The monophylyof Porifera (species with an * in fig. 6) was lost, which isalso in accordance with previous studies and coherent

with the result of Philippe et al. (2011) on this dataset, who obtained a node for the Porifera with a supportvalue of 36 (Philippe et al. 2011).

Phylo-MCOA was thus able to identify true outliers thatneed to be filtered out prior to further phylogenetic anal-ysis in order to estimate a more consistent topology. How-ever, we could not retrieve exactly the same topology as theone obtained by Philippe et al. This can be due to the factthat we do not replace the sequences identified as prob-lematic by correct sequences as done in Philippe et al.(2011). Moreover, Phylo-MCOA considers that the correctsignal is the one that is present in the majority of the genes,

but this is not true in this data set for the Hexactinellida,where the majority of the sequences are slow evolvingmitochondrial sequences being contaminations from theDemospongiae, while true nuclear sequences of Hexactinel-lidae are fast evolving (Philippe et al. 2011). As a conse-quence, Hexactinellida and Demospongiae still forma very solid group after filtering, and the branch lengthleading to Hexactinellida is reduced in the new tree(fig. 6b) compared with the original tree (fig. 6a).

The Phylo-MCOA Program: Availability andRequirementsPhylo-MCOA is provided as a set of R functions whichmake use of the ade4 (Chessel et al. 2004; Dray et al.2007), ape (Paradis et al. 2004), and lattice (for graphicalfunctions, Sarkar 2010) packages and newly developedfunctions. It requires a recent version of R (.2.8.0) installedon the machine. Phylo-MCOA is freely distributed underthe GNU General Public License. It can be downloadedat http://phylomcoa.cgenomics.org.

Discussion

It has become a standard practice to use multiple genes inorder to obtain a reliable and concordant phylogeny. When

sampling many different genes for the same set of species,the individual gene trees will often be partly incongruent. Inthis case, one is interested in detecting which subset ofgenes and species is responsible for the discrepancies, inother words, in finding the outliers. Phylo-MCOA, themethod proposed here, provides a quick and powerfulway to detect outliers by integrating the phylogenetic in-formation contained in all the individual gene trees. Phylo-MCOA determines how different the compared topologiesare but more importantly, it identifies which genes and spe-cies explain those differences. This may prove very usefulwhen determining possible causes for incongruities, suchas horizontal gene transfers, different selective pressures,

recombination events, among others. Outlier genes are typ-ically affected by these factors, so identifying which subsetof species explain the differences may help to focus theefforts for studying the affected species. On the other hand,by building a consensus typology Phylo-MCOA also findsthe genes that produce concordant phylogenies. Thesegenes are excellent candidates for recovering the speciestree, which can be subsequently verified by other methods.

Phylo-MCOA thus provides good candidates, not onlyfor outliers, but for phylogenetic markers producingconcordant topologies.

To our knowledge, this is the first time that MCOA isapplied directly to analyze phylogenetic distances. Basedon a covariation optimization criterion, MCOA is a methodthat captures the similarities (and dissimilarities) obtainedfrom the ordination of several tables (types) of data, as ithas been previously demonstrated for the analysis of eco-logical data (Bady et al. 2004; Hedde et al. 2005; Laloe et al.2007). Furthermore, it is a statistically solid technique thatlends itself well to the analysis of phylogenetic data (i.e., in

terms of topological distances), where the covariation ofthe histories of genes and species can be integrated in a sin-gle ordination analysis. Even though a consensus typologyis proposed by the method, MCOA takes into account allthe gene trees in order to build it. We believe this makesa better use of the individual tree information than otherforms of phylogenetic consensus that do not take intoaccount outliers or use only a part of the available data.Our method is unique in simultaneously detecting outlierspecies and genes from the comparison of multiple genetrees.

One important aspect of tree comparisons involves theestimation of pairwise distance matrices. The Eigen decom-

position and further ordination represented in the consen-sus typology take the distances as absolute distances.Therefore, the consensus typology will change dependingon the type of distance that is actually being measured.There are two ways to calculate the distance betweentwo species in a phylogenetic tree: the sum of the branchlengths or the number of nodes separating the species inthe trees. Both distance metrics have advantages and dis-advantages: using the sum of the branch lengths (patristicdistances) allows incorporating the relative rates of evolu-tion but can artificially bring closer species that have smallbranch lengths and separate species with longer branches.

Conversely, if the internodal distance separating species isused, then the consensus typology informs us about a de-gree of phylogenetic proximity, but it tells us nothing aboutthe relative rates of evolution. It is important to be explicitabout the pairwise distance method employed in orderto derive conclusions, as genes identified as outliers whenusing one method may seem concordant with other geneswhen using another one. Comparing the results given byboth methods can be very informative. A gene that is iden-tified as an outlier when using patristic distances but notwhen using nodal distances will represent a gene evolvingat higher or slower rates in some species compared withother genes.


10
http://phylomcoa.cgenomics.org/http://phylomcoa.cgenomics.org/http://phylomcoa.cgenomics.org/


11/12

Phylo-MCOA allows weighing trees according to differ-ent scores (e.g., likelihood scores), which ensures that morecredible trees are given more weight in the analysis. It isequally important to assign weights to nodes to distinguishsupported relationships from spurious or dubious ones. Inprinciple, one could incorporate a measure of node supportinto MCOA metrics; however, this is not straightforward.At present, we have arrived at a solution that may alleviate

the problem of node support assignment to some extent.We have incorporated the possibility of collapsing lowlysupported nodes, according to a cutoff value given bythe user. In this way, the nodes that are poorly supportedwill contribute little or nothing to the construction of theconsensus typology. Assessing node support across phylog-enies is difficult, especially with cases of missing or dupli-cated data, as not all nodes will be present in all trees. Forthis reason, we suggest the use of Treeko (Marcet-Houbenand Gabaldon 2011) for trees with duplicated genes, priorto analysis with Phylo-MCOA. Interestingly, our methodcan be used even in the absence of completely sampled

data sets. The program can deal with gene losses, as dem-onstrated by randomly deleting species from the simulateddata analyzed.

We have shown that Phylo-MCOA is able to correctlydetect outlier genes and species from the comparison ofmultiple phylogenetic trees. We have shown this by ana-lyzing simulated data sets but, more interestingly, fromthe analysis of real data sets. We have shown, with two dif-ferent examples, how Phylo-MCOA can detect trueoutliers. First, with the fungal data set by Aguileta et al.(2008), where we found the same concordant and discor-dant genes as in the original study; and second, where weuse Phylo-MCOA to predict which cell-by-cell outliers need

to be removed from phylogenetic reconstruction in orderto recover the animal phylogeny obtained after removingnoisy data, as done by Philippe et al. (2011).

Phylo-MCOA can be used to detect candidate genes forfurther analysis, or for filtering out problematic genes inphylogenetic reconstruction. Phylo-MCOA can detecttwo types of outliers. Complete outliers can identify whichgenes and species deviate from the evolutionary history ofmost genes analyzed. This information can be very usefulfor filtering out noisy data in phylogenetic reconstruction.On the other hand, cell-by-cell outliers point to particularcases of topological discordance, where one specific species

is an outlier in a given gene. This information can be used topostulate specific hypotheses concerning particular genesor species. Although cell-by-cell outlier detection is imple-mented in the current Phylo-MCOA version, there is roomfor improvement, specifically in the way outliers are scoredto define significance.

Phylo-MCOA can be used to detect candidate genes forfurther analysis, or for filtering out problematic genes inphylogenetic reconstruction. Nevertheless, caution shouldalways be exerted because the method will give moreweight to the information given by the majority of genesunder study, and if the data set analyzed is plaguedwith dubious sequences, the method will favor an incorrect

signal. This is what we observed when we analyzed the dataset ofSchierwater et al. (2009) that is known to be riddenwith multiple faulty sequences. External information on thequality of the sequences should be employed wheneverpossible prior to analysis with our method.

Supplementary Material

Supplementary table S1 and data set S1 are available atMolecular Biology and Evolution online (http://www.mbe.oxfordjournals.org/).

Acknowledgments

We would like to thank Herve Philippe for the time hespent on this manuscript and the helpful commentsand suggestions he made. We thank David Moreira for con-structive discussions, Avner Bar-Hen for interesting com-ments on the use of MCOA for phylogenomics analysis,Vincent Berry for suggestions on future developments

and comments on distances between species, and FranSupek for his suggestions on statistical significance testing.Finally, we thank Tatiana Giraud and Toni Gabaldon forinsightful comments on the manuscript as well as threeanonymous reviewers. D.M.V. acknowledges postdoctoralgrants from the Centre National de la RechercheScientifique (CNRS) and Universite Paris-Sud 11. G.A. ac-knowledges postdoctoral grants from the CNRS, UniversiteParis-Sud 11, and Agence National de la Recherche (ANR-06-BLAN-0201).

ReferencesAguileta G, Marthey S, Chiapello H, Lebrun MH, Rodolphe F,

Fournier E, Gendrault-Jacquemard A, Giraud T. 2008. Assessing

the performance of single-copy genes for recovering robust

phylogenies. Syst Biol. 57:613627.

Ane C, Larget B, Baum DA, Smith SD, Rokas A. 2007. Bayesian

estimation of concordance among gene trees. Mol Biol Evol. 24:

412426.

Bady P, Doledec S, Dumont B, Fruget JF. 2004. Multiple co-inertia

analysis: a tool for assessing synchrony in the temporal

variability of aquatic communities. C R Biol. 327:2936.

Berthouly C, BedHom B, Tixier-Boichard M, Chen CF, Lee YP,

Laloe D, Legros H, Verrier E, Rognon X. 2008. Using molecular

markers and multivariate methods to study the genetic diversity

of local European and Asian chicken breeds. Anim Genet. 39:

121129.

Brochier C, Bapteste E, Moreira D, Philippe H. 2002. Eubacterial

phylogeny based on translational apparatus proteins. Trends

Genet. 18:15.

Chessel D, Dufour AB, Thioulouse J. 2004. The ade4 package-I-one-

table methods. R News. 4:510.

Chessel D, Hanafi M. 1996. Analyses de la co-inertie de K nuages de

points. Rev Stat Appl. 44:3560.

Cranston KA, Rannala B. 2007. Summarizing a posterior distribution

of trees using agreement subtrees. Syst Biol. 56:578590.

de Vienne DM, Aguileta G, Ollier S. 2011. The Euclidean nature of

phylogenetic distance matrices. Syst Biol. 60:826832.

Degnan JH, Rosenberg NA. 2006. Discordance of species trees with

their most likely gene trees. PLoS Genet. 2:762768.


11
http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/http://www.mbe.oxfordjournals.org/http://www.mbe.oxfordjournals.org/http://www.mbe.oxfordjournals.org/http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1http://www.mbe.oxfordjournals.org/lookup/suppl/doi:10.1093/molbev/msr317/-/DC1


12/12

Dray S, Dufour AB, Chessel D. 2007. The ade4 package-II: two-table

and K-table methods. R News. 7:4752.

Edwards SV, Liu L, Pearl DK. 2007. High-resolution species trees

without concatenation. PNAS 104:59365941.

Felsenstein J. 1985. Confidence-limits on phylogeniesan approach

using the bootstrap. Evolution 39:783791.

Gadagkar SR, Rosenberg MS, Kumar S. 2005. Inferring species

phylogenies from multiple genes: concatenated sequence tree

versus consensus gene tree. J Exp Zool B Mol Dev Evol. 304:6474.

Gordon AD. 1986. Consensus supertreesthe synthesis of rootedtrees containing overlapping sets of labeled leaves. J Classif. 3:

335348.

Gower JC. 1984. Distance matrices and their Euclidean approxima-

tion. In: Jambu M, Diday E, Lebart L, Pages J, Tomassone R,

editors. Data analysis and informatics III. Amsterdam (The

Netherlands): Elsevier. p. 321.

Gower JC, Legendre P. 1986. Metric and Euclidean properties of

dissimilarity coefficients. J Classif. 3:548.

Hedde M, Lavelle P, Joffre R, Jimenez JJ, Decaens T. 2005. Specific

functional signature in soil macro-invertebrate biostructures.

Funct Ecol. 19:785793.

Hillis DM, Heath TA, St John K. 2005. Analysis and visualization of

tree space. Syst Biol. 54:471482.

Huson DH, Bryant D. 2006. Application of phylogenetic networks inevolutionary studies. Mol Biol Evol. 23:254267.

Laloe D, Jombart T, Dufour AB, Moazami-Goudarzi K. 2007.

Consensus genetic structuring and typological value of

markers using multiple co-inertia analysis. Genet Sel Evol.39:545567.

Leigh JW, Schliep K, Lopez P, Bapteste E. 2011. Let them fall where

they may: congruence analysis in massive, phylogenetically

messy datasets. Mol Biol Evol. 28:27732785.

Leigh JW, Susko E, Baumgartner M, Roger AJ. 2008. Testing

congruence in phylogenomic analysis. Syst Biol. 57:104115.

Manly BFJ. 2004. Multivariate statistical methods: a primer. 3rd ed.

London: Chapman & Hall, Ltd.

Marcet-Houben M, Gabaldon T. 2011. TreeKO: a duplication-aware

algorithm for the comparison of phylogenetic trees.NucleicAcids Res. 39(10):e66.

Marthey S, Aguileta G, Rodolphe F, et al. (11 co-authors). 2008.

FUNYBASE: a FUNgal phYlogenomic dataBASE. BMC Bioinfor-

matics. 9:456.

Paradis E, Claude J, Strimmer K. 2004. APE: analyses of phylogenetics

and evolution in R language. Bioinformatics 20:289290.

Philippe H, Brinkmann H, Lavrov DV, Littlewood DTJ, Manuel M,

Worheide G, Baurain D. 2011. Resolving difficult phylogenetic

questions: why more sequences are not enough. PLoS Biol. 9:

e1000602.

Rodriguez-Ezpeleta N, Brinkmann H, Roure B, Lartillot N, Lang BF,Philippe H. 2007. Detecting and overcoming systematic errors in

genome-scale phylogenies. Syst Biol. 56:389399.

Rosenberg NA. 2002. The probability of topological concordance of

gene trees and species trees. Theor Popul Biol. 61:225247.

Roure B, Rodriguez-Ezpeleta N, Philippe H. 2007. SCaFoS: a tool for

selection, concatenation and fusion of sequences for phyloge-

nomics. BMC Evol Biol. 7(1 Suppl):S2.

Sanderson MJ, Purvis A, Henze C. 1998. Phylogenetic supertrees:

assembling the trees of life. Trends Ecol Evol. 13:105109.

Sarkar D. 2010. Lattice: lattice graphics [Internet]. R package version

0.18-8. Available from: http://CRAN.R-project.org/package=lattice.

Schierwater B, Eitel M, Jakob W, Osigus H-J, Hadrys H, Dellaporta SL,

Kolokotronis S-O, DeSalle R. 2009. Concatenated analysis sheds

light on early metazoan evolution and fuels a modernurmetazoon hypothesis. PLoS Biol. 7:e1000020.

Stamatakis A. 2006. RAxML-VI-HPC: maximum likelihood-based

phylogenetic analyses with thousands of taxa and mixed models.

Bioinformatics 22:26882690.

Suchard MA. 2005. Stochastic models for horizontal gene transfer:

taking a random walk through tree space. Genetics 170:419431.

Suchard MA, Weiss RE, Sinsheimer JS, Dorman KS, Patel M,

McCabe ERB. 2003. Evolutionary similarity among genes. J Am

Stat Assoc. 98:653662.

Susko E, Leigh J, Doolittle WF, Bapteste E. 2006. Visualizing and

assessing phylogenetic congruence of core gene sets: a case

study of the gamma-proteobacteria. Mol Biol Evol. 23:10191030.

White WT, Hills SF, Gaddam R, Holland BR, Penny D. 2007. Treeness

triangles: visualizing the loss of phylogenetic signal.Mol Biol Evol.24:20292039.

http://cran.r-project.org/package=latticehttp://cran.r-project.org/package=latticehttp://cran.r-project.org/package=lattice

Documents

Mol Biol Evol 2012 de Vienne Molbev_msr317