1756-0500-2-21

Embed Size (px)

Citation preview

  • 8/14/2019 1756-0500-2-21

    1/5

    BioMedCentral

    Page 1 of 5(page number not for citation purposes)

    BMC Research Notes

    Open AccesShort Report

    A new measure of population structure using multiple singlenucleotide polymorphisms and its relationship with FST

    Hongyan Xu*, Bayazid Sarkar and Varghese GeorgeAddress: Deparment of Biostatistics, Medical College of Georgia, Augusta, GA, USA

    Email: Hongyan Xu* - [email protected]; Bayazid Sarkar - [email protected]; Varghese George - [email protected]

    * Corresponding author

    Abstract

    Background: Large-scale genome-wide association studies are promising for unraveling thegenetic basis of complex diseases. Population structure is a potential problem, the effects of whichon genetic association studies are controversial. The first step to systematically quantify the effectsof population structure is to choose an appropriate measure of population structure for humandata. The commonly used measure is Wright's FST. For a set of subpopulations it is generallyassumed to be one value of FST. However, the estimates could be different for distinct loci. Sincepopulation structure is a concept at the population level, a measure of population structure thatutilized the information across loci would be desirable.

    Findings: In this study we propose an adjusted C parameter according to the sample size from eachsub-population. The new measure C is based on the c parameter proposed for SNP data, which wasassumed to be subpopulation-specific and common for all loci. In this study, we performedextensive simulations of samples with varying levels of population structure to investigate theproperties and relationships of both measures. It is found that the two measures generally agreewell.

    Conclusion: The new measure simultaneously uses the marker information across the genome.It has the advantage of easy interpretation as one measure of population structure and yet can alsoassess population differentiation.

    BackgroundLarge scale genome-wide association studies are promis-ing in unraveling the genetic basis of complex diseases inhumans. There are many such studies currently being car-ried out. However, the size of the data produces severalissues and challenges in analysis and interpretation. Oneof the potential problems is hidden population structurein the samples. It can cause spurious associations whencases and controls differ in ancestry and is thus a con-founding factor. However, the effects of population struc-ture in real large-scale association studies are very

    controversial. Therefore, a systematic study is needed toquantify the levels of population structure and its effectson genetic association studies.

    The first step to quantify the effects of population struc-ture is to choose an appropriate measure of populationstructure for human data. The commonly used measure is

    Wright's FST [1]. For a set of subpopulations it is generallyassumed to be one value of FST. However, the estimatescould be different for distinct loci. It could be a problemif population structure is adjusted with local estimates in

    Published: 6 February 2009

    BMC Research Notes2009, 2:21 doi:10.1186/1756-0500-2-21

    Received: 31 December 2008Accepted: 6 February 2009

    This article is available from: http://www.biomedcentral.com/1756-0500/2/21

    2009 Xu et al; licensee BioMed Central Ltd.This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

    http://www.biomedcentral.com/http://www.biomedcentral.com/http://www.biomedcentral.com/http://www.biomedcentral.com/http://www.biomedcentral.com/info/about/charter/http://-/?-http://www.biomedcentral.com/1756-0500/2/21http://creativecommons.org/licenses/by/2.0http://www.biomedcentral.com/info/about/charter/http://www.biomedcentral.com/http://-/?-http://creativecommons.org/licenses/by/2.0http://www.biomedcentral.com/1756-0500/2/21
  • 8/14/2019 1756-0500-2-21

    2/5

    BMC Research Notes2009, 2:21 http://www.biomedcentral.com/1756-0500/2/21

    Page 2 of 5(page number not for citation purposes)

    genome-wide association studies because it could maskreal association and lead to loss of power. With the avail-able of genomic data, we would like a measure utilizingthe information across markers. Therefore, we proposed anew measure C for SNP data. The new measure is same for

    all loci and utilizes information across loci. It is based onthe c parameters for the subpopulation that measures thedivergence of the subpopulation from the common ances-tor [2]. We then performed extensive simulations to inves-tigate the performance the new measure and compared itto the traditional FST statistic.

    MethodsSimulation model

    We simulated samples with population structure usingthe Balding-Nichols model [2,3]. Specifically, let xij be thenumber of copies of the chosen SNP variant at locus i inpopulationj, ij be the corresponding allele frequency at

    locus i in population j, then the model specifies that

    where L is the number of loci, P is the number of popula-tions, nij is the number of chromosomes genotyped at theith SNP in thejth population, i is the ancestral allele fre-quency for the ith SNP and the variance parameter cj spec-ifies how far the jth subpopulation's allele frequencies

    tend to be away from the ancestral allele frequency. In oursimulations, we sample cj and i from a uniform distribu-tion on (0, 1). The simulations were performed using thesimMD program in the Popgen package [4].

    Estimating cjparameter

    For each sample in the simulated data sets, we estimatethe c parameter for each subpopulation using a Bayesianapproach. We assume uniform priors on both c and parameters and use Markov Chain Monte Carlo (MCMC)methods (a Gibbs sampler) to sample from the posteriordistribution. The Markov Chain was run for 20,000 itera-tions and the first 10,000 iterations were discarded as

    burn-in. We estimated the c parameter by using the poste-rior mean values from the posterior samples.

    New measure

    When summarizing the level of population structureacross subpopulations, it is desirable to have a single sta-tistic, instead of one for each subpopulation. We proposeto use the weighted mean of the c parameters as the newmeasure of population structure. Specifically, suppose wehave P subpopulations, and let cj be the variance parame-

    ter for thejth subpopulation as defined above, then thenew measure C is defined as

    where wj is the weight for thejth subpopulation. There are

    many possible weighting schemes. Here we propose to

    use sample size from each subpopulation as a weight.

    That is, , where nj is the number of indi-

    viduals from subpopulationj in our sample. In our imple-

    mentation, we used the posterior estimate of cj and took

    the weighted mean as an estimate of C.

    From the simulated samples, FSTwas estimated using theunbiased estimator at bi-allelic SNP described by [5].

    ResultsEqual sample size

    In the first set of simulations, we sampled equal numberof chromosomes from each of the 3 sub-populations. Wegenerated genotypes of 100 SNPs according to our simu-lation model. The posterior estimates of the varianceparameter cj were obtained using MCMC method. Wethen estimate C by taking simple average of the estimatesof the cj parameters. FSTwas also estimated from the same

    simulated sample. A total of 100 samples were simulated.The correlation coefficient between the estimates of C and

    x n i L j P

    i cj

    cj

    i

    ij ij ij

    ij

    ~ , , , , ,

    ~ ,

    Bin

    Beta

    a

    a

    p p

    ( ) = =

    ( ) ( )

    1 1

    1 1

    11( )

    cj

    cj,

    wn

    nj

    j

    ii

    P== 1

    Linear regression of FST on the estimates of the C parame-ters with sample size 60 from each of the 3 populationsFigure 1Linear regression of FST on the estimates of the Cparameters with sample size 60 from each of the 3populations.

    0.0 0.2 0.4 0.6 0.8 1.0

    0.05

    0.10

    0.15

    0.20

    0.25

    0.30

    0.35

    0.40

    0.45

    0.50

    FST

    = 0.522C+0.065

    R2=0.92, n=60

    FST

    C

    http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/14/2019 1756-0500-2-21

    3/5

  • 8/14/2019 1756-0500-2-21

    4/5

    BMC Research Notes2009, 2:21 http://www.biomedcentral.com/1756-0500/2/21

    Page 4 of 5(page number not for citation purposes)

    though the total sample size is constant. However, the cor-relation levels off with 3 or more sub-populations.

    DiscussionNatural populations of the same species from differentgeographic regions tend to differ genetically. Human pop-

    ulation is no exception. Previous research has shown thatignoring the genetic differences among sub-populations isa potential problem for genetic association studies ofhuman diseases, especially for genome-wide associationstudies [6]. The problem could be severe for large multi-centered studies and/or studies in admixed populations,such as African Americans.

    The explosion of SNP data in human populations pro-vides an unprecedented opportunity to further character-ize population structure and relationships. In this paper,

    we proposed a new measure of population structure spe-cifically for SNPs. It is based on the c parameter which is

    population specific and measures the differentiation of

    the population from the common ancestor population. Incontrast, the new measure C is an index of the overall lev-els of population structure across populations. Throughextensive simulations, we showed that the new measure Chas very high correlations with the traditional Wright'sFST. The correlation increases as we have more informa-

    tion (more SNPs and/or more sub-populations in thesamples).

    While the new measure is different from the c parameter,it has some inherited advantages from the c parameters.First, it is specific for SNPs and takes account of the ascer-tainment bias in the process of SNP discovery. Since SNPdiscovery is generally conducted in small samples, SNPs

    with high minor allele frequencies are more likely to bediscovered than SNPs with low minor allele frequencies,thus creating the possible ascertainment bias. It has beenshown that the ascertainment bias could affect the estima-tion of population parameters in genetic analysis [7]. This

    ascertainment bias has been explicitly accounted for in themodel for estimating the individual c parameter for eachsub-population. It is assumed that a large number ofpotential loci are examined in small samples from each ofthe sub-populations, and a locus is chosen if it is not fixedfor the same allele in all sub-populations.

    Second, the new measure is based on inferences from aBayesian framework. Therefore, it is very flexible in mod-eling and can incorporate prior information on theparameters. In our simulation studies, we used unin-

    Table 3: Correlation of the estimates of FST and C from several weighing schemes

    Weighting scheme Correlation coefficient

    Sample size (30, 40, 90) Sample size (30, 60, 70)

    Scheme 1 0.866 0.875

    Scheme 2 0.940 0.947

    Scheme 3 0.931 0.937

    Linear regression of FST on the estimates of the C parame-ters with 1000 SNPsFigure 2Linear regression of FST on the estimates of the Cparameters with 1000 SNPs.

    0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

    0.1

    0.2

    0.3

    0.4

    0.5

    FST

    =0.488C+0.084

    R2=0.96, n=(30, 60, 90)

    FST

    C

    Table 4: Correlation of the estimates of FST and C with varying

    number of sub-populations

    Sample size Correlation coefficient

    (90, 90) 0.874

    (60, 60, 60) 0.959

    (45, 45, 45, 45) 0.965

    (36, 36, 36, 36, 36) 0.971

    http://-/?-http://-/?-http://-/?-http://-/?-
  • 8/14/2019 1756-0500-2-21

    5/5

    Publish with BioMedCentraland everyscientist can read your work free of charge

    "BioMed Central will be the most significant development for

    disseminating the results of biomedical research in our lifetime."

    Sir Paul Nurse, Cancer Research UK

    Your research papers will be:

    available free of charge to the entire biomedical community

    peer reviewed and published immediately upon acceptance

    cited in PubMed and archived on PubMed Central

    yours you keep the copyright

    Submit your manuscript here:

    http://www.biomedcentral.com/info/publishing_adv.asp

    BioMedcentral

    BMC Research Notes2009, 2:21 http://www.biomedcentral.com/1756-0500/2/21

    Page 5 of 5(page number not for citation purposes)

    formative prior distributions for the c and parameters. Ifwe have any prior knowledge regarding the distribution,we could easily incorporate it in the estimations, whichcan lead to more accurate estimates than the moment-based estimates of FST [2].

    In summary, we proposed a new measure of populationstructure based on a Bayesian hierarchical model forSNPs. It uses the information at multiple markers and hashigh correlations with the traditional measure FST. We rec-ommend reporting the new measure along with the indi-

    vidual c parameters for sub-populations so that we couldget an idea of the level of population structure and thedivergence of each sub-population as well.

    Competing interestsThe authors declare that they have no competing interests.

    Authors' contributionsHX conceived of the study, performed the analysis anddrafted the manuscript. BS participated in the analysis andhelped to draft the manuscript. VG participated in thedesign and coordination of the study. All authors read andapproved the final manuscript.

    AcknowledgementsWe thank the two anonymous reviewers for their constructive comments.This work is supported by R21NS057506 from US National Institute ofHealth and Scientist Training Grant from Medical College of Georgia toHX.

    References1. Wright S: Isolation by Distance. Genetics1943, 28:114-138.2. Nicholson G, Smith AV, Jnsson F, Gstafsson , Stefnsson K, Don-

    nelly P: Assessing population differentiation and isolationfrom single-nucleotide polymorphism data. J R Stat Soc Ser BStat Methodol2002, 64(4):695-715.

    3. Balding DJ, Nichols RA: A method for quantifying differentia-tion between populations at multi-allelic loci and its implica-tions for investigating identity and paternity. Genetica1995,96(12):3-12.

    4. Marchini's Hompage [http://www.stats.ox.ac.uk/~marchini/software.html]

    5. Weir BS, Cockerham CC: Estimating F-statistics for the analy-sis of population structure. Evolution1984, 38:1358-1370.

    6. Xu H, Shete S: Effects of population structure on genetic asso-ciation studies. BMC Genet2005, 6(suppl 1):S109.

    7. Wakeley J, Nielsen R, Liu-Cordero SN, Ardlie K: The discovery ofsingle-nucleotide polymorphisms and inferences abouthuman demographic history. Am J Hum Genet 2001,69(6):1332-1347.

    http://www.biomedcentral.com/http://www.biomedcentral.com/http://www.biomedcentral.com/http://www.biomedcentral.com/info/publishing_adv.asphttp://www.biomedcentral.com/http://www.biomedcentral.com/http://www.biomedcentral.com/http://-/?-http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=17247074http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=7607457http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=7607457http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=7607457http://www.stats.ox.ac.uk/~marchini/software.htmlhttp://www.stats.ox.ac.uk/~marchini/software.htmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16451565http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16451565http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11704929http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11704929http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11704929http://www.biomedcentral.com/http://www.biomedcentral.com/info/publishing_adv.asphttp://www.biomedcentral.com/http://-/?-http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11704929http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11704929http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=11704929http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16451565http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=16451565http://www.stats.ox.ac.uk/~marchini/software.htmlhttp://www.stats.ox.ac.uk/~marchini/software.htmlhttp://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=7607457http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=7607457http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=7607457http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&dopt=Abstract&list_uids=17247074