Zhu Etal 2008 Plant Genome 1 5

Embed Size (px)

Citation preview

  • 7/25/2019 Zhu Etal 2008 Plant Genome 1 5

    1/16TH EPLANTGENOMEJULY2008VO L. 1, NO. 1 5

    REVIEW & INTERPRETATION

    Status and Prospects ofAssociation Mapping in Plants

    Chengsong Zhu, Michael Gore, Edward S. Buckler, and Jianming Yu*

    Abstrac tThere is tremendous interest in using association mapping toidentify genes responsible for quantitative variation of complextraits with agricultural and evolutionary importance. Recentadvances in genomic technology, impetus to exploit naturaldiversity, and development of robust statistical analysis methodsmake association mapping appealing and affordable to plantresearch programs. Association mapping identifies quantitativetrait loci (QTLs) by examining the marker-trait associations that canbe at tributed to the strength of linkage disequilibrium betweenmarkers and functional polymorphisms across a set of diversegermplasm. General understanding of association mapping hasincreased significantly since its debut in plants. We have seen amore concerted effort in assembling various association-mappingpopulations and initiating experiments through either candidate-gene or genome-wide approaches in different plant species. In

    this review, we describe the current status of association mappingin plants and outline opportunities and challenges in complex traitdissection and genomics-assisted crop improvement.

    L- -association analyses omajor human diseases have yielded very promising

    results, corroborating ndings o previous candidate-

    gene association studies and identiying novel disease locthat were previously unknown (Te Wellcome rust CaseControl Consortium, 2007). Te same strategy is beingexploited in many plant species thanks to the dramaticreduction in costs o genomic technologies. In contrastto the widely used linkage analysis traditional map-ping research in plants, association mapping searchesor unctional variation in a much broader germplasmcontext. Association mapping enables researchers to usemodern genomic technologies to exploit natural diver-sity, the wealth o which is known to plant geneticistsand breeders but has been utilized only on a small scale

    beore the genomics era. Owing to the ease o producinglarge numbers o progenies rom controlled crosses andconducting replicated trials with immortal individuals(inbreds and recombinant inbred lines, RILs), associa-tion mapping in plants may prove to be more promisingthan in human or animal genetics. In the current review,

    Published in The Plant Genome 1:520. Published 16 July 2008.doi: 10.3835/plantgenome2008.02.0089 Crop Science Society of America677 S. Segoe Rd., Madison, WI 53711 USAAn open-access publication

    All rights reserved. No part of this periodical may be reproduced ortransmitted in any form or by any means, electronic or mechanical,including photocopying, recording, or any information storage andretrieval system, without permission in writing from the publisher.Permission for printing and for reprinting the material contained herein

    has been obtained by the publisher.

    C. Zhu and J. Yu, Dep. of Agronomy, Kansas State University, 2004Throckmorton Hall, Manhattan, KS 66506; M. Gore, Dep. of PlantBreeding and Genetics, Cornell University, Ithaca, NY 14853;Edward S. Buckler, USDA-ARS and Institute for Genomic Diversity,Cornell University, Ithaca, NY 14853. Mention of t rade names orcommercial products in this publication is solely for the purpose of

    providing specific information and does not imply recommendation oendorsement by the USDA. Received 11 Feb. 2008. *Correspondinauthor ([email protected]).

    Abbreviations:AB-QTL, advanced backcross QTL;AFLP, amplifiedfragment length polymorphism; GC, genomic control; IL, introgressiolibrary; K, kinship matrix; lcyE, lycopene epsilon cyclase; LD,linkage disequilibrium; NAM, nested association mapping; oligo,oligonucleotide; PCA, principal component analysis; Q, populationstructure; QTDT, quantitative transmission disequilibrium test; QTLs,quantitative trait loci; RAPD, random amplified polymorphic DNA;RILs, recombinant inbred lines; SA, structured association; SBE, singlebase extension; SFP, single feature polymorphism; SNPs, singlenucleotide polymorphisms; SSRs, simple sequence repeats.

  • 7/25/2019 Zhu Etal 2008 Plant Genome 1 5

    2/166 TH EPLANTGENOMEJULY2008VOL. 1, NO. 1

    we ocus on presenting association mapping as a newstrategy or genetic dissection o complex traits, stepsto initiate an association mapping study, and commonmethods in genotyping, phenotyping, and data analysis.Interested readers may also reer to previous reviews onother technical aspects such as linkage disequilibrium,population structure, and statistical analysis (Ersoz et al.,2008; Flint-Garcia et al., 2003; Yu and Buckler, 2006).

    WHY ASSOCIATION MAPPING?New ToolTe phenotypic variation o many complex traits o agri-cultural or evolutionary importance is inuenced bymultiple quantitative trait loci (QLs), their interaction,the environment, and the interaction between QL andenvironment. Linkage analysis and association mappingare the two most commonly used tools or dissectingcomplex traits (Fig. 1). Linkage analysis in plants typi-cally localizes QLs to 10 to 20 cM intervals because othe limited number o recombination events that occurduring the construction o mapping populations and thecost or propagating and evaluating a large number olines (Doerge, 2002; Holland, 2007). While hundreds olinkage analysis studies have been conducted in variousplant species over the past two decades (Holland, 2007;Kearsey and Farquhar, 1998), only a limited number oidentied QLs were cloned or tagged at the gene level

    (Price, 2006). Association mapping, also known as link-age disequilibrium (LD) mapping, has emerged as a tool toresolve complex trait variation down to the sequence levelby exploiting historical and evolutionary recombinationevents at the population level (Nordborg and avare, 2002;Risch and Merikangas, 1996). As a new alternative to tra-ditional linkage analysis, association mapping offers threeadvantages, (i) increased mapping resolution, (ii) reducedresearch time, and (iii) greater allele number (Yu andBuckler, 2006). Since its introduction to plants (Torns-berry et al., 2001), association mapping has continued togain avorability in genetic research because o advances inhigh throughput genomic technologies, interests in iden-tiying novel and superior alleles, and improvements instatistical methods (Fig. 2).

    Based on the scale and ocus o a particular study,association mapping generally alls into two broad cat-egories (Fig. 3), (i) candidate-gene association mapping,which relates polymorphisms in selected candidate genesthat have purported roles in controlling phenotypic vari-ation or specic traits; and (ii) genome-wide associationmapping, or genome scan, which surveys genetic varia-tion in the whole genome to nd signals o associationor various complex traits (Risch and Merikangas, 1996).While researchers interested in a specic trait or a suiteo traits ofen exploit candidate-gene association map-ping, a large consortium o researchers might choose toconduct comprehensive genome-wide analyses o various

    Figure 1. Schematic comparison of linkage analysis with designed mapping populations and association mapping with diverse collections.

  • 7/25/2019 Zhu Etal 2008 Plant Genome 1 5

    3/16ZH UET AL .: ASSOCIATION MAPPING IN PLANTS 7

    traits by testing hundreds o thousands o molecularmarkers distributed across the genome or association.

    Genomic TechnologyAdvances in high-throughput genotyping and sequenc-ing technologies have markedly reduced the cost per datapoint o molecular markers, particularly single nucle-otide polymorphisms (SNPs) (Hirschhorn and Daly,2005; Syvanen, 2005). For candidate-gene associationmapping, inormation regarding the location and unc-tion o genes involved in either biochemical or regula-tory pathways that lead to nal trait variation ofen isrequired. Fortunately, due to the availability o annotatedgenome sequences rom several model species and thegeneral application o genomic technology (e.g., sequenc-ing, genotyping, gene expression proling, comparativegenomics, bioinormatics, linkage analysis, mutagen-esis, and biochemistry), a whole host o candidate genesequences or various complex traits is now availableor urther association analysis. On the other hand, asit becomes affordable to identiy hundreds o thousandso SNPs through resequencing a core set o diverse linesand genotype these SNPs across a larger number osamples, researchers are moving toward genome-wide

    association analyses o complex traits. For example, theArabidopsisHapMap provided a powerul catalog ogenetic diversity with more than 1 million SNPs (i.e., onaverage one SNP every 166 bp) (Clark et al., 2007), a rateabout 11-old higher than that o human populations(Hinds et al., 2005).

    Not too long ago, our capacity to conduct even athorough linkage analysis study with a ew hundredmolecular markers was limited by the cost o genotypingNow, a new question aced by many researchers is Howcan I take advantage o the high-throughput genomictechnologies? Obviously, association mapping is oneapproach that heavily leverages these emerging genomictechnologies, with sequencing, resequencing, and geno-typing as the intermediate steps to the nal goal o link-ing unctional polymorphisms to complex trait variation

    Natural DiversityAssociation mapping harnesses the genetic diversity onatural populations to potentially resolve complex trait

    variation to single genes or individual nucleotides. Con-ventional linkage analysis with experimental popula-tions derived rom a bi-parental cross provides pertinentinormation about traits that tends to be specic to the

    Figure 2. Main driving forces of the current interest in association mapping. Genomic technologies for high-throughput genome sequenc-ing and genotyping made it more affordable to obtain a large amount of marker data across a large diversity panel for complex trait dis-section and superior allele mining. Methodology development alleviated the issue of false positives due to population structure.

  • 7/25/2019 Zhu Etal 2008 Plant Genome 1 5

    4/168 TH EPLANTGENOMEJULY2008VOL. 1, NO. 1

    same or genetically related populations, while results romassociation mapping are more applicable to a much widergermplasm base. Te ability to map QLs in collections obreeding lines, landraces, or samples rom natural popu-lations has great potential or uture trait improvementand germplasm security. With regard to exploring naturaldiversity, advanced backcross QL (AB-QL) and intro-gression library (IL) are well-known strategies or miningalleles rom exotic germplasm to improve the productivity,adaptation, quality, and nutritional value o crops (ank-

    sley and McCouch, 1997; Zamir, 2001). Association map-ping is complementary to AB-QL and IL in that it is anadditional tool or evaluating extant unctional diversity ineach crop species on a much larger scale (Breseghello andSorrells, 2006a; Flint-Garcia et al., 2003).

    Methodology DevelopmentConventional linkage mapping in plant species, includ-ing single marker analysis, interval mapping, multipleinterval mapping, and Bayesian interval mapping, is welldeveloped and validated (Doerge, 2002; Zeng, 2005). Incontrast, little effort has been made to develop robust

    methods o association mapping in plant species. False

    positives generated by population structure have longbeen regarded as a hurdle to association mapping and ithas been diffi cult to replicate signicant results in inde-pendent studies and ollow up on detected signals withcostly molecular and biochemical analyses. Given thegeographical origins, local adaptation, and breeding his-tory o assembled genotypes in an association mappingpanel, these non-independent samples usually containboth population structure and amilial relatedness (Yuand Buckler, 2006). Recently, several statistical methods

    have been proposed to account or population structureand amilial relatedness, structured association (SA)(Falush et al., 2003; Pritchard and Rosenberg, 1999; Prit-chard et al., 2000a), genomic control (GC) (Devlin andRoeder, 1999), mixed model approach (Yu et al., 2006),and principal component approach (Price et al., 2006).Te essence o these approaches is to use genotypicinormation rom random molecular markers across thegenome to account or genetic relatedness in associa-tion tests either explicitly (e.g., SA and mixed model) orthrough ad hoc adjustment (e.g., GC). With these meth-ods, the issue o alse positives generated by population

    Figure 3. Schematic diagram and contrast of genome-wide association mapping and candidate-gene association mapping. The inclu-sion of population structure (Q), relative kinship (K), or both in final association analysis depends on the genetic relationship of theassociation mapping panel and the divergence of the trait examined. E stands for residual variance.

  • 7/25/2019 Zhu Etal 2008 Plant Genome 1 5

    5/16ZH UET AL .: ASSOCIATION MAPPING IN PLANTS 9

    structure can now be dealt with accordingly (Price et al.,2006; Yu et al., 2006; Zhao et al., 2007).

    Current StatusSo ar, a series o research papers ocusing on LD andassociation mapping have been published, spanningmore than a dozen plant species (able 1). Many majorcrops, such as maize (Zea mays,L.), soybean (Glycinemax (L.) Merr.), barley (HordeumvulgareL.), wheat(Triticum aestivum L.), tomato (Lycopersicon esculentumMill.), sorghum (Sorghum bicolor (L.) Moench), andpotato (Solanum tuberosum L.), as well as tree speciessuch as aspen (Populus tremula L.) and loblolly pine(Pinus taeda L.), have been studied. Many questionsstill demand urther study as we attempt to gain a bet-ter grasp o the various genetic and statistical aspects oassociation mapping. For example, should one choosea highly pedigreed group o individuals rom breedingprograms or a diverse collection o germplasm bankaccessions? Does one need to be concerned about alsepositives due to population structure? What is the appro-priate analysis method? Should one start a candidate-gene or genome-wide association analysis? Are cryptic

    genetic relationships adequately estimated by randommarkers? We offer our opinions on some o these ques-tions in the ollowing sections.

    HOW TO INITIATEASSOCIATION MAPPING?Species and GermplasmBeore initiating association mapping, researchers shouldcareully consider all genetic aspects o the species andthe associated germplasm available. Te ploidy level oindividuals rom a plant species whose genetics are notwell characterized should be evaluated, particularly i theassembled population contains wild accessions obtainedrom a germplasm bank. Tis helps to avoid the diffi cultyin differentiating the effects o unctional polymorphismsrom that o al lele dosage. Because the task o assemblingand studying an association mapping population requiresa long-term commitment, it is worthwhile to examine var-ious genetic tools available or a given species. Are theregroups o scientists who have been conducting genetics,physiological, or biochemical studies within the species?What are the available molecular markers that have been

    Table 1. Examples of association mapping studies in various plant species.

    Plant species Populations Sample

    sizeBackground

    markers Traits References

    Maize Diverse inbred lines 92 141 Flowering time (Thornsberry et al., 2001)

    Elite inbred lines 71 55 Flowering time (Andersen et al., 2005)

    Diverse inbred lines and landraces 375 + 275 55 Flowering time (Camus-Kulandaivelu et al., 2006)

    Diverse inbred lines 95 192 Flowering time (Salvi, 2007)

    Diverse inbred lines 102 47 Kernel compositionStarch pasting properties

    (Wilson et al., 2004)

    Diverse inbred lines 86 141 Maysin synthesis (Szalma et al., 2005)

    Elite inbred lines 75 Kernel color (Palaisa et al., 2004)

    Diverse inbred lines 57 Sweet taste (Tracy et al., 2006)

    Elite inbred lines 553 8950 Oleic acid content (Belo et al., 2008)

    Diverse inbred lines 282 553 Carotenoid content (Harjes et al., 2008)

    Arabidopsis Diverse ecotypes 95 104 Flowering time (Olsen et al., 2004)

    Diverse ecotypes 95 2553 Disease resistanceFlowering time

    (Aranzana et al., 2005)(Zhao et al., 2007)

    Diverse accessions 96 Shoot branching (Ehrenreich et al., 2007)

    Sorghum Diverse inbred lines 377 47 Community resource report (Casa et al., 2008)

    Wheat Diverse cultivars 95 95 Kernel size, milling quality (Breseghello and Sorrells, 2006b)

    Barley Diverse cultivars 148 139 Days to heading, leaf rust, yellow dwarf virus,rachilla hair length, lodicule size

    (Kraakman et al., 2006)

    Potato Diverse cultivars 123 49 Late bright resistance (Malosetti et al., 2007)

    Rice Diverse land races 105 Glutinous phenotype (Olsen and Purugganan, 2002)

    Diverse land races 577 577 Starch quality (Bao et al., 2006)

    Diverse accessions 103 123 Yield and its components (Agrama et al., 2007)

    Pinus taeda Unstructured natural population 32 21 Wood specific gravity, late wood (Gonzalez-Martinez et al., 2006)

    Lines 435 288 Microfibril angle, cellulose content (Gonzalez-Martinez et al., 2007)

    Sugarcane Diverse clones 154 2209 Disease resistance (Wei et al., 2006)

    Eucalyptus Unstructured natural population 290 35 Microfibril angle (Thumma and Nolan, 2005)

    Perennial ryegrass Diverse natural germplasms 26 589 Heading date (Skt et al., 2005)

    Diverse natural germplasms 96 506 Flowing time, water soluble carbohydrate (Skt et al., 2007)

  • 7/25/2019 Zhu Etal 2008 Plant Genome 1 5

    6/1610 TH EPLANTGENOMEJULY2008VOL. 1, NO. 1

    developed or this species? What is the current status olinkage analysis or the targeted traits?

    Choice o germplasm is critical to the success oassociation analysis (Breseghello and Sorrells, 2006a;Flint-Garcia et al., 2003; Yu et al., 2006). Genetic diver-sity, extent o genome-wide LD, and relatedness withinthe population determine the mapping resolution,marker density, statistical methods, and mapping power.Generally, plant populations amenable or associationstudies can be classiable into one o ve groups (Yuand Buckler, 2006; Yu et al., 2006), (i) ideal sample withsubtle population structure and amilial relatedness,(ii) multi-amily sample, (iii) sample with populationstructure, (iv) sample with both population structureand amilial relationships, and (v) sample with severepopulation structure and amilial relationships. Due tolocal adaptation, selection, and breeding history in manyplant species, many populations or association mappingwould all into category our. Alternatively, we can clas-siy populations according to the source o materials,germplasm bank collections, synthetic populations, andelite germplasm (Breseghello and Sorrells, 2006a).

    Linkage DisequilibriumLinkage disequilibrium, or gametic phase disequilib-rium, measures the degree o non-random associationbetween alleles at different loci. Te difference betweenobserved haplotype requency and expected based onallele requencies is dened as D.

    = AB A BD p p p

    wherepAB

    is the requency o gamete AB;pA

    andpBare

    the requency o the allele A and B, respectively. In

    absence o other orces, recombination through randommating breaks down the LD with Dt= D

    0(1 r)t, where

    Dtis the remaining LD between two loci afer tgenera-

    tions o random mating rom the original D0. Several

    statistics have been proposed or LD, and these measure-ments largely differ in how they are affected by marginalallele requencies and small sample sizes (Hedrick, 1987).Both D(Lewontin, 1964) and r2(Hill and Robertson,1968) have been widely used to quantiy LD. For two bi-allelic loci, Dand r2have the ollowing ormula:

    =

    max

    DD

    D

    = >

    =