64
ВПЕРЕД В ПРОШЛОЕ. МЕТОДЫ ГЕНЕТИЧЕСКОЙ ДИАГНОСТИКИ ДРЕВНЕЙ ДНК ИЛИ “О ЧЕМ МОЛЧАТ И ГОВОРЯТ ДРЕВНИЕ КОСТИ” ТАТЬЯНА ТАТАРИНОВА UNIVERSITY OF SOUTHERN CALIFORNIA

Вперед в прошлое. Методы генетической диагностики древней днк

Embed Size (px)

Citation preview

. University of southern California

, : . , , , , , , .

: ?: ?

, DNA ResearchToward high-resolution population genomics using archaeological samplesIrina Morozova, Pavel Flegontov, Alexander Mikheyev, Hosseinali Asgharian, Petr Ponomarenko, Vladimir Klyuchnikov, GaneshPrasad ArunKumar, Sergey Bruskin,Egor Prokhortchouk, Yuriy Gankin, Evgeny Rogaev, Yuri Nikolsky, Ancha Baranova,Eran Elhaik, Tatiana V. Tatarinova

, , , , , . , .

, (230-30 ) . , , . , , .

:- - 500,000100,000

, . , , , . , , , .

Speech gene FOX2P http://www.nature.com/nature/journal/v418/n6900/fig_tab/nature01025_F2.html

http://www.sciencedirect.com/science/article/pii/S0960982207020659Extremely conserved among mammalsAll humans have two functional amino-acid changesThis gene has been the target of selection during recent human evolution.Neanderthals have the same variant of FOX2P as modern humans6

6

6

It looks like the forming of lactase persistence started in Bronze Age

http://www.nature.com/ncomms/2014/141021/ncomms6257/full/ncomms6257.html

Lactase persistence gene is absent in Neolithic but present in Bronze Age.7http://www.nature.com/nature/journal/v522/n7555/full/nature14507.html

Response to the artificial environment: Lactase persistence

http://mideats.comMostmammalsnormally cant produce lactase afterweaning, but some human populations have developedlactase persistence into adulthood. Domesticated cattleMilkingLactase persistence8Geographic distribution of the lactase persistence allele in contemporary EuropeansCattle breeds (blue dots) sampled across Europe and TurkeyDiversity in cattle milk genes

Limits of the geographic distribution of early Neolithic cattle pastoralist (Funnel Beaker Culture)

http://www.nature.com/ng/journal/v35/n4/full/ng1263.html

8

Lactase persistence: Gene-culture co-evolution9

LDC = lactose digestion capacityhttps://www.msu.edu/course/eng/473/johnsen/LDC.pdfMilking evolves first, and evolution of high LDC is highly dependent on milking.

10Industrial Revolution: Changing in microbiotic ecosystems Industrial revolution (17-19th cent.):New technologies:Industrially processed flour and sugar Changes in oral microbiotahttp://www.nature.com/ng/journal/v45/n4/full/ng.2536.html

Industrial Revolution: Changing in microbiotic ecosystems http://www.nature.com/ng/journal/v45/n4/full/ng.2536.html

Mesolithic hunter-gatherers Farming - distinct shift in early Neolithic more caries- and periodontal diseaseassociated taxaConsistency in the composition of bacteria through the medieval periodToday's oral environment is much less biodiverse and is dominated by potentially cariogenic bacteria11Decrease of diversity

Domination of cariogenic bacteria

(VIII-III ...) (III . .. III . ..) : .. ..

:

Rychkov et al. 2014 .

?

(VII . .. III . ..)

. . . 2013; . . 2013

K11K22334455667788991010111112_113214315416_51761871982092110K2211

. . 2013; . . 2013 , .

16

. . 2013; . . 2013

Der Sarkissian. University of Adelaide. 2011

, . .

VIII-III ...

Keyser et al. Hum. Genet. 2009 : . , , , .

.., ..

Der Sarkissian. University of Adelaide. 2011; Rychkov et al. 2014 .

Keyser et al. Hum. Genet. 2009; Der Sarkissian. University of Adelaide. 2011 ( < 20 )

?

?

?

?

?

?

Amelogenin gene in females and males:Different lengthDifferent sequence

TATCCCAGATGTTTCTCCATCCCAAATAAAGTG...

Amel XAmel Y

Amel XAmel YX+ Y-X+Y+

NGS Genetic methods24

Sex determinationGenetic methodsFemaleX+Y-FemaleX+Y-MaleX+Y+

25

mtDNA (48,000-30,000 ), , : 3 , ?

?

?After the death of an organism, all of its biomolecules are degraded either by host enzymes released from their proper compartments or by saprobic microorganisms. Therefore, compared to modern DNA, aDNA has lower concentration; it is fragmented (may be down to 50-70 nt long fragments), contaminated, and chemically modified.Relative preservation of DNA in old samples depends on environmental circumstances, such as temperature, humidity, pH, or oxygen, rather than the absolute age of the sample. For instance, DNA samples extracted from frozen remains dated thousands or even hundreds of thousands years can be of better quality than much more recent samples. Recent studies showed that the age of readable (by current methods) aDNA products is restricted to about 11.5 million years. At present, the 560780 thousand years old Middle Pleistocene horse is the most ancient organism from which reliable aDNA data have been procured

DNase enzymes

Sawyer et al. 2012

Ancient DNA is often contaminated with some level of exogenous DNA (e.g., DNA from ancient or modern saprotrophic bacteria or fungi), postmortem juxtaposition of organisms, or modern human DNA from the researchers themselves

in the 1990s a large number of papers were published reporting DNA sequences from extremely ancient remains such as Miocene plant fossils, amber-entombed organisms, 250-million-year-old bacteria in salt crystal, and dinosaur bones and eggs. In one such case, researchers reported successful extraction and amplification of mtDNA cytochrome b fragment from a dinosaur. The sequences differed from all modern cytochrome b sequences. This led the authors to believe that they had sequenced authentic DNA from 80-million-year-old bones. It was later discovered that those mtDNA sequences were not close to avian and reptilian mtDNAs, as would be expected from their phylogenetic history, but rather to mammalian (including human) mtDNAs. It was thereby suggested that the alleged dinosaur DNA was contaminated, presumably by modern human DNA. A similar course of events occurred in the study of ancient bacterial DNA supposedly preserved in 250-million-year-old salt crystals, which turned out to be modern bacterial DNA. In addition to these examples, several other aDNA projects have been impeded by contamination of ancient samples.

To prevent contamination, the experiment must be properly managed, including special requirements for sample collection, sterilization of the working area, DNA authentication, and independent reproducibility. mechanical removal of the upper layer and UV and/or bleach treatment of the sample.sample incubation in an extraction buffer and its subsequent removal. this step alone increases the fraction of endogenous DNA several fold.a substantial fraction of the reads comes from contamination with environmental DNA from bacteria and fungi. Microbial sequences should be easily flagged by a standard BLAST search against the NCBI non-redundant nucleotide database. This strategy, however, fails to discover most of the microbial sequences that have yet to be sequenced. Therefore, it is not surprising that a large fraction of reads in many aDNA libraries is labeled as unknown or unclassified, mainly due to the unidentified microbial content.

. ? ( ) (). ?

6 , % ~1% ~99%, 5 8001600 0% 100%

Der Sarkissian et al. 2015

http://mammoth.psu.edu/hair.html

Transitions vs transversionsComparison of modern and ancient humansPost mortem base modification in aDNA often involve C to U (T) and A to G transitions, contamination with external DNA can be reliably estimated using transversion or indel counts

Base modifications are often observed in the 57 final bases of DNA fragments and are thought to occur more readily in terminal, single-stranded overhangs

U? U U C A->G An excess of CT (and GA) transitions in modern-ancient alignments provides an estimate of base modification

Base Calling De-multiplexing:

Trim adapters at both ends,Clip low quality sequences,Stitch overlapping reads

Mapping and Realignment

Mapping and Realignment

Estimating post mortem damage and contamination Variant Calling Reduction of Heterozygosity/ Homozygosity

( ) ,

Ancestry SNP chipFaster and cheaper as compared to whole genome sequencingWell-designed SNP chips contain carefully selected markers

To infer population structure from genotype data, it is necessary to first reduce the dimensionality of the dataset due to the thousands of SNPs it encompasses. From SNPs to Admixture

Thousands of SNPs

North EastAsianMediterranianSouth AfricanSouth West AsianNative AmericanOceanianSouth East AsianNorthernEuropeanSub-SaharanAfricanHGDP00985 0.52530.020200.22220.04040.01010.01010.17170HGDP010940.040.0400.030.8300.010.050HGDP009820.01020.15310.03060.07140.040800.01020.20410.4796

ADMIXTURE

Admixture proportions in geographically adjacent populations, such as Italian and Greeks, and populations sharing similar history, like British and Germans, are similar.43

603 unrelated individuals representing 54 worldwide populations and subpopulations with ~15 samples per population. The results of the admixture analysis exhibit spatial patterns for individuals of the same genetic background that decreased in similarity with distance. In few places, the distribution of admixture proportion in geographically adjacent populations, such as Italian and Greeks, and populations sharing similar history, like British and Germans, overlapped.

433/15/2016

QuestionHow to link genetic and geographic divergence?44

Input: geneticsSamples with known origin45SAMPLE IDNORTH EASTASIANMEDITERRANIANSOUTH AFRICASOUTH WEST ASIANNATIVE AMERICANOCEANIANSOUTH EAST ASIANORTHERNEUROPEANSUB-SAHARANAFRICAChinese10.7188260.0004190.000010.000010.000010.000010.2806950.000010.00001Chinese20.7349670.000010.000010.000010.0010610.000010.2639120.000010.00001Chinese30.746930.000010.000010.000010.0102710.0032440.2395050.000010.00001Chinese40.6712090.000010.000010.000010.000010.000010.3287210.000010.00001Chinese50.7256140.000010.000010.000010.000010.000010.2743160.000010.00001Chinese60.720710.000010.000010.0010980.016650.000010.2614920.000010.00001Chinese70.6957010.000010.000010.000010.000010.000010.3042290.000010.00001Chinese80.7097670.000010.000010.000010.000010.000010.2901630.000010.00001Chinese90.7158080.010560.000010.000010.000010.000010.2735720.000010.00001Chinese100.7320430.000010.000010.000010.0126940.000010.2552030.000010.00001Chinese110.6559950.000010.000010.000010.000010.000010.3439350.000010.00001Chinese120.7126070.000010.000010.000010.000010.000010.2873230.000010.00001

Input: geographyFor every reference population find the corresponding coordinates. LatitudeLongitudeChinese39.55116.2Russian55.7537.62Tatar55.5550.93

Moscow46

Relationship between genetic and geographic distances

We correlated the admixture patterns with geography, by calculating two distance matrices between all populationsFor all reference samples, compute genetic and geographic distance between samples47

QuestionKnowing relationship between geographic and genetic distances, is it possible to find a geographic origin of a person of known genotype?

We decided to try a simple approach48

AB

X

First step: calculate mean admixture vectorsFor every reference population, calculate mean admixture vectorsNORTH EASTASIAMEDI-TERRANIASOUTHAFRICASOUTHWESTASIANATIVEAMERICAOCEANIASOUTHEASTASIANORTHERNEUROPESUB-SAHARANAFRICAChinese0.7116810.0009231.00E-050.0001010.0033960.000280.2835891.00E-051.00E-05Russian0.0688670.2652220.0012410.2246590.0350110.0086220.0318440.3631070.001419Tatar0.157940.2098971.00E-050.2109570.0119020.0026050.0057030.4009751.00E-05

49

Dealing with individuals of unknown originNORTH EASTASIAMEDI-TERRANIASOUTHAFRICASOUTHWESTASIANATIVEAMERICAOCEANIASOUTHEASTASIANORTHERNEUROPESUB-SAHARANAFRICAUnknown0.7116810.0009231.00E-050.0001010.0033960.000280.2835891.00E-051.00E-05

Find distances between the Unknown vector and all reference vectorsSort reference populations by distance from smallest to largest50

Example

Unknown samples52

Accuracy of the GPS algorithmLeave-one-out approach

GPS1 maps 80% of the individuals to their countries of origin, and 60% of all individuals to their exact inner-country region. The assignment accuracy was largely affected (r=0.45) by the genetic diversity of the reference populations as estimated by the standard deviation of their admixture proportions.54

GPS1 accurately assigned: ~100% of all individuals to their continental regions 80% of all individuals to their country of origin 60% of all individuals to their inner-country region55

Populations for which inner-country data was available are marked with *. The average accuracy is calculated across populations given equal weights.

553/15/2016

Application of GPS to aDNA (Bronze Age)

30 out of 100 Bronze Age samples (Allentoft et al 2015) had over 500 of ancestry informative markers.We applied GPS algorithm to find the closest modern population.

/ (, 23 ) rs262555, ? ?

Phenotype prediction from aDNAProblem: coverage is low, and reliability of each individual SNP is meagerSolution: Consider population, group SNPs by diseases and rank diseases by the number of SNPs

Phenotype

Links can be taken from HGMD or ClinVar databases

Conditions with the highest/lowest number of SNPs in Bronze Age EuropeAdenomatous polyposis coliLiver glycogenosisMuir-Torre syndromeHaemoglobin variantCongenital disorder of glycosylation 1aThalassaemia alphaVon Willebrand disease 2aDiabetes, permanent neonatalShort statureDiabetes, neonatal

F3 X, Y Z (), Y1 Y2 X ()F3(Z;x,y)>0 X,Y Z