44
Permissions: you are free to blog or live-blog about this presentation as long as you attribute the work to its authors Population Structure Analysis using STRUCTURE software Chang Bum Hong kt Bioinformatics TF, [email protected] , twitter @hongiiv, hongiiv.tistory.com Friday, August 12, 11

Workshop 2011

Embed Size (px)

Citation preview

Page 1: Workshop 2011

Permissions: you are free to blog or live-blog about this presentation as long as you attribute the work to its authors

Population Structure Analysisusing STRUCTURE software

Chang Bum Hong

kt Bioinformatics TF, [email protected], twitter @hongiiv, hongiiv.tistory.com

Friday, August 12, 11

Page 2: Workshop 2011

Genetic test

일반적으로 알콜을 섭취하게 되면 알콜은 아세트알데히드(얼굴을 붉게 만들고, 가슴도 콩닥 거리고, 구토를 일으키는 독성 물질)로 변하게 되고 이것이 다시 ALDH 에 의해 인체에 무해한 젖산으로 분해되는 과정을 거치게 됩니다. 이때 ALDH2라는 유전자가 바로 아세트알데히드가 조금이라도 생성되면 분해하는데 관여하게 이때 유전자형에 따라서 3가지 유형으로 나

타나게 됩니다.

Friday, August 12, 11

Page 3: Workshop 2011

23andMe

Friday, August 12, 11

Page 4: Workshop 2011

북서유럽

남동유럽

Friday, August 12, 11

Page 5: Workshop 2011

Text

HGDP(Human Genome Diversity Project)

Friday, August 12, 11

Page 6: Workshop 2011

Text

PASNP(Pan-Asian SNP Consortium)

Friday, August 12, 11

Page 7: Workshop 2011

SNP Individual PopulationPASNP 54,794 1,928 75

HGDP 2,834~ 1,056 52

HapMap 1,481,135 1,397 11

SGVP 268,667 292 3

Korean 58,625 159 10

China(Yanbian) 58,625 16 1

Japan(Kobe) 58,625 5 1

Korea-Japan 58,625 6 1

Vietnam 58,625 16 1

Korean-Vietnam 58,625 8 1

Cambodia 58,625 16 1

Mongol 58,625 16 1

East Asia - Public genotype data

a. Pan-Asian SNP Consortium(http://www4a.biotec.or.th/PASNP)

b. Singapore Genome Variation Project(http://www.nus-cme.org.sg/SGVP)

a

b

Friday, August 12, 11

Page 8: Workshop 2011

GyeongJu

UlSanGoryeong

GimJe

NaJu

Jeju

YeonCheon

Cheonan

JeCheon

PyeongChang

SESW

MWaverage >70 year oldlong settlementAffymetrix 50K Xba

58,960 SNPs

15

16

16

16

16

16

16

16

16

16

China(Yanbian)Japan(Kobe)Korea-Japan

VietnamKorean-Vietnam

CambodiaMongol

Korean Data

Friday, August 12, 11

Page 9: Workshop 2011

GyeongJu

GoRyeong

GimJe

Before QC 58,960 SNPs All Asian

Before QC 58,960 SNPsKorean

Text

Missing genotype individuals

Friday, August 12, 11

Page 10: Workshop 2011

Relatedness between the 153 Korean(10 region) Individuals

PCA analysis using autosomal 46,559 SNP markers (n=153, Korean)

GyeongJu

UlSan

JeJu

GimJe

NaJu

GoRyeong

CheonAn

JeCheon

PyeongChang

YeonCheon

Friday, August 12, 11

Page 11: Workshop 2011

Korea-Vietnam Korea-Japan

Jeju Kobe JPT-HapMap

Yanbian

Mongol

CHB-HapMap

Vietnam

Cambodia

PCA analysis of East Asian descent

illustration of geographic correspondence of ethnic group

locations

Friday, August 12, 11

Page 12: Workshop 2011

Relationship between Eigenvector values and Latitude

R = 0.8621y = 36.65 + 166.33x

2

37.53

47.81

39.98

14.72

Friday, August 12, 11

Page 13: Workshop 2011

• A model-based clustering method (Pritchard et al. 2000)

• Free software (http://pritch.bsd.uchicago.edu/structure.html)

• Bayesian approach (MCMC: Markov Chain Mote Carlo)

• Detects the underlying genetic population among a set of individuals genotyped at multiple markers

• Computes the proportion of the genome of an individual originating from each inferred population (quantitative clustering method)

STRUCTURE software

Friday, August 12, 11

Page 14: Workshop 2011

• A matrix where the data for individuals are in rows, the loci are in column

• n consecutive rows have the data for each individual of n-ploid species

• Integer should be used for coding genotype

• Missing data should be indicated by a number which doesn’t occur elsewhere in the data (e.g. -1)

• The data file should be a text file (.txt) not an excel (.xls) for running STRUCTURE

Input data

Friday, August 12, 11

Page 15: Workshop 2011

Input format

Information of user-defined populationsLable : 각 개인의 고유한 ID로 숫자 또는 문자 어떤것이든 상관없다.(예, CEPH1334.10)PopID: 개인이 속한 민족의 고유한 번호 (예, 중국인(CHB)인 경우 5, 유럽인(CEU)인 경우 1과 같이 자신이 직접 부여) Flag: 해당 PopID 정보를 STRUCTURE 프로그램 실행시 사용할 것인가?(1= 사용한다, 2= 사용하지 않는다.)Location: 해당 개인의 위치정보(예, 동아시아(EAS)인경우 1번, 유럽(EURA)인 경우 2번과 같이 자신이 직접 부여)

genotype (1,2,5)AA = 11AB = 12BB = 22

missing = 55

MarkerName...Label PopID Flag Location Genotype...

1 consecutive rows for alleles

Friday, August 12, 11

Page 16: Workshop 2011

Input format (cont.)

Friday, August 12, 11

Page 17: Workshop 2011

Running STRUCTURE from a graphical interface, Front End

Friday, August 12, 11

Page 18: Workshop 2011

Importing input data into a project

Friday, August 12, 11

Page 19: Workshop 2011

Importing input data into a project (cont.)

Friday, August 12, 11

Page 20: Workshop 2011

Importing input data into a project (cont.)

Friday, August 12, 11

Page 21: Workshop 2011

Importing input data into a project (cont.)

Friday, August 12, 11

Page 22: Workshop 2011

Importing input data into a project (cont.)

Friday, August 12, 11

Page 23: Workshop 2011

Importing input data into a project (cont.)

Friday, August 12, 11

Page 24: Workshop 2011

Importing input data into a project (cont.)

Friday, August 12, 11

Page 25: Workshop 2011

Importing input data into a project (cont.)

Friday, August 12, 11

Page 26: Workshop 2011

Configuring a parameter set

Friday, August 12, 11

Page 27: Workshop 2011

Configuring a parameter set (cont.)

Length of Burnin Period : how long to run the simulation before collecting data to minimize the effect of the starting configuration, 목표함수로 수렴할 때까지의 반복 숫자Number of MCMC Reps after Burnin : how long to run the simulation after burnin to get accurate parameter estimates

Friday, August 12, 11

Page 28: Workshop 2011

Configuring a parameter set (cont.)

Friday, August 12, 11

Page 29: Workshop 2011

Configuring a parameter set (cont.)

Friday, August 12, 11

Page 30: Workshop 2011

Configuring a parameter set (cont.)

Friday, August 12, 11

Page 31: Workshop 2011

Running STRUCTURE: a single run

Friday, August 12, 11

Page 32: Workshop 2011

Running STRUCTURE: a single run (cont.)

Friday, August 12, 11

Page 33: Workshop 2011

Running STRUCTURE: a batch run

Friday, August 12, 11

Page 34: Workshop 2011

Running STRUCTURE: a batch run (cont.)

Friday, August 12, 11

Page 35: Workshop 2011

Ln P(D): Estimated probability of Ks

Friday, August 12, 11

Page 36: Workshop 2011

Friday, August 12, 11

Page 37: Workshop 2011

• For very large data sets, the runtime of structure using default settings may become impractically slow

• reduced data sets (ex, pruned)

• get accurate results using much shorter runs than default (ex, small values of NUMREPS)

• download the source code and compile it on your machine (using 64-bit machine)

• use the command-line version of structure

Analysis of genome-wide SNP data

Friday, August 12, 11

Page 38: Workshop 2011

An example of MCMC convergence

Friday, August 12, 11

Page 39: Workshop 2011

Inference of true K(number of population)

• The log likelihood for each K, Ln P(D) = L(K)

• Two approaches to determine the best K

• Use of L(K) : When K is approaching a true value, L(K) plateaus and has high variance between runs

• Use of an ad hod quantity (∆K) : calculated based on the second order rate of change of the likelihood (∆K). The ∆K shows a clear peak at the true value of K

Friday, August 12, 11

Page 40: Workshop 2011

Friday, August 12, 11

Page 41: Workshop 2011

Q-metrixan individuals belongs to a subpopulation

Simulation Result

Friday, August 12, 11

Page 42: Workshop 2011

Simulation Result (cont.)

Friday, August 12, 11

Page 43: Workshop 2011

Enjoy running STRUCTURE

Friday, August 12, 11

Page 44: Workshop 2011

We may not always be able to know the TRUE value K, but we should aim for the smallest value of K

that captures the major structure in the data

Pritchard et al. (2000)

Friday, August 12, 11