Upload
suk-namgoong
View
872
Download
15
Embed Size (px)
DESCRIPTION
생물정보학 11강
Citation preview
Bioinformatics
2014 2학기
생명시스템과학과
한남대학교
11강2014.11.11
강의계획서
주 수업내용
1주 생물정보학의개요및기본이론
2주차 추석(휴강)
3주차 서열분석의원리 I
4주차 서열분석의원리 II
5주차 단백질의구조및기능예측
6주차 지놈시퀀싱및시퀀스어셈블리
7주차 중간고사
8주차 차세대시퀀싱 (Next Generation Sequencing)
9주차 개인유전체학 I
10주차
개인유전체학 II
11주차
발현체학
12주차
메타지놈
13주차
최신연구동향
세포
세포 : 생명체의기본단위다양한종류의세포가생명체를구성하고있음
한개체에서유래된모든세포는동일한유전정보 (DNA)를공유하고있음
전사체의다양성
동일한유전체에서다양한종류의세포가유래될수있는원동력 : 전사체 (Transcriptome)의다양성
Different Cell
이들은모두같은재료를이용하여만들어짐
그러나다른배합비에따라서다른성질을가진먹거리가됨.
전사체의다양성에따른세포의다양성
Snyder M et al. Genes Dev. 2010;24:423-431
Splicing Variants
세포내의특정 RNA 양을측정하는방법
Northern Blot
Labor Intensive
Low-throughput (less than 5-10 genes per experiments)
Real-time qRT-PCR
Relatively Easy
Mid-throughput (100-200 genes per experiments)
TaqMan Probes SYBR-Green
MicroarraycDNA Microarray Oligonucleotide Microarray
PM: perfect match same as gene sequence
MM: Mismatch same but the 13th base changed > reduces rate of binding.
Control for experimental variation and non specific binding
유전자발현을조사하는데Microarray의한계
1. Genome Sequencing이되어있는생물이어야함2. 지놈시퀀스에따라서 Microarray가제작되어있어야함3. Hybridization에 의존4. 지놈내에는상동서열이존재하는관계로백그라운드가높음
5. 검출의 Dynamic range가낮음
- 너무과발현되는유전자는실제보다덜발현되는것으로측정되거나- 극히적게발현되는유전자는실제보다많이발현되는것으로측정될수있음
RNA Sequencing (RNA-Seq)
mRNA 는 PolyA 꼬리를가지고있음.
Poly-A 꼬리를가진 RNA를역전사효소로cDNA로변환
어댑터를연결하여라이브러리제조
cDNA시퀀싱
지놈시퀀스에매핑
- 엑손영역에매핑- Splicing Junction을걸치게 Mapping
될수있음- 특정유전자에대한얼마나많은
Read가mRNA에존재하나를측정
전사체학 transcriptomics
1995 P. Brown, et. al. Gene expression profilingusing spotted cDNA microarray: expression levels of known genes
2002 Affymetrix, whole genome expression profiling using tiling array: identifying and profiling novel genes and splicing variants
2008 many groups, mRNA-seq: direct sequencing of mRNAs using next generation sequencing techniques (NGS)
Hybridization-based
http://darwin.informatics.indiana.edu/col/courses/I519-12/Lecture/RNASeq.ppt
RNA-Seq 과다른전사체학연구도구와의비교
RNA 정제
RNA Purification: Degradation 이되지않은 RNA를얻는것이실험의성패를결정
RNA 퀄리티체크 : Agilent 2100 BioAnalyzer
RNA 정량 (Qubit) – nanodrop considered too inaccurate
Illumina TRUSEQ 라이브러리제조과정
Library Construction
Ribosomal RNA (전체 RNA의 95%)
제거
polyA selection (for mRNA)
High Quality Strand Information
Can be used with low quality/low
abundance RNA (10-100ng)
48 barcodes allows for multiplexing
Small RNAs can be directly
sequenced
Large RNAs must be fragmented
Library Construction can be done at cost at Gonda Genomics Core
$1300/8 samples
Joseph deYoung
Sequencing Apparatus
Experimental Design: Single End (SR) vs Paired End (PE)
Single Read: cDNA에대해서한방향으로만시퀀싱
Paired End: 하나의 cDNA에대해서양방향으로시퀀싱
SR:단순히발현양을측정할때나 SNP 발굴에는유용. 새로운 splicing isoform
발굴에는적절하지못함PE: 새로운 transcript 나 isoform 발굴에유용
반복• Technical Replicate
기술적반복– 동일한시료로복수의데이터를얻음
• Biological Replicate생물학적반복
o 동일조건의다른시료를이용한반복
o Some example concerns/challenges:o Environmental Factors,
Growth Conditions, Timeo Correlation Coefficient 0.92-
0.98
Bowtie/Tophat/Cufflinks/Cuffdiff
RNA-seq Pipeline
RNA-seq reads (2 x 100 bp)
Sequencing
Bowtie/TopHatalignment (genome)
Read
alignment
Cufflinks
Transcript
compilation
Cufflinks (cuffmerge)
Gene
identification
Cuffdiff(A:B comparison)
Differential
expression
CummRbund
Visualization
Gene annotation (.gtf file)
Reference genome(.fa file)
Raw sequence data
(.fastq files)
Inputs
My RNA Seq Workflow
Data QC
Data Trimming
퀄리티가나쁜데이터를잘라버림
Before After
RNA-seq alignment challenges
• Computational cost– 100’s of millions of reads
• Introns!– Spliced vs. unspliced alignments
• Can I just align my data once using one approach and be done with it?– Unfortunately probably not
• Is TopHat the only mapper to consider for RNA-seq data?– http://www.biostars.org/p/60478/
Three RNA-seq mapping strategies
Diagrams from Cloonan & Grimmond, Nature Methods 2010
De novo assembly Align to transcriptome
Align to reference genome
Which alignment strategy is best?
• De novo assembly– If a reference genome does not exist for the species being
studied
– If complex polymorphisms/mutations/haplotypes might be missed by comparing to the reference genome
• Align to transcriptome– If you have short reads (< 50bp)
• Align to reference genome– All other cases
• Each strategy involves different alignment/assembly tools
Which read aligner should I use?
http://wwwdev.ebi.ac.uk/fg/hts_mappers/
RNABisulfiteDNAmicroRNA
Splicing을고려한 Mapping
• RNA-seq데이터는엑손간을걸칠수있음.
• 이런데이터는인트론을포함하고있지않으며, 지놈에매핑할경우두개이상의부분에매핑되게됨
• 50bp 이하의짧은시퀀싱이아닌이상 RNA 시퀀싱결과를매핑할때는 splicing을고려한Mapper를사용해야함.
– TopHat, STAR, MapSplice, etc.
Bowtie/TopHat
• TopHat : 스플라이싱을고려한 RNA-Seq전용aligner
• 레퍼런스지놈필요
• 시퀀스를여러조각으로내어 Bowtie 를이용하여 align
• 조각을확장하여seed를만듬
Trapnell et al. 2009
bowtie/tophat의출력물
• A SAM/BAM file
– SAM stands for Sequence Alignment/Map format
– BAM is the binary version of a SAM file
IGV browser 를통한시각화
IdeogramControl pop-up info
Gene track
Reads track
Coverage
track
Single reads,
not spliced
Single reads,
spliced
Coverage
scale
Viewer positionCoverage
pileup
+ve
strand
-ve
strand
http://www.broadinstitute.org/igv/
Cufflinks:매핑된결과에따라서 Transcripts를재구성하고정량화
INPUT
.bam file (Accepted Hits)
Reference (.gtf)
Refseq, Ensembl, etc
Output (tabular form, excel)
FPKM quantifiable
RNA-seq에서나온결과를정량하는방법RPKM: Reads per Kb million (Mortazavi et al. Nature Methods 2008)
단순히유전자에매핑된 Read의갯수만을셀경우, 긴유전자의경우좀더많은Mapping
이일어남
시퀀싱양이많은경우많은 Mapping이일어남
유전자의길이와시퀀싱양에따라서결과를정규화할필요가있음.
RNA-seq 을통해알수있는것은?
- 서로다른조직 (처리) 간의 RNA 발현의차이
(A샘플과 B 샘플간에는어떤 RNA가얼마나발현되는가?)
- 서로다른조직간의 RNA splicing 의차이
(A 샘플과 B 샘플간에는어떤 Splicing Isoforms이존재하는가?)
Differential Expression
* 서로다른샘플에서유래된 RNA의발현정도를수치화하였음.
* 그렇다면이샘플에서서로다르게발현되는유전자는어떠한유전자인가?
DEG : Differentially expressed gene
통계적으로유의하게두샘플에서다르게발현되는유전자를검색
Hierarchical clustering
여러가지조건에서동일한패턴으로발현양상이변하는유전자들을그룹별로분류
Alternative Splicing
[Griffith and Marra 07]
Tutorial using Galaxy
https://usegalaxy.org/u/jeremy/p/galaxy-rna-seq-analysis-exercise
Quality Check
Mapping on Genome Sequence Using Tophat
Visualize
CufflinksmRNA 시퀀스를매핑
CuffDiff
RNA Sequence Data
Import in Galaxy
Quality Check using FastQC
http://www.slideshare.net/hongiiv/galaxy-rnaseq-analysis-tuxedo-protocol
Mapping with Tophat
http://www.slideshare.net/hongiiv/galaxy-rnaseq-analysis-tuxedo-protocol
Cuffdiff : Differential Expression
Other applications of mRNA-seq:
gene fusionFollowing the alignment of the short m-RNA reads to a reference genome, most reads will fall within a single exon, and a smaller but still large set would be expected to map to known exon-exon junctions. The remaining unmapped short reads can then be further analyzed to determine whether they match an exon-exon junction where the exons come from different genes. An alternative approach is using pair-end reads, when potentially a large number of paired reads would map each end to a different exon, giving better coverage of these events. Nonetheless, the end result consists of multiple and potentially novel combinations genes providing an ideal starting point for further validation.
Acknowledgement: Wiki – mRNA-seq
http://www.biomedcentral.com/1755-8794/4/75/figure/F2?highres=y