Upload
hyungyong-kim
View
2.432
Download
5
Embed Size (px)
Citation preview
김형용, 이규열, 이성찬 _ 2013. 02. 05 ~ 2013.02.06 R&D Center, Insilicogen, Inc.
NGS Analysis using Galaxy 2013 한국유전체학회 동계심포지엄 생물정보분석교육 워크샵
01 Galaxy introduction 02 Galaxy examples 1,2 03 Galaxy installation 04 Galaxy function details 05 Galaxy examples 3,4 06 Galaxy tools 07 Galaxy on Grid 08 Galaxy on Cloud
목차 있을 시 간지
Index NGS Analysis using Galaxy
Agenda
3 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
구분 시간 강의내용 비고
1부: Introduction
and Application
15:00 ~ 15:20 Galaxy 소개 진행 김형용
15:20 ~ 15:50 Galaxy 분석예제 시연 1. Human exon 가운데 가장 SNP 많은 exon 찾기 2. NGS QC and assembly 예제
16:00 ~ 16:20 Galaxy 설치 진행 이성찬 16:20 ~ 17:10 Galaxy 설치 및 분석예제 실습 1. Galaxy 설치 실습
2. Human exon 가운데 가장 SNP가 많은 exon 찾기 실습 3. NGS QC and assembly 예제 실습
17:20 ~ 17:50 Galaxy 세부 기능에 대한 설명 진행 김형용
2부: Custom
operation
09:00 ~ 09:20 Galaxy 분석예제 시연 진행 김형용 1. RNA-seq 분석 예제 2. NGS 분석예제 2
19:20 ~ 09:50 Galaxy 분석예제 실습 1. RNA-seq 분석 예제 2. NGS 분석예제 2
10:00 ~ 10:20 Galaxy tool의 이해 진행 김형용 10:20 ~ 11:00 Galaxy tool 작성 실습 1. Primer design 11:10 ~ 11:30 Galaxy on Grid 진행 이규열
1. 그리드의 이해 2. 분산작업 시연
11:30 ~ 11:50 Galaxy on Cloud 진행 김형용 1. 클라우드의 이해 2. Galaxy on Amazon EC2
NGS Technologies
Sequencer Comparison
5 Copyrightⓒ Insilicogen, Inc. 2011. All rights reserved.
Illumina 454 SOLiD
HiSeq 2000 HiSeq 1000 HiScan SQ GAIIx GS FLX 5500
microbeads 5500xl
microbeads 5500xl
nanobeads
Read length
2X100 bp 2X150 bp 400 bp Mate pair : 60 bp X60 bp Paired-end : 75 bp X35 bp
Fragment : 75 bp
Gb/day 55 35 17.5 6.5 10h 10-15 20-30 30-45
Yield 600Gb 300Gb 150Gb 95Gb 35Mb 90Gb 180Gb 300Gb
Required input
50 ng with Nextera 100 ng – 1 μg with TruSeq
Accuracy 85% (2X50 bp, >Q30) 80% (2X100 bp, >Q30)
99% (>Q20) 99.99%
Illumina의 Gb/day는 2X100 bp run 결과
Illumina read length : 1X35, 2X50, 2X100 GA : 1X35, 2X50, 2X100, 2X150
Applications
6 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Interaction of DNA and Protein
Mutation Detection
Structure Variation
Transcriptional Control
Personal Genomics
Personal Genomics
Environmentology
Microbiology Toxicology
Chemical Biology
Application of NGS Technique
Issue of New Genomic Era.
7 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
many researchers, having invested
in next generation sequencing
instruments, now face
a computational bottleneck in their research
work-flow.
BGI
Most Significant Improvement to Your Next Generation Sequencing Workflow
8 Copyrightⓒ Insilicogen, Inc. 2010. All rights reserved.
(출처: The Global Outlook for Next Generation Sequencing: Usage, Platform Drivers & Workflow, October 31, 2011. BioInformatics, LLC)
Issue of New Genomic Era.
9
Library construction
Template purification
Sequence delineation
Finishing & Assembly
Sequence annotation
Secondary annotation
Data delivery
•DNA shearing •Insert into high and /or low copy number vectors
• Primer walking • Transposon insertion methods • Proprietary & commercial assembly
• PCR Amplicons • BACs • Cosmids/ Fosmids
• Big Dye • ABI 3730 • Data compliation
• Gene prediction • BLAST search
• FTP • Web browser • Commercial software
• SNP • Comparative genomics • Expression analysis
Cost
Process
Bioinformatics
Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Application of Next Genomic Data
10 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Practical Software Platforms for NGS data analysis
What kind of?
• Biological Features
• Framework (Enterprise/Informatics) Features
• Service
• Price
List of NGS Frameworks
13 Copyrightⓒ Insilicogen,Inc. 2012. All rights reserved.
유전변이 추출 전문 파이프라인 HugeSeq
14 Copyrightⓒ Insilicogen,Inc. 2012. All rights reserved.
사용자 친화적 GUI환경을 제공하는 CLC Genomics Server
15 Copyrightⓒ Insilicogen,Inc. 2012. All rights reserved.
CLC Genomics Server
- 3계층 시스템 구조의 데이터 분석 및 공유, 관리를 위한 엔터프라이즈 솔루션
1
CLC Bioinformatics Database
- 데이터의 중앙 집중 방식의 저장 및 공유 관리를 위한 데이터베이스
2
CLC Assembly Cell
- NGS 데이터의 초고속 assembly 분석 솔루션 (커맨드라인 기반)
3
CLC Genomics Workbench
- NGS 데이터의 다양한 생물정보 분석 솔루션 (GUI 기반)
4
CLC Developer Kit
- 사용자가 원하는 생물정보 분석 툴과 워크플로우 커스터마이징 솔루션
5
②
①
③ ④
⑤
16 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
17 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
30x Human genome 1 sample (150G) 500만원 (1년저장)
18 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
구글로부터 투자받아 NCBI SRA 서비스 연동 온라인에서 실험없이 곧바로 분석 가능
GALAXY
20 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
21 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
What is Galaxy
22 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Galaxy, a web-based genome analysis platform http://usegalaxy.org • An open-source framework for integrating various computational tools and databases into a cohesive workspace • A web-based service we provide, integrating many popular tools and resources for comparative genomics • A completely self-contained application for building your own Galaxy style sites
Galaxy Usage
23 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
• One of the fastest growing open source bioinformatics projects, a highly successful high throughput data analysis platform for Life Sciences with over 15,000 users worldwide • Annual Galaxy Community Conference
Galaxy visualization
External Genome Browser
UCSC
Ensembl
GBrowse
24 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Trackster
Track/data viewer in web browser
HTML5 Canvas, jQuery
Renders in browser, not on server
Galaxy visualization
25 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Trackster
26 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Trackster
27 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Trackster
28 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Galaxy 구성요소
Galaxy 주요구성 요소
Datasources : 입력 데이터 지정. 별도의 지역
시스템이나, 외부 웹사이트의 데이터를 등록 가능
Tool : 기본적인 분석의 최소 단위, 지역설치시
원하는 툴을 만들어 넣을 수 있음
History : 입력데이터가 Tool의 조합을 거쳐 얻어진
중간 결과물 목록
Workflow : History 는 입력데이터 및 파라메터만
바꾸면 새로운 데이터 결과를 얻을 수 있다. 이를
별도로 프로세스 등록
Visualization : 분석결과를 가시화 도구와 연결
Page : 위 요소들을 종합한 보고서 작성 기능
29 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Eprimer3 tool 을 별도로 만들어 등록한 예제
Galaxy tool 은
30 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Tool 입력포맷
출력포맷
입력 데이터를 (포맷에 맞게) 작업하여 (포맷에 맞게) 출력 데이터를 만드는 역할
조합하면 Workflow가 된다
Galaxy formats
31 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Auto-detect 데이터가 어떤 형식인지 자동으로 인식
Ab1 A binary sequence file in 'ab1' format with a '.ab1' file extension. You must manually select this 'File Format' when uploading the file.
Axt blastz pairwise alignment format. Each alignment block in an axt file contains three lines: a summary line and 2 sequence lines. Blocks are separated from one another by blank lines. The summary line contains chromosomal position and size information about the alignment. It consists of 9 required fields.
Bam A binary file compressed in the BGZF format with a '.bam' file extension.
Bed Tab delimited format (tabular). Does not require header line
Fasta A sequence in FASTA format consists of a single-line description, followed by lines of sequence data. The first character of the description line is a greater-than (">") symbol in the first column. All lines should be shorter than 80 characters
FastqSolexa Illumina (Solexa) variant of the Fastq format, which stores sequences and quality scores in a single file
Gff GFF lines have nine required fields that must be tab-separated.
Gff3 The GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous formats.
Interval (Genomic Intervals)
Tab delimited format (tabular)
Lav Lav is the primary output format for BLASTZ. The first line of a .lav file begins with #:lav..
MAF TBA and multiz multiple alignment format. The first line of a .maf file begins with ##maf. This word is followed by white-space-separated "variable=value pairs". There should be no white space surrounding the "=".
Scf A binary sequence file in 'scf' format with a '.scf' file extension. You must manually select this 'File Format' when uploading the file.
Sff A binary file in 'Standard Flowgram Format' with a '.sff' file extension.
Tabular (tab delimited)
Any data in tab delimited format (tabular)
Wig The wiggle format is line-oriented. Wiggle data is preceded by a track definition line, which adds a number of options for controlling the default display of this track.
Other text type Any text file
Galaxy 특징 한번 더
Amazon Cloud 이용
NGS 관련 분석기능 탑재 논문에 Galaxy URL 제공
Transparent analysis
Biologist
최근 Galaxy 사용 추세
Galaxy 특징 한번 더
파이썬으로 만들어져 있으나, 확장시 파이썬이 아니어도 됨
“투명한” 분석 플로우를 만들고 공유하고 확장할 수 있다.
거의 모든 생물정보 분석을 Galaxy 로 할 수 있다.
Galaxy만 잘 써도 뽑겠다 (NCBI)
…
32 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Bioinformatician
GALAXY Examples 1
Example 1.
34 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Finding Human Exons with the highest number of SNPs
1. Download all Human Exons from NCBI or Ensembl BioMart or UCSC TableBrowser
2. Download all Human SNPs from … 3. Scripting
Join 1, 2 according to position Group by Exon id Sort by SNP count Filter Exon which has more than 10 SNPs
Have to do programming! (Python, Perl, …)
On Galaxy
35 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
http://usegalaxy.org
On Galaxy
36 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Get data UCSC main
Get data UCSC main
: Exon 데이터 가져오기
: SNP 데이터 가져오기
Operate on Genomic Interval Join : 영역이 겹치는 Exon 추출하기
Join, Substract and Group Group : Exon 이름으로 그룹핑하고 SNP 세기
Filter and Sort Sort : SNP 개수로 Exon 정렬하기
Text Manipulation Select first : SNP 개수가 많은 top 5 exon 추출하기
Join, Substract and Group Compare two Datasets : 잃어버린 exon 정보 회복하기
GALAXY Examples 2
Example 2.
38 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Human NGS data QC and assembly
1. NGS Quality Control 2. NGS Single End Mapping 3. SNP Calling 4. Compare with dbSNP
Have to do in Unix and need programming! (Python, Perl, …)
On Galaxy
39 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
http://usegalaxy.org
On Galaxy
40 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
NGS 분석을 위해서는 프로그램 추가 설치해야 함
( http:// http://wiki.galaxyproject.org/Admin/NGS%20Local%20Setup )
프로그램 사용되는 곳 설치방법
Fastx-toolkit NGS QC Ubuntu apt-get
Gnuplot NGS QC boxplot Ubuntu apt-get
Bowtie2 Reference assembly 복사 후 PATH 설정
SAMTools SNP calling Ubuntu apt-get
On Galaxy
41 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Get data Upload File
NGS: QC and minipulation FASTQ Groomer
: human illumina fastq 파일 업로드
: fastsanger 포맷을 변경
: fastq quality 통계정보 보기
: fastq quality 통계정보로 boxplot 그리기
: 의미없는 부분 잘라내기, 가리기
NGS: QC and minipulation Compute quality statistics
NGS: QC and minipulation Draw quality score boxplot
NGS: QC and minipulation FASTQ Trimmer, Quality Trimer, Masker
On Galaxy
42 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Get data Upload File
NGS: Mapping Bowtie2
: Reference assembly를 위한 레퍼런스 서열 입력
: Bowtie2를 이용한 assembly
: BAM 파일에서 SNP, indel 정보 추출하기
: 추출된 SNP, indel 가운데 높은 점수 추출하기
: Genomic interval 형식으로 변경
NGS: SAM Tools MPileup
NGS: SAM Tools Filter pileup
NGS: SAM Tools Pileup-to-interval
Get data UCSC Main : dbSNP 정보 가져오기
Operate on Genomic Interval Join : 영역이 겹치는 SNP 추출하기
Galaxy Installation
Install Virtualbox - Ubuntu
44 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
rm /etc/udev/rules.d/70-persistent-net.rules sudo shutdown –h now
5. Ubuntu 실행 후, Network 설정 파일을 삭제합니다.
6. Linux(ubuntu) 를 재 시작합니다.
1. USB에서 Virtualbox와 Galaxy 폴더를 복사합니다.
2. Virtualbox를 설치합니다.
3. Virtualbox를 실행한 후, Galaxy 이미지를 Import합니다.
4. 설정에서 네트워크를 브릿지(Bridge)로 변경합니다.
Creating your own Galaxy
45 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Running Galaxy in an production environment
46 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
By default, Galaxy uses
SQLite database
Built-in HTTP server for all tasks
Local job runnser
Single process
Simplest error-proof configuration
Change configuration for service
Disable the developer settings use_interactive = False, use_debug = False
Get a real database PostgresSQL
Offload the menial tasks: Proxy Nginix, Apache
Let your tools free: Cluster Move intensive processing to other host, TORQUE, GRID, DRMAA
Other advanced settings
Galaxy on Cluster
47 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Intensive processes to other hosts
TORQUE
GRID
DRMAA
Working with Galaxy on the Cloud
Virtualization
Virtualization
• 컴퓨터 자원의 추상화를 일컫는 말
• 가상의 물리적 리소스를 만들어 냄.
•물리적인 1대의 하드웨어 자원을 논리적으로 여러 개로 나누어 사용하거나,
•여러대의 하드웨어 자원을 논리적으로 통합하여 이용하는 기술
• 하드웨어 관리, 재난에 대한 시스템 복구 등 여러 문제를 해결할 수 있는 방법으로 최근 각광
받고 있음
가상화
Virtualization
50 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
• 비용절감
서버 한 대를 분할하여 여러 대의 서버를 구성할 수 있음
서버 구입비용 절감, 전기, 상면비용, 서버관리비용이 절감
• 자원의 효율적인 사용
서버의 비 활용되는 자원을 이용하여 가상머신을 만듬으로써 효율적인 자원사용이 가능
• 안정적인 운영
서버를 이미지로 백업, 손쉬운 서버 이전으로 장애에 대한 신속한 대처 가능
• SW의 지속적인 운영
서버 HW의 수명 주기가 끝나면 OS 벤더는 장치 드라이버 지원이 중단됨
-> 마이그레이션 문제가 발생
가상머신에 기존의 시스템을 가상머신에 올리기 때문에 장치 드라이버에 대한 문제
가 발생하지 않음
가상화의 장점!!
클라우드 서비스에 기본적으로 활용
51 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Public Galaxy environment
52 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Example of Cloud
53 Copyrightⓒ Insilicogen,Inc. 2012. All rights reserved.
출처 : iSC 2012 Amazon HPC session
Running Galaxy Web server
54 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
ifconfig
4. 자신의 호스트 OS (windows) 에서 웹브라우저에서 주소창에 다음을 입력합니다.
1. 자신의 컴퓨터의 IP Address를 확인합니다.
2. Galaxy 폴더로 이동합니다.
3. Galaxy web server를 실행합니다.
cd galaxy-dist
sh run.sh
IP Address:8080 (예, 172.20.8.162:8080)
Galaxy Detail functions
Get Data
56 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Get Data / Send Data
57 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Text Manipulation
58 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Convert Format
59 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
FASTA manipulation
60 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Filter and Sort
61 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Join, Subtract and Group
62 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Operate on Genomic Intervals
63 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
NGS Toolbox
64 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Galaxy Examples 3
Example 3.
66 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Human RNA-seq
1. RNA-seq result: adrenal_1,2.fastq, brain_1,2.fastq 2. Reference: iGenome UCSC hg19, chr19 gene notation (GTF format)
Have to do in Unix and need programming! (Python, Perl, …)
On Galaxy
67 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
http://usegalaxy.org
On Galaxy
68 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
RNA-seq 분석을 위해서는 프로그램 추가 설치해야 함
( http://wiki.galaxyproject.org/Admin/NGS%20Local%20Setup )
프로그램 사용되는 곳 설치방법
java FastQC Ubuntu apt-get install openjdk-7-jre
FastQC NGS QC tool-data/shared/jars/ 로 복사
Tophat RNA-seq mapping (다음페이지 참고)
Cufflinks RNA-seq assembly Ubuntu apt-get install cufflinks
Tophat install in Ubuntu
69 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
$ cp samtools-0.1.18.tar.gz2 ~/work $ bzip2 –d samtools-0.1.18.tar.gz2 $ tar xvf samtools-0.1.18.tar $ cd samtools-0.1.18 $ make $ cd .. $ cp tophat-1.4.1.tar.gz ~/work $ tar zxvf tophat-1.4.1.tar.gz $ cd tophat-1.4.1 $ apt-get install libboost libbam libboost-thread-dev $ cp ../samtools-0.1.18/libbam.a /usr/local/lib $ sudo mkdir /usr/local/include/bam $ cp ../samtools-0.1.18/*.h /usr/local/include/bam $ configure $ make $ make install
On Galaxy
70 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Get data Upload File
NGS: QC and minipulation FASTQ Groomer
: fastq, chr19.fa, gtf 파일 업로드
: fastqsanger 포맷으로 변경
: fastq quality 통계정보 보기
: RNA-seq fastq 데이터에서 splice junction 찾기 레퍼런스로 chr19.fa 이용
: Transcript assembly, FPKM 추정
NGS: QC and minipulation FastQC:Read QC
NGS: RNA Analysis Tophat for Illumina
NGS: RNA Analysis Cufflinks
On Galaxy
71 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
NGS: RNA Analysis Cuffmerge
NGS: RNA Analysis Cuffdiff
: brain, adrenal 데이터를 reference에 맞게 합치기
: 유의한 발현변화 찾기
Galaxy Tools
Galaxy tool 은
73 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Tool 입력포맷
출력포맷
입력 데이터를 (포맷에 맞게) 작업하여 (포맷에 맞게) 출력 데이터를 만드는 역할
조합하면 Workflow가 된다
Galaxy formats
74 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Auto-detect 데이터가 어떤 형식인지 자동으로 인식
Ab1 A binary sequence file in 'ab1' format with a '.ab1' file extension. You must manually select this 'File Format' when uploading the file.
Axt blastz pairwise alignment format. Each alignment block in an axt file contains three lines: a summary line and 2 sequence lines. Blocks are separated from one another by blank lines. The summary line contains chromosomal position and size information about the alignment. It consists of 9 required fields.
Bam A binary file compressed in the BGZF format with a '.bam' file extension.
Bed Tab delimited format (tabular). Does not require header line
Fasta A sequence in FASTA format consists of a single-line description, followed by lines of sequence data. The first character of the description line is a greater-than (">") symbol in the first column. All lines should be shorter than 80 characters
FastqSolexa Illumina (Solexa) variant of the Fastq format, which stores sequences and quality scores in a single file
Gff GFF lines have nine required fields that must be tab-separated.
Gff3 The GFF3 format addresses the most common extensions to GFF, while preserving backward compatibility with previous formats.
Interval (Genomic Intervals)
Tab delimited format (tabular)
Lav Lav is the primary output format for BLASTZ. The first line of a .lav file begins with #:lav..
MAF TBA and multiz multiple alignment format. The first line of a .maf file begins with ##maf. This word is followed by white-space-separated "variable=value pairs". There should be no white space surrounding the "=".
Scf A binary sequence file in 'scf' format with a '.scf' file extension. You must manually select this 'File Format' when uploading the file.
Sff A binary file in 'Standard Flowgram Format' with a '.sff' file extension.
Tabular (tab delimited)
Any data in tab delimited format (tabular)
Wig The wiggle format is line-oriented. Wiggle data is preceded by a track definition line, which adds a number of options for controlling the default display of this track.
Other text type Any text file
Creating your own Galaxy
75 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Primer design tool
76 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Primer3
77 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Primer3 • Primer design program • http://primer3.sourceforge.net/releases.php • Download from
http://sourceforge.net/projects/primer3/files/primer3/1.1.4/primer3-1.1.4.tar.gz
• make & copy to PATH
eprimer3 • Wrapper for Primer3, it’s used in EMBOSS package • Easy command line interface • http://emboss.sourceforge.net/apps/release/6.4/emboss/apps/
eprimer3.html • apt-get install emboss
erimer3
78 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
$ eprimer3 –sequence INPUT_FASTA_FILE –outfile PRIMER_DESIGN_RESULT -osize OSIZE -gcclamp GCCLAMP …
# EPRIMER3 RESULTS FOR GL020027.1 # Start Len Tm GC% Sequence 1 PRODUCT SIZE: 199 FORWARD PRIMER 571071 20 60.06 45.00 CTTGCCAATAGCGAATGGAT REVERSE PRIMER 571250 20 59.99 55.00 GACGGCGTAGATCTTCAAGC 2 PRODUCT SIZE: 199 FORWARD PRIMER 55074 20 60.05 55.00 TAACACCACTGCTCCTGCTG REVERSE PRIMER 55253 20 59.97 50.00 CATTGCATGGTCAGAACCAC 3 PRODUCT SIZE: 200 FORWARD PRIMER 71990 20 60.03 45.00 GGGGTTGATTTTCATTGTGG REVERSE PRIMER 72170 20 59.88 45.00 GTTTGCACCAACCTGGTTTT 4 PRODUCT SIZE: 200 FORWARD PRIMER 427182 20 59.83 50.00 CTGATGTGCTCTGTGGGAAA REVERSE PRIMER 427362 20 60.01 55.00 CCGTGTATGTAGCCCGAGTT 5 PRODUCT SIZE: 197 FORWARD PRIMER 427185 20 59.97 50.00 ATGTGCTCTGTGGGAAAACC REVERSE PRIMER 427362 20 60.01 55.00 CCGTGTATGTAGCCCGAGTT
이 결과 형식을 수정하여 다른 Galaxy tool의 입력으로 쓰고 싶다.
직접 Primer design Galaxy tool 만들기
erimer3.xml
79 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
erimer3.py
80 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
tool_conf.xml
81 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
… <section name="VCF Tools" id="vcf_tools"> <tool file="vcf_tools/intersect.xml" /> <tool file="vcf_tools/annotate.xml" /> <tool file="vcf_tools/filter.xml" /> <tool file="vcf_tools/extract.xml" /> </section> <section name=“MyTools" id=“mytools"> <tool file=“mytools/eprimer3.xml" /> </section> </toolbox>
EMBOSS eprimer3 tool added
82 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
실습
83 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Install Primer3
Install EMBOSS
: make 명령으로 컴파일 후, primer3_core PATH 설정
: sudo apt-get install emboss
: sudo apt-get install python-biopython
: mytools 디렉토리는 직접 생성
: mytools/eprimer3.xml 설정
Install Biopython
Copy eprimer3.py, eprimer3.xml to galaxy-dist/tools/mytools/
Edit tool_conf.xml
Galaxy on Grid
Grid vs Cluster
85 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
대용량 데이터에 대한 연산을 작은 소규모 연산들로 나누어 작은 여러대의 컴퓨터로 분산시켜 수행
WAN상에서 서로 다른 기종의 머신들을 연결 다양한 플랫폼을 서로 연결함 연결대수에 제한이 없음
공통점
차이점
Grid
86 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Globus Toolkit
87 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
대표적인 계산 그리드 미들웨어 Open source toolkit for building computing grids
developed and provided by Globus Alliance Standards implementation
• Open Grid Service Architecture (OGSA) • Open Grid Service Infrastructure (OGSI) • Web Services Resource Framework (WSRF) • Job Submission Description Language
(JSDL) • Distributed Resource Management
Application API (DRMAA) • SOAP • WSDL • Grid Security Infrastructure
88 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
High level Open Grid Forum API specification for submission and control of jobs to a Distributed Resource Management (DRM, Job scheduler) system, such as a Cluster or Grid computing infrastructure
PBS (Portable Batch System)
89 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Computer software that performs job scheduling in Unix cluster environment
A component of the Globus Toolkit
Originally developed by NASA
Following versions
• OpenPBS
• TORQUE – a fork of OpenPBS
• PBS Professional (PBS pro) - commercial
TORQUE
90 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Distributed resource manager providing control over batch jobs and distributed compute node
It stands for Terascale Open Source Resource and QUEue Manager
Slave 노드의 CPU개수, core 개수, RAM사이즈, 임시저장소 등의 설정정보를 가지고 스케줄러에 의해 요청이 왔을 때 클러스터 리소스를 분배함
Master
Slave 1
Slave 2
Slave 3
> qsub a.sh
NFS
a.sh 명령을 스케줄러에 따라 slave로 넘김
Virtualized Galaxy (Test-bed)
91 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Galaxy on Cloud
Cloud computing
93 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Delivery of computing and storage capacity as a service to a heterogeneous community of end-recipients.
94 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
VPS (Virtual Private Server)
95 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Internet hosting services to refer a virtual machine in a cloud
96
Amazon EC2 (Amazon Elastic Compute Cloud)
Virtualization + Grid(Cluster) computing in a Cloud
97
Amazon EC2 (Amazon Elastic Compute Cloud)
98
Amazon EC2 (Amazon Elastic Compute Cloud)
99
Amazon EC2 (Amazon Elastic Compute Cloud)
100
Amazon S3 (Amazon Simple Storage Service)
Galaxy on Cloud
101 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Using Amazon EC2 + S3
Select AMIs in Community AMIs
Galaxy on Cloud
102 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Galaxy on Cloud
103 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Galaxy on Cloud
104 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Galaxy on Cloud
105 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Galaxy on Cloud
106 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Galaxy on Insilicogen
107 Copyrightⓒ Insilicogen,Inc. 2011. All rights reserved.
Galaxy localization on cluster
Tool development
Workflow development
www.insilicogen.com E-mail [email protected] Tel 031-278-0061 Fax 031-278-0062
www.insilicogen.com E-mail [email protected] Tel 031-548-1008,1009 Fax 031-278-0062