View
1.397
Download
2
Category
Preview:
Citation preview
Computational Skill for Modern Biology Research
Department of BiologyChungbuk National University
1st Lecture 2015.9.1
Introduction
강의 개요강사 : 남궁석 (Suk Namgoong)충북대학교 농업생명과학대학 축산학과suknamgoong@chungbuk.ac.krsuk.namgoong@gmail.comHP: 010-4103-2415
Syllalus주 수업내용1 주차 Introduction : Why we need to learn this stuff?
2 주차 Basic of Unix
3 주차 Unix Command Prompt II and software install
4 주차 Running bioinformatic software (BLAST) and shell scripting
5 주차 Python Scripting II
6 주차 Python Scripting II
7 주차 Python Scripting III
8 주차 Next Generation Sequencing
9 주차10 주차 Next Generation Sequencing Analysis
11 주차 R and statistical analysis
12 주차 Bioconductor I
13 주차 Bioconductor II
14 주차 Network analysis
Objective of this lecture
• Essential Computational Skills needed for Biology at 21st century
• Survival skills for ‘omics’ research
….for wet lab biologists like you
… 우리가 컴퓨터과도 아닌데 왜 이런거를 배워야 하나요 ?
Some of you may think…
The objective of this lecture is..
Not to make you you as these people….
여기 컴퓨터학과 아님
Or..
It is not about to develop these complicated software used for biological research..They are jobs for professional developers, not us.
All we need to learn is…
Survival kit & Skills
To survive challenges from modern biology
Current situation of biologists
YouArehere!
Welcome to the jungle…
What is the ‘challenge’ for modern biologists?
High-throughput ExperimentsAnd Its data analysisMicroarrayRNA-SeqNext Generation SequencingChiP-SeqRip-SeqMass Spectrometry
Interdisplinary NatureOf
Modern Research
GeneticsCell BiologyDevelopmental BiologyBiochemistryBiophysicsBioinformatics…
Genomic scale of data
GenomeTranscriptome
Proteome
Metabolome Metagenome
Genomic Scale of Data
How you handle these scale of data?
Completion of Genome Project : 2001 Genome Size : 3.2Gb (Human)
Number of Gene : about 22,000
Average number of variants unique individual person :Single nucleotide polymorphism : 10-30 million bases
Transcripts : about 140,000
3,209,286,105 base pair
Simple Examples
Questions : I want to find out all of human (insert your favorite organism here) protein kinase genes and classify them based on sequence homology..
How you can do that?
On human genome, there is more than 500 protein kinase genes.(Some of them are serine/threonine kinases and other are tyrosin kinases)
Most of biologists will go to ncbi and search like this…
But will be end up with these results…
The way typical biologists do.
Little bit of tweaking of search may yield more reasonable number…
Cut&Paste one by one…
Repeat the same procedure more than 1,000 times…
Do you want to do that?
If you have 2-3 proteins, you can do this way. But when you are doing genomic scale research, you need more efficient way to do…
Don’t use cheat key like “ 학부생한테 시킨다”
Other Examples : Measuring expression Levels of mRNA
Northern Blot qPCR
2-3 genes 5-100 genes
When you are checking expression levels for a few genes, it is not difficult to find differentially expressed gene (DEG by eye…
MicroarrayRNA-Seq
Transcriptome wide
(~22,000genes)Questions : Find out gene expressed differentially in two conditions..(Example : cancer and normal tissues)
But in the age of genome scale expression profiling..
Microarray RNA-Seq
Gene Expression Omnibus at NCBI
6 Microarray datasets
Control
Treatmen
ThreeBiologicalRepeats
Each dataset contains…
More than 30,000 valuesCorrespond to intensityOf probe
You cannot find differentially expressed gene by your eyes..
Too much data to look at genome scale
You need to have a skill to look through these data..
General purpose software (like Office) is not enough for these purpose
Premade software to analyze these data will not fit all your purposes
Many of biologists uses (more correctly, abuses) spreadsheet program like Excel…
But Excel is not designed for these purpose. I will show why.
Excel is not enough..
Original gene name : Sep1, Sep2, Sep3, March1, March2
Examples : Human Gene Name (ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/locus_groups/protein-coding_gene.txt)
Convert Them as “Date”
Excel is not enough..Can you find the nucleotide sequence of selected gene?
Can you find the neighborhood genes in chromosome?
Can you find the gene which is homologous with current row?
Can you find the genes belong to the same signaling pathway?
Find out common SNP found on the specific gene
……
Probably you cannot find answer so easily with Excel…
Genome Level High-throughput Experiments
Genomic DNA
PCR
Transformation
What if you want to construct ‘genome wide library’?
Go to PubMed..
Finding Appropriate Gene Sequence…
Cut & Paste nucleotide sequence to web server for designing Primer…
Retrieve designed primer sequences..
Then, go to first step and repeat until finished
Order Primers..
Done?
YES
PCR
NO
Flow of works
Probably okay if you have a few genes to work on..(Still Ok if you have 10)
But..
If you have 100 genes…
Or 1,000 genes..
Even whole genome (10,000-20,000 gene) can you still do this way?
You can do it manually..
Need for the automations
It reminds me of something else…
Done?
YES
PCR
NO
Software flowchart
If we know exactly what we want to do, we can program and automate it.
…..Using Computer Software!
High-throughput screening of Chemical using zebrafish
More complicated Examples
http://www.bloodjournal.org/content/bloodjournal/119/24/5614/
You can get as many images you want
But do you need to check all of these images one by one?
High-Content Screening
Automatically analyze imagesQuantification of desired features in imagesFind out possible drug candidates or targets
Computers
Originally developed for..
Perform complex and repetitive scientific calculation (tasks)
ENIAC Programming (Circa 1950)
Until 1970-1980..The use of computer was primary limited to scientific calculations..
Current widespread uses of computer is after introduction of ‘Personal Computer’
Two Steve with their personal computerPeople started to use ‘Personal Computer’ in other activities than Scientific research…
Nowaday most people even forgot that computer is originally made for scientific research to automate repetitive tasks..
Your computer is not just for these task…
Use your computer for the purpose originally invented for! Computer is originally made for scientific research.
Software
To do specific task using computer, you need software for specific task
If someone else did those tasks before you, probably there would be software..
Utilize them, if you can finish your task with them
But sometime you cannot finish all of your works with premade software.
Web-based software Destktop-based software
Commercial Software vs Research Software
No. of User base : Huge(Millions)
Features : very extensive
Developed by commercial company
Cost : $$$
Easy to use UI (User interface)
Commercial Software vs Research Software
No. of User base : Relatively small(1 – 1,000 or 10,000)
Features : vary
Developed by academia(maintained by very small number of developer)
Cost : mostly free
Focused on the function Sometime not easy to use
(Making good UI cost $$$)
Don’t expect quality of “Office” on Research software
“Oh, You can buy (expensive) software to analyze genome sequece”
“Or someone will make web site to analyze data for my purpose…”
Sometime you need to make your own tools(Or get things done using combining pre-existing tools)
Essentially doing something to find ‘novel’ thing.
Sometime you will not have suitable tools for your novel research subjects.
Do not expect to find research software for your jobs always
‘Doing Research’ means…
You need to use multiple softwares to get your job done
NGS Sequencing
Aligned into Reference Sequences
Find out variants
Filtering variant
Visualize
Short Read Mapper:Bwa, bowtie
Integrated Genome viewer
GATK
FAQ format
Bam format
VCF format
X-ray Crystal Diffraction Data
Space group determination
Phasing
Refine
Model building
Data Reduction
Visualize
Each step of computational analysisrequire different software
Command Line
muscle
MUSCLE v3.8.31 by Robert C. Edgar
http://www.drive5.com/muscleThis software is donated to the public domain.Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.
Basic usage
muscle -in <inputfile> -out <outputfile>
Common options (for a complete list please see the User Guide):
-in <inputfile> Input file in FASTA format (default stdin) -out <outputfile> Output alignment in FASTA format (default stdout) -diags Find diagonals (faster for similar sequences) -maxiters <n> Maximum number of iterations (integer, default 16) -maxhours <h> Maximum time to iterate in hours (default no limit) -html Write output in HTML format (default FASTA) -msf Write output in GCG MSF format (default FASTA) -clw Write output in CLUSTALW format (default FASTA) -clwstrict As -clw, with 'CLUSTAL W (1.81)' header -log[a] <logfile> Log to file (append if -loga, overwrite if -log) -quiet Do not write progress messages to stderr -version Display version information and exit
Without refinement (very fast, avg accuracy similar to T-Coffee): -maxiters 2Fastest possible (amino acids): -maxiters 1 -diags -sv -distance1 kbit20_3Fastest possible (nucleotides): -maxiters 1 -diags
Most of academic software developer does not have both :(
You need to input a command to execute software
Not every scientific software have ‘user friendly’ user interface
Developing user interface require time and $$$$
Command-line driven software
Advantage of Command Line
You can combine various software modules and make “Pipeline”
• Software 1 -> software 2 -> software 3 -> software 4 -> final
• You can customize & automate your workflow
Reproducible Research
• All of conditions to run software was preserved as ‘scripts’
• You can analyze various samples in same conditions
Reproducible ResearchDifferent setup may cause difference in analysisIn GUI environments, change setup each time and consistent results is sometime challenging..
By Command Line and Scripting, You can document your researchIt is like writing ‘Lab Notebook’ in wet Lab experiments
http://nbviewer.ipython.org/gist/hyeshik/cf9f3d7686e07eedbfda?revision=6
Don’t afraid command line
Some time ago, we used command line in everyday computing tasks
After Windows & Internet era, people start to afraid command line…
Don’t afraid it. They don’t bite you. (If you know how to use it)
The key to mastering this course is to conquer command line interface
“Experimental Protocol” vs “Scripting”
Wet Lab Experiments Sacrifice Animal
Analysis
Read data
Alignment
Quantification
DifferentiallyExpressedGene
Scripting
DNA Extraction
PCR Amplificaiton
Sequencing
Kit vs Premade Software Kit for Wet Lab experiments
We can do routine, well define experiments using premade kitEven without understanding principle of experiments, we can get a results..
Kit vs Premade Software Web server for bioinformatics analysis
We can do routine, well defined analysis using premade web server or software
Even without understanding principle of analysis, we can get a results..
Limitation of ‘Kit’
You cannot do all of your experiments using ‘kit’
If you are the first one to develop the protocol, there will be no kit for your experiments
Sometime you need to buy individual component of experiments and optimize it
These statements also applied in the bioinformatic analysis.
Sometime your analysis cannot be done by ‘premade kit’, then what will you do?
Purpose of This Lectur
You will learn how to make your own ‘Kit’ using preexisitng component
It will require…
A little bit knowledge of Unix-like operating system..
How to handle computer using command line
A little bit piece of programming..
Why Unix?
Most scientific software was made on Unix-based operating system
Examples of Unix-based operating system
- Linux (Ubuntu, Fedora, RedHat…)- Mac OS- Windows
Most bioinformatics/biology related software was made on Unix-base
It is more convenient to use command line in Unix-based system
But I knew most of you are using Windows.
Three alternatives
1. You can install linux in your computer (You need to format your computer)
2. You can install linux inside your windows (Using ‘virtual machine’)
https://www.virtualbox.org/
3. Install Cygwin (https://www.cygwin.com/)
Use windows using unix-like commands
If you are Mac Users..
Application – Utility – Terminal
Mac is unix-based computer, so you don’t need anything..
First assignment
In the next class, please bring your notebook computer capable to use Unix command line
Select one of options
- Install Linux - Install Linux via virtualbox- Install Cygwin- Bring Mac OS
How to install?
Google it!
Your ‘Cheat Key’ for Life
생물정보학은 ..전산학과 생물학의 융합같은 것이 아님 .
컴퓨터는 단지 매우 큰 데이터를 다루기 위한 도구일 뿐
생물정보학은 대량의 데이터를 다루기 위한 현대생물학의 연구테크닉Bioinformatics is the Biology
물론 생물정보학을 공부하기 위해서는 전산학 , 통계학 , 프로그래밍 , 수학등의 지식이 필요할 수 있음 .
그러나 생물정보학은 결코 전산학 , 통계학 , 프로그래밍 , 수학등의 세부 분과가 아님 .
• 힉스입자를 발견하기 위한 실험이 수행된 유럽입자물리연구소 (CERN) 에서도 방대한 데이터를 분석하기 위하여 복잡한 프로그래밍 , 수학 , 통계 등을 이용함 .
• 이런 사람들을 우리는 우리는 ‘물리학자’ (Physicist) 라고 부름 .
• 생물체에 얽힌 방대한 데이터를 분석하는 사람은 ‘생물학자’ (Biologist) 이며 , 이들이분석을 위해 사용하는 방법이 생물정보학 (Bioinformatics) 적인 방법임 .
잘못된 생각생물학자
데이터를 생산“ 분석은 컴퓨터 잘하는 애들이 다 해주겠지”
생물정보학자
데이터를 분석“ 이 실험결과 뭔지는 모르겠지만 대충 프로그램 돌려서 예쁜 그림 그려주면 되겠지”이러한 연구가 제대로 되지 않은 근본적인 이유
생물학 배경으로부터 온 사람이라면 ..
• 생물정보학 분석의 기본 이론의 이해• 생물정보학 분석의 기본 개념 파악• 기초 프로그래밍 개념의 이해• 생물정보 분석은 ‘블랙박스’ 가 아니라는 것의 이해
프로그래밍 배경으로부터 온 사람이라면
• 기본 생물학 개념의 이해• 데이터가 어떻게 생산되는지에 대한 개념의 이해• 실험 데이터에는 오류가 항상 존재한다는 사실의 인지
화성에서 온 남자와 금성에서 온 여자
In the next class..
Please bring notebook which can use command line interface
One and Half hour lectures
One and Half hour practice & question session
We will talk about basic of UNIX command line interface and Installing BLAST in your computer
Recommended