생물학 연구를 위한 컴퓨터 사용기술 제 1강

Preview:

Citation preview

Computational Skill for Modern Biology Research

Department of BiologyChungbuk National University

1st Lecture 2015.9.1

Introduction

강의 개요강사 : 남궁석 (Suk Namgoong)충북대학교 농업생명과학대학 축산학과suknamgoong@chungbuk.ac.krsuk.namgoong@gmail.comHP: 010-4103-2415

Syllalus주 수업내용1 주차 Introduction : Why we need to learn this stuff?

2 주차 Basic of Unix

3 주차 Unix Command Prompt II and software install

4 주차 Running bioinformatic software (BLAST) and shell scripting

5 주차 Python Scripting II

6 주차 Python Scripting II

7 주차 Python Scripting III

8 주차 Next Generation Sequencing

9 주차10 주차 Next Generation Sequencing Analysis

11 주차 R and statistical analysis

12 주차 Bioconductor I

13 주차 Bioconductor II

14 주차 Network analysis

Objective of this lecture

• Essential Computational Skills needed for Biology at 21st century

• Survival skills for ‘omics’ research

….for wet lab biologists like you

… 우리가 컴퓨터과도 아닌데 왜 이런거를 배워야 하나요 ?

Some of you may think…

The objective of this lecture is..

Not to make you you as these people….

여기 컴퓨터학과 아님

Or..

It is not about to develop these complicated software used for biological research..They are jobs for professional developers, not us.

All we need to learn is…

Survival kit & Skills

To survive challenges from modern biology

Current situation of biologists

YouArehere!

Welcome to the jungle…

What is the ‘challenge’ for modern biologists?

High-throughput ExperimentsAnd Its data analysisMicroarrayRNA-SeqNext Generation SequencingChiP-SeqRip-SeqMass Spectrometry

Interdisplinary NatureOf

Modern Research

GeneticsCell BiologyDevelopmental BiologyBiochemistryBiophysicsBioinformatics…

Genomic scale of data

GenomeTranscriptome

Proteome

Metabolome Metagenome

Genomic Scale of Data

How you handle these scale of data?

Completion of Genome Project : 2001 Genome Size : 3.2Gb (Human)

Number of Gene : about 22,000

Average number of variants unique individual person :Single nucleotide polymorphism : 10-30 million bases

Transcripts : about 140,000

3,209,286,105 base pair

Simple Examples

Questions : I want to find out all of human (insert your favorite organism here) protein kinase genes and classify them based on sequence homology..

How you can do that?

On human genome, there is more than 500 protein kinase genes.(Some of them are serine/threonine kinases and other are tyrosin kinases)

Most of biologists will go to ncbi and search like this…

But will be end up with these results…

The way typical biologists do.

Little bit of tweaking of search may yield more reasonable number…

Cut&Paste one by one…

Repeat the same procedure more than 1,000 times…

Do you want to do that?

If you have 2-3 proteins, you can do this way. But when you are doing genomic scale research, you need more efficient way to do…

Don’t use cheat key like “ 학부생한테 시킨다”

Other Examples : Measuring expression Levels of mRNA

Northern Blot qPCR

2-3 genes 5-100 genes

When you are checking expression levels for a few genes, it is not difficult to find differentially expressed gene (DEG by eye…

MicroarrayRNA-Seq

Transcriptome wide

(~22,000genes)Questions : Find out gene expressed differentially in two conditions..(Example : cancer and normal tissues)

But in the age of genome scale expression profiling..

Microarray RNA-Seq

Gene Expression Omnibus at NCBI

6 Microarray datasets

Control

Treatmen

ThreeBiologicalRepeats

Each dataset contains…

More than 30,000 valuesCorrespond to intensityOf probe

You cannot find differentially expressed gene by your eyes..

Too much data to look at genome scale

You need to have a skill to look through these data..

General purpose software (like Office) is not enough for these purpose

Premade software to analyze these data will not fit all your purposes

Many of biologists uses (more correctly, abuses) spreadsheet program like Excel…

But Excel is not designed for these purpose. I will show why.

Excel is not enough..

Original gene name : Sep1, Sep2, Sep3, March1, March2

Examples : Human Gene Name (ftp://ftp.ebi.ac.uk/pub/databases/genenames/new/tsv/locus_groups/protein-coding_gene.txt)

Convert Them as “Date”

Excel is not enough..Can you find the nucleotide sequence of selected gene?

Can you find the neighborhood genes in chromosome?

Can you find the gene which is homologous with current row?

Can you find the genes belong to the same signaling pathway?

Find out common SNP found on the specific gene

……

Probably you cannot find answer so easily with Excel…

Genome Level High-throughput Experiments

Genomic DNA

PCR

Transformation

What if you want to construct ‘genome wide library’?

Go to PubMed..

Finding Appropriate Gene Sequence…

Cut & Paste nucleotide sequence to web server for designing Primer…

Retrieve designed primer sequences..

Then, go to first step and repeat until finished

Order Primers..

Done?

YES

PCR

NO

Flow of works

Probably okay if you have a few genes to work on..(Still Ok if you have 10)

But..

If you have 100 genes…

Or 1,000 genes..

Even whole genome (10,000-20,000 gene) can you still do this way?

You can do it manually..

Need for the automations

It reminds me of something else…

Done?

YES

PCR

NO

Software flowchart

If we know exactly what we want to do, we can program and automate it.

…..Using Computer Software!

High-throughput screening of Chemical using zebrafish

More complicated Examples

http://www.bloodjournal.org/content/bloodjournal/119/24/5614/

You can get as many images you want

But do you need to check all of these images one by one?

High-Content Screening

Automatically analyze imagesQuantification of desired features in imagesFind out possible drug candidates or targets

Computers

Originally developed for..

Perform complex and repetitive scientific calculation (tasks)

ENIAC Programming (Circa 1950)

Until 1970-1980..The use of computer was primary limited to scientific calculations..

Current widespread uses of computer is after introduction of ‘Personal Computer’

Two Steve with their personal computerPeople started to use ‘Personal Computer’ in other activities than Scientific research…

Nowaday most people even forgot that computer is originally made for scientific research to automate repetitive tasks..

Your computer is not just for these task…

Use your computer for the purpose originally invented for! Computer is originally made for scientific research.

Software

To do specific task using computer, you need software for specific task

If someone else did those tasks before you, probably there would be software..

Utilize them, if you can finish your task with them

But sometime you cannot finish all of your works with premade software.

Web-based software Destktop-based software

Commercial Software vs Research Software

No. of User base : Huge(Millions)

Features : very extensive

Developed by commercial company

Cost : $$$

Easy to use UI (User interface)

Commercial Software vs Research Software

No. of User base : Relatively small(1 – 1,000 or 10,000)

Features : vary

Developed by academia(maintained by very small number of developer)

Cost : mostly free

Focused on the function Sometime not easy to use

(Making good UI cost $$$)

Don’t expect quality of “Office” on Research software

“Oh, You can buy (expensive) software to analyze genome sequece”

“Or someone will make web site to analyze data for my purpose…”

Sometime you need to make your own tools(Or get things done using combining pre-existing tools)

Essentially doing something to find ‘novel’ thing.

Sometime you will not have suitable tools for your novel research subjects.

Do not expect to find research software for your jobs always

‘Doing Research’ means…

You need to use multiple softwares to get your job done

NGS Sequencing

Aligned into Reference Sequences

Find out variants

Filtering variant

Visualize

Short Read Mapper:Bwa, bowtie

Integrated Genome viewer

GATK

FAQ format

Bam format

VCF format

X-ray Crystal Diffraction Data

Space group determination

Phasing

Refine

Model building

Data Reduction

Visualize

Each step of computational analysisrequire different software

Command Line

muscle

MUSCLE v3.8.31 by Robert C. Edgar

http://www.drive5.com/muscleThis software is donated to the public domain.Please cite: Edgar, R.C. Nucleic Acids Res 32(5), 1792-97.

Basic usage

muscle -in <inputfile> -out <outputfile>

Common options (for a complete list please see the User Guide):

-in <inputfile> Input file in FASTA format (default stdin) -out <outputfile> Output alignment in FASTA format (default stdout) -diags Find diagonals (faster for similar sequences) -maxiters <n> Maximum number of iterations (integer, default 16) -maxhours <h> Maximum time to iterate in hours (default no limit) -html Write output in HTML format (default FASTA) -msf Write output in GCG MSF format (default FASTA) -clw Write output in CLUSTALW format (default FASTA) -clwstrict As -clw, with 'CLUSTAL W (1.81)' header -log[a] <logfile> Log to file (append if -loga, overwrite if -log) -quiet Do not write progress messages to stderr -version Display version information and exit

Without refinement (very fast, avg accuracy similar to T-Coffee): -maxiters 2Fastest possible (amino acids): -maxiters 1 -diags -sv -distance1 kbit20_3Fastest possible (nucleotides): -maxiters 1 -diags

Most of academic software developer does not have both :(

You need to input a command to execute software

Not every scientific software have ‘user friendly’ user interface

Developing user interface require time and $$$$

Command-line driven software

Advantage of Command Line

You can combine various software modules and make “Pipeline”

• Software 1 -> software 2 -> software 3 -> software 4 -> final

• You can customize & automate your workflow

Reproducible Research

• All of conditions to run software was preserved as ‘scripts’

• You can analyze various samples in same conditions

Reproducible ResearchDifferent setup may cause difference in analysisIn GUI environments, change setup each time and consistent results is sometime challenging..

By Command Line and Scripting, You can document your researchIt is like writing ‘Lab Notebook’ in wet Lab experiments

http://nbviewer.ipython.org/gist/hyeshik/cf9f3d7686e07eedbfda?revision=6

Don’t afraid command line

Some time ago, we used command line in everyday computing tasks

After Windows & Internet era, people start to afraid command line…

Don’t afraid it. They don’t bite you. (If you know how to use it)

The key to mastering this course is to conquer command line interface

“Experimental Protocol” vs “Scripting”

Wet Lab Experiments Sacrifice Animal

Analysis

Read data

Alignment

Quantification

DifferentiallyExpressedGene

Scripting

DNA Extraction

PCR Amplificaiton

Sequencing

Kit vs Premade Software Kit for Wet Lab experiments

We can do routine, well define experiments using premade kitEven without understanding principle of experiments, we can get a results..

Kit vs Premade Software Web server for bioinformatics analysis

We can do routine, well defined analysis using premade web server or software

Even without understanding principle of analysis, we can get a results..

Limitation of ‘Kit’

You cannot do all of your experiments using ‘kit’

If you are the first one to develop the protocol, there will be no kit for your experiments

Sometime you need to buy individual component of experiments and optimize it

These statements also applied in the bioinformatic analysis.

Sometime your analysis cannot be done by ‘premade kit’, then what will you do?

Purpose of This Lectur

You will learn how to make your own ‘Kit’ using preexisitng component

It will require…

A little bit knowledge of Unix-like operating system..

How to handle computer using command line

A little bit piece of programming..

Why Unix?

Most scientific software was made on Unix-based operating system

Examples of Unix-based operating system

- Linux (Ubuntu, Fedora, RedHat…)- Mac OS- Windows

Most bioinformatics/biology related software was made on Unix-base

It is more convenient to use command line in Unix-based system

But I knew most of you are using Windows.

Three alternatives

1. You can install linux in your computer (You need to format your computer)

2. You can install linux inside your windows (Using ‘virtual machine’)

https://www.virtualbox.org/

3. Install Cygwin (https://www.cygwin.com/)

Use windows using unix-like commands

If you are Mac Users..

Application – Utility – Terminal

Mac is unix-based computer, so you don’t need anything..

First assignment

In the next class, please bring your notebook computer capable to use Unix command line

Select one of options

- Install Linux - Install Linux via virtualbox- Install Cygwin- Bring Mac OS

How to install?

Google it!

Your ‘Cheat Key’ for Life

생물정보학은 ..전산학과 생물학의 융합같은 것이 아님 .

컴퓨터는 단지 매우 큰 데이터를 다루기 위한 도구일 뿐

생물정보학은 대량의 데이터를 다루기 위한 현대생물학의 연구테크닉Bioinformatics is the Biology

물론 생물정보학을 공부하기 위해서는 전산학 , 통계학 , 프로그래밍 , 수학등의 지식이 필요할 수 있음 .

그러나 생물정보학은 결코 전산학 , 통계학 , 프로그래밍 , 수학등의 세부 분과가 아님 .

• 힉스입자를 발견하기 위한 실험이 수행된 유럽입자물리연구소 (CERN) 에서도 방대한 데이터를 분석하기 위하여 복잡한 프로그래밍 , 수학 , 통계 등을 이용함 .

• 이런 사람들을 우리는 우리는 ‘물리학자’ (Physicist) 라고 부름 .

• 생물체에 얽힌 방대한 데이터를 분석하는 사람은 ‘생물학자’ (Biologist) 이며 , 이들이분석을 위해 사용하는 방법이 생물정보학 (Bioinformatics) 적인 방법임 .

잘못된 생각생물학자

데이터를 생산“ 분석은 컴퓨터 잘하는 애들이 다 해주겠지”

생물정보학자

데이터를 분석“ 이 실험결과 뭔지는 모르겠지만 대충 프로그램 돌려서 예쁜 그림 그려주면 되겠지”이러한 연구가 제대로 되지 않은 근본적인 이유

생물학 배경으로부터 온 사람이라면 ..

• 생물정보학 분석의 기본 이론의 이해• 생물정보학 분석의 기본 개념 파악• 기초 프로그래밍 개념의 이해• 생물정보 분석은 ‘블랙박스’ 가 아니라는 것의 이해

프로그래밍 배경으로부터 온 사람이라면

• 기본 생물학 개념의 이해• 데이터가 어떻게 생산되는지에 대한 개념의 이해• 실험 데이터에는 오류가 항상 존재한다는 사실의 인지

화성에서 온 남자와 금성에서 온 여자

In the next class..

Please bring notebook which can use command line interface

One and Half hour lectures

One and Half hour practice & question session

We will talk about basic of UNIX command line interface and Installing BLAST in your computer

Recommended