16
Amel Ghouila, Claudia Chica, Emna Achouri & Fatma Guerfali C3BI Hands-on NGS course – IPP – 23 rd Nov 2016 1 SAM and BAM formats

i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

1

SAMandBAMformats

Page 2: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

2

Rawsequencedata:Fastq files

Mapping(Bowtie,BWAorothers)

BAM/SAMfiles

• AftermappingtheFASTQfiletothereferencegenomeyouwillendupwithaSAMorBAMalignmentfile

• SAMstandsforSequenceAlignment/Mapformat

• AsingleSAMfilecanstoremapped,unmapped,andevenQC-failedreadsfromasequencingrun,andindexedtoallowrapidaccess.ThismeansthattherawsequencingdatacanbefullyrecapitulatedfromtheSAM/BAMfile.

SAM,BAMformats

Page 3: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

LiShen,2014

SAMFormat

Page 4: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

Rawsequencedata:Fastq files

Mapping(Bowtie,BWAorothers)

BAM/SAMfiles

• SAMisrarelyhelpfulandreallytakesuptoomuchspace whichiswhyweuseonlytheBAMinprinciple

• ABAMfile(.bam)isthebinaryversionofaSAMfile(savingstorageandfastermanipulation)

SAM,BAMformats

4

Page 5: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

§ ASAMfile(.sam)isatab-delimitedtextfilethatcontainssequencealignmentdata

§ SAMfilescanbeopenedusingatexteditororviewedusingtheUNIX"more"command

§ Mostalignmentprogramswillsupply:

- aheader:describingtheformatversion,sortingorderofthereads,genomicsequencestowhichthereadsweremapped

- analignmentsection:containstheinformationforeachsequenceaboutwhere/howitalignstothereferencegenome

Rawsequencedata:Fastq files

Mapping(Bowtie,BWAorothers)

BAM/SAMfiles

SAM,BAMformats

5

Page 6: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

SAM,BAMformats

Header:Alignmentsection11columns(tab-separated)

6

Page 7: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

7

SAMFormat

http://samtools.sourceforge.net/SAM1.pdfhttp://genome.sph.umich.edu/wiki/SAM

QNAME FLAG RNAME MAPQ RNEX

T

PNEX

T

TLEN

SEQPOS

CIGAR

QUAL

Page 8: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

(http://samtools.github.io/hts-specs/SAMv1.pdf)

QNAME:QuerytemplateNAME.Reads/segmentshavingidenticalQNAMEareregardedtocomefromthesametemplate.AQNAME‘*’indicatestheinformationisunavailable.

8

SAMfomat

Page 9: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

(http://samtools.github.io/hts-specs/SAMv1.pdf)

FLAG:FLAG:bitwiseFLAG(idealforcompression).

9

SAMfomat(2)

11boolean flagsallstotred inasingecolumn

Page 10: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

10

SAMfile

read mapped toposition7:FLAG163(=1+2+32+128):- Readis thesecondread inthepair(128)- Readis properly paired (1+2)- its mateis mapped to37onthereversestrand (32)

SAMflag:example

Page 11: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

Explainflagtool:https://broadinstitute.github.io/picard/explain-flags.html

11

DecodingSAMflags

Page 12: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

(http://samtools.github.io/hts-specs/SAMv1.pdf)

The MAPQvaluecanbeusedtofigureouthowuniqueanalignmentisinthegenome.ü Largenumber,>10 indicatesit'slikelythealignmentisunique.ü 255indicatesthatthemappingqualityisnotavailable

12

SAMfomat(3)

Itequals−10log10Pr{mappingpositioniswrong},roundedtothenearestinteger.

Page 13: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

• The CIGAR string is a sequence of numbers and lettersrepresenting the associated information on bases alignmentused to indicate things like which bases align (either amatch/mismatch) with the reference, are deleted from thereference, and if there are insertions that are not in the reference

SAMfomat:CIGARstring

Moreinformationabouttheseformatsavailablehere:http://samtools.sourceforge.nethttps://samtools.github.io/hts-specs/SAMv1.pdf

13

Page 14: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

Mapped andunmapped reads areimported into SAM/BAMformat

ThestandardCIGARdescriptionofpairwise alignment defines three operations:‘M’foralignment match,‘I’forinsertioncompared with thereference and‘D’fordeletion.

(NB:ThePOSindicates that theread aligns starting at position5onthereference)

TheCIGAR:3M=3basesintheread sequence align with thereference.1I=Thenext baseintheread does notexist inthereference.1D=Thereference basedoes notexist intheread sequence

POS:5CIGAR:3M1I3M1D2M

http://genome.sph.umich.edu/wiki/SAM

SAMfomat:CIGARstring

14

Page 15: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

(Lietal.,2009)

Alignments

SAMfile

Examples ofCIGARstringsfordifferent typesofalignments

SAMfomat:CIGARstring

15

Page 16: i a i SAM and BAM formats - Institut Pasteur · alignment file • SAM stands for Sequence Alignment/Map format • A single SAM file can store mapped, unmapped, and even QC-failed

Am

elG

houi

la, C

laud

ia C

hica

, Em

naA

chou

ri&

Fa

tma

Gue

rfali

C3B

I Ha

nds-

on N

GS

cour

se –

IPP

–23

rdN

ov 2

016

Nameofmate(matepairinformationforpaired-endsequencing)Positionofmate(matepairinformation)

Obviously,thechromsome andpositionareimportant.TheCIGARstringisalsoimportanttoknowwhereinsertions(i.e.introns)mightexistinyourread.

(http://samtools.github.io/hts-specs/SAMv1.pdf)

16

SAMformat(5)

SAMfomat(5)