next-gen sequencing informatics primer - boston · pdf filenext-gen sequencing informatics...

35
next-gen sequencing informatics primer Adrian Heilbut <[email protected] > BU Bioinfo Program Retreat May 22 2010

Upload: vuonghanh

Post on 19-Mar-2018

220 views

Category:

Documents


3 download

TRANSCRIPT

next-gen sequencing informatics primer

Adrian Heilbut <[email protected]>

BU Bioinfo Program RetreatMay 22 2010

1. Technologies & Costs2. Applications

Library Design3. Informatics

Inputs: FASTQAlignment AlgorithmsOutputs: SAM/BAMQC issuesAnalysis: SVs, SNPs

growth in sequencing DBsTHE SEQUENCE EXPLOSION

SOU

RCE:

NCB

I; G

RAPH

ICS

BY N

. SPE

NCE

R &

W. F

ERN

AN

DES

Nature, 1 April 2010

Technologies

454 ABI SolidIllumina (Solexa)Helicos

PacBio

Complete Genomics

Ion Torrent

454 (Roche)

Nature Reviews | Microbiology

Adaptors

gDNA

a 454 technology

Beads

gDNA

gDNA

b SOLiD technology

c SOLEXA technology

Beads

PCR

PCR

Enzymes on beadsand primer

Polymerase

PPi

ATP

Light

Ligase

Sample preparation Pyrosequencing

Sample preparation Sequencing by synthesis

Sample preparation Sequencing by ligation

T

A CG

Polymerase

AT C

AT

C

T

Adaptors

Adaptors

PCR

T/A/G/C

Figure 3 | Post-Sanger sequencing technologies. a | The 454 sequencing method is a highly parallel, two-step approach. First, the DNA is sheared and oligonucleotide adaptors are attached. Each fragment is attached to a bead and the beads are PCR amplified within droplets of an oil–water emulsion. This generates multiple copies of the same DNA sequence on each bead. Second, the beads are captured in picolitre-sized wells in a fabricated substrate and pyrosequencing (pyrophosphate-based sequencing) is performed in parallel on each DNA fragment as shown (the DNA fragment has been artificially elongated in the figure). Nucleotide incorporation is detected by the release of inorganic pyrophosphate (PPi), which leads to the enzymatic generation of photons: PPi is released and converted to ATP and luciferase uses the ATP to generate light. The cycle is iteratively repeated for each of the four bases. The average read length has already increased from 110 bp to approximately 250 bp, and future developments will probably increase it further to over 400 bp.. b | SOLiD technology has an amplification procedure that is conceptually similar to that of 454, but the sequencing strategy is radically different. Beads are

deposited onto glass slides and the sequence is determined by sequential hybridization and ligation of partially random oligonucleotides with a central determined base (or pair of bases) that is identified by a specific fluorophore. After the colour is recorded, the ligated oligonucleotide is cleaved and removed and the process is then repeated. The reads that are generated are currently ~25 bp, but will probably increase to more than 50 bp in the future. c | The first step of SOLEXA sequencing is based on the amplification of DNA on a solid surface using fold-back PCR and anchored primers. Multiple cycles of the solid-phase amplification followed by denaturation create clusters of ~1,000 copies of single-stranded DNA molecules. Sequencing is performed sequentially using primers, DNA polymerase and four fluorophore-labelled, reversibly terminating nucleotides. After the incorporation of a nucleotide, the image is captured and the identity of the first base is recorded. The terminators and fluorophores are then removed and the incorporation, detection and identification steps are repeated. The average read length is currently ~40 bp, but this will also probably increase in the future.

REVIEWS

422 | JUNE 2008 | VOLUME 6 www.nature.com/reviews/micro

• first 2nd generation technology - 2005• can do long reads (400bp)• suitable for de novo genome sequencing of

moderately sized genomes

SOLID

Nature Reviews | Microbiology

Adaptors

gDNA

a 454 technology

Beads

gDNA

gDNA

b SOLiD technology

c SOLEXA technology

Beads

PCR

PCR

Enzymes on beadsand primer

Polymerase

PPi

ATP

Light

Ligase

Sample preparation Pyrosequencing

Sample preparation Sequencing by synthesis

Sample preparation Sequencing by ligation

T

A CG

Polymerase

AT C

AT

C

T

Adaptors

Adaptors

PCR

T/A/G/C

Figure 3 | Post-Sanger sequencing technologies. a | The 454 sequencing method is a highly parallel, two-step approach. First, the DNA is sheared and oligonucleotide adaptors are attached. Each fragment is attached to a bead and the beads are PCR amplified within droplets of an oil–water emulsion. This generates multiple copies of the same DNA sequence on each bead. Second, the beads are captured in picolitre-sized wells in a fabricated substrate and pyrosequencing (pyrophosphate-based sequencing) is performed in parallel on each DNA fragment as shown (the DNA fragment has been artificially elongated in the figure). Nucleotide incorporation is detected by the release of inorganic pyrophosphate (PPi), which leads to the enzymatic generation of photons: PPi is released and converted to ATP and luciferase uses the ATP to generate light. The cycle is iteratively repeated for each of the four bases. The average read length has already increased from 110 bp to approximately 250 bp, and future developments will probably increase it further to over 400 bp.. b | SOLiD technology has an amplification procedure that is conceptually similar to that of 454, but the sequencing strategy is radically different. Beads are

deposited onto glass slides and the sequence is determined by sequential hybridization and ligation of partially random oligonucleotides with a central determined base (or pair of bases) that is identified by a specific fluorophore. After the colour is recorded, the ligated oligonucleotide is cleaved and removed and the process is then repeated. The reads that are generated are currently ~25 bp, but will probably increase to more than 50 bp in the future. c | The first step of SOLEXA sequencing is based on the amplification of DNA on a solid surface using fold-back PCR and anchored primers. Multiple cycles of the solid-phase amplification followed by denaturation create clusters of ~1,000 copies of single-stranded DNA molecules. Sequencing is performed sequentially using primers, DNA polymerase and four fluorophore-labelled, reversibly terminating nucleotides. After the incorporation of a nucleotide, the image is captured and the identity of the first base is recorded. The terminators and fluorophores are then removed and the incorporation, detection and identification steps are repeated. The average read length is currently ~40 bp, but this will also probably increase in the future.

REVIEWS

422 | JUNE 2008 | VOLUME 6 www.nature.com/reviews/micro

• 60 gigabases per run• “emulsion PCR”• sequencing by ligation• colorspace output

SOLiD sequencing (I)

APPLICATION NOTE Di-Base Sequencing and the Advantages of Color Space Analysis

Principles of Di-Base Sequencing and the Advantages of Color Space Analysis in the SOLiD™ System

IntroductionThe SOLiD™ System is the only next generation sequencing system to employ ligation based chemistry with di-base labeled probes. This unique approach enables a method which provides signifi cant advantages in terms of system accuracy and downstream data analysis.

• Unique built-in error checking capability distinguishes between measurement errors and true polymorphisms

• Ability to detect more complicated genomic variation such as adjacent SNPs, insertions, deletions and structural rearrangements

• Final data reported as standard base calls relative to provided reference sequence

The process of 2 base encoding and the benefi ts of performing analysis in the di-base alphabet (a.k.a. “color space”) are described below.

Principles of Ligation Based Chemistry and 2 Base EncodingThe SOLiD™ System enables massively parallel sequencing of clonally amplifi ed DNA fragments linked to beads. This unique sequencing methodology is based on sequential ligation of dye labeled oligonucleotide probes whereby each probe queries two base positions at a time. The system uses four fl uorescent dyes to encode for the sixteen possible two base combinations. Multiple ligation cycles

Ligation Cycle

Prim

er R

ound

1 2 3 4 5 6 7

3’AT

TA

Universal seq primer (n)

3’P1 Adapter Template Sequence

POH

Universal seq primer (n-1)

Ligase

Phosphatase

+

1. Prime and Ligate

2. Image 5. Repeat steps 1-4 to Extend Sequence

3’

Universal seq primer (n-1)1. Melt off extended sequence

2. Primer reset3’

AA CA

GCGT

CG TC AA

AG GG

CC

T TT

6. Primer Reset

7. Repeat steps 1-5 with new primer

8. Repeat Reset with , n-2, n-3, n-4 primers

AT

TA

TA3’

Excite Fluorescence

ATAA GA CA AATATT CT GT TT CA

GTGCCG

3’

4. Cleave off Fluor

AT

TA3’

Cleavage Agent

P

HO

3. Cap Unextended Strands

3’

PO4

1 2 3 4 5 6 7 ... (n cycles)Ligation cycle

3’

3’1 !mbead

1 !mbead

1 !mbead

-1

Universal seq primer (n-1)

Universal seq primer (n)

Universal seq primer (n-2)

Universal seq primer (n-3)

Universal seq primer (n-4)

3’

3’

3’

3’

3’

1

2

3

4

5

PRIMER ROUND 1

PRIMER ROUND 2

1 base shift

Read Position

Indicates positions of interogation

35343332313029282726252423222120191817161514131211109876543210

Figure 1: SOLiD™ System – Sequencing by ligation using di-base labeled probes

108235.indd 2108235.indd 2 4/25/08 1:09:15 PM4/25/08 1:09:15 PM

SOLiD sequencing (II)

APPLICATION NOTE Di-Base Sequencing and the Advantages of Color Space Analysis

Principles of Di-Base Sequencing and the Advantages of Color Space Analysis in the SOLiD™ System

IntroductionThe SOLiD™ System is the only next generation sequencing system to employ ligation based chemistry with di-base labeled probes. This unique approach enables a method which provides signifi cant advantages in terms of system accuracy and downstream data analysis.

• Unique built-in error checking capability distinguishes between measurement errors and true polymorphisms

• Ability to detect more complicated genomic variation such as adjacent SNPs, insertions, deletions and structural rearrangements

• Final data reported as standard base calls relative to provided reference sequence

The process of 2 base encoding and the benefi ts of performing analysis in the di-base alphabet (a.k.a. “color space”) are described below.

Principles of Ligation Based Chemistry and 2 Base EncodingThe SOLiD™ System enables massively parallel sequencing of clonally amplifi ed DNA fragments linked to beads. This unique sequencing methodology is based on sequential ligation of dye labeled oligonucleotide probes whereby each probe queries two base positions at a time. The system uses four fl uorescent dyes to encode for the sixteen possible two base combinations. Multiple ligation cycles

Ligation Cycle

Prim

er R

ound

1 2 3 4 5 6 7

3’AT

TA

Universal seq primer (n)

3’P1 Adapter Template Sequence

POH

Universal seq primer (n-1)

Ligase

Phosphatase

+

1. Prime and Ligate

2. Image 5. Repeat steps 1-4 to Extend Sequence

3’

Universal seq primer (n-1)1. Melt off extended sequence

2. Primer reset3’

AA CA

GCGT

CG TC AA

AG GG

CC

T TT

6. Primer Reset

7. Repeat steps 1-5 with new primer

8. Repeat Reset with , n-2, n-3, n-4 primers

AT

TA

TA3’

Excite Fluorescence

ATAA GA CA AATATT CT GT TT CA

GTGCCG

3’

4. Cleave off Fluor

AT

TA3’

Cleavage Agent

P

HO

3. Cap Unextended Strands

3’

PO4

1 2 3 4 5 6 7 ... (n cycles)Ligation cycle

3’

3’1 !mbead

1 !mbead

1 !mbead

-1

Universal seq primer (n-1)

Universal seq primer (n)

Universal seq primer (n-2)

Universal seq primer (n-3)

Universal seq primer (n-4)

3’

3’

3’

3’

3’

1

2

3

4

5

PRIMER ROUND 1

PRIMER ROUND 2

1 base shift

Read Position

Indicates positions of interogation

35343332313029282726252423222120191817161514131211109876543210

Figure 1: SOLiD™ System – Sequencing by ligation using di-base labeled probes

108235.indd 2108235.indd 2 4/25/08 1:09:15 PM4/25/08 1:09:15 PM

SOLIDof probe hybridization, ligation, imaging and analysis are performed to extend the strand from a primer hybridized to a ligated adaptor proximal to the immobilized bead (P1 adaptor). The resulting product is then removed and the process repeated for 5 more extension rounds with primers hybridized to positions n-1, n-2 etc., in the P1 adaptor (Figure 1 on previous page).

There are several fundamental properties unique to ligation based sequencing which contribute to the high accuracy inherent to the SOLiD system. The advantages of these properties and their contribution to data quality are described below.

1. Two bases are interrogated in each ligation reaction providing increased specifi city.

2. The primer is periodically reset for 5 independent rounds of extension improving the signal to noise ratio of the system.

3. Each base is interrogated twice in independent primer rounds providing increased confi dence in each call.

4. Four dyes are used to encode for sixteen possible two base combinations. The design of the encoding matrix enables built in error checking capability.

Fundamentals of SOLiD Color Space“Color space” is not a new concept. Data from Sanger sequencing is also encoded in color space by the four dyes used in the sequencing chemistry and displayed as peaks in an electropherogram. One difference between Sanger and SOLiD color space is that previously, in Sanger sequencing, each color represented only a single nucleotide and was automatically translated to A,C,G or T. With the SOLiD System each color now represents 4 potential two base combinations and the conversion into nucleotide base space is usually done after the sequence is aligned to a reference genome transcribed in color space. Alternatively, translation can occur following generation of a consensus

sequence. Since this system uses 4 fl uorescent dyes, there are 16 possible two color combinations (Figure 2).

Please note that the SOLiD color space schema was specifi cally designed to have the following properties which enable the unique error checking capability of the SOLiD™ System.

• For each di-base the reverse (e.g., CA and AC) is always in the same color

• For each di-base the complement (e.g., CA and GT) is always in the same color

• For each di-base the reversed complement (e.g., CA and TG) is always in the same color

Advantages of 2 base encoding and color space — higher accuracy for SNP detectionTwo base encoding provides higher system accuracy and built-in error checking capability which enables discrimination between measurement errors and true polymorphisms. Since each base is interrogated twice in independent reactions, the information about each base is included in two adjacent pieces of color space data. There are 16 possible two color combinations that may encode for a transition at any given single base position (Table 1). A transition at that position will result in a characteristic change that is limited to a subset of all the possible positions. This restriction allows for anomalies to be easily detected and discarded as errors.

Figure 2. SOLiD Color Space Code — Four dyes encode for sixteen potential two base combinations

Possible Dinucleotides Encoded By Each Color

A

A

C

C

G

G

T

T

1st B

ase

2nd Base

TACGGCTA

ACCAGTTG

AAC CGGT T

GATCAGCT

A AT G G

With 2 base encoding each base is defined twice

Double Interrogation

Template Sequence

All 2 color possibilities are represented for the four possible dyes. Four sets of allowable changes are classifi ed by color. Changes that fall outside a particular set can be classifi ed as possible errors for further analysis.

Blue Green Yellow Red

Blue BB BG BY BR

Green GB GG GY GR

Yellow YB YG YY YR

Red RB RG RY RR

TABLE 1. COLOR SPACE TRANSITIONS

108235.indd 3108235.indd 3 4/25/08 1:09:15 PM4/25/08 1:09:15 PM

Illumina

• 20-200 gigabases per run (depending on instrument)

Nature Reviews | Microbiology

Adaptors

gDNA

a 454 technology

Beads

gDNA

gDNA

b SOLiD technology

c SOLEXA technology

Beads

PCR

PCR

Enzymes on beadsand primer

Polymerase

PPi

ATP

Light

Ligase

Sample preparation Pyrosequencing

Sample preparation Sequencing by synthesis

Sample preparation Sequencing by ligation

T

A CG

Polymerase

AT C

AT

C

T

Adaptors

Adaptors

PCR

T/A/G/C

Figure 3 | Post-Sanger sequencing technologies. a | The 454 sequencing method is a highly parallel, two-step approach. First, the DNA is sheared and oligonucleotide adaptors are attached. Each fragment is attached to a bead and the beads are PCR amplified within droplets of an oil–water emulsion. This generates multiple copies of the same DNA sequence on each bead. Second, the beads are captured in picolitre-sized wells in a fabricated substrate and pyrosequencing (pyrophosphate-based sequencing) is performed in parallel on each DNA fragment as shown (the DNA fragment has been artificially elongated in the figure). Nucleotide incorporation is detected by the release of inorganic pyrophosphate (PPi), which leads to the enzymatic generation of photons: PPi is released and converted to ATP and luciferase uses the ATP to generate light. The cycle is iteratively repeated for each of the four bases. The average read length has already increased from 110 bp to approximately 250 bp, and future developments will probably increase it further to over 400 bp.. b | SOLiD technology has an amplification procedure that is conceptually similar to that of 454, but the sequencing strategy is radically different. Beads are

deposited onto glass slides and the sequence is determined by sequential hybridization and ligation of partially random oligonucleotides with a central determined base (or pair of bases) that is identified by a specific fluorophore. After the colour is recorded, the ligated oligonucleotide is cleaved and removed and the process is then repeated. The reads that are generated are currently ~25 bp, but will probably increase to more than 50 bp in the future. c | The first step of SOLEXA sequencing is based on the amplification of DNA on a solid surface using fold-back PCR and anchored primers. Multiple cycles of the solid-phase amplification followed by denaturation create clusters of ~1,000 copies of single-stranded DNA molecules. Sequencing is performed sequentially using primers, DNA polymerase and four fluorophore-labelled, reversibly terminating nucleotides. After the incorporation of a nucleotide, the image is captured and the identity of the first base is recorded. The terminators and fluorophores are then removed and the incorporation, detection and identification steps are repeated. The average read length is currently ~40 bp, but this will also probably increase in the future.

REVIEWS

422 | JUNE 2008 | VOLUME 6 www.nature.com/reviews/micro

illumina sequencing (I)TECHNOLOGY SPOTLIGHT: ILLUMINA® SEQUENCING

5. DENATURE THE DOUBLE-STRANDED MOLECULES

1. PREPARE GENOMIC DNA SAMPLE

Randomly fragment genomic DNA and ligate adapters to both ends of the fragments.

2. ATTACH DNA TO SURFACE

Bind single-stranded fragments randomly to the inside surface of the flow cell channels.

3. BRIDGE AMPLIFICATION

Add unlabeled nucleotides and en-zyme to initiate solid-phase bridge amplification.

4. FRAGMENTS BECOME DOUBLE-STRANDED

The enzyme incorporates nucleotides to build double-stranded bridges on the solid-phase substrate.

6. COMPLETE AMPLIFICATION

Several million dense clusters of double-stranded DNA are generat-ed in each channel of the flow cell.

Denaturation leaves single-stranded templates anchored to the substrate.

SEQUENCING TECHNOLOGY OVERVIEW

DNA

Adapters Adapter

DNA fragment

Dense lawn of primers

Adapter

Attached terminus

Attached terminus

Free terminus

Attached

Attached

Clusters

illumina sequencing (II)

TECHNOLOGY SPOTLIGHT: ILLUMINA® SEQUENCING

5. DENATURE THE DOUBLE-STRANDED MOLECULES

1. PREPARE GENOMIC DNA SAMPLE

Randomly fragment genomic DNA and ligate adapters to both ends of the fragments.

2. ATTACH DNA TO SURFACE

Bind single-stranded fragments randomly to the inside surface of the flow cell channels.

3. BRIDGE AMPLIFICATION

Add unlabeled nucleotides and en-zyme to initiate solid-phase bridge amplification.

4. FRAGMENTS BECOME DOUBLE-STRANDED

The enzyme incorporates nucleotides to build double-stranded bridges on the solid-phase substrate.

6. COMPLETE AMPLIFICATION

Several million dense clusters of double-stranded DNA are generat-ed in each channel of the flow cell.

Denaturation leaves single-stranded templates anchored to the substrate.

SEQUENCING TECHNOLOGY OVERVIEW

DNA

Adapters Adapter

DNA fragment

Dense lawn of primers

Adapter

Attached terminus

Attached terminus

Free terminus

Attached

Attached

Clusters

illumina sequencing (III)TECHNOLOGY SPOTLIGHT: ILLUMINA® SEQUENCING

7. DETERMINE FIRST BASE

The first sequencing cycle begins by adding four labeled reversible terminators, primers, and DNA polymerase.

8. IMAGE FIRST BASE 9. DETERMINE SECOND BASE

The next cycle repeats the incor-poration of four labeled reversible terminators, primers, and DNA polymerase.

10. IMAGE SECOND CHEMISTRY CYCLE

11. SEQUENCING OVER MUL-TIPLE CHEMISTRY CYCLES

12. ALIGN DATA

After laser excitation, the image is captured as before, and the identity of the second base is recorded.

The sequencing cycles are repeated to determine the sequence of bases in a fragment, one base at a time.

The data are aligned and com-pared to a reference, and sequenc-ing differences are identified.

After laser excitation, the emit-ted fluorescence from each cluster is captured and the first base is identified.

LaserLaser

GCTGA...

illumina sequencing (IV)

TECHNOLOGY SPOTLIGHT: ILLUMINA® SEQUENCING

7. DETERMINE FIRST BASE

The first sequencing cycle begins by adding four labeled reversible terminators, primers, and DNA polymerase.

8. IMAGE FIRST BASE 9. DETERMINE SECOND BASE

The next cycle repeats the incor-poration of four labeled reversible terminators, primers, and DNA polymerase.

10. IMAGE SECOND CHEMISTRY CYCLE

11. SEQUENCING OVER MUL-TIPLE CHEMISTRY CYCLES

12. ALIGN DATA

After laser excitation, the image is captured as before, and the identity of the second base is recorded.

The sequencing cycles are repeated to determine the sequence of bases in a fragment, one base at a time.

The data are aligned and com-pared to a reference, and sequenc-ing differences are identified.

After laser excitation, the emit-ted fluorescence from each cluster is captured and the first base is identified.

LaserLaser

GCTGA...

Current Costs (eg. Illumina)

~$1k per lane (54 cycle, single end)

~$2k per lane (74 cycle, paired end)-> 6 gigabases of sequence-> 1-2x human genome coverage

Applications & Sample Prep

• De Novo sequencing

• Resequencing

• SNP discovery

• Structural variation

• RNA-seq

• gene expression

• small RNAs

Library Preparation

• ~ 3ug source DNA needed

• whole genome amplification

• every PCR step adds potential bias and artifacts

• Paired-ends, Jumping libraries

• increase effective coverage

Hybrid Capture / Enrichment

• Pre-designed exon capture arrays

• Design-your-own capture arrays

• may require optimization / iteration

DNA Sample Preparation 2

Agilent SureSelect DNA Capture Array Protocol 17

Figure 2 SureSelect Target Enrichment System Capture Process

Paired-End & Mate-Pair Libraries10

Cat # PE-930-1003Part # 15008135 Rev. A

Figure 9 Origin and Alignment of Inward and Outward-Facing Reads

B

Circularized Molecule

Outer Ends

Alignment of Outward and Inward FacingReads to Reference

400 bp Gap Size

BB

3.1 kb Gap Size

Outward Facing Read

Inward Facing Read

Sheared Molecule

Outer Ends

B

Internal Fragment

BRead 1

Read 2

3.5 kb DNA Moleculewith Biotin End Labels

Outer End

Outer End

BB

Biotinylated End Fragment

Read 1

Read 2

Unbiotinylated Internal Fragment

Sequencing of Sheared Fragments

Source: Illumina Mate Pair Library V2 Sample Prep Guide

Informatics

Read files & quality scoresPrimary QC Quality recalibrationAlignment algorithmsAlignment output formatsOpen problems?

FASTQ

@EAS54_6_R1_2_1_413_324CCCTTCTTGTCTTCAGCGTTTCTCC+;;3;;;;;;;;;;;;7;;;;;;;88@EAS54_6_R1_2_1_540_792TTGGCAGGCCAAGGCCGATGGATCA+;;;;;;;;;;;7;;;;;-;;;3;83@EAS54_6_R1_2_1_443_348GTTGCTTCTGGCGTGGGTGGGGGGG+EAS54_6_R1_2_1_443_348;;;;;;;;;;;9;7;;.7;393333

PHRED quality encoding

Encoded in in FASTQ / SAM by quality string ofASCII value - 33scales differ slightly betweeen PHRED and older versions of Illumina outputs

Q = −10log10P P = 10−Q10

QC

Quality vs. Cycle; Quality vs. Tile

Often bias towards errors at 3’ end

Fastx_tools, PIQA helpful tools for visualizing quality

Aligner Performance Varies WidelyShort-read alignment Short-read aligners

The speed varies...

by Bala et al.Heng Li (Broad Institute) BWA 4 Feburary 2010 6 / 17

Bala et al. via Heng Li

Hashing / Spaced Seed Aligners

Hash Queries: Eland (Cox - illumina) MAQ (Li, Durbin)Hash Genome: Mosaik (Stromberg, Marth

Burrows-Wheeler Alignment

BWA (Li, Durbin)Bowtie (UMD) SOAP2 (BGI)Use FM-index data structure to effectively maintain a suffix array index of the genome in < 3Gb RAMFast exact matchingMismatches searched by varying query sequence

SAM

Sequence Alignment / Map Formatstandardized alignment output format supported by most modern alignment programssimple, tab-delimited text-file with an optional bzip-compressed binary encoding (BAM)

SAM Record Format2.2.1. Overview

The alignment section consists of multiple TAB-delimited lines with each line describing an alignment. Each line is:

<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> \ [<TAG>:<VTYPE>:<VALUE> [...]]

The format of each field is explained in the following table. More detailed descriptions are given in the sections below.

Field Regular expression Range Description

QNAME [^ \t\n\r]+ Query pair NAME if paired; or Query NAME if unpaired 2

FLAG [0-9]+ [0,216-1] bitwise FLAG (Section 2.2.2)

RNAME [^ \t\n\r@=]+ Reference sequence NAME 3

POS [0-9]+ [0,229-1] 1-based leftmost POSition/coordinate of the clipped sequence

MAPQ [0-9]+ [0,28-1] MAPping Quality (phred-scaled posterior probability that the mapping

position of this read is incorrect) 4

CIGAR ([0-9]+[MIDNSHP])+|\* extended CIGAR string

MRNM [^ \t\n\r@]+ Mate Reference sequence NaMe; “=” if the same as <RNAME> 3

MPOS [0-9]+ [0,229-1] 1-based leftmost Mate POSition of the clipped sequence

ISIZE -?[0-9]+ [-229,229] inferred Insert SIZE 5

SEQ [acgtnACGTN.=]+|\* query SEQuence; “=” for a match to the reference; n/N/. for ambiguity;

cases are not maintained 6,7

QUAL [!-~]+|\* [0,93] query QUALity; ASCII-33 gives the Phred base quality 6,7

TAG [A-Z][A-Z0-9] TAG

VTYPE [AifZH] Value TYPE

VALUE [^\t\n\r]+ match <VTYPE> (space allowed)

Notes:

1. QNAME and FLAG are required for all alignments. If the mapping position of the query is not available, RNAME and

CIGAR are set as “*”, and POS and MAPQ as 0. If the query is unpaired or pairing information is not available, MRNM

equals “*”, and MPOS and ISIZE equal 0. SEQ and QUAL can both be absent, represented as a star “*”. If QUAL is

not a star, it must be of the same length as SEQ.

2. The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in

different alignment records, representing multiple or split hits. The maximum string length is 254.

3. If SQ is present in the header, RNAME and MRNM must appear in an SQ header record.

4. Field MAPQ considers pairing in calculation if the read is paired. Providing MAPQ is recommended. If such a

calculation is difficult, 255 should be applied, indicating the mapping quality is not available.

5. If the two reads in a pair are mapped to the same reference, ISIZE equals the difference between the coordinate of

the 5!-end of the mate and of the 5!-end of the current read; otherwise ISIZE equals 0 (by the “5!-end” we mean the

5!-end of the original read, so for Illumina short-insert paired end reads this calculates the difference in mapping

coordinates of the outer edges of the original sequenced fragment). ISIZE is negative if the mate is mapped to a

smaller coordinate than the current read.

6. Color alignments are stored as normal nucleotide alignments with additional tags describing the raw color

sequences, qualities and color-specific properties (see also Note 5 in section 2.2.4).

7. All mapped reads are represented on the forward genomic strand. "The bases are reverse complemented from the

unmapped read sequence and the quality scores and cigar strings are recorded consistently with the bases. "This

applies to information in the mate tags (R2, Q2, S2, etc.) and any other tags that are strand sensitive. "The strand

bits in the flag simply indicates whether this reverse complement transform was applied from the original read

sequence to obtain the bases listed in the SAM file.

2.2.2. The <flag> field

Field <flag> is a bitwise flag. The meaning of predefined bits is shown in the following table:

SAM Format Specification 0.1.2-draft (20090820)

- 5 -

2.2.1. Overview

The alignment section consists of multiple TAB-delimited lines with each line describing an alignment. Each line is:

<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> \ [<TAG>:<VTYPE>:<VALUE> [...]]

The format of each field is explained in the following table. More detailed descriptions are given in the sections below.

Field Regular expression Range Description

QNAME [^ \t\n\r]+ Query pair NAME if paired; or Query NAME if unpaired 2

FLAG [0-9]+ [0,216-1] bitwise FLAG (Section 2.2.2)

RNAME [^ \t\n\r@=]+ Reference sequence NAME 3

POS [0-9]+ [0,229-1] 1-based leftmost POSition/coordinate of the clipped sequence

MAPQ [0-9]+ [0,28-1] MAPping Quality (phred-scaled posterior probability that the mapping

position of this read is incorrect) 4

CIGAR ([0-9]+[MIDNSHP])+|\* extended CIGAR string

MRNM [^ \t\n\r@]+ Mate Reference sequence NaMe; “=” if the same as <RNAME> 3

MPOS [0-9]+ [0,229-1] 1-based leftmost Mate POSition of the clipped sequence

ISIZE -?[0-9]+ [-229,229] inferred Insert SIZE 5

SEQ [acgtnACGTN.=]+|\* query SEQuence; “=” for a match to the reference; n/N/. for ambiguity;

cases are not maintained 6,7

QUAL [!-~]+|\* [0,93] query QUALity; ASCII-33 gives the Phred base quality 6,7

TAG [A-Z][A-Z0-9] TAG

VTYPE [AifZH] Value TYPE

VALUE [^\t\n\r]+ match <VTYPE> (space allowed)

Notes:

1. QNAME and FLAG are required for all alignments. If the mapping position of the query is not available, RNAME and

CIGAR are set as “*”, and POS and MAPQ as 0. If the query is unpaired or pairing information is not available, MRNM

equals “*”, and MPOS and ISIZE equal 0. SEQ and QUAL can both be absent, represented as a star “*”. If QUAL is

not a star, it must be of the same length as SEQ.

2. The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in

different alignment records, representing multiple or split hits. The maximum string length is 254.

3. If SQ is present in the header, RNAME and MRNM must appear in an SQ header record.

4. Field MAPQ considers pairing in calculation if the read is paired. Providing MAPQ is recommended. If such a

calculation is difficult, 255 should be applied, indicating the mapping quality is not available.

5. If the two reads in a pair are mapped to the same reference, ISIZE equals the difference between the coordinate of

the 5!-end of the mate and of the 5!-end of the current read; otherwise ISIZE equals 0 (by the “5!-end” we mean the

5!-end of the original read, so for Illumina short-insert paired end reads this calculates the difference in mapping

coordinates of the outer edges of the original sequenced fragment). ISIZE is negative if the mate is mapped to a

smaller coordinate than the current read.

6. Color alignments are stored as normal nucleotide alignments with additional tags describing the raw color

sequences, qualities and color-specific properties (see also Note 5 in section 2.2.4).

7. All mapped reads are represented on the forward genomic strand. "The bases are reverse complemented from the

unmapped read sequence and the quality scores and cigar strings are recorded consistently with the bases. "This

applies to information in the mate tags (R2, Q2, S2, etc.) and any other tags that are strand sensitive. "The strand

bits in the flag simply indicates whether this reverse complement transform was applied from the original read

sequence to obtain the bases listed in the SAM file.

2.2.2. The <flag> field

Field <flag> is a bitwise flag. The meaning of predefined bits is shown in the following table:

SAM Format Specification 0.1.2-draft (20090820)

- 5 -

SAM Bitwise FlagsFlag Description

0x0001 the read is paired in sequencing, no matter whether it is mapped in a pair

0x0002 the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment) 1

0x0004 the query sequence itself is unmapped

0x0008 the mate is unmapped 1

0x0010 strand of the query (0 for forward; 1 for reverse strand)

0x0020 strand of the mate 1

0x0040 the read is the first read in a pair 1,2

0x0080 the read is the second read in a pair 1,2

0x0100 the alignment is not primary (a read having split hits may have multiple primary alignment records)

0x0200 the read fails platform/vendor quality checks

0x0400 the read is either a PCR duplicate or an optical duplicate

Notes:

!" Flag 0x02, 0x08, 0x20, 0x40 and 0x80 are only meaningful when flag 0x01 is present.

#" If in a read pair the information on which read is the first in the pair is lost in the upstream analysis, flag 0x01 should

be present and 0x40 and 0x80 are both zero.

2.2.3. Extended CIGAR format

A CIGAR string is comprised of a series of operation lengths plus the operations. The conventional CIGAR format allows

for three types of operations: M for match or mismatch, I for insertion and D for deletion. The extended CIGAR format

further allows four more operations, as is shown in the following table, to describe clipping, padding and splicing:

op Description

M Alignment match (can be a sequence match or mismatch)

I Insertion to the reference

D Deletion from the reference

N Skipped region from the reference

S Soft clip on the read (clipped sequence present in <seq>)

H Hard clip on the read (clipped sequence NOT present in <seq>)

P Padding (silent deletion from the padded reference sequence)

2.2.4. Format of optional fields

Optional fields are in the format: <TAG>:<VTYPE>:<VALUE>. Each tag is encoded in two alphanumeric characters and

appears only once for an alignment. The <VTYPE> follows Perl!s rule (see also perldoc -f pack). Valid types in SAM are:

Type Description

A Printable character

i Signed 32-bit integer

f Single-precision float number

Z Printable string

H Hex string (high nybble first)

Predefined tags are shown in the following table. You can freely add new tags, and if a new tag may be of general

interest, you can email [email protected] to add the new tag to the specification. Note that tags

started with "X!, "Y! and "Z! are reserved for local use and will not be formally defined in any future version of this

specification.

Tag Type Description

X? ? Reserved fields for end users (together with Y? and Z?)

RG Z Read group. Value matches the header RG-ID tag if @RG is present in the header.

LB Z Library. Value should be consistent with the header RG-LB tag if @RG is present.

SAM Format Specification 0.1.2-draft (20090820)

- 6 -

• flags added together to produce value in FLAG field• to test a flag: (value && testFlag)==testFlag

SAMTools

C API and common tools for manipulating SAM & BAM filesGATK / Picard does similar things in Java

Alignment Viewers

IGV

!"#via Jim Robinson

GATLING

• Many current sequencing projects are looking at SNPs -> lots of infrastructure for pileups and SNP-calling

• No standardized tools for analyzing structural variation from next-gen data

• GATLING is a set of lightweight utilities (a la samtools) for analyzing SAM/BAM output for SV and visualizing results

• still under development...

open problems

alignment speed / error tolerancesorting speedstorage???

Resources

SeqAnswers.comi.seqanswers.combiostar.stackexchange.com

samtools-develbio-bwa-help

questions / discussion