next-gen sequencing informatics primer - boston · pdf filenext-gen sequencing informatics...
TRANSCRIPT
next-gen sequencing informatics primer
Adrian Heilbut <[email protected]>
BU Bioinfo Program RetreatMay 22 2010
1. Technologies & Costs2. Applications
Library Design3. Informatics
Inputs: FASTQAlignment AlgorithmsOutputs: SAM/BAMQC issuesAnalysis: SVs, SNPs
growth in sequencing DBsTHE SEQUENCE EXPLOSION
SOU
RCE:
NCB
I; G
RAPH
ICS
BY N
. SPE
NCE
R &
W. F
ERN
AN
DES
Nature, 1 April 2010
454 (Roche)
Nature Reviews | Microbiology
Adaptors
gDNA
a 454 technology
Beads
gDNA
gDNA
b SOLiD technology
c SOLEXA technology
Beads
PCR
PCR
Enzymes on beadsand primer
Polymerase
PPi
ATP
Light
Ligase
Sample preparation Pyrosequencing
Sample preparation Sequencing by synthesis
Sample preparation Sequencing by ligation
T
A CG
Polymerase
AT C
AT
C
T
Adaptors
Adaptors
PCR
T/A/G/C
Figure 3 | Post-Sanger sequencing technologies. a | The 454 sequencing method is a highly parallel, two-step approach. First, the DNA is sheared and oligonucleotide adaptors are attached. Each fragment is attached to a bead and the beads are PCR amplified within droplets of an oil–water emulsion. This generates multiple copies of the same DNA sequence on each bead. Second, the beads are captured in picolitre-sized wells in a fabricated substrate and pyrosequencing (pyrophosphate-based sequencing) is performed in parallel on each DNA fragment as shown (the DNA fragment has been artificially elongated in the figure). Nucleotide incorporation is detected by the release of inorganic pyrophosphate (PPi), which leads to the enzymatic generation of photons: PPi is released and converted to ATP and luciferase uses the ATP to generate light. The cycle is iteratively repeated for each of the four bases. The average read length has already increased from 110 bp to approximately 250 bp, and future developments will probably increase it further to over 400 bp.. b | SOLiD technology has an amplification procedure that is conceptually similar to that of 454, but the sequencing strategy is radically different. Beads are
deposited onto glass slides and the sequence is determined by sequential hybridization and ligation of partially random oligonucleotides with a central determined base (or pair of bases) that is identified by a specific fluorophore. After the colour is recorded, the ligated oligonucleotide is cleaved and removed and the process is then repeated. The reads that are generated are currently ~25 bp, but will probably increase to more than 50 bp in the future. c | The first step of SOLEXA sequencing is based on the amplification of DNA on a solid surface using fold-back PCR and anchored primers. Multiple cycles of the solid-phase amplification followed by denaturation create clusters of ~1,000 copies of single-stranded DNA molecules. Sequencing is performed sequentially using primers, DNA polymerase and four fluorophore-labelled, reversibly terminating nucleotides. After the incorporation of a nucleotide, the image is captured and the identity of the first base is recorded. The terminators and fluorophores are then removed and the incorporation, detection and identification steps are repeated. The average read length is currently ~40 bp, but this will also probably increase in the future.
REVIEWS
422 | JUNE 2008 | VOLUME 6 www.nature.com/reviews/micro
• first 2nd generation technology - 2005• can do long reads (400bp)• suitable for de novo genome sequencing of
moderately sized genomes
SOLID
Nature Reviews | Microbiology
Adaptors
gDNA
a 454 technology
Beads
gDNA
gDNA
b SOLiD technology
c SOLEXA technology
Beads
PCR
PCR
Enzymes on beadsand primer
Polymerase
PPi
ATP
Light
Ligase
Sample preparation Pyrosequencing
Sample preparation Sequencing by synthesis
Sample preparation Sequencing by ligation
T
A CG
Polymerase
AT C
AT
C
T
Adaptors
Adaptors
PCR
T/A/G/C
Figure 3 | Post-Sanger sequencing technologies. a | The 454 sequencing method is a highly parallel, two-step approach. First, the DNA is sheared and oligonucleotide adaptors are attached. Each fragment is attached to a bead and the beads are PCR amplified within droplets of an oil–water emulsion. This generates multiple copies of the same DNA sequence on each bead. Second, the beads are captured in picolitre-sized wells in a fabricated substrate and pyrosequencing (pyrophosphate-based sequencing) is performed in parallel on each DNA fragment as shown (the DNA fragment has been artificially elongated in the figure). Nucleotide incorporation is detected by the release of inorganic pyrophosphate (PPi), which leads to the enzymatic generation of photons: PPi is released and converted to ATP and luciferase uses the ATP to generate light. The cycle is iteratively repeated for each of the four bases. The average read length has already increased from 110 bp to approximately 250 bp, and future developments will probably increase it further to over 400 bp.. b | SOLiD technology has an amplification procedure that is conceptually similar to that of 454, but the sequencing strategy is radically different. Beads are
deposited onto glass slides and the sequence is determined by sequential hybridization and ligation of partially random oligonucleotides with a central determined base (or pair of bases) that is identified by a specific fluorophore. After the colour is recorded, the ligated oligonucleotide is cleaved and removed and the process is then repeated. The reads that are generated are currently ~25 bp, but will probably increase to more than 50 bp in the future. c | The first step of SOLEXA sequencing is based on the amplification of DNA on a solid surface using fold-back PCR and anchored primers. Multiple cycles of the solid-phase amplification followed by denaturation create clusters of ~1,000 copies of single-stranded DNA molecules. Sequencing is performed sequentially using primers, DNA polymerase and four fluorophore-labelled, reversibly terminating nucleotides. After the incorporation of a nucleotide, the image is captured and the identity of the first base is recorded. The terminators and fluorophores are then removed and the incorporation, detection and identification steps are repeated. The average read length is currently ~40 bp, but this will also probably increase in the future.
REVIEWS
422 | JUNE 2008 | VOLUME 6 www.nature.com/reviews/micro
• 60 gigabases per run• “emulsion PCR”• sequencing by ligation• colorspace output
SOLiD sequencing (I)
APPLICATION NOTE Di-Base Sequencing and the Advantages of Color Space Analysis
Principles of Di-Base Sequencing and the Advantages of Color Space Analysis in the SOLiD™ System
IntroductionThe SOLiD™ System is the only next generation sequencing system to employ ligation based chemistry with di-base labeled probes. This unique approach enables a method which provides signifi cant advantages in terms of system accuracy and downstream data analysis.
• Unique built-in error checking capability distinguishes between measurement errors and true polymorphisms
• Ability to detect more complicated genomic variation such as adjacent SNPs, insertions, deletions and structural rearrangements
• Final data reported as standard base calls relative to provided reference sequence
The process of 2 base encoding and the benefi ts of performing analysis in the di-base alphabet (a.k.a. “color space”) are described below.
Principles of Ligation Based Chemistry and 2 Base EncodingThe SOLiD™ System enables massively parallel sequencing of clonally amplifi ed DNA fragments linked to beads. This unique sequencing methodology is based on sequential ligation of dye labeled oligonucleotide probes whereby each probe queries two base positions at a time. The system uses four fl uorescent dyes to encode for the sixteen possible two base combinations. Multiple ligation cycles
Ligation Cycle
Prim
er R
ound
1 2 3 4 5 6 7
3’AT
TA
Universal seq primer (n)
3’P1 Adapter Template Sequence
POH
Universal seq primer (n-1)
Ligase
Phosphatase
+
1. Prime and Ligate
2. Image 5. Repeat steps 1-4 to Extend Sequence
3’
Universal seq primer (n-1)1. Melt off extended sequence
2. Primer reset3’
AA CA
GCGT
CG TC AA
AG GG
CC
T TT
6. Primer Reset
7. Repeat steps 1-5 with new primer
8. Repeat Reset with , n-2, n-3, n-4 primers
AT
TA
TA3’
Excite Fluorescence
ATAA GA CA AATATT CT GT TT CA
GTGCCG
3’
4. Cleave off Fluor
AT
TA3’
Cleavage Agent
P
HO
3. Cap Unextended Strands
3’
PO4
1 2 3 4 5 6 7 ... (n cycles)Ligation cycle
3’
3’1 !mbead
1 !mbead
1 !mbead
-1
Universal seq primer (n-1)
Universal seq primer (n)
Universal seq primer (n-2)
Universal seq primer (n-3)
Universal seq primer (n-4)
3’
3’
3’
3’
3’
1
2
3
4
5
PRIMER ROUND 1
PRIMER ROUND 2
1 base shift
Read Position
Indicates positions of interogation
35343332313029282726252423222120191817161514131211109876543210
Figure 1: SOLiD™ System – Sequencing by ligation using di-base labeled probes
108235.indd 2108235.indd 2 4/25/08 1:09:15 PM4/25/08 1:09:15 PM
SOLiD sequencing (II)
APPLICATION NOTE Di-Base Sequencing and the Advantages of Color Space Analysis
Principles of Di-Base Sequencing and the Advantages of Color Space Analysis in the SOLiD™ System
IntroductionThe SOLiD™ System is the only next generation sequencing system to employ ligation based chemistry with di-base labeled probes. This unique approach enables a method which provides signifi cant advantages in terms of system accuracy and downstream data analysis.
• Unique built-in error checking capability distinguishes between measurement errors and true polymorphisms
• Ability to detect more complicated genomic variation such as adjacent SNPs, insertions, deletions and structural rearrangements
• Final data reported as standard base calls relative to provided reference sequence
The process of 2 base encoding and the benefi ts of performing analysis in the di-base alphabet (a.k.a. “color space”) are described below.
Principles of Ligation Based Chemistry and 2 Base EncodingThe SOLiD™ System enables massively parallel sequencing of clonally amplifi ed DNA fragments linked to beads. This unique sequencing methodology is based on sequential ligation of dye labeled oligonucleotide probes whereby each probe queries two base positions at a time. The system uses four fl uorescent dyes to encode for the sixteen possible two base combinations. Multiple ligation cycles
Ligation Cycle
Prim
er R
ound
1 2 3 4 5 6 7
3’AT
TA
Universal seq primer (n)
3’P1 Adapter Template Sequence
POH
Universal seq primer (n-1)
Ligase
Phosphatase
+
1. Prime and Ligate
2. Image 5. Repeat steps 1-4 to Extend Sequence
3’
Universal seq primer (n-1)1. Melt off extended sequence
2. Primer reset3’
AA CA
GCGT
CG TC AA
AG GG
CC
T TT
6. Primer Reset
7. Repeat steps 1-5 with new primer
8. Repeat Reset with , n-2, n-3, n-4 primers
AT
TA
TA3’
Excite Fluorescence
ATAA GA CA AATATT CT GT TT CA
GTGCCG
3’
4. Cleave off Fluor
AT
TA3’
Cleavage Agent
P
HO
3. Cap Unextended Strands
3’
PO4
1 2 3 4 5 6 7 ... (n cycles)Ligation cycle
3’
3’1 !mbead
1 !mbead
1 !mbead
-1
Universal seq primer (n-1)
Universal seq primer (n)
Universal seq primer (n-2)
Universal seq primer (n-3)
Universal seq primer (n-4)
3’
3’
3’
3’
3’
1
2
3
4
5
PRIMER ROUND 1
PRIMER ROUND 2
1 base shift
Read Position
Indicates positions of interogation
35343332313029282726252423222120191817161514131211109876543210
Figure 1: SOLiD™ System – Sequencing by ligation using di-base labeled probes
108235.indd 2108235.indd 2 4/25/08 1:09:15 PM4/25/08 1:09:15 PM
SOLIDof probe hybridization, ligation, imaging and analysis are performed to extend the strand from a primer hybridized to a ligated adaptor proximal to the immobilized bead (P1 adaptor). The resulting product is then removed and the process repeated for 5 more extension rounds with primers hybridized to positions n-1, n-2 etc., in the P1 adaptor (Figure 1 on previous page).
There are several fundamental properties unique to ligation based sequencing which contribute to the high accuracy inherent to the SOLiD system. The advantages of these properties and their contribution to data quality are described below.
1. Two bases are interrogated in each ligation reaction providing increased specifi city.
2. The primer is periodically reset for 5 independent rounds of extension improving the signal to noise ratio of the system.
3. Each base is interrogated twice in independent primer rounds providing increased confi dence in each call.
4. Four dyes are used to encode for sixteen possible two base combinations. The design of the encoding matrix enables built in error checking capability.
Fundamentals of SOLiD Color Space“Color space” is not a new concept. Data from Sanger sequencing is also encoded in color space by the four dyes used in the sequencing chemistry and displayed as peaks in an electropherogram. One difference between Sanger and SOLiD color space is that previously, in Sanger sequencing, each color represented only a single nucleotide and was automatically translated to A,C,G or T. With the SOLiD System each color now represents 4 potential two base combinations and the conversion into nucleotide base space is usually done after the sequence is aligned to a reference genome transcribed in color space. Alternatively, translation can occur following generation of a consensus
sequence. Since this system uses 4 fl uorescent dyes, there are 16 possible two color combinations (Figure 2).
Please note that the SOLiD color space schema was specifi cally designed to have the following properties which enable the unique error checking capability of the SOLiD™ System.
• For each di-base the reverse (e.g., CA and AC) is always in the same color
• For each di-base the complement (e.g., CA and GT) is always in the same color
• For each di-base the reversed complement (e.g., CA and TG) is always in the same color
Advantages of 2 base encoding and color space — higher accuracy for SNP detectionTwo base encoding provides higher system accuracy and built-in error checking capability which enables discrimination between measurement errors and true polymorphisms. Since each base is interrogated twice in independent reactions, the information about each base is included in two adjacent pieces of color space data. There are 16 possible two color combinations that may encode for a transition at any given single base position (Table 1). A transition at that position will result in a characteristic change that is limited to a subset of all the possible positions. This restriction allows for anomalies to be easily detected and discarded as errors.
Figure 2. SOLiD Color Space Code — Four dyes encode for sixteen potential two base combinations
Possible Dinucleotides Encoded By Each Color
A
A
C
C
G
G
T
T
1st B
ase
2nd Base
TACGGCTA
ACCAGTTG
AAC CGGT T
GATCAGCT
A AT G G
With 2 base encoding each base is defined twice
Double Interrogation
Template Sequence
All 2 color possibilities are represented for the four possible dyes. Four sets of allowable changes are classifi ed by color. Changes that fall outside a particular set can be classifi ed as possible errors for further analysis.
Blue Green Yellow Red
Blue BB BG BY BR
Green GB GG GY GR
Yellow YB YG YY YR
Red RB RG RY RR
TABLE 1. COLOR SPACE TRANSITIONS
108235.indd 3108235.indd 3 4/25/08 1:09:15 PM4/25/08 1:09:15 PM
Illumina
• 20-200 gigabases per run (depending on instrument)
Nature Reviews | Microbiology
Adaptors
gDNA
a 454 technology
Beads
gDNA
gDNA
b SOLiD technology
c SOLEXA technology
Beads
PCR
PCR
Enzymes on beadsand primer
Polymerase
PPi
ATP
Light
Ligase
Sample preparation Pyrosequencing
Sample preparation Sequencing by synthesis
Sample preparation Sequencing by ligation
T
A CG
Polymerase
AT C
AT
C
T
Adaptors
Adaptors
PCR
T/A/G/C
Figure 3 | Post-Sanger sequencing technologies. a | The 454 sequencing method is a highly parallel, two-step approach. First, the DNA is sheared and oligonucleotide adaptors are attached. Each fragment is attached to a bead and the beads are PCR amplified within droplets of an oil–water emulsion. This generates multiple copies of the same DNA sequence on each bead. Second, the beads are captured in picolitre-sized wells in a fabricated substrate and pyrosequencing (pyrophosphate-based sequencing) is performed in parallel on each DNA fragment as shown (the DNA fragment has been artificially elongated in the figure). Nucleotide incorporation is detected by the release of inorganic pyrophosphate (PPi), which leads to the enzymatic generation of photons: PPi is released and converted to ATP and luciferase uses the ATP to generate light. The cycle is iteratively repeated for each of the four bases. The average read length has already increased from 110 bp to approximately 250 bp, and future developments will probably increase it further to over 400 bp.. b | SOLiD technology has an amplification procedure that is conceptually similar to that of 454, but the sequencing strategy is radically different. Beads are
deposited onto glass slides and the sequence is determined by sequential hybridization and ligation of partially random oligonucleotides with a central determined base (or pair of bases) that is identified by a specific fluorophore. After the colour is recorded, the ligated oligonucleotide is cleaved and removed and the process is then repeated. The reads that are generated are currently ~25 bp, but will probably increase to more than 50 bp in the future. c | The first step of SOLEXA sequencing is based on the amplification of DNA on a solid surface using fold-back PCR and anchored primers. Multiple cycles of the solid-phase amplification followed by denaturation create clusters of ~1,000 copies of single-stranded DNA molecules. Sequencing is performed sequentially using primers, DNA polymerase and four fluorophore-labelled, reversibly terminating nucleotides. After the incorporation of a nucleotide, the image is captured and the identity of the first base is recorded. The terminators and fluorophores are then removed and the incorporation, detection and identification steps are repeated. The average read length is currently ~40 bp, but this will also probably increase in the future.
REVIEWS
422 | JUNE 2008 | VOLUME 6 www.nature.com/reviews/micro
illumina sequencing (I)TECHNOLOGY SPOTLIGHT: ILLUMINA® SEQUENCING
5. DENATURE THE DOUBLE-STRANDED MOLECULES
1. PREPARE GENOMIC DNA SAMPLE
Randomly fragment genomic DNA and ligate adapters to both ends of the fragments.
2. ATTACH DNA TO SURFACE
Bind single-stranded fragments randomly to the inside surface of the flow cell channels.
3. BRIDGE AMPLIFICATION
Add unlabeled nucleotides and en-zyme to initiate solid-phase bridge amplification.
4. FRAGMENTS BECOME DOUBLE-STRANDED
The enzyme incorporates nucleotides to build double-stranded bridges on the solid-phase substrate.
6. COMPLETE AMPLIFICATION
Several million dense clusters of double-stranded DNA are generat-ed in each channel of the flow cell.
Denaturation leaves single-stranded templates anchored to the substrate.
SEQUENCING TECHNOLOGY OVERVIEW
DNA
Adapters Adapter
DNA fragment
Dense lawn of primers
Adapter
Attached terminus
Attached terminus
Free terminus
Attached
Attached
Clusters
illumina sequencing (II)
TECHNOLOGY SPOTLIGHT: ILLUMINA® SEQUENCING
5. DENATURE THE DOUBLE-STRANDED MOLECULES
1. PREPARE GENOMIC DNA SAMPLE
Randomly fragment genomic DNA and ligate adapters to both ends of the fragments.
2. ATTACH DNA TO SURFACE
Bind single-stranded fragments randomly to the inside surface of the flow cell channels.
3. BRIDGE AMPLIFICATION
Add unlabeled nucleotides and en-zyme to initiate solid-phase bridge amplification.
4. FRAGMENTS BECOME DOUBLE-STRANDED
The enzyme incorporates nucleotides to build double-stranded bridges on the solid-phase substrate.
6. COMPLETE AMPLIFICATION
Several million dense clusters of double-stranded DNA are generat-ed in each channel of the flow cell.
Denaturation leaves single-stranded templates anchored to the substrate.
SEQUENCING TECHNOLOGY OVERVIEW
DNA
Adapters Adapter
DNA fragment
Dense lawn of primers
Adapter
Attached terminus
Attached terminus
Free terminus
Attached
Attached
Clusters
illumina sequencing (III)TECHNOLOGY SPOTLIGHT: ILLUMINA® SEQUENCING
7. DETERMINE FIRST BASE
The first sequencing cycle begins by adding four labeled reversible terminators, primers, and DNA polymerase.
8. IMAGE FIRST BASE 9. DETERMINE SECOND BASE
The next cycle repeats the incor-poration of four labeled reversible terminators, primers, and DNA polymerase.
10. IMAGE SECOND CHEMISTRY CYCLE
11. SEQUENCING OVER MUL-TIPLE CHEMISTRY CYCLES
12. ALIGN DATA
After laser excitation, the image is captured as before, and the identity of the second base is recorded.
The sequencing cycles are repeated to determine the sequence of bases in a fragment, one base at a time.
The data are aligned and com-pared to a reference, and sequenc-ing differences are identified.
After laser excitation, the emit-ted fluorescence from each cluster is captured and the first base is identified.
LaserLaser
GCTGA...
illumina sequencing (IV)
TECHNOLOGY SPOTLIGHT: ILLUMINA® SEQUENCING
7. DETERMINE FIRST BASE
The first sequencing cycle begins by adding four labeled reversible terminators, primers, and DNA polymerase.
8. IMAGE FIRST BASE 9. DETERMINE SECOND BASE
The next cycle repeats the incor-poration of four labeled reversible terminators, primers, and DNA polymerase.
10. IMAGE SECOND CHEMISTRY CYCLE
11. SEQUENCING OVER MUL-TIPLE CHEMISTRY CYCLES
12. ALIGN DATA
After laser excitation, the image is captured as before, and the identity of the second base is recorded.
The sequencing cycles are repeated to determine the sequence of bases in a fragment, one base at a time.
The data are aligned and com-pared to a reference, and sequenc-ing differences are identified.
After laser excitation, the emit-ted fluorescence from each cluster is captured and the first base is identified.
LaserLaser
GCTGA...
Current Costs (eg. Illumina)
~$1k per lane (54 cycle, single end)
~$2k per lane (74 cycle, paired end)-> 6 gigabases of sequence-> 1-2x human genome coverage
Applications & Sample Prep
• De Novo sequencing
• Resequencing
• SNP discovery
• Structural variation
• RNA-seq
• gene expression
• small RNAs
Library Preparation
• ~ 3ug source DNA needed
• whole genome amplification
• every PCR step adds potential bias and artifacts
• Paired-ends, Jumping libraries
• increase effective coverage
Hybrid Capture / Enrichment
• Pre-designed exon capture arrays
• Design-your-own capture arrays
• may require optimization / iteration
DNA Sample Preparation 2
Agilent SureSelect DNA Capture Array Protocol 17
Figure 2 SureSelect Target Enrichment System Capture Process
Paired-End & Mate-Pair Libraries10
Cat # PE-930-1003Part # 15008135 Rev. A
Figure 9 Origin and Alignment of Inward and Outward-Facing Reads
B
Circularized Molecule
Outer Ends
Alignment of Outward and Inward FacingReads to Reference
400 bp Gap Size
BB
3.1 kb Gap Size
Outward Facing Read
Inward Facing Read
Sheared Molecule
Outer Ends
B
Internal Fragment
BRead 1
Read 2
3.5 kb DNA Moleculewith Biotin End Labels
Outer End
Outer End
BB
Biotinylated End Fragment
Read 1
Read 2
Unbiotinylated Internal Fragment
Sequencing of Sheared Fragments
Source: Illumina Mate Pair Library V2 Sample Prep Guide
Informatics
Read files & quality scoresPrimary QC Quality recalibrationAlignment algorithmsAlignment output formatsOpen problems?
FASTQ
@EAS54_6_R1_2_1_413_324CCCTTCTTGTCTTCAGCGTTTCTCC+;;3;;;;;;;;;;;;7;;;;;;;88@EAS54_6_R1_2_1_540_792TTGGCAGGCCAAGGCCGATGGATCA+;;;;;;;;;;;7;;;;;-;;;3;83@EAS54_6_R1_2_1_443_348GTTGCTTCTGGCGTGGGTGGGGGGG+EAS54_6_R1_2_1_443_348;;;;;;;;;;;9;7;;.7;393333
PHRED quality encoding
Encoded in in FASTQ / SAM by quality string ofASCII value - 33scales differ slightly betweeen PHRED and older versions of Illumina outputs
Q = −10log10P P = 10−Q10
QC
Quality vs. Cycle; Quality vs. Tile
Often bias towards errors at 3’ end
Fastx_tools, PIQA helpful tools for visualizing quality
Aligner Performance Varies WidelyShort-read alignment Short-read aligners
The speed varies...
by Bala et al.Heng Li (Broad Institute) BWA 4 Feburary 2010 6 / 17
Bala et al. via Heng Li
Hashing / Spaced Seed Aligners
Hash Queries: Eland (Cox - illumina) MAQ (Li, Durbin)Hash Genome: Mosaik (Stromberg, Marth
Burrows-Wheeler Alignment
BWA (Li, Durbin)Bowtie (UMD) SOAP2 (BGI)Use FM-index data structure to effectively maintain a suffix array index of the genome in < 3Gb RAMFast exact matchingMismatches searched by varying query sequence
SAM
Sequence Alignment / Map Formatstandardized alignment output format supported by most modern alignment programssimple, tab-delimited text-file with an optional bzip-compressed binary encoding (BAM)
SAM Record Format2.2.1. Overview
The alignment section consists of multiple TAB-delimited lines with each line describing an alignment. Each line is:
<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> \ [<TAG>:<VTYPE>:<VALUE> [...]]
The format of each field is explained in the following table. More detailed descriptions are given in the sections below.
Field Regular expression Range Description
QNAME [^ \t\n\r]+ Query pair NAME if paired; or Query NAME if unpaired 2
FLAG [0-9]+ [0,216-1] bitwise FLAG (Section 2.2.2)
RNAME [^ \t\n\r@=]+ Reference sequence NAME 3
POS [0-9]+ [0,229-1] 1-based leftmost POSition/coordinate of the clipped sequence
MAPQ [0-9]+ [0,28-1] MAPping Quality (phred-scaled posterior probability that the mapping
position of this read is incorrect) 4
CIGAR ([0-9]+[MIDNSHP])+|\* extended CIGAR string
MRNM [^ \t\n\r@]+ Mate Reference sequence NaMe; “=” if the same as <RNAME> 3
MPOS [0-9]+ [0,229-1] 1-based leftmost Mate POSition of the clipped sequence
ISIZE -?[0-9]+ [-229,229] inferred Insert SIZE 5
SEQ [acgtnACGTN.=]+|\* query SEQuence; “=” for a match to the reference; n/N/. for ambiguity;
cases are not maintained 6,7
QUAL [!-~]+|\* [0,93] query QUALity; ASCII-33 gives the Phred base quality 6,7
TAG [A-Z][A-Z0-9] TAG
VTYPE [AifZH] Value TYPE
VALUE [^\t\n\r]+ match <VTYPE> (space allowed)
Notes:
1. QNAME and FLAG are required for all alignments. If the mapping position of the query is not available, RNAME and
CIGAR are set as “*”, and POS and MAPQ as 0. If the query is unpaired or pairing information is not available, MRNM
equals “*”, and MPOS and ISIZE equal 0. SEQ and QUAL can both be absent, represented as a star “*”. If QUAL is
not a star, it must be of the same length as SEQ.
2. The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in
different alignment records, representing multiple or split hits. The maximum string length is 254.
3. If SQ is present in the header, RNAME and MRNM must appear in an SQ header record.
4. Field MAPQ considers pairing in calculation if the read is paired. Providing MAPQ is recommended. If such a
calculation is difficult, 255 should be applied, indicating the mapping quality is not available.
5. If the two reads in a pair are mapped to the same reference, ISIZE equals the difference between the coordinate of
the 5!-end of the mate and of the 5!-end of the current read; otherwise ISIZE equals 0 (by the “5!-end” we mean the
5!-end of the original read, so for Illumina short-insert paired end reads this calculates the difference in mapping
coordinates of the outer edges of the original sequenced fragment). ISIZE is negative if the mate is mapped to a
smaller coordinate than the current read.
6. Color alignments are stored as normal nucleotide alignments with additional tags describing the raw color
sequences, qualities and color-specific properties (see also Note 5 in section 2.2.4).
7. All mapped reads are represented on the forward genomic strand. "The bases are reverse complemented from the
unmapped read sequence and the quality scores and cigar strings are recorded consistently with the bases. "This
applies to information in the mate tags (R2, Q2, S2, etc.) and any other tags that are strand sensitive. "The strand
bits in the flag simply indicates whether this reverse complement transform was applied from the original read
sequence to obtain the bases listed in the SAM file.
2.2.2. The <flag> field
Field <flag> is a bitwise flag. The meaning of predefined bits is shown in the following table:
SAM Format Specification 0.1.2-draft (20090820)
- 5 -
2.2.1. Overview
The alignment section consists of multiple TAB-delimited lines with each line describing an alignment. Each line is:
<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL> \ [<TAG>:<VTYPE>:<VALUE> [...]]
The format of each field is explained in the following table. More detailed descriptions are given in the sections below.
Field Regular expression Range Description
QNAME [^ \t\n\r]+ Query pair NAME if paired; or Query NAME if unpaired 2
FLAG [0-9]+ [0,216-1] bitwise FLAG (Section 2.2.2)
RNAME [^ \t\n\r@=]+ Reference sequence NAME 3
POS [0-9]+ [0,229-1] 1-based leftmost POSition/coordinate of the clipped sequence
MAPQ [0-9]+ [0,28-1] MAPping Quality (phred-scaled posterior probability that the mapping
position of this read is incorrect) 4
CIGAR ([0-9]+[MIDNSHP])+|\* extended CIGAR string
MRNM [^ \t\n\r@]+ Mate Reference sequence NaMe; “=” if the same as <RNAME> 3
MPOS [0-9]+ [0,229-1] 1-based leftmost Mate POSition of the clipped sequence
ISIZE -?[0-9]+ [-229,229] inferred Insert SIZE 5
SEQ [acgtnACGTN.=]+|\* query SEQuence; “=” for a match to the reference; n/N/. for ambiguity;
cases are not maintained 6,7
QUAL [!-~]+|\* [0,93] query QUALity; ASCII-33 gives the Phred base quality 6,7
TAG [A-Z][A-Z0-9] TAG
VTYPE [AifZH] Value TYPE
VALUE [^\t\n\r]+ match <VTYPE> (space allowed)
Notes:
1. QNAME and FLAG are required for all alignments. If the mapping position of the query is not available, RNAME and
CIGAR are set as “*”, and POS and MAPQ as 0. If the query is unpaired or pairing information is not available, MRNM
equals “*”, and MPOS and ISIZE equal 0. SEQ and QUAL can both be absent, represented as a star “*”. If QUAL is
not a star, it must be of the same length as SEQ.
2. The name of a pair/read is required to be unique in the SAM file, but one pair/read may appear multiple times in
different alignment records, representing multiple or split hits. The maximum string length is 254.
3. If SQ is present in the header, RNAME and MRNM must appear in an SQ header record.
4. Field MAPQ considers pairing in calculation if the read is paired. Providing MAPQ is recommended. If such a
calculation is difficult, 255 should be applied, indicating the mapping quality is not available.
5. If the two reads in a pair are mapped to the same reference, ISIZE equals the difference between the coordinate of
the 5!-end of the mate and of the 5!-end of the current read; otherwise ISIZE equals 0 (by the “5!-end” we mean the
5!-end of the original read, so for Illumina short-insert paired end reads this calculates the difference in mapping
coordinates of the outer edges of the original sequenced fragment). ISIZE is negative if the mate is mapped to a
smaller coordinate than the current read.
6. Color alignments are stored as normal nucleotide alignments with additional tags describing the raw color
sequences, qualities and color-specific properties (see also Note 5 in section 2.2.4).
7. All mapped reads are represented on the forward genomic strand. "The bases are reverse complemented from the
unmapped read sequence and the quality scores and cigar strings are recorded consistently with the bases. "This
applies to information in the mate tags (R2, Q2, S2, etc.) and any other tags that are strand sensitive. "The strand
bits in the flag simply indicates whether this reverse complement transform was applied from the original read
sequence to obtain the bases listed in the SAM file.
2.2.2. The <flag> field
Field <flag> is a bitwise flag. The meaning of predefined bits is shown in the following table:
SAM Format Specification 0.1.2-draft (20090820)
- 5 -
SAM Bitwise FlagsFlag Description
0x0001 the read is paired in sequencing, no matter whether it is mapped in a pair
0x0002 the read is mapped in a proper pair (depends on the protocol, normally inferred during alignment) 1
0x0004 the query sequence itself is unmapped
0x0008 the mate is unmapped 1
0x0010 strand of the query (0 for forward; 1 for reverse strand)
0x0020 strand of the mate 1
0x0040 the read is the first read in a pair 1,2
0x0080 the read is the second read in a pair 1,2
0x0100 the alignment is not primary (a read having split hits may have multiple primary alignment records)
0x0200 the read fails platform/vendor quality checks
0x0400 the read is either a PCR duplicate or an optical duplicate
Notes:
!" Flag 0x02, 0x08, 0x20, 0x40 and 0x80 are only meaningful when flag 0x01 is present.
#" If in a read pair the information on which read is the first in the pair is lost in the upstream analysis, flag 0x01 should
be present and 0x40 and 0x80 are both zero.
2.2.3. Extended CIGAR format
A CIGAR string is comprised of a series of operation lengths plus the operations. The conventional CIGAR format allows
for three types of operations: M for match or mismatch, I for insertion and D for deletion. The extended CIGAR format
further allows four more operations, as is shown in the following table, to describe clipping, padding and splicing:
op Description
M Alignment match (can be a sequence match or mismatch)
I Insertion to the reference
D Deletion from the reference
N Skipped region from the reference
S Soft clip on the read (clipped sequence present in <seq>)
H Hard clip on the read (clipped sequence NOT present in <seq>)
P Padding (silent deletion from the padded reference sequence)
2.2.4. Format of optional fields
Optional fields are in the format: <TAG>:<VTYPE>:<VALUE>. Each tag is encoded in two alphanumeric characters and
appears only once for an alignment. The <VTYPE> follows Perl!s rule (see also perldoc -f pack). Valid types in SAM are:
Type Description
A Printable character
i Signed 32-bit integer
f Single-precision float number
Z Printable string
H Hex string (high nybble first)
Predefined tags are shown in the following table. You can freely add new tags, and if a new tag may be of general
interest, you can email [email protected] to add the new tag to the specification. Note that tags
started with "X!, "Y! and "Z! are reserved for local use and will not be formally defined in any future version of this
specification.
Tag Type Description
X? ? Reserved fields for end users (together with Y? and Z?)
RG Z Read group. Value matches the header RG-ID tag if @RG is present in the header.
LB Z Library. Value should be consistent with the header RG-LB tag if @RG is present.
SAM Format Specification 0.1.2-draft (20090820)
- 6 -
• flags added together to produce value in FLAG field• to test a flag: (value && testFlag)==testFlag
SAMTools
C API and common tools for manipulating SAM & BAM filesGATK / Picard does similar things in Java
GATLING
• Many current sequencing projects are looking at SNPs -> lots of infrastructure for pileups and SNP-calling
• No standardized tools for analyzing structural variation from next-gen data
• GATLING is a set of lightweight utilities (a la samtools) for analyzing SAM/BAM output for SV and visualizing results
• still under development...