high throughput sequencing : informatics & software aspects

48
High throughput sequencing: informatics & software aspects Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013

Upload: dyanne

Post on 22-Feb-2016

54 views

Category:

Documents


0 download

DESCRIPTION

High throughput sequencing : informatics & software aspects. Gabor T. Marth Boston College Biology Department BI543 Fall 2013 January 29, 2013. Traditional DNA sequencing. Genetics of living organisms. Chromosomes. DNA. Radioactive label gel sequencing. Four-color capillary sequencing. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: High throughput sequencing : informatics & software aspects

High throughput sequencing:informatics & software aspects

Gabor T. MarthBoston College Biology Department

BI543 Fall 2013January 29, 2013

Page 2: High throughput sequencing : informatics & software aspects

Traditional DNA sequencing

Page 3: High throughput sequencing : informatics & software aspects

Genetics of living organisms

DNA

Chromosomes

Page 4: High throughput sequencing : informatics & software aspects

Radioactive label gel sequencing

Page 5: High throughput sequencing : informatics & software aspects

Four-color capillary sequencing

~1 Mb ~100 Mb >100 Mb ~3,000 Mb

ABI 3700 four-color sequence trace

Page 6: High throughput sequencing : informatics & software aspects

Individual human resequencing

Page 7: High throughput sequencing : informatics & software aspects

Next-generation DNA sequencing

Page 8: High throughput sequencing : informatics & software aspects

New sequencing technologies…

Page 9: High throughput sequencing : informatics & software aspects

… vast throughput, many applications

read length

base

s per

mac

hine

run

10 bp 1,000 bp100 bp

1 Gb

100 Mb

10 Mb

10 Gb

Illumina, SOLiD

ABI / capillary

454

1 Mb

100 Gb

1 Tb

Page 10: High throughput sequencing : informatics & software aspects

DNA ligation DNA base extension

Church, 2005

Sequencing chemistries

Page 11: High throughput sequencing : informatics & software aspects

Template clonal amplification

Church, 2005

Page 12: High throughput sequencing : informatics & software aspects

Massively parallel sequencing

Church, 2005

Page 13: High throughput sequencing : informatics & software aspects

Chemistry of paired-end sequencing

Double strand DNA is folded into a bridge shape then separated into single strands. The end of each strand is then sequenced.

(Figure courtesy of Illumina)

Page 14: High throughput sequencing : informatics & software aspects

Paired-end reads

• fragment amplification: fragment length 100 - 600 bp• fragment length limited by amplification efficiency

• circularization: 500bp - 10kb (sweet spot ~3kb)• fragment length limited by library complexity

Korbel et al. Science 2007

Page 15: High throughput sequencing : informatics & software aspects

Features of NGS data

Short sequence reads100-200bp25-35bp (micro-reads)

Huge amount of sequence per runUp to gigabases per run

Huge number of reads per runUp to 100’s of millions

Higher error as compared with Sanger sequencing

Error profile different to Sanger

Page 16: High throughput sequencing : informatics & software aspects

Application areas of next-gen sequencing

Page 17: High throughput sequencing : informatics & software aspects

Application areas• Genome resequencing

• variant discovery• somatic mutation detection• mutational profiling

• De novo assembly

• Identification of protein-bound DNA• chromatin structure• methylation• transcription binding sites

• RNA-Seq• expression• transcript discovery

Mikkelsen et al. Nature 2007

Cloonan et al. Nature Methods, 2008

Page 18: High throughput sequencing : informatics & software aspects

SNP and short-INDEL discovery

Page 19: High throughput sequencing : informatics & software aspects

Structural variation detection• structural variations (deletions, insertions, inversions and translocations) from paired-end read map locations

• copy number (for amplifications, deletions) from depth of read coverage

Page 20: High throughput sequencing : informatics & software aspects

Identification of protein-bound DNA

genome sequence

aligned reads

Chromatin structure (CHIP-SEQ)(Mikkelsen et al. Nature 2007)

Transcription binding sites. (Robertson et al. Nature Methods, 2007)

Page 21: High throughput sequencing : informatics & software aspects

Novel transcript discovery (genes)

Mortazavi et al. Nature Methods

• novel exons• novel transcripts containing known exons

Page 22: High throughput sequencing : informatics & software aspects

Novel transcript discovery (miRNAs)

Ruby et al. Cell, 2006

Page 23: High throughput sequencing : informatics & software aspects

Expression profiling

aligned reads

aligned reads

Jones-Rhoads et al. PLoS Genetics, 2007

gene gene

• tag counting (e.g. SAGE, CAGE)• shotgun transcript sequencing

Page 24: High throughput sequencing : informatics & software aspects

De novo genome sequencing

assembled sequence contigs

short reads

longer reads

read pairs

Lander et al. Nature 2001

Page 25: High throughput sequencing : informatics & software aspects

The informatics of sequencing

Page 26: High throughput sequencing : informatics & software aspects

Re-sequencing informatics pipelineREF

(ii) read mappingIND

(i) base calling

IND(iii) SNP and short INDEL calling

(v) data viewing, hypothesis generation

(iv) SV calling GigaBayesGigaBayes

Page 27: High throughput sequencing : informatics & software aspects

The variation discovery toolbox• base callers

• read mappers

• SNP callers

• SV callers

• assembly viewers

Page 28: High throughput sequencing : informatics & software aspects

Raw data processing / base calling

Trace extraction

Base calling

• These steps are usually handled well by the machine manufacturers’ software

• What most analysts want to see is base calls and well-calibrated base quality values

Page 29: High throughput sequencing : informatics & software aspects

Sequence traces are machine-specific

Base calling is increasingly left to machine manufacturers

Page 30: High throughput sequencing : informatics & software aspects

…where they give you the cover on the box

Read mapping…

Is like a jigsaw puzzle…

Page 31: High throughput sequencing : informatics & software aspects

Some pieces are easier to place than others…

…pieces with unique features

pieces that look like each other…

Page 32: High throughput sequencing : informatics & software aspects

Repeats multiple mapping problem

Lander et al. 2001

Page 33: High throughput sequencing : informatics & software aspects

Paired-end (PE) reads

fragment length: 100 – 600bp

Korbel et al. Science 2007

fragment length: 1 – 10kb

PE reads are now the standard for whole-genome short-read sequencing

Page 34: High throughput sequencing : informatics & software aspects

Mapping quality values

0.8 0.19 0.01

Page 35: High throughput sequencing : informatics & software aspects

SNP calling

Page 36: High throughput sequencing : informatics & software aspects

SNP calling: what goes into it?

sequencing errortrue polymorphism

Base qualities

Base coverage

Prior expectation

Page 37: High throughput sequencing : informatics & software aspects

Bayesian SNP calling

Siablevarall

]T,G,C,A[S ]T,G,C,A[SiiiorPr

iiorPr

i

iiorPr

i

NiorPrNiorPr

NN

iorPr

i Ni

N

N

N )S,...,S(P)S(P)R|S(P

...)S(P)R|S(P

...

)S,...,S(P)S(P)R|S(P...

)S(P)R|S(P

)SNP(P

1

1

1

1 11

11

11

AAAAA

CCCCC

TTTTT

GGGGG

polymorphic permutation

monomorphic permutationBayesian

posterior probability

Base call + Base quality Expected polymorphism rate

Base composition Depth of coverage

Page 38: High throughput sequencing : informatics & software aspects

http://bioinformatics.bc.edu/~marth/PolyBayes

Marth et al., Nature Genetics, 1999

• First statistically rigorous SNP discovery tool• Correctly analyzes alternative cDNA splice forms

The PolyBayes software

Page 39: High throughput sequencing : informatics & software aspects

SNP calling (continued)P(G1=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(G1=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(Gi=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gi=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(Gn=aa|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=cc|B1=aacc; Bi=aaaac; Bn= cccc)P(Gn=ac|B1=aacc; Bi=aaaac; Bn= cccc)

P(SNP)

“genotype probabilities”

P(B1=aacc|G1=aa)P(B1=aacc|G1=cc)P(B1=aacc|G1=ac)

P(Bi=aaaac|Gi=aa)P(Bi=aaaac|Gi=cc)P(Bi=aaaac|Gi=ac)

P(Bn=cccc|Gn=aa)P(Bn=cccc|Gn=cc)P(Bn=cccc|Gn=ac)

“genotype likelihoods”

Prio

r(G1,.

.,Gi,..

, Gn)

-----a----------a----------c----------c-----

-----a----------a----------a----------a----------c-----

-----c----------c----------c----------c-----

Page 40: High throughput sequencing : informatics & software aspects

Insertion/deletion (INDEL) variants

These variants have been on the “radar screen” for decadesAccurate automated detection is difficult

Different mutation mechanismsOften appear in repetitive sequence and therefore difficult to alignOften multi-allelicDeleted allele has no base quality values

Page 41: High throughput sequencing : informatics & software aspects

Alignment methods became more refined

Original alignment

After left realignment

After haplotype-aware realignment

Page 42: High throughput sequencing : informatics & software aspects

Medium length INDELs still a problem

Guillermo Angel

Page 43: High throughput sequencing : informatics & software aspects

Structural variation detection

Feuk et al. Nature Reviews Genetics, 2006

Page 44: High throughput sequencing : informatics & software aspects

Structural variant detection (cont’d)

Page 45: High throughput sequencing : informatics & software aspects

Detection Approaches

Read Depth: good for big CNVs

Sample Reference

Lmap

read

contig

• Paired-end: all types of SV

• Split-Readsgood break-point resolution

• deNovo Assembly~ the future

SV slides courtesy of Chip Stewart, Boston College

Page 46: High throughput sequencing : informatics & software aspects

SV detection – resolution

Expected CNVsKaryotype

Micro-arraySequencing

Rela

tive

num

bers

of e

vent

s

CNV event length [bp]

Page 47: High throughput sequencing : informatics & software aspects

Standard data formats

Reads: FASTQ

Alignments: SAM/BAM

Variants: VCF

Page 48: High throughput sequencing : informatics & software aspects

Tools for analyzing & manipulating 1000G data

• samtools: http://samtools.sourceforge.net/• BamTools: http://sourceforge.net/projects/bamtools/• GATK: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit

• VCFTools: http://vcftools.sourceforge.net/• VcfCTools: https://github.com/AlistairNWard/vcfCTools

Alignments: SAM/BAM

Variants: VCF