introduction to next-generation sequencing data and related bioinformatic analysis

51
Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis Han Liang, Ph.D. Department of Bioinformatics and Computational Biology 3/25/2014 @ Rice University

Upload: september-middleton

Post on 04-Jan-2016

43 views

Category:

Documents


0 download

DESCRIPTION

Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis. Han Liang, Ph.D. Department of Bioinformatics and Computational Biology 3/25/2014 @ Rice University. Outline. History NGS Platforms Applications Bioinformatics Analysis Challenges. Central Dogma. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Introduction to Next-Generation Sequencing Data and Related Bioinformatic Analysis

Han Liang, Ph.D.Department of Bioinformatics and

Computational Biology3/25/2014 @ Rice University

Page 2: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Outline

• History• NGS Platforms• Applications• Bioinformatics Analysis• Challenges

Page 3: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Central Dogma

Page 4: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Sanger sequencing

• DNA is fragmented• Cloned to a plasmid

vector• Cyclic sequencing

reaction• Separation by

electrophoresis• Readout with

fluorescent tags

Page 5: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

‘Sanger sequencing’ has been the only DNA sequencing method for 30 years but……hunger for even greater sequencing throughput

and more economical sequencing technology…NGS has the ability to process millions of sequence

reads in parallel rather than 96 at a time (1/6 of the cost)

Objections: fidelity, read length, infrastructure cost, handle large volum of data

.

Page 6: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

• Roche/454 FLX: 2004• Illumina Solexa Genome Analyzer: 2006• Applied Biosystems SOLiDTM System: 2007• Helicos HeliscopeTM : recently available• Pacific Biosciencies SMRT: launching 2010

Page 7: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Quickly reduced Cost

Page 8: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Three Leading Sequencing Platforms

• Roche 454 • Illumina Solexa• Applied Biosystems SOLiD

Page 9: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

The general experimental procedure

Wang et al. Nature Reviews Genetics 2009

Page 10: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

454

bead microreactor

Maridis Annu. Rev. Genome. Human Genet. 2008

Page 11: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis
Page 12: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Illumina (Solexa)

Bridge amplification

Maridis Annu. Rev. Genome. Human Genet. 2008

Page 13: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

SOLiD

color coding

Maridis Annu. Rev. Genome. Human Genet. 2008

Page 14: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Comparison of existing methods

Page 15: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Real Data – nucleotide space• [email protected] :8:1:325:773 length=33AAAGAACATTAAAGCTATATTATAAGCAAAGAT+SRR002051.1 :8:1:325:773 length=33IIIIIIIIIIIIIIIIIIIIIIIII'II@I$)[email protected] :8:1:409:432 length=33AAGTTATGAAATTGTAATTCCAATATCGTAAGC+SRR002051.2 :8:1:409:432 [email protected] :8:1:488:490 length=33AATTTCTTACCATATTAGACAAGGCACTATCTT+SRR002051.3 :8:1:488:490 length=33IIIIIIIIIIIIIIIIIIIIIIIIIIIIIII&I

Page 16: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Real Data – color space• SOLiD Data>1_24_47_F3T1.1.23..0120230.320033300030030010022.00.0201.0201>1_24_52_F3T2.3.21..2122321.213110332101132321002.11.0111.1222>1_24_836_F3T0.2.22..2222222.010203032021102220200.01.2211.2211>1_24_1404_F3T2.3.30..2013222.222103131323012313233.22.2220.0213>1_25_202_F3T0.3213.111202312203021101111330201000313.121122211>1_25_296_F3T0.1130.100123202213120023121112113212121.013301210

Page 17: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Data output difference among the three platforms

• Nucleotide space vs. color space• Length of short reads 454 (400~500 bp) > SOLiD (70 bp) ~ Solexa (36~120bp)

Page 18: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Applications with “Digital output”

• De novo genome assembly• Genome re-sequencing• RNA-Seq (gene expression, exon-intron

structure, small RNA profiling, and mutation)• CHIP-Seq (protein-DNA interaction)• Epigenetic profiling

Page 19: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

• Degraded state of the sample mitDNA sequencing• Nuclear genomes of ancient remains: cave bear, mommoth, Neanderthal (106

bp )

Problems: contamination modern humans and coisolation bacterial DNA

Page 20: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis
Page 21: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

• Key part in regulating gene expression

• Chip: technique to study DNA-protein interaccions

• Recently genome-wide ChIP-based studies of DNA-protein interactions

• Readout of ChIP-derived DNA sequences onto NGS platforms

• Insights into transcription factor/histone binding sites in the human genome

• Enhance our understanding of the gene expression in the context of specific environmental stimuli

Page 22: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

• ncRNA presence in genome difficult to predict by computational methods with high certainty because the evolutionary diversity

• Detecting expression level changes that correlate with changes in environmental factors, with disease onset and progression, complex disease set or severity

• Enhance the annotation of sequenced genomes (impact of mutations more interpretable)

Page 23: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

• Characterizing the biodiversity found on Earth• The growing number of sequenced genomes enables us to interpret partial

sequences obtained by direct sampling of specif environmental niches.• Examples: ocean, acid mine site, soil, coral reefs, human microbiome which may

vary according to the health status of the individual

Page 24: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

• Common variants have not yet completly explained complex disease genetics rare alleles also contribute

• Also structural variants, large and small insertions and deletions

• Accelerating biomedical research

Page 25: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

• Enable of genome-wide patterns of methylation and how this patterns change through the course of an organism’s development.

• Enhanced potential to combine the results of different experiments, correlative analyses of genome-wide methylation, histone binding patterns and gene expression, for example.

Page 26: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Kahvejian et al. 2008

:Integrating Omics

mRNA expression

Alternative Splicing

microRNA expression

Protein-DNA interaction

Mutation discovery

Copy number variation

Page 27: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Data Analysis Flow

SOLiD machine:

Raw data

Central ServerBasic processing

decoding, filter and mapping

Local MachineDownstream analysis

Page 28: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Short Read Mapping

• DNA-Resequencing BLAST-like approach• RNA-Seq

Page 29: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis
Page 30: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis
Page 31: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Read length and pairing

• Short reads are problematic, because short sequences do not map uniquely to the genome.

• Solution #1: Get longer reads.• Solution #2: Get paired reads.

ACTTAAGGCTGACTAGC TCGTACCGATATGCTG

Page 32: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Post-alignment Analysis

• DNA-SEQ• SNP calling• RNA-SEQ• Quantifying gene expression level

Page 33: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

ConceptsThe reference genome: hg19 (GRC37)Main assembly: Chr1-22, X, and Y3,095,677,412 bp

Target Region: exonome

Ensembl: 85.3 Million (2.94%)RefSeq: 67.7Million (2.34%)ccds: 31,266,049 (1.08%) consisting of 185,446 nr exons

Page 34: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Target Coverage

Page 35: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis
Page 36: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

SOLiD

color coding

Maridis Annu. Rev. Genome. Human Genet. 2008

Page 37: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

SNP calling

Page 38: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis
Page 39: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Array-based High-throughput Dataset

Page 40: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Limitations of hybridization-based approach

• Reliance existing knowledge about genome sequence

• Background noise and a limited dynamic detecting range

• Cross-experiment comparison is difficult• Requiring complicated normalization methods

Wang et al. Nature Reviews Genetics 2009

Page 41: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Quantifying gene expression using RNA-Seq data

RPKM: Reads Per Kb exon length and Millions of mapped readings

Page 42: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Large Dynamic Range

Mortazavi et al. Nature Methods 2008

Page 43: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

High reproducibility

Mortazavi et al. Nature Methods 2008

Page 44: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

High Accuracy

Wang et al. Nature 2008

Page 45: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Advantages of RNA-Seq

• Not limited to the existing genomic sequence• Very low (if any) background signal• Large dynamic detecting range• Highly reproducibility• Highly accurate• Less sample • Low cost per base

Wang et al. Nature Reviews Genetics 2009

Page 46: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Huge amount of data!

• For a typical RNA-Seq SOLiD run, ~ 2T image file ~ 120G text file for downstream analysis ~ 75 M short reads per sample

Efficient methods for data storage and management

Page 47: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Considerable sequencing error

High-quality image analysis for base calling

Page 48: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Genome alignment and assembly: time consuming and memory demanding

• To perform genome mapping for SOLiD data

32-opteron HP DL785 with 128GB of ram 12~14 hours per sample

High-performance parallel computing

Page 49: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Bioinformatics Challenges

• Efficient methods to store, retrieve and process huge amount of data

• To reduce errors in image analysis and base calling

• Fast and accurate for genome alignment and assembly

• New algorithms in downstream analyses

Page 50: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Experimental ChallengesLibrary fragmentation

Strand specific

Wang et al. Nature Reviews Genetics 2009

Page 51: Introduction  to Next-Generation Sequencing Data  and Related Bioinformatic Analysis

Question& Answer

Han LiangE-mail: [email protected]: 713-745-9815