databases and tools for high throughput sequencing analysis

43
Databases and Tools for Databases and Tools for High Throughput Sequencing High Throughput Sequencing Analysis Analysis P. Tang (鄧致剛); PJ Huang (黄栢榕) Bioinformatics Center, Chang Gung University.

Upload: others

Post on 08-Jun-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Databases and Tools for High Throughput Sequencing Analysis

Databases and Tools for Databases and Tools for High Throughput Sequencing High Throughput Sequencing

AnalysisAnalysisyy

P. Tang (鄧致剛); PJ Huang (黄栢榕)g ( ); g ( )Bioinformatics Center, Chang Gung University.

Page 2: Databases and Tools for High Throughput Sequencing Analysis
Page 3: Databases and Tools for High Throughput Sequencing Analysis

HTseq PlatformsHTseq Platforms

Page 4: Databases and Tools for High Throughput Sequencing Analysis
Page 5: Databases and Tools for High Throughput Sequencing Analysis

Applications Applications on Biomedical Scienceson Biomedical Sciences

Page 6: Databases and Tools for High Throughput Sequencing Analysis

Analysis Strategies: Reference Sequence Alignment (Mapping) vs De novo AssemblyAlignment (Mapping) vs De novo Assembly

or transcriptome

Page 7: Databases and Tools for High Throughput Sequencing Analysis

HTseq ExperimentHTseq Experiment

Page 8: Databases and Tools for High Throughput Sequencing Analysis

Great… I got my data now what…Great… I got my data now what…

• Data and information management is slowly moving out of infancy in genomics science…. at the toddler stage…

• The Good newsSome data formats are being accepted widely– Some data formats are being accepted widely

• The Bad news– Still many competing standards in some areas

– Interoperability of data standards is almost non‐existent

– Governance is questionable– Governance is questionable

Page 9: Databases and Tools for High Throughput Sequencing Analysis

Storage & Computing PowerStorage & Computing PowerNext gen sequencers generated Giga bp to  Tera bp of data

Page 10: Databases and Tools for High Throughput Sequencing Analysis

Data Format TypesData Format TypesData Format Types Data Format Types 

• Raw Sequence Data e.g. fasta

• Aligned data e.g. BAM

• Processed data e.g. BED

Page 11: Databases and Tools for High Throughput Sequencing Analysis

Interpreting raw dataInterpreting raw dataInterpreting raw dataInterpreting raw data

Page 12: Databases and Tools for High Throughput Sequencing Analysis

How deep should we go?How deep should we go?coveragecoveragegg

(a) 80% of yeast genes (genome size: ~120MB) were detected at 4 million uniquelymapped RNA‐Seq reads, and coverage reaches a plateau afterwards despite theincreasing sequencing depth. Expressed genes are defined as having at least fourindependent reads from a 50 bp window at the 3' endindependent reads from a 50‐bp window at the 3 end.

(b) The number of unique start sites detected starts to reach a plateau when the depthof sequencing reaches 80 million in two mouse transcriptomes. ES, embryonic stemcells; EB, embryonic body.

Nature Reviews Genetics 10, 57‐63 

Page 13: Databases and Tools for High Throughput Sequencing Analysis

Genome SizeGenome Size

De novo assembled rice transcriptome 1.3 Gb RNA‐Seq data (genome size: ~400MB)85% of assembled unigenes were covered by gene modelsg y g

Page 14: Databases and Tools for High Throughput Sequencing Analysis

HTseq Raw Data FormatHTseq Raw Data FormatHTseq Raw Data FormatHTseq Raw Data Format

f ( )• fasta (Sanger)• csfasta (SOLiD)( )• fastq (Solexa)• sff (454)• sff (454)• …. And about 30 other file formats

• http://emboss sourceforge net/docs/themes/http://emboss.sourceforge.net/docs/themes/SequenceFormats.html

Page 15: Databases and Tools for High Throughput Sequencing Analysis

SOLiD Color SpaceSOLiD Color Spacepp

Page 16: Databases and Tools for High Throughput Sequencing Analysis

((cscs))FastaFasta/(/(cscs))FastqFastq(( )) /(/( )) qq

FASTA• FASTA– Header line “>”Sequence– Sequence

• FASTQAdd QVs encoded as single byte ASCII codes– Add QVs encoded as single byte ASCII codes

• Most aligners accept FASTA/Q as inputI d t i l (2 b t b f• Issue: data is volumous (2 bytes per base for FASTQ)

• Do PHRED scaled values provide the most• Do PHRED scaled values provide the most information? 

Page 17: Databases and Tools for High Throughput Sequencing Analysis

FastqFastq: : IlluminaIllumina & & SnagerSnager

Page 18: Databases and Tools for High Throughput Sequencing Analysis

FastqFastq: : IlluminaIllumina & NCBI& NCBI

Page 19: Databases and Tools for High Throughput Sequencing Analysis

ssff (text format): 454ff (text format): 454

Page 20: Databases and Tools for High Throughput Sequencing Analysis

454 454 fastafasta with quality filewith quality file

Page 21: Databases and Tools for High Throughput Sequencing Analysis

454 base quality?454 base quality?q yq y

Page 22: Databases and Tools for High Throughput Sequencing Analysis

All Platforms have ErrorsAll Platforms have Errors

Illumina SoLID/ABI‐Life Roche 454 Ion Torrent

1. Removal of low quality bases/ Low complexity regions2. Removal of adaptor sequences3. Homopolymer-associated base call errors (3 or more

identical DNA bases) causes higher number of (artificial) f hift frameshifts

Page 23: Databases and Tools for High Throughput Sequencing Analysis

Trace FileTrace File

High quality region ‐ NO ambiguities (Ns)

Medium quality region ‐ SOME ambiguities (Ns)

Poor quality region ‐ LOW confidence

Page 24: Databases and Tools for High Throughput Sequencing Analysis

Quality Control Is EssentialQuality Control Is Essentialyy

Page 25: Databases and Tools for High Throughput Sequencing Analysis

Accessing Quality: Accessing Quality: phredphred scoresscoresg yg y pp

Page 26: Databases and Tools for High Throughput Sequencing Analysis

Accessing Quality: Accessing Quality: phredphred scoresscoresg yg y pp

Page 27: Databases and Tools for High Throughput Sequencing Analysis

454 output formats

Standard flowgram format

.sff

f.fna

.qualq

Page 28: Databases and Tools for High Throughput Sequencing Analysis

Illumina output formats

.seq.txt

.prb.txt

Ill i FASTQIllumina FASTQ    (ASCII – 64 is Illumina score)

QQseq(ASCII – 64 is Phred score)

Phred quality scores

Illumina single line formatIllumina single line format

SCARF 28Solexa Compact ASCII Read Format

Page 29: Databases and Tools for High Throughput Sequencing Analysis

Illumina FastQ

• ASCII value for h= 103

• Quality of Base A at the position 1 = 103 64• Quality of Base A at the position 1 = 103‐ 64

• 103‐ 64 = 39

• Where 39 is the phred score• Where 39 is the phred score

Page 30: Databases and Tools for High Throughput Sequencing Analysis

Quality ControlQuality ControlyyRead quality distribution

Library insert sizeMapping Rate

Duplication assessment 

Page 31: Databases and Tools for High Throughput Sequencing Analysis

Quality Control ToolsQuality Control Tools

Page 32: Databases and Tools for High Throughput Sequencing Analysis

NGS QC Toolkit & FastQCGS QC lki i f li h k d fil i f hi h li dNGS QC Toolkit is for quality check and filtering of high‐quality read

This toolkit is a standalone and open source application freely available at htt // i i / t lkit ht lhttp://www.nipgr.res.in/ngsqctoolkit.html

Application have been implemented in Perl programming language

QC of sequencing data generated using Roche 454 and Illumina platforms

Additional tools to aid QC : (sequence format converter and trimming tools)Additional tools to aid QC :  (sequence format converter and trimming tools) and analysis (statistics tools)

FastQC can be used only for preliminary analysis

Page 33: Databases and Tools for High Throughput Sequencing Analysis
Page 34: Databases and Tools for High Throughput Sequencing Analysis
Page 35: Databases and Tools for High Throughput Sequencing Analysis

http://www.ncbi.nlm.nih.gov/geo/

Page 36: Databases and Tools for High Throughput Sequencing Analysis

http://www.ncbi.nlm.nih.gov/gds/

expression profiling by arrayexpression profiling by arrayexpression profiling by genome tiling arrayexpression profiling by high throughput sequencingexpression profiling by mpssexpression profiling by rt pcrexpression profiling by sage

i fili bexpression profiling by snp arraygenome binding/occupancy profiling by arraygenome binding/occupancy profiling by genome tiling arraygenome binding/occupancy profiling by high throughput sequencinggenome binding/occupancy profiling by snp arraygenome variation profiling by arrayg p g y ygenome variation profiling by genome tiling arraygenome variation profiling by high throughput sequencinggenome variation profiling by snp arraymethylation profiling by arraymethylation profiling by genome tiling arraymethylation profiling by high throughput sequencingmethylation profiling by high throughput sequencingmethylation profiling by snp arraynon coding rna profiling by arraynon coding rna profiling by genome tiling arraynon coding rna profiling by high throughput sequencingotherprotein profiling by mass specprotein profiling by protein arraysnp genotyping by snp arraythird party reanalysis

Page 37: Databases and Tools for High Throughput Sequencing Analysis
Page 38: Databases and Tools for High Throughput Sequencing Analysis

"Illumina Genome Analyzer" AND smallRNA

Page 39: Databases and Tools for High Throughput Sequencing Analysis
Page 40: Databases and Tools for High Throughput Sequencing Analysis
Page 41: Databases and Tools for High Throughput Sequencing Analysis
Page 42: Databases and Tools for High Throughput Sequencing Analysis
Page 43: Databases and Tools for High Throughput Sequencing Analysis

http://seqanswers.com/