day 1: introduction to ngs · day 1: introduction to ngs read length [bp] throughput 454 400 16...
TRANSCRIPT
• Different Next-Generation Sequencing technologies:
2
Day 1: Introduction to NGS
Read length [bp] Throughput
454 400 16 bp/day
Illumina 100 (150 GAIIx) 2x27 Gbp/day
Solid 75 20-30 Gbp/day
Ion Torrent 200 16 bp/run (=2h)
Pacific Biosciences ? ? Do not require DNA amplification Third generation sequencing
• De novo sequencing • Population genetics by re-sequencing
– Sequencing of individuals – Pool-Seq
• Mapping – RAD-Seq – NGS-speed-mapping
• Genome evolution – CNV – inversions
• Expression profiling – mRNA – micro RNA
• Chip-Seq • Amplicon sequencing biodiversity surveys
3
Day 1: NGS applications
• Introduction to some basic Unix commands
4
Day 1: Unix Basics
The absolute basics
Basic file control Viewing/ creating/ editing files
Misc. useful commands
Power commands
Process-related commands
ls mv less man uniq top
cd cp head chmod sort ps
pwd mkdir tail source cut kill
rmdir touch wc tr
rm nano curl grep
I (pipe) diff sed
> (write to file)
< (read from file)
• MAQ & ELAND:
– Build hash table for reads
– Map reference to reads
• Bowtie and BWA
– Burrows-Wheeler index
– faster
5
Day 2: Mapping reads
• Raw SNP and Indel calling Samtools
• Local re-alignment GATK/Dindel
• Multiple alignment across individuals Samtools/GATK
6
Day 2: SNP and Indel calling
• Read depth Pindel
• Read pair BreakDancer
• Split reads
7
Day 2: Structural variation calling
Deletion
• Different approaches for SNP identification
– Sequence individuals separately
– Sequence a pool of individuals
8
Day 3: Population Genomics - PoolSeq
Identify SNPs from consensus
Identify SNPs from pool
• Sequence individuals separately – Preserve haplotype information
– Singletons may be identified
– Creates redundant information (inefficient)
• Sequence pools of individuals – Haplotype information is lost
– Identification of singletons is difficult. (high error rate introduce minimum allele counts)
– No redundant sequence information is created more SNPs may be identified
– Requires a correction of standard Population Genetics estimators (singletons and multiple samplings of identical sequences)
9
Day 3: Population Genomics - PoolSeq
Advantage of pooling increases with pool size!
Problematic regions for SNP identification:
• Repetitive (ambiguous) regions exclude ambiguosly mapped reads
• Copy number variations – High coverage at SNP
– Balanced allele frequencies (e.g.: 50% A, 50% T)
Exclude high coverage regions 10
Day 3: Population Genomics - PoolSeq
Alignin to reference transcriptome:
• Pros:
– Quick
– Reduced reference complexity
– The output is easy to interpret
• Cons:
– Multiple transcripts exist for the same gene
– Inability to identify novel transcribed regions
11
Day 4: RNA-seq
Aligning to the reference genome:
• Pros: – Allows novel transcribed regions to be found
– Not mapping to the same exon multiple times
• Cons: – Mapping reads across exon-exon junctions
– Output requires further analysis to obtain a measure of transcript expression
De novo assembly: • Not recommended at present if a (close) reference exists!
12
Day 4: RNA-seq
Quantifying expression:
• Single transcript:
– Sum the number of reads mapping to each of the exons
• Multiple transcript:
– Count the reads spanning unique exon junctions
– Use statistical techiques based upon mixture models
Quantitative data require normalization (adjust for total read count, gene length)
13
Day 4: RNA-seq
RNA-seq applications:
• Detecting differential expression DESeq (R package)
• Detecting allele-specific expression
• Correlating genotypes with expression
14
Day 4: RNA-seq
Programs
• CLC Genomics Workbench – Commercial
– fast
– Easy to use
– GUI
• ABySS: – Freely available
– Command line
– Only short reads
15
Day 5: De novo assembly