day 1: introduction to ngs · day 1: introduction to ngs read length [bp] throughput 454 400 16...

• Different Next-Generation Sequencing technologies:

2

Day 1: Introduction to NGS

Read length [bp] Throughput

454 400 16 bp/day

Illumina 100 (150 GAIIx) 2x27 Gbp/day

Solid 75 20-30 Gbp/day

Ion Torrent 200 16 bp/run (=2h)

Pacific Biosciences ? ? Do not require DNA amplification Third generation sequencing

• De novo sequencing • Population genetics by re-sequencing

– Sequencing of individuals – Pool-Seq

• Mapping – RAD-Seq – NGS-speed-mapping

• Genome evolution – CNV – inversions

• Expression profiling – mRNA – micro RNA

• Chip-Seq • Amplicon sequencing biodiversity surveys

3

Day 1: NGS applications

• Introduction to some basic Unix commands

4

Day 1: Unix Basics

The absolute basics

Basic file control Viewing/ creating/ editing files

Misc. useful commands

Power commands

Process-related commands

ls mv less man uniq top

cd cp head chmod sort ps

pwd mkdir tail source cut kill

rmdir touch wc tr

rm nano curl grep

I (pipe) diff sed

> (write to file)

< (read from file)

• MAQ & ELAND:

– Build hash table for reads

– Map reference to reads

• Bowtie and BWA

– Burrows-Wheeler index

– faster

5

Day 2: Mapping reads

• Raw SNP and Indel calling Samtools

• Local re-alignment GATK/Dindel

• Multiple alignment across individuals Samtools/GATK

6

Day 2: SNP and Indel calling

• Read depth Pindel

• Read pair BreakDancer

• Split reads

7

Day 2: Structural variation calling

Deletion

• Different approaches for SNP identification

– Sequence individuals separately

– Sequence a pool of individuals

8

Day 3: Population Genomics - PoolSeq

Identify SNPs from consensus

Identify SNPs from pool

• Sequence individuals separately – Preserve haplotype information

– Singletons may be identified

– Creates redundant information (inefficient)

• Sequence pools of individuals – Haplotype information is lost

– Identification of singletons is difficult. (high error rate introduce minimum allele counts)

– No redundant sequence information is created more SNPs may be identified

– Requires a correction of standard Population Genetics estimators (singletons and multiple samplings of identical sequences)

9


Advantage of pooling increases with pool size!

Problematic regions for SNP identification:

• Repetitive (ambiguous) regions exclude ambiguosly mapped reads

• Copy number variations – High coverage at SNP

– Balanced allele frequencies (e.g.: 50% A, 50% T)

Exclude high coverage regions 10


Alignin to reference transcriptome:

• Pros:

– Quick

– Reduced reference complexity

– The output is easy to interpret

• Cons:

– Multiple transcripts exist for the same gene

– Inability to identify novel transcribed regions

11

Day 4: RNA-seq

Aligning to the reference genome:

• Pros: – Allows novel transcribed regions to be found

– Not mapping to the same exon multiple times

• Cons: – Mapping reads across exon-exon junctions

– Output requires further analysis to obtain a measure of transcript expression

De novo assembly: • Not recommended at present if a (close) reference exists!

12

Day 4: RNA-seq

Quantifying expression:

• Single transcript:

– Sum the number of reads mapping to each of the exons

• Multiple transcript:

– Count the reads spanning unique exon junctions

– Use statistical techiques based upon mixture models

Quantitative data require normalization (adjust for total read count, gene length)

13

Day 4: RNA-seq

RNA-seq applications:

• Detecting differential expression DESeq (R package)

• Detecting allele-specific expression

• Correlating genotypes with expression

14

Day 4: RNA-seq

Programs

• CLC Genomics Workbench – Commercial

– fast

– Easy to use

– GUI

• ABySS: – Freely available

– Command line

– Only short reads

15

Day 5: De novo assembly

day 1: introduction to ngs · day 1: introduction to ngs read length [bp] throughput 454 400 16...

Documents