ngs: mapping and de novo assembly

119
Next-Generation Sequencing Analysis Series January 28, 2015 Andrew Oler, PhD High-throughput Sequencing Bioinformatics Specialist BCBB/OCICB/NIAID/NIH

Upload: bcbbslides

Post on 12-Apr-2017

66 views

Category:

Science


1 download

TRANSCRIPT

Page 1: NGS: Mapping and de novo assembly

Next-Generation Sequencing Analysis Series

January 28, 2015

Andrew Oler, PhD High-throughput Sequencing Bioinformatics Specialist BCBB/OCICB/NIAID/NIH

Page 2: NGS: Mapping and de novo assembly

Bioinformatics and Computational Biosciences Branch

§  Bioinformatics Software Developers

§  Computational Biologists §  Project Managers &

Analysts

http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx

2

Page 3: NGS: Mapping and de novo assembly

Objectives

§  Give you an introduction to common methods used to process and analyze Next Generation Sequence data

§  Learn methods for 1) Mapping NGS reads and 2) De novo assembly of NGS reads

§  Give exposure to various applications for NGS experiments

3

Page 4: NGS: Mapping and de novo assembly

Illumina

Sample DNA library

Illumina sequencing

What other platforms?

Cycle 1

Cycle 2

Cycle 3

Cycle 4

Cycle 5

Page 5: NGS: Mapping and de novo assembly

Illumina Paired-End Library Preparation

5 Illumina

Page 6: NGS: Mapping and de novo assembly

Illumina Mate Pair Libraries

6

Page 7: NGS: Mapping and de novo assembly

A growing list of NGS applications

RNA-Seq / miRNA-seq (noncoding, differential

expression, Novel splice forms,

antisense)

Epigenetics (Chip-Seq, Mnase-seq,

Bisulfite-Seq)

CNV, Structural variations

Targeted resequencing

“Exome analysis”

Whole genome sequencing

Metagenomics (16S microbiome,

environmental WGS)

Somatic mutations Variants in

mendelian diseases

High-throughput sequencing

De novo genome

assembly

What applications?

Page 8: NGS: Mapping and de novo assembly

Alignment versus De Novo Assembly

8

Short Sequence “Reads”

Is a Reference Genome available?

Yes No

Alignment to Reference de novo Assembly ?

http://www.ncbi.nlm.nih.gov/sites/genome “Browse by organism groups”

Page 9: NGS: Mapping and de novo assembly

Short  Read  Alignment

CTCTGCACGCGTGGGTTCGAATCCCACCTTCGTCGA!

Coordinate:

chr6 27,373,801

chr6

9

Page 10: NGS: Mapping and de novo assembly

Steps in Alignment/Mapping

1.  Get your sequence data

2.  Check quality of sequence data

3.  Choose an alignment/mapping program

4.  Run the alignment

5.  View the alignments

6.  Downstream Processing

10

Page 11: NGS: Mapping and de novo assembly

Steps in Alignment/Mapping

1.  Get your sequence data

2.  Check quality of sequence data

3.  Choose an alignment/mapping program

4.  Run the alignment

5.  View the alignments

6.  Downstream Processing

11

Page 12: NGS: Mapping and de novo assembly

Public Short Read Repositories

§  NIH/NCBI •  Short Read Archive (http://www.ncbi.nlm.nih.gov/sra) •  Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) •  1000 Genomes (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/) •  European Nucleotide Archive (http://www.ebi.ac.uk/ena/)

12

fastq-dump SRR036642

Page 13: NGS: Mapping and de novo assembly

Understanding file formats

@F29EPBU01CZU4O GCTCCGTCGTAAAAGGGG + 24469:666811//..,, @F29EPBU01D60ZF CTCGTTCTTGATTAATGAAACATTCTTGGCAAATGCTTTCGCTCTGGTCCGTCTTGCGCCGGTCCAAGAATTTCACCTCTAGCGGCGCAATACGAATGCCCAAACACACCCAACACACCA + G???HHIIIIIIIIIBG555?=IIIIIIIIHHGHHIHHHIIIIIIHHHIIHHHIIIIIIIIIH99;;CBBCCEI???DEIIIIII??;;;IIGDBCEA?9944215BB@>>@A=BEIEEE @F29EPBU01EIPCX TTAATGATTGGAGTCTTGGAAGCTTGACTACCCTACGTTCTCCTACAAATGGACCTTGAGAGCTTGTTTGGAGGTTCTAGCAGGGGAGCGCATCTCCCCAAACACACCCAACACACCA + IIIIIIIIIIIIIIIIIIIIIIHHHHIIIIHHHIIIIIIIIIIIIIHHHIIIIIIIIIIIIIIIIIHHHIIIIIIIIEIIB94422=4GEEEEEIBBBBHHHFIH???CII=?AEEEE @F29EPBU01DER7Q TGACGTGCAAATCGGTCGTCCGACCTCGGTATAGGGGCGAAGACTAATCGAACCATCTAGTAGCTGGTT + IIIIIIIG666GIIIIIIIIIIIIIIIIIIIII====:2:::<EEIIIIIIIIIIIIIIIIIGGGIIII @F29EPBU01B2FE3 TCAACGATTAAAGTCCTACGTGATCTGAGTTCAGACCGGAGCAATCCAGGTCGGTTTCTATCTATTCAACATTTCTCCCTGTACGAAAGGACAAGAGAAATAGGGCCCACTTCACAATAGCGCCCNCCNCCNCCACACACACACACAC + ADBBBBD?B666FFFFHHHIFFFFFFFFFFFFFC86666DDDDDBBDFFFFFFF???FFFFCAA>ABBBB=336:<F??DDDDFFFFFDD?A===BA111?688;;;;?<:::<?>>?>?980..!//!86!669888??=<999822 @F29EPBU01BD6BJ GTTGTAGGTCGGTAGTGTCGTCGGTAC + 80006:;<4/..9:233342225984/ @F29EPBU01DDR5H TGTGATGTGTCTTTATAGTAGCATGATTTATAATCCTTTGGGTATATACTCAATAATGGGATCACTGGGTCAAATGGA + ADDBBBDHHHF:::FFGGDDDDDDBB;44498411144;555ABDFFFFFFFFFFFFFFFF???A?=:88>8889889 @F29EPBU01BXIBN GTGGAGGTCCGTAGCGGTCTTGACGTGCAAATCGGTCGTCCGACCTGGGTATAGGGGCGG + HIIIIIIIIIIIIIIIIIIIIIIIIIIIHHHIIIIIIIIIIIIBBCDEBEE;8/---0,, @F29EPBU01EMLVL TACCTCGCTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTGACTCGCCAAACACACCC + IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIGFDB@@E?E<333 @F29EPBU01DHVGX TGGTGGCGTACT + :662114,//// @F29EPBU01DSSQC CGATCTGATAAATGCACGCATCCCNCC + BAAAAAB@6666?ABA??===862!,, @F29EPBU01A3JW3 GGGAGTCGGGGAGTTGCAATTAATTTCCCCACCT + 44271::....71;9676;688886622227/// @F29EPBU01CTSZ0 CGTGGGTGCGC + 43422230422 @F29EPBU01EIWX8 TGGGGCTGGAATTACCGCGGCTGCTGGCACCAGACTTGCCCTCCAATGGATCCTCGTTAAAGGATTTAAAGTGGACTCATTCCAATTACAGGGCCTCGA + A1111@@EEEEGEEHHIIIIIIIIIIIIIIIIIIIIIIHHHIGGHHIHHHIIIIIIIIHHHIIIHHHHHHIIIIIIIIIIIIIIIIIIII???HHIIII @F29EPBU01DP0LM TGCGTGGGCGATTGTCTGGTTAATTCCGATAACGAACGAGACTCTCCCA + >@>=>444<042276=<222244===89998AABBBDBBAA?A?@;84/ @F29EPBU01DSYYK GTGTAGTGATATCG + :::89863445244 @F29EPBU01CLHKI GTCTCGTTCGTTATCGGAATTAACCAGACAAATCGCTCCACCAAATAAGAACGCCAAACACACCCAACC + IIIIIIHHIIIIIII====>>IIIIIIIIEBBGGGFG;7542??@E???::<@FF765A=AA===1/., @F29EPBU01AYJ2F TCTCTTAGTCATAGTGA + FFFHGIIIIIIIIIIII @F29EPBU01DVTHA TGCGTCGTGGTTAGAATTCCTATAGGTAATACG + 100//6<=112242<<<<992448?<<2232:7 @F29EPBU01A4RY6 TGTAGGTAGGGACAGTGGGAATCTCGTTCATCCATTCATGCGCGTCACTAATTAGATGACGAGGCATTTTGGCTACCTTAAGAGAGTCATAGTTACT + 145IHHII<<>GIIIIHHHIIIIIIIIIIIIIIIGGGIIIIIIIIIIIIIIIIIIIIIIIHGEED>111///=A=4445@AIIIIIIIIIHIHEGA9 @F29EPBU01D9UUA GACAGCTCTTTAGACACTAGGAAAACCTTATATAGAGNGTAAAAGCATAACCACCATAGTTAGCCCAAAAGCAGCCATC + HIIIIIIHHEEIIIII==99:BBBBIIIIFIIIIIBB!@@====B?<<;@@@DDEEEGG@@@F>>>AABBIIIIIIIB@ @F29EPBU01C0T3C GGTATAGT

Page 14: NGS: Mapping and de novo assembly

Sequence file formats § Next gen sequence file formats are based on the

commonly used FASTA format

>sequence_ID and optional comments ATTCCGGTGCGGTGCGGTGCTGCCGTGCCGGTGC TTCGAAATTGGCGTCAGT

§ The Phred quality scores per base were added to form the FASTQ format

14

Page 15: NGS: Mapping and de novo assembly

Sequence file formats

§  Illumina Fastq format (fasta format with Quality values for each base)

15

@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA - base calls + BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ - Base quality+33

Full read header description"@ <instrument-name>:<run ID>:<flowcell ID>:<lane-number>:<tile-number>: <x-pos>: <y-pos>

<read number>:<is filtered>:<control number>:<barcode sequence>

Space to separate Read ID Read ID "

Page 16: NGS: Mapping and de novo assembly

Fastq Quality values

16

Quality scores are normally expected up to 40 in a Phred scale. ASCII characters <http://en.wikipedia.org/wiki/ASCII>

BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ "The highest base quality score in this sequence: ‘D’=(68-33)=35

From http://en.wikipedia.org/wiki/FASTQ_format

= 0.00032 (or 1/3200 incorrect) P=10-35/10

If base quality = 35

Page 17: NGS: Mapping and de novo assembly

Steps in Alignment/Mapping

1.  Get your sequence data

2.  Check quality of sequence data

3.  Choose an alignment/mapping program

4.  Run the alignment

5.  View the alignments

6.  Downstream Processing

17

Page 18: NGS: Mapping and de novo assembly

Running FastQC

Open FastQC program Open in browser: fastqc_report.html

18

Per base sequence quality p-value = 0.0001

p-value = 0.001

p-value = 0.01

p-value = 0.05

Babraham Bioinformatics http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Page 19: NGS: Mapping and de novo assembly

Steps in Alignment/Mapping

1.  Get your sequence data

2.  Check quality of sequence data

3.  Choose an alignment/mapping program

4.  Run the alignment

5.  View the alignments

6.  Downstream Processing

19

Page 20: NGS: Mapping and de novo assembly

Short Read Alignment Software § BFAST § BLASTN § BLAT § Bowtie § BWA § ELAND § GNUMAP § GMAP and

GSNAP

§ MAQ § mrFAST and

mrsFAST § MOSAIK § Novoalign § RUM § SHRiMP § SOAP § SpliceMap

§ SSAHA and SSAHA2

§ STAR § TopHat § ~20 more…

20

http://en.wikipedia.org/wiki/List_of_sequence_alignment_software

http://tinyurl.com/seqanswers-mapping

ls /usr/local/bio_apps/

Page 21: NGS: Mapping and de novo assembly

Issues of Consideration for Alignment Software

§ Library types: •  Genomic DNA (for resequencing) •  ChIP DNA (PCR bias) •  RNA-seq cDNA

– mRNA-seq (junction mapping) –  smRNA-seq (adapter trimming)

21

3

Preparing Samples for ChIP Sequencing of DNA

Introduction

This protocol explains how to prepare libraries of chromatin-immuno-precipitated DNA for analysis on the Illumina Cluster Station and Genome Analyzer. You will add adapter sequences onto the ends of DNA fragments to generate the following template format:

Figure 1 Fragments after Sample Preparation

The adapter sequences correspond to the two surface-bound oligos on the flow cells used in the Cluster Station.

DNAFragment

Adapters

3

Preparing Samples for Analysis of Small RNA

Introduction

This protocol explains how to prepare libraries of small RNA for subsequent cDNA sequencing on the Illumina Cluster Station and Genome Analyzer.

You will physically isolate small RNA, ligate the adapters necessary for use during cluster creation, and reverse-transcribe and PCR to generate the following template format:

Figure 1 Fragments after Sample Preparation

The 5’ small RNA adapter is necessary for reverse transcription and amplification of the small RNA fragment. This adapter also contains the DNA sequencing primer binding site. The 3’ small RNA adapter corresponds to the surface bound amplification primer on the flow cell used on the Cluster Station.

Workflow You will need a minimum of 4 days to complete this protocol.

Figure 2 Sample Preparation Workflow

Small RNA

Adapters

cDNAFragment

AdapterLigation

RT-PCR

Day 1

Complete5' RNA Adapter Ligation

Start 3' RNA Adapter Ligation

Complete3' RNA Adapter Ligation

Perform RT-PCR Amplification

Day 4

Gel Purify Small RNAConstruct LibraryIsolate small nucleotides

Start 5' RNA Adapter Ligation

Day 2 Day 3

Illumina

Page 22: NGS: Mapping and de novo assembly

Issues of Consideration for Alignment Software

§  Types of reads

•  Single-end

•  Paired-end

22

1

2

Mean, Standard Deviation of Inner Distance

e.g. SRR036642.fastq

e.g. SRR027894_1.fastq, SRR027894_2.fastq

Page 23: NGS: Mapping and de novo assembly

Issues of Consideration for Alignment Software §  Library types, continued:

•  Multiplexed library (demultiplex)

•  Mate pair library

•  Bisulfite-converted (C->T reference genome)

23 Illumina Heng Li, 2010

Page 24: NGS: Mapping and de novo assembly

Issues of Consideration for Alignment Software

§ Platform differences

•  Bases (ACTG)

•  Colorspace (2-base encoding, SOLiD)

•  Read Length

•  454 (homopolymers)

24

Page 25: NGS: Mapping and de novo assembly

Issues of Consideration for Alignment Software

§  Software Properties

•  Open-source or proprietary ($)

•  Accuracy

•  Speed of algorithm

•  Multi-threaded or single processor

•  RAM requirements (2GB vs 50GB for loading index)

•  Use of base quality score

•  Gapped alignment (indels)

25

Page 26: NGS: Mapping and de novo assembly

Steps in Alignment/Mapping

1.  Get your sequence data

2.  Check quality of sequence data

3.  Choose an alignment/mapping program

4.  Run the alignment

5.  View the alignments

6.  Downstream Processing

26

Page 27: NGS: Mapping and de novo assembly

The Command Line Terminal

A New World to Some

Page 28: NGS: Mapping and de novo assembly

File Manager/Browser by Operating System

28

OS: Windows Mac OSX Unix FM: Explorer Finder Shell Input Method:

Page 29: NGS: Mapping and de novo assembly

Anatomy of the Terminal, “Command Line”, or “Shell”

Prompt (computer_name:current_directory username) Cursor

Command Argument Window

Output

Mac: Applications -> Utilities -> Terminal Windows: Download open source software

PuTTY http://www.chiark.greenend.org.uk/~sgtatham/putty/ Other SSH Clients (http://en.wikipedia.org/wiki/Comparison_of_SSH_clients) Cygwin (http://www.cygwin.com/)

29

Page 30: NGS: Mapping and de novo assembly

How to execute a command

command argument

output

output

30

Page 31: NGS: Mapping and de novo assembly

ls (“list”)

ls list the files, links, subdirectories, etc. in a directory ls -a same as “ls”, but also show the “hidden” files ls -l list files with details (size, timestamp, ownership, permissions) ls -lh use “human-readable” file sizes

*See handout for more options!*

31

Page 32: NGS: Mapping and de novo assembly

cd (“change directory”), mkdir (“make directory”) and viewing files

cd ~ change to home directory cd test_data change to “test_data” directory cd .. change to higher directory (“go up”) cd ~/unix_hpc change to home directory > “unix_hpc” directory mkdir dir_name make directory “dir_name” pwd “print working directory” head junctions.bed view the first 10 lines of “junctions.bed” head -5 file1 view the first 5 lines of “file1” tail lymph3K.fastq view the last 10 lines of “lymph3K.fastq” tail -5 file1 view the last 5 lines of “file1” less lymph3K.fastq view a file; Space to page down, Ctrl-b to page up,

arrow keys also work; “/” to search, “q” to quit (faster for huge files)

32

Page 33: NGS: Mapping and de novo assembly

Mapping ChIP-seq Reads with Bowtie

33

Page 34: NGS: Mapping and de novo assembly

Using ChIP-seq to analyze protein-DNA contacts

§ Proteins called transcription factors (TFs) are involved in regulation of gene activation

§ The first step in gene activation is binding of the TF to its target gene.

Gene X

RNA Polymerase

TF

Page 35: NGS: Mapping and de novo assembly

Chromatin Immunoprecipitation (ChIP)

See also: ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012 Sep;22(9):1813-31

Page 36: NGS: Mapping and de novo assembly

Burrows-Wheeler Transformation §  Uses Burrows-Wheeler Transformation

•  small genome index •  small memory footprint (RAM) during alignment •  faster alignment

§  Good at getting very accurate alignments quickly §  Used in BWA, Bowtie, Bowtie2

36

Reference Sequence Indexed

Sequence

Burrows-Wheeler Transformation

Langmead B, Trapnell C, Pop M, Salzberg SL. Genome Biol 10:R25.

Create all permutations, then sort

Page 37: NGS: Mapping and de novo assembly

Mapping RNA-seq Reads with TopHat

37

Page 38: NGS: Mapping and de novo assembly

Mapping RNA-seq Reads

38

Page 39: NGS: Mapping and de novo assembly

Steps in TopHat Alignment

39 Genome Biology (2013) 14:R36.

Page 40: NGS: Mapping and de novo assembly

Alignment for Variant Analysis §  Variants

•  Small-scale –  single nucleotide variants (SNV) or single nucleotide polymorphism (SNP) –  short insertions or deletions –  deletion followed by insertion (indel)

•  Large-scale, structural –  copy number variants (CNV) –  inversions and translocations

§  Alignment software that will support gapped alignment for small-scale variation •  BWA (also uses Burroughs-Wheeler algorithm) •  Novoalign •  Bowtie2 •  GSNAP •  GEM •  mrFAST •  MOSAIK •  RMAP •  rNA •  RTG Investigator •  Segemehl •  SHRiMP •  Stampy •  SToRM

40 http://en.wikipedia.org/wiki/List_of_sequence_alignment_software http://www.hgvs.org/mutnomen/recs-DNA.html

Page 41: NGS: Mapping and de novo assembly

iGenomes §  Common standard datasets for genomic analysis, organized in standardized directory

structure •  Found in /gpfs/bio_data/iGenomes on NIAID HPC •  Found in /fdb/igenomes on Biowulf

§  Files have additional formatting required by TopHat, Cufflinks §  Maintained by Illumina, hosted on TopHat / Cufflinks website

•  http://tophat.cbcb.umd.edu/igenomes.html §  Approximately 500Gb for all species together §  Genomes available:

41

Arabidopsis_thaliana Bacillus_cereus_ATCC_10987 Bacillus_subtilis_168 Bos_taurus Caenorhabditis_elegans Canis_familiaris Drosophila_melanogaster Enterobacteriophage_lamdba Equus_caballus Escherichia_coli_K_12_DH10B Escherichia_coli_K_12_MG1655 Gallus_gallus Glycine_max Homo_sapiens Macaca_mulatta

Mus_musculus Mycobacterium_tuberculosis_H37RV Oryza_sativa_japonica Pan_troglodytes PhiX Pseudomonas_aeruginosa_PAO1 Rattus_norvegicus Rhodobacter_sphaeroides_2.4.1 Saccharomyces_cerevisiae Schizosaccharomyces_pombe Sorangium_cellulosum_So_ce_56 Sorghum_bicolor Staphylococcus_aureus_NCTC_8325 Sus_scrofa Zea_mays

Page 42: NGS: Mapping and de novo assembly

iGenomes Directory Structure

[Species]

Ensembl NCBI UCSC

hg18

hg19

Annotation

Genes SmallRNA Variation

Sequence

BWAIndex BowtieIndex Chromosomes WholeGenomeFasta AbundantSequences

GenomeStudio

42

GTF, other formats FASTA files Pre-built Indexes

Examples: iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BowtieIndex/genome iGenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa iGenomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf

Page 43: NGS: Mapping and de novo assembly

Are you still awake?

43

Page 44: NGS: Mapping and de novo assembly

Mapping Demo with Bowtie and BWA

§ SRR036642 from SRA •  ChIP-seq •  Map using Bowtie

§ SRR062634 from SRA •  Human Resequencing data •  Map using BWA

44

Page 45: NGS: Mapping and de novo assembly

No HPC available to you? Free, Alternative Ways to Map NGS Reads

§ Galaxy •  Web-based analysis workflow interface •  https://main.g2.bx.psu.edu/ •  Emphasis on NGS tools •  Includes Bowtie, BWA, TopHat

§  Kbase •  Web-based command-line interface •  http://kbase.science.energy.gov/ •  Includes Bowtie, BWA

§  Disadvantages of online tools: •  Takes long time to upload data to servers •  Disk space limitations •  Limited customization of analysis workflow

45

Page 46: NGS: Mapping and de novo assembly

Steps in Alignment/Mapping

1.  Get your sequence data

2.  Check quality of sequence data

3.  Choose an alignment/mapping program

4.  Run the alignment

5.  View the alignments

6.  Downstream Processing

46

Page 47: NGS: Mapping and de novo assembly

Most commonly used alignment file formats

§ SAM (sequence alignment map) Unified format for storing alignments to a reference genome §  BAM (binary version of SAM)

Compressed SAM file, is normally indexed

47

Page 48: NGS: Mapping and de novo assembly

SAM/BAM format (sequence alignment map): Most commonly used alignment file formats

48

QNAME FLAG RNAME POSITION MAPQ CIGAR MRNM MPOS TLEN

SEQ QUAL OPT

Unified format for storing alignments to a reference genome BAM is a compressed SAM file, normally indexed

http://samtools.sourceforge.net/samtools.shtml http://samtools.sourceforge.net/SAM1.pdf

http://picard.sourceforge.net/explain-flags.html

Page 49: NGS: Mapping and de novo assembly

Picard Tools AddOrReplaceReadGroups.jar BamIndexStats.jar BamToBfq.jar BuildBamIndex.jar CalculateHsMetrics.jar CleanSam.jar CollectAlignmentSummaryMetrics.jar CollectCDnaMetrics.jar CollectGcBiasMetrics.jar CollectInsertSizeMetrics.jar CollectMultipleMetrics.jar CompareSAMs.jar

CreateSequenceDictionary.jar EstimateLibraryComplexity.jar ExtractIlluminaBarcodes.jar ExtractSequences.jar FastqToSam.jar FixMateInformation.jar IlluminaBasecallsToSam.jar MarkDuplicates.jar MeanQualityByCycle.jar MergeBamAlignment.jar MergeSamFiles.jar NormalizeFasta.jar

picard-1.45.jar QualityScoreDistribution.jar ReorderSam.jar ReplaceSamHeader.jar RevertSam.jar sam-1.45.jar SamFormatConverter.jar SamToFastq.jar SortSam.jar ValidateSamFile.jar ViewSam.jar

49 http://broadinstitute.github.io/picard/

java -jar QualityScoreDistribution.jar I=file.bam CHART=file.pdf /usr/local/bio_apps/java/bin/java -jar /usr/local/bio_apps/picard-tools/CollectMultipleMetrics.jar …

Page 50: NGS: Mapping and de novo assembly

Visualization of output in Integrated Genome Browser (IGV)

§  IGV download •  http://www.broadinstitute.org/igv/projects/current/igv_mm.jnlp (Windows

1.2GB) •  http://www.broadinstitute.org/igv/projects/current/igv_lm.jnlp (Mac 2GB)

§  Open IGV by double-clicking. Upload data by selecting File → Load from URL, and entering the following links.

§  Links to BAM files •  ChIP-seq:

–  https://dl.dropbox.com/u/12821862/SRR036642.bam •  DNA-seq:

–  https://dl.dropbox.com/u/30379708/SRR062634.sorted.bam •  RNA-seq:

–  https://dl.dropbox.com/u/30379708/Upenn/lymph_accepted_hits.bam –  https://dl.dropbox.com/u/30379708/Upenn/wbc_accepted_hits.bam

§  Examples: •  chr6:26,224,647-26,402,373 •  rs1205023, chr6:3,582,094-3,582,266 •  AIF1 •  LST1

50

Page 51: NGS: Mapping and de novo assembly

Steps in Alignment/Mapping

1.  Get your sequence data

2.  Check quality of sequence data

3.  Choose an alignment/mapping program

4.  Run the alignment

5.  View the alignments

6.  Downstream Processing

51

Page 52: NGS: Mapping and de novo assembly

Downstream Processing

§  Finding peaks (ChIP-seq) §  Annotating peaks to genes (ChIP-seq) §  Assembling transcripts (RNA-seq) §  Annotating transcripts to genes (RNA-seq) §  Etc.

52 Park, Nat Rev Genet, 2009 http://grimmond.imb.uq.edu.au/mammalian_transcriptome.html

Page 53: NGS: Mapping and de novo assembly

53

Examples of using different mapping strategies for NGS

Page 54: NGS: Mapping and de novo assembly

ChIP-seq and Differential Expression RNA-seq

§ The transcription factor T-bet is induced by multiple pathways and prevents an endogenous Th2 cell program during Th1 cell responses. Immunity. 2012 Oct 19;37(4):660-73. doi: 10.1016/j.immuni.2012.09.007. Epub 2012 Oct 4. Zhu J, Jankovic D, Oler AJ, Wei G, Sharma S, Hu G, Guo L, Yagi R, Yamane H, Punkosdy G, Feigenbaum L, Zhao K, Paul WE.

§ ChIP-seq Methods •  Mapping: Bowtie •  Peaks: MACS

§ RNA-seq Methods •  Mapping: TopHat •  Expression: USeq

54

Page 55: NGS: Mapping and de novo assembly

Resequencing/Variant Analysis

§ Whole genome sequencing of peach (Prunus persica L.) for SNP identification and selection. BMC Genomics. 2011 Nov 22;12:569. doi: 10.1186/1471-2164-12-569. Ahmad R, Parfitt DE, Fass J, Ogundiwin E, Dhingra A, Gradziel TM, Lin D, Joshi NA, Martinez-Garcia PJ, Crisosto CH.

§ Methods •  Mapping: BWA •  SNP calling: SAMtools

55 http://www.themoneytimes.com/files/peach.jpg?1270231192

Page 56: NGS: Mapping and de novo assembly

RNA-seq alternative splicing

§ RNA-Seq analysis of the parietal cortex in Alzheimer's disease reveals alternatively spliced isoforms related to lipid metabolism. Neurosci Lett. 2013 Mar 1;536:90-5. doi: 10.1016/j.neulet.2012.12.042. Epub 2013 Jan 7. Mills JD, Nalpathamkalam T, Jacobs HI, Janitz C, Merico D, Hu P, Janitz M.

§ Methods •  Mapping: TopHat •  Splicing: Cufflinks, Cuffdiff

56

Page 57: NGS: Mapping and de novo assembly

Stranded RNA-seq

§ Directional gene expression and antisense transcripts in sexual and asexual stages of Plasmodium falciparum. BMC Genomics. 2011 Nov 30;12:587. doi: 10.1186/1471-2164-12-587. López-Barragán MJ, Lemieux J, Quiñones M, Williamson KC, Molina-Cruz A, Cui K, Barillas-Mury C, Zhao K, Su XZ.

§ Methods •  Mapping: TopHat •  Expression: Cufflinks

57

Page 58: NGS: Mapping and de novo assembly

Genome-wide Bisulfite Sequencing

§ Whole-genome bisulfite DNA sequencing of a DNMT3B mutant patient. Epigenetics. 2012 Jun 1;7(6):542-50. doi: 10.4161/epi.20523. Epub 2012 Jun 1. Heyn H, Vidal E, Sayols S, Sanchez-Mut JV, Moran S, Medina I, Sandoval J, Simó-Riudalbas L, Szczesna K, Huertas D, Gatto S, Matarazzo MR, Dopazo J, Esteller M.

§ Methods: •  Mapping and Bisulfite analysis: BSMAP

58

Page 59: NGS: Mapping and de novo assembly

Cross-linking immunoprecipitation sequencing (CLIP-seq)

§ LIN28 binds messenger RNAs at GGAGA motifs and regulates splicing factor abundance. Mol Cell. 2012 Oct 26;48(2):195-206. doi: 10.1016/j.molcel.2012.08.004. Epub 2012 Sep 6. Wilbert ML, Huelga SC, Kapeli K, Stark TJ, Liang TY, Chen SX, Yan BY, Nathanson JL, Hutt KR, Lovci MT, Kazan H, Vu AQ, Massirer KB, Morris Q, Hoon S, Yeo GW.

§ Methods •  Mapping: Bowtie •  Peaks: Custom scripts

59

Page 60: NGS: Mapping and de novo assembly

Ribosomal Profiling sequencing (Ribo-seq)

§ Genome-wide ribosome profiling reveals complex translational regulation in response to oxidative stress. Proc Natl Acad Sci U S A. 2012 Oct 23;109(43):17394-9. doi: 10.1073/pnas.1120799109. Epub 2012 Oct 8. Gerashchenko MV, Lobanov AV, Gladyshev VN.

§ Methods •  Mapping: Bowtie •  Translation Efficiency: Custom Perl scripts

60

Page 61: NGS: Mapping and de novo assembly

Chromosome Conformation Capture Sequencing (4C)

§ Multiplexed chromosome conformation capture sequencing for rapid genome-scale high-resolution detection of long-range chromatin interactions. Nat Protoc. 2013 Feb 14;8(3):509-24. doi: 10.1038/nprot.2013.018. Epub 2013 Feb 14. Stadhouders R, Kolovos P, Brouwer R, Zuin J, van den Heuvel A, Kockx C, Palstra RJ, Wendt KS, Grosveld F, van Ijcken W, Soler E.

§ Methods •  Mapping: Bowtie via NARWHAL •  Post-alignment: BED Tools

61

Page 62: NGS: Mapping and de novo assembly

DNase I Hypersensitivity (DNase-seq)

§ Modeling gene expression using chromatin features in various cellular contexts. Genome Biol. 2012 Jun 13;13(9):R53. doi: 10.1186/gb-2012-13-9-r53. Dong X, Greven MC, Kundaje A, Djebali S, Brown JB, Cheng C, Gingeras TR, Gerstein M, Guigó R, Birney E, Weng Z.

§ Methods •  Mapping: Maq •  Peak calling: F-Seq

62

http://genome.ucsc.edu/cgi-bin/hgTrackUi?hgsid=328941501&g=wgEncodeChromatinMap&hgTracksConfigPage=configure

Page 63: NGS: Mapping and de novo assembly

16S rRNA Microbiome Sequencing

§ Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PLoS One. 2011;6(12):e27310. doi: 10.1371/journal.pone.0027310. Epub 2011 Dec 14. Schloss PD, Gevers D, Westcott SL.

§ Methods •  Taxonomic Assignment: Mothur classify.seqs •  Alignment: Mothur align.seqs

63

Page 64: NGS: Mapping and de novo assembly

Polyploid Genome Re-sequencing

§ PolyCat: A Resource for Genome Categorization of Sequencing Reads From Allopolyploid Organisms. G3 (Bethesda). 2013 Mar;3(3):517-25. doi: 10.1534/g3.112.005298. Epub 2013 Mar 1. Page JT, Gingle AR, Udall JA.

§ Methods •  Mapping: GSNAP •  Homoeo-SNP calling: PolyCat

64

Page 65: NGS: Mapping and de novo assembly

Additional Resources

§ Commercial Software for NGS Analysis (No Command Line!) •  Partek Genomics Suite

–  http://www.partek.com/?q=partekgs

•  CLCBio Genomics Workbench –  http://www.clcbio.com/products/clc-genomics-workbench/

65

Page 66: NGS: Mapping and de novo assembly

Next-Generation Sequencing Analysis Series

Andrew Oler, PhD High-throughput Sequencing Bioinformatics Specialist BCBB/OCICB/NIAID/NIH

Page 67: NGS: Mapping and de novo assembly

Alignment versus De Novo Assembly

67

Short Sequence “Reads”

Is a Reference Genome available?

Yes No

Alignment to Reference de novo Assembly ?

http://www.ncbi.nlm.nih.gov/sites/genome “Browse by organism groups”

Page 68: NGS: Mapping and de novo assembly

General strategy of assembling a genome de novo

68

Pre-process short reads (trim, quality filter…)

Assemble sequences into contigs

Order contigs into scaffolds

Annotate genome

Page 69: NGS: Mapping and de novo assembly

Basic Preprocessing

Tools for evaluating quality •  PrinSeq (web and command line) -

http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi •  FastQC (stand-alone and command line) -

http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Tools for trimming reads and removing adaptors •  Btrim - http://www.ncbi.nlm.nih.gov/pubmed/21651976

–  Trims off adapters, barcodes and/or low quality regions from single or paired-end reads

•  Cutadapt - http://code.google.com/p/cutadapt/ –  Provides many options of trimming –  Accepts fasta, fastq and csfasta/qual –  Needs ordering of pairs; could be done with cmpfastq script

§  http://compbio.brc.iop.kcl.ac.uk/software/download/cmpfastq

69

Page 70: NGS: Mapping and de novo assembly

Assembly of Sequences

§ Algorithms

1.  Greedy

2.  Overlap-layout-consensus (OLC)

3.  De Bruijn Graph

70

Schatz M C et al. Genome Res. 2010;20:1165-1173

Page 71: NGS: Mapping and de novo assembly

Greedy Was used in the very early next gen assemblers (e.g. SSAKE, VCAKE) 1- The highest scoring alignment takes on another read with the highest score 2- The paired end reads are used to generate super contigs 3- Mate pairs could also be used to determine contig order

71

* Repeats can cause big problems in this approach

Page 72: NGS: Mapping and de novo assembly

Imperfect Overlap Between Reads Can Lead to Incorrect Assembly in the Greedy Approach

72 Brief Bioinform. 2009 July; 10(4): 354–366.

Correct!

Incorrect

Imperfect overlap

Page 73: NGS: Mapping and de novo assembly

Greedy Extension Leads to Arrested Assembly if Multiple Matches are Found

73

Existing Contig

Two Unassembled Reads that Match Contig

Can’t Resolve, so Assembly Stops

Page 74: NGS: Mapping and de novo assembly

•  Perform better overall •  All against all using k-mers as seeds; Seed

& Extend algorithm is used.

•  Good for Long reads (e.g. Sanger or other >100bp, such as 454, Ion Torrent, PacBio) due to minimum overlap threshold

•  Examples: CABOG (Celera), ARACHNE

•  Newbler developed for 454 is based on OLC and is now being used for IonTorrent

Overlap Graph or Overlap-layout-consensus (OLC)

Page 75: NGS: Mapping and de novo assembly

•  It breaks reads into successive k-mers and the graph maps the k-mers

•  Each k-mer is a node and edges are drawn between each k-mer in a read.

•  Repeat sequences create a fork in the graph; alternative sequences create a bubble.

•  The k-mer size can only be determined by “trial and error”.

•  A small value of K will create a complex graph but a large value of K may miss small overlaps. A good starting point would be a k-mer size that is 2/3 the size of the read

•  Good for short reads or small genomes. With long reads and/or large genomes, may require lots of RAM (e.g., ~0.5 TB for human)

De Bruijn Graph

Examples are: Velvet, SOAPdenovo, ALLPATHS-LG, ABySS

Page 76: NGS: Mapping and de novo assembly

Evaluating the assembly §  Genome assembly results:

•  contig size and number of contigs produced •  scaffold size and number •  N50 and N90

§  Coverage

§  GC Content

§  Genome annotation •  repeats analysis and annotation •  protein-coding gene annotation (including gene structure prediction and gene function

annotation) •  non-coding RNA gene annotation (including annotation of microRNA, tRNA, rRNA, and other

ncRNA) •  transposon and tandem repeats annotation

§  Comparative genomics and evolution (chromosome structure, conserved gene families)

76

Page 77: NGS: Mapping and de novo assembly

Evaluating the assembly Basic statistics N50 the length of the shortest contig such that the sum of contigs of equal length or longer is at least 50% of the total length of all contigs. Contig size (bp)

3000 2000 N50 1200

800 600 N90

400 Total: 8000

N90 = the length of the shortest contig such that the sum of contigs of equal length or longer is at least 90% of the total length of all contigs.

77

Page 78: NGS: Mapping and de novo assembly

To Determine Optimal kmer Size, Try Many

78

0

100

200

300

400

500

600

700

800

900

1000

20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

Con

tigs

(bp)

kmer (bp)

Effect of kmer Length on Contig Length in ABySS

ABySS N25

ABySS N50

ABySS N75

*This will vary based on dataset (genome, read length, etc.)

*Good starting point is 2/3 of the read length.

Page 79: NGS: Mapping and de novo assembly

Example of de novo genome assembly from start to finish: Giant Panda

§ “The sequence and de novo assembly of the giant panda genome.” Nature. 2010 Jan 21;463(7279):311-7. doi: 10.1038/nature08696. Epub 2009 Dec 13.

79

Page 80: NGS: Mapping and de novo assembly

Panda Genome Karyotype

80

Page 81: NGS: Mapping and de novo assembly

81

Complex genome (if any condition met) •  GC content: < 35% or > 65% •  Repeat content: >50% •  Heterozygous diploid or polyploid •  Heterozygosity rate > 0.5%

Sequencing Strategies for De novo Assembly

Page 82: NGS: Mapping and de novo assembly

Flowchart of the panda genome assembly

82

Page 83: NGS: Mapping and de novo assembly

Supplementary Methods for Details about Panda Genome Assembly

§  Illumina GA Platform, 35-71 bp paired-end reads §  “In total, we generated 176-Gb of usable sequence (equal to 73-fold

coverage of the whole genome), with an average read length of 52  bp”

83

Page 84: NGS: Mapping and de novo assembly

Summary of Sequencing Reads

84

Page 85: NGS: Mapping and de novo assembly

Sequencing Error Correction and Filtering

§  “The quality requirements for de novo sequencing is far higher than for re-sequencing, because sequencing errors can create difficulties for the short-read assembly algorithm. We therefore carried out a stringent filtering process.”

§  Remove reads that contain only/mostly adapter. •  How would you do that?

§  Exclude datasets/lanes with too much low-quality sequence.

§  Trimming at 3’ end to remove low-quality bases §  Remove duplicate base call reads §  Remove reads with significant excess of “N” and low-

quality bases. •  How would you do that?

85

Page 86: NGS: Mapping and de novo assembly

Sequencing Error Correction and Filtering

§  Error correction by K-mer frequency: “Prior to assembly, the sequence errors were corrected based on K-mer frequency information. For the panda genome assembly, we chose K=17 bp, and corrected sequencing errors for the 17-mers with a frequency lower than 4. In summary, we corrected 8.4% of the reads and 0.2% of the bases. The total, the number of distinct 27-mers (we used 27-mer in graph construction and assembly) was reduced from 8.62 billion to 2.69 billion (3.2 times smaller) through this error correction step.”

§  Internal to SOAPdenovo § Quake or ALLPATHS-LG error corrector can be used

as standalone methods to do this

86

Page 87: NGS: Mapping and de novo assembly

http://soap.genomics.org.cn/down/soapdenovo.pdf

A. Create Graph

Kmer = 27

Strategy of SOAPdenovo

http://1.usa.gov/oTUrWC

Page 88: NGS: Mapping and de novo assembly

B. Simplify the graph by removing errors

72 million 2.6 million

(Keep Contigs >100bp) N50: 1483 N90: 224

Strategy of SOAPdenovo

Page 89: NGS: Mapping and de novo assembly

C. Realign reads into contigs and use paired end information to create scaffolds •  Require at least 3 consistent pairs to make a

connection •  Start with small inserts, progressively add larger insert

libraries

Strategy of SOAPdenovo

Page 90: NGS: Mapping and de novo assembly

Scaffolding Statistics

90

“In principle, the scaffold size could have been further improved by using even more distant insert-sized paired-end data, such as fosmid ends (~35 Kb) and BAC ends (100~150 Kb).”

Page 91: NGS: Mapping and de novo assembly

§ D. Close Gaps using Paired-end reads •  Mainly repeats (masked during scaffold construction) •  Local assembly of the reads that align to the gap •  If unknown copy number, fill with Ns •  97% of gaps filled •  Increased coverage from 84.2% to 93.6%

91

Strategy of SOAPdenovo

Page 92: NGS: Mapping and de novo assembly

Are you still awake?

92

Page 93: NGS: Mapping and de novo assembly

SOAPdenovo Demo

§  Xpr1 variants in mouse §  “Endogenous gammaretrovirus acquisition in Mus

musculus subspecies carrying functional variants of the XPR1 virus receptor.” •  J Virol. 2013 Sep;87(17):9845-55

§  In IGV, go to Mouse, mm9 §  “Load from URL” :

•  https://dl.dropbox.com/u/30379708/H12.bam •  https://dl.dropbox.com/u/30379708/H15.bam

§ Go to chr1:157,136,824-157,137,961 in browser

93

Page 94: NGS: Mapping and de novo assembly

SOAPdenovo2

§ Updates •  Reduced memory consumption in graph

construction •  Resolves more repeat regions in contig assembly •  Increased coverage and length in scaffold

construction •  Improved gap closing •  Optimization for large genome

§ Luo et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 2012 1:18.

94

Page 95: NGS: Mapping and de novo assembly

Annotation of Assembly: Repeats

§ Repeatmasker •  http://www.binfo.ncku.edu.tw/RM/

webrepeatmaskerhelp.html •  Known repeats

– Uses RepBase database of known repeats

•  Low complexity repeats, satellites, etc. –  “100 bp stretch of DNA is masked when it is >87% AT or

>89% GC, a 30 bp stretch has to contain 29 A/T (or GC) nucleotides”

95

Page 96: NGS: Mapping and de novo assembly

Annotation of Assembly: Gene Structure and Function

§  Known Genes •  Mapped human and dog genes to panda assembly •  ~20,000 genes

§  Novel Gene Prediction •  Genscan •  Augustus •  Required at least 3 exons, and at least 30% of the

translated sequence should align to SwissProt § Gene Function

•  Predict Domains (InterPro) •  Functional Gene Ontology •  ncRNAs (e.g., tRNAs, etc.) using INFERNAL •  Pathways using KEGG

96

Page 97: NGS: Mapping and de novo assembly

Assessment of Assembly: Coverage and Annotation

97

Page 98: NGS: Mapping and de novo assembly

Choosing a de novo Assembler

§ Assemblathon 1 •  Genome Res. 2011 21: 2224-2241

§ Genome Assembly Gold-standard Evalutions (GAGE) •  Genome Res. 2012 22: 557-567 •  http://gage.cbcb.umd.edu/results/index.html

98

Page 99: NGS: Mapping and de novo assembly

Assemblathon 1

99

•  BROAD (ALLPATHS-LG) and BGI (SOAPdenovo) performed best overall.

Page 100: NGS: Mapping and de novo assembly

GAGE

§ Multiple genomes •  Human chr14 (88 Mb) •  S. aureus (2.9 Mb) •  R. sphaeroides (4.6 Mb) •  B. impatiens (~250 Mb)

§ Used Quake and ALLPATHS-LG for error correction for all datasets prior to assembly (chose the best for final report)

§ Compared assembly to known reference to determine how many errors, etc.

100

Page 101: NGS: Mapping and de novo assembly

GAGE Results

101 •  Corrected N50 is most instructive

Page 102: NGS: Mapping and de novo assembly

GAGE Results

102

Unneccesary Duplication/

Compression Goal: 100%

Small contigs

Goal: 0%

Reference Bases

Missing Goal: 0%

Sequence not in

Reference Goal: 0%

Human more

difficult

Page 103: NGS: Mapping and de novo assembly

GAGE Results

103

Page 104: NGS: Mapping and de novo assembly

GAGE Results

104

Misjoins in the assembly are visible in dot-plot graphs

Page 105: NGS: Mapping and de novo assembly

GAGE Summary

105

•  N50 is average of the three genomes with a known reference •  Vertical axis is distance between errors •  “Best” is top right area of graph

Page 106: NGS: Mapping and de novo assembly

GAGE Conclusions

•  “ALLPATHS-LG demonstrated consistently strong performance based on contig and scaffold size, with the best trade-off between size and error rate”

•  “Considering all metrics, and with the caveat that it requires a precise recipe of input libraries, ALLPATHS-LG appears to be the most consistently performing assembler, both in terms of contiguity and correctness.”

•  “SOAPdenovo produced results that initially seemed superior to most assemblers, but on closer inspection it generated many misassemblies that would be impossible to detect without access to a reference genome.”

•  “Despite its poor performance on human, SOAPdenovo performed very well on the bacteria, creating contigs that were eight times larger than it built on the human data.”

•  “Velvet had a particularly high error rate for its scaffolds, creating many more inversions and translocations than any other algorithm.”

•  “Finally, we should note that all of the assemblers considered here are under constant development, and many will be improved by the time this analysis appears.”

106

Page 107: NGS: Mapping and de novo assembly

Some Strategies for Refining an Assembly

§ Deeper coverage: the shorter the reads, the deeper the coverage needed to produce long contigs

§ Mix of short and long read sizes § Combinatorial approach

•  e.g., assemble short reads with de Bruijn (e.g., Velvet), then treat the contigs as long reads in an OCL assembler (e.g., CABOG)

§ Comparative assembly (using a reference sequence to assist)

§ Libraries with a variety of insert sizes and mate pair libraries to scaffold contigs into supercontigs

107

Schatz M C et al. Genome Res. 2010;20:1165-1173

Page 108: NGS: Mapping and de novo assembly

Thank You Questions or Comments please contact:

[email protected]

[email protected]

108

Page 109: NGS: Mapping and de novo assembly

Bowtie Command-line

bowtie [options]* <ebwt> {-1 <m1> -2 <m2> | --12 <r> | <s>} [<hit>]

e.g., Paired-end

bowtie hg19 -1 SRR027894_1.fastq -2 SRR027894_2.fastq e.g., Single-end bowtie hg19 SRR036642.fastq bowtie hg19 SRR036642.fastq,SRR036643.fastq

109

Paired-end Single-end

Tab- delimited

(uncommon)

“OR” “OR”

Index name (genome)

Output file (optional)

http://bowtie-bio.sourceforge.net/manual.shtml

Page 110: NGS: Mapping and de novo assembly

Bowtie Command-line Options To get options, type: /usr/local/bio_apps/bowtie/bowtie --solexa1.3-quals Use for Illumina pipeline 1.3-1.7 quality scores (phred+64) (omit for Illumina 1.8) -p <int> Number of threads/processors (default: 1) Alignment: -v <int> Number of mismatches allowed in sequence OR -n <int> Number of mismatches allowed in “seed” portion (first part of read) (default: 2) -l <int> Length of seed (default: 28bp) -e <int> Maximum sum of scores of all mismatched bases (default: 70) Reporting Reads: -k <int> Number of alignments to report (default: 1) -a Report all alignments (disables -k; default: off) -m <int> Skip read if more than this many alignments (default: no limit) -M <int> Like -m but reports one random alignment instead of skipping (default: no limit) --best Order in best-to-worst quality alignment (i.e., fewest mismatches first) --strata Only consider those alignments with the fewest mismatches Output: -t Print out time at each step (to terminal) -S Output in SAM format --un <file> Save unaligned reads to a file (give it a name) --max <file> Save reads with more alignments than -m to a file (i.e. repeats; give it a name)

110 http://bowtie-bio.sourceforge.net/manual.shtml

Page 111: NGS: Mapping and de novo assembly

Bowtie n mode versus v mode

111

CTCTGCACGTGTGGGTTCGAGTCCCACCTTCGTTTG ATTGTGCTCTGCACGCGTGGGTTCGAATCCCACCTTCGTCGACCGTTT

Reference sequence

Read sequence

FHHHHIGHHFHIFFFGHGCD/DBA>=@?A980/*-) Quality: 37 14 9 8 = ? 68

In v mode (e.g., -v 2 commonly used): In n mode (default -n 2 -e 70 -l 28): KEEP (because <=70)

REJECT (because >2 mismatches)

Page 112: NGS: Mapping and de novo assembly

Example Bowtie Commands §  These are some things you could add to a script

#Default alignment settings (plus threaded and SAM output): bowtie -p 2 -t -S bowtie_hg19/genome SRR036642.fastq out.sam

#Unique alignments: bowtie -p 2 -t -S -m 1 -a --best --strata bowtie_hg19/genome SRR036642.fastq out.sam

bowtie -p 2 -t -S -m 1 -a bowtie_hg19/genome SRR036642.fastq out.sam #Allowing up to 10 repeats (for gene families): bowtie -p 2 -t -S -m 10 -a --best --strata bowtie_hg19/genome SRR036642.fastq out.sam

bowtie -p 2 -t -S -m 10 -a bowtie_hg19/genome SRR036642.fastq out.sam bowtie -p 2 -t -S -k 10 --best --strata bowtie_hg19/genome SRR036642.fastq out.sam

#Input a gzipped file to bowtie (- means stdin) gunzip -c SRR036642.fastq.gz | bowtie -p 2 -t bowtie_hg19/genome -

112

Page 113: NGS: Mapping and de novo assembly

Effects of Various Options on Bowtie Output alignment settings time (s) reads aligned reads not

aligned reads suppressed by -m (repeat)

# alignments reported

reads/alignments ratio

default 3 164801 (86.60%)

25492 (13.40%)

0 164801 1

unique: m1, a, best, strata

4 132168 (69.45%)

25379 (13.34%)

32746 (17.21%)

132168 1

unique: m1, a 11 120069 (63.10%)

25492 (13.40%)

44732 (23.51%)

120069 1

max10: m10, a, best, strata

5 147459 (77.49%)

25379 (13.34%)

17455 (9.17%)

191860 0.768

max10: m10, a 14 135517 (71.21%)

25492 (13.40%)

29284 (15.39%)

180796 0.750

max10: k10, best, strata

5 164914 (86.66%)

25379 (13.34%)

0 366410 0.450

113

Total reads in test dataset: 190293

*

*

Playing with -l, -n, -e settings could decrease this number, but you will still have some not aligned

Page 114: NGS: Mapping and de novo assembly

Let’s Run Bowtie Exercise 1: Get today’s alignment dataset cp -r /scratch/aln ~ ls ~/aln Exercise 2: cd ~/aln *Hint: You can use nano (or other text editor) to change email in test_bowtie.sh script to your email address. qsub test_bowtie.sh **qsub output will tell you your jobID (needed for the next step) qstat -u $LOGNAME (to check status of job occasionally) cat test_bowtie.sh (to look at the script that we submitted) *Hint: Use “genome” as the genome name in commands instead of hg19 because of the index basename in the folder e.g., bowtie bowtie_hg19/genome SRR036642.fastq *Hint: to learn PBS syntax for submitting jobs on Biowulf, see their website: http://biowulf/user_guide.html

114

Page 115: NGS: Mapping and de novo assembly

Bowtie Output Stats Exercise 2, continued: cat bowtie_test.oXXXXXX (substitute XXXXXX for jobID) Unique alignments Seeded quality full-index search: 00:00:09 # reads processed: 190293 # reads with at least one reported alignment: 132168 (69.45%) # reads that failed to align: 25379 (13.34%) # reads with alignments suppressed due to -m: 32746 (17.21%) Reported 132168 alignments to 1 output stream(s) Overall time: 00:00:51 Unique + repeat alignments max 10 Seeded quality full-index search: 00:00:10 # reads processed: 190293 # reads with at least one reported alignment: 147459 (77.49%) # reads that failed to align: 25379 (13.34%) # reads with alignments suppressed due to -m: 17455 (9.17%) Reported 191860 alignments to 1 output stream(s)

115

Page 116: NGS: Mapping and de novo assembly

TopHat Command Line

116

tophat [options]* <index_base> <reads> •  See “TopHat” Section of Exercise Handout •  Copy/Paste these commands, waiting for each to finish before

going to the next (should take ~1-2 minutes altogether): §  cd ~/rnaseq_upenn §  tophat -o wbc -p 2 -G hg19_chr6_refFlat_noRandomHapUn.gtf

index/chr6 wbc_aln.fastq.gz §  tophat -o lymph -p 2 -G hg19_chr6_refFlat_noRandomHapUn.gtf

index/chr6 lymph_aln.fastq.gz

Fastq file Index name (genome)

http://tophat.cbcb.umd.edu/manual.html

Page 117: NGS: Mapping and de novo assembly

Using SAM Tools to Get Sorted BAM File Convert SAM to BAM samtools view [options] <in.bam or in.sam> Options: -b Output is BAM -S Input is SAM -h Include header if output is SAM -o Output file (default: stdout) e.g., samtools view -bS -o SRR062634.bam SRR062634.sam samtools view -h SRR062634.bam | head -n 100 samtools sort [options] <in.bam> <out.prefix> (.bam extension will be added to the “prefix”) e.g., samtools sort SRR062634.bam SRR062634.sorted samtools index <in.sorted.bam> e.g., samtools index SRR062634.sorted.bam 117

http://samtools.sourceforge.net/samtools.shtml

Convert sam to bam (to compress) or bam to sam

Sort bam file

Index bam file

Page 118: NGS: Mapping and de novo assembly

Get BAM Stats

BAM Stats with SAMtools and Picard #Put samtools and other binaries on your PATH export PATH=/usr/local/bio_apps/R/bin/:/usr/local/bio_apps/java/bin/:/usr/local/bio_apps/samtools/:$PATH samtools idxstats <in.bam> #Outputs chr, length of chr, # mapped reads, # unmapped reads e.g., samtools idxstats SRR062634.sorted.bam samtools flagstat SRR062634.sorted.bam java -jar /usr/local/bio_apps/picard-tools/CollectMultipleMetrics.jar I=SRR062634.sorted.bam O=SRR062634.sorted

118

Page 119: NGS: Mapping and de novo assembly

Alignment QC Plots

119

0 100 200 300 400 500

010

020

030

040

050

0

Insert Size Histogram for All_Reads in file accepted_hits.bam

Insert Size

Cou

nt

FR

0 20 40 60 80 100

010

2030

4050

accepted_hits.bam Quality By Cycle

Cycle

Mea

n Q

ualit

y

Mean QualityMean Original Quality

10 20 30 40

050

0000

1000

000

1500

000

2000

000

2500

000

accepted_hits.bam Quality Score Distribution

Quality Score

Obs

erva

tions

Quality ScoresOriginal Quality Scores