wings2014 workshop 1 design, sequence, align, count, visualize
DESCRIPTION
Slides from Workshop 1 of wings 2014TRANSCRIPT
Workshops in next-‐genera1on science at UNC Charlo7e 2014
Workshop 1 -‐ Design, sequence, align, count, visualize
1
Workshop Loca1ons
• Sec$on 1 -‐ Room 801 – Ann Loraine, UNC Charlo7e – Naim Matasci, University of Arizona, iPlant
• Sec$on 2 -‐ Room 802 – Ivory Clabaugh Blakley, UNC Charlo7e – Xiangqin Cui, University of Alabama Birmingham
• Please stay in your sec$on – Cover same material, but 1ming may vary
2
Meet your TAs
• Graduate students from UNCC Dept of Bioinforma1cs and Genomics – 801 Roshonda Barner, Ibro Mujacic, Chi-‐Yu "Jack" Yen, Warren (G.) Cole, Tony Dao, Greg Linchango, Sushma Madamanchi, Anuja Jain
– 802 Richard Linchangco, Fred Lin, Chris Ball, Lu Tian, Shawn Chaffin, Natascha Moestl, Walter Clemens, Adriano Schneider
• Loraine Lab members – 801 Kyle Su7lemyre (IGB support), April Estrada (Research Specialist, Expert IGB User)
– 802 David Norris (IGB Developer)
3
Schedule
• Workshop 1 -‐ planning an experiment, data processing, visualiza1on – 9:00 to 11:30, then Lunch
• Workshop 2 -‐ introduc1on to R & RStudio for data analysis, differen1al expression – 12:30 to 2:30, then a 30' Break
• Workshop 3 -‐ biological interpreta1on using pathway tools, Gene Ontology, the Web – 3:00 to 5:00, then Done
4
Using RNA-‐Seq data set for WiNGS2014
5
pollennetwork.org
• Sponsored by Pollen Research Coordina1on Network in Integra1ve Pollen Biology (annual mee1ng starts tonite)
• Visit Web site for more info
RNA-‐Seq data set for the workshop
• Goal: Provide resources for pollen biology – Example RNA-‐Seq data analysis – Catalog of genes expressed in pollen – Highlight important area of pollen research
• Problem: Pollen in some plant species is vulnerable to heat stress, reduces yields – Exposure to mild heat stress (acclima$on) can protect against more severe stress later -‐ called acquired thermotolerance (Firon 2012)
• To learn more, we sequenced RNA extracted from pollen undergoing a mild heat stress – Same temperature that can establish thermotolerance
6
Samples from the lab of Nurit Firon, Volcani Ins1tute, Israel
• Firon lab studies effects of heat stress on tomato pollen
• Showed (along with others) that high temp. reduces pollen viability, sugar content
• Studying a heat-‐tolerant tomato cul1var: Hazera 3042 – Pollen is sensi1ve to heat stress but not as much as other varie1es
7
Nurit's experiment: RNA-‐Seq of heat-‐tolerant tomato cul1var Hazera 3042 • Collected pollen from plants growing in temperature-‐controlled greenhouses – Control 25/18° C op$mal temperature – Treatment 32/26° C mild chronic heat stress
• Collected batches of pollen from ~ 10 plants during Sep. & Oct 2013 – One treatment, one control per collec1on – Made RNA from five collec1ons, 5 treatment, 5 control "batches"
– sequenced at UCLA (69 base, PE)
8
Arabidopsis cold stress RNA-‐Seq
• Simpler data set with one treatment & control – Using data from part of chr1, treatment sample to illustrate data processing, visualiza1on, effects of parameter seongs on results (maximum intron size in tophat spliced alignment program)
• For details, see: – experiment record at the Short Read Archiveh7p://www.ncbi.nlm.nih.gov/sra/SRP029896
– sample h7p://www.ncbi.nlm.nih.gov/sra/SRX348640 • Published in Methods in Molecular Biology h7p://www.ncbi.nlm.nih.gov/pubmed/24792048
9
Workshop 1: RNA-‐Seq: Design, sequence, align, count, visualize
wings 2014
10 10
Goals • Learn the basics (20') – Plan an experiment – Library prep for RNA-‐Seq – Illumina sequencing
• Prac1ce: Quality analysis using FastQC (30') • Prac1ce: Data processing (30') – Align reads (make BAM files and junc1on files) – Make counts files for sta1s1cal analysis – Merge reads into transcript models w/ Cufflinks
• Prac1ce: Visualize results in IGB (60') – Compare to data set in Galaxy, TAIR10 gene models
11
Visualiza1on using IGB
FASTQ files
WildType1a.fastq
Work Shop 2
Workshop 1 Overview FASTQC
Alignment onto Genome
$Command Line…
WildType1a.bam
Genera1on of Counts Data
Counts.txt
Sequencing Strategy
RNA-‐seq: ultra-‐high throughput cDNA sequencing
• Several papers published in 2008, first in May
13 h7p://blog.sbgenomics.com/rna-‐seq-‐the-‐first-‐wave/
Ecker lab
Snyder lab
999 cites
1,076 cites
Mortazavi 2008 "Mapping and quan1fying mammalian transcriptomes
by RNA-‐Seq" Nature Methods
• Published later in 2008, but > 3000 cita1ons
• Why? Maybe because emphasized RNA-‐Seq as replacement for expression DNA microarrays
• Comment in same issue: "Beginning of the end for microarrays?"
14
google scholar
RNA-Seq Overview - Illumina
~ ~ ~ ~ fragment
synthesize cDNA (random hexamers) - - - - - - - -
- - - - - - -
- - - - - - - - - - -
- - - - repair ends
add “A” bases to 3’ ends
ligate adapters
extract RNA, purify polyA+
- - - - - - - - - - -
amplify
library reflects RNA from original sample
Data, fastq sequence files Millions of reads per library
Map to genome Count reads per gene
improve gene models
identify differentially expressed genes
alignments
analyze splicing
and much more..
prepare flowcell
Plan experiment • Biological replication • Sequencing strategy • Data analysis strategy
sequence by synthesis
collect samples
2. Making Libraries
quality assessment
3. Sequencing
4. Data Analysis
1. Design
15
Five steps for design
1. Ar1culate your ques$ons or hypothesis 2. Define your unit of biological replica1on. 3. Write up your sample collec1on protocol in
detail – Does the protocol allow you to test your hypothesis?
4. Define library synthesis & sequencing strategy – Read lengths, paired end vs. single end, depth, barcoding
5. Ask an experienced data analyst to review your plan, revise needed
16
Image: David C Corney Ph. D. h7p://www.labome.com/method/RNA-‐seq-‐Using-‐Next-‐Genera1on-‐Sequencing.html
Fork or "Y" adapters size selec1on
Library synthesis
17
Y adapters contain indexes,
allow mul1plexing
Example library molecule
Unknown sequence Rd1
Rd2
barcode
Universal adapter
Index Primer
18
Rd1 Rd2
Rd1 & Rd 2 are from reverse complements, might overlap. Ref: h7p://nextgen.mgh.harvard.edu/IlluminaChemistry.html
P5 P7
Flow cell prepara1on & sequencing by synthesis
19
h7ps://www.youtube.com/watch?v=HMyCqWhwB8E
Review: Paired End vs Single End • Single End – cheaper • Paired End – more expensive – two reads per fragment – coun1ng fragments, not reads
– call normalized counts FPKM not RPKM sequenced in SE
Sequenced in PE
SE
PE indexed adapter
20
Get the reads in a FASTQ file • File contains millions of records – Each record has four lines, represents ONE sequence
• Line 1 – the name, starts with @ • Line 2 – the sequence, starts at new line
• Line 3 – some other stuff, op1onal, starts with + • Line 4 – the quality scores, starts at new line
@SN1083:379:H8VA1ADXX:2:1101:1248:2144 1:N:0:12!CCTAAATGGTGCCATGCTAGGAGGCCGTGCCCTTCTTGAAAAGTTGTATGTGAA!+!BBBFFFFFFBFFFIIIIFI<FFIIIIIFIIIIFBFIIIIIIIIFFFIIIIFIII!
base = T score = F = 37
21
Phred Quality score Q
h7p://en.wikipedia.org/wiki/FASTQ_format
Describes how exponen1ally unlikely it is that a given base call is wrong.
Q = -‐10 log10 pe
22
h7p://drive5.com/usearch/manual/quality_score.html
Different Illumina data processing pipelines used different score encodings
23
Get two files -‐ Read1 & Read2 -‐ from paired end sequencing
• Read1 and Read2 have same read iden$fier, are reverse complements of the same fragment
• Example is processing pipeline Cassava 1.8, older versions used different naming conven1ons
@SN1083:379:H8VA1ADXX:2:1101:1248:2144 1:N:0:12!CCTAAATGGTGCCATGCTAGGAGGCCGTGCCCTTCTTGAAAAGTTGTATGTGAA!+!BBBFFFFFFBFFFIIIIFI<FFIIIIIFIIIIFBFIIIIIIIIFFFIIIIFIII!
@SN1083:379:H8VA1ADXX:2:1101:1248:2144 2:N:0:12!CATTTTCGACGTTGTTAATAAGCTCTGCGTACTTGCAAGCTATCTGCGCGAACG!+!BBBFFFFFFFFFFIIIIIIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIFFF!
24
R1
R2
Sequence iden1fier line in Cassava 1.8
25
@SN1083:379:H8VA1ADXX:2:1101:1248:2144 1:N:0:12!
machine run# flow-‐cell-‐id lane 1le x-‐pos y-‐pos
read# index is-‐filtered (barcode) control
FastQC
• Many groups use FastQC as a first pass quality assessment
• Free from Babraham h7p://www.bioinforma1cs.babraham.ac.uk/projects/fastqc/
• Run interac1vely (point-‐and-‐click) or command line (won’t cover this)
26
Prac1ce: Using FastQC
• Go to Conference DropBox link: – h7p://bitly.com/rnaseq2014
• Note two folders – FastQC and FastQC-‐Examples – FastQC-‐Examples has FastqQC reports from different species, sample types (next slide)
• FastQC folder, download – Example.fastq – FastQC_Manual.pdf
• Start FastQC, open Example.fastq
27
Prac1ce: Watch FastQC video
• h7ps://www.youtube.com/watch?v=bz93ReOv87Y (start around 34 sec)
• Take-‐home #1: FastQC assesses whether your data files are typical
• Take-‐home #2: A "bad result" from FastQC doesn't always mean your data are not useful or valuable
• Explore on your own! (~ 15 minutes)
28
Prac1ce: View reports in Fastqc-‐Examples (~ 15 min)
• Blueberry – OnealRipe_1 – OzarkblueGreen_1
• Tomato pollen – T2_1 – C2_1
• Rice – Control2h-‐R2 Per read %GC
29
Prac1ce: Data processing
• Double-‐click "Alignment.tar.gz" on your Desktop to unpack it
• Also available from h7p://bitly.com/rnaseq2014
30
Prac1ce: Look at "align.sh"
• Open Alignment folder • Right-‐click "align.sh" • Select "open with text editor" • This is a shell script – Commands executed in sequence – Very useful for automa1ng tasks
• First line is "she-‐bang" line – tells Terminal it's a shell script
• All other lines star1ng with # are comments (not run)
31
Learning the bash shell Great guide to wri1ng shell scripts
align.sh -‐ simple pipeline for RNA-‐Seq data processing
• Aligns a sample fastq file to genome – tophat2, bowtie2!– fastq file is from Arabidopsis cold stress experiment (Short Read Archive SRX348640)
– file ColdTreatment-little.fastq.gz (gzip-‐compressed, .gz)
• Counts reads that align to TAIR10 genes – featureCounts!– only coun1ng reads that uniquely align
• Merges alignments into transcript models – cufflinks!
32
Prac1ce: Intro to Terminal
• Double-‐click Terminal shortcut on desktop – Program for entering commands or running scripts – Also called a "shell" or "Unix shell" – Can open mul1ple Terminal windows
• Each window called a "shell" or "Unix shell" • Terminal shows hierarchical view of file system – An upside-‐down tree, where every folder is inside another folder
– Folders are also called "directories" – The top folder (that contains everything else) is called "root" directory -‐ / (forward slash)
33
Prac1ce: Open Terminal, try these commands
• cd change directory – by itself means "go to user
home directory" – with an argument means: go there
– with ".." means go up one
• pwd -‐ "print the current working directory" & find out where you are
34
Prac1ce: Try these commands
ls lists files and directories in the current directory
35
Prac1ce: Try these commands
36
• ls -l "list long" – report more informa1on about files – "d" means it's a directory (folder)
Prac1ce: Run align.sh in Terminal
• Go to home directory • Go to Desktop • Go to Alignment • Run align.sh
37
Now Running: tophat2 spliced
alignment tool
38
TopHat: discovering splice junc$ons with RNA-‐Seq Cole Trapnell1, Lior Pachter and Steven L. Salzberg Figure 1
Tophat Output -‐ we'll open in IGB
• Creates new folder with files, including... • accepted_hits.bam -‐ "binary alignments" file contains read alignments – BAM -‐ compressed version of SAM -‐ "sequence alignment", needs index ".bai" file (made using samtools)
• junction.bed -‐ reports boundaries of introns, called "junc1on" features – BED format, tab-‐delimited plain text file – one junc1on feature per line – fi{h field is score, no. spliced reads aligned across the junc1on
– see: h7p://genome.ucsc.edu/FAQ/FAQformat.html#format1
39
Prac1ce: Start IGB while script runs
• Double-‐click IGB desktop icon • Click Arabidopsis flower on start screen
40
Prac1ce: How to get IGB if you're using your own computer
• Go to h7p://bioviz.org • Follow Download link • Choose Medium Memory op1on (typical)
41
TAIR10 annota1ons, June 2009 Columbia-‐0 genome release
• TAIR10 protein-‐coding gene models loaded automa1cally from IGB data server
• Forward & reverse strand in separate tracks
42
Forward
Reverse
RNA-‐Seq, ChIP-‐Seq, other data sets available in Data Access tab
• IGB data servers, can set up your own 43
Arabidopsis pollen data sets
• Read alignments, coverage graphs, junc1on files • From 2013 Plant Phys. Pollen RNA-‐Seq paper 44
Prac1ce: Combine Plus & Minus Tracks
Click "+/-‐" to combine tracks
45
Use Data Management Table to change track color, name, visibility, load op1ons, strand op1ons
Summary of moving and zooming
• Animated zooming – click to posi1on zoom stripe, sets zoom focus – horizontal zoom & ver1cal stretch
• Moving from side to side (panning) – arrows in toolbar – hand icon -‐ the move tool
• Jump-‐zooming – Click-‐drag coordinate axis with arrow tool – Double-‐click to zoom in on a feature – Search by name
46
Prac1ce: Zoom in on a feature
• Zoom in on alt-‐spliced gene models * on chr1 • This is animated zooming
47
1. Click to set zoom focus 2. Drag slider
to zoom in *
Prac1ce: Click move arrows to reposi1on during zoom
• Click data display to re-‐focus zoom on target loca1on
48
49
Prac1ce: Or use move tool (hand) to reposi1on during zoom
• Click display to focus zoom on target
1. Select move tool (hand)
2. Click-‐drag to move
Prac1ce: Click-‐drag sequence axis to jump-‐zoom to a region
2. Click number line
50
3. Drag
4. Release
• Highlighted region becomes new view
1. Select pointer tool
Prac1ce: Jump-‐zoom to gene model
• Double-‐click label, space a li7le above exon blocks, or intron to jump-‐zoom to a gene model – Also selects it, selected items outlined in red
51
2. double-‐click label or intron
1. Select pointer tool
A{er jump-‐zoom, gene model is selected
• Arrows indicate direc1on of transcrip1on
52
Selected gene model
outlined in red
Prac1ce: Gene model close-‐up
• Use ver1cal slider to make gene models taller • Increase window size to make more room
53
Drag slider to stretch ver1cally
Prac1ce: Interact with data using pointer. Select pointer (arrow) in toolbar
• Click intron, label, or region above blocks to select whole gene model
• Click blocks to select parts of a gene model • SHIFT-‐click to mul1-‐select • CLICK-‐drag to select & count everything in a region • Selec1on Info, top right, reports counts – "i" bu7on shows info if one item selected
54
Prac1ce: View edge Matching
• Edges that match selected item edges are highlighted in red
• To change edge-‐match color choose File > Preferences > Other Op$ons
• To turn off or on, see View > Edge Matching
55
Prac1ce: to work with sequence data, click Load Sequence
56 • Sequence appears in Coordinates track
Prac1ce: Zoom in to see amino acids
• Note: Must load genomic sequence first 57
Prac1ce: Zoom in on end of transla1on
• Click the "thick end" and then zoom in • Note: Variants encode same C-‐term amino acids
58
Prac1ce: Select genomic sequence 1. Choose pointer tool in toolbar
2. Click-‐drag genomic
sequence to select a region
3. CNTRL-‐click to copy
• Length of selected region reported in Selec$on Info box (top right)
• Useful for designing primers, measuring regions 59
Prac1ce: Right-‐click (or CNTRL-‐click) gene model
• Shows op1ons to run a Web search, BLAST search, view sequence
60
Prac1ce: Quick Search
• Enter search text, select op1on • Jump-‐zoom to selected gene
61
Choose At-‐SR30
Zoomed to At-‐SR30, RNA-‐binding protein involved in splicing
62
Looking ahead to Workshop 3
• Some genes that were highly expressed in tomato pollen are annotated as "Unknown" proteins & have no counterpart in Arabidopsis.
• You can use IGB to quickly find those genes and then run BLASTX or BLASTP searches at NCBI to find out... – Are they unique to tomato? – Could they be non-‐coding?
63
Prac1ce: Open files from align.sh!
• Zoom out to show more of At-‐SR30 region • Choose File > Open – Select "accepted_hits.bam" & "junctions.bed"
• A new empty track appears for each file
• Click Load Data to load reads and junc1ons
64
65
read alignments stack
reads at top of stack not being shown (too
many to fit)
66
junc1on features, summarizing spliced reads
junc1on features, summarizing spliced reads
Prac1ce: Configure view -‐ Load Sequence
67
Click Load Sequence to load genomic bases for this
region
Prac1ce: Configure view -‐ Lock mRNA track height
68
1. Click TAIR10 mRNA track label to select it
2. Open Annota$on tab
3. Select Lock Track Height, enter 170, click
Apply
Prac1ce: Configure view -‐ configure junc1on track
69
1. Click junc$ons track label to select junc1ons track
2. Open Annota$on tab
3. Select score in Label
Field
4. Select +/-‐ in Strand
Prac1ce: Configure view -‐ lock junc1on track height
70
1. Click junc$ons track label to
select it
2. Open Annota$on tab
3. Select Lock Track Height, enter 120, click Apply
Prac1ce: Change read stack height to see more reads
1. CNTRL-‐click (or right-‐click) accepted_hits.bam track label
2. Choose Set Stack Height... 71
Prac1ce: Change read stack height
3. Enter 50
72
Prac1ce: Change read stack height to see more reads
Prac1ce: Set mRNA stack height
2. Enter 3 -‐ tallest stack has 3 models
73 Note: Tabs are minimized to make more space
1. Right-‐click TAIR10 mRNA track label, choose Set Stack Height
Prac1ce: Note read support for alterna1ve splicing
Take-‐home: Many spliced reads support both variants, but there are also many reads inside the introns, indica1ng failure to splice. This may be typical of alt-‐spliced introns?
74
Prac1ce: Use junc1on track to quan1fy support for splice variants
1. Click-‐drag to genes track 2. Scores are number of
spliced reads suppor1ng each junc1on.
75
Prac1ce: Compare Cufflinks GTF file to Gene models
• Open Alignments > cufflinks_cold > transcripts.gf
76
Prac1ce: View Cufflinks gene models
77
1. Click Load Data to see Cufflinks models
2. Click-‐drag new track next to gene models
3. Use ver$cal slider to make more room
Take-‐home: Cufflinks annota1ons close, but incomplete.
Prac1ce: Load data from Galaxy
78
1. Go to usegalaxy.org 2. Open Shared Data
3. Choose Published Histories
Prac1ce: Load data from Galaxy
79
1. Search for Cold
3. Select Cold stress in Arabidopsis (with default maximum intron size)
Prac1ce: Load data from Galaxy
• Illustrates results when tophat is run with default seongs: – default maximum intron size is 500,000 bases
• Tophat was developed with human data in mind, where large introns are common
80
Select Import History
Prac1ce: Select start using this history
81
82
1. Select Treatment junc1ons
2. Select display in IGB View
83
New tab opens. Select Click to go to IGB
84
New track 1. Click Load Data
Prac1ce: Remove reads -‐ don't need them now
85
1. Right-‐click accepted_hits.bam
2. Choose Delete Track
86
1. Zoom out all the way
2. Click Load Data
Your data are here
87
Take-‐home: Tophat run with default parameters predicts enormous introns. Important to understand parameters seongs -‐-‐ defaults are not always best.
Now you can
• Describe Illumina library synthesis, sequencing • Evaluate data quality using FastQC • Run a data processing pipeline (shell script) • View and explore data in a genome browser – and load data sets from Galaxy, local files
88
Thank you for your a7en1on!