wings2014 workshop 1 design, sequence, align, count, visualize

Workshops in next-‐genera1on science at UNC Charlo7e 2014

Workshop 1 -‐ Design, sequence, align, count, visualize

1

Workshop Loca1ons

•  Sec$on 1 -‐ Room 801 – Ann Loraine, UNC Charlo7e – Naim Matasci, University of Arizona, iPlant

•  Sec$on 2 -‐ Room 802 –  Ivory Clabaugh Blakley, UNC Charlo7e – Xiangqin Cui, University of Alabama Birmingham

•  Please stay in your sec$on – Cover same material, but 1ming may vary

2

Meet your TAs

•  Graduate students from UNCC Dept of Bioinforma1cs and Genomics –  801 Roshonda Barner, Ibro Mujacic, Chi-‐Yu "Jack" Yen, Warren (G.) Cole, Tony Dao, Greg Linchango, Sushma Madamanchi, Anuja Jain

–  802 Richard Linchangco, Fred Lin, Chris Ball, Lu Tian, Shawn Chaffin, Natascha Moestl, Walter Clemens, Adriano Schneider

•  Loraine Lab members –  801 Kyle Su7lemyre (IGB support), April Estrada (Research Specialist, Expert IGB User)

–  802 David Norris (IGB Developer)

3

Schedule

•  Workshop 1 -‐ planning an experiment, data processing, visualiza1on – 9:00 to 11:30, then Lunch

•  Workshop 2 -‐ introduc1on to R & RStudio for data analysis, differen1al expression – 12:30 to 2:30, then a 30' Break

•  Workshop 3 -‐ biological interpreta1on using pathway tools, Gene Ontology, the Web – 3:00 to 5:00, then Done

4

Using RNA-‐Seq data set for WiNGS2014

5

pollennetwork.org

•  Sponsored by Pollen Research Coordina1on Network in Integra1ve Pollen Biology (annual mee1ng starts tonite)

•  Visit Web site for more info

RNA-‐Seq data set for the workshop

•  Goal: Provide resources for pollen biology –  Example RNA-‐Seq data analysis –  Catalog of genes expressed in pollen –  Highlight important area of pollen research

•  Problem: Pollen in some plant species is vulnerable to heat stress, reduces yields –  Exposure to mild heat stress (acclima$on) can protect against more severe stress later -‐ called acquired thermotolerance (Firon 2012)

•  To learn more, we sequenced RNA extracted from pollen undergoing a mild heat stress –  Same temperature that can establish thermotolerance

6

Samples from the lab of Nurit Firon, Volcani Ins1tute, Israel

•  Firon lab studies effects of heat stress on tomato pollen

•  Showed (along with others) that high temp. reduces pollen viability, sugar content

•  Studying a heat-‐tolerant tomato cul1var: Hazera 3042 – Pollen is sensi1ve to heat stress but not as much as other varie1es

7

Nurit's experiment: RNA-‐Seq of heat-‐tolerant tomato cul1var Hazera 3042 •  Collected pollen from plants growing in temperature-‐controlled greenhouses –  Control 25/18° C op$mal temperature –  Treatment 32/26° C mild chronic heat stress

•  Collected batches of pollen from ~ 10 plants during Sep. & Oct 2013 – One treatment, one control per collec1on – Made RNA from five collec1ons, 5 treatment, 5 control "batches"

–  sequenced at UCLA (69 base, PE)

8

Arabidopsis cold stress RNA-‐Seq

•  Simpler data set with one treatment & control – Using data from part of chr1, treatment sample to illustrate data processing, visualiza1on, effects of parameter seongs on results (maximum intron size in tophat spliced alignment program)

•  For details, see: –  experiment record at the Short Read Archiveh7p://www.ncbi.nlm.nih.gov/sra/SRP029896

–  sample h7p://www.ncbi.nlm.nih.gov/sra/SRX348640 •  Published in Methods in Molecular Biology h7p://www.ncbi.nlm.nih.gov/pubmed/24792048

9

Workshop 1: RNA-‐Seq: Design, sequence, align, count, visualize

wings 2014

10 10

Goals •  Learn the basics (20') – Plan an experiment – Library prep for RNA-‐Seq –  Illumina sequencing

•  Prac1ce: Quality analysis using FastQC (30') •  Prac1ce: Data processing (30') – Align reads (make BAM files and junc1on files) – Make counts files for sta1s1cal analysis – Merge reads into transcript models w/ Cufflinks

•  Prac1ce: Visualize results in IGB (60') – Compare to data set in Galaxy, TAIR10 gene models

11

Visualiza1on using IGB

FASTQ files

WildType1a.fastq

Work Shop 2

Workshop 1 Overview FASTQC

Alignment onto Genome

$Command Line…

WildType1a.bam

Genera1on of Counts Data

Counts.txt

Sequencing Strategy

RNA-‐seq: ultra-‐high throughput cDNA sequencing

•  Several papers published in 2008, first in May

13 h7p://blog.sbgenomics.com/rna-‐seq-‐the-‐first-‐wave/

Ecker lab

Snyder lab

999 cites

1,076 cites

Mortazavi 2008 "Mapping and quan1fying mammalian transcriptomes

by RNA-‐Seq" Nature Methods

•  Published later in 2008, but > 3000 cita1ons

•  Why? Maybe because emphasized RNA-‐Seq as replacement for expression DNA microarrays

•  Comment in same issue: "Beginning of the end for microarrays?"

14

google scholar

RNA-Seq Overview - Illumina

~ ~ ~ ~ fragment

synthesize cDNA (random hexamers) - - - - - - - -

- - - - - - -

- - - - - - - - - - -

- - - - repair ends

add “A” bases to 3’ ends

ligate adapters

extract RNA, purify polyA+

- - - - - - - - - - -

amplify

library reflects RNA from original sample

Data, fastq sequence files Millions of reads per library

Map to genome Count reads per gene

improve gene models

identify differentially expressed genes

alignments

analyze splicing

and much more..

prepare flowcell

Plan experiment •  Biological replication •  Sequencing strategy •  Data analysis strategy

sequence by synthesis

collect samples

2. Making Libraries

quality assessment

3. Sequencing

4. Data Analysis

1. Design

15

Five steps for design

1.  Ar1culate your ques$ons or hypothesis 2.  Define your unit of biological replica1on. 3.  Write up your sample collec1on protocol in

detail –  Does the protocol allow you to test your hypothesis?

4.  Define library synthesis & sequencing strategy –  Read lengths, paired end vs. single end, depth, barcoding

5.  Ask an experienced data analyst to review your plan, revise needed

16

Image: David C Corney Ph. D. h7p://www.labome.com/method/RNA-‐seq-‐Using-‐Next-‐Genera1on-‐Sequencing.html

Fork or "Y" adapters size selec1on

Library synthesis

17

Y adapters contain indexes,

allow mul1plexing

Example library molecule

Unknown sequence Rd1

Rd2

barcode

Universal adapter

Index Primer

18

Rd1 Rd2

Rd1 & Rd 2 are from reverse complements, might overlap. Ref: h7p://nextgen.mgh.harvard.edu/IlluminaChemistry.html

P5 P7

Flow cell prepara1on & sequencing by synthesis

19

h7ps://www.youtube.com/watch?v=HMyCqWhwB8E

Review: Paired End vs Single End •  Single End – cheaper •  Paired End – more expensive –  two reads per fragment – coun1ng fragments, not reads

– call normalized counts FPKM not RPKM sequenced in SE

Sequenced in PE

SE

PE indexed adapter

20

Get the reads in a FASTQ file •  File contains millions of records – Each record has four lines, represents ONE sequence

•  Line 1 – the name, starts with @ •  Line 2 – the sequence, starts at new line

•  Line 3 – some other stuff, op1onal, starts with + •  Line 4 – the quality scores, starts at new line

@SN1083:379:H8VA1ADXX:2:1101:1248:2144 1:N:0:12!CCTAAATGGTGCCATGCTAGGAGGCCGTGCCCTTCTTGAAAAGTTGTATGTGAA!+!BBBFFFFFFBFFFIIIIFI<FFIIIIIFIIIIFBFIIIIIIIIFFFIIIIFIII!

base = T score = F = 37

21

Phred Quality score Q

h7p://en.wikipedia.org/wiki/FASTQ_format

Describes how exponen1ally unlikely it is that a given base call is wrong.

Q = -‐10 log10 pe

22

h7p://drive5.com/usearch/manual/quality_score.html

Different Illumina data processing pipelines used different score encodings

23

Get two files -‐ Read1 & Read2 -‐ from paired end sequencing

•  Read1 and Read2 have same read iden$fier, are reverse complements of the same fragment

•  Example is processing pipeline Cassava 1.8, older versions used different naming conven1ons

@SN1083:379:H8VA1ADXX:2:1101:1248:2144 1:N:0:12!CCTAAATGGTGCCATGCTAGGAGGCCGTGCCCTTCTTGAAAAGTTGTATGTGAA!+!BBBFFFFFFBFFFIIIIFI<FFIIIIIFIIIIFBFIIIIIIIIFFFIIIIFIII!

@SN1083:379:H8VA1ADXX:2:1101:1248:2144 2:N:0:12!CATTTTCGACGTTGTTAATAAGCTCTGCGTACTTGCAAGCTATCTGCGCGAACG!+!BBBFFFFFFFFFFIIIIIIIIIIIIIIIIFIIIIIIIIIIIIIIIIIIIIIFFF!

24

R1

R2

Sequence iden1fier line in Cassava 1.8

25

@SN1083:379:H8VA1ADXX:2:1101:1248:2144 1:N:0:12!

machine run# flow-‐cell-‐id lane 1le x-‐pos y-‐pos

read# index is-‐filtered (barcode) control

FastQC

•  Many groups use FastQC as a first pass quality assessment

•  Free from Babraham h7p://www.bioinforma1cs.babraham.ac.uk/projects/fastqc/

•  Run interac1vely (point-‐and-‐click) or command line (won’t cover this)

26

Prac1ce: Using FastQC

•  Go to Conference DropBox link: –  h7p://bitly.com/rnaseq2014

•  Note two folders – FastQC and FastQC-‐Examples –  FastQC-‐Examples has FastqQC reports from different species, sample types (next slide)

•  FastQC folder, download –  Example.fastq –  FastQC_Manual.pdf

•  Start FastQC, open Example.fastq

27

Prac1ce: Watch FastQC video

•  h7ps://www.youtube.com/watch?v=bz93ReOv87Y (start around 34 sec)

•  Take-‐home #1: FastQC assesses whether your data files are typical

•  Take-‐home #2: A "bad result" from FastQC doesn't always mean your data are not useful or valuable

•  Explore on your own! (~ 15 minutes)

28

Prac1ce: View reports in Fastqc-‐Examples (~ 15 min)

•  Blueberry – OnealRipe_1 – OzarkblueGreen_1

•  Tomato pollen – T2_1 – C2_1

•  Rice – Control2h-‐R2 Per read %GC

29

Prac1ce: Data processing

•  Double-‐click "Alignment.tar.gz" on your Desktop to unpack it

•  Also available from h7p://bitly.com/rnaseq2014

30

Prac1ce: Look at "align.sh"

•  Open Alignment folder •  Right-‐click "align.sh" •  Select "open with text editor" •  This is a shell script –  Commands executed in sequence –  Very useful for automa1ng tasks

•  First line is "she-‐bang" line –  tells Terminal it's a shell script

•  All other lines star1ng with # are comments (not run)

31

Learning the bash shell Great guide to wri1ng shell scripts

align.sh -‐ simple pipeline for RNA-‐Seq data processing

•  Aligns a sample fastq file to genome – tophat2, bowtie2!–  fastq file is from Arabidopsis cold stress experiment (Short Read Archive SRX348640)

–  file ColdTreatment-little.fastq.gz (gzip-‐compressed, .gz)

•  Counts reads that align to TAIR10 genes – featureCounts!–  only coun1ng reads that uniquely align

•  Merges alignments into transcript models – cufflinks!

32

Prac1ce: Intro to Terminal

•  Double-‐click Terminal shortcut on desktop –  Program for entering commands or running scripts – Also called a "shell" or "Unix shell" –  Can open mul1ple Terminal windows

•  Each window called a "shell" or "Unix shell" •  Terminal shows hierarchical view of file system – An upside-‐down tree, where every folder is inside another folder

–  Folders are also called "directories" –  The top folder (that contains everything else) is called "root" directory -‐ / (forward slash)

33

Prac1ce: Open Terminal, try these commands

•  cd change directory –  by itself means "go to user

home directory" –  with an argument means: go there

–  with ".." means go up one

•  pwd -‐ "print the current working directory" & find out where you are

34

Prac1ce: Try these commands

ls lists files and directories in the current directory

35

Prac1ce: Try these commands

36

•  ls -l "list long" –  report more informa1on about files – "d" means it's a directory (folder)

Prac1ce: Run align.sh in Terminal

•  Go to home directory •  Go to Desktop •  Go to Alignment •  Run align.sh

37

Now Running: tophat2 spliced

alignment tool

38

TopHat: discovering splice junc$ons with RNA-‐Seq Cole Trapnell1, Lior Pachter and Steven L. Salzberg Figure 1

Tophat Output -‐ we'll open in IGB

•  Creates new folder with files, including... •  accepted_hits.bam -‐ "binary alignments" file contains read alignments –  BAM -‐ compressed version of SAM -‐ "sequence alignment", needs index ".bai" file (made using samtools)

•  junction.bed -‐ reports boundaries of introns, called "junc1on" features –  BED format, tab-‐delimited plain text file –  one junc1on feature per line –  fi{h field is score, no. spliced reads aligned across the junc1on

–  see: h7p://genome.ucsc.edu/FAQ/FAQformat.html#format1

39

Prac1ce: Start IGB while script runs

•  Double-‐click IGB desktop icon •  Click Arabidopsis flower on start screen

40

Prac1ce: How to get IGB if you're using your own computer

•  Go to h7p://bioviz.org •  Follow Download link •  Choose Medium Memory op1on (typical)

41

TAIR10 annota1ons, June 2009 Columbia-‐0 genome release

•  TAIR10 protein-‐coding gene models loaded automa1cally from IGB data server

•  Forward & reverse strand in separate tracks

42

Forward

Reverse

RNA-‐Seq, ChIP-‐Seq, other data sets available in Data Access tab

•  IGB data servers, can set up your own 43

Arabidopsis pollen data sets

•  Read alignments, coverage graphs, junc1on files •  From 2013 Plant Phys. Pollen RNA-‐Seq paper 44

Prac1ce: Combine Plus & Minus Tracks

Click "+/-‐" to combine tracks

45

Use Data Management Table to change track color, name, visibility, load op1ons, strand op1ons

Summary of moving and zooming

•  Animated zooming –  click to posi1on zoom stripe, sets zoom focus –  horizontal zoom & ver1cal stretch

•  Moving from side to side (panning) –  arrows in toolbar –  hand icon -‐ the move tool

•  Jump-‐zooming –  Click-‐drag coordinate axis with arrow tool – Double-‐click to zoom in on a feature –  Search by name

46

Prac1ce: Zoom in on a feature

•  Zoom in on alt-‐spliced gene models * on chr1 •  This is animated zooming

47

1. Click to set zoom focus 2. Drag slider

to zoom in *

Prac1ce: Click move arrows to reposi1on during zoom

•  Click data display to re-‐focus zoom on target loca1on

48

49

Prac1ce: Or use move tool (hand) to reposi1on during zoom

•  Click display to focus zoom on target

1. Select move tool (hand)

2. Click-‐drag to move

Prac1ce: Click-‐drag sequence axis to jump-‐zoom to a region

2. Click number line

50

3. Drag

4. Release

•  Highlighted region becomes new view

1. Select pointer tool

Prac1ce: Jump-‐zoom to gene model

•  Double-‐click label, space a li7le above exon blocks, or intron to jump-‐zoom to a gene model –  Also selects it, selected items outlined in red

51

2. double-‐click label or intron

1. Select pointer tool

A{er jump-‐zoom, gene model is selected

•  Arrows indicate direc1on of transcrip1on

52

Selected gene model

outlined in red

Prac1ce: Gene model close-‐up

•  Use ver1cal slider to make gene models taller •  Increase window size to make more room

53

Drag slider to stretch ver1cally

Prac1ce: Interact with data using pointer. Select pointer (arrow) in toolbar

•  Click intron, label, or region above blocks to select whole gene model

•  Click blocks to select parts of a gene model •  SHIFT-‐click to mul1-‐select •  CLICK-‐drag to select & count everything in a region •  Selec1on Info, top right, reports counts –  "i" bu7on shows info if one item selected

54

Prac1ce: View edge Matching

•  Edges that match selected item edges are highlighted in red

•  To change edge-‐match color choose File > Preferences > Other Op$ons

•  To turn off or on, see View > Edge Matching

55

Prac1ce: to work with sequence data, click Load Sequence

56 •  Sequence appears in Coordinates track

Prac1ce: Zoom in to see amino acids

•  Note: Must load genomic sequence first 57

Prac1ce: Zoom in on end of transla1on

•  Click the "thick end" and then zoom in •  Note: Variants encode same C-‐term amino acids

58

Prac1ce: Select genomic sequence 1. Choose pointer tool in toolbar

2. Click-‐drag genomic

sequence to select a region

3. CNTRL-‐click to copy

•  Length of selected region reported in Selec$on Info box (top right)

•  Useful for designing primers, measuring regions 59

Prac1ce: Right-‐click (or CNTRL-‐click) gene model

•  Shows op1ons to run a Web search, BLAST search, view sequence

60

Prac1ce: Quick Search

•  Enter search text, select op1on •  Jump-‐zoom to selected gene

61

Choose At-‐SR30

Zoomed to At-‐SR30, RNA-‐binding protein involved in splicing

62

Looking ahead to Workshop 3

•  Some genes that were highly expressed in tomato pollen are annotated as "Unknown" proteins & have no counterpart in Arabidopsis.

•  You can use IGB to quickly find those genes and then run BLASTX or BLASTP searches at NCBI to find out... – Are they unique to tomato? – Could they be non-‐coding?

63

Prac1ce: Open files from align.sh!

•  Zoom out to show more of At-‐SR30 region •  Choose File > Open – Select "accepted_hits.bam" & "junctions.bed"

•  A new empty track appears for each file

•  Click Load Data to load reads and junc1ons

64

65

read alignments stack

reads at top of stack not being shown (too

many to fit)

66

junc1on features, summarizing spliced reads

junc1on features, summarizing spliced reads

Prac1ce: Configure view -‐ Load Sequence

67

Click Load Sequence to load genomic bases for this

region

Prac1ce: Configure view -‐ Lock mRNA track height

68

1. Click TAIR10 mRNA track label to select it

2. Open Annota$on tab

3. Select Lock Track Height, enter 170, click

Apply

Prac1ce: Configure view -‐ configure junc1on track

69

1. Click junc$ons track label to select junc1ons track


3. Select score in Label

Field

4. Select +/-‐ in Strand

Prac1ce: Configure view -‐ lock junc1on track height

70

1. Click junc$ons track label to

select it


3. Select Lock Track Height, enter 120, click Apply

Prac1ce: Change read stack height to see more reads

1.  CNTRL-‐click (or right-‐click) accepted_hits.bam track label

2.  Choose Set Stack Height... 71

Prac1ce: Change read stack height

3. Enter 50

72

Prac1ce: Change read stack height to see more reads

Prac1ce: Set mRNA stack height

2. Enter 3 -‐ tallest stack has 3 models

73 Note: Tabs are minimized to make more space

1. Right-‐click TAIR10 mRNA track label, choose Set Stack Height

Prac1ce: Note read support for alterna1ve splicing

Take-‐home: Many spliced reads support both variants, but there are also many reads inside the introns, indica1ng failure to splice. This may be typical of alt-‐spliced introns?

74

Prac1ce: Use junc1on track to quan1fy support for splice variants

1.  Click-‐drag to genes track 2.  Scores are number of

spliced reads suppor1ng each junc1on.

75

Prac1ce: Compare Cufflinks GTF file to Gene models

•  Open Alignments > cufflinks_cold > transcripts.gf

76

Prac1ce: View Cufflinks gene models

77

1. Click Load Data to see Cufflinks models

2. Click-‐drag new track next to gene models

3. Use ver$cal slider to make more room

Take-‐home: Cufflinks annota1ons close, but incomplete.

Prac1ce: Load data from Galaxy

78

1. Go to usegalaxy.org 2. Open Shared Data

3. Choose Published Histories


79

1. Search for Cold

3. Select Cold stress in Arabidopsis (with default maximum intron size)


•  Illustrates results when tophat is run with default seongs: –  default maximum intron size is 500,000 bases

•  Tophat was developed with human data in mind, where large introns are common

80

Select Import History

Prac1ce: Select start using this history

81

82

1. Select Treatment junc1ons

2. Select display in IGB View

83

New tab opens. Select Click to go to IGB

84

New track 1. Click Load Data

Prac1ce: Remove reads -‐ don't need them now

85

1.  Right-‐click accepted_hits.bam

2. Choose Delete Track

86

1.  Zoom out all the way

2.  Click Load Data

Your data are here

87

Take-‐home: Tophat run with default parameters predicts enormous introns. Important to understand parameters seongs -‐-‐ defaults are not always best.

Now you can

•  Describe Illumina library synthesis, sequencing •  Evaluate data quality using FastQC •  Run a data processing pipeline (shell script) •  View and explore data in a genome browser – and load data sets from Galaxy, local files

88

Thank you for your a7en1on!

wings2014 workshop 1 design, sequence, align, count, visualize

Data & Analytics

mild heat

click load

seq data set

gene model

load data

heat stress

reverse complements

shell script