rnaseq short intro - göteborgs universitetbio.lundberg.gu.se/courses/vt13/rnaseq_intro.pdf ·...
TRANSCRIPT
RNA-Seq practical!
Basic processing: UNIX tools and IGV!
Erik Larsson
RNA-Seq practical!
• Tophat!– Alignment!
• IGV!– Visualization!
• Cufflinks!– Gene discovery!– Find differentially expressed genes!
!
<3% coding sequence
~40% coding genes
GGGGTGAGATCTGGCTGGGTAGGGCTGTTTGACAGGGACACAGTTCACGGCCTGGGACTTGCCAACAAAGTCACCCTGTAGTTCAGGTGACACACAAGTGGATGGGGAGGGTGAGACCCAGGATCTCTTCTCCCCCAGGTCCTTATGAGGGGCTGGAGGAGACAGAACTGGGGTGCTGGACCCTCAGCATAAAGAATGCTATAGGCTGGGCATGGTGACTCATGCCTGTAAATCCCAGCGTTTTGGGAGGCCAAGGCGGGCAGATTGCTTGAGCCCAGAAATTTGAGACCAGCCTGGGCAACATAGCGAGACCCCGGGCAACATAGCGAGACCCCATCTCTAAAAAAATAAAATAAAATTAGCCAGGTTGGTGGCACAAGTCTGCAATTCTAACTACTTGGATGGGCTGAGATGGGAGGATCACTTGAGCCTGGGAGGTCAAGGCTGCAGTGAGCTGTGATTGTGCCACTGCACTCCAGCCGAGGGGACAGAGTGAAACCTTGCCTTAAAAAGACTGCTATGGCCCGAGTCCCTCTGCTGTGCCGGGCACTGTGCTGGGCATGTAACAGGCATATTCTTCTGATCTTTACAACTCTCCCATGAGGCAGGCACTATCGTTAGCCCATTTTACAGATGTGGCCATAGAGGCCCAGAGAGGAGAAGGGGCTTACCTAAGGCTATAGACTGTTGGTATCTGGAGATAAACCCGGGATGGTGCTCACTAAACTACCTTGGGTGTCAGTCCTGCTTCAAGACTCCAGAGAGATAAAGAGAGATGACCTCAGAGACAAAGAGACTCAGACCCAGCCAGAGGCCCAATGGACAGTGGGAGGGGTGGGTGGAAGAAGGCTGGTCTCTGTCTGACCAAGCCCCCCCAGAATAACGCAGGCTGCCCCCCTAGGTGGAAACAATGACACAATCAGCTCCCAATACCAAGGGCCTGACATCACAAGGGGAGGGGAAGGCAGCTGAGGTTGTGGGGGGAGGTGCCCCGCCCCTTGGCAGGCCCCTACAGCCAATGGAACGGCCCTGGAAGAGACCCGGGTCGCCTCCGGAGCTTCAAAAACATGTGAGGAGGGAAGAGTGTGCAGACGGAACTTCAGCCGCTGCCTCTGTTCTCAGCGTCAGTGCCGCCACTGCCCCCGCCAGAGCCCACCGGCCAGCATGTCCTCTGCTCACTTCAACCGAGGCCCTGCCTACGGGCTGTCAGCCGAGGTTAAGAACAAGGTAGGGCTGGAGGGCCTCCCTGGCCTGGCCCACACGTCCTGCCAGGCCAGAGCCCTGAGCTTGGGGTCCCTTGAACCCCCTCCTGCCTATCCTATGTGACTTGGAAACTGAGAGGGGAAAAGGGAGTGATATGGGATAGGGGCTGCCTGTCTCCCCCTGAACATCCCGGAGCCCCCAGCTATGGTTGGGGCTGGAATGGGGGGGCACACAGCCACACATAAACAGAGGGGGTCAGTCCATTGCAAAGATACCCACCTGATCAGTCTTCTGTTAACCCTTCGTGTTCTTGGGGGGAACAACATAGGGGGAAGACTTGTTGATTTTTCCATATCCCCCGGCCTGACAAAGAAATTGGGGAGCGCTTGAGTGCTGGGGTACCTGGGAAGTGACGCCGTGAAAGTGTGGGAGATCCTGAAGACAGAGGGGGACGGTGAAAGGCAGGAAGCGGGCATCAGAAGTGCGGCAGGGGTCTCCTGACTGTGGAGCTAGGAAGATACCTGGACACCACCTTCATGCTATGGTTGGGTAAACTGAGGTTCGGAGAGGAGAGGCAAATAGCTGGGGTCCCAGGTAAAGCAGGTACAGCGCTCGGACCCTGGACTCACCCCCCATACACCAGGATGGGCTCAGCTTCTCCCAGCTGGAGAACTTTAAGTTTCCAGCCCACTGGAATCGCCCCAACAGTATTGCCGAGGGAGGAGTTCCTGCCCCATTTGACAGAGGGGAACACTGAGGCTCAGGGTGGCTTTTCCCAGGGTCCCATGGTGAGGAAGTGGGGGACTGGGTTGGAACCTGGGTCGAGGGATCTCGGGGCTGGAGGAGGGGGCTGGTGGGGGGCGGGTCCTCGGGCGAGAGACAGATCCCAGCGCCGCCCTCCTCCCCCCCAGCGCCGGCCCCAGAGCCGCGCAGAGCCGCGCAGAGACGCCGCGCCTTATAAGGCGGCCTCGGGGAGCCCGGGCCACGCTATATAAGGGCCGGTTTGCTTTATAAAGCCGGGCTGGTGGCGTGGGGGGCGGCAGGGCCAGGGCCAGGTGAGGGGGCCGCCCCTCCCACCTCCCCCCACTCACCCGGGAGAAGAAGAGGCAGCCCGGTCCCCTAGGGGCTGGGAGCCTGGCTGGGCTTGGGCGGAGGGTTCTGGAGAAATGGGAGTGGAGTGGGGGAGGGGGGGGACAGTGGAGAGAGGGAAAAGCAGGGAGGTGGGGGGAGAGGCAGACAGAGATACTGGGAGCCTGAGACACCCTAGGGACAGACGGGGGAGGGCGAGCCAGGAGCGAGATAAGACCTAGACAAGGATGGAGGGGCAGGGAGAGGAGACAGAGCCCCACCACCCCCACCCCAGGCAGGAAACCTGGAGACAGAGAAAGACCTAGAGAGGCAGATATACAAGACCCAGGAGCCCTACCCCTGGCCAGACAGGGACTAGCCACCTAGAGAGATGGGGACCCAAGACTGGGCCAAGAAAAGACAGCGCTGGGGAAGAGAGAGACAGAGGAGTCGGGGGGATAAGAGGGAGAGAGACATACAGACGTGCAAGGGGTGGGGGCTAAGACAGAGACAAGCCCCCACCACTAACCAGAGACAGAGCCCTGGAGCTGAAGACCTGGGGGACACGGAGAGACAGAGATGTATGACCAGCACTCCTCTGCAAGCCAGCACCCAGGGACACCTCCTTAGACATCCTTCTTCCCTTCCTGAGGTGCCCTCTCTTCCAACAGGGGGCACAGAGGGGGCAGGGCTAGAGGAAGAGAAGCCCCAAGTTTGGCCTGGGCGAAAAACCAGGGTGCCGGGTGCCACCCCTCTAGCTCAGAGGATCCAGCTCCCCACACCCCACCCCTCATCTACATTCCCTGGTGCCAAACCTCAGAATGCCCGGAATGGCCCCCTGGGCAGGTGCCACCTCAGCCCTGGCTCTCAGCCCGCCCCAGCCCCCATCCCCCAACTATGGATCTGGGGCAAAATTGCCTTAGTTGGGAAGGACGAGGGAGATCAGGCTCTAGGAAGTTCAGACAGGACCCAGGGAGCCCAGGCTGCCCCCAATGCATCCTCACCCCTTTCTCTGTGCCCCCTGCCCTCCCCTCGCCCCAGCTGGCCCAGAAGTATGACCACCAGCGGGAGCAGGAGCTGAGAGAGTGGATCGAGGGGGTGACAGGCCGTCGCATCGGCAACAACTTCATGGACGGCCTCAAAGATGGCATCATTCTTTGCGAGTGAGTGAGGCTCTCGAAGCCGAGACCCTGCAACATCCCCCAACTCCATGCAGCCCCTCAACCCCCAAAACAACCATGATCCTGGAACTGAGTTGAACACTTTCTATTGGATACCTTTGGGGTGGCCAGTAATCATTGTGCCCATTTAACAGGCACAGAAAACTGAGGCTCAGGTGAAATGCATTGCACCAAGTCCCACGTGGTTTCAAGGGAAATGACTCTAGAATCTTAACCACCATGCTATATAGGGTAGGCCCATCTGTGGCCGCCAGAGTCCCCAGAAAGAGCGGTCACAGCTAAAAGGCAGCAGCCAACAGCTGTTCATGGCTGGCTTGGTGATGTGAGGAGAGATGTGCAGCAATAATTAAAGGAGGCCCTGGTTTTCTTTCTGTTTTCTTTTTGTTTTTTTGAGATACAGTCTTGTTCTGTTGCCCAGGCTGCAGTGCAGAGACACAATCTCGGCTCACTGCAACCTCCGCCTCCAGGGTTTAAGTGATTCTCCTGCCTCAGCCTCCCCAATAGCTGGGATTACAGGCACGCACCACCATGCCTGGCTAATTTTTGTATTTTTTTAAAGTAGAGATGGGGTTTCACCATGTTGGCCAGGATGGTTACGAACTCCTGACCTCAATTGATCCACCTACCTCAGCCTCCCAAAGTGCTGGGATTACAGGCACGTGCCACCATGCCCGGTTAATTTTTGTTTTTTTTTTTTTTTTTTCAGTAGAGATGGAGTTTCACCATGTTGACTAGGCTGGTCTTGAACTCCTGACTTCAAGTGATCCACCTGCCTTGGCCTCCCAAAGTGCTGGGATTGCAGGCACATGCCACCACGCCTGGCTAATTTTTGTATTTTTTTTTTTTTTTTTTAGTAGAGACAGTGTTTCACCATGTTGACCGGGCTGGTCTCAAACTGTGTGTGACACACACACACATGTGACAGTTGTGAAAAACACACACGTGTGTGTGTGGACACACACACACACACACACAC
~60% transcribed
The human transcriptome (according to GENCODE v11)!
1,944 SnRNA
1,521 SnoRNA
1,756 MicroRNA1,190 Misc. RNA19,999
Protein-coding12,534Pseudogene
10,419 LncRNA
Shahrouki, Larsson, Frontiers in Genetics 2012
RNA-seq, RNA sequencing, transcriptome sequencing, total RNA-seq, mRNA-seq,
miRNA-seq…!
• Many names, sometimes mean same!• All about characterizing RNA with next-
generation sequencing (NGS) in one way or the other!
Microarrays vs. RNA-seq!
• Simultaneously quantify most known genes!
• Simultaneously quantify all known genes at high accuracy!
• Identify new genes!• Study splicing patterns!• Discover mutations!• Fusion transcripts!• Find viruses!• Allele-specific expression!• …!
New toys
Applied Biosystems 3730 (2002) Illumina HiSeq 2000 (2010)
50.000-100.000 bp per run ~200.000.000.000 bp per run
NGS principle (Illumina/Solexa)!Take picture to figure out first base in each cluster !
Remove terminators and repeat everything many times!
Add labeled nucleotides, primers, polymerase!
Source: Illumina!Sequencing!!
Isolate polyA+!Fragmentation!
Add random primers!
cDNA synthesis!(first and second strand)!
Ligate adapters!
Standard RNA-seq workflow (polyA+)!
Directional/strand-specific RNA-seq:dUTP method!
Levin et al, Nature Methods 2010!
RNA!
dsDNA!
Adapters!
U U U U U!
U U U U U!
UNC treatment!
RNA-seq data analysis!
• Alignment!• Gene discovery!• Expression quantification!• Testing for differential expression!• Variant discovery!
Pairwise alignment
• Figure out where one sequence belongs within another sequence
• Trivial if not for substitutions, insertions, deletions
Genome: TGCGTACGCTCGATAGCTCGCATCGCTAGCCTCGCATAGCTAGCGATCGT
TCGCATCGCTAGCCTCGCAGAGCTAGC RNA:
||||||||||||||||||| |||||||
Aligning RNA-seq reads!
• Why? Figure out from where the were transcribed!!• Required prior to most analyses!!Two main options:!• Align to transcriptome!
– Fast, simple!– Avoids problems with “spliced”/junction-spanning
reads!• Align to genome!
– Requires specialized RNA-seq aligner (can handle junction-spanning reads)!
Gapped alignments
• Aligners for RNA-seq will need to handle gapped alignments
• Junction-spanning reads will otherwise be lost
Genome:
Spliced mRNA: AAA
NGS reads:
Splice-junction aware aligners!
• TopHat!– Popular option, big online user community!– Finds new junctions but can be guided by
known annotation!– Cuts up reads into smaller pieces and calls
the Bowtie short-read aligner!• SOAPsplice!• SpliceMap!• …!
TopHat output visualized using IGV(human ACTB locus)!
RNA-seq data analysis!
• Alignment!• Gene discovery!• Expression quantification!• Testing for differential expression!• Variant discovery!
Transcriptome assembly/gene discovery!
• Task:!– Use aligned reads to discover genes and
figure out transcript structures !• Tools:!
– Cufflinks!• Most popular choice!• Lots of online support, actively developed!
– Scripture!– Trans-ABySS!
Cufflinks discovers new transcripts/genes from aligned reads!
Aligned reads!
Discovered transcript isoforms!
Abundance estimates!
RNA-seq data analysis!
• Alignment!• Gene discovery!• Expression quantification!• Testing for differential expression!• Variant discovery!
Testing for differential expression!
• Normal t-test not optimal!– RNA-seq is “digital” rather than continuous!
• Negative binomial distribution is better!– EdgeR, DeSeq!
• Runs in R environment!– Cuffdiff (Cufflinks package)!
• +Easy: use alignments without prior quantification!• +Can test for differential splicing!• -Very conservative!
http://bio.lundberg.gu.se/courses/vt13/rnaseq.html
Read intro carefully!
Good luck!!