kissplice - identifying and quantifying snps, indels and...

42
The software Results Experimental Validation In Practice Post-treatment in development KisSplice Identifying and Quantifying SNPs, indels and Alternative Splicing Events from RNA-seq data 29th may 2013 KisSplice

Upload: others

Post on 15-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

KisSpliceIdentifying and Quantifying SNPs, indels and Alternative Splicing

Events from RNA-seq data

29th may 2013

KisSplice

Page 2: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Next Generation Sequencing

A sequencing experiment now produces millions of short reads(∼ 100 nt) in a single run for a reasonable cost (∼ 103 euros)

For model species, the first step is usually to map the reads tothe reference genome/transcriptome

For non model species, the first step is usually to assemble thereads and reconstruct the genome/transcriptome

Downstream analysis includes the analysis of polymorphism(SNPs, rearrangements, splicing)

Our main idea is to extract polymorphism directly from thereads, and not assemble the genome/transcriptome

KisSplice

Page 3: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

De-Bruijn Graph

De Bruijn graphs (DBG) are used as a first step in many shortreads assemblers.

Node = k-merEdge = overlap of k-1 bases

Example

CACTCAA, k = 3

KisSplice

Page 4: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

De-Bruijn Graph

More complicated example

Reference : CACTCAACTG (unknown)

read1 CACTCA

read2 CAACTG

KisSplice

Page 5: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

De-Bruijn Graph

Even more complicated example

Reference : CACTCAACTGACT (unknown)

read1 CACTCA

read2 CAACTG

read3 CTGACT

KisSplice

Page 6: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Compressed De Bruijn Graph

Even more complicated example

Reference : CACTCAACTGACT (unknown)

read1 CACTCA

read2 CAACTG

read3 CTGACT

KisSplice

Page 7: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

De-Bruijn Graph

An assembly is a walk in the de Bruijn graph, which containsall reads as subwalks

This problem is known to be NP-complete

In practice, heuristics are used which consist in simplifying thegraph to � make it linear �

However, the structures that are removed may correspond torelevant biological structures (SNPs, alternative splicing).

KisSplice

Page 8: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Specificities of RNA-seq data

Dynamic range of gene expression

Few genes are highly expressedMany are poorly expressed

Alternative splicing

A gene may give rise to several transcripts

KisSplice

Page 9: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Example of DBG built from RNA-seq data

KisSplice

Page 10: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Polymorphism in RNA-seq data

KisSplice

Page 11: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Polymorphism in RNA-seq data

If the purpose is to identify polymorphism, then assemblersare not well suited

The variable parts are precisely the ones that will be removed

3 types of polymorphisms are expected in RNA-seq :At the genomic level

SNPApproximate tandem repeats

At the transcriptomic level

Alternative splicing

KisSplice

Page 12: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

SNPs

SNPs correspond to recognizable patterns in the de Bruijngraph

Issue : how to discriminate SNPs from sequencing errors ?

Idea : require a minimum coverage for each path

KisSplice

Page 13: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Approximate Tandem Repeats

KisSplice

Page 14: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Alternative splicing events

Exon skipping

Intron retention

Alternative 5’ or 3’ splice site

KisSplice

Page 15: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Alternative splicing events

Not covered by this pattern :

Alternative transcription start and endMutually exclusive exons

KisSplice

Page 16: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

A general model for polymorphism in DBG

! !

!"#$%$&'(")*+$(",*&"-*(.)*&-/01)"0%"234

567"8"9"-':/1"*,"($%#:/"9;<=

>6?"8"="-':/"*,"($%#:/"':")*1:"9;<9@":/$":A*"-':/1"'(0#%

!5"$B$%:"8"="-':/"*,"($%#:/"':")*1:"9;<9"

SNP : 2 paths of length 2k − 1

! !

!"#$%$&'(")*+$(",*&"-*(.)*&-/01)"0%"234

567"8"9"-':/1"*,"($%#:/"9;<=

>6?"8"="-':/"*,"($%#:/"':")*1:"9;<9@":/$":A*"-':/1"'(0#%

!5"$B$%:"8"="-':/"*,"($%#:/"':")*1:"9;<9"

Repeats : 1 path of length atmost 2k−2, the two paths align

! !

!"#$%$&'(")*+$(",*&"-*(.)*&-/01)"0%"234

567"8"9"-':/1"*,"($%#:/"9;<=

>6?"8"="-':/"*,"($%#:/"':")*1:"9;<9@":/$":A*"-':/1"'(0#%

!5"$B$%:"8"="-':/"*,"($%#:/"':")*1:"9;<9"AS : 1 path of length at most2k − 2

KisSplice

Page 17: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Algorithm outline

KisSplice

1 De Bruijn graph construction ;

2 BiConnected Components decomposition (BCC) ;

3 Four nodes compression (SNPs and sequencing errors) ;

4 Enumeration of all bubbles with a shorter path length at most2k − 2 ;

5 Quantification and classification.

KisSplice

Page 18: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Results on simulated data

We simulated the sequencing of one drosophila gene with twoalternative transcripts (using FluxSimulator)

For different values of the coverage, we test if our methodrecovers the AS event

KisSplice recovers the AS event when the coverage is above8X

Trinity recovers the AS event when the coverage is above18X

Note : Trinity’s purpose is more general as it reconstructsfull-length transcripts, but for this task, it is less sensitive

KisSplice

Page 19: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Impact of k

At 8X, kmin=17, kmax=29KisSplice

Page 20: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Results on real data

In order to assess if our predicted AS events are true positives, weneed to test our method in the case where a reference genome isavailable.

Data :

Human Body Map 2.0 data (ERP00546)2 tissues (out of 16) : brain and liver75 bp reads, 32M and 39M

KisSplice

Page 21: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

BCCs repartition

KisSplice

Page 22: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Confirming AS events

We align the two paths of the bubble to the reference genomeusing blat

If the two paths align with the same initial and finalcoordinates, then it is a true positiveOtherwise, it is a false positive

Next, we check if the alignment coordinates correspond to anannotated AS event

If the coordinates match, then it is a known eventOtherwise it is a novel AS event

KisSplice

Page 23: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Confirming AS events

KisSplice

Page 24: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Annotated exon skipping

KisSplice

Page 25: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Annotated intron retention

KisSplice

Page 26: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Novel alternative 5’ splice site

KisSplice

Page 27: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Novel complex event

KisSplice

Page 28: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Novel AS events are less expressed

KisSplice

Page 29: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Novel AS events are shorter

1 short AS events tend to be under-annotated (Ex : NAGNAG)2 we also detect genomic indels that are within genes, which we

mistake for AS eventsKisSplice

Page 30: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Comparison with Trinity on real data

Memory usage is better (5Gb / 100M reads)

KisSplice is faster, which is expected because it solves asimpler task

KisSplice finds 4099 events, while Trinity finds 1123, outwhich 570 are common

50% of the events found only by Trinity are false positives

The rest is hidden in very large BCCs, and we can recoverpart of it using larger values of k

KisSplice

Page 31: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Unresolved BCC

KisSplice

Page 32: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Unresolved BCC

This is not an elephant, this is a gene family :)

KisSplice

Page 33: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Assembly Vs Mapping

For model species, the pipeline is usually TopHat +Cufflinks

Even in this case, KisSplice (or other assembly-basedapproaches) may be useful.

Example of event missed by Cufflinks, but which isannotated

KisSplice

Page 34: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Experimental Validation (on going)

With Didier Auboeuf (CRCL)

Validation by RT-PCR

Almost all novel events are validated

Novel events found both by KisSplice and Cufflinks arealmost all validated

Novel events found by KisSplice alone are validated only if :

The minor isoform has a relative abundance of at least 15 %The splicing event is simple, not complex

KisSplice

Page 35: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

KisSplice in practice

Input : fasta/q files

Output : 5 files (SNPs, AS events, Repeats, Indels <3nt,others)

Format :

KisSplice

Page 36: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

After the counts

Testing if a variant is specific to a condition :

MReduced : Yv ,c = µ+ βvariantv + βcondc

MFull : Y (v , c) = µ+ βvariantv + βcondc + βvariant∗condv ,c

µ : local mean expression of the gene that contains the variant

βvariantv : contribution of variant v

βcondc : contribution of condition c

Counts are modelled using a negative binomial

We compute the likelihood of both models and test with a χ2

KisSplice

Page 37: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

KisSplice2IGV

Combining KisSplice output with the context given by a fulllength transcriptome assembler ( Trinity, Oases, etc.)

Visualisation in a genome browser (IGV)

The colour of an alignment depends on the log10( RPKM ) ( ReadPer Kilobase per Millions mapped reads)

KisSplice

Page 38: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

KisSplice2IGV

KisSplice

Page 39: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Conclusion

KisSplice detects various polymorphisms (SNPs, AS,repeats ) in RNA-seq data

It provides quantification for such events.

KisSplice is more sensitive than Trinity for finding ASevents

KisSplice is relevant for studies without model species

It brings information even when there is a model species andcan be used in addition to classical pipeline

KisSplice

Page 40: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

KisSplice People and download

Download

KisSplice : http://kissplice.prabi.fr

DBG construction http://minia.genouest.org/

KisSplice People

Rennes :Rayan Chikhi, Pavlos Antoniou, Guillaume Rizk, RalucaUricaru, Pierre Peterlongo

Lyon :Gustavo Sacomoto, Alice Julien-Laferriere, David Parsons,Janice Kielbassa, Lilia Brinza, Marie-France Sagot, VincentMiele, Vincent Lacroix

KisSplice

Page 41: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Thank you !Questions ?

KisSplice

Page 42: KisSplice - Identifying and Quantifying SNPs, indels and ...kissplice.prabi.fr/documentation/kissplice_prabiformation.pdf · read3 CTGACT KisSplice. The software Results Experimental

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Further analysis on short events

KisSplice