kissplice - identifying and quantifying snps, indels and...

Post on 15-Aug-2020

4 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

KisSpliceIdentifying and Quantifying SNPs, indels and Alternative Splicing

Events from RNA-seq data

29th may 2013

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Next Generation Sequencing

A sequencing experiment now produces millions of short reads(∼ 100 nt) in a single run for a reasonable cost (∼ 103 euros)

For model species, the first step is usually to map the reads tothe reference genome/transcriptome

For non model species, the first step is usually to assemble thereads and reconstruct the genome/transcriptome

Downstream analysis includes the analysis of polymorphism(SNPs, rearrangements, splicing)

Our main idea is to extract polymorphism directly from thereads, and not assemble the genome/transcriptome

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

De-Bruijn Graph

De Bruijn graphs (DBG) are used as a first step in many shortreads assemblers.

Node = k-merEdge = overlap of k-1 bases

Example

CACTCAA, k = 3

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

De-Bruijn Graph

More complicated example

Reference : CACTCAACTG (unknown)

read1 CACTCA

read2 CAACTG

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

De-Bruijn Graph

Even more complicated example

Reference : CACTCAACTGACT (unknown)

read1 CACTCA

read2 CAACTG

read3 CTGACT

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Compressed De Bruijn Graph

Even more complicated example

Reference : CACTCAACTGACT (unknown)

read1 CACTCA

read2 CAACTG

read3 CTGACT

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

De-Bruijn Graph

An assembly is a walk in the de Bruijn graph, which containsall reads as subwalks

This problem is known to be NP-complete

In practice, heuristics are used which consist in simplifying thegraph to � make it linear �

However, the structures that are removed may correspond torelevant biological structures (SNPs, alternative splicing).

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Specificities of RNA-seq data

Dynamic range of gene expression

Few genes are highly expressedMany are poorly expressed

Alternative splicing

A gene may give rise to several transcripts

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Example of DBG built from RNA-seq data

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Polymorphism in RNA-seq data

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Polymorphism in RNA-seq data

If the purpose is to identify polymorphism, then assemblersare not well suited

The variable parts are precisely the ones that will be removed

3 types of polymorphisms are expected in RNA-seq :At the genomic level

SNPApproximate tandem repeats

At the transcriptomic level

Alternative splicing

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

SNPs

SNPs correspond to recognizable patterns in the de Bruijngraph

Issue : how to discriminate SNPs from sequencing errors ?

Idea : require a minimum coverage for each path

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Approximate Tandem Repeats

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Alternative splicing events

Exon skipping

Intron retention

Alternative 5’ or 3’ splice site

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Alternative splicing events

Not covered by this pattern :

Alternative transcription start and endMutually exclusive exons

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

A general model for polymorphism in DBG

! !

!"#$%$&'(")*+$(",*&"-*(.)*&-/01)"0%"234

567"8"9"-':/1"*,"($%#:/"9;<=

>6?"8"="-':/"*,"($%#:/"':")*1:"9;<9@":/$":A*"-':/1"'(0#%

!5"$B$%:"8"="-':/"*,"($%#:/"':")*1:"9;<9"

SNP : 2 paths of length 2k − 1

! !

!"#$%$&'(")*+$(",*&"-*(.)*&-/01)"0%"234

567"8"9"-':/1"*,"($%#:/"9;<=

>6?"8"="-':/"*,"($%#:/"':")*1:"9;<9@":/$":A*"-':/1"'(0#%

!5"$B$%:"8"="-':/"*,"($%#:/"':")*1:"9;<9"

Repeats : 1 path of length atmost 2k−2, the two paths align

! !

!"#$%$&'(")*+$(",*&"-*(.)*&-/01)"0%"234

567"8"9"-':/1"*,"($%#:/"9;<=

>6?"8"="-':/"*,"($%#:/"':")*1:"9;<9@":/$":A*"-':/1"'(0#%

!5"$B$%:"8"="-':/"*,"($%#:/"':")*1:"9;<9"AS : 1 path of length at most2k − 2

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

The ModelAlgorithm outline

Algorithm outline

KisSplice

1 De Bruijn graph construction ;

2 BiConnected Components decomposition (BCC) ;

3 Four nodes compression (SNPs and sequencing errors) ;

4 Enumeration of all bubbles with a shorter path length at most2k − 2 ;

5 Quantification and classification.

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Results on simulated data

We simulated the sequencing of one drosophila gene with twoalternative transcripts (using FluxSimulator)

For different values of the coverage, we test if our methodrecovers the AS event

KisSplice recovers the AS event when the coverage is above8X

Trinity recovers the AS event when the coverage is above18X

Note : Trinity’s purpose is more general as it reconstructsfull-length transcripts, but for this task, it is less sensitive

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Impact of k

At 8X, kmin=17, kmax=29KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Results on real data

In order to assess if our predicted AS events are true positives, weneed to test our method in the case where a reference genome isavailable.

Data :

Human Body Map 2.0 data (ERP00546)2 tissues (out of 16) : brain and liver75 bp reads, 32M and 39M

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

BCCs repartition

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Confirming AS events

We align the two paths of the bubble to the reference genomeusing blat

If the two paths align with the same initial and finalcoordinates, then it is a true positiveOtherwise, it is a false positive

Next, we check if the alignment coordinates correspond to anannotated AS event

If the coordinates match, then it is a known eventOtherwise it is a novel AS event

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Confirming AS events

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Annotated exon skipping

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Annotated intron retention

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Novel alternative 5’ splice site

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Novel complex event

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Novel AS events are less expressed

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Novel AS events are shorter

1 short AS events tend to be under-annotated (Ex : NAGNAG)2 we also detect genomic indels that are within genes, which we

mistake for AS eventsKisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Comparison with Trinity on real data

Memory usage is better (5Gb / 100M reads)

KisSplice is faster, which is expected because it solves asimpler task

KisSplice finds 4099 events, while Trinity finds 1123, outwhich 570 are common

50% of the events found only by Trinity are false positives

The rest is hidden in very large BCCs, and we can recoverpart of it using larger values of k

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Unresolved BCC

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Unresolved BCC

This is not an elephant, this is a gene family :)

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Simulated DataReal DataComparison with TrinityUnresolved BCCsAssembly Vs Mapping

Assembly Vs Mapping

For model species, the pipeline is usually TopHat +Cufflinks

Even in this case, KisSplice (or other assembly-basedapproaches) may be useful.

Example of event missed by Cufflinks, but which isannotated

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Experimental Validation (on going)

With Didier Auboeuf (CRCL)

Validation by RT-PCR

Almost all novel events are validated

Novel events found both by KisSplice and Cufflinks arealmost all validated

Novel events found by KisSplice alone are validated only if :

The minor isoform has a relative abundance of at least 15 %The splicing event is simple, not complex

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

KisSplice in practice

Input : fasta/q files

Output : 5 files (SNPs, AS events, Repeats, Indels <3nt,others)

Format :

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

After the counts

Testing if a variant is specific to a condition :

MReduced : Yv ,c = µ+ βvariantv + βcondc

MFull : Y (v , c) = µ+ βvariantv + βcondc + βvariant∗condv ,c

µ : local mean expression of the gene that contains the variant

βvariantv : contribution of variant v

βcondc : contribution of condition c

Counts are modelled using a negative binomial

We compute the likelihood of both models and test with a χ2

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

KisSplice2IGV

Combining KisSplice output with the context given by a fulllength transcriptome assembler ( Trinity, Oases, etc.)

Visualisation in a genome browser (IGV)

The colour of an alignment depends on the log10( RPKM ) ( ReadPer Kilobase per Millions mapped reads)

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

KisSplice2IGV

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Conclusion

KisSplice detects various polymorphisms (SNPs, AS,repeats ) in RNA-seq data

It provides quantification for such events.

KisSplice is more sensitive than Trinity for finding ASevents

KisSplice is relevant for studies without model species

It brings information even when there is a model species andcan be used in addition to classical pipeline

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

KisSplice People and download

Download

KisSplice : http://kissplice.prabi.fr

DBG construction http://minia.genouest.org/

KisSplice People

Rennes :Rayan Chikhi, Pavlos Antoniou, Guillaume Rizk, RalucaUricaru, Pierre Peterlongo

Lyon :Gustavo Sacomoto, Alice Julien-Laferriere, David Parsons,Janice Kielbassa, Lilia Brinza, Marie-France Sagot, VincentMiele, Vincent Lacroix

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Thank you !Questions ?

KisSplice

The softwareResults

Experimental ValidationIn Practice

Post-treatment in development

Further analysis on short events

KisSplice

top related