big 2018 slides - lpantano.github.io · clean lex neb nf1 nf2 nf3 nf4 peb qia som tri tru ts1 ts2...
TRANSCRIPT
12-06-2018
miRNA wars THE isomiRS menace
Lorena Pantano, P.h. D https://lpantano.github.io
@lopantano
https://www.vinylinfo.org/news/population-growth-sustainability-and-our-future
150 human liver samples with small RNAseq data
miRNAsGene regulation by imperfect complementary between seed region in miRNA and 3’UTR in the targeted RNA molecule.
https://www.sciencedirect.com/science/article/pii/S002228361300154X
isomiRs https://www.flickr.com/photos/ajc1/4401411500
What is the best way to analyze this data?
Main projects: - a format - a way to compare
By Kieran O’Neill
Format derived from GFF3: mirGFF3
Python API: mirtop
What is the best way to analyze this data?
Samples suitable for benchmarking
Giraldez et al.
Mestdagh et al.
Dan-Dascot et al.
+ 6 synthetic RNAs
Tewari synthetic
Tewari custom
Different protocols
Protocols
https://rnaseq.uoregon.edu/
Variations
3 adapter 5 adapter PEG
TrueSeq
NEBNext Yes
NextFlex 4N 4N Yes
CleanTag Methylphosphonate 2’Ome
CATS poly-T TSO with rX
SMARTer poly-T TSO
Methods and Metrics
Sample
Library size = 20
Sample
miRNA 2 = 10
miRNA 1 = 10
Library size = 20
Sample
miRNA 2 = x10 isomiR 2.1 = x3 (match perfectly spike-in)
isomiR 2.2 = x5 (NOT match perfectly spike-in)
isomiR 2.2 = x2 (NOT match perfectly spike-in)
miRNA 1 = x10 isomiR 1.1 = x7 (match perfectly spike-in)
isomiR 1.2 = x3 (NOT match perfectly spike-in)
Library size = 20
Sample
Library size = 20
There are 20 reads in total (lines), but 5 unique sequences (colors)
Sample
Library size = 20
isomiR 1.1 is 7/20=35% of the reads and 1/5=20% of the sequences
miRNA 1 = 10x 2 unique isomiRs
miRNA 2= 10x 3 unique isomiRs
Sample
Library size = 20
isomiR 1.1 is 1/2=50% of the miRNA1 sequences
isomiR 1.1 is 7/10=70% of the miRNA1 reads
miRNA 2 = 10 3 isomiRs
miRNA 1 = 10 2 isomiRs
Sample
Library size = 20
PCT = isomiR 1.1 is 1/5=20% of the sequences
IMPORTANCE = isomiR 1.1 is 7/10=70% of the miRNA1 reads
miRNA 2 = 10 3 isomiRs
miRNA 1 = 10 2 isomiRs
Pipeline razers3 + mirtop
Reference Trimmed Data
Razers3
Mirtop GFF3
Mirtop tabular
snakemake
Data analysis - Filters
Human?
Is the spike-in detected?
Any sequence in group mapped to
other miRNA?
NO YES
NO YES
NO YES
All sequences in a miRNA
Tewari-synthetic: library size
56K99K
200K
86K
113K
196K
53K
88K
81K
77K
84K
30K
85K
166K
188K
134K
153K160K
127K
12K
91K
0e+00
2e+07
4e+07
6e+07
clean_ta
g_lab
5
neb_next_
lab1
neb_next_
lab3
neb_next_
lab4
neb_next_
lab5
neb_next_
lab9
tru_seq_la
b1
tru_seq_la
b2
tru_seq_la
b3
tru_seq_la
b5
tru_seq_la
b6
tru_seq_la
b8
tru_seq_la
b9
x4n_a_lab
5
x4n_b_lab
2
x4n_b_lab
4
x4n_b_lab
6
x4n_c_lab
6
x4n_d_lab
1
x4n_nex_tfle
x_lab
8
x4n_xu_la
b5
sample
reads
protocol clean neb tru x4n
40 million reads and 85K different sequences.
Tewari-synthetic: spike-ins are the top sequence?
667
626 735659 712
752 885712
691 690 701645
692
741 792
768
804
860
751
578 701
0
20
40
60
80
clean_tag_lab5
neb_next_lab1
neb_next_lab3
neb_next_lab4
neb_next_lab5
neb_next_lab9
tru_seq_lab1
tru_seq_lab2
tru_seq_lab3
tru_seq_lab5
tru_seq_lab6
tru_seq_lab8
tru_seq_lab9
x4n_a_lab5
x4n_b_lab2
x4n_b_lab4
x4n_b_lab6
x4n_c_lab6
x4n_d_lab1
x4n_nex_tflex_lab8
x4n_xu_lab5
sample
pct
protocol clean neb tru x4n
70% (692 being total detected miRNAs) of miRNAs with top sequence to be the expected one.
Most abundant type of isomiR for each miRNA without the spike-in as the most abundant
0
200
400
600
clean
_tag_
lab5
neb_
next_
lab1
neb_
next_
lab3
neb_
next_
lab4
neb_
next_
lab5
neb_
next_
lab9
tru_s
eq_la
b1
tru_s
eq_la
b2
tru_s
eq_la
b3
tru_s
eq_la
b5
tru_s
eq_la
b6
tru_s
eq_la
b8
tru_s
eq_la
b9
x4n_
a_lab
5
x4n_
b_lab
2
x4n_
b_lab
4
x4n_
b_lab
6
x4n_
c_lab
6
x4n_
d_lab
1
x4n_
nex_
tflex_
lab8
x4n_
xu_la
b5
sample
# of m
iRNA
s
IMPORTANCE add3padd3p + other
shift3pshift5p
shift5p shift3psnp
snp + other
Tewari-synthetic: Importance of the ‘isomiRs’ detected
snp snp + other
shift5p shift5p shift3p
reference shift3p
add3p add3p + other
clean
_tag
_lab
5ne
b_ne
xt_l
ab1
neb_
next
_lab
3ne
b_ne
xt_l
ab4
neb_
next
_lab
5ne
b_ne
xt_l
ab9
tru_s
eq_l
ab1
tru_s
eq_l
ab2
tru_s
eq_l
ab3
tru_s
eq_l
ab5
tru_s
eq_l
ab6
tru_s
eq_l
ab8
tru_s
eq_l
ab9
x4n_
a_la
b5x4
n_b_
lab2
x4n_
b_la
b4x4
n_b_
lab6
x4n_
c_la
b6x4
n_d_
lab1
x4n_
nex_
tflex
_lab
8x4
n_xu
_lab
5
clean
_tag
_lab
5ne
b_ne
xt_l
ab1
neb_
next
_lab
3ne
b_ne
xt_l
ab4
neb_
next
_lab
5ne
b_ne
xt_l
ab9
tru_s
eq_l
ab1
tru_s
eq_l
ab2
tru_s
eq_l
ab3
tru_s
eq_l
ab5
tru_s
eq_l
ab6
tru_s
eq_l
ab8
tru_s
eq_l
ab9
x4n_
a_la
b5x4
n_b_
lab2
x4n_
b_la
b4x4
n_b_
lab6
x4n_
c_la
b6x4
n_d_
lab1
x4n_
nex_
tflex
_lab
8x4
n_xu
_lab
5
0
1
2
3
4
5
0
1
2
3
0
1
2
3
0
20
40
60
0.0
0.5
1.0
1.5
0
1
2
3
4
0
1
2
3
4
0
20
40
60
sample
PCT
IMPORTANCE<0.1
0.1−1
1−5
5−10
10−20
20−50
>50
isomiRs importance by type
4200 5361 60765583 5752 5780
48224209 4315 4188 4107
3940
4015 4135 4045 4297 4681 5025 4040
2980
3948
0
30
60
90
clean
_tag
_lab
5
neb_
next
_lab
1
neb_
next
_lab
3
neb_
next
_lab
4
neb_
next
_lab
5
neb_
next
_lab
9
tru_s
eq_l
ab1
tru_s
eq_l
ab2
tru_s
eq_l
ab3
tru_s
eq_l
ab5
tru_s
eq_l
ab6
tru_s
eq_l
ab8
tru_s
eq_l
ab9
x4n_
a_la
b5
x4n_
b_la
b2
x4n_
b_la
b4
x4n_
b_la
b6
x4n_
c_la
b6
x4n_
d_la
b1
x4n_
nex_
tflex
_lab
8
x4n_
xu_l
ab5
sample
% re
mov
ed
protocol clean neb tru x4n
From 85K to 4K unique sequences. 90% of sequences are removed.
snp snp + other
shift5p shift5p shift3p
reference shift3p
add3p add3p + other
clean
_tag
_lab
5ne
b_ne
xt_l
ab1
neb_
next
_lab
3ne
b_ne
xt_l
ab4
neb_
next
_lab
5ne
b_ne
xt_l
ab9
tru_s
eq_l
ab1
tru_s
eq_l
ab2
tru_s
eq_l
ab3
tru_s
eq_l
ab5
tru_s
eq_l
ab6
tru_s
eq_l
ab8
tru_s
eq_l
ab9
x4n_
a_la
b5x4
n_b_
lab2
x4n_
b_la
b4x4
n_b_
lab6
x4n_
c_la
b6x4
n_d_
lab1
x4n_
nex_
tflex
_lab
8x4
n_xu
_lab
5
clean
_tag
_lab
5ne
b_ne
xt_l
ab1
neb_
next
_lab
3ne
b_ne
xt_l
ab4
neb_
next
_lab
5ne
b_ne
xt_l
ab9
tru_s
eq_l
ab1
tru_s
eq_l
ab2
tru_s
eq_l
ab3
tru_s
eq_l
ab5
tru_s
eq_l
ab6
tru_s
eq_l
ab8
tru_s
eq_l
ab9
x4n_
a_la
b5x4
n_b_
lab2
x4n_
b_la
b4x4
n_b_
lab6
x4n_
c_la
b6x4
n_d_
lab1
x4n_
nex_
tflex
_lab
8x4
n_xu
_lab
5
0
1
2
3
0
5
10
15
0
1
2
3
0
5
10
15
0.0
0.5
1.0
1.5
0
5
10
15
010203040
0
20
40
60
sample
PCT
IMPORTANCE 1−5 5−10 10−20 20−50 >50
isomiRs importance by type with pct > 1
snp snp + other
shift3p shift5p shift5p shift3p
add3p add3p + other reference
clean neb tru x4n
clean neb tru x4n
clean neb tru x4n
0
5
10
15
0.0
0.5
1.0
1.5
0
1
2
0
5
10
15
20
0
4
8
12
0.00
0.25
0.50
0.75
0
3
6
9
12
0
20
40
protocol
PCT
IMPORTANCE 1−5 5−10 10−20 20−50 >50
Tewari-synthetic: Other tools
shift3p shift5p snp
bcbiomirG
erazer3
clean neb tru x4n
clean neb tru x4n
clean neb tru x4n
0
20
40
0
20
40
0
20
40
protocol
PCT
IMPORTANCE 1−5 5−10 10−20 20−50
Tewari-synthetic: Other data
shift3p shift5p snp
tewaritewari_custom
vandijk
clean neb
nf1
nf2
nf3
nf4 tru ts1
ts2
ts3
ts4
ts5
x4n
clean neb
nf1
nf2
nf3
nf4 tru ts1
ts2
ts3
ts4
ts5
x4n
clean neb
nf1
nf2
nf3
nf4 tru ts1
ts2
ts3
ts4
ts5
x4n
0
20
40
60
0
20
40
60
0
20
40
60
protocol
PCT
IMPORTANCE 1−5 5−10 10−20 20−50
Tewari: synthetic vs human plasma
shift3p shift5p snp
human plasm
atewari
clean neb tru x4n
clean neb tru x4n
clean neb tru x4n
0
20
40
0
20
40
protocol
PCT
IMPORTANCE 1−5 5−10 10−20 20−50
Tewari-synthetic: Shift type
●
●
●●●●
●
●
●
●
●
●
●●● ●
●
●
●●●● ●●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●●
●
●
●
●
●●
●●
●
●
●
●
● ●
●●
●
●●
●● ●●●●
●
●
1 0 2 0 3 0 4 0
0 1 0 2 0 3 0 4
0 −1 0 −2 0 −3 0 −4
−1 0 −2 0 −3 0 −4 0cl
ean
neb tru x4n
clea
n
neb tru x4n
clea
n
neb tru x4n
clea
n
neb tru x4n
0
5
10
0
5
10
0
5
10
0
5
10
protocol
% o
f iso
miR
s w
ith s
hifti
ng e
vent
s
pct_cat<0.1
0.1−1
1−5
5−10
10−20
20−50
shift type
0
25
50
75
100
clea
n_ta
g_la
b5
neb_
next
_lab
1
neb_
next
_lab
3
neb_
next
_lab
4
neb_
next
_lab
5
neb_
next
_lab
9
tru_s
eq_l
ab1
tru_s
eq_l
ab2
tru_s
eq_l
ab3
tru_s
eq_l
ab5
tru_s
eq_l
ab6
tru_s
eq_l
ab8
tru_s
eq_l
ab9
x4n_
a_la
b5
x4n_
b_la
b2
x4n_
b_la
b4
x4n_
b_la
b6
x4n_
c_la
b6
x4n_
d_la
b1
x4n_
nex_
tflex
_lab
8
x4n_
xu_l
ab5
sample
% o
f miR
NAs
with
shi
ft ev
ents
protocol clean neb tru x4n
Summary
✓ Each miRNA generates a diversity of isomiRs:
✓ 90% contribute to < 1% of the miRNA abundance
✓ 10% enrichment on truncations at both sides
✓ Independent of pipelines and data sets
✓ 90% of miRNAs affected
✓ Custom 4N protocols perform worst
https://github.com/miRTop/mirGFF3
https://github.com/miRTop/mirtop
https://github.com/miRTop/incubator/tree/master/projects/tewari_equimolar
Thomas Desvignes Phillipe Loher Karen EIlbeck Bastian Fromm Gianvito Urgese Isidore Rigoutsos Michael Hackenberg Ioannis S. Vlachos Marc K. Halushka
Jeffery Ma, Jason Sydes, Yin Lu, Ernesto Aparicio-Puerta, Shruthi Bandyadka, Victor Barrera, Peter Batzel,
Rafa Allis, Roderic Espin
Kieran O’Neill, Eric Londin, Aristeidis G. Telonis, Elisa Ficarra, John H. Postlethwait,
miRNA wars: THE NEW Hope
Summary
0
10
20
30
40
A C G Tnt
pct
position first last
●
●
●
●
●
●
● ●
●
●●
●
●
● ●
●
●
●● ●● ●●
● ●
●
●
●
●
●
●
●
●●
● ●
●●
−1 0 0 −1
<0.10.1−1
1−55−10
10−2020−50
clea
n
neb tru x4n
clea
n
neb tru x4n
05
101520
05
101520
05
101520
05
101520
05
101520
05
101520
protocol
% o
f iso
miR
s
nt a c g t
shift3p shift5p snp
dsrghum
an plasma
tewaritewari_custom
vandijk
clean le
xne
bnf
1nf
2nf
3nf
4pe
bqi
aso
m tri tru ts1
ts2
ts3
ts4
ts5
x4n
clean le
xne
bnf
1nf
2nf
3nf
4pe
bqi
aso
m tri tru ts1
ts2
ts3
ts4
ts5
x4n
clean le
xne
bnf
1nf
2nf
3nf
4pe
bqi
aso
m tri tru ts1
ts2
ts3
ts4
ts5
x4n
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
0
25
50
75
protocol
PCT
IMPORTANCE 1−5 5−10 10−20 20−50