mspc: joint analysis of chip-seq replicates
TRANSCRIPT
POLITECNICO
DI MILANO
Department of Electronics,
Information and Bioengineering
July 20, 2015
Using combined evidence from replicates to evaluate ChIP-seq peaks
Vahid Jalili
Vahid Jalili ([email protected])
Matteo Matteucci ([email protected])
Marco Masseroli ([email protected])
Marco Morelli ([email protected])
Website: https://mspc.codeplex.com
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 2
MotivationTag c
ount
Genomic DNA
Signal Background
ChIP-seq sample
True Positive False Positive
False Negative True Negative
Stringent
Threshold
Permissive
Threshold
Stringent
Threshold
Permissive
Threshold
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 3
Motivation
Benefit from ReplicatesUtilize replicates to discriminate between
sub-threshold binding from truly none-bounding regions
Tag c
ount
Genomic DNA
Signal Background
Replicate 1
Replicate 2
Tag c
ount
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 4
Motivation
Benefit from Replicates
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 5
Method
Notations
𝒯𝑠
𝒯𝑤
Strong threshold
Weak threshold
𝑝 − 𝑣𝑎𝑙𝑢𝑒 ≤ 𝒯𝑠
Strong Peak
Weak Peak
𝒯𝑠 < 𝑝 − 𝑣𝑎𝑙𝑢𝑒 ≤ 𝒯𝑤
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 6
Method
Combining Evidences
𝑋2𝑘2 follows a 𝜒2 distribution with 2𝑘 degrees of freedom.
Alternatives for combining test statistics :
Liptak’s method (Liptak, 1958)
Mudholkar and George (Mudholkar & George, 1979)
Wilkinson’s method (Wilkinson, 1951)
Truncated product method (Zaykin D. , Zhivotovsky, Westfall, & Weir, 2002)
…
How to combine evidences ?
Fisher’s combined probability test
𝑋2𝑘2 = −2
𝑖=1
𝑘
ln 𝑝𝑖
𝐶𝑜𝑛𝑓𝑖𝑟𝑚, 𝑋2𝑘
2 ≥ 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
𝐷𝑖𝑠𝑐𝑎𝑟𝑑, 𝑋2𝑘2 < 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 7
Method
Combining Evidences
Replicate 1
Replicate 2
Replicate 3
Which evidences to combine ?
Replicate 4
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 8
Method
Combining Evidences
Replicate 1
Replicate 2
Replicate 3
Which evidences to combine ?
Replicate 4
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 9
Method
Combining Evidences
Replicate 1
Replicate 2
Replicate 3
Which evidences to combine ?
Replicate 4
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 10
Method
Combining Evidences
Replicate 1
Replicate 2
Replicate 3
Which evidences to combine ?
Replicate 4
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 11
Method
Intersection DeterminationThe Challenge …
an optimal method for finding the intersections
Sorted Lists
Naïve method
Hashing Based
Interval Trees
𝑶 𝒎 𝒏
𝑶 𝒏𝒎
𝑶𝒏 𝒍𝒐𝒈𝟐𝒘
𝒘+𝒎𝒓
𝑶 𝒏 log𝟐 𝒏
S o m e Po s s i b l e M e t h o d s
• 𝑛 average peaks count on a sample
• 𝑚 sample count
M e t h o d ’ s C o m p l ex i t y
• 𝑤 number of bits in a machine-word
• 𝑟 intersection size
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 12
Method
Intersection DeterminationInterval Trees
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
[ 16 , 21 ]
Data
[ 8 , 9 ]
Data
[ 25 , 30 ]
Data
[ 17 , 19 ]
Data
[ 26 , 27 ]
Data
[ 19 , 20 ]
Data
[ 15 , 23 ]
Data
[ 5 , 8 ]
Data
[ 6 , 10 ]
Data
[ 0 , 3 ]
Data
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 13
Method
Algorithm
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 14
Method
Algorithm
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 15
Method
Algorithm
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 16
Method
Algorithm
Replicate 1
Replicate 2
Replicate 3
R 1 (weak peak)
R 4 (strong region)
R 3 (weak peak)
Algorithm … an example
R 2 (weak peak)
R 1 (weak peak)
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 17
Method
Algorithm
Replicate 1
Replicate 2
Replicate 3 R 4 (strong region)
R 3 (weak peak)
Algorithm … an example
R 2 (weak peak)
Determine intersecting regions across all samples
R 1 (weak peak)
R 2 (weak peak) R 3 (weak peak)
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 18
Method
Algorithm
Replicate 1
Replicate 2
Replicate 3 R 4 (strong region)
Algorithm … an example
R 1 (weak peak)
R 2 (weak peak) R 3 (weak peak)
If multiple regions determined intersecting on a
sample, choose the strongest one
R 3 (weak peak)
Determine intersecting regions across all samples
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 19
Method
Algorithm
Replicate 1
Replicate 2
Replicate 3 R 4 (strong region)
Algorithm … an example
R 1 (weak peak)
R 2 (weak peak) R 3 (weak peak)
If multiple regions determined intersecting on a
sample, choose the strongest one
Determine intersecting regions across all samples
Combine test statistics using Fisher’s method
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 20
Method
Algorithm
Replicate 1
Replicate 2
Replicate 3 R 4 (strong region)
Algorithm … an example
R 1 (weak peak)
R 2 (weak peak) R 3 (weak peak)
If multiple regions determined intersecting on a
sample, choose the strongest one
Determine intersecting regions across all samples
Combine test statistics using Fisher’s method
𝑋2 ≥ 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 ? NO !
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 21
Method
Algorithm
██ Confirmed Peaks Set
██ Discarded Peaks Set
Algorithm … an example
R 1
I n t e r m e d i a t e S e t s
R e p l i c a t e 1 R e p l i c a t e 2 R e p l i c a t e 3
R 2
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 22
Method
Algorithm
Replicate 1
Replicate 2
Replicate 3 R 4 (strong region)
Algorithm … an example
R 1 (weak peak)
R 2 (weak peak) R 3 (weak peak)
Determine intersecting regions across all samples
R 2 (weak peak)
Since R2 intersects only with R1, and R1-R2 test is
already performed, no further process will be taken
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 23
Method
Algorithm
Replicate 1
Replicate 2
Replicate 3 R 4 (strong region)
Algorithm … an example
R 1 (weak peak)
R 3 (weak peak)
Determine intersecting regions across all samples
R 2 (weak peak) R 3 (weak peak)
R 4 (strong region)
R 1 (weak peak)
Combine test statistics using Fisher’s method
𝑋2 ≥ 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 ? YES !
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 24
Method
AlgorithmAlgorithm … an example
██ Confirmed Peaks Set
██ Discarded Peaks Set
R 1
I n t e r m e d i a t e S e t s
R e p l i c a t e 1 R e p l i c a t e 2 R e p l i c a t e 3
R 2
R 3 R 4
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 25
Method
Algorithm
Replicate 1
Replicate 2
Replicate 3 R 4 (strong region)
Algorithm … an example
R 3 (weak peak)R 2 (weak peak)
R 1 (weak peak)
R 4 (strong region)
Determine intersecting regions across all samples
Combine test statistics using Fisher’s method
𝑋2 ≥ 𝑇ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 ? YES !
R 3 (weak peak)
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 26
Method
AlgorithmAlgorithm … an example
██ Confirmed Peaks Set
██ Discarded Peaks Set
I n t e r m e d i a t e S e t s
R e p l i c a t e 1 R e p l i c a t e 2 R e p l i c a t e 3
R 2
R 3 R 4
R 1
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 27
Method
AlgorithmAlgorithm … an example
I n t e r m e d i a t e S e t s
R e p l i c a t e 1 R e p l i c a t e 2 R e p l i c a t e 3
R 2
R 3 R 4
R 1
R 1
██ Confirmed Peaks Set
██ Discarded Peaks Set
██ Output Set
O u t p u t S e t s
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 28
Method
Algorithm
Replicate 1
Replicate 2
Replicate 3 R 4 (strong region)
Algorithm … an example
R 3 (weak peak)R 2 (weak peak)
R 1 (weak peak)
R 2 (weak peak)
R 1 (weak peak)
R 3 (weak peak)
R 4 (strong region)
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 29
Results
Myc2_1
0e
+0
02
e+
04
4e
+0
46
e+
04
8e
+04
1e+
05
Myc2_2Myc3_1
05
00
010
00
015
00
02
000
02
50
00
30
00
0
Myc3_2
Myc2_1
0e+
00
2e+
04
4e+
04
6e+
04
8e+
04
1e
+0
5
Myc2_2 Myc3_1 Myc3_2
Abbreviation File name
Myc2_1 wgEncodeSydhTfbsK562CmycIggrabAlnRep1
Myc2_2 wgEncodeSydhTfbsK562CmycIggrabAlnRep2
Myc3_1 wgEncodeSydhTfbsK562CmycStdAlnRep1
Myc3_2 wgEncodeSydhTfbsK562CmycStdAlnRep2
Category Abbreviation Color Implication
Input (source BED file) In██ Strong
██ Weak
Analysis Results Re
██ Strong Confirmed
██ Weak Confirmed
██ Weak Discarded
S e t 1 S e t 2 S e t 3
In Re In Re In Re In Re In Re In Re In Re In Re
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 31
Results
Motif was enriched in the sequence defined by peaks
Motif was NOT enriched in the sequence defined by peaks
Presence of Ebox
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 32
Implementation
Performance
0
5
10
15
20
25
30
35
40
45
50
0 5 10 15 20 25 30 35 40 45
Tim
e (
seco
nds)
Peaks Count
x 10000
Running Time
2-Replicates 4-Replicates 6-Replicates
Demo
P o l i t e c n i c o d i M i l a n oV a h i d J a l i l i / 35July 20, 2015 33
Questions
Q u e s t i o n sare welcome at: https://mspc.codeplex.com/discussions