efficient and accurate clustering for large-scale genetic mappingveronika/efficient and... · 2014....
TRANSCRIPT
![Page 1: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/1.jpg)
Efficient and Accurate Clustering for Large-Scale Genetic Mapping
V. Strnadová (Neeley) , Aydın Buluç , Jarrod Chapman , Joseph Gonzalez ,
John Gilbert , Stefanie Jegelka , Daniel Rokhsar , Leonid Oliker
* ++ § ¶
*,++ * *, ¶ §
§++ *,§, ¶ *
Lawrence Berkeley National Labs, UC Santa Barbara, UC Berkeley, Joint Genome Institute
![Page 2: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/2.jpg)
Motivation
• High-throughput sequencing methods have produced a flood of inexpensive genetic information
• Genetic maps are important to breeding studies but genetic mapping software is prohibitively slow on large data sets
![Page 3: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/3.jpg)
The Genetic Mapping Problem
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
(missing data)
Data
![Page 4: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/4.jpg)
The Genetic Mapping Problem
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
(missing data)
𝑚1
𝑚2
𝑚8𝑚9
𝑚15
𝑚3
𝑚4𝑚5
𝑚6
𝑚7𝑚10𝑚11𝑚12
𝑚13𝑚14
Linkage group 1
Linkage group 2
cluster
Data
![Page 5: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/5.jpg)
The Genetic Mapping Problem
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
(missing data)
𝑚1
𝑚2
𝑚8𝑚9
𝑚15
𝑚3
𝑚4𝑚5
𝑚6
𝑚7𝑚10𝑚11𝑚12
𝑚13𝑚14
𝑚8
𝑚15
𝑚2
𝑚1
𝑚9
Linkage group 1
Linkage group 2
cluster
Linkage group 1 Linkage group 2
Data
𝑚11
𝑚6
𝑚13
𝑚3
𝑚12
𝑚10
𝑚4
𝑚7
𝑚5𝑚14
![Page 6: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/6.jpg)
The Genetic Mapping Problem
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
(missing data)
𝒎𝟏
𝒎𝟐
𝒎𝟖𝒎𝟗
𝒎𝟏𝟓
𝒎𝟑
𝒎𝟒𝒎𝟓
𝒎𝟔
𝒎𝟕𝒎𝟏𝟎𝒎𝟏𝟏𝒎𝟏𝟐
𝒎𝟏𝟑𝒎𝟏𝟒
Linkage group 1
Linkage group 2
cluster
Data
𝑚8
𝑚15
𝑚2
𝑚1
𝑚9
Linkage group 1 Linkage group 2
𝑚11
𝑚6
𝑚13
𝑚3
𝑚12
𝑚10
𝑚4
𝑚7
𝑚5𝑚14
![Page 7: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/7.jpg)
The Need for Large-Scale Clustering in Genetic Mapping
• Hundreds of thousands of genetic markers available, but current software can only handle up to ~10,000 markers
• A major bottleneck is the linkage-group-finding phase
• Popular mapping tools all handle this phase the same way, with an 𝑂(𝑀2) clustering algorithm for 𝑀 markers
![Page 8: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/8.jpg)
• Hundreds of thousands of genetic markers available, but current software can only handle up to ~10,000 markers
• A major bottleneck is the linkage-group-finding phase
• Popular mapping tools all handle this phase the same way, with an 𝑂(𝑀2) clustering algorithm for 𝑀 markers
Our solution: A fast, scalable clustering algorithm tailored to genetic marker data
The Need for Large-Scale Clustering in Genetic Mapping
![Page 9: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/9.jpg)
Standard Approach to Genetic Marker Clustering
cluster
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
Data
𝑚1
𝑚2
𝑚8𝑚9
𝑚15
𝑚3
𝑚4𝑚5
𝑚6
𝑚7𝑚10𝑚11𝑚12
𝑚13𝑚14
Linkage group 1
Linkage group 2
![Page 10: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/10.jpg)
Standard Approach to Genetic Marker Clustering
(1)
𝑚1𝑚2
𝑚8𝑚9
𝑚15
𝑚3
𝑚4 𝑚5𝑚6
𝑚7𝑚10𝑚11
𝑚12
𝑚13𝑚14
Linkage group 1
Linkage group 2
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices
• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked
• (2) Cut all edges below a LOD threshold
• (3) The resulting connected components = linkage groups
![Page 11: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/11.jpg)
Standard Approach to Genetic Marker Clustering
(1)
𝑚1𝑚2
𝑚8𝑚9
𝑚15
𝑚3
𝑚4 𝑚5𝑚6
𝑚7𝑚10𝑚11
𝑚12
𝑚13𝑚14
Linkage group 1
Linkage group 2
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices
• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked
• (2) Cut all edges below a LOD threshold
• (3) The resulting connected components = linkage groups
![Page 12: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/12.jpg)
LOD ScoreCompares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance:
𝐿𝑂𝐷(𝑚𝑖 , 𝑚𝑗) = log10𝑃(𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)
𝑃(𝑛𝑜 𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)
Formally,
𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )
𝑁𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗
0.5𝑅𝑖𝑗+𝑁𝑅𝑖𝑗
Where:
𝑅𝑖𝑗 = number of recombinant offspring
𝑁𝑅𝑖𝑗 = number of nonrecombinant offspring 𝜃𝑖𝑗 = recombination fraction, i.e. 𝑅
𝑅+𝑁𝑅
𝑚𝑖 A B - - A -
𝑚𝑗 A B A A B A
![Page 13: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/13.jpg)
LOD ScoreCompares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance:
𝐿𝑂𝐷(𝑚𝑖 , 𝑚𝑗) = log10𝑃(𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)
𝑃(𝑛𝑜 𝑙𝑖𝑛𝑘𝑎𝑔𝑒𝑖𝑗)
Formally,
𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )
𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗
0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗
Where:
𝑅𝑖𝑗 = number of recombinant offspring
𝑅𝑖𝑗 = number of nonrecombinant offspring 𝜃𝑖𝑗 = recombination fraction, i.e. 𝑅𝑖𝑗
𝑅𝑖𝑗+ 𝑅𝑖𝑗
𝑚𝑖 A B - - A -
𝑚𝑗 A B A A B A
![Page 14: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/14.jpg)
LOD ScoreCompares the likelihood of obtaining test data if the two markers are indeed linked, to the likelihood of observing the same data purely by chance:
𝑳𝑶𝑫(𝒎𝒊,𝒎𝒋) = 𝐥𝐨𝐠𝟏𝟎(𝟏 − 𝟏 𝟑)
𝟐( 𝟏 𝟑)𝟏
𝟎. 𝟓𝟑= 𝟎. 𝟎𝟕𝟒
Formally,
𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )
𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗
0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗
Where:
𝑅𝑖𝑗 = number of recombinant offspring
𝑅𝑖𝑗 = number of nonrecombinant offspring 𝜃𝑖𝑗 = recombination fraction, i.e. 𝑅𝑖𝑗
𝑅𝑖𝑗+ 𝑅𝑖𝑗
𝑚𝑖 A B - - A -
𝑚𝑗 A B A A B A
![Page 15: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/15.jpg)
Standard Approach to Genetic Marker Clustering
(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices
• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked
(2) Cut all edges below a LOD threshold
(2)
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
𝑚1𝑚2
𝑚8𝑚9
𝑚15
Linkage group 2
𝑚3
𝑚4 𝑚5𝑚6
𝑚7𝑚10𝑚11
𝑚12
𝑚13𝑚14
Linkage group 1
![Page 16: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/16.jpg)
Standard Approach to Genetic Marker Clustering
(1) Compute the similarity between all 𝑂(𝑀2) pairs of markers, producing a complete graph with 𝑀 vertices
• Similarity function is the “LOD score”, a logarithm of odds that two markers are genetically linked
(2) Cut all edges below a LOD threshold
(3) The resulting connected components = linkage groups
(3)
𝑖1 𝑖2 𝑖3 𝑖4 𝑖5 𝑖6
𝑚1 A B - - A -
𝑚2 A B A A B A
𝑚3 A A - - - B
𝑚4 A - B - B B
𝑚5 B - B A - A
𝑚6 A A B A - -
𝑚7 - - - A B B
𝑚8 A B A B - A
𝑚9 A B - B - -
𝑚10 B B B - A A
𝑚11 A A A A B B
𝑚12 B - A B A -
𝑚13 B B - A A -
𝑚14 - - - B A A
𝑚15 B - - A A B
𝑚1𝑚2
𝑚8𝑚9
𝑚15
Linkage group 2
𝑚3
𝑚4 𝑚5𝑚6
𝑚7𝑚10𝑚11
𝑚12
𝑚13𝑚14
Linkage group 1
![Page 17: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/17.jpg)
Our Approach: The BubbleCluster Algorithm
Primary assumption: Clusters have a “linear structure”
![Page 18: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/18.jpg)
Our Approach: The BubbleCluster Algorithm
Primary assumption: Clusters have a “linear structure”
Key idea: Maintain a set of representative or “sketch” points which reveal the cluster structure
LOD threshold
![Page 19: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/19.jpg)
Input: set of markers, LOD threshold 𝜏, non-missing threshold 𝜂, low-quality threshold 𝑐, cluster size threshold 𝜎
Output: set of clusters C and set of representative points R
Phase I: Build initial set of clusters and set of representative points using high-quality markers (those with at least 𝜂 non-missing entries)
Phase II: Add low quality markers (less than 𝜂 non-missing entries) to intialset of clusters
Phase III: Attempt to merge small clusters with large
The BubbleCluster Algorithm: Overview
![Page 20: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/20.jpg)
The BubbleCluster Algorithm
𝒎
Iteration i:find 𝑟𝑀𝐴𝑋 ∶= 𝑟𝑗 for which 𝐿𝑂𝐷(𝑚, 𝑟𝑗) is
maximal;set 𝐶𝑀𝐴𝑋 ≔ 𝐶𝐾 ∈ 𝐶 containing 𝑟𝑀𝐴𝑋
𝐿𝑂𝐷(𝑚, 𝑟𝑗)
𝑟𝑗𝐶1 𝐶2
![Page 21: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/21.jpg)
The BubbleCluster Algorithm
If (𝐿𝑂𝐷 𝑚, 𝑟𝑀𝐴𝑋 < 𝑳𝑶𝑫_𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅)
𝒎𝐿𝑂𝐷(𝑚, 𝑟𝑀𝐴𝑋)
𝑟𝑀𝐴𝑋𝐶1 𝐶2
![Page 22: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/22.jpg)
The BubbleCluster Algorithm
If (𝐿𝑂𝐷 𝑚, 𝑟𝑀𝐴𝑋 < 𝑳𝑶𝑫_𝒕𝒉𝒓𝒆𝒔𝒉𝒐𝒍𝒅)𝐶 = 𝐶 ∪ {𝑚}
𝒎
𝑟𝑀𝐴𝑋𝐶1 𝐶2
𝐶3
![Page 23: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/23.jpg)
The BubbleCluster Algorithm
Else If ( IS_INTERIOR 𝑟𝑀𝐴𝑋 )
𝒎
𝑟𝑀𝐴𝑋
𝐶1𝐶2
𝐿𝑂𝐷(𝑚, 𝑟𝑀𝐴𝑋)
![Page 24: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/24.jpg)
The BubbleCluster Algorithm
Else If ( IS_INTERIOR 𝑟𝑀𝐴𝑋 )𝐶𝑀𝐴𝑋 = 𝐶𝑀𝐴𝑋 ∪ {𝑚}
𝒎
𝑟𝑀𝐴𝑋
𝐶1 = 𝐶𝑀𝐴𝑋𝐶2
![Page 25: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/25.jpg)
Else If ( IS_EXTERIOR 𝑚, 𝑟𝑀𝐴𝑋 )
𝒎
𝐶2𝐶1 = 𝐶𝑀𝐴𝑋
𝑟𝑀𝐴𝑋
The BubbleCluster Algorithm
![Page 26: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/26.jpg)
The BubbleCluster Algorithm
Else If ( IS_EXTERIOR 𝑚, 𝑟𝑀𝐴𝑋 )Add 𝑚 to representative points of 𝐶𝑀𝐴𝑋Add 𝑚 to 𝐶𝑀𝐴𝑋
𝒎
𝐶2𝐶1 = 𝐶𝑀𝐴𝑋
𝑟𝑀𝐴𝑋
![Page 27: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/27.jpg)
Else // 𝑚 is interior to the outer point 𝑟𝑀𝐴𝑋
The BubbleCluster Algorithm
𝒎
𝐶2𝐶1 = 𝐶𝑀𝐴𝑋
𝑟𝑀𝐴𝑋
![Page 28: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/28.jpg)
Else // 𝑚 is interior to the outer point 𝑟𝑀𝐴𝑋Add 𝑚 to 𝐶𝑀𝐴𝑋
The BubbleCluster Algorithm
𝒎
𝐶2𝐶1 = 𝐶𝑀𝐴𝑋
𝑟𝑀𝐴𝑋
![Page 29: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/29.jpg)
The BubbleCluster Algorithm
𝒎
𝐶1 𝐶2
If 𝑚 has a LOD score above the threshold to two clusters,
![Page 30: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/30.jpg)
𝐶𝑁𝐸𝑊 = 𝐶1 ∪ 𝐶2
If 𝑚 has a LOD score above the threshold to two clusters, Then merge the clusters and add m to the merged cluster
𝒎
The BubbleCluster Algorithm
![Page 31: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/31.jpg)
End of Phase IStop when all markers with at least 𝜂 non-missing entries have been processed
Running time: 𝑶 |𝜢| 𝐥𝐨𝐠𝟐 𝚮 + |𝜢||𝑹|where: |𝛨| = size of high-quality marker set,
|𝑅| = size of representative point set
Phase II: add low-quality markers
Phase III: merge small clusters
![Page 32: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/32.jpg)
BubbleCluster Parameters
LOD threshold
Non-missing threshold: determines how many markers are high-quality
Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃)𝑁𝑅𝜃𝑅
0.5𝑅+𝑁𝑅
Highest achievable LOD: lim𝑅→0log10
(1 − 𝜃)𝑁𝑅𝜃𝑅
0.5𝑅+𝑁𝑅= 𝑁𝑅 log10 2
𝑚𝑖 A B - - A -
𝑚𝑗 A B A A B A
𝜏1
𝜏2
![Page 33: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/33.jpg)
LOD threshold
Non-missing threshold: determines how many markers are high-quality
Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )
𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗
0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗
𝜏1
𝜏2
BubbleCluster Parameters
![Page 34: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/34.jpg)
LOD threshold
Non-missing threshold: determines how many markers are high-quality
Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )
𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗
0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗
Example LOD: log10(1 − 1 3)
2( 1 3)1
0.53= 0.074
𝑚𝑖 A B - - A -
𝑚𝑗 A B A A B A
𝜏1
𝜏2
BubbleCluster Parameters
![Page 35: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/35.jpg)
LOD threshold
Non-missing threshold: determines how many markers are high-quality
Recall the LOD score: 𝐿𝑂𝐷 = log10(1 − 𝜃𝑖𝑗 )
𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗
0.5𝑅𝑖𝑗+ 𝑅𝑖𝑗
Highest achievable LOD: lim𝑅𝑖𝑗→0log10
(1 − 𝜃𝑖𝑗) 𝑅𝑖𝑗𝜃𝑖𝑗𝑅𝑖𝑗
0.5𝑅+ 𝑅𝑖𝑗
= 𝑅𝑖𝑗log10 2
𝑚𝑖 A B - - A -
𝑚𝑗 A B A A B A
𝜏1
𝜏2
BubbleCluster Parameters
![Page 36: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/36.jpg)
Evaluation Metric: 𝐹-score
Given a golden standard clustering, the 𝐹-score measures the quality of another clustering by comparing it to the golden standard
Range: 0 – 1
The 𝐹-score is a harmonic mean of precision and recallAn 𝐹-score of 1 indicates perfect precision and perfect recall for every golden
standard cluster
![Page 37: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/37.jpg)
Results: BubbleCluster on Real Data Sets
Dataset Size Time F-Score
Barley 64K 15 sec 0.9993
Switchgrass 113K 8.9 min 0.9745
Switchgrass 548K 1.9 hrs 0.9894
Wheat 1.58M 1.22 hrs N/A *
* Results under review at Genome Biology
![Page 38: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/38.jpg)
Comparison of Clustering Algorithms for Simulated Data
ClusteringMethod
12.5K Markers 25K Markers
F-score Time F-score Time
JoinMap 0.99964 14 min 0.99982 46 min
MSTMap 0.99964 4.5 min 0.99982 20 min
PIC 0.47024 11 sec (+ 4min)
0.60782 44 sec (+ 16.5min)
BubbleCluster 0.99964 6 sec 0.99982 15 sec
Simulated data created with Nicholas Tinker’s Spaghetti Software.
![Page 39: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/39.jpg)
4.4
9.8
21.0
47.2
96.6
237.8
6.7
14.0
31.7
86.4
198.7
475.1
1
2
4
8
16
32
64
128
256
512
1024
12.5 25 50 100 200 400
1E-5
1E-4
1E-3
1E-2
1E-1
1E+0
Ru
nti
me
(s)
Dataset Size (in thousands of markers)
Erro
r (1
-Fs
core
)
error, 65% missing
error, 35% missing
runtime, 65% missing
runtime, 35% missing
Scaling Results for Bubble Cluster on Simulated Data
![Page 40: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/40.jpg)
Effect of the LOD threshold
LOD threshold 5 10 15 20 25 30
F-Score 0.6225 0.9999 0.9999 0.9999 0.9999 0.9999
Time (s) 48.6 67.0 70.9 82.0 106 170
Fixed missing entry threshold, increasing LOD threshold
200K markers, 300 individuals, 35% missing rate
𝜏1𝜏2 vs.
![Page 41: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/41.jpg)
Effect of the missing data threshold
Fixed LOD threshold, increasing non-missing entry threshold
Non-missing threshold
132 166 172 179 186 192
F-Score 0.9999 0.9999 0.9992 0.9930 0.9610 0.8948
Time (s) 82.0 84.6 82.7 83.0 81.7 82.0
200K markers, 300 individuals, 35% missing rate
𝜏 𝜏𝜏vs.
![Page 42: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/42.jpg)
ConclusionBy exploiting the structure underlying genetic marker clusters, we were able to design a fast clustering algorithm tailored to genetic marker data
𝒎
𝐶1 𝐶2
![Page 43: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/43.jpg)
ConclusionBy exploiting the structure underlying genetic marker clusters, we were able to design a fast clustering algorithm tailored to genetic marker data
𝒎
𝐶1 𝐶2While remaining highly accurate, we outperform popular existing tools in both runtime and scalability
Clustering Method 12.5K Markers 25K Markers
F-score Time F-score Time
JoinMap 0.99964 14 min 0.99982 46 min
MSTMap 0.99964 4.5 min 0.99982 20 min
PIC 0.47024 11 sec (+ 4 min)
0.60782 44 sec (+ 16.5 min)
BubbleCluster 0.99964 6 sec 0.999982 15 sec
![Page 44: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/44.jpg)
Future Work
• Use representative points as starting point for ordering phase
![Page 45: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/45.jpg)
![Page 46: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/46.jpg)
Future Work
• Use representative points as starting point for ordering phase
• Provide a more thorough theoretical analysis of achievable clustering as well as order accuracy given assumptions about error and missing data rates
• Develop efficient and accurate, large-scale genetic mapping software
![Page 47: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/47.jpg)
Thank You
Code for BubbleCluster soon available at: www.ucsb.edu/~veronika
![Page 48: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/48.jpg)
Backup Slides
![Page 49: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/49.jpg)
Choosing LOD and non-missing thresholdGoal: minimize 𝑷(mistake) and maximize 𝑭- score
Let 𝑝 = 𝑃 𝐿𝑂𝐷 𝑚𝑖 , 𝑚𝑗 > 𝐿𝑂𝐷𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝑥𝑖 ∈ 𝐶𝑖 , 𝑥𝑗 ∈ 𝐶𝑗 , 𝑖 ≠ 𝑗)
Let 𝜏 = LOD threshold
By definition of the LOD score, 𝑝 =1
10𝜏
Let 𝑛𝑐𝑜𝑚𝑝 = number of LOD comparisons we make
Then, if we want to ensure that (1 − 𝑝)𝑛𝑐𝑜𝑚𝑝< 1 − 𝜀 then we need:
𝜏 > log10(1
1 − (1 − 𝜀) 1𝑛𝑐𝑜𝑚𝑝)
At the same time, we want to include a marker in the high-quality set only if we expect that it will achieve a LOD of 𝜏 or greater with another marker, requiring:
𝑛𝑛𝑚 > 𝜏(1 − 𝜇) log10 2
Where 𝜇 is the missing rate and 𝑛𝑛𝑚 is the number of non-missing entries in the marker
![Page 50: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/50.jpg)
Evaluation Metric for Cluster Quality
F-score (range: 0 to 1)• Given a “golden standard clustering”, the F-score measures the
quality of a clustering as follows:
• The F-score between a golden standard cluster 𝑔 and a test cluster 𝑐is a harmonic mean of precision 𝑃 and recall 𝑅:
𝐹𝑠𝑐𝑜𝑟𝑒 𝑔, 𝑐 =2𝑃𝑅
𝑃 + 𝑅• The overall F-score between the golden standard clustering G and a
test clustering C is a weighted average of the F-scores for each golden standard cluster 𝑔:
𝑜𝑣𝑒𝑟𝑎𝑙𝑙_𝐹𝑠𝑐𝑜𝑟𝑒 𝐺, 𝐶 =1
𝑚
𝑔 ∋ 𝐺
𝑔 ∗ max𝑐 ∋𝐶𝐹𝑠𝑐𝑜𝑟𝑒(𝑔, 𝑐)
![Page 51: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/51.jpg)
Local Linearity Assumption
Although the LOD score does not obey the triangle inequality, we assume that it does at close ranges and with enough data
𝑳𝑶𝑫(𝒎, 𝒓𝟏)𝑳𝑶𝑫(𝒎, 𝒓𝟐)
𝒎
𝒓𝟏𝒓𝟐𝑳𝑶𝑫(𝒓𝟐, 𝒓𝟏)
𝑳𝑶𝑫 𝒎, 𝒓𝟐 < 𝑳𝑶𝑫 𝒎, 𝒓𝟏&&
𝑳𝑶𝑫 𝒎, 𝒓𝟐 < 𝑳𝑶𝑫 𝒓𝟏, 𝒓𝟐&&
𝑳𝑶𝑫 𝒎, 𝒓𝟏 > 𝑳𝑶𝑫 𝒓𝟏, 𝒓𝟐
𝒎 is a new boundary point
![Page 52: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/52.jpg)
The Effect of the LOD threshold
The standard approach to clustering can be viewed as single linkage clustering
Time complexity: 𝑂(𝑀2)
LOD 10
LOD 9
LOD 8
𝑚15𝑚2𝑚1 𝑚8 𝑚9
LOD 7
𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14
LOD 6
![Page 53: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/53.jpg)
The Effect of the LOD threshold
A high LOD threshold ensures that the only edges that remain in the completely connected graph are between markers that are extremely likely to be genetically linked
LOD 10
LOD 9
LOD 8
𝑚15𝑚2𝑚1 𝑚8 𝑚9
LOD 7
𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14
LOD 6
![Page 54: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/54.jpg)
The Effect of the LOD threshold
Example: LOD threshold = 10
LOD 10
LOD 9
LOD 8
𝑚15𝑚2𝑚1 𝑚8 𝑚9
LOD 7
𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14
LOD 6
𝑚1𝑚2
𝑚8𝑚9
𝑚15
Linkage group 2
𝑚3
𝑚4 𝑚5𝑚6
𝑚7𝑚10𝑚11
𝑚12
𝑚13𝑚14
Linkage group 1
![Page 55: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/55.jpg)
The Effect of the LOD threshold
Example: LOD threshold = 8
LOD 10
LOD 9
LOD 8
𝑚15𝑚2𝑚1 𝑚8 𝑚9
LOD 7
𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14
LOD 6
𝑚1𝑚2
𝑚8𝑚9
𝑚15
Linkage group 2
𝑚3
𝑚4 𝑚5𝑚6
𝑚7𝑚10𝑚11
𝑚12
𝑚13𝑚14
Linkage group 1
![Page 56: Efficient and Accurate Clustering for Large-Scale Genetic Mappingveronika/Efficient and... · 2014. 11. 3. · Genetic Mapping •Hundreds of thousands of genetic markers available,](https://reader033.vdocument.in/reader033/viewer/2022051915/60073a68074cd3561555ec41/html5/thumbnails/56.jpg)
The Effect of the LOD threshold
Example: LOD threshold = 7
LOD 10
LOD 9
LOD 8
𝑚15𝑚2𝑚1 𝑚8 𝑚9
LOD 7
𝑚4𝑚3 𝑚5 𝑚6 𝑚7 𝑚10 𝑚11 𝑚12 𝑚13 𝑚14
LOD 6
𝑚1𝑚2
𝑚8𝑚9
𝑚15
Linkage group 2
𝑚3
𝑚4 𝑚5𝑚6
𝑚7𝑚10𝑚11
𝑚12
𝑚13𝑚14
Linkage group 1