reconstructing kinship relationships in wild populations i do not believe that the accident of birth...
TRANSCRIPT
Reconstructing Kinship Relationshipsin Wild Populations
I do not believe that the accident of birth makes people sisters and brothers. It makes them siblings. Gives them mutuality of
parentage. Maya Angelou
Isabel CaballeroUIC
Priya GovindanRutgers
Chun-An (Joe) Chou
Rutgers
Saad SheikhEcole
Polytechnique
Alan Perez-Rathkeo
UIC
Mary AshleyUIC
W. Art Chaovalitwong
seRutgers
Ashfaq Khokhar
UIC
Bhaskar DasGupta
UIC
TanyaBerger-Wolf
UIC
Microsatellites (STR)
Advantages: Codominant (easy
inference of genotypes and allele frequencies)
Many heterozygous alleles per locus
Possible to estimate other population parameters
Cheaper than SNPs
But: Few loci
And: Large families Self-mating …
CACACACA5’
AllelesCACACACA
CACACACACACA
CACACACACACACA
#1
#2
#3
Genotypes
1/1 2/2 3/3 1/2 1/3 2/3
Siblings: two children with the same parentsQuestion: given a set of children, find the
sibling groups
Diploid Siblings
locusallele
father (.../...),(a /b ),(.../...),(.../...) (.../...),(c /d ),(.../...),(.../...) mother
(.../...),(e /f ),(.../...),(.../...) child
one from fatherone from mother
Why Reconstruct Sibling Relationships?
Used in: conservation biology, animal management, molecular ecology, genetic epidemiology
Necessary for: estimating heritability of quantitative characters, characterizing mating systems and fitness.
• But: hard to sample parent/offspring pairs. Sampling cohorts of juveniles is easier
The Problem
Ind Locus 1
Locus 2
allele 1/allele 2
1 1/2 1/2
2 1/3 3/4
3 1/4 3/5
4 3/3 7/6
5 1/3 3/4
6 1/3 3/7
7 1/5 8/2
8 1/6 2/2
Sibling Groups:
2, 4, 5, 6
1, 3
7, 8
Existing MethodsMethod Approach Error-
DetectionAssumptions
Almudevar & Field (1999,2003)
Minimal Sibling groups under likelihood
No Minimal sibgroups, representative allele frequencies
KinGroup (2004)
Markov Chain Monte Carlo/ML
No Allele Frequencies etc. are representative
Family Finder(2003)
Partition population using likelihood graphs
No Allele Frequencies etc. are representative
Pedigree (2001)
Markov Chain Monte Carlo/ML
No Allele Frequencies etc are representative
COLONY (2004)
Simulated Annealing/ ML
Yes Monogamy for one sex
Fernandez & Toro (2006)
Simulated Annealing/ ML
No Co-ancestry matrix is a good measure, parents can be reconstructed or are available
Inheritance Rulesfather (.../...),(a /b ),(.../...),(.../...) (.../...),(c /d ),(.../...),(.../...) mother
child 1 (.../...),(e1 /f1 ),(.../...),(.../...)
child 2 (.../...),(e2 /f2 ),(.../...),(.../...)child 3 (.../...),(e3 /f3 ),(.../...),(.../...)
child n (.../...),(en/fn ),(.../...),(.../...)
…4-allele rule: siblings have at most 4 distinct alleles in a locus
2-allele rule: In a locus in a sibling group: a + R ≤ 4
Num distinct alleles
Num alleles that appear with 3 others or are
homozygot
Our Approach: Mendelian Constrains
4-allele rule:siblings have at most 4 different alleles in a locus
Yes: 3/3, 1/3, 1/5, 1/6No: 3/3, 1/3, 1/5, 1/6, 3/2
2-allele rule: In a locus in a sibling group:a + R ≤ 4
Yes: 3/3, 1/3, 1/5No: 3/3, 1/3, 1/5, 1/6
Num distinct alleles
Num alleles that appear with 3 others or are
homozygot
Our Approach: Sibling Reconstruction
Given:n diploid individuals sampled at l loci
Find: Minimum number of 2-allele sets that contain all individuals
NP-complete even when we know sibsets are at most 31.0065 approximation gap Ashley et al ’09
ILP formulation Chaovalitwongse et al. ’07, ’10
Minimum Set Cover based algorithm with optimal solution (using CPLEX) Berger-Wolf et al. ’07
Parallel implementation Sheikh, Khokhar, BW ‘10
ID alleles1 1/2
2 2/3
3 2/1
4 1/3
5 3/2
6 1/4
Canonical families
1/1 1/2 1/3 1/4 2/2 2/3 2/4 3/4 3/3 4/4
1/1 1/1
1/2
2/1
2/21/3
1/4
2/3
2/4
3/1
4/1
3/2
4/2
1/1
1/2
2/1
1/1
1/3
2/1
2/3
3/1
2/1
3/2
1/2
1/3
2/1
3/1
ID alleles
1 55/43
2 43/114
3 43/55
4 55/114
5 114/43
6 55/78
1/3
2/1
2/3
2/1
3/2
Aside: Minimum Set Cover
Given: universe U = {1, 2, …, n} collection of sets S = {S1, S2,…,Sm}
where Si subset of U
Find: the smallest number of sets in Swhose union is the universe U
€
minI⊆[m ]
| I | such that Ui∈ISi =U
Minimal Set Cover is NP-hard
(1+ln n)-approximable (sharp)
Are we done?Challenges No ground truth available Growing number of methods Biologists need (one) reliable
reconstruction Genotyping errors
Answer: Consensus
Consensus is what many people say in chorus but do not believe as individuals
Abba Eban (1915 - 2002), Israeli diplomat In "The New Yorker," 23 Apr 1990
Consensus MethodsCombine multiple solutions to a problem to
generate one unified solution C: S*→ S Based on Social Choice Theory Commonly used where the real solution
is not known e.g. Phylogenetic Trees
Consensus...
S1 S2 Sk S
Error-Tolerant ApproachSheikh et al. 08
Locu
s
1
Locu
s
2
Locu
s
3
Locu
s
l
Sibling Reconstruc
tion Algorithm
...
Consensus...
S1 S2 Sk S
Distance-based Consensus
Consensus...S1 S2 Sk Ss
S
Search
f q
fq fd
Algorithm– Compute a consensus solution
S={g1,...,gk }– Search for a good solution near S
fd
NP-hard for any fd, fq or an arbitrary linear combination Sheikh et al. ‘08
A Greedy Approach - Algorithm Compute a strict consensus While total distance is not too large
Merge two sibgroups with minimal (total) distance
Quality: fq=n-|C| Distance function from solution C to C’
fd(C,C’) =sum of costs of merging groups in C to obtain C’
=sum of costs of assigning individuals to groups
Cost of assigning individual to a group:Benefit: Alleles and allele pairs sharedCost: Minimum Edit Distance
Change costs to average per locus costsCompare max group error on per locus basisTreat cost and benefit independentlyIn order to qualify a merge
Cost <= maxcostBenefit >= minbenefitBenefit = max benefit among possible merges
Auto Greedy Consensus
A Greedy Approach
{1,2} {3} {4} {5} {6,7}
{1,2} 3.5 1.1 2.5 5.1
{3} 0.5 0.3 0.5 0.1
{4} 1.0 3.0 0.6 1.1
{5} 2.0 1.2 3.5 4.9
{6,7} 0.6 0.9 1.2 4.1
S1 = { {1,2,3},{4,5},{6,7} }
S2 = { {1,2,3},{4}, {5,6,7} }
S3 = { {1,2},{3,4,5},{6,7} }
Strict Consensus
S = { {1,2}, {3}, {4}, {5}, {6,7} }
{1,2} {3,6,7} {4} {5} {6,7}
{1,2} 3.5 1.1 2.5 5.1
{3,6,7} 1.7 3.1 2.2 6.1
{4} 1.0 3.0 0.6 1.1
{5} 2.0 1.2 3.5 4.9
{6,7} 0.6 0.9 1.2 4.1
S = { {1,2}, {3}, {4}, {5}, {6,7} }
S={ {1,2}, {3,6,7}, {4}, {5} }
Testing and Validation: Protocol
1. Get a dataset with known sibgroups(real or simulated)
2. Find sibgroups using our alg3. Compare the solutions
Partition distrance, Gusfield ’03 = assignment problem
Compare to other sibship methods Family Finder, COLONY
Salmon (Salmo salar) - Herbinger et al., 1999 351 individuals, 6 families, 4 loci. No missing alleles
Shrimp (Penaeus monodon) - Jerry et al., 200659 individuals,13 families, 7 loci. Some missing alleles
Ants (Leptothorax acervorum )- Hammond et al., 2001Ants are haplodiploid species. The data consists of 377 worker diploid ants
Test Data
Simulated populations of juveniles for a range of values of number of parents, offspring per parent, alleles, per locus, number of loci, and the distributions of those.
Experimental Protocol
Generate F females and M males (F=M=5, 10, 20)
Each with l loci (l=2, 4, 6,8,10)Each locus with a alleles (a=10, 15)
Generate f families (f=5,10,20)For each family select female+male
uniformly at random
For each parent pair generate o offspring(o=5,10)
For each offspring for each locus choose allele outcome uniformly at random
Introduce random errors
ConclusionsCombinatorial algorithms with minimal
assumptionsBehaves well on real and simulated data Better than others with few loci, few large
familiesError tolerantUseful, high demand
New and improved: Efficient implementation Perez-Rathlke et al. (in submission)
Other objectives (bio vs math) Ashley et al. ‘10
Other genealogical relationships Sheikh et al. ‘09, ’10
Different combinatorial approach Brown & B-W, ‘10
Pedigree amalgamation