cleaning genotype data karl w bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · •...
TRANSCRIPT
![Page 1: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/1.jpg)
1
Cleaning genotype data
Karl W Broman
![Page 2: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/2.jpg)
2
• Gene hunters need high quality data
• Genetic studies are expensive, so we should ensure that the data is the best possible
• Cleaning data is tedious and requires care and experience
• Data cleaning is not always performed appropriately
Why talk about cleaning data?
![Page 3: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/3.jpg)
3
• Pedigree errors
– non-paternity
– unreported adoption/twinning
– errors in data entry
– sample mix-ups
• Genotyping errors
– errors in data entry
– misinterpretation of pattern on gel
– mutations may masquerade as errors
Sources of errors
![Page 4: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/4.jpg)
4
• Verify the subjects’ sexes
• Verify relationships
– Families with many Mendelian inconsistencies
– Infer pairwise relationships using whole genome data [RelPair]
• Mendelian inheritance
– Find problem genotypes
– Identify responsible individuals [PedCheck]
• Beyond Mendel
– Unlikely multiple recomb. events
The cleaning process
![Page 5: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/5.jpg)
5
• 2141 individuals in 400 families (mostly sibships; a handful of smallish pedigrees)
• Weber screening set 9
– 366 autosomal markers
– 17 on X, 4 on Y
• 803,739 total genotypes (~ 3% dropped)
Example data
![Page 6: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/6.jpg)
6
RelPair
For each pair of individuals, calculate
Pr(genetic data | R)
for R = MZ twinsparent/offspringfull sibshalf sibsunrelated
Boehnke and Cox (1997) AJHG 61:423–429
Broman and Weber (1998) AJHG 63:1563–1564
![Page 7: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/7.jpg)
7
Family 239
log10 likelihood IBSrel. pair full sibs half sibs unrelated 0 1 21342,1343 –11 0 –27 73 230 571342,1344 –12 0 –34 61 252 451343,1344 0 –32 –110 39 174 140
1342 1343 1344
![Page 8: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/8.jpg)
8
Family 403log10 likelihood IBS
rel. pair full sibs half sibs unrelated 0 1 22190,2191 0 –26 –98 47 190 1252190,2192 0 –28 –111 27 204 1272191,2192 0 –48 –141 24 189 148
2190 2191 2192
![Page 9: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/9.jpg)
9
Pedigree errors
Non-paternity (10: no data on dad; 1: maybe dad’s brother)
17
MZ twins (including 1 set MZ triplets, 1 father/son)
12
Sample switch (mother/daughter)
1
Full sibs reported to be half sibs 1
Completely unrelated 2
Total 33
![Page 10: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/10.jpg)
10
Apparent error rate(via unreported twins)
IBSFam Individuals 0 1 225 126,131 0 1 34881 396,397 2 36 298107 537,539 0 1 346127 628,629 0 0 357149 742,744,745 0 3 348173 877,878 1 27 323212 1127,1129 0 0 361231 1258,1259 0 2 349251 1411,1412 0 0 359295 1628,1630 0 0 358330 1820,1826 0 0 351393 2139,2140 0 4 343
Overall error rate = 0.91 %Bad samples = 5.1 %Good samples = 0.15 %
![Page 11: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/11.jpg)
11
PedCheck
1. Check on nuclear family level– easy inconsistencies
2. Genotype elimination– find all inconsistencies
3. Determine critical genotypes– removing one individual’s genotype eliminates inconsistency
4. For critical genotypes, calculate odds ratio to determine the most likely error.
O’Connell and Weeks (1998) AJHG 63:259–266
![Page 12: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/12.jpg)
12
267/
236
267/
256
267/
259
267/
256
259/
236
259/
259
259/
259
267/
255
255/
255
255/
244
267/
255
259/
—25
5/24
4
259/
256
D1S
2141
Fam
ily
403
![Page 13: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/13.jpg)
13
205/
193
205/
193
205/
189
209/
193
209/
205
209/
193
209/
193
217/
209
213/
193
209/
193
GA
TA
49D
12Fa
mil
y 19
5
209/
205
209/
205
209/
193
![Page 14: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/14.jpg)
14
• Removed 2,216/801,848 genotypes (~ 0.3%)
• Mismatches in unreported twins: 51/3,869 (~ 1.3%) prior to Mendel checks 50/3,868 (~ 1.3%) after Mendel checks
The result of the Mendel checks
![Page 15: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/15.jpg)
15
ii--ii-ii-io-oo-ioiiiiooooooo-o-
15 cM
Beyond Mendel
Individual 195-1032chromosome 9
CRI-MAP chrompic output
![Page 16: Cleaning genotype data Karl W Bromanbiostat.wisc.edu/~kbroman/presentations/cleaning.pdf · • Cleaning data requires care and experience • Fix pedigree errors prior to removing](https://reader034.vdocument.in/reader034/viewer/2022042306/5ed2ce209c95614861233475/html5/thumbnails/16.jpg)
16
• Cleaning data requires care and experience
• Fix pedigree errors prior to removing genotypes
• The process could be more automated, but now requires a human brain
• Biologists should learn to program (especially perl)
• Analysts should participate in data cleaning
• Data on diallelic markers will require new techniques
Conclusions