in silico blood genotyping from exome sequencing data silico blood genotyping from exome sequencing...
TRANSCRIPT
In silico blood genotyping from exomesequencing data
Silvio Tosatto
BioComputing UP, Department of Biology,University of Padova, Italy
URL: http://protein.bio.unipd.it/
Today
• Personalized genetics has been upon us for some time
• How good are we at actually identifying phenotype from whole genome?
The CAGI Personal Genome Project (PGP) Challenge
• Few goals are more pure to genome interpretation than predicting traitsfrom raw sequence (or genotype) data
• In this CAGI challenge, phenotypes/traits are predicted for real people with genetic data
• 10 individual’s genetic information from the Personal Genome Project are provided (PGP-10)
Dataset provided byGeorge Church
Personal genome project (PGP) ‐ Predict individuals’ phenotype
Numerical traits33. Birth weight (in g)34. HDL level (in mg/dL) *35. LDL level (in mg/dL) *36. Triglyceride level
(in mg/dL) *37. Fasting blood glucose level
(in mg/dL)38. Warfarin dose (in mg)39. Age at Menarche40. Annual income (in $)
Numerical traits33. Birth weight (in g)34. HDL level (in mg/dL) *35. LDL level (in mg/dL) *36. Triglyceride level
(in mg/dL) *37. Fasting blood glucose level
(in mg/dL)38. Warfarin dose (in mg)39. Age at Menarche40. Annual income (in $)
Personal genome project (PGP) ‐ Predict individuals’ phenotype
Blood Groups
• Clear genetic cause of phenotypes
• Model system for phenotype prediction
• Good description in literature
• High relevance, especially for blood transfusions
(Blood. 2009;114: 248-256)
Example: ABO glycosyltransferase
Blood Grp Genes AntigensABO ABO A, B, O
Amino acid residues differingbetween blood group A- and B-active transferases, respectively (Arg176Gly; Gly235Ser; Leu266Met; Gly268Ala) are shown with the single-letter code and theirpositions indicated.
Relevant Blood Types
Blood Grp Genes AntigensABO ABO A, B, O
RH RHCE, RHD D, E, C plus 50 minor
DUFFY DARC FY(a), FY(b)
Kell KEL K1, K2 plus 23 minor
Diego SLC4A1 Dia, Dib, Wra, Wrb
Kidd SLC14A1 Jk(a), Jk(b)
Lewis FUT3 a, b
Lutheran BCAM Lu(a), Lu(b) plus 15 minor
MNS GYPA, GYPB, GYBE
M, N, S plus 40 minor
Bombay FUT1, FUT2 H, secretor
10 out of ca. 30 blood groups are relevantfor transfusions
BOOGIE: BlOOd Group IdEntifier
• A knowledge-based system to predict blood groups from sequencing data
• All 10 groups relevant for blood transfusions are predicted
• A specialized genotype-phenotype knowledge base is required
BOOGIE: Knowledge representation
• Stored in tree-like structure
• Rules expressed in “if <mutation(s)>
then <phenotype(s)>” form
BOOGIE: Knowledge collection
– Manually curated
– 580 rules derived
Blood G rp G enes AntigensABO ABO A, B , O
R H R H C E, R H D D , E, C p lus 50 m inor
D U FFY D AR C FY(a), FY(b)
Kell KEL K1, K2 p lus 23 m inor
D iego SLC 4A1 D ia, D ib, W ra, W rb
K idd SLC 14A1 Jk(a), Jk(b)
Lew is FU T3 a, b
Lutheran BC AM Lu(a), Lu(b) p lus 15 m inor
M N S G YPA, G YPB, G YBE
M , N , S p lus 40 m inor
Bom bay FU T1, FU T2 H , secre tor
Relevant variants
Gene‐based annotation of variants
Select conserved positions
Remove unrelatedgenes
ANNOVARANNOVAR(Wang et al., Nucleic Acids Research 2010)
Millions of SNVs
ANNOVAR is used
to reduce the SNVs
to manageable
number.
Few relevant SNVs
BOOGIE Pipeline
B lood G rp G enes AntigensABO ABO A, B , O
R H R H C E, R H D D , E , C p lus 50 m inor
D U FFY D AR C FY(a), FY(b)
Kell KEL K1, K2 p lus 23 m inor
D iego SLC 4A1 D ia, D ib, W ra, W rb
K idd SLC 14A1 Jk(a), Jk(b)
Lew is FU T3 a, b
Lutheran BC AM Lu(a), Lu(b) p lus 15 m inor
M N S G YPA, GYPB, G YBE
M , N , S p lus 40 m inor
Bom bay FU T1, FU T2 H , secre tor
Benchmarking
• BOOGIE covers all known blood group variants
• Difficulty in finding genome sequences with known blood phenotypes
• Personal Genome Project (PGP) as annotated benchmark set
Personal Genome Project (PGP)
The mission of the PGP is to encourage the development of personal genomics
• 10 individual’s genetic information from the Personal Genome Project are provided (PGP-10)
• A larger dataset (PGP-1K) aims to cover at least1,000 genomes
Unfortunately, only ABO and Rh blood groupinformation is available
PGP-10 Data
Back row (left to right): James Sherley, Misha Angrist, John Halamka, Keith Batchelder, Rosalynn Gill.
Front row (left to right): Esther Dyson, George Church, Kirk Maxey.
Not shown: Stan Lapidus and Steven Pinker.
PGP-10 Data
PGP-10 Results
PGP1 PGP4 PGP8Known O + A - B +ABO O A BRh c; e; weak D c; e; weak D c; e; weak D
DUFFY FY(a+); FY(b-) FY(a-); FY(b+) FY(a-); FY(b+)KELL K2; K21+; K4-;
K3-; K11; K17; K14; K24; K6+;
K7-
K2; K21+; K4-; K3-; K11; K17; K14; K24; K6+;
K7-
K2; K21+; K4-; K3-; K11; K17; K14; K24; K6+;
K7-Diego Dib; Memph neg Dib; Memph neg Dib; Memph negKIDD Jk(a-); Jk(b+) Jk(a-); Jk(b+) Jk(a+); Jk(b-)Lewis negative negative negative
Lutheran Lu(a-); Lu(b+);Lu6+; Lu9-; Lu4; Lu8+; Aua+;Aub-
Lu(a-); Lu(b+);Lu6-; Lu9+;Lu4-; Lu8+; Aua-;Aub+
Lu(a-); Lu(b+);Lu6+; Lu9-;Lu4-; Lu8+; Aua+;Aub-
MNS M; S M; s M,sBombay H+; secretor H+; secretor H+; secretor
BOOGIE predicts correctly all ABO types and allexcept one (PGP-4) Rh groups
PGP-1K Results
• A second dataset was built from all PGP-1K participants with availableblood group information for a total of 22 individuals
• This dataset contains micro array data (23&me SNPs)
P = predicted R = real* = missing blood group relevant SNPs from dataset
Conclusions
• We developed a method, called BOOGIE, to predict the ten blood
groups relevant for transfusions from sequencing data
– Specialized knowledgebase with 580 genotype to phenotype rules
– Novel variants can be easily considered
• Benchmarking was (so far) only possible on PGP data for the ABO and
Rh blood groups
– The ABO and Rh systems are correctly predicted in 85-100% of cases
– The Rh- type presents some additional difficulties
AcknowledgementsAcknowledgements
Manuel Giollo
Giovanni Minervini
Marta Scalzotto (not shown)
Emanuela Leonardi
Carlo Ferrari
URL:URL: http://http://protein.bio.unipd.itprotein.bio.unipd.it//
FundingFIRB Futuro in Ricerca
Università di Padova CARIPLOAIRC