milkERmilkER – a milk – a milk informatics resourceinformatics resource
Stephen Edwards BSc.Stephen Edwards BSc.University of EdinburghUniversity of Edinburgh
BioNLP meeting 6th June 2005BioNLP meeting 6th June 2005
OverviewOverview
Aims of Aims of milkERmilkER milkERmilkER database database Text-miningText-mining Potential targetsPotential targets
milkERmilkER aims aims
To amalgamate disperse milk To amalgamate disperse milk information into one resource, information into one resource, allowing more focused analysis of allowing more focused analysis of milk proteins in relation to dairy milk proteins in relation to dairy issues, health and disease.issues, health and disease.
A milk databaseA milk database
Knowledge on milk affects many Knowledge on milk affects many industriesindustries
UniProt, GenBank excellent resources UniProt, GenBank excellent resources Marsupial genomics database Marsupial genomics database (New (New
Zealand)Zealand)
Glasgow genomics dataGlasgow genomics data Chinese databaseChinese database Polish bioactive peptide databasePolish bioactive peptide database Food property database (commercial)Food property database (commercial)
Milk componentsMilk components
Fat, carbohydrates, proteins, mineralsFat, carbohydrates, proteins, minerals Growth factors, enzymes, enzyme Growth factors, enzymes, enzyme
inhibitors, immunoglobulins, allergens, inhibitors, immunoglobulins, allergens, disease factors, anti-bacterial proteins, disease factors, anti-bacterial proteins, opioidsopioids
1. Deliberate 1. Deliberate 2. Leakage from blood 2. Leakage from blood 3. Result of disease conditions 3. Result of disease conditions 4. Engineered4. Engineered5. Bacterial origin5. Bacterial origin
milkERmilkER database database
Database using BioSQL which Database using BioSQL which allows incorporation of UniProt, allows incorporation of UniProt, EMBL, GenBank entriesEMBL, GenBank entries
LOCUS NM_173929 790 bp mRNA linear MAM 27-OCT-2004DEFINITION Bos taurus lactoglobulin, beta (LGB), mRNA.ACCESSION NM_173929VERSION NM_173929.2 GI:31343239KEYWORDS .SOURCE Bos taurus (cow) ORGANISM Bos taurus Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Cetartiodactyla; Ruminantia; Pecora; Bovidae; Bovinae; Bos.REFERENCE 1 (bases 1 to 790) AUTHORS Jayat,D., Gaudin,J.C., Chobert,J.M., Burova,T.V., Holt,C., McNae,I., Sawyer,L. and Haertle,T. TITLE A recombinant C121S mutant of bovine beta-lactoglobulin is more susceptible to peptic digestion and to denaturation by reducing agents and heating JOURNAL Biochemistry 43 (20), 6312-6321 (2004) PUBMED 15147215 REMARK GeneRIF: Results suggest that the stability of beta-lactoglobulin arising from the hydrophobic effect is reduced by the C121S mutation so that unfolded or partially unfolded states are more favored.ORIGIN 1 actccactcc ctgcagagct cagaagcgtg atcccggctg cagccatgaa gtgcctcctg 61 cttgccctgg ccctcacctg tggcgcccag gccctcatcg tcacccagac catgaagggc …..
milkEROther Sources (e.g. published
tables)
Web Query
Information extraction
Other Databases
milkER population
EMBL UniProt
Information retrieval
milkERmilkER database database
Database using BioSQL which Database using BioSQL which allows incorporation of UniProt, allows incorporation of UniProt, EMBL, GenBank entriesEMBL, GenBank entries
Library of literature on milk Library of literature on milk User interface User interface
(www.milker.org.uk)(www.milker.org.uk)
Text-miningText-mining
Machine ‘reading’ of textMachine ‘reading’ of text Many techniques involved:Many techniques involved:
– TokenisationTokenisation– StemmingStemming (Activation (Activation Activat) Activat)– POS tagging (Protein POS tagging (Protein noun) noun)– Abbreviation expansion (CN Abbreviation expansion (CN Casein) Casein)– Entity identification (Casein Entity identification (Casein
protein)protein)– DictionaryDictionary
Increased [past participle] levels [plural noun] of [preposition]IgA [antibody]
B-LG [protein]Diabetes [disease]
[IgA antibodies to B-LG]
‘MARKER’ [type 1 diabetes]
”Increased levels of IgA antibodies to B-LG were found and were shown to be an independent risk marker for type 1 diabetes.”
Tokeniser / POS tagger
Entity identification
Parser
Information extractionInformation extraction
Rule basedRule based– ‘‘interact’ ‘bind’ ‘activate’interact’ ‘bind’ ‘activate’– [protein] (0-5 words) [verbs] (0-5 words) [protein][protein] (0-5 words) [verbs] (0-5 words) [protein]
(Blaschke and Valencia, 2002)(Blaschke and Valencia, 2002)
Machine-learningMachine-learning– Statistical methods, Hidden Markov Statistical methods, Hidden Markov
ModelsModels– Learn interfillers, text lying between Learn interfillers, text lying between
tagged entities tagged entities (Bunescu et al, 2004)(Bunescu et al, 2004)
DifficultiesDifficulties
SynonymsSynonyms Proteins and genes with same Proteins and genes with same
namename Funny names e.g. ERK-1/2, ‘and’ gene!Funny names e.g. ERK-1/2, ‘and’ gene! Variability of natural languageVariability of natural language Compounded names Compounded names Co-ordination, negatives, speeling Co-ordination, negatives, speeling
errorserrors
EvaluationEvaluation
Precision (P) Precision (P) - how - how correctcorrect is output is output Recall (R)Recall (R) - how often does it pick- how often does it pick F-measureF-measure - combines P and R- combines P and R
IE systems can achieve high results, IE systems can achieve high results, but not enough to populate but not enough to populate databases automaticallydatabases automatically
Text-mining usesText-mining uses
Aim to extract interactions and Aim to extract interactions and diseasesdiseases
Swanson (Fish oil) Swanson (Fish oil) Srinivasan (Turmeric) Srinivasan (Turmeric)
General model for discovering implicit links between topics Starting topic: Turmeric (inhibits)Intermediate topic: Nuclear factor-kappa B (involved in)Terminal topic: Crohn’s disease
Diagram taken from Srinivasan et al, 2004
Targets for text miningTargets for text mining
Many milk relationships still Many milk relationships still require further investigationrequire further investigation
Positive reasonsPositive reasons
- nutritional benefits- nutritional benefits
- neonatal growth - neonatal growth
- antimicrobial activity- antimicrobial activity- bioactive peptides- bioactive peptides
Targets for text mining Targets for text mining (cont.)(cont.) Negative reasonsNegative reasons
- recent link with Alzheimer's- recent link with Alzheimer's- diabetes link- diabetes link- asthma- asthma- human reactions to cow - human reactions to cow
hormones hormones (e.g. Acne, Danby 2005)(e.g. Acne, Danby 2005)
- drug transfer to milk and effects- drug transfer to milk and effects- allergic reactions/intolerance- allergic reactions/intolerance
- toxic contaminants- toxic contaminants
milkER milkER processprocess
897 proteins, 772 dna, 1232 rna897 proteins, 772 dna, 1232 rna Analyze references Analyze references (1465 MEDLINE refs)(1465 MEDLINE refs)
– MeSH terms, GO terms etcMeSH terms, GO terms etc POS tagPOS tag UMLS standardisationUMLS standardisation Gene/protein dictionaryGene/protein dictionary Extract relationsExtract relations
Milk literatureMilk literature
ArticlesArticles EnglishEnglish AbstractsAbstracts
Milk Milk 38,09738,097 32,20732,207 23,20123,201
Diabetes Diabetes mellitusmellitus
174,498174,498 133,844133,844 103,868103,868
Milk and Milk and diabetesdiabetes
210210 191191 132132
milkERmilkER interactions interactions
Table of interacting proteinsTable of interacting proteins– Store as queryable XML strings?Store as queryable XML strings?
Discover links between proteins and Discover links between proteins and diseasedisease
Create hypothesesCreate hypotheses Confirm experimentallyConfirm experimentally
DiabetesDiabetes
Pancreas secretes hormonesPancreas secretes hormones– Glycagon, increases conversion glycagon Glycagon, increases conversion glycagon glucose glucose– Insulin, increases conversion glucose Insulin, increases conversion glucose glycagon. glycagon.
Allows glucose into cells.Allows glucose into cells.
““Condition where the amount of Condition where the amount of glucose in the blood is abnormally high glucose in the blood is abnormally high as the body cannot use it adequately as the body cannot use it adequately as fuel”as fuel”
DiabetesDiabetes
Affects 3-5% of industrialised populationsAffects 3-5% of industrialised populations Type 1 (~10%)Type 1 (~10%)
– Genetic and environmental factors (e.g. diet)Genetic and environmental factors (e.g. diet)– Decreased insulin productionDecreased insulin production– Mostly develops < age 20Mostly develops < age 20
Type II (~90%)Type II (~90%)– Resistance of body to insulinResistance of body to insulin– Normally develops > age 40Normally develops > age 40– Often associates with high B.P, cholsterol and arterial Often associates with high B.P, cholsterol and arterial
diseasedisease
Milk and diabetesMilk and diabetes
“More research is needed on all aspects of lactation in women with diabetes.”
– Reader D. et al, Curr Diab Rep. 2004
“The effect of high protein intakes from different sources on glucose-insulin metabolism needs further study”
– Hoppe et al, European Journal of Clinical Nutrition 2005
“American children also tend to be heavier American children also tend to be heavier than those from European countries, skewing than those from European countries, skewing the [growth] charts further.” the [growth] charts further.”
– The Scotsman Sat 5 Feb 2005 The Scotsman Sat 5 Feb 2005
The government currently recommends that babies should be fed breast milk alone for the first six months - the WHO recommends two years.
Selected quotesSelected quotes
ConclusionsConclusions
Knowledge of milk vital in many Knowledge of milk vital in many areasareas
milkERmilkER aims to bring disparate aims to bring disparate milk data togethermilk data together
Text-mining can wade through Text-mining can wade through large amounts of data to retrieve large amounts of data to retrieve and and discoverdiscover vital information vital information
Future workFuture work
Relation extraction of milk Relation extraction of milk literatureliterature
Extend content of Extend content of milkERmilkER to to include interaction datainclude interaction data
Create hypotheses for Create hypotheses for experimental workexperimental work
AcknowledgementsAcknowledgements
Prof. Lindsay SawyerProf. Lindsay Sawyer Dr. Carl Holt Dr. Carl Holt (Hannah Research Institute, Ayr)(Hannah Research Institute, Ayr)
Prof. Bonnie Webber Prof. Bonnie Webber (Informatics)(Informatics)
Dr. Alistair Kerr and Dr. Douglas Dr. Alistair Kerr and Dr. Douglas Armstrong for technical supportArmstrong for technical support
ReferencesReferences
Acne/milk– Acne and milk, the diet myth, and beyond (Danby,
2005) Diabetes/milk
– Milk and diabetes (Schrezenmeir et al, 2000) REVIEW– The role of -casein variants in the induction of insulin-
dependent diabetes (Elliott et al, 1997) Text-mining
– Natural language processing and systems biology (Cohen et al, 2004) REVIEW
– Mining MEDLINE for implicit links between dietary substances and diseases (Srinivasan et al, 2004)
– Learning to extract proteins and their interactions from MEDLINE abstracts (Bunescu et al, 2003)