machine-learning in building bioinformatics databases for infectious diseases
DESCRIPTION
Machine-learning in building bioinformatics databases for infectious diseases. ASEAN-China International Bioinformatics Workshop 2008 17 Apr 2008. Victor Tong Institute for Infocomm Research A*STAR, Singapore. Overview. Definitions and background - PowerPoint PPT PresentationTRANSCRIPT
Machine-learning in building bioinformatics databases for infectious diseases
Victor TongInstitute for Infocomm ResearchA*STAR, Singapore
ASEAN-China International Bioinformatics Workshop 200817 Apr 2008
Overview
Definitions and background
Architectures of existing immunological databases
Machine-learning for biological databases
Conclusion
Biology produces more data than we can process >3000 HLA alleles 107-1015 different T-cell receptors 1011 linear 9mer epitopes Post-translational spliced epitopes
Data are stored in databases, literature, laboratory records, clinical records, …
A major issue: turning data into knowledge
The information centric world
Impractical to do manual curation ≥ 16 million PubMed abstracts ~80K immunology related references
Large amounts of data that are difficult to interpret Protein-protein interaction extraction from text
Bioinformatics: systematic construction and updating of databases
Use of bioinformatics
Ad hoc bioinformatics
Biological system
Computational analysis
Biological interpretation
More systematic use of bioinformatics
Biological system
Computational analysis
Biological interpretation
Formal description
Mathematical problem
Conversion of results
Knowledge discovery from databases is the process of automated extraction of useful information or knowledge from individual or multiple databases
1) Data explosion
Current databases: Volume of data increasing exponentially GenBank, SWISS-PROT, IMGT, PubMed, etc
New databases:
Growth in numbers Increase in size More complex
Biologists: Maintain personal data bank Information relevant to their
research Define objectives for data
mining and analysis
2) Data quality
Nature of biological data: Fuzzy and complex Varying interpretations
Problems with raw data:
Inconsistent Inaccurate Redundant Irrelevant Incomplete Incorrect
Data cleaning: Limit on the percentage
error that can be tolerated in the data
Prevent propagation of errors to our databases
Prevent depreciation of data quality
3) Database creation and maintenance
Software tools and programming efforts: Data collection Constructing databases Integrating data mining tools Updating the databases
Nature of the databases:
Short lifespan Hard to maintain
4) Data integration
Disparities in data sources: Data structures Data formats Views Search mechanisms Location
Overview
Definitions and background
Architectures of existing immunological databases
Machine-learning for biological databases
Conclusion
Web-resources for immune epitope information
Immune Epitope Database and Analysis Resource (IEDB)Contains B-cell epitopes, T-cell epitopes, MHC ligands for humans, non-human primates, rodents, and other animal species.URL: http://www.immuneepitope.org
The international ImMunoGeneTics information system (IMGT)Specializes in Ig, T-cell receptors, MHC, Ig superfamily, MHC superfamily, and related proteins of the immune system of human and other vertebrate species URL: http://imgt.cines.fr/
SYFPEITHIContains ~3,500 T-cell epitopes, MHC ligands and peptide motifs for humans and rodentsURL: http://www.syfpeithi.de/
Web-resources for immuneepitope information
MHCBNContains T-cell epitopes, TAP ligands, MHC binding peptides and MHC non-binding peptides for humans and rodentsURL: http://www.imtech.res.in/raghava/mhcbn/
MPID-TContains 3D structural information of 187 T-cell receptors, MHCs and interacting epitopes for humans and rodents, spanning 40 allelesURL: http://surya.bic.nus.edu.sg/mpidt/
AntiJen/JenPepContains T-cell epitopes, MHC ligands, TAP ligands and B-cell epitopes.URL: http://www.jenner.ac.uk/antijen/
The IEDB class diagram
Relationships between an epitope & contexts
Overview
Definitions and background
Architectures of existing immunological databases
Machine-learning for biological databases
Conclusion
Naϊve Bayes classifiers
Attribute values are conditionally independent given the target value
Goal: to assign a new instance vj the most probable target value Vtarget given a set of attribute values <a1, a2, … an>
The target class may be defined as:
Vtarget = argmax P(vj)ΠP(ai|vj)
Comparison of popular text classification algorithms
Dataset 20,910 PubMed abstracts 181,299 unique words
AROC NBC: 0.838 ANN: 0.831 SVM: 0.825 DT: 0.809
Wang et al., BMC Bioinformatics 2007, 8:269
Feature selection (FS)
Data source PubMed abstracts Medical Subject Headings (MeSH) - National Library of
Medicine's controlled vocabulary used for indexing articles, for cataloging books and other holdings
Publication title Author(s) etc
Feature selection (FS)
Algorithms Document frequency (DF) – ranks features based on the
number of abstracts they appear in Information gain (IG) – measures the number of bits of
information obtained for category prediction based on their occurrence in a document
IG(u) = -∑ P(ci) log P(ci) + P(u) ∑ P(ci|u) log P(ci|u) + P(t) ∑ P(ci|ū) log P(ci|ū)
where u is the feature of interest, ci (i = 1, …, m) denotes the set of categories the documents belong to
Feature condensation (FC)
Stemming To reduce words to their common root
e.g. “binding, binds, bind” to bind Porter stemmer – AROC = 0.846 to AROC = 0.842 Domain specific vocabulary may be reduced to
unsuitable terms
Feature extraction (FE)
Rules to capture immune related expressions and group them together Reduction of feature space (i.e. no. of unique words) Enrichment of information content Better performance?
Feature extraction (FE)
Examples: Sequence length
– identify sequence length and replace with “~range<50~” or “~range>50~” if sequences to be mapped stretches 50 amino acids
MHC alleles– identify MHC alleles and replace with “~mhc_allele~”
Protein sequences– identify sequences as a) exclusively containing characters representing the 20 aa, b) in upper case, length > threshold,and replace with “~sequence~”
Performance comparison
Wang et al., BMC Bioinformatics 2007, 8:269
Overview
Definitions and background
Architectures of existing immunological databases
Machine-learning for biological databases
Conclusion
Conclusion
Machine-learning algorithms enable systematic approach to database construction and facilitates scientific discovery
It must be performed with due care and must
be scientifically and technically sound