machine-learning in building bioinformatics databases for infectious diseases

29
Machine-learning in building bioinformatics databases for infectious diseases Victor Tong Institute for Infocomm Research A*STAR, Singapore ASEAN-China International Bioinformatics Workshop 2008 17 Apr 2008

Upload: vahe

Post on 17-Jan-2016

23 views

Category:

Documents


0 download

DESCRIPTION

Machine-learning in building bioinformatics databases for infectious diseases. ASEAN-China International Bioinformatics Workshop 2008 17 Apr 2008. Victor Tong Institute for Infocomm Research A*STAR, Singapore. Overview. Definitions and background - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Machine-learning in  building bioinformatics databases  for infectious diseases

Machine-learning in building bioinformatics databases for infectious diseases

Victor TongInstitute for Infocomm ResearchA*STAR, Singapore

ASEAN-China International Bioinformatics Workshop 200817 Apr 2008

Page 2: Machine-learning in  building bioinformatics databases  for infectious diseases

Overview

Definitions and background

Architectures of existing immunological databases

Machine-learning for biological databases

Conclusion

Page 3: Machine-learning in  building bioinformatics databases  for infectious diseases

Biology produces more data than we can process >3000 HLA alleles 107-1015 different T-cell receptors 1011 linear 9mer epitopes Post-translational spliced epitopes

Data are stored in databases, literature, laboratory records, clinical records, …

A major issue: turning data into knowledge

The information centric world

Page 4: Machine-learning in  building bioinformatics databases  for infectious diseases

Impractical to do manual curation ≥ 16 million PubMed abstracts ~80K immunology related references

Large amounts of data that are difficult to interpret Protein-protein interaction extraction from text

Bioinformatics: systematic construction and updating of databases

Use of bioinformatics

Page 5: Machine-learning in  building bioinformatics databases  for infectious diseases

Ad hoc bioinformatics

Biological system

Computational analysis

Biological interpretation

Page 6: Machine-learning in  building bioinformatics databases  for infectious diseases

More systematic use of bioinformatics

Biological system

Computational analysis

Biological interpretation

Formal description

Mathematical problem

Conversion of results

Page 7: Machine-learning in  building bioinformatics databases  for infectious diseases

Knowledge discovery from databases is the process of automated extraction of useful information or knowledge from individual or multiple databases

Page 8: Machine-learning in  building bioinformatics databases  for infectious diseases

1) Data explosion

Current databases: Volume of data increasing exponentially GenBank, SWISS-PROT, IMGT, PubMed, etc

New databases:

Growth in numbers Increase in size More complex

Biologists: Maintain personal data bank Information relevant to their

research Define objectives for data

mining and analysis

Page 9: Machine-learning in  building bioinformatics databases  for infectious diseases

2) Data quality

Nature of biological data: Fuzzy and complex Varying interpretations

Problems with raw data:

Inconsistent Inaccurate Redundant Irrelevant Incomplete Incorrect

Data cleaning: Limit on the percentage

error that can be tolerated in the data

Prevent propagation of errors to our databases

Prevent depreciation of data quality

Page 10: Machine-learning in  building bioinformatics databases  for infectious diseases

3) Database creation and maintenance

Software tools and programming efforts: Data collection Constructing databases Integrating data mining tools Updating the databases

Nature of the databases:

Short lifespan Hard to maintain

Page 11: Machine-learning in  building bioinformatics databases  for infectious diseases

4) Data integration

Disparities in data sources: Data structures Data formats Views Search mechanisms Location

Page 12: Machine-learning in  building bioinformatics databases  for infectious diseases

Overview

Definitions and background

Architectures of existing immunological databases

Machine-learning for biological databases

Conclusion

Page 13: Machine-learning in  building bioinformatics databases  for infectious diseases

Web-resources for immune epitope information

Immune Epitope Database and Analysis Resource (IEDB)Contains B-cell epitopes, T-cell epitopes, MHC ligands for humans, non-human primates, rodents, and other animal species.URL: http://www.immuneepitope.org

The international ImMunoGeneTics information system (IMGT)Specializes in Ig, T-cell receptors, MHC, Ig superfamily, MHC superfamily, and related proteins of the immune system of human and other vertebrate species URL: http://imgt.cines.fr/

SYFPEITHIContains ~3,500 T-cell epitopes, MHC ligands and peptide motifs for humans and rodentsURL: http://www.syfpeithi.de/

Page 14: Machine-learning in  building bioinformatics databases  for infectious diseases

Web-resources for immuneepitope information

MHCBNContains T-cell epitopes, TAP ligands, MHC binding peptides and MHC non-binding peptides for humans and rodentsURL: http://www.imtech.res.in/raghava/mhcbn/

MPID-TContains 3D structural information of 187 T-cell receptors, MHCs and interacting epitopes for humans and rodents, spanning 40 allelesURL: http://surya.bic.nus.edu.sg/mpidt/

AntiJen/JenPepContains T-cell epitopes, MHC ligands, TAP ligands and B-cell epitopes.URL: http://www.jenner.ac.uk/antijen/

Page 15: Machine-learning in  building bioinformatics databases  for infectious diseases

The IEDB class diagram

Page 16: Machine-learning in  building bioinformatics databases  for infectious diseases

Relationships between an epitope & contexts

Page 17: Machine-learning in  building bioinformatics databases  for infectious diseases
Page 18: Machine-learning in  building bioinformatics databases  for infectious diseases
Page 19: Machine-learning in  building bioinformatics databases  for infectious diseases

Overview

Definitions and background

Architectures of existing immunological databases

Machine-learning for biological databases

Conclusion

Page 20: Machine-learning in  building bioinformatics databases  for infectious diseases

Naϊve Bayes classifiers

Attribute values are conditionally independent given the target value

Goal: to assign a new instance vj the most probable target value Vtarget given a set of attribute values <a1, a2, … an>

The target class may be defined as:

Vtarget = argmax P(vj)ΠP(ai|vj)

Page 21: Machine-learning in  building bioinformatics databases  for infectious diseases

Comparison of popular text classification algorithms

Dataset 20,910 PubMed abstracts 181,299 unique words

AROC NBC: 0.838 ANN: 0.831 SVM: 0.825 DT: 0.809

Wang et al., BMC Bioinformatics 2007, 8:269

Page 22: Machine-learning in  building bioinformatics databases  for infectious diseases

Feature selection (FS)

Data source PubMed abstracts Medical Subject Headings (MeSH) - National Library of

Medicine's controlled vocabulary used for indexing articles, for cataloging books and other holdings

Publication title Author(s) etc

Page 23: Machine-learning in  building bioinformatics databases  for infectious diseases

Feature selection (FS)

Algorithms Document frequency (DF) – ranks features based on the

number of abstracts they appear in Information gain (IG) – measures the number of bits of

information obtained for category prediction based on their occurrence in a document

IG(u) = -∑ P(ci) log P(ci) + P(u) ∑ P(ci|u) log P(ci|u) + P(t) ∑ P(ci|ū) log P(ci|ū)

where u is the feature of interest, ci (i = 1, …, m) denotes the set of categories the documents belong to

Page 24: Machine-learning in  building bioinformatics databases  for infectious diseases

Feature condensation (FC)

Stemming To reduce words to their common root

e.g. “binding, binds, bind” to bind Porter stemmer – AROC = 0.846 to AROC = 0.842 Domain specific vocabulary may be reduced to

unsuitable terms

Page 25: Machine-learning in  building bioinformatics databases  for infectious diseases

Feature extraction (FE)

Rules to capture immune related expressions and group them together Reduction of feature space (i.e. no. of unique words) Enrichment of information content Better performance?

Page 26: Machine-learning in  building bioinformatics databases  for infectious diseases

Feature extraction (FE)

Examples: Sequence length

– identify sequence length and replace with “~range<50~” or “~range>50~” if sequences to be mapped stretches 50 amino acids

MHC alleles– identify MHC alleles and replace with “~mhc_allele~”

Protein sequences– identify sequences as a) exclusively containing characters representing the 20 aa, b) in upper case, length > threshold,and replace with “~sequence~”

Page 27: Machine-learning in  building bioinformatics databases  for infectious diseases

Performance comparison

Wang et al., BMC Bioinformatics 2007, 8:269

Page 28: Machine-learning in  building bioinformatics databases  for infectious diseases

Overview

Definitions and background

Architectures of existing immunological databases

Machine-learning for biological databases

Conclusion

Page 29: Machine-learning in  building bioinformatics databases  for infectious diseases

Conclusion

Machine-learning algorithms enable systematic approach to database construction and facilitates scientific discovery

It must be performed with due care and must

be scientifically and technically sound