applications of text and data mining of biomedical databases

32
Applications of Text and Data Mining of biomedical databases Miguel Andrade Computational Biology & Data Mining group Max Delbrück Center for Molecular Medicine [email protected]

Upload: others

Post on 27-Feb-2022

4 views

Category:

Documents


0 download

TRANSCRIPT

Applications of Text and

Data Mining of biomedical

databases

Miguel Andrade Computational Biology & Data Mining group Max Delbrück Center for Molecular Medicine

[email protected]

Gene structures

Gene expression

Protein sequences

Protein databases

Literature databases

Biological predictions

Human disease

001001001000100101011110110110

AGCTGGTACGAAGATGTCTCGCA

MLVPIEKAEVPRYILKTEFRKAILTS

In a phosphorylation dependent ma

Protein and nucleotide sequences (UniProt, Entrez),

Protein domains (PFAM, SMART), Structures (PDB), Diseases (OMIM),

Gene expression (GEO),

Bibliography (records, MEDLINE)

(full text, PubMed Central)

Molecular Biology databases

Compressed PubMed in XML: 17GB 23M items (exhaustive back to 1966, oldest from 1809) PubMed Central open access subset 26GB of raw XML files (text only), compressed 8GB. 2.6M items

Bibliography (records, MEDLINE)

(full text, PubMed Central)

Molecular Biology databases

Compressed PubMed in XML: 17GB 23M items (exhaustive back to 1966, oldest from 1809) PubMed Central open access subset 26GB of raw XML files (text only), compressed 8GB. 2.6M items 1 Human Genome 320GB

Bibliography (records, MEDLINE)

(full text, PubMed Central)

Molecular Biology databases

K2

Mapping systems

K1

ENTRY

Mapping systems

Trans membrane

PROTEIN

GPCR Receptor

GO

KW

SwissProt

ENTRY

Mapping systems

K2

K1 ENTRY

PAPER

Mapping systems

K2

K1 PROTEIN

GO

MeSH SwissProt MEDLINE

PAPER

Mapping systems

K2

K1 PROTEIN STRUCTURE

SCOP

MeSH PDB MEDLINE

TIM barrel

Enzyme

PAPER

Iterate!

GENE K2

K1 PROTEIN

KW

MeSH SwissProt MEDLINE

EntrezGene

MEDLINE

Entrez Gene

UniProt

KW

authors words MeSH

GO

GO

NetAffx

GO UniGene

ProDom

PDB

GO

fold OMIM words

GEO

PubMed

PubMed

PubMed

Rank MEDLINE according to a topic

Fontaine et al. (2009) Nucleic Acids Research

Jean-Fred Fontaine

http://cbdm.mdc-berlin.de/tools/medlineranker/

MedlineRanker

Jean-Fred Fontaine

http://cbdm.mdc-berlin.de/tools/medlineranker/

MedlineRanker

Génie

http://cbdm.mdc-berlin.de/tools/genie/

Ranks a set of genes from a whole genome according to a topic

Fontaine et al. (2011) Nucleic Acids Research

Human

Génie

http://cbdm.mdc-berlin.de/tools/genie/

PESCADOR

http://cbdm.mdc-berlin.de/tools/pescador/

Extract interactions and filter by concepts

Barbosa-Silva et al. (2010) BMC Bioinformatics

Adriano Barbosa

Barbosa-Silva et al. (2011) BMC Bioinformatics

PESCADOR

Co-occurrences types PESCADOR

co-occurrence in abstract

Type 4

Type 3

Term + Term

Type 2

[Biointeraction] +Term + Term + [Biointeraction]

Type 1

Term + [Biointeraction] + Term

Country-specific variations of English

Netzel et al. (2003)

EMBO Reports

Country-specific variations of English

Netzel et al. (2003)

EMBO Reports

Worldwide scientific publishing activity

Perez-Iratxeta and Andrade

(2002) Science

Approximate amount of publications for the years 1996–2001 per million

inhabitants by country:

10,000 100

1,000 10 1

Ratio publications for 1996–2001 / 1989–95

Worldwide scientific publishing activity

+++ -

++ = --

+ ---

Perez-Iratxeta and Andrade

(2002) Science

Find referees

peer2ref

Andrade-Navarro et al (2012) BioData Mining

Carolina Perez-Iratxeta

(OHRI-Ottawa)

http://www.ogic.ca/peer2ref/

Andrade-Navarro et al (2012) BioData Mining

Find referees

peer2ref

Carolina Perez-Iratxeta

(OHRI-Ottawa)

http://www.ogic.ca/peer2ref/

Gareth Palidwor (OHRI-Ottawa)

http://www.ogic.ca/mltrends/

Graph historical term usage in MEDLINE

MLTrends

Palidwor and Andrade-Navarro (2010) Journal of Biomedical Discovery and Collaboration

Graph historical term usage in MEDLINE

MLTrends

Gareth Palidwor (OHRI-Ottawa)

http://www.ogic.ca/mltrends/

Palidwor and Andrade-Navarro (2010) Journal of Biomedical Discovery and Collaboration

http://cbdm.mdc-berlin.de/

Enrique Muro

Martin Schaefer

Arvind Mer

David Fournier

Nancy Mah

Marie Gebhardt

Jean-Fred Fontaine

Computational Biology and Data Mining group