large-scale integration of data and text
TRANSCRIPT
Large-scale integration of data and text
Lars Juhl Jensen
four parts
association networks
guilt by association
molecular networks
STRING
9.6 million proteins
Szklarczyk et al., Nucleic Acids Research, 2015string-db.org
STITCH
300,000 chemicals
Kuhn et al., Nucleic Acids Research, 2014stitch-db.org
genomic context
gene fusion
Korbel et al., Nature Biotechnology, 2004
conserved neighborhood
operons
Korbel et al., Nature Biotechnology, 2004
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
a real example
Cell
Cellulosomes
Cellulose
experimental data
gene coexpression
physical interactions
Jensen & Bork, Science, 2008
binding assays
activity assays
curated knowledge
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
many databases
different formats
different identifiers
variable quality
not comparable
not same species
hard work
(Ph.D. students)
quality scores
von Mering et al., Nucleic Acids Research, 2005
calibrate vs. gold standard
von Mering et al., Nucleic Acids Research, 2005
homology-based transfer
Franceschini et al., Nucleic Acids Research, 2013
missing most of the data
text mining
>10 km
too much to read
computer
as smart as a dog
teach it specific tricks
named entity recognition
comprehensive lexicon
cyclin dependent kinase 1
CDC2
flexible matching
cyclin dependent kinase 1
cyclin-dependent kinase 1
orthographic variation
CDC2
hCdc2
“black list”
SDS
text corpus
~22 million abstracts
Medline
~2 million full-text articles
restricted access
information extraction
co-mentioning
counting
within documents
within paragraphs
within sentences
scoring scheme
score calibration
NLPNatural Language Processing
grammatical analysis
part-of-speech tagging
what you learned in schoolpronoun pronoun verb preposition noun
semantic tagging
words of special interest
sentence parsing
Gene and protein namesCue words for entity recognitionVerbs for relation extraction
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
Saric et al., Proceedings of ACL, 2004
more precise
worse recall
reusing the wheel
broadly applicable
experimental data
curated knowledge
text mining
common identifiers
quality scores
unified infrastructure
Swiss army knife syndrome
targeted web resources
COMPARTMENTS
subcellular localization
Binder et al., Database, 2014compartments.jensenlab.org
TISSUES
tissue expression
tissues.jensenlab.org Santos et al., PeerJ, 2015
DISEASES
disease–gene associations
diseases.jensenlab.org Frankild et al., Methods, 2015
medical data mining
Jensen et al., Nature Reviews Genetics, 2012
opt-out
opt-in
structured data
Jensen et al., Nature Reviews Genetics, 2012
unstructured data
Danish
comprehensive lexicon
adverse drug events
drugs
Clozapine
Clozapineclozapi
n
clossapin
klozapine
chlosapin
chlosapine
chlozapin
chlozapine
klossapin
closapine
klozapinklosapi
n
rule-based extraction
Eriksson et al., Drug Safety, 2014
Eriksson et al., Drug Safety, 2014
Eriksson et al., Drug Safety, 2014
Eriksson et al., Drug Safety, 2014
estimate ADR frequencies
Eriksson et al., Drug Safety, 2014
new adverse drug reactions
Eriksson et al., Drug Safety, 2014
Drug substance ADE p-value
Chlordiazepoxide Nystagmus 4.0e-8Simvastatin Personality
changes8.4e-8
Dipyridamole Visual impairment
4.4e-4
Citalopram Psychosis 8.8e-4Bendroflumethiazide
Apoplexy 8.5e-3
direct medical implications
AcknowledgmentsMolecular networksMichael KuhnDamian SzklarczykAndrea Franceschini Milan SimonovicAlexander RothSune Pletscher-FrankildChristian von MeringPeer Bork
General frameworkSune Pletscher-FrankildAlberto SantosKalliopi TsafouAlbert PallejaJanos BinderChristian StolteChristos Arvanitidis Reinhardt SchneiderSean O’Donoghue
EHR miningAnders Boeck JensenRobert ErikssonPeter Bjødstrup JensenAndreas Bok AndersenSune Pletscher-FrankildThomas WergeSøren Brunak