large-scale integration of data and text
TRANSCRIPT
Lars Juhl Jensen
Large-scale integration of data and text
Lars Juhl Jensen
Large-scale integration of data and text
Ph.D.
sequence analysis
postdoc
staff scientist
protein networks
cellular signalling
group leader
cofounder
data integration
omics data
association networks
text mining
biomedical literature
electronic health records
association networks
guilt by association
STRING
Franceschini et al., Nucleic Acids Research, 2013
1100+ genomes
genomic context
gene fusion
Korbel et al., Nature Biotechnology, 2004
operons
Korbel et al., Nature Biotechnology, 2004
bidirectional promoters
Korbel et al., Nature Biotechnology, 2004
phylogenetic profiles
Korbel et al., Nature Biotechnology, 2004
a real example
Cell
Cellulosomes
Cellulose
experimental data
gene coexpression
physical interactions
Jensen & Bork, Science, 2008
genetic interactions
Beyer et al., Nature Reviews Genetics, 2007
curated knowledge
pathways
Letunic & Bork, Trends in Biochemical Sciences, 2008
many databases
different formats
different identifiers
variable quality
not comparable
not same species
hard work
(Ph.D. students)
quality scores
von Mering et al., Nucleic Acids Research, 2005
calibrate vs. gold standard
von Mering et al., Nucleic Acids Research, 2005
homology-based transfer
Franceschini et al., Nucleic Acids Research, 2013
missing most of the data
text mining
>10 km
too much to read
computer
as smart as a dog
teach it specific tricks
named entity recognition
comprehensive lexicon
cyclin dependent kinase 1
CDC2
flexible matching
cyclin dependent kinase 1
cyclin-dependent kinase 1
orthographic variation
CDC2
hCdc2
“black list”
SDS
augmented browsing
Reflect
browser add-on
real-time text mining
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009O’Donoghue et al., Journal of Web Semantics, 2010
information extraction
co-mentioning
within documents
within paragraphs
within sentences
NLPNatural Language Processing
grammatical analysis
Gene and protein namesCue words for entity recognitionVerbs for relation extraction
[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]
more precise
worse recall
related web resources
STITCH
STRING + 300k chemicals
stitch-db.org
COMPARTMENTS
compartments.jensenlab.org
TISSUES
tissues.jensenlab.org
DISEASES
diseases.jensenlab.org
general framework
curated knowledge
experimental data
text mining
computational predictions
common identifiers
quality scores
visualization
web resources
download files
why so many?
Swiss army knife syndrome
targeted resources
common infrastructure
medical data mining
Jensen et al., Nature Reviews Genetics, 2012
opt-out
opt-in
centralized registries
structured data
Jensen et al., Nature Reviews Genetics, 2012
14 years
6.2 million patients
119 million diagnoses
distributions
Jensen et al., submitted, 2014
diagnosis trajectories
Jensen et al., submitted, 2014
Jensen et al., submitted, 2014
complex trajectories
Jensen et al., submitted, 2014
confounding factors
correlation ≠ causation
electronic health records
unstructured data
Danish
busy doctors
pharmacovigilance
custom dictionaries
drugs
adverse drug events
typo rules
complex filters
Eriksson et al., Drug Safetey, 2014
new adverse drug reactions
Eriksson et al., Drug Safety, 2014
Drug substance ADE p-value
Chlordiazepoxide Nystagmus 4.0e-8
Simvastatin Personality changes
8.4e-8
Dipyridamole Visual impairment
4.4e-4
Citalopram Psychosis 8.8e-4
Bendroflumethiazide
Apoplexy 8.5e-3
direct medical implications
AcknowledgmentsSTRING/STITCHChristian von MeringDamian SzklarczykMichael KuhnManuel StarkSamuel ChaffronChris CreeveyJean MullerTobias DoerksPhilippe JulienAlexander RothMilan SimonovicJan KorbelBerend SnelMartijn HuynenPeer Bork
Text miningSune FrankildJasmin SaricEvangelos PafilisKalliopi TsafouAlberto SantosJanos BinderHeiko HornMichael KuhnNigel BrownReinhardt SchneiderSean O’ Donoghue
EHR miningAnders Boeck JensenPeter Bjødstrup JensenRobert ErikssonFrancisco S. RoqueHenriette SchmockMarlene DalgaardMassimo AndreattaThomas HansenKaren SøebySøren BredkjærAnders JuulTudor OpreaPope MoseleyThomas WergeSøren Brunak