large-scale integration of data and text

Post on 13-Apr-2017

241 Views

Category:

Science

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Large-scale integration of data and text

Lars Juhl Jensen

four parts

association networks

guilt by association

molecular networks

STRING

9.6 million proteins

Szklarczyk et al., Nucleic Acids Research, 2015string-db.org

STITCH

300,000 chemicals

Kuhn et al., Nucleic Acids Research, 2014stitch-db.org

genomic context

gene fusion

Korbel et al., Nature Biotechnology, 2004

conserved neighborhood

operons

Korbel et al., Nature Biotechnology, 2004

phylogenetic profiles

Korbel et al., Nature Biotechnology, 2004

a real example

Cell

Cellulosomes

Cellulose

experimental data

gene coexpression

physical interactions

Jensen & Bork, Science, 2008

binding assays

activity assays

curated knowledge

pathways

Letunic & Bork, Trends in Biochemical Sciences, 2008

many databases

different formats

different identifiers

variable quality

not comparable

not same species

hard work

(Ph.D. students)

quality scores

von Mering et al., Nucleic Acids Research, 2005

calibrate vs. gold standard

von Mering et al., Nucleic Acids Research, 2005

homology-based transfer

Franceschini et al., Nucleic Acids Research, 2013

missing most of the data

text mining

>10 km

too much to read

computer

as smart as a dog

teach it specific tricks

named entity recognition

comprehensive lexicon

cyclin dependent kinase 1

CDC2

flexible matching

cyclin dependent kinase 1

cyclin-dependent kinase 1

orthographic variation

CDC2

hCdc2

“black list”

SDS

text corpus

~22 million abstracts

Medline

~2 million full-text articles

restricted access

information extraction

co-mentioning

counting

within documents

within paragraphs

within sentences

scoring scheme

score calibration

NLPNatural Language Processing

grammatical analysis

part-of-speech tagging

what you learned in schoolpronoun pronoun verb preposition noun

semantic tagging

words of special interest

sentence parsing

Gene and protein namesCue words for entity recognitionVerbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Saric et al., Proceedings of ACL, 2004

more precise

worse recall

reusing the wheel

broadly applicable

experimental data

curated knowledge

text mining

common identifiers

quality scores

unified infrastructure

Swiss army knife syndrome

targeted web resources

COMPARTMENTS

subcellular localization

Binder et al., Database, 2014compartments.jensenlab.org

TISSUES

tissue expression

tissues.jensenlab.org Santos et al., PeerJ, 2015

DISEASES

disease–gene associations

diseases.jensenlab.org Frankild et al., Methods, 2015

medical data mining

Jensen et al., Nature Reviews Genetics, 2012

opt-out

opt-in

structured data

Jensen et al., Nature Reviews Genetics, 2012

unstructured data

Danish

comprehensive lexicon

adverse drug events

drugs

Clozapine

Clozapineclozapi

n

clossapin

klozapine

chlosapin

chlosapine

chlozapin

chlozapine

klossapin

closapine

klozapinklosapi

n

rule-based extraction

Eriksson et al., Drug Safety, 2014

Eriksson et al., Drug Safety, 2014

Eriksson et al., Drug Safety, 2014

Eriksson et al., Drug Safety, 2014

estimate ADR frequencies

Eriksson et al., Drug Safety, 2014

new adverse drug reactions

Eriksson et al., Drug Safety, 2014

Drug substance ADE p-value

Chlordiazepoxide Nystagmus 4.0e-8Simvastatin Personality

changes8.4e-8

Dipyridamole Visual impairment

4.4e-4

Citalopram Psychosis 8.8e-4Bendroflumethiazide

Apoplexy 8.5e-3

direct medical implications

AcknowledgmentsMolecular networksMichael KuhnDamian SzklarczykAndrea Franceschini Milan SimonovicAlexander RothSune Pletscher-FrankildChristian von MeringPeer Bork

General frameworkSune Pletscher-FrankildAlberto SantosKalliopi TsafouAlbert PallejaJanos BinderChristian StolteChristos Arvanitidis Reinhardt SchneiderSean O’Donoghue

EHR miningAnders Boeck JensenRobert ErikssonPeter Bjødstrup JensenAndreas Bok AndersenSune Pletscher-FrankildThomas WergeSøren Brunak

top related