applied text mining

141
Applied text mining Lars Juhl Jensen

Upload: lars-juhl-jensen

Post on 15-Jul-2015

237 views

Category:

Science


2 download

TRANSCRIPT

Applied text miningLars Juhl Jensen

>10 km

too much to read

exponential growth

~40 seconds per paper

computer

as smart as a dog

teach it specific tricks

information retrieval

named entity recognition

information extraction

text/data integration

medical text mining

information retrieval

find the relevant papers

ad hoc retrieval

user-specified query

“yeast AND cell cycle”

PubMed

indexing

fast lookup

stemming

word endings

dynamic query expansion

MeSH terms

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming

step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation

and degradation

no tool will find that

named entity recognition

identify the concepts

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming

step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation

and degradation

comprehensive lexicon

CDC2

cyclin dependent kinase 1

orthographic variation

flexible matching

upper- and lower-case

CDC2

Cdc2

spaces and hyphens

cyclin dependent kinase 1

cyclin-dependent kinase 1

name expansions

prefixes and suffixes

CDC2

hCDC2

“black list”

SDS

efficient tagger

Pafilis et al., PLOS ONE, 2013

benchmarking

the formal way

manually annotated corpus

precision

recall

much work

the pragmatic way

random sampling

precision

no recall

much less work

augmented browsing

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming

step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation

and degradation

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming

step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation

and degradation

Reflect

Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009reflect.ws

information extraction

formalize the facts

Mitotic cyclin (Clb2)-bound Cdc28 (Cdk1 homolog) directly phosphorylated Swe1 and this modification served as a priming

step to promote subsequent Cdc5-dependent Swe1 hyperphosphorylation

and degradation

two approaches

the formal way

NLPNatural Language Processing

part-of-speech tagging

what you learned in schoolpronoun pronoun verb preposition noun

multiword detection

semantic tagging

sentence parsing

Gene and protein namesCue words for entity

recognitionVerbs for relation extraction

[nxexpr The expression of [nxgene the cytochrome genes [nxpg CYC1 and CYC7]]]is controlled by[nxpg HAP1]

Saric et al., Proceedings of ACL, 2004

extract stated facts

high precision

poor recall

the pragmatic way

guilt by association

co-mentioning

counting

within documents

within paragraphs

within sentences

quality score

high recall

high precision

undirected associations

unknown type

text/data integration

STRING

protein associations

Szklarczyk et al., Nucleic Acids Research, 2015string-db.org

STITCH

STRING + 300k chemicals

Kuhn et al., Nucleic Acids Research, 2014stitch-db.org

COMPARTMENTS

subcellular localization

Binder et al., Database, 2014compartments.jensenlab.org

TISSUES

tissue expression

tissues.jensenlab.org Santos et al., submitted, 2015

DISEASES

disease–gene assocations

diseases.jensenlab.org Frankild et al., Methods, 2015

curated knowledge

pathways

Letunic & Bork, Trends in Biochemical Sciences, 2008

experimental data

gene expression

computational predictions

gene neighborhood

Korbel et al., Nature Biotechnology, 2004

many databases

different formats

different identifiers

variable quality

not comparable

hard work

common identifiers

quality scores

score calibration

visualization

web interfaces

bulk download

why so many resources?

Swiss army knife syndrome

EMBO Practical Course Computational Biology:Genomes to SystemsPuerto Varas, 3-9 April 2014

Thanks for your attention!

141