dictionary-based named entity recognition
TRANSCRIPT
Lars Juhl Jensen
Dictionary-basednamed entity recognition
>10 km
too much to read
computer
as smart as a dog
teach it specific tricks
named entity recognition
comprehensive dictionary
synonyms
cyclin dependent kinase 1
CDC2
normalization
CDK1_HUMAN
dictionary compilation
genes/proteins
UniProtKB
Ensembl
RefSeq
chemical compounds
PubChem
species/organisms
NCBI Taxonomy
functions
pathways
compartments
Gene Ontology
tissues
Brenda Tissue Ontology
diseases
Disease Ontology
phenotypes
Human Phenotype Ontology
environments
Environment Ontology
filters
redundant terms
insulin
broad synonyms
CDK holoenzyme
related synonyms
polyubiquitin
wrong synonyms
BRCA1
dictionary expansion
shortened forms
protein kinase activity
protein kinase
Wnt signaling pathway
Wnt signaling
synonymous forms
metabolic disease
metabolic disorder
plural forms
protein kinase
protein kinases
mitochondrion
mitochondria
cancer
cancers
adjective forms
mitochondrion
mitochondrial
abbreviated forms
Escherichia coli
E. coli
prefixes and suffixes
CDC2
hCDC2
mCDC2
Cdc28
Cdc28p
huge dictionary
additional ambiguity
handling ambiguity
three options
allow
disallow
disambiguate
acceptable ambiguity
orthologous genes
overlapping ontologies
disease
phenotype
unacceptable ambiguity
unrelated entities
APC
adenomatous polyposis coli
anaphase promoting complex
disambiguation
ranking of name sources
remove unlikely meanings
acronym definitions
three letter acronym (TLA)
other names mentioned
C. sativa
Camelina sativa
Cannabis sativa
Castanea sativa
marijuana
species autodetection
two rounds of NER
species/organisms
genes/proteins
text matching
uppercase / lowercase
spaces and hyphens
punctuation
too many variants
flexible matching
finite state automaton
LINNAEUS
custom hash function
C++ tagger
efficiency
Pafilis et al., PLOS ONE, 2013
performance
~85% precision
~75% recall
“black list”
bad names
SDS
a
an
web resources
indexing of literature
term co-occurrence
iHOP
Hoffmann & Valencia, Nature Genetics, 2004www.ihop-net.org
STRING
Szklarczyk et al., Nucleic Acids Research, 2015string-db.org
real-time text mining
Reflect
augmented browsing
Pafilis, O’Donoghue, Jensen et al., Nature Biotechnology, 2009reflect.ws
EXTRACT
interactive annotation
Pafilis et al., Proceedings of BioCreative V, 2015extract.hcmr.gr
Pafilis et al., Proceedings of BioCreative V, 2015extract.hcmr.gr