the value of a vocabulary of science (for improved search quality) 1) queries and documents 2) terms...

34
The Value of a Vocabulary of Science (for Improved Search Quality) 1) Queries and Documents 2) Terms and Variants 3) Search Results and Refinements Franz Guenthner Centrum für Informations- und Sprachverarbeitung (CIS) LMU-München [email protected]

Post on 18-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

The Value of a Vocabulary of Science

(for Improved Search Quality)

1) Queries and Documents

2) Terms and Variants

3) Search Results and Refinements

Franz Guenthner

Centrum für Informations- und Sprachverarbeitung (CIS)

LMU-München

[email protected]

Introduction and Overview

• Inadequacies of current search engines (Google, Alltheweb, etc.) or Why search is still unsatisfactory? (esp. for scientific searches)

• Recent improvements for scientific search on Scirus (Categorized Search) and others

• New approaches: Patent/Medline Demos (dynamic result set indexes (DSRI) based on „interesting“ vocabulary selection)

Queries and Documents

• The two main problems with Search Quality– Literal evaluation of queries– All queries are dealt with in the same way (as far as

evaluation and ranking are concerned)

• Any kind of term variation will (almost always!!) yield different results (even singular/plural)

• No special attention (with minor exceptions) to the specific form or even the language of queries

• The query evaluator „knows“ nothing about the terms in the query (besides their frequency)

• Result presentation depends in no way on the „intention“ of the queries

Q = A B C

The Query & Document Matrix

• A Classification of Query Features– General Queries (cancer, medical schools)– Problem Queries (methods for treating brain tumors in

dogs)– Specific Queries (g-protein chemokine receptor ccr5 hdgnr1)

• A Classification of Document Features– Content (keywords, phrases, names, addresses, etc.)– Format (download, gif, bibliography, homepage, etc.)– Reference (metatags, external and internal anchor texts,

citations, etc.)

The Query/Document Matrix

General queries

Problem queries

Specific queries

Content Format Reference

Remarks on Queries

Query distribution• Top 10 = 1%• Top 100 = 3 %• Top 1000 = 7 %• Top 10000 = 15 %• Top 100000 = 28 %• Top 1000000 = 44 %• Top 10000000 = 60 %• Top 100000000= 99 %

(Data from an Altavista query log from June/July 1998, about 750 million queries and about 250 million distinct queries)

• About 1/3 of all queries (per day) are unique

• More than 1/10 of all queries are orthographic variants of other queries

• About 1/3 of all „correct“ queries are variants of more frequent queries

• Average query frequency of 3

• There is a high degree of partial repetition in queries

Remarks on Queries

• 1 time queries = 23%• 1 or 2 time queries = 36 %• <= 3 time queries = 40 %• <=10 time queries = 50 %• <= 100 time queries = 67 %• <= 1000 time queries = 83 %• <= 10000 time queries = 94 %• <= 100000 time queries = 99 %

The Head/Container Structure of Queries

• Typical queries (to all search engines) have always been very short (2,4 words on the aver.)

• Longer queries are typically of the form - Head + Container - Or a combination of heads and containersExample: „statistics on infant mortality“

• What is a head? • What is a container?• Close relation to predicates and arguments in natural

language

Most frequent queries on www.scirus.com • 2093656 sensor• 2058653 sensors• 448114 water• 425579 analysis• 417029 acid• 339454 cell• 296468 oil• 292716 design• 291155 science• 287315 gas• 285510 management• 272517 model• 271354 review• 250483 system• 246219 "actuators"• 239109 protein• 237993 carbon• 228638 heat• 223321 polymer

• 222793 plant• 220677 synthesis• 219260 structure• 216344 film• 215497 food• 214434 laser• 211470 control• 209631 journal• 208670 human• 208452 "electrochemical"• 204949 cancer• 198333 metal• 197929 soil• 196628 field• 195406 temperature• 189066 "ultrasonic"• 187696 thermal• 186828 "Human"• 182020 treatment

Most frequent queries on www.scirus.com • 181272 "chemical sensors"• 180006 "detectors"• 179783 "gas sensors"• 179739 energy• 176521 hydrogen• 176461 processing• 173274 "electrodes"• 172220 surface• 171170 growth• 171041 research• 170652 flow• 169209 effect• 168216 theory• 168041 fish• 166366 production• 164540 paper• 163423 chemistry• 162353 substrate• 161866 data•

• 161851 stress• 161346 IBRO• 159778 conference• 158848 method• 156280 chemical• 155877 systems• 155138 properties• 150150 semiconductor• 148528 oxide• 146871 "temperature sensor"• 146567 phase• 146387 engineering• 145979 development• 143533 climate• 143141 artificial• 142927 drag• 141812 "biosensors"• 139159 fiber• 138854 technology

Most frequent queries on www.scirus.com • 138334 disease• 138145 "Animal"• 137970 copper• 137694 reduction• 136025 education• 135039 transfer• 135005 "sensor array"• 134380 test• 133729 organic• 133578 sheet• 133453 leaf• 132996 well• 132058 "electrode"• 131153 process• 129228 DNA• 128631 determination• 127745 health

• 127539 extraction• 126398 drug• 125754 vision• 124914 liquid• 123880 membrane• 123511 chromatography• 123509 bacteria• 123001 parallel• 122960 molecular• 122677 cells• 122157 "Utility"• 122026 drill• 121637 Journal• 121582 pressure• 121421 oxidation• 121148 "Support, Non U.S. Gov't"• 120825 geologic• 120713 steel

Queries in the middle range on www.scirus.com

• 402 Hazardous waste management policies• 402 "genetic population structure"• 402 "fourier transform ion cyclotron

resonance"• 402 "experimental characterization"• 402 "eukaryotic initiation factor"• 402 "enterprise architecture"• 402 "DNA Directed DNA Polymerase"• 402 "breast cancer cell line"• 402 biodiversity conservation• 402 "Avoidant Personality Disorder"• 402 abstraction aggregation association

classification instantiation decomposition• 401 "water quality criteria"• 401 "superoxide dismutase activity"• 401 "ornithine decarboxylase"• 401 "non verbal communication"• 401 "international whaling commission"

• 401 comparison of parametric classification procedures

• 399 "scanning transmission electron microscopy"

• 380 Preparation properties of serum plasma proteins

• 357 asymmetric information in financial markets

• 353 free amino acids chromatography• 351 "expression of epidermal growth

factor"• 348 rhodamine 6g lifetime measurements• 343 cell cycle regulation of transcription• 342 executive compensation in america :

optimal contracting extraction of rents• 339 "progress in nuclear magnetic

resonance spectroscopy"• 335 "electron spectroscopy for chemical

analysis"

Queries in the middle range on www.scirus.com

• 334 "composition morphology studies of ultrathin CaF"

• 332 use modelling probabilistic approaches in risk assessment

• 332 "Gene Expression Regulation, Neoplastic"

• 331 "translationally controlled tumor protein"

• 331 "metabotropic glutamate receptor subtype"

• 330 "electrospray tandem mass spectrometry"

• 327 \"maintenance scheduling\" \"case based reasoning\"

• 322 stereoselectivity, electron transfer, complex

• 322 diagnosis of disease electron microscopy

• 320 "[Physical Astronomy Classification Scheme] 92.40.Qk"

• 320 "high resolution electron energy loss"

• 320 "fourier transform infrared spectrometry"

• 319 "methylotrophic yeast pichia pastoris"

• 319 ambulatory blood pressure monitoring

• 317 "comparative concentration analysis of cr co in fesi"

Rare (specific) queries on www.scirus.com

50 "bone morphogenetic protein 3"

• 50 arrangement of subunits in polymeric proteins overlapping epitopes

• 50 antocyanin* grape*• 50 Antigens from Echinococcus

granulosus post translational modifications• 50 "alkaline single cell gel

electrophoresis"• 50 albicans streptococcus• 50 adrenomyeloleukodystrophy• 50 "adrenomedullin expression"• 50 8 OH DPAT• 50 8 hydroxyquinoline iron• 50 68hc11 interrupts• 50 5 phosphate• 50 "5 Lipoxygenase"• 50 5 hydroxytryptophan gonadotropin•

• 50 5 hydroxymethyl• 50 5 chloro 1 indanone• 50 "4 bromobenzophenone"• 50 3 methyl 3 buten 1 ol• 50 3dstudio• 50 3D QSAR• 50 3d gis• 50 3 cyclohexen 1 yl• 50 "3 aminopropyltriethoxysilane"• 50 3 Aminopropyl• 50 3' 5' exonuclease• 50 3,4 dimethoxyphenyl• 50 3,4 dihydroxyphenylpyruvic• 50 2 phenylindole• 50 2 methyl 3 buten 2 ol• 50 2H NMR imaging of stress in strained

elastomers.•

The Structure of Scientific Vocabulary(compared to general English)

• Size of elementary forms in English = ca. 300K to 400K

• Size of phrases in general English = many millions• Basic assumption: domain-specific (in particular,

scientific vocabulary should be studied and maintained along similar lines to general vocabulary

• How to classify scientific vocabulary– Morpho-syntactic forms and variants– Semantic Types– Scientific Domains

Terms and Variants

• What is a „scientific term“?– A domain-specific expression that can take many forms

• The very large majority of scientific terms are (very unfortunately!!) not controlled (by authors, editors, publishers, readers) in any way:– Orthography– Morphology– Syntax– Semantics

• There are probably more than 50 million !! such terms in English scientific literature

Tomography on Scirus21874 tomography

• 4592 "emission tomography"• 2239 positron emission tomography• 2130 "Tomography, X Ray Computed"• 594 optical coherence tomography• 591 "computed tomography"• 585 "Tomography, Emission Computed"• 511 microtomography• 435 "seismic tomography"• 430 "process tomography"• 421 "optical coherence tomography"• 383 "single photon emission computed tomography"• 343 "emission computed tomography"• 274 "Tomography, Emission Computed, Single Photon"• 256 electrical impedance tomography• 244 "optical tomography"• 232 "electrical impedance tomography"• 225 "industrial process tomography"• 222 "acoustic tomography"• 199 reflection tomography

Example: „Tomography“ in EB

• computed tomography• computerized axial tomography• linear tomography • multidirectional tomography• ocean acoustic tomography • positron emission tomography• tomography

Uses of Scientific Vocabulary

• „Perfect“ Spell-checkingExamples

– thrommbocytopennia hrombopoetin– thrombocytopenia thrombopoietin

• Extract automatically (most/all) scientific terms from a text

• Classify texts into many scientific domains– Remark on the number of domains

• Find „related“ documents (in time, in content, etc)• Structure the evolution of scientific thought and

scientific progress via vocabulary trends and evolution

Term Variants

• Orthographic variation • Types of orthographic variants

• Spelling variation• „Official variants“• German orthography

reform• Brit./Am. English• Abbreviations• Acronyms

Search results for title: Computational cranial tomografy tumor

Morphological Variants

• What are morphological variants?– Stemming vs. Lemmatization– Derivational morphology– „offical forms“

Syntactic Variants

• Real term variation– Local Syntax of Terms– Terms and containers

• The form of a „Dictionary of Science“– Variation automata/transducers– The Role of Normalization

Entities in Scientific Texts

• Persons/authors (i.e. scientists)• Domains• Concepts (divided into semantic

classes and hyper-classes)• Abbreviations• Acronyms• Titles• Sub-titles• Bibliographical citations• Journals• Publishers• Learned Societies

• Books• Universities • Departments• Patents• Citations• Conferences • Abstracts• Reviews• Handbooks• Proceedings• Prizes

• And several hundred more.

What if we (i.e. the search device) had a Classification of Entities?

• Entities will be recognized in queries• Entities will be recognized in documents• The query evaluator will synchronize the two• The next step is to weight the role of the different

types of entities in the documents where they occur

• Another next step is quasi-structured search on document zones that have these entities as features

„Almanac (Factual) Queries“

• Birth Date of Bill Clinton• President of Harvard in 1996• Address of the United Nations• Population of Nigeria• Senators of Iowa• Olympic Gold Medal Winner in High Jump 1964

„Intensional (List) Queries“

• Famous computer scientists– COM(PERSONSxCS)

• Prizes in mathematics– COM(SCIENTIFIC_PRIZExMATH)

• Concepts in string search– COM(CONxString_Search

• Recent papers on tomography– Sort_By_Publication_Date(COM(CONxTomography))

• Universities famous for tomography– COM(UNIxTomography)

Results of Searches and their Refinement

• Typical results on current search engines– In general, all results are presented on a par– Fast Data Search currently allows for a variety of „field

searches“

• What can be „refined“?– The zones of searches BEFORE searching

• E.g. by category or page format (e.g. Scirus)

– The results of searches AFTER the query

• What we would like to refine on INSIDE results and result sets?

Entities in Scientific Texts

• Semantic Types in general– Entities are in general domain-dependent– Semantic Categories and sub-categories occur

in many domains but are more pre-dominant in one or the other

– Semantic hyper-categories– Domain types – Domain specific types (e.g. patents, applicants,

inventors, claims, etc in patent documents)

Refinements After Search

• Related Queries

• Related Concepts

• Related Entities

• Related Containers

• Examples

What is a DRSI? („Dynamic Result Set Indexes“)

• What we are already used to in connection with searching articles and books?– Classification– Key Words– Abstracts– Tables of Contents– Bibliographies– Back-of-Book-Indexes

• Of authors• Of concepts and entities

What is a „Complete“ DRSI?

• All the „interesting“ terms in the document (the document set)

• Ordered alphabetically or by frequency• Ordered by category• Other kinds of operations

– Translating a DRSI (i.e. the need for multi-lingual dictionaries of science at least for the most frequent terms)

– Clustering (which keywords co-occur heavily and by category)