query-driven dictionary enhancement
DESCRIPTION
Primo ž Jakopin, Birte Lönneker Scientific Research Center ZRC SAZU Ljubljana, Slovenia. Query-driven dictionary enhancement. Motivation. Log files of online dictionaries provide direct acces to the users‘ requests. Make use of them!. Dictionary author‘s question: - PowerPoint PPT PresentationTRANSCRIPT
Query-driven dictionary enhancement
Primož Jakopin, Birte Lönneker
Scientific Research Center ZRC SAZULjubljana, Slovenia
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 2
Motivation
Log files of online dictionaries provide direct acces to the users‘ requests.
Make use of them!
Dictionary author‘s question:What are the needs of the users of my dictionary?
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 3
Overview
The dictionary: Online SLO-DE-SLO The log file Use of the log file
to evaluate current dictionary contents to choose the most promising corpus
type for enlarging the dictionary Conclusions
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 4
Dictionary: Online SLO-DE-SLO
Bidirectional online dictionary German-Slovenian On the Web since 2001 Initially a learners‘ dictionary
for German-speaking learners of Slovenian
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 5
Online SLO-DE-SLO user interface
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 6
Online SLO-DE-SLO contents Evaluated version
(October 2003)
Textbook corpus:5,172 entries
Newspaper corpus:729 entries
Total: 5,901 entries
Current version (June 2004) Textbook corpus:
5,544 entries Newspaper
corpus:743 entries
Technical corpus:829 entries
Total: 7,116 entries
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 7
Online SLO-DE-SLO entry concept
Each entry is bilingual Exactly one equivalence per entry An entry can describe
a basic word form an inflected word form an example sentence or phrase a collocation
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 8
Online SLO-DE-SLO query results
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 9
The log file
When a user submits a query to the dictionary, a program writes data about the query into the log file, e.g. Source language Submitted query string Selected search options
(exact string match, match at beginning of word, match anywhere)
Time stamp
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 10
The log file: details
Evaluation period 6 January 2002 to 10 October 2003
Number of queries stored in log file 131,674
Number queries, exact string match 88,879
Only exact string match queries are evaluated
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 11
The log file: preprocessing Has to take into account how the
matching is performed when the dictionary finds an entry for the user
Example 1: Dictionary matching:
Case insensitive (user enters A for a) Preprocessing:
Downcase all letters in log file (and in dictionary evaluation file)
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 12
The log file: preprocessing
Example 2 a: Dictionary matching:
Substitution of special characters for easier access (user enters ae for ä)
Preprocessing version I: Make a second version of log file Replace ae by ä in second version Use spell checker word list to find valid
versions Check ambiguous cases manually
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 13
The log file: preprocessing
Example 2b: Dictionary matching:
Substitution of special characters for easier access (user enters c for č)
Preprocessing version II: Make a second version of log file Replace c by č in second version Use frequencies of parallel spellings to
find valid versions Check ambiguous cases manually
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 14
The log file: preprocessing
Users sometimes determine erroneous source language (SL)
Correct SL could be found using spell checker lists for both languages
In our case: „spell checker lists“ taken from Online SLO-DE-SLO detect ...wrongly determined SL Slovenian: 378 ...wrongly determined SL German: 593
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 15
Evaluation IQueries against dictionary
Question: To which extent does the dictionary satisfy users‘ requests?
Method: match preprocessed queries against downcased dictionary entries, language by language
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 16
Evaluation I
Dictionary entries: 5,901 German distinct (downc.): 5,289 Slovenian distinct: 5,103
Compare these lists with queries Result I („tokens“):
German: 40,7% of queries match Slovenian: 38,3% of queries match
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 17
Evaluation I
Result II (types) types: distinct queries German: 10,4 % of types match Slovenian: 12,7 % of types match
Well-known frequency distribution also in query log file: a few types occur very often
and many types occur rarely
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 18
Evaluation I: Qualitative results
Online SLO-DE-SLO still lacks some expressions and words used in social relations and everyday life, e.g.
Slovenian „top unmatched“ queries: regard, offer, confirmation, cow, payment,
kiss, oak, to miss, fond of, to teach, sale,... German „top unmatched“ queries:
kiss, welcome, regard, regards, good morning, treasure, to fuck, good evening,...
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 19
Corpus-based enlargement
Log file entries alone are not enough The enlargement of the dictionary
should stay corpus-based, because the dictionary author wants to find appropriate examples of use find also collocations and idioms find more words that are likely to be of
interest to typical users
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 20
Evaluation II: Outline
Which corpus should be used next to enlarge Online-SLO-DE-SLO?
Which corpus best reflects the structure of the entire vocabulary entered by the users?
Evaluation of Slovenian queries using Slovenian corpora (subcorpora of Nova Beseda c.)
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 21
Evaluation IIQueries against corpora
Evaluate corpora of three text types
Newspaper
88 million
Fiction
5,7 million
Technical
6,3 million
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 22
Evaluation II. Method
Compare lemmas in user queries with relative frequencies in the three corpora.
1. Lemmatize Slovenian queries and assign POS
2. Retain lemmatized content words and interjections: 7,246 „query lemmas“
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 23
Evaluation II. Method
3. Lemmatize each corpus (currently with ambiguities)
4. Calculate relative frequencies (per 1 million) of lemmas in three corpora
5. Assign „weight“ to lemma: for each query lemma and corpus, multiply number of queries with relative frequency
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 24
Evaluation II. Example First seven lines of fiction corpus
evaluation (alphabetical order)lemma English FRQ pm # queries weight
absoluten:P absolute 7.64 1 7.64
absolvent:S graduate 0.28 2 0.56
adaptacija:S adaptation 0.14 2 0.28
adrenalin:S adrenalin 1.13 1 1.13
afera:S affair 1.56 3 4.68
afna:S ape; „at“ 0.14 3 0.42
agencija:S agency 6.65 13 86.45
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 25
Evaluation II. Result
All lemma weights are summed up for each of the three corpora separately
1. Fiction 10,262,5582. Newspaper 9,694,1253. Technical 9,369,494
The fiction corpus reflects the user queries best
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 26
Evaluation II.Top twenty weights
Lemmas in at least two corpora (transl.) to be, to have, to give, to go, good, day,
house, beautiful, table, to come, light (ADJ), to know, to see, big, year, to work/to do
Top 20 weight in fiction: to think, to say, to look, fond of
Top 20 weight in newspaper: town/place; Slovenian
Top 20 weight in technical: computer, picture, data item
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 27
Evaluation II.Improvements and Variations
Improvement: unambiguously lemmatized corpora (work in progress for Slovenian)
Variation: evaluate only non-matched queries
Not: overall structure of all queries But: overall structure of
unsuccessful queries (might change after enhancements)
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 28
Conclusion
We have shown query-driven methods of evaluation
for online dictionaries query-driven methods for finding
adequate corpora as sources for enhancing dictionaries
Result: example dictionary Online-SLO-DE-SLO should and will be enlarged based on literary texts first
7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 29
Thank you for your attention
deutsch slowenisch
danke hvala
http://www.rrz.uni-hamburg.de/slowenisch