query-driven dictionary enhancement

29
Query-driven dictionary enhancement Primož Jakopin, Birte Lönneker Scientific Research Center ZRC SAZU Ljubljana, Slovenia

Upload: hector

Post on 21-Jan-2016

25 views

Category:

Documents


0 download

DESCRIPTION

Primo ž Jakopin, Birte Lönneker Scientific Research Center ZRC SAZU Ljubljana, Slovenia. Query-driven dictionary enhancement. Motivation. Log files of online dictionaries provide direct acces to the users‘ requests. Make use of them!. Dictionary author‘s question: - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Query-driven  dictionary enhancement

Query-driven dictionary enhancement

Primož Jakopin, Birte Lönneker

Scientific Research Center ZRC SAZULjubljana, Slovenia

Page 2: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 2

Motivation

Log files of online dictionaries provide direct acces to the users‘ requests.

Make use of them!

Dictionary author‘s question:What are the needs of the users of my dictionary?

Page 3: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 3

Overview

The dictionary: Online SLO-DE-SLO The log file Use of the log file

to evaluate current dictionary contents to choose the most promising corpus

type for enlarging the dictionary Conclusions

Page 4: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 4

Dictionary: Online SLO-DE-SLO

Bidirectional online dictionary German-Slovenian On the Web since 2001 Initially a learners‘ dictionary

for German-speaking learners of Slovenian

Page 5: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 5

Online SLO-DE-SLO user interface

Page 6: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 6

Online SLO-DE-SLO contents Evaluated version

(October 2003)

Textbook corpus:5,172 entries

Newspaper corpus:729 entries

Total: 5,901 entries

Current version (June 2004) Textbook corpus:

5,544 entries Newspaper

corpus:743 entries

Technical corpus:829 entries

Total: 7,116 entries

Page 7: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 7

Online SLO-DE-SLO entry concept

Each entry is bilingual Exactly one equivalence per entry An entry can describe

a basic word form an inflected word form an example sentence or phrase a collocation

Page 8: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 8

Online SLO-DE-SLO query results

Page 9: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 9

The log file

When a user submits a query to the dictionary, a program writes data about the query into the log file, e.g. Source language Submitted query string Selected search options

(exact string match, match at beginning of word, match anywhere)

Time stamp

Page 10: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 10

The log file: details

Evaluation period 6 January 2002 to 10 October 2003

Number of queries stored in log file 131,674

Number queries, exact string match 88,879

Only exact string match queries are evaluated

Page 11: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 11

The log file: preprocessing Has to take into account how the

matching is performed when the dictionary finds an entry for the user

Example 1: Dictionary matching:

Case insensitive (user enters A for a) Preprocessing:

Downcase all letters in log file (and in dictionary evaluation file)

Page 12: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 12

The log file: preprocessing

Example 2 a: Dictionary matching:

Substitution of special characters for easier access (user enters ae for ä)

Preprocessing version I: Make a second version of log file Replace ae by ä in second version Use spell checker word list to find valid

versions Check ambiguous cases manually

Page 13: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 13

The log file: preprocessing

Example 2b: Dictionary matching:

Substitution of special characters for easier access (user enters c for č)

Preprocessing version II: Make a second version of log file Replace c by č in second version Use frequencies of parallel spellings to

find valid versions Check ambiguous cases manually

Page 14: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 14

The log file: preprocessing

Users sometimes determine erroneous source language (SL)

Correct SL could be found using spell checker lists for both languages

In our case: „spell checker lists“ taken from Online SLO-DE-SLO detect ...wrongly determined SL Slovenian: 378 ...wrongly determined SL German: 593

Page 15: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 15

Evaluation IQueries against dictionary

Question: To which extent does the dictionary satisfy users‘ requests?

Method: match preprocessed queries against downcased dictionary entries, language by language

Page 16: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 16

Evaluation I

Dictionary entries: 5,901 German distinct (downc.): 5,289 Slovenian distinct: 5,103

Compare these lists with queries Result I („tokens“):

German: 40,7% of queries match Slovenian: 38,3% of queries match

Page 17: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 17

Evaluation I

Result II (types) types: distinct queries German: 10,4 % of types match Slovenian: 12,7 % of types match

Well-known frequency distribution also in query log file: a few types occur very often

and many types occur rarely

Page 18: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 18

Evaluation I: Qualitative results

Online SLO-DE-SLO still lacks some expressions and words used in social relations and everyday life, e.g.

Slovenian „top unmatched“ queries: regard, offer, confirmation, cow, payment,

kiss, oak, to miss, fond of, to teach, sale,... German „top unmatched“ queries:

kiss, welcome, regard, regards, good morning, treasure, to fuck, good evening,...

Page 19: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 19

Corpus-based enlargement

Log file entries alone are not enough The enlargement of the dictionary

should stay corpus-based, because the dictionary author wants to find appropriate examples of use find also collocations and idioms find more words that are likely to be of

interest to typical users

Page 20: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 20

Evaluation II: Outline

Which corpus should be used next to enlarge Online-SLO-DE-SLO?

Which corpus best reflects the structure of the entire vocabulary entered by the users?

Evaluation of Slovenian queries using Slovenian corpora (subcorpora of Nova Beseda c.)

Page 21: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 21

Evaluation IIQueries against corpora

Evaluate corpora of three text types

Newspaper

88 million

Fiction

5,7 million

Technical

6,3 million

Page 22: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 22

Evaluation II. Method

Compare lemmas in user queries with relative frequencies in the three corpora.

1. Lemmatize Slovenian queries and assign POS

2. Retain lemmatized content words and interjections: 7,246 „query lemmas“

Page 23: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 23

Evaluation II. Method

3. Lemmatize each corpus (currently with ambiguities)

4. Calculate relative frequencies (per 1 million) of lemmas in three corpora

5. Assign „weight“ to lemma: for each query lemma and corpus, multiply number of queries with relative frequency

Page 24: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 24

Evaluation II. Example First seven lines of fiction corpus

evaluation (alphabetical order)lemma English FRQ pm # queries weight

absoluten:P absolute 7.64 1 7.64

absolvent:S graduate 0.28 2 0.56

adaptacija:S adaptation 0.14 2 0.28

adrenalin:S adrenalin 1.13 1 1.13

afera:S affair 1.56 3 4.68

afna:S ape; „at“ 0.14 3 0.42

agencija:S agency 6.65 13 86.45

Page 25: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 25

Evaluation II. Result

All lemma weights are summed up for each of the three corpora separately

1. Fiction 10,262,5582. Newspaper 9,694,1253. Technical 9,369,494

The fiction corpus reflects the user queries best

Page 26: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 26

Evaluation II.Top twenty weights

Lemmas in at least two corpora (transl.) to be, to have, to give, to go, good, day,

house, beautiful, table, to come, light (ADJ), to know, to see, big, year, to work/to do

Top 20 weight in fiction: to think, to say, to look, fond of

Top 20 weight in newspaper: town/place; Slovenian

Top 20 weight in technical: computer, picture, data item

Page 27: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 27

Evaluation II.Improvements and Variations

Improvement: unambiguously lemmatized corpora (work in progress for Slovenian)

Variation: evaluate only non-matched queries

Not: overall structure of all queries But: overall structure of

unsuccessful queries (might change after enhancements)

Page 28: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 28

Conclusion

We have shown query-driven methods of evaluation

for online dictionaries query-driven methods for finding

adequate corpora as sources for enhancing dictionaries

Result: example dictionary Online-SLO-DE-SLO should and will be enlarged based on literary texts first

Page 29: Query-driven  dictionary enhancement

7 July, 2004 11th EURALEX Congress - Query-driven dictionary enhancement 29

Thank you for your attention

deutsch slowenisch

danke hvala

http://www.rrz.uni-hamburg.de/slowenisch