![Page 1: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/1.jpg)
CorpEus, a ‘web as corpus’ tooldesigned for the
agglutinative nature of Basque
I. Leturia, A. Gurrutxaga1,I. Alegria, A. Ezeiza2
WAC3 – September 15-16, 2007 – Louvain-la-Neuve
1 Elhuyar R&D, Usurbil, Basque Country2 IXA Group, University of the Basque Country, Donostia, Basque Country
![Page 2: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/2.jpg)
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
![Page 3: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/3.jpg)
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
![Page 4: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/4.jpg)
Motivation
• No doubt corpora are necessary:– for linguistic research– for language normalization– for developing language technologies
• But many corpora are exclusively used for these purposes
• They are not made publicly available and searchable through the Internet
![Page 5: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/5.jpg)
Motivation
• For Basque, it is essential to have corpora available for querying– Standardization of Basque started only in 1968– Many rules, words and spellings have been changing
since; still, every now and then new rules are released by the Academy of Basque Language
– It was not taught in schools until the seventies and in universities until the eighties
– No decision as to the correct word or spelling has yet been taken in many areas or words
– Even written production abounds with misspellings, errors, uncertainties, etc.
![Page 6: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/6.jpg)
Motivation
• Basque speaking community needs corpora– Teachers– Writers– Technical text producers– Dictionary makers– Translators– Students– Academics in the field of standardization
• Basque is not a language rich in corpora– Few, small and not updated
![Page 7: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/7.jpg)
Motivation
• Only corpora available (I):– XX. mendeko euskararen corpusa:
• Academy of the Basque language• 4.6 million words• Balanced• Literary texts• Twentieth century• http://www.euskaracorpusa.net/XXmendea/Konts_
arrunta_fr.html
![Page 8: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/8.jpg)
Motivation
• Only corpora available (II):– Ereduzko prosa gaur:
• University of the Basque Country• 23.8 million words• Literary and press texts regarded as “reference”• 2000 - 2005• http://www.ehu.es/euskara-orria/euskara/ereduzkoa/
araka.html
![Page 9: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/9.jpg)
Motivation
• Only corpora available (III):– Zientzia eta teknologiaren corpusa:
• Elhuyar Foundation and the IXA Group of the University of the Basque Country
• 7.6 million words• Texts on science and technology• 1990 - 2002• http://www.ztcorpusa.net
![Page 10: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/10.jpg)
Motivation
• Only corpora available (IV):– Klasikoen gordailua:
• Susa publishing house• 10.7 million words• Non-tagged• Classic texts• http://klasikoak.armiarma.com/corpus.htm
![Page 11: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/11.jpg)
Motivation
• But we do have the Internet– Huge repository of texts– Constantly updated
• A tool for querying the Internet as if it were a Basque corpus would be very interesting
![Page 12: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/12.jpg)
Motivation
• Also disadvantages:– Not linguistically tagged:
• Always some uncertainty• Variants and misspellings will not appear when
looking for a word
– It will never show all, only what there is in the first results returned by search engines
– The Internet is often considered non-representative
– The Internet is full of redundancy
![Page 13: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/13.jpg)
Motivation
• Nevertheless, we thought that the benefits far exceeded the disadvantages
• We embarked on a project to build a ‘web as corpus’ tool for Basque
![Page 14: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/14.jpg)
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
![Page 15: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/15.jpg)
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
![Page 16: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/16.jpg)
Problems with Basque language
• Similar services exist:– WebConc (http://www.niederlandistik.fu-berlin
.de/cgi-bin/web-conc.cgi)– WebCorp (http://www.webcorp.org.uk/)– KWiCFinder (http://www.kwicfinder.com)
• But these rely on search engines• Search engines don’t work well for Basque
![Page 17: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/17.jpg)
Problems with Basque language
• Looking for conjugations and inflections– Basque is an agglutinative language
• A given lemma makes many different word forms– lan (“work”): lana (“the work”), lanak (“works” or “the
works”), lanari (“to the work”), lanei (“to the works”), lanaren (“of the work”), lanen (“of the works”)…
– Looking only for the exact given word, or the word plus an “s” for the plural, is not enough
– Wildcards are not an appropriate solution• Looking for lan* would also return forms of the
words lanabes (“tool”), lanbro (“fog”)…
![Page 18: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/18.jpg)
Problems with Basque language
• Language discrimination– No search engine offers the possibility of
returning only pages in Basque– Big problem when looking for:
• Technical words that exist also in other languages: anorexia, sulfuroso, byte, allegro, sistema, energia…
• Short words: katu (“cat”), ur (“water”)…• Proper nouns: Egipto, Newton, Pluton…
– Many non-Basque results are returned, often no Basque results at all
![Page 19: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/19.jpg)
Problems with Basque language
• Lack of knowledge about the language– Status of language:
• Late standardization• Still many changes in words and rules• Late teaching in schools and universities• Many non-standardised areas or words• Many misspellings and errors in written production
– A word might be incorrect but appear often in the web
– The user might think it is correct, without knowing that a more appropriate word exists
![Page 20: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/20.jpg)
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
![Page 21: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/21.jpg)
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
![Page 22: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/22.jpg)
Our approach
• Looking for conjugations and inflections: Morphological query expansion (I)– Morphological generator created by the IXA
Group of the University of the Basque Country– We obtain all the forms of a given lemma– We ask the search engine for all of them using
an OR operator– etxe (“house”) => etxe OR etxea OR etxeak
OR etxeari OR etxeek OR …
![Page 23: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/23.jpg)
Our approach
• Looking for conjugations and inflections: Morphological query expansion (II)– Little problems:
• The APIs of the search engines have each a limit in number of words or length of search phrase
– we had to discover the limits by trial and error
• Due to these limits, real lemmatised search is impossible
– we looked in a corpus for the most frequent cases, numbers, times, etc. of the declinations and inflections of words
– these are the forms of the words sent in the query
![Page 24: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/24.jpg)
Our approach
• Language discrimination:Language-filtering words (I)– We looked in a corpus for the most frequent
words in Basque– We include them in the search phrase using an
AND operator
![Page 25: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/25.jpg)
Our approach
• Language discrimination:Language-filtering words (II)– Little problems (I):
• The most frequent words in Basque exist in other languages too
• Several language-filtering words had to be used– the more of these, the more we gained in precision (fewer
non-Basque pages returned) but also lost in recall (more Basque pages were left out), and vice versa
– we chose precision and include four filtering words– if few results are returned, the user can try again
increasing the recall
![Page 26: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/26.jpg)
Our approach
• Language discrimination:Language-filtering words (III)– Little problems (II):
• In bilingual pages, the searched word can be in a piece of text that is not in Basque
– LangId, a free language identifier developed by the IXA Group of the University of the Basque Country
– applied to some context around the words to see if it is in a piece of text in Basque
– it does not work well with small contexts, but if the context is too big pieces in other languages can be included
– we start with quite a broad context and progressively reduce its length until minimal length for LangId to work properly is reached
– if at any time LangId says it is in Basque, we stop and we show it
![Page 27: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/27.jpg)
• Lack of knowledge about the language:Variant suggestion (I)– EDBL, lexical database created by the IXA
Group of the University of the Basque Country– Each word is linked to its variants, common
errors, old spellings, etc.– When a user enters a word, its standard form
or variants are suggested
Our approach
![Page 28: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/28.jpg)
• Lack of knowledge about the language:Variant suggestion (II)– Somehow lightens one of the problems of the
non-linguistically-tagged nature of the web:• in a tagged corpus, variants would be assigned the
correct lemma and would appear when looking for the lemma
• with our approach, the user can obtain the variants too
Our approach
![Page 29: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/29.jpg)
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
![Page 30: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/30.jpg)
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for
Basque• EusBila, a search service for Basque• Evaluation
![Page 31: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/31.jpg)
• System architecture:– User enters word– Query the EDBL for variants– Query morphological generator to obtain
conjugations and inflections– Query APIs of search engines– Download pages– Find occurrences of the forms of the word– Query LangId for language occurrences are in– Show KWiCs and counts
CorpEus
![Page 32: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/32.jpg)
EDBL (IXA)
Morphologicalgenerator (IXA)
Search engines’APIs
W W W
LangId (IXA)
CorpEus
Word
Variants
Word, variants
Inflections, conjugations
Search phrase
URLs
URLs
Web pages
Occurrence contexts
Language
Word
Occurrence KWiCs and counts
User
![Page 33: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/33.jpg)
• Features (I):– Lemma-based search– Language-filtered search– Variant suggestion
CorpEus
![Page 34: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/34.jpg)
![Page 35: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/35.jpg)
• Features (II):– Ambiguous or unrecognised words:
• The user chooses the analysis upon which to base the morphological generation
CorpEus
![Page 36: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/36.jpg)
![Page 37: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/37.jpg)
• Features (III):– Search for more than one word:
• Lemma-based search performed for all of them• Occurrences of any of the words are shown
CorpEus
![Page 38: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/38.jpg)
![Page 39: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/39.jpg)
• Features (IV):– Noun phrase or term searching:
• Enclosing various terms in double quotes• Morphological generation applied to last word• Thus, proper lemma-based search for whole noun
phrases or terms (in Basque, only the last component of the noun phrase or term is inflected)
CorpEus
![Page 40: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/40.jpg)
![Page 41: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/41.jpg)
• Features (V):– Different ordering criteria:
• Pages arriving order (default)• Form of searched word• Context after the word• Context before the word
– Ordered on the fly as they arrive
CorpEus
![Page 42: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/42.jpg)
![Page 43: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/43.jpg)
• Features (VI):– Analysis of the words:
• Possible lemmas and POSs of the forms of the searched word are shown in a floating box
• Different colours:– Light green: correct word, unambiguous– Dark green: variant, unambiguous– Light yellow: correct word, ambiguous– Dark yellow: variant, ambiguous– Red: unrecognised word
CorpEus
![Page 44: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/44.jpg)
![Page 45: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/45.jpg)
• Features (VII):– Count charts:
• Word forms• Possible lemma or POS• Word before or after• Lemma of word before or after• …
CorpEus
![Page 46: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/46.jpg)
![Page 47: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/47.jpg)
• Features (VIII):– Many textual content file types:
• HTML• XML• RSS• TXT• PDF• DOC• RTF• PPT• XLS• …
– Parallel downloading of pages to avoid blocking
CorpEus
![Page 48: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/48.jpg)
• Demo: http://www.corpeus.org
CorpEus
![Page 49: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/49.jpg)
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
![Page 50: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/50.jpg)
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
![Page 51: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/51.jpg)
• Search engines don’t work well for Basque• We decided to build a search service for
Basque based on the principles of CorpEus:– API based– Lemma-based search– Language-filtered search– Variant suggestion
• But return URLs and snippets, not KWiCs or charts
EusBila
![Page 52: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/52.jpg)
![Page 53: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/53.jpg)
• Problem: limit of calls per day of the APIs– Google: 1,000 calls per day– Yahoo!: 5,000 calls per day– Windows Live Search: 10,000 calls per day
• The limits can be enough for a corpus tool, but not for a general use search service
• Microsoft recently augmented the limit in calls per day to 25,000 and also launched an unlimited use commercial license
EusBila
![Page 54: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/54.jpg)
• Published a paper in iNEWS07 (Improving Non-English Web Searching), a workshop in SIGIR’07 (July 2007, Amsterdam)
• It aroused interest, as it is a cost-effective web search solution that can be used by other minority languages with few resources
EusBila
![Page 55: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/55.jpg)
• Launch:– By Eleka Ingeniaritza Linguistikoa– Under commercial name Elebila– October 2007
EusBila
![Page 57: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/57.jpg)
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
![Page 58: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/58.jpg)
Contents
• Motivation• Problems with Basque language• Our approach• CorpEus, a ‘web as corpus’ tool for Basque• EusBila, a search service for Basque• Evaluation
![Page 59: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/59.jpg)
• The methodolgy used in EusBila and CorpEus was evaluated for the iNEWS07 paper on EusBila
• We evaluated:– Gain in recall due to morphological query
expansion– Gain in precision due to language-filtering
words– Loss in recall due to language-filtering words
Evaluation
![Page 60: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/60.jpg)
• Indicator for precision: percentage of results that were actually in Basque
• Indicator for recall: estimated hit counts returned by the API
• Compared Windows Live Search’s API with EusBila using this same API
• The words for the evaluation were taken from the search logs of a very popular science portal in Basque
Evaluation
![Page 61: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/61.jpg)
Evaluation
Evaluation Condition Measured variable
Result
Language-filtering words
Morphological query expansion
Words
Gain in recall due to morphological query expansion
Not applied - Only Basque
Hit counts 89.43% increase
Gain in precision due to language-filtering words
- Not applied Any kind
% of results in Basque
70.55 points increase, from 27.19% to 97.74%
Loss in recall due to language-filtering words
- Not applied Only Basque
Hit counts Decrease from 6.48% to 57.69%, depending on the number of language-filtering words*
Gain in recall due to morphological query expansion
Applied - Any kind
Hit counts 40.19% increase
* The amount of filtering words can optionally be reduced to increase the recall when few results are returned
![Page 62: CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque](https://reader035.vdocument.in/reader035/viewer/2022070401/56813647550346895d9dc6a6/html5/thumbnails/62.jpg)
CorpEus, a ‘web as corpus’ tooldesigned for the
agglutinative nature of Basque
I. Leturia, A. Gurrutxaga1,I. Alegria, A. Ezeiza2
WAC3 – September 15-16, 2007 – Louvain-la-Neuve
1 Elhuyar R&D, Usurbil, Basque Country2 IXA Group, University of the Basque Country, Donostia, Basque Country