a suite to compile and analyze an lsp corpus

21
1 A Suite to Compile and Analyze an LSP Corpus 6th International Conference on Language Resources and Evaluation LREC 2008 Rogelio Nazar – Jorge Vivaldi – M. Teresa Cabré {rogelio.nazar; jorge.vivaldi; teresa.cabre}@upf.edu

Upload: lave

Post on 05-Jan-2016

24 views

Category:

Documents


1 download

DESCRIPTION

6th International Conference on Language Resources and Evaluation LREC 2008. A Suite to Compile and Analyze an LSP Corpus. Rogelio Nazar – Jorge Vivaldi – M. Teresa Cabré { rogelio.nazar; jorge.vivaldi; teresa.cabre}@upf.edu. Introduction. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Suite to Compile and Analyze  an LSP Corpus

1

A Suite to Compile and Analyze an LSP Corpus

6th International Conference

on Language Resources and Evaluation

LREC 2008

Rogelio Nazar – Jorge Vivaldi – M. Teresa Cabré

{rogelio.nazar; jorge.vivaldi; teresa.cabre}@upf.edu

Page 2: A Suite to Compile and Analyze  an LSP Corpus

2

Introduction

This system (JAGUAR) is a set of tools for compiling and exploring an LSP corpus from the webhttp://jaguar.iula.upf.edu

Usage Examples:

• Terminology extraction

• Bilingual lexicon extraction

• Neologisms extraction

Architecture: a system divided in two main modules:

1. Compilation of an LSP corpus from the web

2. Analysis of the corpus with statistical techniques

Page 3: A Suite to Compile and Analyze  an LSP Corpus

3

1. Document retrieval by querying search engines

2. Classification of the collection on the basis of two axis:

a) Degree of relevance to the topic

Possibility of corpus tuning with user feedback

b) Degree of specialization of the document

Structure of the document (abstract, introduction, etc.)

System for bibliographical references, etc.

Module 1: Compilation of an LSP corpus from the web

Final classification is the result of the combination of these factors.

Page 4: A Suite to Compile and Analyze  an LSP Corpus

4

Classification by degree of relevance to the topic:

Module 1: Compilation of an LSP corpus from the web

Page 5: A Suite to Compile and Analyze  an LSP Corpus

5

Classification by degree of relevance to the topic:coocurrence graphs

Module 1: Compilation of an LSP corpus from the web

Page 6: A Suite to Compile and Analyze  an LSP Corpus

6

Cumulative precision in the ranking of documents with the term spastic diplegia.

Evaluation of the documents classification:

Module 1: Compilation of an LSP corpus from the web

Page 7: A Suite to Compile and Analyze  an LSP Corpus

7

Precision and Recall for the experiments.

Term: Documents: Precision: Recall:Spastic Diplegia 67 88,46% 92,00%Giant Cell Aortitis 21 92,85% 81,25%DNA Virus Infections 50 79,31% 74,19%Meige Syndrome 76 73,07% 90,47%Down Syndrome 76 85,10% 92,15%

Average: 83,75% 86,01%

Evaluation of the documents classification:

Module 1: Compilation of an LSP corpus from the web

Page 8: A Suite to Compile and Analyze  an LSP Corpus

8

Probability distribution of precision as a random variable (performance of

10.000 random classifiers).

Evaluation of the documents classification:

Module 1: Compilation of an LSP corpus from the web

Page 9: A Suite to Compile and Analyze  an LSP Corpus

9

Module 2: Analysis of the corpus with statistical techniques

1. Input: from module 1 or from user compiled corpus2. Main functions:

• Measures of vocabulary richness• Analysis of sample representativeness• Automatic language recognition• Kwic search• N-grams extraction and sorting• Collocations extraction• Measures of association• Models of term distribution• Coefficients for vector comparison

Page 10: A Suite to Compile and Analyze  an LSP Corpus

10

http://rc16.upf.es/jaguar

Page 11: A Suite to Compile and Analyze  an LSP Corpus

11

Page 12: A Suite to Compile and Analyze  an LSP Corpus

12

Page 13: A Suite to Compile and Analyze  an LSP Corpus

13

Page 14: A Suite to Compile and Analyze  an LSP Corpus

14

Page 15: A Suite to Compile and Analyze  an LSP Corpus

15

Page 16: A Suite to Compile and Analyze  an LSP Corpus

16

Page 17: A Suite to Compile and Analyze  an LSP Corpus

17

Page 18: A Suite to Compile and Analyze  an LSP Corpus

18

Page 19: A Suite to Compile and Analyze  an LSP Corpus

19

Conclusions

We have presented the system JAGUAR, set of tools for compiling and exploring an LSP corpus from the webThe main characteristics of this suit are the following:• It is able to collect an LSP corpus from the web, ensuring the thematic adequacy and degree of specialization to a given domain• It offers tools to statistically explore such collection in a friendly interface• It has also been conceived as a library

The original algorithms have been successfully evaluatedIt usage save time and effort in the analysis of a corpus offering also new insights, a perspective of the data invisible to the naked eye.

Page 20: A Suite to Compile and Analyze  an LSP Corpus

20

• Project is now growing in different directions: • Progressive enhancement with new functions and algorithms • Turning into a desktop application

Future Work

Page 21: A Suite to Compile and Analyze  an LSP Corpus

Thanks!