corpus studio erwin komen

Post on 23-Jan-2018

340 Views

Category:

Science

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

CorpusStudio web application Erwin R. Komen

Meertens Instituut // Radboud University Nijmegen // SIL-International E.Komen@ru.nl

1. Background • Existing software:

• CorpusStudio – Windows • Cesax – Windows • Successfully used in linguistic research

• Web application version? • Central location for corpora (‘last’ version) • Platform independent: MacOS/Linux/Windows • Fast parallel processing

2. Formats • FoLiA xml

• Dutch: Nederlab, CGN, Sonar/Lassy • TEI-Psdx xml

• English historical + SLA • Caucasian: Chechen, Lak, Lezgi • Old Welsh • Dutch

• Additional formats • Convert via ‘Cesax’ (Alpino, Negra, …) • Add handler into CorpusStudio

4. Defining queries • Definition editor

• Constants • Functions (Xquery)

• Query editor • Subcategorization (Xquery)

• Constructor editor • Execution order • Options (examples, output, complement)

• Result database Feature editor • Xquery user-functions calculate them

6. Availability • CorpusStudio sources (build your own version)

• https://github.com/ErwinKomen • CLARIN-NL access

• http://www.clarin.nl/node/2095

7. References Boag, Scott, Don Chamberlin, Mary F. Fernández, Daniela Florescu, Jonathan Robie, and Jérôme Siméon. 2010.

XQuery 1.0: An XML Query Language (Second Edition): W3C Recommendation, <http://www.w3.org/XML/Query>. van Gompel, Maarten & Martin Reynaert (2014). FoLiA: A practical XML format for linguistic annotation - a descriptive

and comparative study. Computational Linguistics in the Netherlands Journal; 3:63-81; 2013. Komen, Erwin R. 2013. Corpus databases with feature pre-calculation. In Proceedings of the twelfth workshop on

treebanks and linguistic theories (TLT12). Sandra Kübler, Petya Osenova & Martin Volk (eds), 85-96. Sofia, Bulgaria: The institute of information and communication technologies, Bulgarian AS.

User information Project information

Definition Editor

Query Editor

Constructor Editor

Result viewer

Meta Data Editor

Definitions

Queries

Corpus Research Project (.crpx)

Search service: crpp

Query Executor

Database Creator

Output Monitor

Results (.xml)

Corpus Research Database

(.xml)

Table Viewer

Result Viewer

Documents (.xml)

xml

xml

xml

xml

xml

Input Selector

json

Status

xml

json

Database feature editor

Result Grouping

Standard grouping

(.json)

Grouping Viewer

Corpus Viewer

Result database

Result dbase Viewer

Result dbase Editor

3. Corpus Research Projects • All information for one research project

• Meta information (author, dates, goal) • Input (language, corpus, filter) • All definition and query files used • Execution order • Optional: result database features

• Exchange • Upload/download • Compatible with Windows CorpusStudio

CorpusStudio components

Meta Data Editor

Definition Editor

Input Selector

Query Editor

Constructor Editor

Output Monitor

Query Executor

Result Viewer

Corpus Viewer

Database feature editor

5. Future • Grouping editor

• Group output over meta-data categories • User-definable (Xquery)

• Query/project wizard • Tabular input of principal components • Relations, names, feature calculations

• Result database editor • View and edit result database records

top related