nooj international conference, komotini, may 2010 portability of armenian corpus by nooj anaid...

Post on 12-Jan-2016

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

NooJ international Conference, Komotini, May 2010

Portability of Armenian Corpus

by NoojAnaid Donabedian & Victoria Khurshudian

Institut National des Langues et Civilisations Orientales (INALCO), Paris

Armenian: preliminaries

an Indo-European language

right-branching

of an accusative type

typically with an SOV structure and

dominantly with an agglutinative morphology

Historical Armenia

Republic of Armenia

Periodization prealphabetical

alphabetical (405 A.D. – up to present).

1. Old Armenian or Grabar (V-XI);

2. Middle Armenian (XII-XVI);

3. Modern Armenian (XVII – up to present)

Western Eastern (based on Constantinople dialect) (based on Ararat dialect)

dialects… dialects….

Objective

Provide data compatibility and portability between Nooj and

Eastern Armenian National Corpus (EANC) platform

What is Eastern Armenian National Corpus

www.eanc.netCorpus Technologies

Michael Daniel, Victoria Khurshudian, Dmitri Levonian,

Vladimir Plungian, Alexey Polyakov,Sergey Rubakov

8

Source texts

PARSER

Annotated texts

Annotation algorithm

Grammatical dictionary

EANC History

Moscow, Russia

March 2006: Project Launch

July 2007: 1st Release

May 2008: 2nd Release

March 2009: 3rd release

Eastern Armenian National Corpus (EANC) is:

about 110 million tokens

morphological and other markup

English translations for frequent tokens

covers SEA from the mid-19th century to the present

both written and oral discourse

full-text view for over 100 Armenian classic titles

open internet access

Written Discourse

over 106 mln. tokens

510 authors (1841-2009)

1039 fiction texts (including 206 translated texts)

7858 press issues

non-fiction (scientific and other) texts

Spontaneous discourse

Polylogues

Task-oriented discourse

TV-shows transcripts

Movies …

☼ EANC oral corpus has all been recorded and transcribed

by the project.

Oral Discourse (3.5 mln. tokens)

13

EANC Functionality

14

Search Functionality

Token queries

Context queries

Subcorpus selection

15

Simple token queries:

• lexeme search

• wordform search

• gram search

• translation search

• lexeme + gram search

Search Functionality

16

Advanced options for token queries:

case-sensitivity

punctuation marks

position in the sentence

wildcard (*)

logical functions (e.g. ‘or' |)

negated features

grammatical/lexical homonymy inclusion/exclusion

Search Functionality

17

Subcorpus selection by:

time

author(s) / title(s)

genres

types of texts (translated vs. original)

superposition of any of the above

Search Functionality

18

Display options

context expanding

‘sort by’ (time, lexeme, wordform etc.)

Latin transliteration

glossed display

KWIC (key word in the context)

Search Functionality

19

Transliterated samples:

20

Glossed samples:

21

KWIC samples:

Main Current Tasks:

Make Nooj-based Western Armenian morphological annotation compatible with EANC grammatical dictionary structure

Make EANC and Nooj Western Armenian platforms interportable

Mutual full coverage of Nooj and EANC capacities (e.g. syntactical annotation of Nooj)

top related