bastien kindt [email protected] tamar … · • theophylact simocatta (6th-7th c.) •...

65
Bastien Kindt – [email protected] Tamar Pataridze – [email protected] Emmanuel Van Elverdinghe – [email protected] International Workshop on Computer Aided Processing of Intertextuality in Ancient Languages Lyon, 2 nd -4 th June 2014

Upload: nguyenlien

Post on 29-Aug-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • Bastien Kindt [email protected] Tamar Pataridze [email protected]

    Emmanuel Van Elverdinghe [email protected]

    International Workshop on Computer Aided Processing of Intertextuality in Ancient Languages

    Lyon, 2nd-4th June 2014

  • PRLG

    Projet de recherche en lexicologie grecque

  • Two goals

    Creating an electronic dictionary of Ancient Greek

    Lemmatizing patristic

    and historiographical Byzantine texts

  • The Dictionary

    DICTIONNAIRE AUTOMATIQUE GREC (D.A.G.)

    Lexical data directly stemming from

    corpus-based observations: ensures comprehensiveness and coherence

    Without restriction regarding the

    handled texts date, literary genre, language level or dialect

  • The Dictionary

    DICTIONNAIRE AUTOMATIQUE GREC (D.A.G.)

    434,190 word-forms

    66,772 lemmata

    Every morphosyntactic category

  • The Dictionary

    DICTIONNAIRE AUTOMATIQUE GREC (D.A.G.)

    Proper names: anthroponyms, toponyms

    Numeric determiners

    Crases (984 different forms)

    Elided forms (1,160 forms)

  • Lemmatization

    1990-1991 Thesaurus Sancti Gregorii Nazianzeni

  • Lemmatization

    Clement of Alexandria (2nd-3rd c.) Basil of Caesarea (4th c.) Gregory of Nyssa (4th c.) Procopius of Caesarea (6th c.) Theophylact Simocatta (6th-7th c.) Theophanes Confessor (8th-9th c.) Joseph Genesius (10th c.) Doukas (15th c.) Etc.

  • Lemmatization

    Comprehensive inventories of the vocabulary of Byzantine patristic and

    historiographic texts

    with the D.A.G.

  • Lemmatization

    Concordances published in the Thesaurus Patrum Graecorum

    series (Brepols Publishers)

    24 volumes published

    Concordances on microfiches (!)

  • PRLGs tools

    Lemmatized concordances Frequency indexes Reverse indexes End-of-book indexes Indexes of words common or specific (to

    two corpuses or corpus parts) Etc.

  • From PRLG to GREgORI project

    Switching to full-digital Extending to the other languages of

    Christian Orient the computing tools and linguistic resources developed for Greek

    Constituting multilingual lexica

  • From PRLG to GREgORI project

    Armenian: studying the formulaic style in

    manuscript colophons (E. Van Elverdinghe) Georgian: studying translation methods

    from Greek (T. Pataridze)

  • Armenian manuscript colophons

  • I. Lemmatizing Armenian

    II. The Armenian colophons project

    III. An illustrative case-study

  • Lemmatizing Armenian

    SOME METHODOLOGICAL NOTES

    Indo-European, flexional language with a leaning towards agglutination

    Grammatical categories

    Tokenization issues

    Diachrony

  • II. The Armenian colophons project

    1. CORPUS

    2. PURPOSE

    3. METHOD

    I. Lemmatizing Armenian

    III. An illustrative case-study

  • 1. CORPUS

    Text

    Digitized text editions

    6154 pages in 9 volumes

    5th century to 1500 + 1601 to 1660

    16,000 different colophons (from 1 word to several pages)

    >1,300,000 forms

    Processed through Unitex

    Database

    Gathering metadata extracted from the editions

    Reference

    Date

    Author

    Place

    Manuscript content

    Etc.

  • 2. PURPOSE

    Studying stereotypical patterns (formulae)

    Lifespan

    Frequency

    Variation

    Evolution

    Geographical diffusion

    Relevant for manuscript studies as a whole

  • 3. METHOD

    1) Spotting formulaic patterns (= collocations)

    2) Determining the formulas structure

    3) Extracting all utterances

    4) Cross-analysis with information stored in the database

    5) Sketch of the formulas life and deeds

  • II. The Armenian colophons project

    1. COLLOCATIONS

    2. STRUCTURE

    3. ANALYSIS

    4. LEARNINGS

    I. Lemmatizing Armenian

    III. An illustrative case-study

  • An illustrative case-study

    [I wrote this] from a good and choice exemplar

    396 occurrences (variants included)

    From 989 AD onwards

    Relatively stable

    Verbose colophons

  • 1. COLLOCATIONS

    Log-likelihood

    Browsing the concordance

    Literature survey

  • 2. STRUCTURE

    Yields 56 pertinent matches

    Provides new vocabulary

  • = good, choice

    but also = reliable

    true

    glorious etc.

  • Concordance for each word

    Some variation in structure

    3 qualifiers

    Repetition/omission of the preposition

    Polysyndeton/asyndeton

    Rarer vocabulary

    New, more complex, and exhaustive graph

  • Total: 396 matches

    2 qualifiers

    3 qualifiers

  • 3. ANALYSIS

    Statistical outlook

    Most frequent adjectives

    Making up 90% of attestations

    A B

    205 9

    16 2 45

    1 78

  • Attestations by century

    Varieties found in only 1 manuscript

    10th c. 11th c. 12th c. 13th c. 14th c. 15th c. 17th c.

    20% 27.3% 12.3% 1.7% 3.4%

    10th c. 11th c. 12th c. 13th c. 14th c. 15th c. 17th c.

    2 1 5 33 57 118 400?

  • 3. ANALYSIS

    Formula in context metadata

    Date, copyist and locality almost always known

    Found alongside the formula

    Part of a wider-scale pattern

    In early times, mostly in biblical manuscripts

    Not very significant

    Subtypes distinguishable by the context

  • ,

    ,

    I wrote this with my unworthy hands, from a reliable and choice exemplar, with a tormented

    life and through much emigration ...

    N.B. Both word orders: or

  • 42 attestations

    83% before 1500

    First time in 1331

    Gospels, Bibles

    Lake Van (1 from Jerusalem)

    Often the same copyists

    7 attestations in the 17th c.

    Lake Van; 1 from todays Armenia

    Gospels, canon-books

  • : 21 attestations (50%)

    Gospels only

    First time in 1399, in Atamar (island on Lake Van)

    17 times during the 15th century: 14 manuscripts from Atamar, 3 written on the shores of Lake Van

    All manuscripts from Atamar with this formula present this word-order

  • vs

    History and geography of the formula

  • 4. LEARNINGS

    Copyists mentality

    Stylistic and orthographic habits

    Increasing standardization

    Inferring missing information about some manuscripts

    Insight into the life and activity of copyists: passing down of techniques, traditions, and knowledge

  • Lemmatizing Georgian

    Bilingual index of Gregory of Nazianzus

  • Edited, digitized and

    formatted text

    DATABASE (SQL)

    UNITEX Corpus processor

    - First disambiguation

    step - Lexical lookup - Fully manual disambi-

    guation - Lexical data export

    Production of lexical tools

    -iLemmatized concordan-ces

    - Frequency index - Reverse index - End-of-book index -iCommon or specific

    vocabulary index - Fully tagged corpus - Bilingual index

    State of the art for processing Greek

  • Lemmatization principles for Georgian

    B. Kindt, La lemmatisation des sources patristiques et byzantines au service d'une description lexicale du grec ancien. Les principes de formulation des lemmes du Dictionnaire Automatique Grec, in: Byzantion, 74 (2004), pp. 213-272. B. Coulie, B. Kindt, T. Pataridze, Lemmatisation automatique des sources du gorgien ancien, in: Le Muson, 126 (2013), pp. 161-201.

  • Final goal: a bilingual Greek-Georgian index

  • DATABASE (SQL)

    - SOURCE language tagged corpus

    - TARGET language tagged corpus

    Production of bilingual index

    mkAlign Text alignment processor

    Method: from text-alignment to bilingual index

  • Some analysis

    Verbs and Common Names

  • V V+Mas

    V V+Part

    VERBS

    Greek Georgian

    V+Mas

    V+Part

  • Correspondance

  • Correspondance

    Greek Georgian

    N+Com

    V+Part

  • [Geo.] V+Part = A [Gr.]

    A V+Part

    Where -- [ketili] has the morphology of participle, formed through a morpheme - [-il], it is used as an adjective

  • [Geo.] V+Part = N+Com [Gr.]

    N+Com V+Part

    --- [mo-u-ar-i] is an active participle formed through the -- [mo--ar] morphemes. Literally speaking it means one who teaches, but after having become a substantive it receives a meaning similar to professor.

    Another example of the past participle: V+Part [--- / mo-na-geb-i] what was obtained, earned leads to the meaning of goods / properties. The corresponding Greek term is N+Com

  • COMMON NAMES

    Greek Georgian

    N+Com V+Part

    N+Com

  • [Gr.] N+Com = V+Part [Geo.]

    N+Com A

    N+Com V+Part

    -- [na-ksov-i], past participle, something that is knit, with the meaning of material, fabric

    [Gr.] N+Com = A [Geo.]

    Grass, lawn, greenery, green the meaning can be expressed by the suffix of possession - [ovan], when -- [mcuanil-ovan-il] means something like holder of grass

  • Greek Georgian

    N+Com

    I+Adv

    N+Com

  • [Geo.] N+Com = I+Adv [Gr.]

    PRO+Per1p N+Com

    V N+Com

    I+Adv N+Com

    [Geo.] N+Com = V [Gr.]

    [Geo.] N+Com = Pro+Pers [Gr.]

    [tavi] = head

  • [Geo.] N+Com = A [Gr.]

    = genitive of stone, --. Of course, the genitive of the common name will receive a nominative lemma tagged N + Com.

    Genitive of common name [in Georgian] = adjective [in other languages]

    Adverbial case of the adjectives and participles [in Georgian] = adverbs [in other languages]

  • _{.PRO+Per1s.54-0} _{ ().V.55-0} = -_{.V+Mas.52-0}

    [Gr.] PRO+Pers + V = V [Geo.]

  • - = - [tana]

    V V+Mas

    V V+Mas

    V V+Mas

    V V+Mas

    V V+Part

  • A $ A$N+Com

    A V+Part

    frequency % + text ref

    frequency % + text ref

    frequency % + text ref

    frequency % + text ref

    frequency % + text ref

    frequency % + text ref

    frequency % + text ref

  • Bastien Kindt [email protected] Tamar Pataridze [email protected]

    Emmanuel Van Elverdinghe [email protected]

    International Workshop on Computer Aided Processing of Intertextuality in Ancient Languages

    Lyon, 2nd-4th June 2014