template produced at the graphics support workshop, media centre combining the strengths of umist...

1
T e m p l a t e p r o d u c e d a t t h e G r a p h i c s S u p p o r t W o r k s h o p , M e d i a C e n t r e Aims The GerManC project involves the compilation of a representative corpus of German texts for the period 1650-1800. It is designed to parallel historical corpora of English (i.e. ARCHER, Helsinki) for this period in order to facilitate comparative synchronic study of the two languages. Design The corpus will consist of 2000 word extracts from eight text types: orally-oriented: drama, newspapers, sermons, letters print-oriented: narrative prose, academic texts, medical texts, legal texts To ensure representativeness there will be an equal number of extracts from: three sub-periods: 1650-1700 : 1701-1750 ; 1751-1800 five regions: North ; West Central ; East Central ; South-West ; South-East This will result in a corpus of about 800,000 words and will be the first representative corpus of German for this period. It will further the synchronic study of the development of German syntax and lexis in the early modern period, and also provide material for investigating the process of standardization in German. The regional representativeness is vital for this; these 150 years saw the decline of local linguistic norms and the emergence of a supraregional standard accepted throughout the Holy Roman Empire. Methods stage 1 - digitization For the pilot project 45 extracts from German newspapers of this period were digitized by double-keying, i.e. entered independently by two people and the results compared and checked with the original to eliminate mistakes. Scanning (apart from being potentially more prone to error) was not feasible as there is no reliable OCR program for black letter (‘Gothic’) typefaces. stage 2 - annotation The corpus was then annotated according to the standards of the Text Encoding Initiative (TEI). Each text was supplied with administrative metadata (header information, etc.) and marked for significant textual features using the TEI tagset. The TEI conventions were applied rigorously, and as this corpus consists of newspapers with a wealth of relevant detail it required a very intensive level of annotation. It was marked for loan words, passages in languages other than German, proper names (of places, people, organizations etc.), numbers, dates, times, abbreviations with expansions, special characters and other diacritics, illustrations and text decorations and any formatting conventions. Exchanger XML was used as editing software, and CLaRK for automatic conformance checking in line with TEI U5 standards. Each stage of corpus construction and annotation was documented in detail and any deviations from and modifications of existing TEI standards were noted and accounted for. Analytical tools A major objective was to develop programs for tagging and lemmatizing the corpus. The Stuttgart-Tübingen tagset was adapted and this produced good results, with some 80% of word forms tagged and lemmatized accurately. Significant regularities could be exploited to automate assigning basic leading forms for specific variants for each text. Programs were developed to normalize variant spellings, capturing the relationship between the variants and a standardized form and establishing an overall lexicon of variant forms for each lemma. Application Further programs were developed, e.g. to allow searches for particular tag sequences. Thus, by searching for sequences of determiner + adjective + noun lists can be generated to show the inflection of adjectives within the noun phrase – this was subject to considerable variation at this time, and the corpus shows the elimination of one variant to leave only the one which was eventually adopted into the standard language. Further developments In the proposed extended project, with the compilation of the complete corpus of 800,000 words, further tools will be developed, in particular to parse the corpus. It would also be desirable also to identify the morphosyntactic properties of each word-form. A start has been made with a program identifying singular and plural nouns and their cases with a reasonable degree of accuracy (ca 75%). GerManC an annotated, spatialised, multi-genre corpus of Early Modern German Martin Durrell, Astrid Ensslin, Paul Bennett Pilot The project was piloted by the compilation of a corpus of 100,000 words from one text type – newspapers – with this design, i.e. with an equal number of texts from the three sub- periods and five regions. This was completed with an ESRC grant (RES- 000-22-1609) between March 2006 and March 2007. A bid for funding of the complete project, which will include the other text types, is currently awaiting decision.

Post on 21-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Template produced at the Graphics Support Workshop, Media Centre Combining the strengths of UMIST and The Victoria University of Manchester Aims The GerManC

Tem

pla

te p

rod

uce

d a

t the G

rap

hic

s S

up

port W

ork

sh

op

, Med

ia C

entre

AimsThe GerManC project involves the compilation of a representative corpus of German texts for the period 1650-1800.

It is designed to parallel historical corpora of English (i.e. ARCHER, Helsinki) for this period in order to facilitate comparative synchronic study of the two languages.

DesignThe corpus will consist of 2000 word extracts from eight text types:

orally-oriented: drama, newspapers, sermons, lettersprint-oriented: narrative prose, academic texts, medical texts, legal texts

To ensure representativeness there will be an equal number of extracts from:

three sub-periods: 1650-1700 : 1701-1750 ; 1751-1800

five regions: North ; West Central ; East Central ; South-West ; South-East

This will result in a corpus of about 800,000 words and will be the first representative corpus of German for this period.

It will further the synchronic study of the development of German syntax and lexis in the early modern period, and also provide material for investigating the process of standardization in German. The regional representativeness is vital for this; these 150 years saw the decline of local linguistic norms and the emergence of a supraregional standard accepted throughout the Holy Roman Empire.

Methods

stage 1 - digitization

For the pilot project 45 extracts from German newspapers of this period were digitized by double-keying, i.e. entered independently by two people and the results compared and checked with the original to eliminate mistakes. Scanning (apart from being potentially more prone to error) was not feasible as there is no reliable OCR program for black letter (‘Gothic’) typefaces.

stage 2 - annotation

The corpus was then annotated according to the standards of the Text Encoding Initiative (TEI). Each text was supplied with administrative metadata (header information, etc.) and marked for significant textual features using the TEI tagset.

The TEI conventions were applied rigorously, and as this corpus consists of newspapers with a wealth of relevant detail it required a very intensive level of annotation. It was marked for loan words, passages in languages other than German, proper names (of places, people, organizations etc.), numbers, dates, times, abbreviations with expansions, special characters and other diacritics, illustrations and text decorations and any formatting conventions.

Exchanger XML was used as editing software, and CLaRK for automatic conformance checking in line with TEI U5 standards. Each stage of corpus construction and annotation was documented in detail and any deviations from and modifications of existing TEI standards were noted and accounted for.

Analytical tools A major objective was to develop programs for tagging and lemmatizing the corpus.

The Stuttgart-Tübingen tagset was adapted and this produced good results, with some 80% of word forms tagged and lemmatized accurately. Significant regularities could be exploited to automate assigning basic leading forms for specific variants for each text. Programs were developed to normalize variant spellings, capturing the relationship between the variants and a standardized form and establishing an overall lexicon of variant forms for each lemma.

ApplicationFurther programs were developed, e.g. to allow searches for particular tag sequences. Thus, by searching for sequences of determiner + adjective + noun lists can be generated to show the inflection of adjectives within the noun phrase – this was subject to considerable variation at this time, and the corpus shows the elimination of one variant to leave only the one which was eventually adopted into the standard language.

Further developmentsIn the proposed extended project, with the compilation of the complete corpus of 800,000 words, further tools will be developed, in particular to parse the corpus.

It would also be desirable also to identify the morphosyntactic properties of each word-form. A start has been made with a program identifying singular and plural nouns and their cases with a reasonable degree of accuracy (ca 75%).

GerManC

an annotated, spatialised, multi-genre corpus of Early Modern German

Martin Durrell, Astrid Ensslin, Paul Bennett

PilotThe project was piloted by the compilation of a corpus of 100,000 words from one text type – newspapers – with this design, i.e. with an equal number of texts from the three sub-periods and five regions.

This was completed with an ESRC grant (RES-000-22-1609) between March 2006 and March 2007. A bid for funding of the complete project, which will include the other text types, is currently awaiting decision.