celex user guide

Upload: amirzetz

Post on 04-Jun-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/13/2019 Celex user guide

    1/30

    CELEX

    A GUIDE FOR USERS

    GAVIN BURNAGE

    CELEX

    CENTRE FOR LEXICAL INFORMATION

  • 8/13/2019 Celex user guide

    2/30

  • 8/13/2019 Celex user guide

    3/30

    CELEX A GUIDE FOR USERS

    C

    CC C C

    CC C C

    C CCCCCCC CCCCCCCCCCCCC

    C C C CCCCCCCCCCCCCCCCCCCCCCCCCC CC

    C CCCCCCCCCCCCCCCCCCCCCCCC

    CCCCCCCCCCCCCCCCCCCCCCCC

    CCCCCCCCCCCCCCCC

    CCCCCCCCCCCCCCCCCC

  • 8/13/2019 Celex user guide

    4/30

    CELEX CENTRE FOR LEXICAL INFORMATION

    Max Planck Institute for PsycholinguisticsWundtlaan 1

    6525 XD NijmegenThe Netherlands

    Telephone

    31-(0)24-361579731-(0)24-3615751

    Fax

    31-(0)24-3521213

    Electronic mail

    internet: [email protected]

    First published in the Netherlands in 1990

    c 1990 CELEX CENTRE FOR LEXICAL INFORMATION

    ISBN 90 373 0063 4

    No part of thispublication may be reproduced,

    stored in a retrieval system, or transmitted,in any form or by any means, electronic, mechanical,

    photocopying, recording, or otherwise, withoutthe prior written permission

    of the publisher

    Typeset using the TEX computer typesetting system.

    Printed by drukkerij SSN, Nijmegen.

    TEX is a trademark of the American Mathematical Society.

  • 8/13/2019 Celex user guide

    5/30

    INTRODUCTION

    There can be no doubt that lexicography is avery difficult sphere of linguistic activity.

    Many lexicographers have given vent to their feelings in this respect.Perhaps the most colourful of these opinions

    based on a lexicographers long experienceis that of J.J Scaliger (16th17th cent.)

    who says in fine Latin verses that the worst criminalsshould neither be executed nor sentenced to forced labour,

    but should be condemned to compile dictionaries,because all the tortures are included in this work.

    LADISLAV ZGUSTA Manual of Lexicography (1971)

    The 1980s will one day be seen as a watershed in lexicography the decade in which computer applications began to alter radically

    the methods and the potential of lexicography.

    Gone are the days of painstaking manual transcriptionand sorting on paper slips: the future is on disk,in the form of vast lexical databases, continuously updated,

    that can generate a dictionary of a given size and scopein a fraction of the time it used to take.

    DAVID CRYSTAL The Cambridge Encyclopedia of Language (1988)

  • 8/13/2019 Celex user guide

    6/30

  • 8/13/2019 Celex user guide

    7/30

    CONTENTS

    1 DATABASES AND LEXICONS 11

    1.1 Why use a database? 11

    2 LEXICON TYPES 16

    2.1 Dutch Lemmas 172.2 Dutch Wordforms 1112.3 Dutch Abbreviations 1112.4 Dutch INL Corpus Types 112

    2.5 English Lemmas 114

    2.6 English Wordforms 116

    2.7 English COBUILD Corpus Types 117

    2.8 German Lemmas 1182.9 German Wordforms 1212.10 German Mannheim Corpus Types 121

  • 8/13/2019 Celex user guide

    8/30

  • 8/13/2019 Celex user guide

    9/30

    1 DATABASES AND LEXICONS

    This introduction tries to do two things. In the first section,for those who arent familiar with the ideas and possibilitiesof databases and lexicons, there is a description of the way

    in which a computer database and lexicon is likeand moreimportantlyunlikea traditional paper dictionary. If yourealready familiar with such things, you may like to skip aheadto the second section, where there is a description of each ofthe main lexicon types available to you in flex. Fundamen-tal to this description is the difference between wordforms(the words we use in everyday speech and writing) and lem-mas (words used to represent families of wordforms, in thesame way as bold-type dictionary headings, which take the

    form of stems or headwords). Since the linguistic informationavailable to you depends on the type of your lexicon, youshould make sure you understand the differences betweenthe various lexicon types before beginning your work. Andwhen you start work with flex, the special program whichhelps you build and use your lexicons, youll be better off forhaving read these sections carefully. In the third and lastsection of this introductory chapter, you can find out howto log into celex using local, national, and international

    computer networks.

    1.1 WHY USE A DATABASE?

    Since we are dealing with words, we can start off by thinkingof databases in terms of a paper dictionary. A book like theVan Dale Groot Woordenboek der Nederlandse Taalis essen-tially a long list of words with information supplied alongsideeach word. The key to a dictionary is the alphabetical order

    of its word entries: you can only look up one particularword at a time and examine the information given for it.If youve got time, you can look at every page to find allthe words with a certain grammatical code or pronunciation,but, quite understandably, most people dont do this unlesstheyre really desperate.

    In its simplest form, a database can be like a dictionary: justa list of words, and some information alongside each word.

  • 8/13/2019 Celex user guide

    10/30

    The first important difference between a computer databaseand a paper dictionary is that the database uses differentcolumnsto store separate types of information, whereas thedictionary uses one paragraph of text, and marks differentsorts of information within that text by using different type-faces and coding systems, or by giving the information in aparticular order. Dictionary text is fixed once it is printed.You cant move bits of an entry around, or miss them out:you are presented with everything at once, and you may haveto read a lot of irrelevant information before you find whatyoure looking for. The columns which make up a databaseare much more regimented, but that, paradoxically, is whatgives a database its flexibility. Each type of informationkeeps strictly to its own dedicated place, which means itseasier for the computer to locate and serve up one individualitem, or several particular items, relating to each word thatinterests you. So, you can look up a word and its word classcode and pronunciation, say, without even having to glanceat all the other information. The diagram below is a simplerepresentation of how information is held in a database.

    Headword Class Phonetics

    aback ADV @-b&k

    abacus N &-b@-k@s

    abandon N @-b&n-d@n

    abandon V @-b&n-d@n

    abandoned A @-b&n-d@nd

    abandonment N @-b&n-d@n-m@nt

    abase V @-beIs

    abasement N @-beIs-m@nt

    abash V @-b&S

    abate V @-beIt

    The crucial difference between a database and a dictionary isthe flexibility that a computer can achieve with the properly-defined rows and columns: you can gather together differentparts of the database, and display the information in any wayyou like. This illustration shows you three vertical columns,which are entitled Headword, Class, and Phonetics, andten horizontal rows, each of which displays information foreach headword under the correct column heading. A row

  • 8/13/2019 Celex user guide

    11/30

    thus contains every type of information for one word, whilea column contains one specific type of information for everyheadword.

    The illustration is, of course, only a very simple example.To get an idea of what the whole celex Dutch, English orGerman database might look like, imagine three hundred orso more column headings added to the right hand side, anda hundred thousand or so more rows added at the bottom.

    This diagram would then represent a small part of the top leftcorner of an enormous grid packed with lexical information.Experts have calculated that if you printed out the rest ofthis table in full, you would end up with a piece of paperapproximately 5.5m wide, and 2.4km long so you couldprobably walk round it in just under an hour. Using flex,which itself uses a database management system to accessthe information in the grid, you can extract tiny bits ofinformation, or long and detailed lists, just as you please.

    When you create alexicon, youre essentially creating a littledictionary, designed to your own specifications.

    Unlike a dictionary, you can use keys other than the head-word when you look something up in a database. On a simplelevel, this means you can look up the verb walk, instead ofthe noun walk. On another level, it means that you canget a list of all the verbs in the database, excluding all theother words which are not verbs. The individual printedparagraphsfor each word in a dictionary are fixed, but thecorresponding rows in a computer database can be movedabout and rearranged just as you want them. So, its possibleto create a lexicon like the one illustrated below by usingflex restrictions. You simply state that you want to see allthe words which have the word class code V, and you canthen get as much information as you like about the verbs inyour list. The example below shows a list of verbs with theirpronunciations:

    Headword Transcription

    abandon @-b&n-d@n

    abase @-beIs

    abash @-b&S

    abate @-beIt

    Since youve specified that you only want verbs in your list,

  • 8/13/2019 Celex user guide

    12/30

    theres no need to put the word class code column on display.The computer uses it in preparing the list, but you dont haveto look at it youd just get a list of Vs. The possibilities forcreating all sorts of lexicons are seemingly endless. You havehundreds of columns to choose from, most of which containinformation you might want to inspect on your screen. Othercolumns contain information which can be used to controlwhat is shown on your screen or in your file: the wordclass column in the illustration above, for example, or theInflectional features columns under the morphology of Dutchwordforms which simply say yes when a wordform does havea particular inflectional feature, and no when it doesnt. Ascreen display of those columns isnt particularly interesting,but a file of wordforms created using the information theycontain may well be very interesting.

    And there are still more possibilities. If you want to buildup a lexicon which contains words with, say, certain phonetic

    features in common, then flex lets you do it with the helpof thepattern matcher. For example, you might want to seethe words which contain in a non-initial position syllablesbeginning with a dental plosive or dental fricative. Therequired pattern is @*-[tdTD]@*, which when applied toa syllabified phonetic transcription column, tells flex tofind transcriptions which consist of zero or more charactersof any sort ( @* ) characters followed by a syllable marker( - ) followed by one of the dental phonemes t, d, T or D

    followed by zero or more characters of any sort. The resultinglexicon would start off something like this:

    Headword Class Phonetics

    abandon N @-b&n-d@n

    abandon V @-b&n-d@n

    abandoned A @-b&n-d@nd

    abandonment N @-b&n-d@n-m@nt

    Compared to a dictionary, then, a lexicon-based databasesystem like flex has significant advantages for the linguis-tic researcher. You have a great store of linguistic detailavailable, and the means to tailor and craft it according tothe research you have to do, rather than being limited to theinflexible format of a dictionary. And while for a beginner theprospect of learning to use flexmay at first seem daunting,

  • 8/13/2019 Celex user guide

    13/30

    the alternativepage by page inspection of a dictionaryshould be enough to convince any who doubt the usefulnessof it.

    The next section describes the differences between the var-ious types of lexicon you can develop within flex. Thecolumns that you can add to your lexicons are described inthe threeLinguistic Guides. Read them carefully as you planthe construction of your personal lexicons.

  • 8/13/2019 Celex user guide

    14/30

    2 LEXICON TYPES

    When you work with flex, you create your own lexicons.The database which flex accesses is enormous, and you only

    ever see a tiny fraction of it on your screen at any one time.Lexicons allow you to narrow down the information you geton screen or in a file, so that you have a manageable view ofthe parts of the database which interest you most.

    Celexhas several databases available for your use, and youcan create lexicons using any one of them. When youreasked what type of lexicon you want, you are in fact beingasked from which part of which database would you like yourinformation? The LEXICON TYPE menu is the menu screenthat lets you choose:

    LEXICON TYPE

    Dutch wordformsDutch abbreviationsDutch INL corpus typesEnglish lemmasEnglish wordformsEnglish COBUILD corpus typesGerman lemmasGerman wordformsGerman Mannheim corpus types

    The most basic choice is, obviously, whether you want infor-

    mation on Dutch, English or German. After that, the choiceyou make depends on the work you want to do. For now,remember that in this context the terms lemma, wordform,abbreviation andcorpus typerefer to the database equivalentof a bold-type dictionary entry. Each databaseand thuseach of your lexiconsholds particular sorts of entries. So,if you choose a lemma lexicon, it is as if you are using adictionary where every entry represents a full inflectional

  • 8/13/2019 Celex user guide

    15/30

    paradigm, making a lemma lexicon the closest thing to anormal dictionary that celex offers. If you choose a word-form lexicon on the other hand, the dictionary entries arethe inflectional forms themselves: every entry (or row) dealsspecifically with one flection, something which conventionaldictionaries never do. In effect you have a dictionary whichcontains all the words which are used in natural language.Naturally enough, an abbreviations lexicon is like a diction-ary of abbreviations: the entry is always an abbreviatedform of some sort. And a corpus types lexicon contains rowsspecific to each distinct item in one of the text corpora usedto extract lexical features for each of the languages. Read onto discover more about each lexicon type. Once youve readand understood it, youll be able to choose the lexicon typemost appropriate to whatever task you have to carry out.

    2.1 DUTCH LEMMAS

    When you look up a word in a dictionary, you dont alwaysfind the exact word you want. Quite often you come across ashorter version in bold type, which represents the particularword you had in mind, as well as various other forms whichyou know intuitively belong to the same word. Thus whenyoure interested in a word likeloopt, you know that you willfind all the information you want under the bold-type entryfor the verblopen. These bold-type words in dictionaries arecalled headwords or canonical forms, since they represent

    what can be called the full canonorparadigmof inflections:lopenis the headword which stands for the wordformsloop,loopt, lopen, liep, liepen, gelopen, lopend, lopende, lopenden,and (occasionally)lope.

    Because headword forms have become so firmly established,people often presume that there must be something linguis-tically special about them. Usually, the shortest form in theinflectional paradigm is the headword but not always: forverbs, Dutch dictionaries use the present tense plural form

    (which is also the infinitive), even though the present tensefirst person singular is shorter: consider openen as againstopen. Many linguists in fact prefer to use the shorter formas the canonical form in their work, because all the otherforms can be made from this basic form by adding inflectionalaffixes (though this is putting it very simply, of course).So, what form is used as the canonical form in the celexdatabases?

  • 8/13/2019 Celex user guide

    16/30

    As far as celexis concerned, alemmais an abstract way ofrepresenting a whole inflectional paradigm. The dictionaryheadword, as described above is one form a lemma can taketo represent a word in all its inflected forms. It is possiblebut probably not very helpful for humansto signify theword by some completely different word, or even a number;anything will do, so long as it is understood to represent thewhole inflectional paradigm. A lemma is that underlyingform; it doesnt really exist, except for use in databases anddictionaries. It looks like a real word, but in fact its just aconvenient way of expressing something bigger.

    Since the lemma is an abstract notion, we need now to ident-ify the more concrete forms it can take. Two are used in thedatabases, and you can choose for yourself which one you use.First, there is the headword, which corresponds exactly tothe traditional lexicographic headword used in dictionaries.And second, there is thestem, the form which most linguists

    prefer. Since the forms headwords and stems take are oftenassumed rather than explicitly stated, table 1 defines whatheadwords and stems look like in the celex Dutch database.It holds true for just about every lemma; there are very fewexceptions.

    Table 1 shows that celex headwords and stems are very sim-ilar to the traditional lexicographic forms which are normallyused in dictionaries. The major exception is the stem of a

    verb. One other important feature of stems is the possibilityof using so-called abstract stems, which some linguists liketo use in certain circumstances, again for reasons concerningthe formation of flections. These forms are dealt with in theDutch Linguistic Guide.

    There is still one major difference between dictionary entriesand celex lemmas, however: celex lemmas are never dis-tinguished solely on the basis of meaning. In a dictionary,there might be two entries for the noun bank, one explaining

    that it means a sofa, the other that it means a financialinstitution. In the celexdatabase, there is onlyonelemmafor the noun bank, and thus it gets only one row in thedatabase (which corresponds to an entry or sub-paragraph ina dictionary). On what basis, then, does celexdifferentiatebetween lemmas? There are six possible criteria. If twopotential lemmas are the same on all six points, then they aretaken to be one single lemma. This remains true even if the

  • 8/13/2019 Celex user guide

    17/30

    Word Class Lemma

    Headword Stem(where different fromthe headword)

    Noun Nominative singular,except for pluraliatantum which use thenominative plural.Diminutive formsare not treated asseparate lemmas.

    As for the headword,except that pluraliatantum are given anominative singularform.

    Adjective The shortest positiveform.

    Quantifier/Numeral For a number, thecardinal and ordinalforms are two separatelemmas, and are usedas the two headwords.For quantifiers, theshortest form is used.

    Verb Infinitive. First person singularpresent tense form

    (non-separated formas used in relativeclauses).

    Determiner Nominative form.

    Pronoun Shortest form.

    Adverb Shortest form.

    Preposition The only form.

    Conjunction The only form.

    Interjection The only form.

    Table 1: Celex canonical forms for Dutch

  • 8/13/2019 Celex user guide

    18/30

    two words differ in meaning. If, however, they differ on anyone criterion,anddiffer in meaning, then they are treated astwo separate lemmas. The six distinguishing criteria are asfollows:

    1. Orthography of the wordforms. The adjectives rauwand rouw are two different lemmas because they are speltdifferently and have a different meaning.

    2. Syntactic class. The nounwit and the adjective wit aredifferent lemmas because they each have a different wordclass. Sometimes the difference in word class is itself theonly way a difference in meaning is indicated.

    3. Gender. The noun de pas (meaning the pace) and thenoun het pas(meaning a spirit level) are different lemmasbecause they differ in gender and also in meaning.

    4. Inflectional paradigm. The verbmalen (meaning to crush)

    and the verbmalen(to be delirious) are two different lemmasbecause the first has the past participle gemalen, while thesecond has the past participle gemaald, and they differ inmeaning.

    5. Morphological structure. The nounkoker(someone whocooks kook + er) and the noun koker(a cylindrical object;a monomorphemic word) are two distinct lemmas becausethey differ in their derivational morphological structure and

    their meaning.

    6. Pronunciation of the wordforms The noun kip (meaningchicken) and the noun kip (meaning the act of dumping)would be different lemmas because in standard Dutch thefirst is pronounced [ k p ] and the second is pronounced [ ki p ],and they differ in meaning.

    It should be clear by now that a lemma is a notional represen-tation of an inflectional paradigm, and that the forms celex

    gives to a lemma are headwords and stems. These formsare convenient representations only, which exist to make lifeeasier for dictionary users and computer lexicon builders. Alemma lexicon contains general information about an inflec-tional paradigm, similar to the way an ordinary dictionarydoes. Within a lemma lexicon, the lemmas are given as stemsor headwords, and you can choose either form when you makeyour lexicon.

  • 8/13/2019 Celex user guide

    19/30

    For lemmas, there is orthographic, phonetic, morphological,syntactic and frequency information available. In the appen-dices you can find diagrams which give an overview of thelemma columns that are described in detail in the LinguisticGuides, as well as some basic information about the numberof lemmas currently available in the celexdatabases.

    2.2 DUTCH WORDFORMS

    Wordforms can be thought of as real words: we use themevery day, in speech and in writing. You are at this mo-ment reading English wordforms. Even the shortest forms inan inflectional paradigm are wordforms (as opposed to lem-mas) simply because they are working parts in the language.When you use a wordforms lexicon, it is as if you are lookingin a dictionary which lists every possible word, instead ofabstract forms which represent particular sets of words. Forthis reason, while a lexicon of type lemma only yields kat ,a wordforms lexicon gives you all the occurring forms ofthe lemma for example, both kat, katje and katten arewordforms.

    Sometimes individual wordforms (in this case verbs) canbe split into two distinct parts, depending on the way thesentence is formed. Both the whole form and the separatedform are included in the wordforms information. For exam-ple, bel op ( ik bel jou op ) andopbel ( als ik jou opbel. . .)

    can both be included in a wordforms lexicon.

    Information about each wordforms lemma is supplied too, sothat this lexicon type also covers all the information a normallemma lexicon can contain. You can include such informationby going to the morphology section of the ADD COLUMNSmenus, where you can choose to include Stem informationand/or Inflectional features.

    For wordforms, there is orthographic, phonetic, morphologi-cal, and frequency information available. You can also useall the information relating to the lemma that each wordformbelongs to. In the appendices you can find diagrams whichgive an overview of the wordform columns that are describedin detail in theLinguistic Guides, as well as some basic infor-mation about the number of wordforms currently availablein the celexdatabases.

  • 8/13/2019 Celex user guide

    20/30

    . DUT H ABBREVIATI N

    Abbreviations are shortened forms of words or names. Forexample,gem. is a shortened form ofgemiddeldandgemeu-bileerd. Other abbreviations are composed of the first letterfrom each word in a name BBC is thus an abbreviationfor British Broadcasting Corporation. Many such abbrevi-ations have two spellings, one with and one without dots, forexample.

    The abbreviations given are drawn from the Van Dale GrootWoordenboek van Hedendaags Nederlandsand the sizeabletext corpus of the inl(Instituut voor Nederlandse Lexicolo-gie the Institute for Dutch Lexicology based in Leiden).

    For abbreviations, there is orthographic and frequency infor-mation available. In the appendices you can find diagramswhich give an overview of the abbreviation columns that

    are described in detail in the Linguistic Guides, as well assome basic information about the number of abbreviationscurrently available in the celex databases.

    2.4 DUTCH INL CORPUS TYPES

    inltokensare strings in the large inltext corpus of modernDutch, and here a string can be taken to mean at leastone alphabetic character in series with zero or more other

    alphanumeric characters, delimited at either end by a space.(So, for example, zesis a token, and so is 6de, but 6by itselfis not, because it does not contain at least one alphabeticcharacter. This applies to all numerals.) inl corpustypesaredistinct tokens; that is, not a list of the many million tokens,but a representative list that includes once each separatetoken which occurs in the corpus.

    In fact, the criteria for inclusion in the type list can bemore closely defined. The inl corpus is made up of manydifferent contemporary texts, or looking at it in another way,several millions oftokens. Included in the celex inlcorpustype list, then, are all the typeswhich occur in at least twodifferent corpus texts.

    Corpus types complement the lemma and wordform infor-mation; its safe to say that amongst them, you can findalmost every item which occurs in written text. Unlike the

  • 8/13/2019 Celex user guide

    21/30

    dictionary-style lemma and wordform lexicons, no syntactic,morphological, or phonetic information is available. Whatyou do have is a database of real-life words, distinguished onthe basis of their orthography, with detailed information ontheir frequency.

    In the appendices there are diagrams which give an overviewof the corpus type columns that are described in detail inthe Linguistic Guides, as well as some basic information

    about the number of types currently available in the celexdatabases, and the size and contents of the inlcorpus.

  • 8/13/2019 Celex user guide

    22/30

    .5 EN LI H LEMMA

    When you look up an English word in a dictionary, you dontalways find the particular form you want. Instead, you comeacross a shorter version in bold type, which represents theparticular form you had in mind, along with various othersimilar forms which you intuitively know belong to the sameword. So when youre interested in a word likewalking, youknow that you can find lots of information about it under the

    bold-type entry for the verbwalk. These bold-type words indictionaries are called headwords or canonical forms, sincethey represent what can be called the fullcanonor paradigmof inflections: walk is the headword which stands for thewordformswalk, walks, walking andwalked.

    As far as celex is concerned, such a form is a lemma, anabstract way of representing a whole inflectional paradigm.The dictionary headword, as described above is one form a

    lemma can take to represent a word in all its inflected forms.It is possiblebut probably not very helpful for humanstosignify the word by some other word, or even a number;anything will do, so long as it is understood to represent thewhole inflectional paradigm. A lemma is that underlyingform; it doesnt really exist, except for use in databases anddictionaries. It looks like a real word, but in fact, its just aconvenient way of expressing something bigger.

    In an English lemma lexicon, the lemma is given in the formof the traditional lexicographic headword. This is in contrastto Dutch lemma lexicons, where the lemma can take the formeither of the traditional headword or of a stem, which is aform more suitable for most linguistic research. No suchcomplications apply to English, however: the underlyinglemma always becomes the traditional headword when itcomes to the surface. Table 2 opposite sets out exactlywhich form that is for each lemma. It is almost alwaysaccurate; there are only a few exceptions, such as the verbtobewhich is given asbein accordance with the long-standingtradition.

    There is one major difference between dictionary entries andcelex English lemmas, however: celex lemmas are neverdistinguished solely on the basis of meaning. In a dictionary,there might be two entries for the noun bank, one explainingthat it means the land at the side of a river, the other that it

  • 8/13/2019 Celex user guide

    23/30

    Word Class Headword

    Noun The singular form,except for pluraliatantum which use theplural form.

    Adjective The positive form.

    Quantifier/Numeral For a number, the

    cardinal and ordinalforms are two separatelemmas, and are usedas the two headwords.For quantifiers, theonly form is used.

    Verb First person singularpresent tense form.

    Pronoun The only form.

    Adverb The positive form.

    Preposition The only form.

    Conjunction The only form.

    Interjection The only form.

    Table 2: Canonical forms for English lemmas

    means a financial institution. In the celex database, thereis only onelemma for the noun bank, and thus it gets onlyonerow in the database (which corresponds to an entry orsub-paragraph in a dictionary). On what basis, then, doescelexdifferentiate between lemmas? There are five possiblecriteria. If two potential lemmas are the same on all fivepoints, then they are considered as belonging to one lemma.This remains true even if the two words differ in meaning.If, however, they differ on any one criterion, and differ inmeaning, then they are treated as two separate lemmas. Thefive distinguishing criteria are as follows:

    1. Orthography of the wordforms. The nounspeekandpeakare two different lemmas because they are spelt differentlyand have a different meaning.

    2. Syntactic class. The adjective meet and the adverb meet

  • 8/13/2019 Celex user guide

    24/30

    are different lemmas because they each have a different wordclass and a different meaning. Sometimes the difference inword class is itself the only way a difference in meaning isindicated, as with words likewater(verb) andwater(noun).

    3. Inflectional paradigm. The nounantenna(meaning radioaerial) and the nounantenna(an anatomical feature of someinsects) are two different lemmas because the first has theplural antennas, while the second has the plural antennae,

    and they differ in meaning.

    4. Morphological structure. The noun rubber (someone orsomething that rubs rub + er) and the noun rubber (theelastic substance; a monomorphemic word) are two distinctlemmas because they differ in their derivational morphologi-cal structure and their meaning.

    5. Pronunciation of the wordforms. The verbrecount(mean-

    ing count again) and the verbrecount(meaning to tell a tale)would be different lemmas because the first is pronounced[ ri -ka nt ] and the second is pronounced [ r - ka nt], andthey differ in meaning.

    For lemmas, there is orthographic, phonetic, morphological,syntactic and frequency information available. In the appen-dices you can find diagrams which give an overview of thelemma columns that are described in detail in the Linguistic

    Guides, as well as some basic information about the numberof lemmas currently available in the celex databases, andthe sources from which the information derives.

    2.6 ENGLISH WORDFORMS

    Wordforms can be thought of as real words: we use themevery day, in speech and in writing. You are at this mo-ment reading English wordforms. Even the shortest forms in

    an inflectional paradigm are wordforms (as opposed to lem-mas) simply because they are working parts in the language.When you use a wordforms lexicon, it is as if youre lookingin a dictionary which lists every possible word, instead ofabstract forms which represent particular sets of words. Forthis reason, while a lexicon of type lemma only yields dog,a wordforms lexicon gives you all the occurring forms of thelemma for example both dog and dogs are wordforms.

  • 8/13/2019 Celex user guide

    25/30

    Information about each wordforms lemma is supplied too,so that this lexicon type also covers all the information anormal lemma lexicon can contain. You can include suchinformation by going to the morphology section of the ADDCOLUMNS menus, where you can choose to include Lemmainformation and/or Inflectional features.

    For wordforms, there is orthographic, phonetic, morphologi-cal and frequency information available. You can also use

    all the information relating to the lemma that each word-form belongs to. In the appendices you can find diagramswhich give an overview of the wordform columns that aredescribed in detail in the Linguistic Guides, as well as somebasic information about the number of wordforms currentlyavailable in the celex databases, and the sources from whichthe information derives.

    2.7 ENGLISH COBUILD CORPUS TYPES

    COBUILD tokensare strings in the large COBUILDtextcorpus of modern English, and here a string can be takento mean at least one alphabetic character in series with zeroor more other alphanumeric characters, delimited at eitherend by a space. (So, for example, six is a token, and so is6th , but 6 by itself is not, because it does not contain atleast one alphabetic character. This applies to all numerals.)COBUILD corpus types are distinct tokens; that is, not alist of the many million tokens, but a representative list thatincludes once each separate token which occurs in the corpus.

    Corpus types complement the lemma and wordform infor-mation; its safe to say that amongst them, you can findalmost every item which occurs in written text. Unlike thedictionary-style lemma and wordform lexicons, no syntactic,morphological, or phonetic information is available. Whatyou do have is a database of real-life words, distinguished onthe basis of their orthography, with detailed information on

    their frequency.

    In the appendices there are diagrams which give an overviewof the corpus type columns that are described in detail inthe Linguistic Guides, as well as some basic informationabout the number of types currently available in the celexdatabases, and the size and contents of the COBUILDcor-pus.

  • 8/13/2019 Celex user guide

    26/30

  • 8/13/2019 Celex user guide

    27/30

    Word Class Lemma

    Headword Stem(where different from

    the headword)

    Noun Nominative singular,except for pluraliatantum which use thenominative plural.Diminutive formsare not treated asseparate lemmas.

    As for the headword,except that pluraliatantum are given anominative-singular-like form.

    Adjective The shortest positive

    form.

    Quantifier/Numeral For a number, thecardinal and ordinalforms are two separatelemmas, and are usedas the two headwords.For quantifiers, theshortest form is used.

    Verb Infinitive. Infinitive without the

    (e)n-ending.

    Article The only forms areder and ein

    The only forms areder and ein

    Pronoun The nominativesingular forms.

    The shortest form

    Adverb Shortest form.

    Preposition The shortest form.

    Conjunction The only form.

    Interjection The only form.

    Table 3: Celex canonical forms for German

  • 8/13/2019 Celex user guide

    28/30

    criteria. If two potential lemmas are the same on all sixpoints, then they are considered as belonging to one lemma.This remains true even if the two words differ in meaning.If, however, they differ on any one criterion, and differ inmeaning, then they are treated as two separate lemmas. Thesix distinguishing criteria are as follows:

    1. Orthography of the wordforms. The nouns fallen andfallenare two different lemmas because they are spelt differ-

    ently and have a different meaning.

    2. Syntactic class. The adjectiveanderweitigand the adverbanderweitigare different lemmas because they each have adifferent word class. Sometimes the difference in word classis itself the only way a difference in meaning is indicated, aswith words likeledern(verb) andledern(adjective).

    3. Inflectional paradigm. The noun Bank (the bank in

    the park) and the noun Bank (die Deutsche Bank) are twodifferent lemmas because the first has the pluralBanke, whilethe second has the pluralBanken, and they differ in meaning.

    4. Morphological structure. The nounMesser (knife) andthe nounMesser(meter or measurer) are two distinct lemmasbecause they differ in their morphological structure and theirmeaning.

    5. Pronunciation of the wordforms. The noun Band(mean-

    ing a group of people making music) and the noun Band(meaning the relationship between two people) would bedifferent lemmas because the first is pronounced [ bEnt ] andthe second is pronounced [ bant ], and they differ in meaning.

    6. Gender of the wordforms. The noun das Tor (the gate)and der Tor(the mad person) will be different lemmas be-cause the gender of the first noun is neuter and the genderof the second one is masculine.

    For lemmas, there is orthographic, phonetic, morphological,syntactic and frequency information available. In the appen-dices you can find diagrams which give an overview of thelemma columns that are described in detail in the LinguisticGuides, as well as some basic information about the numberof lemmas currently available in the celex databases, andthe sources from which the information derives.

  • 8/13/2019 Celex user guide

    29/30

    . ERMAN W RDF RM

    Wordforms can be thought of as real words: we use themevery day, in speech and in writing. You are at this mo-ment reading English wordforms. Even the shortest forms inan inflectional paradigm are wordforms (as opposed to lem-mas) simply because they are working parts in the language.When you use a wordforms lexicon, it is as if youre lookingin a dictionary which lists every possible word, instead of

    abstract forms which represent particular sets of words. Forthis reason, while a lexicon of type lemma only yields Kind,a wordforms lexicon gives you all the occurring forms of thelemma for example Kind, Kindes, Kinde, Kinder andKindern are all wordforms.

    Information about each wordforms lemma is supplied too,so that this lexicon type also covers all the information anormal lemma lexicon can contain. You can include such

    information by going to the morphology section of the ADDCOLUMNS menus, where you can choose to include Lemmainformation and/or Inflectional features.

    For wordforms, there is orthographic, phonetic, morphologi-cal and frequency information available. You can also useall the information relating to the lemma that each word-form belongs to. In the appendices you can find diagramswhich give an overview of the wordform columns that aredescribed in detail in the Linguistic Guides, as well as some

    basic information about the number of wordforms currentlyavailable in the celex databases, and the sources from whichthe information derives.

    2.10 GERMAN MANNHEIM CORPUS TYPES

    Mannheimtokensare strings in the Mannheimtext corpusof modern German, and here a string can be taken to meanat least one alphabetic character in series with zero or more

    other alphanumeric characters, delimited at either end bya space. (So, for example, funfzehn is a token, and so is15jahrige, but 15by itself is not, because it does not containat least one alphabetic character. This applies to all numer-als.) Mannheim corpus types are distinct tokens; that is,not a list of the many million tokens, but a representativelist that includes once each separate token which occurs inthe corpus.

  • 8/13/2019 Celex user guide

    30/30

    In fact, the criteria for inclusion in the type list can be moreclosely defined. The Mannheimcorpus is made up of manydifferent contemporary texts, or looking at it in another way,several millions oftokens. Included in the celex Mannheimcorpus type list, then, are all thetypeswhich occur in at leasttwo different corpus texts.

    Corpus types complement the lemma and wordform infor-mation; its safe to say that amongst them, you can find

    almost every item which occurs in written text. Unlike thedictionary-style lemma and wordform lexicons, no syntactic,morphological, or phonetic information is available. Whatyou do have is a database of real-life words, distinguished onthe basis of their orthography, with detailed information ontheir frequency.

    In the appendices there are diagrams which give an overviewof the corpus type columns that are described in detail in

    the Linguistic Guides, as well as some basic informationabout the number of types currently available in the celexdatabases, and the size and contents of the Mannheimcor-pus.