computational lexicology, morphology and syntaxdtrandabat/pages/courses/lex2018/lex02... ·...

40
Computational Lexicology, Morphology and Syntax Diana Trandabăţ Course 2 Academic year 2017-2018

Upload: vunhan

Post on 22-Feb-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Computational Lexicology, Morphology and Syntax

Diana Trandabăţ

Course 2

Academic year 2017-2018

Page 2: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Introduction

• Since the 1990s, the corpus methodology has revolutionized nearly all branches of linguistics

– Corpus analysis can be illuminating in “virtually all branches of linguistics or language learning.” (Leech 1997)

• One of the strengths of corpus data lies in its empirical and attested nature

– … pools together the intuitions of a great number of speakers

– … makes linguistic analysis more objective

• Corpus linguistics

– …introduces the theoretical and practical issues of using corpora in linguistic studies

– …explores how the corpus-based approach and other methodologies can be combined in linguistic studies

Page 3: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

What is a corpus? • The word corpus comes from Latin (“body”) and the plural is

corpora • A corpus is a body of naturally occurring language

– …but rarely a random collection of text – Corpora “are generally assembled with particular purposes

in mind, and are often assembled to be (informally speaking) representative of some language or text type.” (Leech 1992)

• “A corpus is a collection of (1) machine-readable (2) authentic texts (including transcripts of spoken data) which is (3) sampled to be (4) representative of a particular language or language variety.” (MXT 2006: 5)

Page 4: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

What is not a corpus? • A list of words is not a corpus

– Building blocks of language • A text archive is not a corpus

– A random collection of texts • A collection of citations is not a corpus

– A short quotation which contains a word or phrase that is the reason for its selection

• A collection of quotations is not a corpus – A short selection from a text chosen on internal criteria by

human beings • A text is not a corpus

– Intending to be read in different ways • The Web is not a corpus

– Its dimensions unknown, constantly changing, not designed from a linguistic perspective

Sinclair (2005)

Page 5: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

What is a corpus for?

• A corpus is made for the study of language in a broad sense

– To test existing linguistic theory and hypotheses

– To generate and verify new linguistic hypotheses

– Beyond linguistics, to provide textual evidence in text-based humanities and social sciences subjects

• The purpose is reflected in a well-designed corpus

Page 6: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Why use corpora?

• Even expert speakers have only a partial knowledge of a language

– A corpus can be more comprehensive and balanced

• Even expert speakers tend to notice the unusual and think of what is possible

– A corpus can show us what is common and typical

• Even expert speakers cannot quantify their knowledge of language

– A corpus can readily give us accurate statistics

Page 7: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Why use corpora?

• Even expert speakers cannot remember everything they know – A corpus can store and recall all the information that has

been stored in it • Even experts speakers cannot make up natural examples

– A corpus can provide us with a vast number of examples in real communication context

• Even expert speakers have prejudices and preferences and every language has cultural connotations and underlying ideology – A corpus can give you more objective evidence

Page 8: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Why use corpora?

• Even expert speakers are not always available to be consulted – A corpus can be made permanently accessible to all

• Even expert speakers cannot keep up with language change – A constantly updated corpus can reflect even recent

changes in the language • Even expert speakers lack authority: they can be challenged

by other expert speakers – A corpus can encompass the actual language use of many

expert speakers

Page 9: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Intuitions as an alternative

• Intuitions are always useful in linguistics

– To invent (grammatical, ungrammatical, or questionable) example sentences for linguistic analysis

– To make judgments about the acceptability / grammaticality or meaning of an expression

– To help with categorization

Page 10: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Intuitions as an alternative • Intuitions should be applied with caution

– Possibly biased as they are likely to be influenced by one’s dialect or sociolect

– Introspective data is artificial and may not represent typical language use as one is consciously monitoring one’s language production

– Introspective data is decontextualized because it exists in the analyst’s mind rather than in any real linguistic context

– Intuitions are not observable and verifiable by everyone as corpora are

– Excessive reliance on intuitions blinds the analyst to the realities of language usage because we tend to notice the unusual but overlook the commonplace

– There are areas in linguistics where intuitions cannot be used reliably e.g. language variation, historical linguistics, register and style, first and second language acquisition

– Human beings have only the vaguest notion of the frequency of a construct or a word

Page 11: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Benefits of corpus data • Corpus data is more reliable

– A corpus pools together linguistic intuitions of a range of language speakers, which offsets the potential biases in intuitions of individual speakers

• Corpus data is more natural – It is used in real communications instead of being invented

specifically for linguistic analysis • Corpus data is contextualized

– Attested language use which has already occurred in real linguistic context

• Corpus data is quantitative – Corpora can provide frequencies and statistics readily

• Corpus data can find differences that intuitions alone cannot perceive – E.g. synonyms totally, absolutely, utterly, completely,

entirely

Page 12: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Corpora vs. intuitions • Not necessarily antagonistic, but rather corroborate each

other and can be gainfully viewed as being complementary

– Armchair linguists and corpus linguists “need each other. Or better, […] the two kinds of linguists, wherever possible, should exist in the same body.” (Fillmore 1992)

– “Neither the corpus linguist of the 1950s, who rejected intuitions, nor the general linguist of the 1960s, who rejected corpus data, was able to achieve the interaction of data coverage and the insight that characterize the many successful corpus analyses of recent years.” (Leech 1991)

• The key to using corpus data is to find the balance between the use of corpus data and the use of one’s intuitions

Page 13: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

The corpus methodology

• It is debatable whether corpus linguistics is a methodology or a branch of linguistics – Corpus linguistics goes well beyond this methodological role

and has become an independent discipline • In spite of the name, corpus linguistics is indeed a

methodology rather than an independent branch of linguistics in the same sense as phonetics, syntax, semantics or pragmatics – These latter areas of linguistics describe, or explain, a

certain aspect of language use – Corpus linguistics, in contrast, is not restricted to a

particular aspect of language - it can be employed to explore almost any area of linguistic research

Page 14: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

A brief history of Corpus Linguistics

• The term corpus linguistics first appeared only in the early 1980s, but corpus-based language study has a substantial history

• The history of CL can be split into two periods: before and after Chomsky

Page 15: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

A brief history of CL • Before Chomsky

– Field linguists and linguists of the structuralist tradition used “shoebox corpora” – shoeboxes filled with paper slips • Their methodology was essentially “corpus-based” in

the sense that it was empirical and based on observed data

– The work of early corpus linguistics was underpinned by two fundamental, yet flawed assumptions • The sentences of a natural language are finite. • The sentences of a natural language can be collected

and enumerated. – Most linguists saw the “corpus” as the only source of

linguistic evidence in the formation of linguistic theories

Page 16: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

A brief history of CL

• Chomsky revolution: Between 1957 and 1965 Chomsky changed the direction of linguistics from empiricism towards rationalism

– “Any natural corpus will be skewed. Some sentences won’t occur because they are obvious, others because they are false, still others because they are impolite. The corpus, if natural, will be so wildly skewed that the description would be no more than a mere list.” (Chomsky 1962)

– Our internal knowledge of language in human brain (competence, I-language) replaces observed data (performance, E-language)

– Intuitions started to be relied on as evidence

• [Xiao, R. (2008) “Theory-driven corpus research: using corpora to inform aspect theory”. In A. Lüdeling & M. Kyto (eds.) Corpus Linguistics: An International Handbook. Berlin: Mouton de Gruyter]

Page 17: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

A brief history of CL

• Revival of CL

– Corpus research was continued in a few centres (Brown, Lancaster) in the 60s-70s

• The Brown University Standard Corpus of Present-day American English (Brown corpus)

• Lancaster-Oslo-Bergen Corpus of BrE (LOB)

– The hardware still imposed some restrictions until the real development started in the 1980s

• The marriage of corpora with computer technology rekindled interest in the corpus methodology

• Since then, the number and size of corpora and corpus-based studies have increased dramatically

– Nowadays, the corpus methodology enjoys widespread popularity, and has opened up or foregrounded many new areas of research

Page 18: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Areas that have used corpora

• Lexicography

• Lexical studies

• Grammatical studies

• Register/genre analysis

• Language variation

• Contrastive analysis

• Translation studies

• Language change

• Language teaching

• Semantics

• Pragmatics

• Stylistics

• Literary study

• Sociolinguistics

• Discourse analysis

• Forensic linguistics

• Computational linguistics

• …

Page 19: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Why use computers?

• Development of computer technology has revived CL

• Machine-readability is a de facto attribute of modern corpora

• Electronic corpora have advantages unavailable to their “shoebox” ancestors

– It is the use of computerized corpora, together with computer programs which facilitate linguistic analysis, that distinguishes modern electronic corpora from early ‘drawer-cum-slip’ corpora

Page 20: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Why use computers?

• Computerized corpora can be processed and manipulated rapidly at minimal cost – E.g. searching, selecting, sorting and formatting

• Computers can process machine-readable data accurately and consistently

• Computers can avoid human bias in an analysis, thus making the result more reliable

• Machine-readability allows further automatic processing to be performed on the corpus so that corpus texts can be enriched with various metadata and linguistic analyses – Corpus markup and corpus annotation

Page 21: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

A question for Deep Thought

“Alright,” said the computer Deep Thought. “The Answer to the Great Question...”

“Yes...!” “Of Life, the Universe and Everything ...” said Deep Thought. “Yes...!” “Is...” “Yes...!!!...?” “Forty-two,” said Deep Thought, with infinite majesty and calm. It was a long time before anyone spoke. “Forty-two!” yelled someone in the audience. “Is that all you’ve got to

show for seven and a half million years’ work?” “I checked it very thoroughly,” said the computer, “and that quite

definitely is the answer. I think the problem, to be quite honest with you, is that you’ve never actually known what the question is.”

Hitchhikers Guide to the Galaxy by Douglas Adams What can we learn from this story?

Page 22: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

What corpora cannot do • Corpora do not provide negative evidence

– Cannot tell us what is possible or not possible – Can show what is central and typical in language

• Corpora can yield findings but rarely provide explanations for what is observed – Interfacing other methodologies

• The use of corpora as a methodology also defines the boundaries of any given study – Importance of amenable research questions

• The findings based on a particular corpus only tell us what is true in that corpus – Generalisation vs. representativeness

Page 23: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Ask corpora the right questions • Corpus linguistics as a methodology is only one of the (many)

ways of doing things – “doing linguistics” • The usefulness of corpora depends upon the research

question being investigated – “They are invaluable for doing what they do, and what

they do not do must be done in another way.” (Hunston 2002: 20)

• The development of the corpus-based approach as a tool in language studies has been compared to the invention of telescopes in astronomy – If it is ridiculous to criticize a telescope for not being a

microscope, it is equally pointless to criticize the corpus-based approach for not doing what it is not intended to do

• It is up to you to formulate research questions amenable to corpus-based investigation and to decide how to combine corpora with other resources

Page 24: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Corpus revolution in lexicographic and lexical studies

• Lexicographic and lexical studies are the greatest beneficiaries of corpora

• Corpora have “revolutionised” dictionary making and reference publishing

– It is now nearly unheard of for new dictionaries and new editions of old dictionaries published from the 1990s onwards not to claim to be based on corpus data

Page 25: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Why use corpora in dictionary making?

• Machine-readable corpora allow dictionary makers to extract all authentic, typical examples of the usage of a lexical item from a large body of text in a few seconds

• Corpora allow dictionary makers to select entries based on frequency information

• Corpora can readily provide frequency information and collocation information for readers

• Textual (e.g. register, genre and domain) and sociolinguistic (e.g. user gender and age) information encoded in corpora allows lexicographers to give a more accurate description of the usage of a lexical item

Page 26: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Why use corpora in dictionary making?

• Corpus annotations such as part-of-speech tagging and word sense disambiguation also enable a more sensible grouping of words which are polysemous and homographs

• A “monitor corpus” allows lexicographers to track subtle change in the meaning and usage of a lexical item so as to keep their dictionaries up-to-date

• Corpus evidence can complement or refute the intuitions of individual lexicographers, which are not always reliable because of potential biases in intuitions

Page 27: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Five emphases

• Changes brought about by corpora to dictionaries and other reference books - five “emphases” (Hunston 2002)

– an emphasis on frequency

– an emphasis on collocation and phraseology

– an emphasis on variation

– an emphasis on lexis in grammar

– an emphasis on authenticity

Page 28: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Corpus-based learner dictionaries

• First ‘fully corpus-based’ dictionary

– Collins Cobuild English Dictionary (1987)

• Some corpus-based learner dictionaries

– Longman Dictionary of Contemporary English (3rd edition)

– Oxford Advanced Learner’s Dictionary (OALD, 5th edition)

– Cambridge International Dictionary of English (1st edition)

Page 30: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Corpus classification

• Textual vs. Speech Corpus

• Public vs. Private Corpus

• Particular vs. Reference Corpus

• Particular:

– literature corpus classified by year/domain/author etc.

– Corpus with the language of children, etc.

• Reference:

– Very large, covers all relevant language varieties and the common vocabulary of a language.

– Is usually hierarchically structured in sub-corpora

– Usually built by specialised linguistic institutions

Page 31: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Corpus classification

• Diachronic corpus (language in its evolution)

• Monolingual vs. Multilingual Corpus

• Paralell vs. Comparable corpus

Page 32: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Collocation • Collocation is among the linguistic concepts which have

benefited most from advances in corpus linguistics

• What is collocation?

– strong tea, powerful car (Halliday 1976)

– “collocations of a given word are statements of the habitual or customary places of that word…the company that words keep” (Firth 1968:181-2)

• “One of the meanings of night is its collocability with dark” (Firth 1957:196)

– “a frequent co-occurrence of two lexical items in the language” (Greenbaum 1974:82)

• expel a school child vs. cashier an army officer

• “I propose to bring forward as a technical term, meaning by collocation, and apply the test of collocability” (Firth 1957: 194)

Page 33: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Meaning by collocation

• “There is frequently so high a degree of interdependence between lexemes which tend to occur in texts in collocation with one another that their potentiality for collocation is reasonably described as being part of their meaning” (Lyons 1977: 613)

• Complete description of the meaning of a word would have to include the other word or words that collocate with it

• “You shall know a word by the company it keeps!” (Firth 1968:179)

• Collocation is part of the word meaning

Page 34: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Two types of collocation

• Coherence collocation vs. neighbourhood (horizontal) collocation (Scott 1998)

– Coherence collocation

• Collocates associated with a word (e.g. letter – stamp, post office)

– Neighbourhood collocation

• Words which do actually co-occur with the word (letter - my, this, a, etc)

Page 35: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Coherence collocation • “A cover term for the cohesion that results from the co-

occurrence of lexical items that are in some way or other typically associated with one another, because they tend to occur in similar environments.” (Halliday & Hasan 1976:287)

– candle – flame – flicker

– hair – comb – curl – wave

– sky – sunshine – cloud – rain

• Difficult to measure using a statistical formula

Page 36: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Neighbourhood collocation • Collocation in corpus linguistics • Structure of collocation – collocation window

– “We may use the term node to refer to an item whose collocations we are studying, and we may then define a span as the number of lexical items on each side of a node that we consider relevant to that node. Items in the environment set by the span we will call collocates.” (Sinclair 1966:415)

• Casual vs. significant collocation – Significant collocation: collocation that occurs more

frequently than would be expected (in a statistical sense) on the basis of the individual items

• n.b. Neighbourhood (horizontal) collocations can include some coherence collocations

Page 37: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Intuition vs. collocation • Greenbaum (1974): “people disagree on collocations” in

introspection-based elicitation experiments • Although “collocation can be observed informally” on the

basis of intuitions, “it is more reliable to measure it statistically, and for this a corpus is essential” (Hunston 2002: 68)

• Intuition is often a poor guide to collocation – “because each of us has only a partial knowledge of the

language, we have prejudices and preferences, our memory is weak, our imagination is powerful (so we can conceive of possible contexts for the most implausible utterances), and we tend to notice unusual words or structures but often overlook ordinary ones” (Krishnamurthy 2000: 32-33)

• Collocation can be measured on the basis of co-occurrence statistics (MI, z, t, LL etc) – more discussion to follow

Page 38: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Collocation is syntagmatic

famous boots. On the stroke of full time the Stoke the lead on the stroke of half-time with a goal

Smith sin-binned on the stroke of half-time, added a

clinched their win on the stroke of lunch after resuming

chase by declaring on the stroke of lunch. <p> With a lead

expectant crowd, on the stroke of midday. The bird

hour began not upon the stroke of midnight but upon the

of midnight but upon the stroke of noon. There was,

booked in advance. On the stroke of seven, a gong summons

Promptly on the stroke of six 'clock, the chooks

from Edinburgh on the stroke of the Millennium.

Parole (Utterance)

syntagmatic

Langue (Language system)

paradigmatic

Page 39: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

Collocation vs. colligation

• Collocation

– Relationship between a lexical item and other lexical items

• Relationship between words at the lexical level

• E.g. very collocates with good

• Colligation

– Relationship between a lexical item and a grammatical category

• Relationship between words at the grammatical level

• E.g. very colligates with ADJ

Page 40: Computational Lexicology, Morphology and Syntaxdtrandabat/pages/courses/lex2018/LEX02... · Introduction • Since the 1990s, the corpus methodology has revolutionized nearly all

“Words are, of course, the most

powerful drug used by mankind.”

― Rudyard Kipling

Until next week…