lecture 2 - corpus building
TRANSCRIPT
-
8/12/2019 Lecture 2 - Corpus Building
1/51
Principles of corpusconstruction
Matthew Brook ODonnell
University of Liverpool - Corpus Linguistics Summer Institute 2008
-
8/12/2019 Lecture 2 - Corpus Building
2/51
Aims
What is a corpus?
What principles guide the construction,development and selection of a corpus?
When and How to build a corpus
Can the web be used for building corpora?
Workshop: Build a small corpus of web texts
-
8/12/2019 Lecture 2 - Corpus Building
3/51
What is a corpus?
John Sinclair (1933-2007)
-
8/12/2019 Lecture 2 - Corpus Building
4/51
What is a corpus?
John Sinclair (1933-2007)
A corpus is a collection of pieces of language
text in electronic form, selected according to
external criteria to represent, as far as possible,
a language or language variety as a source of
data for linguistic research.
(Sinclair 2004)
-
8/12/2019 Lecture 2 - Corpus Building
5/51
What is a corpus?
John Sinclair (1933-2007)
A corpus is a collection of pieces of language
textin electronic form, selected according to
external criteria to represent, as far as possible,
a language or language variety as a source of
data for linguistic research.
(Sinclair 2004)
-
8/12/2019 Lecture 2 - Corpus Building
6/51
What is a corpus?
John Sinclair (1933-2007)
A corpus is a collection of pieces of language
text in electronic form, selected according to
external criteria to represent, as far as possible,
a language or language variety as a source of
data for linguistic research.
(Sinclair 2004)
-
8/12/2019 Lecture 2 - Corpus Building
7/51
What is a corpus?
John Sinclair (1933-2007)
A corpus is a collection of pieces of language
text in electronic form, selected according to
external criteriato represent, as far as possible,
a language or language variety as a source of
data for linguistic research.
(Sinclair 2004)
-
8/12/2019 Lecture 2 - Corpus Building
8/51
What is a corpus?
John Sinclair (1933-2007)
A corpus is a collection of pieces of language
text in electronic form, selected according to
external criteria to represent, as far as possible,
a languageor language variety as a source of
data for linguistic research.
(Sinclair 2004)
-
8/12/2019 Lecture 2 - Corpus Building
9/51
What is a corpus?
John Sinclair (1933-2007)
A corpus is a collection of pieces of language
text in electronic form, selected according to
external criteria to represent, as far as possible,
a language or language variety as a source of
data for linguistic research.
(Sinclair 2004)
-
8/12/2019 Lecture 2 - Corpus Building
10/51
Corpus
Authentic language data
Electronic/machine readable form Designed and collected according to
sampling procedures
Representative of language
For linguistic investigation
-
8/12/2019 Lecture 2 - Corpus Building
11/51
Corpus
Authentic language data
Electronic/machine readable form Designed and collected according to
sampling procedures
Representative of language
For linguistic investigation
-
8/12/2019 Lecture 2 - Corpus Building
12/51
Corpus
Authentic language data
Electronic/machine readable form Designed and collected according to
sampling procedures
Representative of language
For linguistic investigation
-
8/12/2019 Lecture 2 - Corpus Building
13/51
But first.
Lets talk about food! How could we compile a representative list
(=CORPUS) of food/dishes from around the
world?
-
8/12/2019 Lecture 2 - Corpus Building
14/51
http://images.google.co.uk/imgres?imgurl=http://www.tofocus.info/images/flags/china-flag.gif&imgrefurl=http://www.tofocus.info/flag-of-China.php&h=400&w=600&sz=5&hl=en&start=3&tbnid=NolROTGGBNxR6M:&tbnh=90&tbnw=135&prev=/images?q=chinese+flag&gbv=2&svnum=10&hl=en -
8/12/2019 Lecture 2 - Corpus Building
15/51
http://www.donquijote.org/culture/spain/history/images/spanish_flag2.jpg -
8/12/2019 Lecture 2 - Corpus Building
16/51
http://images.google.co.uk/imgres?imgurl=https://secure3.gct21.net/~christmascards/images/flag1js.gif&imgrefurl=https://secure3.gct21.net/~christmascards/14.html&h=304&w=445&sz=26&hl=en&start=7&tbnid=srAweWDKvI9djM:&tbnh=87&tbnw=127&prev=/images?q=american+flag&gbv=2&svnum=10&hl=enhttp://images.google.es/imgres?imgurl=http://www.streetcow.com/hamburger.jpg&imgrefurl=http://www.xanga.com/veganpiranha/598192382/item.html&h=863&w=1040&sz=37&hl=es&start=1&tbnid=ZaFW0KIthuQUuM:&tbnh=124&tbnw=150&prev=/images?q=hamburger&gbv=2&svnum=10&hl=es -
8/12/2019 Lecture 2 - Corpus Building
17/51
http://images.google.co.uk/imgres?imgurl=http://www.appliedlanguage.com/flags_of_the_world/large_flag_of_united_kingdom.gif&imgrefurl=http://www.appliedlanguage.com/flags_of_the_world/flag_of_united_kingdom.shtml&h=302&w=604&sz=15&hl=en&start=6&tbnid=MOPBsochYIoV-M:&tbnh=68&tbnw=135&prev=/images?q=british+flag&gbv=2&svnum=10&hl=en -
8/12/2019 Lecture 2 - Corpus Building
18/51
http://images.google.co.uk/imgres?imgurl=http://www.frenchflag.us/downloads/France-Flag_current.gif&imgrefurl=http://www.frenchflag.us/downloads.html&h=681&w=1024&sz=8&hl=en&start=1&tbnid=Y-oEfMuB0QN-FM:&tbnh=100&tbnw=150&prev=/images?q=french+flag&gbv=2&svnum=10&hl=en -
8/12/2019 Lecture 2 - Corpus Building
19/51
http://images.google.co.uk/imgres?imgurl=http://westgroup.biology.ed.ac.uk/pics/japanese%2520flag.gif&imgrefurl=http://westgroup.biology.ed.ac.uk/&h=267&w=400&sz=120&hl=en&start=3&tbnid=VQFnz96FpFRLHM:&tbnh=83&tbnw=124&prev=/images?q=japanese+flag&gbv=2&svnum=10&hl=en -
8/12/2019 Lecture 2 - Corpus Building
20/51
http://images.google.co.uk/imgres?imgurl=http://www.dimiopizza.com/images/pizza18-lg.gif&imgrefurl=http://www.dimiopizza.com/specialty.html&h=318&w=450&sz=86&hl=en&start=4&tbnid=J-2Jb62CEgQSBM:&tbnh=90&tbnw=127&prev=/images?q=cheese+pizza&gbv=2&svnum=10&hl=en&sa=Ghttp://orders.mkn.co.uk/mime/footballshirt/worldcup/pennant/italyflag.jpg -
8/12/2019 Lecture 2 - Corpus Building
21/51
http://images.google.co.uk/imgres?imgurl=http://www.learntoquestion.com/seevak/groups/2004/sites/zana/images/turkishflag.jpg&imgrefurl=http://www.learntoquestion.com/seevak/groups/2004/sites/zana/photo.html&h=401&w=602&sz=15&hl=en&start=1&um=1&tbnid=Uv5cbyX_OpES_M:&tbnh=90&tbnw=135&prev=/images?q=turkish+flag&svnum=10&um=1&hl=en -
8/12/2019 Lecture 2 - Corpus Building
22/51
http://images.google.co.uk/imgres?imgurl=http://library.thinkquest.org/05aug/01280/Images/flag.jpg&imgrefurl=http://library.thinkquest.org/05aug/01280/Mexico/Flag_description.html&h=544&w=771&sz=28&hl=en&start=2&um=1&tbnid=KVxH77Stci2rMM:&tbnh=100&tbnw=142&prev=/images?q=mexican+flag&svnum=10&um=1&hl=enhttp://en.wikipedia.org/wiki/Image:NCI_Visuals_Food_Taco.jpg -
8/12/2019 Lecture 2 - Corpus Building
23/51
How can we group these foods?
1. Where they come from
-
8/12/2019 Lecture 2 - Corpus Building
24/51
1. Continental Food Corpus
Europe
Asia
America(North
& South)
?
http://images.google.co.uk/imgres?imgurl=http://www.dimiopizza.com/images/pizza18-lg.gif&imgrefurl=http://www.dimiopizza.com/specialty.html&h=318&w=450&sz=86&hl=en&start=4&tbnid=J-2Jb62CEgQSBM:&tbnh=90&tbnw=127&prev=/images?q=cheese+pizza&gbv=2&svnum=10&hl=en&sa=Ghttp://en.wikipedia.org/wiki/Image:NCI_Visuals_Food_Taco.jpghttp://images.google.es/imgres?imgurl=http://www.streetcow.com/hamburger.jpg&imgrefurl=http://www.xanga.com/veganpiranha/598192382/item.html&h=863&w=1040&sz=37&hl=es&start=1&tbnid=ZaFW0KIthuQUuM:&tbnh=124&tbnw=150&prev=/images?q=hamburger&gbv=2&svnum=10&hl=es -
8/12/2019 Lecture 2 - Corpus Building
25/51
-
8/12/2019 Lecture 2 - Corpus Building
26/51
2. Main component corpus
Fish
Meat
Vegetarian
http://images.google.co.uk/imgres?imgurl=http://www.dimiopizza.com/images/pizza18-lg.gif&imgrefurl=http://www.dimiopizza.com/specialty.html&h=318&w=450&sz=86&hl=en&start=4&tbnid=J-2Jb62CEgQSBM:&tbnh=90&tbnw=127&prev=/images?q=cheese+pizza&gbv=2&svnum=10&hl=en&sa=Ghttp://en.wikipedia.org/wiki/Image:NCI_Visuals_Food_Taco.jpghttp://images.google.es/imgres?imgurl=http://www.streetcow.com/hamburger.jpg&imgrefurl=http://www.xanga.com/veganpiranha/598192382/item.html&h=863&w=1040&sz=37&hl=es&start=1&tbnid=ZaFW0KIthuQUuM:&tbnh=124&tbnw=150&prev=/images?q=hamburger&gbv=2&svnum=10&hl=es -
8/12/2019 Lecture 2 - Corpus Building
27/51
Where do you eat it?
-
8/12/2019 Lecture 2 - Corpus Building
28/51
How can we group these foods?
1. Where they come from2. Their main component
3. Where you usually eat it
-
8/12/2019 Lecture 2 - Corpus Building
29/51
3. Fast food potential corpus
Takeaway
Restaurant
http://images.google.co.uk/imgres?imgurl=http://www.dimiopizza.com/images/pizza18-lg.gif&imgrefurl=http://www.dimiopizza.com/specialty.html&h=318&w=450&sz=86&hl=en&start=4&tbnid=J-2Jb62CEgQSBM:&tbnh=90&tbnw=127&prev=/images?q=cheese+pizza&gbv=2&svnum=10&hl=en&sa=Ghttp://en.wikipedia.org/wiki/Image:NCI_Visuals_Food_Taco.jpghttp://images.google.es/imgres?imgurl=http://www.streetcow.com/hamburger.jpg&imgrefurl=http://www.xanga.com/veganpiranha/598192382/item.html&h=863&w=1040&sz=37&hl=es&start=1&tbnid=ZaFW0KIthuQUuM:&tbnh=124&tbnw=150&prev=/images?q=hamburger&gbv=2&svnum=10&hl=eshttp://images.google.co.uk/imgres?imgurl=http://www.dimiopizza.com/images/pizza18-lg.gif&imgrefurl=http://www.dimiopizza.com/specialty.html&h=318&w=450&sz=86&hl=en&start=4&tbnid=J-2Jb62CEgQSBM:&tbnh=90&tbnw=127&prev=/images?q=cheese+pizza&gbv=2&svnum=10&hl=en&sa=Ghttp://en.wikipedia.org/wiki/Image:NCI_Visuals_Food_Taco.jpghttp://images.google.es/imgres?imgurl=http://www.streetcow.com/hamburger.jpg&imgrefurl=http://www.xanga.com/veganpiranha/598192382/item.html&h=863&w=1040&sz=37&hl=es&start=1&tbnid=ZaFW0KIthuQUuM:&tbnh=124&tbnw=150&prev=/images?q=hamburger&gbv=2&svnum=10&hl=eshttp://images.google.co.uk/imgres?imgurl=http://www.dimiopizza.com/images/pizza18-lg.gif&imgrefurl=http://www.dimiopizza.com/specialty.html&h=318&w=450&sz=86&hl=en&start=4&tbnid=J-2Jb62CEgQSBM:&tbnh=90&tbnw=127&prev=/images?q=cheese+pizza&gbv=2&svnum=10&hl=en&sa=Ghttp://en.wikipedia.org/wiki/Image:NCI_Visuals_Food_Taco.jpghttp://images.google.es/imgres?imgurl=http://www.streetcow.com/hamburger.jpg&imgrefurl=http://www.xanga.com/veganpiranha/598192382/item.html&h=863&w=1040&sz=37&hl=es&start=1&tbnid=ZaFW0KIthuQUuM:&tbnh=124&tbnw=150&prev=/images?q=hamburger&gbv=2&svnum=10&hl=es -
8/12/2019 Lecture 2 - Corpus Building
30/51
How can we group these foods?
1. Where they come from2. Their main component
3. Where it is usually eaten
4. What you use to eat it
-
8/12/2019 Lecture 2 - Corpus Building
31/51
4. Consumption Implement Corpus
Knife
& fork
Chopsticks
Hands
http://images.google.co.uk/imgres?imgurl=http://www.dimiopizza.com/images/pizza18-lg.gif&imgrefurl=http://www.dimiopizza.com/specialty.html&h=318&w=450&sz=86&hl=en&start=4&tbnid=J-2Jb62CEgQSBM:&tbnh=90&tbnw=127&prev=/images?q=cheese+pizza&gbv=2&svnum=10&hl=en&sa=Ghttp://en.wikipedia.org/wiki/Image:NCI_Visuals_Food_Taco.jpghttp://images.google.es/imgres?imgurl=http://www.streetcow.com/hamburger.jpg&imgrefurl=http://www.xanga.com/veganpiranha/598192382/item.html&h=863&w=1040&sz=37&hl=es&start=1&tbnid=ZaFW0KIthuQUuM:&tbnh=124&tbnw=150&prev=/images?q=hamburger&gbv=2&svnum=10&hl=es -
8/12/2019 Lecture 2 - Corpus Building
32/51
How can we group these foods?
1. Where they come from2. Their main component
3. Where it is usually eaten
4. What you use to eat it
-
8/12/2019 Lecture 2 - Corpus Building
33/51
1+2. Continental & Main Component Corpus
Europe
Asia
America(North
& South)
Fish Meat Veg.
http://images.google.co.uk/imgres?imgurl=http://www.dimiopizza.com/images/pizza18-lg.gif&imgrefurl=http://www.dimiopizza.com/specialty.html&h=318&w=450&sz=86&hl=en&start=4&tbnid=J-2Jb62CEgQSBM:&tbnh=90&tbnw=127&prev=/images?q=cheese+pizza&gbv=2&svnum=10&hl=en&sa=Ghttp://en.wikipedia.org/wiki/Image:NCI_Visuals_Food_Taco.jpghttp://images.google.es/imgres?imgurl=http://www.streetcow.com/hamburger.jpg&imgrefurl=http://www.xanga.com/veganpiranha/598192382/item.html&h=863&w=1040&sz=37&hl=es&start=1&tbnid=ZaFW0KIthuQUuM:&tbnh=124&tbnw=150&prev=/images?q=hamburger&gbv=2&svnum=10&hl=es -
8/12/2019 Lecture 2 - Corpus Building
34/51
Corpus
Authentic language data
Electronic/machine readable form Designed and collected according to
sampling procedures
Representative of language
For linguistic investigation
-
8/12/2019 Lecture 2 - Corpus Building
35/51
Language corpora
Different types of corpus
Corpus size Sample size
Representativeness - sampling
Classification criteria
-
8/12/2019 Lecture 2 - Corpus Building
36/51
-
8/12/2019 Lecture 2 - Corpus Building
37/51
Types of corpus
Comparable Corpus: texts in 2 languages or 2 varietiesbut not matched up
Parallel Corpus: texts are translations of each other, eg.Canadian Hansard, corpus of versions of Plato, Bible
Translation Corpus: 2 or more sets of texts classified as
either originals or translations, the purpose being to identifyfeatures of translation (Manchester: Baker)
Diachronic Corpus: Helsinki, LOB v. FLOB
Learner Corpus: texts are written by language learners
-
8/12/2019 Lecture 2 - Corpus Building
38/51
Corpus SizeIs bigger better?
1stGeneration Corpora = 1 Million Words BROWN, LOB
ICE corpora
2ndGeneration Sample Corpora BNC, ANC = 100 Million Words
Monitor Corpora Bank of English (450+ million and growing!)
Specialized corpora Depends on source and scope of problem under
investigation
-
8/12/2019 Lecture 2 - Corpus Building
39/51
Sample Size
Personally I would like to see whole text as adefault condition, thus classifying sample
corpora as one of the categories of specialcorpora... To me the use of small samples isjust a remnant of the early restraints on corpusbuilding, and the advantages of whole texts canbe set out in powerful argument. The use of
samples of constant size gains only a spuriousair of scientific method, since it confers nobenefit on the corpus, and is as practical asGenghis Khans fabled policy of having all his
soldiers the same height. (Sinclair 1995: 27-28)
-
8/12/2019 Lecture 2 - Corpus Building
40/51
Sampling
Population
Production
Reception
-
8/12/2019 Lecture 2 - Corpus Building
41/51
Classifying Texts
Internal Criteria
Topic (aboutness)
Register/Style
-
8/12/2019 Lecture 2 - Corpus Building
42/51
-
8/12/2019 Lecture 2 - Corpus Building
43/51
B it t t E li h ith h di d i
-
8/12/2019 Lecture 2 - Corpus Building
44/51
Brits treat English with such disdain
By Mr M. Rasheed Iqbal
Published: June 30 2007 03:00 | Last updated: June 30 2007 03:00
From Mr M. Rasheed Iqbal.
Sir, I agree with Henry von Blumenthal (Letters, June 23). It is very
discouraging to hear news presenters saying "gonna" and "wanna" on theBBC news.
We were brought up to speak English properly and it is disappointing to see
Brits treat the language with such disdain.
I hope the BBC will stem the tide and pull up its socks.
M. Rasheed Iqbal,
National Bank of Dubai,
Deira, Dubai, UAE
Copyright The Financial Times Limited 2007
-
8/12/2019 Lecture 2 - Corpus Building
45/51
Mode
Primary ChannelWritten
FormatPublished (print & web)
Setting- Public
-
8/12/2019 Lecture 2 - Corpus Building
46/51
Tenor
Addressee
Pluralityindividual (editor) / plural (readers)
Presenceabsent
Interactivenesswritten correspondence/response
Shared knowledgereaders of same publication
Addressor Demographic: Male, from Dubai, works in Bank,
educated, at least bilingual?
Acknowledgement: Self-identified in text
-
8/12/2019 Lecture 2 - Corpus Building
47/51
Field
Factualityresponding to actual event (TVbroadcast), expressing personal opinion
Purposescomplain, express viewpoint, condemnslipping standards, correct perceived decline
Topicsuse of British English on BBC, value oflanguage education in former era
-
8/12/2019 Lecture 2 - Corpus Building
48/51
When and How to build a corpus
1. DONT! use one of the available corporaa. If interested in differences in conversational language in
British English (age, sex, class etc. differences) USEBritish National Corpus
b. Combine and subsample existing corpora to match your
2. Repurpose existing archive/collection
a. Any electronic texts availableresults of surveys, DA/CAtranscripts
3. Build your own!a. OCR, download, extract from PDF
b. TYPE IT IN!!!!
-
8/12/2019 Lecture 2 - Corpus Building
49/51
Using the web as source for corpora
-
8/12/2019 Lecture 2 - Corpus Building
50/51
Web as corpus: Advantages
Massive (and expanding) amounts of
electronic text
Whole texts
Wide reach of text-types/topics/genres
Much in the public domain
Google (etc) as corpus query tool
-
8/12/2019 Lecture 2 - Corpus Building
51/51
Web as corpus: Disadvantages
So much text that balance is difficult toachieve
Copyright is difficult to ascertain for manydocuments
Pages containing large amounts ofextraneous material (menus, formatting,graphics)
Explosion of information who is writing it?Who is reading it?