natural language processing with...
TRANSCRIPT
Natural Language Processing with Python
CS372: Spring, 2021
Lecture 3Accessing Text Corpora and
Lexical Resources
Jong C. ParkSchool of Computing
Korea Advanced Institute of Science and Technology
ACCESSING TEXT CORPORA AND LEXICAL RESOURCESAccessing Text CorporaConditional Frequency DistributionsMore Python: Reusing CodeLexical ResourcesWordNet
CS372: NLP with Python 22021-03-09
Questions• What are some useful text corpora and lexical
resources, and how can we access them with Python?
• Which Python constructs are most helpful for this work?
• How do we avoid repeating ourselves when writing Python code?
2021-03-09 CS372: NLP with Python 3
Introduction
Gutenberg Corpus Web and Chat Text Brown Corpus Reuters Corpus Inaugural Address Corpus Annotated Text Corpora Corpora in Other Languages Text Corpus Structure Loading Your Own Corpus
2021-03-09 CS372: NLP with Python 4
Accessing Text Corpora
The Project Gutenberg electronic text archive • contains some 25,000 electronic books• http://www.gutenberg.org/.
2021-03-09 CS372: NLP with Python 5
Gutenberg Corpus
2021-03-09 CS372: NLP with Python 6
Gutenberg Corpus
2021-03-09 CS372: NLP with Python 7
Gutenberg Corpus
Average sentence length and lexical diversityappear to be characteristics of particular authors.
2021-03-09 CS372: NLP with Python 8
Gutenberg Corpus
The sents() function divides the text into its sentences, which are lists of words.
Most NLTK corpus readers include a variety of access methods in addition to words(), raw(), and sents().
NLTK’s collection of web text includes • content from a Firefox discussion forum; • conversations overheard in New York; • the movie script of Pirates of the Carribean; • personal advertisements; and • wine reviews.
2021-03-09 CS372: NLP with Python 9
Web and Chat Text
2021-03-09 CS372: NLP with Python 10
Web and Chat Text
A corpus of instant messaging chat sessions:• originally collected by the Naval Postgraduate
School (nps) for research on automatic detection of Internet predators;
• contains over 10,000 posts, anonymized by replacing usernames with generic names of the form “UserNNN”, and manually edited to remove any other identifying information;
2021-03-09 CS372: NLP with Python 11
Web and Chat Text
• organized into 15 files, where each file contains several hundred posts collected on a given data, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a generic adults chatroom).
2021-03-09 CS372: NLP with Python 12
Web and Chat Text
The Brown Corpus• the first million-word electronic corpus of
English;• created in 1961 at Brown University;• contains text from 500 sources; and• the sources have been categorized by genre,
such as news, editorial, and so on.
2021-03-09 CS372: NLP with Python 13
Brown Corpus
http://icame.uib.no/brown/bcm-los.htmlfor a complete list.
2021-03-09 CS372: NLP with Python 14
Brown Corpus
We can access the corpus as a list of words or a list of sentences. • We may optionally specify particular
categories or files to read.
2021-03-09 CS372: NLP with Python 15
Brown Corpus
It is a convenient resource for studying systematic differences between genres, a kind of linguistic inquiry known as stylistics.
2021-03-09 CS372: NLP with Python 16
Brown Corpus
Is there any other selection of words that one can try for similar stylistics?
2021-03-09 CS372: NLP with Python 17
Brown Corpus
Computing counts for each genre of interest. • Use NLTK’s support for conditional frequency
distributions.
What kind of observations can we make?
The Reuters Corpus • It contains 10,788 news documents totaling
1.3 million words.• The documents are classified into 90 topics,
and grouped into two sets, “training”/“test”.• For example, the text with fileid ‘test/14826’ is
a document drawn from the test set. • The split is for training and testing algorithms
that automatically detect the topic of a document.
2021-03-09 CS372: NLP with Python 18
Reuters Corpus
2021-03-09 CS372: NLP with Python 19
Reuters Corpus
Unlike the Brown Corpus, categories in the Reuters Corpus overlap with each other.
2021-03-09 CS372: NLP with Python 20
Reuters Corpus
Why?
We can specify the words or sentences we want in terms of files or categories.
2021-03-09 CS372: NLP with Python 21
Reuters Corpus
The Inaugural Address Corpus• a collection of 55 texts, one for each
presidential address;• its time dimension is an interesting property.
2021-03-09 CS372: NLP with Python 22
Inaugural Address Corpus
2021-03-09 CS372: NLP with Python 23
Inaugural Address Corpus
‘2021-Biden.txt’ is not yet available.
Looking at how the words America and citizen are used over time.
2021-03-09 CS372: NLP with Python 24
Inaugural Address Corpus
2021-03-09 CS372: NLP with Python 25
Inaugural Address Corpus
Many text corpora containing linguistic annotations represent part-of-speech tags, named entities, syntactic structures, semantic roles, and so forth. • Consult http://www.nltk.org/data for
information about downloading them.
2021-03-09 CS372: NLP with Python 26
Annotated Text Corpora
2021-03-09 CS372: NLP with Python 27
Annotated Text Corpora
2021-03-09 CS372: NLP with Python 28
Annotated Text Corpora
2021-03-09 CS372: NLP with Python 29
Annotated Text Corpora
NLTK comes with corpora for many languages, though in some cases we need to learn how to manipulate character encodings in Python.
2021-03-09 CS372: NLP with Python 30
Corpora in Other Languages
the “Floresta Sinta(c)tica Corpus http://www.linguateca.pt/Floresta/(cf. http://nltk.googlecode.com/svn/trunk/doc/howto/portuguese_en.html)
the CESS-ESP Treebank, with 6030 parsed sentences
bangla, hindi, marathi, telugu
The corpus, udhr, contains the Universal Declaration of Human Rights in over 300 languages. • The fields include information about the
character encoding used in the file, such as UTF8 or Latin1.
2021-03-09 CS372: NLP with Python 31
Corpora in Other Languages
2021-03-09 CS372: NLP with Python 32
Corpora in Other Languages
2021-03-09 CS372: NLP with Python 33
Corpora in Other Languages
2021-03-09 CS372: NLP with Python 34
Corpora in Other Languages
2021-03-09 CS372: NLP with Python 35
Corpora in Other Languages
2021-03-09 CS372: NLP with Python 36
Corpora in Other Languages
2021-03-09 CS372: NLP with Python 37
Corpora in Other Languages
Words having five or fewer letters account forabout 80% of Ibibio text, 60% of German text,and 25% of Inuktitut text.
Common structures
2021-03-09 CS372: NLP with Python 38
Text Corpus Structure
2021-03-09 CS372: NLP with Python 39
Text Corpus Structure
There is a difference between some of the corpus access methods:
2021-03-09 CS372: NLP with Python 40
Text Corpus Structure
Load your own collection of text files.
2021-03-09 CS372: NLP with Python 41
Loading Your Own Corpus
your own path to replace /usr/share/dict
Another example
2021-03-09 CS372: NLP with Python 42
Loading Your Own Corpus
corpus reader for corpora that consist of parenthesis-delineated parse trees
your own path to replace /corpora/penntreebank/parsed/mrg/wsj”
Conditions and Events Counting Words by Genre Plotting and Tabulating Distributions Generating Random Text with Bigrams
2021-03-09 CS372: NLP with Python 43
Conditional Frequency Distributions
While a frequency distribution counts observable events, a conditional frequency distribution needs to pair each event with a condition.
2021-03-09 CS372: NLP with Python 44
Conditions and Events
2021-03-09 CS372: NLP with Python 45
Counting Words by Genre
2021-03-09 CS372: NLP with Python 46
Counting Words by Genre
2021-03-09 CS372: NLP with Python 47
Plotting and Tabulating Distributions
1,638 words of the English text have nine or fewer letters.
Create a table of bigrams using a conditional frequency distribution.
2021-03-09 CS372: NLP with Python 48
Generating Random Text with Bigrams
2021-03-09 CS372: NLP with Python 49
Generating Random Text with Bigrams
Example 2-1. Generating random text
Accessing Text Corpora• Gutenberg Corpus• Web and Chat Text• Brown Corpus• Reuters Corpus• Inaugural Address Corpus• Annotated Text Corpora• Corpora in Other Languages• Text Corpus Structure• Loading Your Own Corpus
2021-03-09 CS372: NLP with Python 50
Summary (1/2)
Conditional Frequency Distributions• Conditions and Events• Counting Words by Genre• Plotting and Tabulating Distributions• Generating Random Text with Bigrams
2021-03-09 CS372: NLP with Python 51
Summary (2/2)