natural language processing with...

51
Natural Language Processing with Python CS372: Spring, 2021 Lecture 3 Accessing Text Corpora and Lexical Resources Jong C. Park School of Computing Korea Advanced Institute of Science and Technology

Upload: others

Post on 02-Aug-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

Natural Language Processing with Python

CS372: Spring, 2021

Lecture 3Accessing Text Corpora and

Lexical Resources

Jong C. ParkSchool of Computing

Korea Advanced Institute of Science and Technology

Page 2: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

ACCESSING TEXT CORPORA AND LEXICAL RESOURCESAccessing Text CorporaConditional Frequency DistributionsMore Python: Reusing CodeLexical ResourcesWordNet

CS372: NLP with Python 22021-03-09

Page 3: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

Questions• What are some useful text corpora and lexical

resources, and how can we access them with Python?

• Which Python constructs are most helpful for this work?

• How do we avoid repeating ourselves when writing Python code?

2021-03-09 CS372: NLP with Python 3

Introduction

Page 4: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

Gutenberg Corpus Web and Chat Text Brown Corpus Reuters Corpus Inaugural Address Corpus Annotated Text Corpora Corpora in Other Languages Text Corpus Structure Loading Your Own Corpus

2021-03-09 CS372: NLP with Python 4

Accessing Text Corpora

Page 5: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

The Project Gutenberg electronic text archive • contains some 25,000 electronic books• http://www.gutenberg.org/.

2021-03-09 CS372: NLP with Python 5

Gutenberg Corpus

Page 6: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 6

Gutenberg Corpus

Page 7: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 7

Gutenberg Corpus

Average sentence length and lexical diversityappear to be characteristics of particular authors.

Page 8: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 8

Gutenberg Corpus

The sents() function divides the text into its sentences, which are lists of words.

Most NLTK corpus readers include a variety of access methods in addition to words(), raw(), and sents().

Page 9: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

NLTK’s collection of web text includes • content from a Firefox discussion forum; • conversations overheard in New York; • the movie script of Pirates of the Carribean; • personal advertisements; and • wine reviews.

2021-03-09 CS372: NLP with Python 9

Web and Chat Text

Page 10: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 10

Web and Chat Text

Page 11: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

A corpus of instant messaging chat sessions:• originally collected by the Naval Postgraduate

School (nps) for research on automatic detection of Internet predators;

• contains over 10,000 posts, anonymized by replacing usernames with generic names of the form “UserNNN”, and manually edited to remove any other identifying information;

2021-03-09 CS372: NLP with Python 11

Web and Chat Text

Page 12: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

• organized into 15 files, where each file contains several hundred posts collected on a given data, for an age-specific chatroom (teens, 20s, 30s, 40s, plus a generic adults chatroom).

2021-03-09 CS372: NLP with Python 12

Web and Chat Text

Page 13: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

The Brown Corpus• the first million-word electronic corpus of

English;• created in 1961 at Brown University;• contains text from 500 sources; and• the sources have been categorized by genre,

such as news, editorial, and so on.

2021-03-09 CS372: NLP with Python 13

Brown Corpus

Page 14: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

http://icame.uib.no/brown/bcm-los.htmlfor a complete list.

2021-03-09 CS372: NLP with Python 14

Brown Corpus

Page 15: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

We can access the corpus as a list of words or a list of sentences. • We may optionally specify particular

categories or files to read.

2021-03-09 CS372: NLP with Python 15

Brown Corpus

Page 16: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

It is a convenient resource for studying systematic differences between genres, a kind of linguistic inquiry known as stylistics.

2021-03-09 CS372: NLP with Python 16

Brown Corpus

Is there any other selection of words that one can try for similar stylistics?

Page 17: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 17

Brown Corpus

Computing counts for each genre of interest. • Use NLTK’s support for conditional frequency

distributions.

What kind of observations can we make?

Page 18: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

The Reuters Corpus • It contains 10,788 news documents totaling

1.3 million words.• The documents are classified into 90 topics,

and grouped into two sets, “training”/“test”.• For example, the text with fileid ‘test/14826’ is

a document drawn from the test set. • The split is for training and testing algorithms

that automatically detect the topic of a document.

2021-03-09 CS372: NLP with Python 18

Reuters Corpus

Page 19: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 19

Reuters Corpus

Page 20: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

Unlike the Brown Corpus, categories in the Reuters Corpus overlap with each other.

2021-03-09 CS372: NLP with Python 20

Reuters Corpus

Why?

Page 21: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

We can specify the words or sentences we want in terms of files or categories.

2021-03-09 CS372: NLP with Python 21

Reuters Corpus

Page 22: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

The Inaugural Address Corpus• a collection of 55 texts, one for each

presidential address;• its time dimension is an interesting property.

2021-03-09 CS372: NLP with Python 22

Inaugural Address Corpus

Page 23: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 23

Inaugural Address Corpus

‘2021-Biden.txt’ is not yet available.

Page 24: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

Looking at how the words America and citizen are used over time.

2021-03-09 CS372: NLP with Python 24

Inaugural Address Corpus

Page 25: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 25

Inaugural Address Corpus

Page 26: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

Many text corpora containing linguistic annotations represent part-of-speech tags, named entities, syntactic structures, semantic roles, and so forth. • Consult http://www.nltk.org/data for

information about downloading them.

2021-03-09 CS372: NLP with Python 26

Annotated Text Corpora

Page 27: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 27

Annotated Text Corpora

Page 28: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 28

Annotated Text Corpora

Page 29: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 29

Annotated Text Corpora

Page 30: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

NLTK comes with corpora for many languages, though in some cases we need to learn how to manipulate character encodings in Python.

2021-03-09 CS372: NLP with Python 30

Corpora in Other Languages

the “Floresta Sinta(c)tica Corpus http://www.linguateca.pt/Floresta/(cf. http://nltk.googlecode.com/svn/trunk/doc/howto/portuguese_en.html)

the CESS-ESP Treebank, with 6030 parsed sentences

bangla, hindi, marathi, telugu

Page 31: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

The corpus, udhr, contains the Universal Declaration of Human Rights in over 300 languages. • The fields include information about the

character encoding used in the file, such as UTF8 or Latin1.

2021-03-09 CS372: NLP with Python 31

Corpora in Other Languages

Page 32: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 32

Corpora in Other Languages

Page 33: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 33

Corpora in Other Languages

Page 34: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 34

Corpora in Other Languages

Page 35: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 35

Corpora in Other Languages

Page 36: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 36

Corpora in Other Languages

Page 37: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 37

Corpora in Other Languages

Words having five or fewer letters account forabout 80% of Ibibio text, 60% of German text,and 25% of Inuktitut text.

Page 38: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

Common structures

2021-03-09 CS372: NLP with Python 38

Text Corpus Structure

Page 39: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 39

Text Corpus Structure

Page 40: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

There is a difference between some of the corpus access methods:

2021-03-09 CS372: NLP with Python 40

Text Corpus Structure

Page 41: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

Load your own collection of text files.

2021-03-09 CS372: NLP with Python 41

Loading Your Own Corpus

your own path to replace /usr/share/dict

Page 42: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

Another example

2021-03-09 CS372: NLP with Python 42

Loading Your Own Corpus

corpus reader for corpora that consist of parenthesis-delineated parse trees

your own path to replace /corpora/penntreebank/parsed/mrg/wsj”

Page 43: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

Conditions and Events Counting Words by Genre Plotting and Tabulating Distributions Generating Random Text with Bigrams

2021-03-09 CS372: NLP with Python 43

Conditional Frequency Distributions

Page 44: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

While a frequency distribution counts observable events, a conditional frequency distribution needs to pair each event with a condition.

2021-03-09 CS372: NLP with Python 44

Conditions and Events

Page 45: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 45

Counting Words by Genre

Page 46: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 46

Counting Words by Genre

Page 47: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 47

Plotting and Tabulating Distributions

1,638 words of the English text have nine or fewer letters.

Page 48: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

Create a table of bigrams using a conditional frequency distribution.

2021-03-09 CS372: NLP with Python 48

Generating Random Text with Bigrams

Page 49: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

2021-03-09 CS372: NLP with Python 49

Generating Random Text with Bigrams

Example 2-1. Generating random text

Page 50: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

Accessing Text Corpora• Gutenberg Corpus• Web and Chat Text• Brown Corpus• Reuters Corpus• Inaugural Address Corpus• Annotated Text Corpora• Corpora in Other Languages• Text Corpus Structure• Loading Your Own Corpus

2021-03-09 CS372: NLP with Python 50

Summary (1/2)

Page 51: Natural Language Processing with Pythonnlpcl.kaist.ac.kr/~cs372_2021/slides/cs372-3-spring-2021.pdf · 2021. 3. 9. · given data, for an age-specific chatroom (teens, 20s, 30s, 40s,

Conditional Frequency Distributions• Conditions and Events• Counting Words by Genre• Plotting and Tabulating Distributions• Generating Random Text with Bigrams

2021-03-09 CS372: NLP with Python 51

Summary (2/2)