lab 11: processing corpora, data filesnaraehan/ling1330/lab11.pdf · lab 11: processing corpora,...

Lab 11: Processing Corpora,

Online Data Resources

Ling 1330/2330: Intro to Computational Linguistics

Na-Rae Han

Objectives

2/9/2017 2

How to process an ARCHIVE of text, i.e., a corpus?

From NLTK Corpora page, download:

C-Span Inaugural Address Corpus

How to process data resources, downloaded from the internet?

From Norvig's data page, download:

Xkcd's simple words: words.js

Word 1-grams: count_1w.txt

Big file! Make sure the entire thing is downloaded.

Using set() function

Processing xkcd's words.js

2/9/2017 3

Download from: http://norvig.com/ngrams/

In javascript format, used for xkcd Simple Writer https://xkcd.com/simplewriter/

Let's process this file into a word list.

How to do this?

http://norvig.com/ngrams/


https://xkcd.com/simplewriter/



Step 1: stare at the file.

2/9/2017 4

Extra stuff at the

beginning

Extra stuff at the end

Words are separated

by |

Contracted words

Step 2: read in, shed extras.

2/9/2017 5

>>> f = open('words.js') >>> txt = f.read() >>> f.close() >>> txt[:100] '/**\n *\n * XKCD Simple Writer Word List 0.2.1\n */\nwindow.__WORDS = "understandings|understanding|conv' >>> txt[:67] '/**\n *\n * XKCD Simple Writer Word List 0.2.1\n */\nwindow.__WORDS = "' >>> txt[-10:] 'e|an|i|a";' >>> txt[-2:] '";' >>> chopped = txt[67:-2] >>> print(chopped)[:200] understandings|understanding|conversations|disappearing|informations|grandmothers|grandfathers|questionings|conversation|information|approaching|understands|immediately|positioning|quest

Middle slice without the extra stuff at

either end

May also need: encoding='utf-8'

Step 3: split away.

2/9/2017 6

>>> chopped[-200:] "t|mad|low|lot|hot|lip|how|lit|lie|kid|i'm|let|iâ€™m|leg|i'd|iâ€™d|ice|led|act|lay|law|ins|yes|yet|you|its|job|no|at|by|my|on|ha|do|ok|he|oh|is|tv|me|us|as|hi|go|if|of|am|up|to|we|so|in|or|it|be|an|i|a" >>>

We have to split on ' and | …

Solution: change every ' into |,

and then split on |.

Step 3: split away.

2/9/2017 7

>>> chopped[-200:] "t|mad|low|lot|hot|lip|how|lit|lie|kid|i'm|let|iâ€™m|leg|i'd|iâ€™d|ice|led|act|lay|law|ins|yes|yet|you|its|job|no|at|by|my|on|ha|do|ok|he|oh|is|tv|me|us|as|hi|go|if|of|am|up|to|we|so|in|or|it|be|an|i|a" >>> xkcd_words = chopped.replace("'", '|').split('|') >>> xkcd_words[-50:] ['i', 'm', 'let', 'iâ€™m', 'leg', 'i', 'd', 'iâ€™d', 'ice', 'led', 'act', 'lay', 'law', 'ins', 'yes', 'yet', 'you', 'its', 'job', 'no', 'at', 'by', 'my', 'on', 'ha', 'do', 'ok', 'he', 'oh', 'is', 'tv', 'me', 'us', 'as', 'hi', 'go', 'if', 'of', 'am', 'up', 'to', 'we', 'so', 'in', 'or', 'it', 'be', 'an', 'i', 'a']

SUCCESS!

But! Because we uncoupled

contracted words, 'i' is now listed

twice (or more…)

Would be nice to remove these

duplicates.

Step 4: remove duplicates.

2/9/2017 8

>>> xkcd_words.count('i') 4 >>> xkcd_words.count('he') 2 >>> len(xkcd_words) 3652 >>> xkcd_words = list(set(xkcd_words)) >>> xkcd_words.count('i') 1 >>> len(xkcd_words) 3630 >>>

Turns the list into a set (duplicates removed!) and then back to list

Last step: pickle.

2/9/2017 9

>>> xkcd_words.count('i') 4 >>> xkcd_words.count('he') 2 >>> len(xkcd_words) 3652 >>> xkcd_words = list(set(xkcd_words)) >>> xkcd_words.count('i') 1 >>> len(xkcd_words) 3630 >>> import pickle >>> f = open('xkcd_simple_words.p', 'wb') >>> pickle.dump(xkcd_words, f, -1) >>> f.close() >>>

The set data type

2/9/2017 10

set is a built-in data type in Python. Just like dictionaries, it is built with { } and is orderless.

But unlike dictionaries, it does not have key:value pairs as entries. It has single elements as entries.

Just like dictionaries do not allow duplicate keys, sets do not allow duplicate entries.

>>> cities = {'Boston', 'New York', 'Akron', 'Pittsburgh'} >>> 'Chicago' in cities False >>> medals = {'gold', 'bronze', 'silver', 'silver'} >>> medals {'silver', 'gold', 'bronze'} >>>

Duplicates are quietly ignored.

Using set() to remove duplicates

2/9/2017 11

set is useful as a type-conversion function set().

Can be used to remove duplicates!

It returns a set type, but it can then be converted to some other type: use list() or sorted() for a list type.

>>> li = [1, 2, 3, 3, 3, 4, 4, 5] >>> set(li) {1, 2, 3, 4, 5} >>> list(set(li)) [1, 2, 3, 4, 5]

>>> li2 = 'rose is a rose is a rose'.split() >>> li2 ['rose', 'is', 'a', 'rose', 'is', 'a', 'rose'] >>> set(li2) {'rose', 'is', 'a'} >>> list(set(li2)) ['rose', 'is', 'a'] >>> sorted(set(li2)) ['a', 'is', 'rose']

output now a list

same, but in sorted order!

Processing count_1w.txt

2/9/2017 12

Download from http://norvig.com/ngrams/

Data derived from the Google Web Trillion Word Corpus

Essentially unigram frequency data

Top 333K entries, taken from Google's original data (which is much bigger)

Let's process this file into a Python data object.

How to do this?

Huge file. Wait until your browser fully loads the page before hitting "save as"!



http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html

Step 1: stare at the file.

2/9/2017 13

One word per line, followed by count

Separated by white space: most likely a TAB

Already sorted by frequency

Step 2: read in as list of lines

2/9/2017 14

>>> f = open('count_1w.txt') >>> lines = f.readlines() >>> f.close() >>> lines[0] 'the\t23135851162\n' >>> lines[1] 'of\t13151942776\n' >>> len(lines) 333333 >>>

Because of the "one entry per line" format

of the original file, .readlines() is

better suited.

May also need: encoding='utf-8'

Step 3: decide on data structure.

2/9/2017 15

>>> f = open('count_1w.txt') >>> lines = f.readlines() >>> f.close() >>> lines[0] 'the\t23135851162\n' >>> lines[1] 'of\t13151942776\n' >>> len(lines) 333333 >>> lines[1].split() ['of', '13151942776'] >>> Let's build:

a list where each item is

(word, count) tuple.

This is a string. Must turn into

integer.

Step 4: experiment with a small copy.

2/9/2017 16

>>> short = lines[:5] >>> short ['the\t23135851162\n', 'of\t13151942776\n', 'and\t12997637966\n', 'to\t12136980858\n', 'a\t9081174698\n'] >>> short[0].split() ['the', '23135851162'] >>> for s in short: li = s.split() tu = (li[0], int(li[1])) print(tu) ('the', 23135851162) ('of', 13151942776) ('and', 12997637966) ('to', 12136980858) ('a', 9081174698) >>>

Mini version of lines

Build (word, count) tuple

from each line

2/9/2017 17

>>> foo = [] >>> for s in short: li = s.split() tu = (li[0], int(li[1])) foo.append(tu) >>> foo [('the', 23135851162), ('of', 13151942776), ('and', 12997637966), ('to', 12136980858), ('a', 9081174698)] >>> foo[0] ('the', 23135851162) >>> foo[1] ('of', 13151942776) >>>

foo looks good.

Mini version of the big list we're building

Step 5: build the real thing.

2/9/2017 18

>>> goog_list = [] >>> for s in lines: li = s.split() tu = (li[0], int(li[1])) goog_list.append(tu) >>> goog_list[:10] [('the', 23135851162), ('of', 13151942776), ('and', 12997637966), ('to', 12136980858), ('a', 9081174698), ('in', 8469404971), ('for', 5933321709), ('is', 4705743816), ('on', 3750423199), ('that', 3400031103)] >>> goog_list[100] ('price', 501651226) >>> goog_list[1000] ('stay', 80694073) >>> len(goog_list) 333333 >>>

DONE!

The real deal

Alternate data format: dictionary

2/9/2017 19

>>> goog_list = [] >>> for s in lines: li = s.split() tu = (li[0], int(li[1])) goog_list.append(tu) >>> goog_list[:10] [('the', 23135851162), ('of', 13151942776), ('and', 12997637966), ('to', 12136980858), ('a', 9081174698), ('in', 8469404971), ('for', 5933321709), ('is', 4705743816), ('on', 3750423199), ('that', 3400031103)] >>> goog_list[100] ('price', 501651226) >>> goog_list[1000] ('stay', 80694073) >>> len(goog_list) 333333 >>>

But suppose we want to know where

'platypus' ranks…

You cannot look up a word in this list!

A dictionary is better data format for this purpose.

Let's build goog_dict word as key,

(rank, count) as value

Step 4': experiment with a small copy.

2/9/2017 20

>>> short = goog_list[:5] >>> short [('the', 23135851162), ('of', 13151942776), ('and', 12997637966), ('to', 12136980858), ('a', 9081174698)] >>> for i in range(len(short)): print(i, short[i]) 0 ('the', 23135851162) 1 ('of', 13151942776) 2 ('and', 12997637966) 3 ('to', 12136980858) 4 ('a', 9081174698) >>>

Build from goog_list this time.

Need index 0, 1, 2…!! Use range(len(li))

to produce a list of indexes

2/9/2017 21

>>> for i in range(len(short)): print(i, short[i][0], short[i][1]) 0 the 23135851162 1 of 13151942776 2 and 12997637966 3 to 12136980858 4 a 9081174698 >>> foo = {} >>> for i in range(len(short)): word = short[i][0] count = short[i][1] rank = i + 1 foo[word] = (rank, count) >>> foo {'a': (5, 9081174698), 'the': (1, 23135851162), 'to': (4, 12136980858), 'and': (3, 12997637966), 'of': (2, 13151942776)} >>> foo['and'] (3, 12997637966) >>>

Add 1 to index to get rank

foo looks good.

Mini version of the big dictionary we're building

Step 5': build the real thing.

2/9/2017 22

>>> goog_dict = {} >>> for i in range(len(goog_list)): word = goog_list[i][0] count = goog_list[i][1] rank = i + 1 goog_dict[word] = (rank, count) >>> goog_dict['important'] (573, 136103455) >>> goog_dict['platypus'] (36770, 565585) >>> goog_dict['pittsburgh'] (3733, 19654781) >>> goog_dict['philadelphia'] (2631, 30179898) >>> goog_dict['cleveland'] (3813, 19041185) >>>

DONE!

Last step: pickle both data.

2/9/2017 23

>>> import pickle >>> f = open('google_unigram_list.p', 'wb') >>> pickle.dump(goog_list, f, -1) >>> f.close() >>> >>> f2 = open('google_unigram_dict.p', 'wb') >>> pickle.dump(goog_dict, f2, -1) >>> f2.close() >>>

Beyond a single, short text

2/9/2017 24

So far, we have been handling relatively short texts, one at a time.

Going multiple

Find out what's involved in processing a text archive of multiple text files (aka corpus)

Going big

Find out what's involved in processing HUMONGUOUS text files

Let's try this today

Processing multiple texts

2/9/2017 25

From the NLTK Corpora page, download:

C-Span Inaugural Address Corpus

http://www.nltk.org/nltk_data/

The C-Span Inaugural Address Corpus

Includes 56 past presidential inaugural address, from 1789 (Washington) to 2009 (Obama).

The directory has 56 .txt files and one README file.

QUESTION: How do we effectively process this many files?

http://www.nltk.org/nltk_data/

Corpus vs. sub-corpora

2/9/2017 26

Entire Corpus

Sub-corpus 1 Sub-corpus 2

Big token lists for sub-corpora

2/9/2017 27

text text text text text text text text

sub-corpus 1 TOKENS

sub-corpus 2 TOKENS

Good when individual texts don't

need separate attention.

Pools & individual token lists

2/9/2017 28

text text text text text text text text

sub-corpus 1 TOKENS

sub-corpus 2 TOKENS

tokens tokens tokens tokens tokens tokens tokens tokens

Individual token lists as well as

sub-corpus pools

Using glob

2/9/2017 29

glob: a file-name globbing utility

Returns a list of file names that match the specified pattern

>>> import glob >>> files = glob.glob(r'D:\Lab\inaugural\*.txt') >>> len(files) 56 >>> files[:5] ['D:\\Lab\\inaugural\\1789-Washington.txt', 'D:\\Lab\\inaugural\\1793-Washington.txt', 'D:\\Lab\\inaugural\\1797-Adams.txt', 'D:\\Lab\\inaugural\\1801-Jefferson.txt', 'D:\\Lab\\inaugural\\1805-Jefferson.txt'] >>> files[-1] 'D:\\Lab\\inaugural\\2009-Obama.txt' >>>

All files ending in .txt Excludes README

Using glob

2/9/2017 30

Addresses from 1800's only

>>> files2 = glob.glob(r'D:\Lab\inaugural\18*.txt') >>> len(files2) 25 >>> files2[:5] ['D:\\Lab\\inaugural\\1801-Jefferson.txt', 'D:\\Lab\\inaugural\\1805-Jefferson.txt', 'D:\\Lab\\inaugural\\1809-Madison.txt', 'D:\\Lab\\inaugural\\1813-Madison.txt', 'D:\\Lab\\inaugural\\1817-Monroe.txt'] >>> files2[-1] 'D:\\Lab\\inaugural\\1897-McKinley.txt' >>>

All files starting with '18' and ending with

'.txt'

Build dictionary of texts

2/9/2017 31

For-loop through file names and build a dictionary of key (filename): value (text content)

>>> files[0] 'D:\\Lab\\inaugural\\1789-Washington.txt' >>> files[0][12:-4] 'ural\\1789-Washington' >>> files[0][17:-4] '1789-Washington' >>> files[2][17:-4] '1797-Adams'

>>> fn2txt = {} >>> for longname in files: f = open(longname) txt = f.read() f.close() fname = longname[17:-4] fn2txt[fname] = txt >>> fn2txt['1809-Madison'][:40] 'Unwilling to depart from examples of the' >>> fn2txt['1789-Washington'][:40] 'Fellow-Citizens of the Senate and of the'

fn2txt file name as key,

text string as value

Treating files as a single corpus

32

Task: Compile word frequency of the Inaugural Speeches.

>>> import textstats >>> alltoks = [] >>> for txt in fn2txt.values(): toks = textstats.getTokens(txt) alltoks.extend(toks) >>> len(alltoks) 145774 >>> alltoks[:15] ['fellow', '-', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', ':', 'among', 'the'] >>> alltoks[-15:] ['you', '.', 'god', 'bless', 'you', '.', 'and', 'god', 'bless', 'the', 'united', 'states', 'of', 'america', '.']

For this, we only need

to build a single pool of tokenized words.

For each text, tokenize it, and then add the result to the pool of tokenized words.

Word frequency of entire corpus

33

>>> allfreq = textstats.getFreq(alltoks) >>> allfreq['citizens'] 237 >>> allfreq['battle'] 12 >>> for k in sorted(allfreq, key=allfreq.get, reverse=True)[:10]: print(k, allfreq[k]) the 9906 of 6986 , 6862 and 5139 . 4749 to 4432 in 2749 a 2193 our 2058 that 1726 >>>

Treating files as a single corpus, take 2

34

Task: Compile word frequency of the Inaugural Speeches.

>>> alltxt = '\n'.join(fn2txt.values())

>>> alltoks = textstats.getTokens(alltxt) >>> len(alltoks) 145774 >>> alltoks[:15] ['fellow', '-', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', ':', 'among', 'the']

Alternative approach: join all text strings into

a single gigantic text string…

And then, tokenize it all at once.

All speech texts, concatenated with a line break in between

Processing each text

35

Task: Compute the average sentence length for each presidential address.

We have to build separate token lists for each speech.

>>> fn2toks = {} >>> for (fn, txt) in fn2txt.items(): toks = textstats.getTokens(txt) fn2toks[fn] = toks >>> len(fn2toks) 56 >>> fn2toks['1789-Washington'] ['fellow', '-', 'citizens', 'of', 'the', 'senate', 'and', 'of', 'the', 'house', 'of', 'representatives', ':', ... >>> fn2toks['2001-Bush'][:10] ['president', 'clinton', ',', 'distinguished', 'guests', 'and', 'my', 'fellow', 'citizens', ',']

fn2toks file name as key,

token list as value

Average sentence length, per address

36

>>> for fn in sorted(fn2toks): toks = fn2toks[fn] sentcount = toks.count('.') + toks.count('!') \ + toks.count('?') avgsentlen = len(toks)/sentcount print(avgsentlen, '\t', fn) 66.9130434783 1789-Washington 36.75 1793-Washington 69.8648648649 1797-Adams 47.1951219512 1801-Jefferson 52.9777777778 1805-Jefferson 60.2380952381 1809-Madison ... 18.824742268 2001-Bush 23.3939393939 2005-Bush 24.7909090909 2009-Obama >>>

Assumes every sentence ends with

'.', '!', or '?'

HW 5A: Two Presidents

George W. Bush Barack Obama

2/9/2017 37

Thank you very much. Mr. Speaker, Vice President Cheney, Members of Congress, distinguished guests, fellow citizens: As we gather tonight, our Nation is at war; our economy is in recession; and the civilized world faces unprecedented dangers. Yet, the state of our Union has never been stronger.

We last met in an hour of shock and suffering. In 4 short months, our Nation has comforted the victims, begun to rebuild New York and the Pentagon, rallied a great coalition, captured, arrested, and rid the world of thousands of terrorists, destroyed Afghanistan's terrorist training camps, saved a people from starvation, and freed a country from brutal oppression.

The American flag flies again over our …

Madam Speaker, Vice President Biden, Members of Congress, distinguished guests, and fellow Americans: Our Constitution declares that from time to time, the President shall give to Congress information about the state of our Union. For 220 years, our leaders have fulfilled this duty. They've done so during periods of prosperity and tranquility, and they've done so in the midst of war and depression, at moments of great strife and great struggle.

It's tempting to look back on these moments and assume that our progress was inevitable, that America was always destined to succeed. But when the Union was turned back at Bull Run and the Allies first landed at Omaha Beach, victory was very much in doubt. …

HW 5B: Two EFL Corpora

Bulgarian Students Japanese Students

2/9/2017 38

It is time, that our society is dominated by industrialization. The prosperity of a country is based on its enormous industrial corporations that are gradually replacing men with machines. Science is highly developed and controls the economy. From the beginning of school life students are expected to master a huge amount of scientific data. Technology is part of our everyday life.

Children nowadays prefer to play with computers rather than with our parents' wooden toys. But I think that in our modern world which worships science and technology there is still a place for dreams and imagination.

There has always been a place for them in man's life. Even in the darkness of the …

I agree greatly this topic mainly because I think that English becomes an official language in the not too distant. Now, many people can speak English or study it all over the world, and so more people will be able to speak English. Before the Japanese fall behind other people, we should be able to speak English, therefore, we must study English not only junior high school students or over but also pupils. Japanese education system is changing such a program. In this way, Japan tries to internationalize rapidly. However, I think this way won't suffice for becoming international humans. To becoming international humans, we should study English not only school but also daily life. If we can do it, we are able to master English conversation. It is important for us to master English honorific words. …

Wrapping up

2/9/2017 39

Next class:

Introduction to corpora

How to do corpus analysis

Homework 5: Corpus analysis

1-week-long. You will have a choice between:

- Bush vs. Obama SOU speech corpus

- Bulgarian vs. Japanese EFL learner corpus

Recitation students: work on PART 1 before tomorrow

START EARLY!!!

Midterm exam next slide

Midterm exam

2/9/2017 40

2/21 (Tuesday)

At LMC's PC lab (CL G17)

More room!

ALL pencil-and-paper exam questions!

lab 11: processing corpora, data filesnaraehan/ling1330/lab11.pdf · lab 11: processing corpora,...

Documents