lecture 5: text mining and knowledge...

99
HG8003 Technologically Speaking: The intersection of language and technology. Text Mining and Knowledge Acquisition Francis Bond Division of Linguistics and Multilingual Studies http://www3.ntu.edu.sg/home/fcbond/ [email protected] Lecture 5 Location: LT8 HG8003 (2014)

Upload: others

Post on 14-Mar-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

HG8003 Technologically Speaking:The intersection of language and technology.

Text Mining and Knowledge Acquisition

Francis BondDivision of Linguistics and Multilingual Studieshttp://www3.ntu.edu.sg/home/fcbond/

[email protected]

Lecture 5Location: LT8

HG8003 (2014)

Page 2: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Schedule

Lec. Date Topic1 01-16 Introduction, Organization: Overview of NLP; Main Issues2 01-23 Representing Language3 02-06 Representing Meaning4 02-13 Words, Lexicons and Ontologies5 02-20 Text Mining and Knowledge Acquisition Quiz6 02-27 Structured Text and the Semantic Web

Recess7 03-13 Citation, Reputation and PageRank8 03-20 Introduction to MT, Empirical NLP9 03-27 Analysis, Tagging, Parsing and Generation Quiz

10 Video Statistical and Example-based MT11 04-03 Transfer and Word Sense Disambiguation12 04-10 Review and Conclusions

Exam 05-06 17:00

➣ Video week 10

Text Mining and Knowledge Acquisition 1

Page 3: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Introduction

➣ Review

➣ Text Mining and Knowledge Acquisition

➣ Homework

Text Mining and Knowledge Acquisition 2

Page 4: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Review of Lexicons andOntologies

Text Mining and Knowledge Acquisition 3

Page 5: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Review

➣ Storing information on machines allows us to manipulate it in many ways

➣ Information for humans can be made easier to search and validate

➢ Machine Readable Dictionaries

➣ Information for machines must be made explicit

➢ Dictionaries for various processors➢ Ontologies

➣ We can reuse knowledge to make new resources

Text Mining and Knowledge Acquisition 4

Page 6: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Machine Readable Lexicon

definition (n) a concise explanation of the meaning of a word or phraseor symbol

➣ Headword: definition

➣ Part of Speech: n (noun)

➣ Definition:

➢ genus: explanation➢ differentia: concise; of the meaning of a word or phrase or symbol

? Implied: countable (a), regular plural

Text Mining and Knowledge Acquisition 5

Page 7: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Erin McKean’s TED Talk

➣ Redefining the dictionary (by Erin McKean; TED Talk 2007)(http://blog.ted.com/2007/08/30/redefining_the/)

➣ Dictionaries still don’t cover all wordsmany, many new words are undefinedas many as one per book?

➣ We need to define these words in context

➣ On-line dictionaries allow us to do this without space limitations

➢ Dictionaries can describe usage with real examples

Text Mining and Knowledge Acquisition 6

Page 8: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Ontology Example (WordNet)

Synset 06744396-n: definition

Def: ’a concise explanation of the meaning of a word orphrase or symbol. ’

Hype: accountHypo: redefinition, explicit definition, recursive definition,

stipulative definition, contextual definition,ostensive definition, dictionary definition

SUMO: = equivalentContentInstance

Has-Part: genusHas-Part: differentia

Text Mining and Knowledge Acquisition 7

Page 9: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

What is an Ontology?

➣ A set of statements in a formal languagethat describes/conceptualizes knowledge in a given domain

➢ What kinds of entities exist (in that domain)➢ What kinds of relationships hold among them

➣ Ontologies usually assume a particular level of granularity

➢ doesn’t capture all details

Text Mining and Knowledge Acquisition 8

Page 10: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Text Mining

Text Mining and Knowledge Acquisition 9

Page 11: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Overview of Text Mining

➣ Text Mining

➣ Template Filling

➢ Named Entity Recognition

➣ Relation Detection

➣ Learning Lexical Knowledge

Text Mining and Knowledge Acquisition 10

Page 12: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Why Text Mining?

➣ Too much (textual) information

➣ We now have electronic books, documents, web pages, emails, blogs,news, chats, memos, research papers, . . .

. . . much of it immediately accessible, thanks to databases and InformationRetrieval (IR)

➣ An estimated 80–85% of all data stored in databases is natural language

➣ But humans did not scale so well. . .

➣ This results in the common perception of Information Overload

Text Mining and Knowledge Acquisition 11

Page 13: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Example: The BioTech Industry

➣ Access to information is a serious problem

➢ 80% of biological knowledge is only in reasearch papers➢ finding the information you need is prohibitively expensive

➣ Humans do not scale well

➢ if you scan 60 research papers/week➢ and read 10% of those which are interesting➢ a scientist manages 6/week, or 300/year

➣ This is not good enough

➢ MedLine adds more than 10,000 abstracts each month➢ Chemical Abstracts Registry (CAS) registers 4000 entities each day

Text Mining and Knowledge Acquisition 12

Page 14: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

The growth in PubMed articles

Text Mining and Knowledge Acquisition 13

Page 15: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

What is Text Mining?

➣ The discovery by computer

➢ of new, previously unknown information,➢ by automatically extracting information➢ from a usually large amount➢ of different unstructured textual resources.

Text Mining and Knowledge Acquisition 14

Page 16: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

➣ What does previously unknown mean?

➢ Implies discovering genuinely new information.➢ Marti Hearst’s analogy:

Discovering new knowledge vs. merely finding patterns is like thedifference between a detective following clues to find the criminal vs.analysts looking at crime statistics to assess overall trends in car theft.

➣ What about unstructured?

➢ Naturally occurring text.➢ As opposed to HTML, XML, databases, . . .

Text Mining and Knowledge Acquisition 15

Page 17: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Text Mining Process

1. Document Collection

2. Preprocessing

3. Mining

➣ Template Filling➣ Relation Extraction

4. Presentation/Visualization

Text Mining and Knowledge Acquisition 16

Page 18: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Document Collection

➣ This is basically information retrieval

➣ Normally want to restrict the text domain in some way

➢ Existing Document collections∗ Research Papers∗ Newspapers∗ Phone conversations

➢ Induced document collections∗ Find all documents similar to a seed

➣ What you do depends on the goal

Text Mining and Knowledge Acquisition 17

Page 19: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Issues with documents

➣ Large document collections contain erroneous data

➢ Mistaken analyses➢ Deliberately erroneous data➢ Out-of-date data➢ Fictional data

➣ Text is typically noisy

➢ Spelling errors➢ Conversion errors (hyphens, headers, footers)

Text Mining and Knowledge Acquisition 18

Page 20: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Study: 58 Percent Of U.S. Exercise Televised

WASHINGTON, DC—According to a new Department of Health andHuman Services study, 58 percent of all exercise performed in the U.S. isbroadcast on television. “Of the 3.5 billion push-ups performed in 2003,2.03 billion took place on exercise shows on the Lifetime Network andESPN3 or fitness segments on Good Morning America,”” the study read.“The abundance of TV exercise would create the impression that Americais a healthy society, if everyone didn’t already know that we’re a bunch ofdisgusting, near-immobile spectators.” The DHHS study also indicated that99.3 percent of the nation’s Soloflex workouts are televised.

The ONION America’s Finest News Source

http://www.theonion.com/articles/study-58-percent-of-us-exercise-televised,4623/ 19

Page 21: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

MSNBC News 2004-03-12NORVILLE: Finally tonight, if you were watching the show earlier this

week, you heard Health and Human Services Tommy Thompson encourageAmericans to work out and watch what they eat.

Good advice, because it turns out most Americans are watching theirworkouts. Yes, according to a new study by Thompson‘s department, 58percent of all the exercise done in America is broadcast on television. Forinstance, of the 3.5 billion sit-ups done during 2003, two million, 30,000 ofthem were on exercise shows on Lifetime or one of the ESPN channels. Putit another way, according to the study, 99 percent of the time that someone isusing one of those Soloflex machines, it‘s when it‘s being broadcast on oneof those late-night commercials. [. . . ]

We want to hear from you, so send us your e-mails and ideas to us [email protected].

Thanks for watching. I‘m Deborah Norville.

http://www.msnbc.msn.com/id/4533441/ 20

Page 22: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Text Preprocessing

➣ Text cleanup

➢ remove ads from web pages➢ normalize text converted from binary formats➢ deal with tables, figures and formulas

➣ Tokenization

➢ Splitting up a string of characters into a set of tokens.➢ Need to deal with issues like:

∗ Apostrophes, e.g., “John’s sick”, is it 2 or 3 tokens?∗ Hyphens, e.g., database vs. data-base vs. data base.∗ How should we deal with ‘C++’, ‘A/C’, ‘:-)’, ‘. . . ’.?∗ Is the amount of white space significant?

21

Page 23: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Text Processing

➣ Sentence Splitting (split into sentences)

➣ Part of Speech Tagging (annotate POS)

➣ Chunking (find constituents, typically noun phrases)

➣ Lemmatization

➢ try to find the root form (mice → mouse)

➣ Stemming

➢ try to find the stem (computing, computer → comput)

➣ Parsing

Text Mining and Knowledge Acquisition 22

Page 24: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Take out your clickers!

Text Mining and Knowledge Acquisition 23

Page 25: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

What to Mine?

➣ Email, Instant Messages, Blogs, Twitter, . . .

➢ Entities (Persons, Companies, Organizations, . . . )➢ Events (Inventions, Offers, Attacks, . . . )

Biggest existing system: ECHELON (UK/USA)

➣ News: Newspaper articles, Newswires, . . .

➢ Collections of articles (e.g., from different agencies, describing the sameevent)

➢ Contrastive summaries (e.g., event described by U.S. newspaper vs.Arabic newspaper)

Text Mining and Knowledge Acquisition 24

Page 26: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

➣ (Scientific) Books, Papers, . . .

➢ detect new trends in research➢ automatic curation of research results in Bioinformatics

need to deal with highly specific language

➣ Software Requirement Specifications, Documentation, . . .

➢ extract requirements from software specification➢ detect conflicts between source code and its documentation

➣ Web Mining

➢ extract and analyse information from web sites➢ mine companies’ web pages (detect new products & trends)➢ mine Intranets (gather knowledge, find ‘illegal‘ content, . . . )

Text Mining and Knowledge Acquisition 25

Page 27: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Typical Text Mining Tasks

➣ Classification and Clustering

➢ Email Spam-Detection, Classification (Orders, Offers, . . . )➢ Clustering of large document sets (vivisimo.com)➢ Creation of topic maps (www.leximancer.com)

➣ Web Mining

➢ Trend Mining, Opinion Mining, Novelty Detection➢ Ontology Creation, Entity Tracking, Information Extraction

➣ Summarization

Text Mining and Knowledge Acquisition 26

Page 28: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Template Filling

➣ Look for a fixed template

➢ Seminar Announcement

Title The Artificial Boundary ofHumanities and Science:Arguments from Linguistics and Literature

Speaker A/P Wee Lian HeeInstitution Hong Kong Baptist UniversityDate 24 August 2009, ThursdayPlace HSS SEM RM 3 (HSS-B1-10)

➣ This can be used to fill in a calendar

Text Mining and Knowledge Acquisition 27

Page 29: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Extract from email

You are cordially invited to the CLASS Seminar on ”The ArtificialBoundary of Humanities and Science: Arguments from Linguistics andLiterature ” by A/P Wee Lian Hee of Hong Kong Baptist University on24 August 2009, Thursday at HSS SEM RM 3 (HSS-B1-10). Pleasedisseminate this email to your colleagues and students who may beinterested to attend.

Note: 24 August 2009 was a Monday — easier to check if the date isextracted

Text Mining and Knowledge Acquisition 28

Page 30: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Template Extraction

➣ Identify entities

➢ Named Entity Recognition

➣ Look for patterns that match slots

➢ Relation Extraction

Text Mining and Knowledge Acquisition 29

Page 31: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Named Entity Recognition

➣ Identify interesting things, typically

PER PeopleORG OrganizationLOC LocationGPE Geo-Political EntityFAC FacilityTIM TimeMON Money

➣ Task Dependent

TIT Talk Title

Text Mining and Knowledge Acquisition 30

Page 32: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Named Entity Recognition

You are cordially invited to the CLASS Seminar on[TIT ”The Artificial Boundary of Humanities and Science: Argumentsfrom Linguistics and Literature”] by[PER A/P Wee Lian Hee] of[ORG Hong Kong Baptist University] on[TIM 24 August 2009, Thursday] at[LOC HSS SEM RM 3 (HSS-B1-10)]. Please disseminate this email toyour colleagues and students who may be interested to attend.

Text Mining and Knowledge Acquisition 31

Page 33: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Named Entity Recognition (NER) as Sequence Labeling

Text Mining and Knowledge Acquisition 32

Page 34: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

NER as Sequence Labeling

➣ Typically learn a series of classifiers

➢ one for each NE type➢ choose the one with the highest score

➣ IOB encodingB BeginningI InsideO Outside

➢ byO

WeeBPER

LianIPER

HeeIPER

ofO

HongBORG

KongIORG

BaptistIORG

UniversityIORG

onO

Text Mining and Knowledge Acquisition 33

Page 35: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Typical Features for NER

➣ Words

➣ Stemmed words (or lemmatized)

➣ Shape (the orthographic form)

➣ Part of Speech

➣ Chunks (constituents: e.g., noun phrases)

➣ Gazetteer (Name List)

➣ n-gram bag-of-words

Text Mining and Knowledge Acquisition 34

Page 36: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Typical Shape Features

Feature Example CommentLower Case cummingsCapitalized Nanyang NameAll Caps NTUMixed Case eBayCapital letter and Period F. Person NameEnds in digit A9Hyphenated H-PFour numbers 1967 YearEight numbers 64561967 Phone Number

Text Mining and Knowledge Acquisition 35

Page 37: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Gazetteer

➣ Geographical dictionary or directory (Original Meaning)

➣ Names from US Census

➣ Companies from stock market lists

➣ International Standard Organization (ISO) listse.g., ISO3166-2 Region Names

ISO3166-2 regions for AUAU-NS : New South WalesAU-QL : QueenslandAU-SA : South Australia

➣ Facebook names, Student Lists, Phonebooks, . . .

Text Mining and Knowledge Acquisition 36

Page 38: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Training a Classifier

➣ Training Input

➢ Labelled examples (IOB)➢ Features (extracted)➢ Gazetteers

➣ Classifier

➢ Takes text labelled with features➢ Labels Named Entities➢ Many machine learning methods (HMM, SVM, EM, kNN, . . . )

➣ Typically high precision (80-90%), low recall (30-40%)

➢ Better on restricted text

Text Mining and Knowledge Acquisition 37

Page 39: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Using a Classifier

Text Mining and Knowledge Acquisition 38

Page 40: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Different Ways of tagging chunksTokens IO BIO BMEWO BMEWO+Yesterday O O O BOS Oafternoon O O O O, O O O O PERJohn I PER B PER B PER B PERJ I PER I PER M PER M PER. I PER I PER M PER M PERSmith I PER I PER E PER E PERtraveled O O O PER Oto O O O O LOCWashington I LOC B LOC W LOC W LOC. O O O O EOS

From Bob Carpenter’s lingpipe bloglingpipe-blog.com/2009/10/14/

coding-chunkers-as-taggers-io-bio-bmewo-and-bmewo/

Text Mining and Knowledge Acquisition 39

Page 41: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

IO, BIO Encoding

➣ IO encoding

➢ TagsI X token is in named entity XO token is outside a named entity

⊗ Can’t represent two entities next to each other

➣ BIO encoding

➢ Tags: O and:B X token is beginning of named entity XI X token is a continuation of named entity X

➢ Industry standard

Text Mining and Knowledge Acquisition 40

Page 42: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

BMEWO Encoding

➣ BMEWO encoding

➢ Tags: BO and:M X token is in the middle of named entity X (sometimes I X)E X token is at the end of named entity XW X single-token named entity X (sometimes S X)

➢ Useful with more powerful machine learning (e.g., max entropy)

Text Mining and Knowledge Acquisition 41

Page 43: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

BMEWO+ Encoding

➣ BMEWO+ encoding (Bob Carpenter

➢ Tags: BMWEO and:O X token is before named entity XX O token is after named entity XBOS O Beginning of SentenceEOS O End of Sentence

➢ Adding finer-grained information to the tags themselves implicitlyencodes a kind of longer-distance information about preceding orfollowing words: John said , in Boston.

➢ Begin and end of sentence tags helps to reduce the confusion betweenEnglish sentence capitalization and proper name capitalization.

Text Mining and Knowledge Acquisition 42

Page 44: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Try it!

Tag the following:

➣ Eric Raymond was a GNU contributor in the mid- 1980s .

➣ Of the 3.5 billion push-ups performed in 2003, 2.03 billion took place onexercise shows on the Lifetime Network and ESPN3 or fitness segmentson Good Morning America.

➣ Use BIO with the following tags:

PER PeopleORG OrganizationLOC LocationTIM TimeMON MoneyOTH Other

Text Mining and Knowledge Acquisition 43

Page 45: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Evaluation Metrics

Precision Ratio of correctly labeled/Labeled (P: Accuracy)

Recall Ratio of correctly labeled/Should have been labeled (R)

F-measure A measure of overall goodness 2PRP+R

(F)

More generally F-measure is (1+β2)PR

β2P+R.

Most often we set β = 1. If Precision is more important, increase β.

Text Mining and Knowledge Acquisition 44

Page 46: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Confidence in Evaluation

➣ We train on one part of the data (training set)

➣ We tune the algorithm on another part (development set)

➣ We test on a third part (test set)

➢ Ideally this is unseen by the developers

➣ However, if you split the data in different ways, you may get different results

➢ Some parts may be more difficult, or less similar to the test set

Text Mining and Knowledge Acquisition 45

Page 47: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

n-fold cross validation

➣ We can average over differences in data using n-fold cross validation

➢ divide the data into n parts (folds)➢ train on sets n1 . . . n9 and test on n10

➢ do this for each ni and use the average➢ we can also check the variation

Text Mining and Knowledge Acquisition 46

Page 48: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Factors in Classifier Performance

➣ More training data improves performance

➣ Similar training data improves performance

➢ Out of domain performance is often a problem

➣ Orthogonal knowledge sources improve performance

➢ Different sources allow cross-checking➢ e.g., Use gazetteers and shape-based features

➣ Better machine learning algorithms improve performance

➢ Typically by allowing more or better features

Text Mining and Knowledge Acquisition 47

Page 49: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Relation Detection

Text Mining and Knowledge Acquisition 48

Page 50: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Relation Detection (Learning Lexical Knowledge)

➣ Try to find relations

➢ Hypernomy➢ Synonymy➢ Speaker➢ . . .

➣ Can also find lexical knowledge

➢ Syntactic structure➢ Countability

will give two examples∗ Bootstrapping from ontology∗ Learning from text

Text Mining and Knowledge Acquisition 49

Page 51: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

The basic approach

➣ Similar relations behave the same➢ Look for patterns➢ Look for contexts

➣ Overcome noise by looking at multiple examples

Text Mining and Knowledge Acquisition 50

Page 52: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Acquisition: Patterns

➣ Why disease carrying animals such as rat and cockroach didn’t get diseasefrom the bacteria they carried?

➣ Certain species of birds, such as the Phainopepla, a slim, glossy, blackbird with a slender crest, breed during the relatively cool spring, then leavethe desert for cooler areas at higher elevations or along the Pacific coast.

➣ A few desert animals, such as the Round-tailed Ground Squirrel, a diurnalmammal, enter a state of estivation when the days become too hot and thevegetation too dry.

➣ Skeptics paint a picture of Noah going to countries remote from the MiddleEast to gather animals such as kangaroos and koalas from Australia, andkiwis from New Zealand.

Text Mining and Knowledge Acquisition 51

Page 53: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Acquisition: Patterns

➣ Why disease carrying animals such as rat and cockroach didn’t get diseasefrom the bacteria they carried?

➣ Certain species of birds, such as the Phainopepla, a slim, glossy, blackbird with a slender crest, breed during the relatively cool spring, then leavethe desert for cooler areas at higher elevations or along the Pacific coast.

➣ A few desert animals, such as the Round-tailed Ground Squirrel, a diurnalmammal, enter a state of estivation when the days become too hot and thevegetation too dry.

➣ Skeptics paint a picture of Noah going to countries remote from the MiddleEast to gather animals such as kangaroos and koalas from Australia, andkiwis from New Zealand.

Text Mining and Knowledge Acquisition 52

Page 54: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Acquisition: Patterns

➣ Hypernyms

➢ S (such as|like|e.g.) A, B and C (S ⊃ A, B, C)➢ A, B, C and other S➢ S (including|especially) A, B, C➢ the A, an S,

➣ Synonyms

➢ both A and B (A ≈ B)➢ either A or B➢ neither A nor B➢ A (B)

Text Mining and Knowledge Acquisition 53

Page 55: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

➣ Templates

➢ (seminar |talk) on TIT➢ PER of ORG ⇒ Institution➢ (seminar |talk) . . . by PER ⇒ Speaker

Seminar on “The Artificial Boundary of Humanities and Science:Arguments from Linguistics and Literature” by A/P Wee Lian Hee ofHong Kong Baptist University

Seminar on “The Artificial Boundary of Humanities and Science:Arguments from Linguistics and Literature” by A/P Wee Lian Hee ofHong Kong Baptist University

Text Mining and Knowledge Acquisition 54

Page 56: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Example: Hypernym

As the world looks around anxiously for an alternative to oil, energysources such as biofuels, solar, and nuclear seem like they could be themagic ticket

➣ Extract

➢ energy source ⊃ biofuel➢ energy source ⊃ solar➢ energy source ⊃ nuclear

➣ Note: use of lemmatization and chunking.

➣ Need to find multiple examples (can be different patterns)

Text Mining and Knowledge Acquisition 55

Page 57: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Acquisition: Learning Patterns!

➣ If you know some relations, look for patterns they occur in

➢ n-grams : dog w1 w2 w3 animal➢ dependencies : SUBJ(dog, w1), OBJ(animal , w1)

➣ Then use the learned patterns to find more relational pairs

➢ e.g., train on positive and negative wordnet pairs〈dog, animal〉, 〈food, pizza〉, . . .〈dog, truck〉, 〈food, trust〉, . . .

➣ Used to add 30,000 entries to WordNet (Snow et al., 2006)

Text Mining and Knowledge Acquisition 56

Page 58: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Example of a discovered pattern

➣ S called A (S ⊃ A)

➣ Learned from cases (in WordNet) such as:

➢ 〈sarcoma, cancer〉 . . . an uncommon bone cancer called osteogenicsarcoma and to . . .

➢ 〈deuterium, atom〉 . . . heavy water rich in the doubly heavy hydrogenatom called deuterium

➣ Finds cases (not in WordNet):

➢ 〈efflorescence, condition〉 . . . and a condition called efflorescenceare other reasons for . . .

➢ 〈hat creek outfit, ranch〉 . . . run a small ranch called the HatCreek Outfit.

➢ 〈tardive dyskinesia, problem〉 . . . irreversible problem calledtardive dyskinesia . . .

Text Mining and Knowledge Acquisition 57

Page 59: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Pattern and Bootstrapping-based Relation Extraction

Text Mining and Knowledge Acquisition 58

Page 60: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Bootstrapping

➣ “help oneself, often through improvised means”

➣ In machine learning: “any method that takes a few seed examples andlearns patterns from them”. (There are other more technical meanings)

Text Mining and Knowledge Acquisition 59

Page 61: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Obscene patterns

(Cartoon from http://xkcd.com/798/) 60

Page 62: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Case Studies

Text Mining and Knowledge Acquisition 61

Page 63: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Two Examples of Knowledge Acquisition

➣ Attempt to find the countability of English nouns

➢ Countability and Semantics (knowledge based)(Bond and Vatikiotis-Bateson, 2002)

➢ Countability and Distribution (text based)(Baldwin and Bond, 2003)

62

Page 64: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Why should we care?

➣ In generation need to decide between:

➢ a cake, cakes, a piece of cake➢ Especially important in machine translation

➣ In analysis, helps to resolve ambiguity:

➢ I like dogs (in general)➢ I like a dog (a specific dog)➢ I like dog (dog meat)

➣ Useful in teaching English (yet not marked in dictionaries)

Text Mining and Knowledge Acquisition 63

Page 65: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Countability from an Ontology

➣ Many grammatical phenomena are both:

➢ semantically motivated➢ arbitrarily marked in different languages

➣ For example information is

➢ Countable in French; Uncountable in English

➣ How much of syntax is predictable from meaning?

Text Mining and Knowledge Acquisition 64

Page 66: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Outline

➣ How far is English countability predictable from meaning?

Short Answer: 78%Long Answer: It depends➢ Definition of countability

Five Classes for Noun Countability Preferences➢ Definition of meaning

Hierarchical ontology of 2,710 semantic classes

Text Mining and Knowledge Acquisition 65

Page 67: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Noun Phrase Countability

➣ Semantically motivated:

➢ bounded, indivisible individuals (things)prototypically COUNTABLE: a dog, two dogs

➢ unbounded, divisible substances (stuff)prototypically UNCOUNTABLE: gold

Text Mining and Knowledge Acquisition 66

Page 68: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Noun Phrase Countability

➣ Knowing the referent is not enough: (Wierzbicka 1996)e.g. scales

1. Thought of as being made of two arms: (British)a pair of scales

2. Thought of as a set of numbers: (Australian)a set of scales

3. Thought of as discrete whole objects: (American)one scale/two scales

Text Mining and Knowledge Acquisition 67

Page 69: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

➣ Also varies from language to language

➢ [a flash of] lightning (English)➢ ein Blitz (German)➢ un eclair (French)

➣ A well known problem for non-native speakers

➣ How often can we predict it from the referent’s meaning?

➢ There must be some connection

Text Mining and Knowledge Acquisition 68

Page 70: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Noun Countability Preferences

Noun Countability Code Example Default Default # %Preference Number Classifierfully CO knife sg — 47,255 65.8countablestrongly BC cake sg — 3,110 4.3countableweakly BU beer sg — 3,377 4.7countableuncountable UC furniture sg piece 15,435 21.5plural PO scissors pl pair 2,107 2.9only

Text Mining and Knowledge Acquisition 69

Page 71: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Lexicon

➣ ALT-J/E ’s semantic transfer lexicon

Index usagi

sense 1

English Translation rabbit

Part of Speech noun

Noun Countability Pref. strongly countable

Default Number singular

Semantic Classes[

common noun animal, meat]

➣ 71,833 linked Japanese-English noun pairs

➣ 41,285 are multiword expressions (57.4%)

Text Mining and Knowledge Acquisition 70

Page 72: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

The Goi-Taikei Ontology

➣ A rich ontology with wide coverage of Japanese

➣ Used in many NLP applications such as MT

➣ Several hierarchies of concepts:

➢ 2,710 semantic classes (12-level tree structure) for common nouns➢ 200 classes (9-level tree structure) for proper nouns:➢ Not designed with countability in mind

Text Mining and Knowledge Acquisition 71

Page 73: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Experiment

➣ How well do the semantic classes predict the countability preferences?

➣ Treat every combination of semantic classes as a different semantic class.

➣ Most frequent NCP is assigned to all members of a class.

➢ Ties are resolved as follows: fully countable beats stronglycountable beats weakly countable beats uncountable beatsplural only.

➣ Baseline (all fully countable = 65.8%)

Text Mining and Knowledge Acquisition 72

Page 74: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Example

➣ Semantic Class 910:tableware

➢ crockery ↔ toukirui (UC)➢ dinner set ↔ youshokki (CO)➢ tableware ↔ shokki (UC)➢ Western-style tableware ↔ youshokki (UC)

➣ The most common NCP is UCAssociate uncountable with 910:tableware.

➣ This predicts the NCP correctly 75% of the time.

Text Mining and Knowledge Acquisition 73

Page 75: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Results

Conditions Entries % Range BaselineTraining=Test all 77.9 76.8–78.6 65.8Tenfold Cross Validation all 71.2 69.8–72.1 65.8

➣ Tested using stratified ten-fold validation

➣ 11.6% given default value (fully countable)i.e. we couldn’t decide

Text Mining and Knowledge Acquisition 74

Page 76: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Discussion

➣ Problems of granularity: Where should cutlery go?

➢ WordNettableware

cutlery chopsticks crockery dishware dinnerware tea set ...

table knife fork spoon

➢ ALT-J/Etableware

crockery cutlery/chopsticks cookware other tableware

cutlery is uncountable, but knives, forks, spoons are countable.

Text Mining and Knowledge Acquisition 75

Page 77: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Other Discussion

➣ pair plural only almost all wrong!binoculars, trousers, headphonesNeed some spatial representation!

➣ 7% or so errors in the ontology

➢ ソフトカラー sofuto karasoft colour clothing BCsoft collar hue CO

➣ It is hard for Japanese speakers to judge countability

Text Mining and Knowledge Acquisition 76

Page 78: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Applications

➣ Adding a checker to the dictionary

➢ Warn if semantic class does not predict the assigned countability➢ Check both semantic class and countability

➣ Predict countability for unknown words

➢ If we know their semantics

Text Mining and Knowledge Acquisition 77

Page 79: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Examples

➣ totoro is a monster (∈ 222:monster)

⇒ totoro is fully countable

➣ gavagai is an edible animal (∈ 537:animal,810:meat)

⇒ gavagai is strongly countable

➣ ununquadium is an element (Uuq114)) (∈ 710:element)

⇒ ununquadium is uncountable

Text Mining and Knowledge Acquisition 78

Page 80: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Conclusion

➣ With a limited ontology and noisy lexiconsemantics predicts countability around 78% of the timetherefore countability is semantically motivated

➣ If we can find the semantic class of a wordwe can predict something about its syntactic properties

Text Mining and Knowledge Acquisition 79

Page 81: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Countability from Corpus Data

➣ Acquire lexical knowledge from corpora

➢ English noun countability preferences➢ Precision of 94.6%. (for words freq. > 10)

➣ Extract features in three ways

➢ POS tagging➢ Full text chunking➢ Robust parsing

➣ Combine in a memory-based learner (TiMBL)

Text Mining and Knowledge Acquisition 80

Page 82: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Background

➣ Countability is a syntactic property of Englishnot marked morphologically

➣ In generation used to decide between:

➢ a cake, cakes, a piece of cake➢ Especially important in machine translation (J-E, . . . )

➣ In analysis, helps to resolve ambiguity:

➢ I need a paper by this evening (academic/newspaper)➢ I need some paper by this evening (material)➢ I need the paper by this evening (ambiguous)

Text Mining and Knowledge Acquisition 81

Page 83: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Noun Countability Classes

Countable: one dog, two dogs, many dogs, a dog kennel#one piece of dog, #much dog, ∗a dogs kennel

Uncountable: much butter, a bit of butter, a butter knife #butters, #onebutter, #two butters

Plural Only: some goods, a goods train∗good, ∗one good, ?two goods

Bipartite: a pair of scissors, a scissor kick , ?some scissors∗a scissor, ∗one scissors ∗two scissors

Text Mining and Knowledge Acquisition 82

Page 84: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Resources

➣ Gold standard data created by comparing two lexicons

➢ ALT-J/E ’s Japanese-English Lexicon56,000 noun-countability combinations

➢ COMLEX 3.014,000 noun-countability combinations

➣ Inter-resource agreement of 93.8%.

➢ Few actual errors➢ Almost half of the disagreements came from words with two

countabilities in ALT-J/E but only one in COMLEX.

Text Mining and Knowledge Acquisition 83

Page 85: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Learning Countability

➣ Identify lexical and/or constructional featuresassociated with each countability class

➣ Determine the relative corpus occurrence of the features for each noun

➣ Use the noun feature vectors to classify the noun as a member of each ofthe countability classes

➢ paper +countable, +uncountable➢ uranium +uncountable➢ tanuki +countable

Text Mining and Knowledge Acquisition 84

Page 86: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Feature space

Head noun number: 1D target noun number as head of NP (e.g. a shaggydog = SINGULAR)

Modifier noun number: 1D target noun number as modifier in NP (e.g. dogfood = SINGULAR)

Subject–verb agreement: 2D target noun number as subject vs. verbnumber agreement (e.g. the dog barks = 〈SINGULAR,SINGULAR〉)

Coordinate noun number: 2D target noun number vs. the number of thehead nouns of conjuncts (e.g. dogs and mud = 〈PLURAL,SINGULAR〉)

N1 of N2 constructions: 2D number of N2 vs. type of N1 (e.g. the type of dog= 〈TYPE,SINGULAR〉). We have identified a total of 11 N1 types for use in thisfeature cluster (e.g. COLLECTIVE, LACK, TEMPORAL).

Text Mining and Knowledge Acquisition 85

Page 87: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Occurrence in PPs: 2D the presence or absence of a determiner (±DET) insingular head complement of PP (e.g. per dog = 〈per ,−DET〉).

Pronoun co-occurrence: 2D what pronouns occur in the same sentence assingular and plural instances (e.g. The dog ate its dinner = 〈its,SINGULAR〉).Approximation of pronoun co-indexation.

Singular determiners: 1D singular-selecting dependents (e.g. a dog = a).Two types: countable (e.g. another, each), uncountable (e.g. much, little).

Plural determiners: 1D plural-selecting dependents (e.g. few dogs = few).

Non-bounded determiners: 2D which non-bounded dependents vs. targetnoun number (e.g. more dogs = 〈more,PLURAL〉).

Text Mining and Knowledge Acquisition 86

Page 88: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Feature extraction

➣ Features extracted from the written-portion of the BNC (redid tagging)

➢ British National Corpus (Balanced Corpus of English)

➣ Data considered: nouns with ≥ 10 instances for all 3 methods

➢ 20,530 common nouns

Text Mining and Knowledge Acquisition 87

Page 89: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Classifier architecture

➣ Four parallel supervised classifiers (all use TiMBL k = 9)

➣ A noun may be in multiple classestrain on nouns in the BNC with

➢ positive examples in both ALT-J/E and COMLEX➢ negative examples in either ALT-J/E or COMLEX

Class Positive data Negative data BaselineCountable 4,342 1,476 .746Uncountable 1,519 5,471 .783Bipartite 35 5,639 .994Plural only 84 5,639 .985

Text Mining and Knowledge Acquisition 88

Page 90: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Cross-validated results (1)

Class Accuracy (e.r.) F-score

Countable .939 (.759) .960Uncountable .952 (.779) .892Bipartite .996 (.403) .722Plural Only .990 (.323) .582

➣ Performs well for countable and uncountable

➣ Much harder for small classeseasier to always say NO

Text Mining and Knowledge Acquisition 89

Page 91: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Open data results

➣ 11,499 unseen feature-mapped common nouns

➣ Classified 10,355 (90.0%):

Countable 7,974 77.0% alchemistUncountable 2,588 25.0% ingenuityBipartite 9 0.1% headphonesPlural only 80 0.8% damages

➣ 139 nouns assigned to multiple countability classes

➣ Combined lexicon contained 4,982 of the nouns:precision for these nouns is 94.6% (baseline 89.7%)

Text Mining and Knowledge Acquisition 90

Page 92: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Hand evaluation

➣ 100 nouns from test data

➢ Precision of 92.4% (37.7% e.r.)➢ Baseline (87.8%: all countable)➢ Agreement with lexicons 92.4%

➣ 100 nouns from training data

➢ Baseline (80.5%: all countable)➢ Agreement with lexicons 86.8%

➣ Classifiers agree with corpus better than lexicons

Text Mining and Knowledge Acquisition 91

Page 93: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Corpus Data

➣ Able to classify nouns with a precision of 94.6%

➣ Need to multiply classify more often

➣ Can classify more finely

➢ ideally a continuum from countable to uncountable➢ at least the noun countability preferences (ALT-J/E )

Fully, Strongly, Weakly, Un-Countable,Plural Only, Bipartite

➣ Final precision comparable with existing lexicons

⇒ We can automatically acquireEnglish noun countability information from text

Text Mining and Knowledge Acquisition 92

Page 94: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Meta-Conclusion

➣ Distributional approach works well (better than ontology)

“You shall know a word by the company it keeps.”

Firth, J.R. Modes of Meaning. Papers in Linguistics, 1934-1951. London: OxfordUniversity Press, 1957, p11.

➣ However, it only works for words with > 10 examples

➣ It does not find countability per sense

➣ Ideally we should combine syntactic distribution with semantic distribution

➣ Should use a more fine grained ontology

Text Mining and Knowledge Acquisition 93

Page 95: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Conclusions

➣ There is a lot of information out there

➣ Much of it is unstructured text

➣ Using NLP techniques we can extract this information

➢ But we can’t trust it all

Text Mining and Knowledge Acquisition 94

Page 96: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Readings and Acknowledgments

➣ Jurafsky and Martin (2008) Chapter 22, esp 22.2, 22.4

➣ Some of the text mining slides are based on www.rene-witte.net/system/files/IntroductionToTextMining.pdf

➣ Some figures are from Jurafsky and Martin (2008)

➣ Great Survey on NER: Nadeau, David and Satoshi Sekine (2007) A surveyof named entity recognition and classification. Linguisticae Investigationes30(1):3–26. nlp.cs.nyu.edu/sekine/papers/li07.pdf

Text Mining and Knowledge Acquisition 95

Page 97: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Bibliography

Text Mining and Knowledge Acquisition 96

Page 98: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

*References

Timothy Baldwin and Francis Bond. 2003. Learning the countability of English nouns fromcorpus data. In 41st Annual Meeting of the Association for Computational Linguistics:ACL-2003, pages 463–470. Sapporo, Japan.

Francis Bond and Caitlin Vatikiotis-Bateson. 2002. Using an ontology to determine Englishcountability. In 19th International Conference on Computational Linguistics: COLING-2002, volume 1, pages 99–105. Taipei.

Daniel Jurafsky and James H. Martin. 2008. Speech and Language Processing: AnIntroduction to Natural Language Processing, Computational Linguistics and SpeechRecognition. Prentice Hall, second edition.

Text Mining and Knowledge Acquisition 97

Page 99: Lecture 5: Text Mining and Knowledge Acquisitioncompling.hss.ntu.edu.sg/courses/hg8003/pdf/wk-05.pdf5 02-20 Text Mining and Knowledge Acquisition Quiz 6 02-27 Structured Text and the

Rion Snow, Daniel Jurafsky, and Andrew Y. Ng. 2006. Semantic taxonomy inductionfrom heterogenous evidence. In Proceedings of the 21st International Conference onComputational Linguistics and 44th Annual Meeting of the Association for ComputationalLinguistics, pages 801–808. Association for Computational Linguistics, Sydney, Australia.URL http://www.aclweb.org/anthology/P/P06/P06-1101.

Text Mining and Knowledge Acquisition 98