text mining, by hadi mohammadzadeh

118
1 . Hadi Mohammadzadeh Text Mining Pages By : Hadi Mohammadzadeh Institute of Applied Information Processing University of Ulm – 15 Dec. 2009 Seminar on Text M

Upload: hadi-mohammadzadeh

Post on 20-May-2015

3.213 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Text mining, By Hadi Mohammadzadeh

1

.

Hadi Mohammadzadeh Text Mining Pages

By : Hadi MohammadzadehInstitute of Applied Information ProcessingUniversity of Ulm – 15 Dec. 2009

Seminar on

Text Mining

Page 2: Text mining, By Hadi Mohammadzadeh

2

.

Hadi Mohammadzadeh Text Mining Pages

Seminar on Text Mining

OutLine1. Basics2. Latent Semantic Indexing3. Part of Speech(POS) Tagging4. Information Extraction5. Clustering Documents6. Text Categorization

Page 3: Text mining, By Hadi Mohammadzadeh

3

.

Hadi Mohammadzadeh Text Mining Pages

Seminar on Text MiningPart One

Basics

Page 4: Text mining, By Hadi Mohammadzadeh

4

.

Hadi Mohammadzadeh Text Mining Pages

Definition: Text Mining

• Text Mining can be defined as a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools.

And

• Text Mining seeks to extract useful information from data sources (document collections) through the identification and exploration of interesting patterns.

Page 5: Text mining, By Hadi Mohammadzadeh

5

.

Hadi Mohammadzadeh Text Mining Pages

Similarities between Data Mining and Text Mining

• Both types of systems rely on:– Preprocessing routines– Pattern-discovery algorithms– Presentation-layer elements such as visualization tools

Page 6: Text mining, By Hadi Mohammadzadeh

6

.

Hadi Mohammadzadeh Text Mining Pages

Preprocessing Operations in

Data Mining and Text Mining

• In Data Mining assume data – Stored in a structured format,

so preprocessing focus on scrubbing and normalizing data,

to create extensive numbers of table joins

• In Text Mining preprocessing operations center on – Identification & Extraction of representative features for

NL documents,

to transform unstructured data stored in doc collections into a more explicity structured intermediate format

Page 7: Text mining, By Hadi Mohammadzadeh

7

.

Hadi Mohammadzadeh Text Mining Pages

Weakly Structured and Semi structured Docs

Documents– that have relatively little in the way of strong

• typographical, layout, or markup indicators

to denote structure are refered to as free-format or weakly structured docs (such as most scientific research papers, business reports, and news stories)

– With extensive and consistent format elements in which field-type metadata can be more easily inferred are described as semistructured docs (such as some e-mail, HTML web pages, PDF files)

Page 8: Text mining, By Hadi Mohammadzadeh

8

.

Hadi Mohammadzadeh Text Mining Pages

Document Features

• Although many potential features can be employed to represent docs, the following four types are most commonly uesd:– Characters– Words– Terms– Concepts

• High Feature Dimensionality ( HFD)– Problems relating to HFD are typically of much greater magnitude in

TM systems than in classic DM systems.

• Feature Sparcity– Only a small percentage of all possible features for a document

collection as a whole appear as in any single docs.

Page 9: Text mining, By Hadi Mohammadzadeh

9

.

Hadi Mohammadzadeh Text Mining Pages

Representational Model of a Document

• An essential task for most text mining systems is

The identification of a simplified subset of document features

that can be used to represent a particular document as a whole.

We refer to such a set of features as the representational model of a document

Page 10: Text mining, By Hadi Mohammadzadeh

10

.

Hadi Mohammadzadeh Text Mining Pages

Character-level Representational

• Without Positional Information– Are often of very limited utility in TM applications

• With Positional Information– Are somewhat more useful and common (e.g.

bigrams or trigrams)

• Disadvantage:– Character-base Rep. can often be unwieldy for

some types of text processing techniques because the feature space for a docs is fairly unoptimized

Page 11: Text mining, By Hadi Mohammadzadeh

11

.

Hadi Mohammadzadeh Text Mining Pages

Word-level Representational

• Without Positional Information– Are often of very limited utility in TM applications

• With Positional Information– Are somewhat more useful and common(e.g.

bigrams or trigrams)

• Disadvantage:– Character-base Rep. can often be unwieldy for

some types of text processing techniques because the feature space for a docs is fairly unoptimized

Page 12: Text mining, By Hadi Mohammadzadeh

12

.

Hadi Mohammadzadeh Text Mining Pages

Term-level Representational

• Normalized Terms comes out of Term-Extraction Methodology– Sequence of one or more tokenized and lemmatized word

• What are Term-Extraction Methodology?

Page 13: Text mining, By Hadi Mohammadzadeh

13

.

Hadi Mohammadzadeh Text Mining Pages

Concept-level Representational

• Concepts are features generated for a document by means of manual, statistical, rule-based, or hybrid categorization methodology

Page 14: Text mining, By Hadi Mohammadzadeh

14

.

Hadi Mohammadzadeh Text Mining Pages

General Architecture of Text Mining Systems Abstract Level

• A text mining system takes in input raw docs and generates various types of output such as:– Patterns

– Maps of connections

– Trends

Input Output

Documents

PatternsConnections

Trends

Page 15: Text mining, By Hadi Mohammadzadeh

15

.

Hadi Mohammadzadeh Text Mining Pages

General Architecture of Text Mining Systems Functional Level

• TM systems follow the general model provided by some classic DM applications and are thus divisible into 4 main areas– Preprocessing Tasks

– Core mining operations

– Presentation layer components and browsing functionality

– Refinement techniques

Page 16: Text mining, By Hadi Mohammadzadeh

16

.

Hadi Mohammadzadeh Text Mining Pages

System Architecture for Generic Text Mining System

Page 17: Text mining, By Hadi Mohammadzadeh

17

.

Hadi Mohammadzadeh Text Mining Pages

System Architecture for Domain-oriented Text Mining System

Page 18: Text mining, By Hadi Mohammadzadeh

18

.

Hadi Mohammadzadeh Text Mining Pages

System Architecture for an advanced Text Mining Systemwith background knowledge base

Page 19: Text mining, By Hadi Mohammadzadeh

19

.

Hadi Mohammadzadeh Text Mining Pages

Seminar on Text Mining

Part Two

Latent Semantic Indexing(LSI)

Page 20: Text mining, By Hadi Mohammadzadeh

20

.

Hadi Mohammadzadeh Text Mining Pages

Problems with Lexical Semantics

• Ambiguity and association in natural language

– Polysemy: Words often have a multitude of meanings and different types of usage such as bank (more severe in very heterogeneous collections).

– The vector space model is unable to discriminate between different meanings of the same word.

Page 21: Text mining, By Hadi Mohammadzadeh

21

.

Hadi Mohammadzadeh Text Mining Pages

Problems with Lexical Semantics

– Synonymy: Different terms may have an identical or a similar meaning (weaker: words indicating the same topic).

– No associations between words are made in the vector space representation.

– Problem of Synonyme may be solved with LSI

Page 22: Text mining, By Hadi Mohammadzadeh

22

.

Hadi Mohammadzadeh Text Mining Pages

Polysemy and Context

• Document similarity on single word level: polysemy and context

carcompany

•••dodgeford

meaning 2

ringjupiter

•••space

voyagermeaning 1…

saturn...

…planet

...

contribution to similarity, if used in 1st meaning, but not if in 2nd

Page 23: Text mining, By Hadi Mohammadzadeh

23

.

Hadi Mohammadzadeh Text Mining Pages

Latent Semantic IndexingIntroduction

• Problem: The first frequency-based indexing method did not utilize any global relationships within the docs collection

• Solution: LSI is an indexing method based on the Singular Value Decomposition (SVD)

• How: SVD transform the word document matrix such that major intrinsic associative pattern in the collection are revealed

Page 24: Text mining, By Hadi Mohammadzadeh

24

.

Hadi Mohammadzadeh Text Mining Pages

Latent Semantic IndexingIntroduction

• Main Adv: it does not depend on individual words to locate documents, but rather uses the concept or topic to find relevant docs

• Using: When a researcher submit a query, it is transformed to LSI space and compared with other docs in the same space

Page 25: Text mining, By Hadi Mohammadzadeh

25

.

Hadi Mohammadzadeh Text Mining Pages

Singular Value Decomposition

TVUA

MM MN V is NN

For an M N matrix A of rank r there exists a factorization(Singular Value Decomposition = SVD) as follows:

The columns of U are orthogonal eigenvectors of AAT.

The columns of V are orthogonal eigenvectors of ATA.

ii

rdiag ...1 Singular values.

Eigenvalues 1 … r of AAT are the eigenvalues of ATA.

Page 26: Text mining, By Hadi Mohammadzadeh

26

.

Hadi Mohammadzadeh Text Mining Pages

Singular Value Decomposition

• Illustration of SVD dimensions and sparseness

Page 27: Text mining, By Hadi Mohammadzadeh

27

.

Hadi Mohammadzadeh Text Mining Pages

• Solution via SVD

Low-rank Approximation

set smallest r-ksingular values to zero

Tkk VUA )0,...,0,,...,(diag 1

column notation: sum of rank 1 matrices

Tii

k

i ik vuA

1

k

Page 28: Text mining, By Hadi Mohammadzadeh

28

.

Hadi Mohammadzadeh Text Mining Pages

• If we retain only k singular values, and set the rest to 0, then we don’t need the matrix parts in red

• Then Σ is k×k, U is M×k, VT is k×N, and Ak is M×N

• This is referred to as the reduced SVD

• It is the convenient (space-saving) and usual form for computational applications

• It’s what Matlab gives you

Reduced SVD

k

Page 29: Text mining, By Hadi Mohammadzadeh

29

.

Hadi Mohammadzadeh Text Mining Pages

Approximation error

• How good (bad) is this approximation?• It’s the best possible, measured by the Frobenius

norm of the error:

where the i are ordered such that i i+1.

Suggests why Frobenius error drops as k increased.

1)(:

min

kFkFkXrankX

AAXA

Page 30: Text mining, By Hadi Mohammadzadeh

30

.

Hadi Mohammadzadeh Text Mining Pages

SVD Low-rank approximation

• Whereas the term-doc matrix A may have M=50000, N=10 million (and rank close to 50000)

• We can construct an approximation A100 with rank 100.– Of all rank 100 matrices, it would have the lowest Frobenius

error.

• Great … but why would we??• Answer: Latent Semantic Indexing

Page 31: Text mining, By Hadi Mohammadzadeh

31

.

Hadi Mohammadzadeh Text Mining Pages

Latent Semantic Indexing (LSI)

• Perform a low-rank approximation of document-term matrix (typical rank 100-300)

• General idea– Map documents (and terms) to a low-dimensional

representation.

– Design a mapping such that the low-dimensional space reflects semantic associations (latent semantic space).

– Compute document similarity based on the inner product in this latent semantic space

Page 32: Text mining, By Hadi Mohammadzadeh

32

.

Hadi Mohammadzadeh Text Mining Pages

Goals of LSI

• Similar terms map to similar location in low dimensional space

• Noise reduction by dimension reduction

Page 33: Text mining, By Hadi Mohammadzadeh

33

.

Hadi Mohammadzadeh Text Mining Pages

Latent Semantic Analysis

• Latent semantic space: illustrating example

courtesy of Susan Dumais

Page 34: Text mining, By Hadi Mohammadzadeh

34

.

Hadi Mohammadzadeh Text Mining Pages

Performing the maps

• Each row and column of A gets mapped into the k-dimensional LSI space, by the SVD.

• Claim – this is not only the mapping with the best (Frobenius error) approximation to A, but in fact improves retrieval.

• A query q is also mapped into this space, by

– Query NOT a sparse vector.

1 kkT

k Uqq

Page 35: Text mining, By Hadi Mohammadzadeh

35

.

Hadi Mohammadzadeh Text Mining Pages

But why is this clustering?

• We’ve talked about docs, queries, retrieval and precision here.

• What does this have to do with clustering?

• Intuition: Dimension reduction through LSI brings together “related” axes in the vector space.

Page 36: Text mining, By Hadi Mohammadzadeh

36

.

Hadi Mohammadzadeh Text Mining Pages

Intuition from block matrices

Block 1

Block 2

Block k0’s

0’s

= Homogeneous non-zero blocks.

Mterms

N documents

What’s the rank of this matrix?

Page 37: Text mining, By Hadi Mohammadzadeh

37

.

Hadi Mohammadzadeh Text Mining Pages

Intuition from block matrices

Block 1

Block 2

Block k0’s

0’sMterms

N documents

Vocabulary partitioned into k topics (clusters); each doc discusses only one topic.

Page 38: Text mining, By Hadi Mohammadzadeh

38

.

Hadi Mohammadzadeh Text Mining Pages

Intuition from block matrices

Block 1

Block 2

Block k0’s

0’s

= non-zero entries.

Mterms

N documents

What’s the best rank-kapproximation to this matrix?

Page 39: Text mining, By Hadi Mohammadzadeh

39

.

Hadi Mohammadzadeh Text Mining Pages

Intuition from block matrices

Block 1

Block 2

Block kFew nonzero entries

Few nonzero entries

wipertireV6

carautomobile

110

0

Likely there’s a good rank-kapproximation to this matrix.

Page 40: Text mining, By Hadi Mohammadzadeh

40

.

Hadi Mohammadzadeh Text Mining Pages

Simplistic picture

Topic 1

Topic 2

Topic 3

Page 41: Text mining, By Hadi Mohammadzadeh

41

.

Hadi Mohammadzadeh Text Mining Pages

Some wild extrapolation

• The “dimensionality” of a corpus is the number of distinct topics represented in it.

• More mathematical wild extrapolation:– if A has a rank k approximation of low Frobenius

error, then there are no more than k distinct topics in the corpus.

Page 42: Text mining, By Hadi Mohammadzadeh

42

.

Hadi Mohammadzadeh Text Mining Pages

LSI has many other applications

• In many settings in pattern recognition and retrieval, we have a feature-object matrix.– For text, the terms are features and the docs are objects.

– Could be opinions and users …

– This matrix may be redundant in dimensionality.

– Can work with low-rank approximation.

– If entries are missing (e.g., users’ opinions), can recover if dimensionality is low.

• Powerful general analytical technique– Close, principled analog to clustering methods.

Page 43: Text mining, By Hadi Mohammadzadeh

43

.

Hadi Mohammadzadeh Text Mining Pages

Seminar on Text Mining

Part Three

Part of Speech(POS) Tagging

Page 44: Text mining, By Hadi Mohammadzadeh

44

.

Hadi Mohammadzadeh Text Mining Pages

Definition of POS

“The process of assigning a part-of-speech or other lexical class marker to each word in a corpus” (Jurafsky and Martin)

thegirlkissedtheboyonthecheek

WORDSTAGS

NVPDET

Page 45: Text mining, By Hadi Mohammadzadeh

45

.

Hadi Mohammadzadeh Text Mining Pages

An Example

thegirlkisstheboyonthecheek

LEMMA TAG

+DET+NOUN+VPAST+DET+NOUN+PREP+DET+NOUN

thegirlkissedtheboyonthecheek

WORD

Page 46: Text mining, By Hadi Mohammadzadeh

46

.

Hadi Mohammadzadeh Text Mining Pages

Motivation of POS

• Speech synthesis — pronunciation• Speech recognition — class-based N-grams• Information retrieval — stemming, selection high-

content words• Word-sense disambiguation• Corpus analysis of language & lexicography

Page 47: Text mining, By Hadi Mohammadzadeh

47

.

Hadi Mohammadzadeh Text Mining Pages

Word ClassesBasic word classes:

Noun, Verb, Adjective, Adverb, Preposition, …

Open vs. Closed classesOpen: Nouns, Verbs, Adjectives, Adverbs

Closed:d

eterminers: a, an, the

pronouns: she, he, I

prepositions: on, under, over, near, by, …

Page 48: Text mining, By Hadi Mohammadzadeh

48

.

Hadi Mohammadzadeh Text Mining Pages

Word Classes: Tag Sets

• Vary in number of tags: a dozen to over 200• Size of tag sets depends on language, objectives and

purpose– Some tagging approaches (e.g., constraint grammar based)

make fewer distinctions e.g., conflating prepositions, conjunctions, particles

– Simple morphology = more ambiguity = fewer tags

Page 49: Text mining, By Hadi Mohammadzadeh

49

.

Hadi Mohammadzadeh Text Mining Pages

Word Classes: Tag set example

Page 50: Text mining, By Hadi Mohammadzadeh

50

.

Hadi Mohammadzadeh Text Mining Pages

The Problem

• Words often have more than one word class: this– This is a nice day = PRP– This day is nice = DT(determiner)

– You can go this far = RB(adverb)

Page 51: Text mining, By Hadi Mohammadzadeh

51

.

Hadi Mohammadzadeh Text Mining Pages

Word Class Ambiguity(in the Brown Corpus)

• Unambiguous (1 tag): 35,340

• Ambiguous (2-7 tags): 4,100

2 tags 3,760

3 tags 264

4 tags 61

5 tags 12

6 tags 2

7 tags 1 (Derose, 1988)

Page 52: Text mining, By Hadi Mohammadzadeh

52

.

Hadi Mohammadzadeh Text Mining Pages

POS Tagging Methods

• Stochastic Tagger: HMM-based(Using Viterbi Algorithm)

• Rule-Based Tagger: ENGTWOL (ENGlish TWO Level analysis)

• Transformation-Based Tagger (Brill)

Page 53: Text mining, By Hadi Mohammadzadeh

53

.

Hadi Mohammadzadeh Text Mining Pages

Stochastic Tagging

• Based on probability of certain tag occurring given various possibilities

• Requires a training corpus• No probabilities for words not in corpus.• Simple Method: Choose most frequent tag in training text for

each word!– Result: 90% accuracy– Baseline– Others will do better– HMM is an example

Page 54: Text mining, By Hadi Mohammadzadeh

54

.

Hadi Mohammadzadeh Text Mining Pages

HMM Tagger

• Intuition: Pick the most likely tag for this word.• HMM Taggers choose tag sequence that maximizes this

formula:– P(word|tag) × P(tag|previous n tags)

• Let T = t1,t2,…,tn

Let W = w1,w2,…,wn

• Find POS tags that generate a sequence of words, i.e., look for most probable sequence of tags T underlying the observed words W.

Page 55: Text mining, By Hadi Mohammadzadeh

55

.

Hadi Mohammadzadeh Text Mining Pages

Rule-Based Tagging

• Basic Idea:– Assign all possible tags to words

– Remove tags according to set of rules of type:

– Typically more than 1000 hand-written rules, but may be machine-learned

if word+1 is an adj, adv, or quantifier and the following is

a sentence boundary and word-1 is not a verb like “consider”

then eliminate non-adv else eliminate adv.

Page 56: Text mining, By Hadi Mohammadzadeh

56

.

Hadi Mohammadzadeh Text Mining Pages

Stage 1 of ENGTWOL Tagging

First Stage: – Run words through Kimmo-style morphological analyzer to get all

parts of speech.

Example: Pavlov had shown that salivation …

Pavlov PAVLOV N NOM SG PROPERhad HAVE V PAST VFIN SVO

HAVE PCP2 SVOshown SHOW PCP2 SVOO SVO SVthat ADV

PRON DEM SGDET CENTRAL DEM SGCS

salivation N NOM SG

Page 57: Text mining, By Hadi Mohammadzadeh

57

.

Hadi Mohammadzadeh Text Mining Pages

Stage 2 of ENGTWOL Tagging

• Second Stage: – Apply constraints.

• Constraints used in negative way.

• Example: Adverbial “that” ruleGiven input: “that”If

(+1 A/ADV/QUANT)(+2 SENT-LIM)(NOT -1 SVOC/A)

Then eliminate non-ADV tagsElse eliminate ADV

Page 58: Text mining, By Hadi Mohammadzadeh

58

.

Hadi Mohammadzadeh Text Mining Pages

Transformation-Based Tagging (Brill Tagging)

• Combination of Rule-based and stochastic tagging methodologies– Like rule-based because rules are used to specify tags in a certain

environment

– Like stochastic approach because machine learning is used—with tagged corpus as input

• Input:– tagged corpus

– dictionary (with most frequent tags)

+ Usually constructed from the tagged corpus

Page 59: Text mining, By Hadi Mohammadzadeh

59

.

Hadi Mohammadzadeh Text Mining Pages

Transformation-Based Tagging (cont.)

• Basic Idea:– Set the most probable tag for each word as a start value

– Change tags according to rules of type “if word-1 is a determiner and word is a verb then change the tag to noun” in a specific order

• Training is done on tagged corpus:– Write a set of rule templates

– Among the set of rules, find one with highest score

– Continue from 2 until lowest score threshold is passed

– Keep the ordered set of rules

• Rules make errors that are corrected by later rules

Page 60: Text mining, By Hadi Mohammadzadeh

60

.

Hadi Mohammadzadeh Text Mining Pages

TBL Rule Application

• Tagger labels every word with its most-likely tag– For example: race has the following probabilities in the

Brown corpus:• P(NN|race) = .98

• P(VB|race)= .02

• Transformation rules make changes to tags– “Change NN to VB when previous tag is TO”

… is/VBZ expected/VBN to/TO race/NN tomorrow/NNbecomes… is/VBZ expected/VBN to/TO race/VB tomorrow/NN

Page 61: Text mining, By Hadi Mohammadzadeh

61

.

Hadi Mohammadzadeh Text Mining Pages

TBL: Rule Learning• 2 parts to a rule

– Triggering environment

– Rewrite rule

• The range of triggering environments of templates (from Manning & Schutze 1999:363)

Schema ti-3 ti-2 ti-1 ti ti+1 ti+2 ti+3

1 *2 *3 *4 *5 *6 *7 *8 *9 *

Page 62: Text mining, By Hadi Mohammadzadeh

62

.

Hadi Mohammadzadeh Text Mining Pages

TBL: The Algorithm

• Step 1: Label every word with most likely tag (from dictionary)

• Step 2: Check every possible transformation & select one which most improves tagging

• Step 3: Re-tag corpus applying the rules

• Repeat 2-3 until some criterion is reached, e.g., X% correct with respect to training corpus

• RESULT: Sequence of transformation rules

Page 63: Text mining, By Hadi Mohammadzadeh

63

.

Hadi Mohammadzadeh Text Mining Pages

TBL: Rule Learning (cont’d)

• Problem: Could apply transformations ad infinitum!

• Constrain the set of transformations with “templates”:– Replace tag X with tag Y, provided tag Z or word Z’ appears in some

position

• Rules are learned in ordered sequence

• Rules may interact.

• Rules are compact and can be inspected by humans

Page 64: Text mining, By Hadi Mohammadzadeh

64

.

Hadi Mohammadzadeh Text Mining Pages

TBL: Problems

• Execution Speed: TBL tagger is slower than HMM approach– Solution: compile the rules to a Finite State Transducer (FST)

• Learning Speed: Brill’s implementation over a day (600k tokens)

Page 65: Text mining, By Hadi Mohammadzadeh

65

.

Hadi Mohammadzadeh Text Mining Pages

Tagging Unknown Words

• New words added to (newspaper) language 20+ per month

• Plus many proper names …• Increases error rates by 1-2%

• Method 1: assume they are nouns• Method 2: assume the unknown words have a

probability distribution similar to words only occurring once in the training set.

• Method 3: Use morphological information, e.g., words ending with –ed tend to be tagged VBN.

Page 66: Text mining, By Hadi Mohammadzadeh

66

.

Hadi Mohammadzadeh Text Mining Pages

Evaluation• The result is compared with a manually coded “Gold

Standard”– Typically accuracy reaches 96-97%

– This may be compared with result for a baseline tagger (one that uses no context).

• Important: 100% is impossible even for human annotators.

• Factors that affects the performance– The amount of training data available

– The tag set

– The difference between training corpus and test corpus

– Dictionary

– Unknown words

Page 67: Text mining, By Hadi Mohammadzadeh

67

.

Hadi Mohammadzadeh Text Mining Pages

Seminar on Text MiningPart Four

Information Extraction (IE)

Page 68: Text mining, By Hadi Mohammadzadeh

68

.

Hadi Mohammadzadeh Text Mining Pages

Definition

• An Information Extraction system generally converts unstructured text into a form that can be loaded into a database.

Page 69: Text mining, By Hadi Mohammadzadeh

69

.

Hadi Mohammadzadeh Text Mining Pages

Information Retrieval vs. Information Extraction

• While

information retrieval deals with the problem of finding relevant document in a collection,

information extraction identifies useful (relevant) text in a document.

Useful information is defined as a text segment and its associated attributes.

Page 70: Text mining, By Hadi Mohammadzadeh

70

.

Hadi Mohammadzadeh Text Mining Pages

An Example

• Query:– List the news reports of car bombings in Basra and

surrounding areas between June and December 2004.

Answering to this query is difficult with an information-retrieval system alone.

To answer such queries, we need additional semantic information to identify text segments that refer to an attribute

Page 71: Text mining, By Hadi Mohammadzadeh

71

.

Hadi Mohammadzadeh Text Mining Pages

Elements Extracted from Text

• There are four basic types of elements that can be extracted from text– Entities: The basic building blocks that can be found in text documents.

e.g. people, companies, locations, drugs

– Attributes: features of the extracted entities. e.g. title of a person, age of person, type of an organization

– Facts: The relations that exist between entities. e.g. relationship between a person and a company

– Events: an activity or occurrence of interest in which entities participate.

e.g. terrorist act, a merger between two companies

Page 72: Text mining, By Hadi Mohammadzadeh

72

.

Hadi Mohammadzadeh Text Mining Pages

IE Applications

• E-Recruitment• Extracting sales information• Intelligence collection for news articles• Message Understanding (MU)

Page 73: Text mining, By Hadi Mohammadzadeh

73

.

Hadi Mohammadzadeh Text Mining Pages

Named Entity Recognition (NER)

• NER can be viewed as a classification problem in which words are assigned to one or more semantic classes.

• The same methods we used to assign POS tags words can be applied here.

• Unlike POS tags, not every word is associated with a semantic class.

• Like POS taggers, we can train an entity extractor to find entities in text using a tagged data set.

• Decision Trees, HMM, and rule-based methods can be applied to the classification task.

Page 74: Text mining, By Hadi Mohammadzadeh

74

.

Hadi Mohammadzadeh Text Mining Pages

Problems of NER

• Unknown words: it is difficult to categorize

• Finding the exact boundary of an entity

• Polysemy and synonymy- methods used for WSD are applicable here.

Page 75: Text mining, By Hadi Mohammadzadeh

75

.

Hadi Mohammadzadeh Text Mining Pages

Architecture of an IE System

1. Extraction of tokens and tags

2. Semantic analysis : A partial parser is usually sufficient

3. Extractor : we look at domain-specific entities, weather DB

4. Merging multiple references to the same entity: finding a single canonical form

5. Template Generation: A template contains a list of slots (fields)

TextTokenizationand tagging

Tokens

POS tags

SentenceAnalysis

POS

groups

ExtractorAssigned

EntitiesMerging

TemplateGeneration

Combined

Entities

Page 76: Text mining, By Hadi Mohammadzadeh

76

.

Hadi Mohammadzadeh Text Mining Pages

IE tools

• Fastus– Finite State Automation Text Understanding System

• Rapier– Robust Automated Production of Information Extraction Rules

Page 77: Text mining, By Hadi Mohammadzadeh

77

.

Hadi Mohammadzadeh Text Mining Pages

Fastus

• It is based on a series of finite-state machines to solve specific problems for each stage of the IE pipeline.

• A Finite-State Machine (FSM) generate a regular language that consists of regular expression to describe the language.

• A regular expression (regex) actually represents a string pattern.

• Regexs are used in IE to identify text segments that match some predefined pattern.

• An FSM applies a pattern to a window of text and transition from one state to another until a pattern matches or fails to match.

Page 78: Text mining, By Hadi Mohammadzadeh

78

.

Hadi Mohammadzadeh Text Mining Pages

Stages of Fastus

• In the first stage, composite words and proper nouns are extracted. e.g. “set up” ,”carry out”

Stage 5 Stage 4

Stage 3Stage 2Stage 1Text Complex Words

BasicPhrases

ComplexPhrases

EventStructures

MergedStructures

Page 79: Text mining, By Hadi Mohammadzadeh

79

.

Hadi Mohammadzadeh Text Mining Pages

Seminar on Text MiningPart Five

Clustering Documents

Page 80: Text mining, By Hadi Mohammadzadeh

80

.

Hadi Mohammadzadeh Text Mining Pages

What is clustering?

• Clustering: the process of grouping a set of objects into classes of similar objects– Documents within a cluster should be similar.

– Documents from different clusters should be dissimilar.

• The commonest form of unsupervised learning– Unsupervised learning = learning from raw data, as opposed to

supervised data where a classification of examples is given

– A common and important task that finds many applications in IR and other places

Page 81: Text mining, By Hadi Mohammadzadeh

81

.

Hadi Mohammadzadeh Text Mining Pages

Applications of clustering in IR

• Whole corpus analysis/navigation(Scatter-gather)

– Better user interface: search without typing

• For improving recall in search applications– Better search results

• For better navigation of search results– Effective “user recall” will be higher

• For speeding up vector space retrieval– Cluster-based retrieval gives faster search

Page 82: Text mining, By Hadi Mohammadzadeh

82

.

Hadi Mohammadzadeh Text Mining Pages

Google News: automatic clustering gives an effective news presentation metaphor

Page 83: Text mining, By Hadi Mohammadzadeh

83

.

Hadi Mohammadzadeh Text Mining Pages

1. Scatter/Gather: Cutting, Karger, and Pedersen

Page 84: Text mining, By Hadi Mohammadzadeh

84

.

Hadi Mohammadzadeh Text Mining Pages

2. For improving search recall

• Cluster hypothesis - Documents in the same cluster behave similarly with respect to relevance to information needs

• Therefore, to improve search recall:– Cluster docs in corpus a priori

– When a query matches a doc D, also return other docs in the cluster containing D

• Hope if we do this: The query “car” will also return docs containing automobile– Because clustering grouped together docs containing car with

those containing automobile.

Page 85: Text mining, By Hadi Mohammadzadeh

85

.

Hadi Mohammadzadeh Text Mining Pages

3. For better navigation of search results

• For grouping search results thematically

Page 86: Text mining, By Hadi Mohammadzadeh

86

.

Hadi Mohammadzadeh Text Mining Pages

What makes docs “related”?

• Ideal: semantic similarity.

• Practical: statistical similarity– We will use cosine similarity.

– Docs as vectors.

– For many algorithms, easier to think in terms of a distance (rather than similarity) between docs.

– We will use Euclidean distance.

Page 87: Text mining, By Hadi Mohammadzadeh

87

.

Hadi Mohammadzadeh Text Mining Pages

Clustering Algorithms

• Flat algorithms– Usually start with a random (partial) partitioning

– Refine it iteratively• K means clustering

• (Model based clustering)

• Hierarchical algorithms– Bottom-up, agglomerative

– (Top-down, divisive)

Page 88: Text mining, By Hadi Mohammadzadeh

88

.

Hadi Mohammadzadeh Text Mining Pages

Hard vs. soft clustering

• Hard clustering: Each document belongs to exactly one cluster– More common and easier to do

• Soft clustering: A document can belong to more than one cluster.– Makes more sense for applications like creating browsable

hierarchies

– You may want to put a pair of sneakers in two clusters: (i) sports apparel and (ii) shoes

– You can only do that with a soft clustering approach.

Page 89: Text mining, By Hadi Mohammadzadeh

89

.

Hadi Mohammadzadeh Text Mining Pages

Partitioning Algorithms

• Partitioning method: Construct a partition of n documents into a set of K clusters

• Given: a set of documents and the number K

• Find: a partition of K clusters that optimizes the chosen partitioning criterion– Globally optimal: exhaustively enumerate all partitions

– Effective heuristic methods: K-means and K-medoids algorithms

Page 90: Text mining, By Hadi Mohammadzadeh

90

.

Hadi Mohammadzadeh Text Mining Pages

K-Means

• Assumes documents are real-valued vectors.

• Clusters based on centroids (aka the center of gravity or mean) of points in a cluster, c:

• Reassignment of instances to clusters is based on distance to the current cluster centroids.

cx

xc

||

1(c)μ

Page 91: Text mining, By Hadi Mohammadzadeh

91

.

Hadi Mohammadzadeh Text Mining Pages

K-Means Algorithm

Select K random docs {s1, s2,… sK} as seeds.

Until clustering converges or other stopping criterion:

For each doc di: Assign di to the cluster cj such that dist(xi, sj) is minimal.

(Update the seeds to the centroid of each cluster) For each cluster cj

sj = (cj)

Page 92: Text mining, By Hadi Mohammadzadeh

92

.

Hadi Mohammadzadeh Text Mining Pages

Termination conditions

• Several possibilities, e.g.,– A fixed number of iterations.

– Doc partition unchanged.

– Centroid positions don’t change.

Page 93: Text mining, By Hadi Mohammadzadeh

93

.

Hadi Mohammadzadeh Text Mining Pages

Seed Choice

• Results can vary based on random seed selection.

• Some seeds can result in poor convergence rate, or convergence to sub-optimal clusterings.– Select good seeds using a heuristic (e.g.,

doc least similar to any existing mean)– Try out multiple starting points– Initialize with the results of another

method.

In the above, if you startwith B and E as centroidsyou converge to {A,B,C}and {D,E,F}If you start with D and Fyou converge to {A,B,D,E} {C,F}

Example showingsensitivity to seeds

Page 94: Text mining, By Hadi Mohammadzadeh

94

.

Hadi Mohammadzadeh Text Mining Pages

How Many Clusters?

• Number of clusters K is given– Partition n docs into predetermined number of clusters

• Finding the “right” number of clusters is part of the problem– Given docs, partition into an “appropriate” number of subsets.

– E.g., for query results - ideal value of K not known up front - though UI may impose limits.

• Can usually take an algorithm for one flavor and convert to the other.

Page 95: Text mining, By Hadi Mohammadzadeh

95

.

Hadi Mohammadzadeh Text Mining Pages

K not specified in advance

• Given a clustering, define the Benefit for a doc to be the cosine similarity to its centroid

• Define the Total Benefit to be the sum of the individual doc Benefits.

Page 96: Text mining, By Hadi Mohammadzadeh

96

.

Hadi Mohammadzadeh Text Mining Pages

Penalize lots of clusters

• For each cluster, we have a Cost C.

• Thus for a clustering with K clusters, the Total Cost is KC.

• Define the Value of a clustering to be =

Total Benefit - Total Cost.

• Find the clustering of highest value, over all choices of K.– Total benefit increases with increasing K. But can stop when it

doesn’t increase by “much”. The Cost term enforces this.

Page 97: Text mining, By Hadi Mohammadzadeh

97

.

Hadi Mohammadzadeh Text Mining Pages

Hierarchical Clustering

• Build a tree-based hierarchical taxonomy (dendrogram) from a set of documents.

• One approach: recursive application of a partitional clustering algorithm.

animal

vertebrate

fish reptile amphib. mammal worm insect crustacean

invertebrate

Page 98: Text mining, By Hadi Mohammadzadeh

98

.

Hadi Mohammadzadeh Text Mining Pages

Dendogram: Hierarchical Clustering

• Clustering obtained by cutting the dendrogram at a desired level: each connectedconnected component forms a cluster.

Page 99: Text mining, By Hadi Mohammadzadeh

99

.

Hadi Mohammadzadeh Text Mining Pages

Hierarchical Agglomerative Clustering (HAC)

• Starts with each doc in a separate cluster– then repeatedly joins the closest pair of clusters, until

there is only one cluster.

• The history of merging forms a binary tree or hierarchy.

Page 100: Text mining, By Hadi Mohammadzadeh

100

.

Hadi Mohammadzadeh Text Mining Pages

Closest pair of clustersMany variants to defining closest pair of clusters

• Single-link– Similarity of the most cosine-similar (single-link)

• Complete-link– Similarity of the “furthest” points, the least cosine-similar

• Centroid– Clusters whose centroids (centers of gravity) are the most cosine-

similar

• Average-link– Average cosine between pairs of elements

Page 101: Text mining, By Hadi Mohammadzadeh

101

.

Hadi Mohammadzadeh Text Mining Pages

Closest pair of clusters

Page 102: Text mining, By Hadi Mohammadzadeh

102

.

Hadi Mohammadzadeh Text Mining Pages

Single Link Agglomerative Clustering

• Use maximum similarity of pairs:

• Can result in “straggly” (long and thin) clusters due to chaining effect.

• After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

),(max),(,

yxsimccsimji cycx

ji

)),(),,(max()),(( kjkikji ccsimccsimcccsim

Page 103: Text mining, By Hadi Mohammadzadeh

103

.

Hadi Mohammadzadeh Text Mining Pages

Single Link Example

Page 104: Text mining, By Hadi Mohammadzadeh

104

.

Hadi Mohammadzadeh Text Mining Pages

Complete Link Agglomerative Clustering

• Use minimum similarity of pairs:

• Makes “tighter,” spherical clusters that are typically preferable.

• After merging ci and cj, the similarity of the resulting cluster to another cluster, ck, is:

),(min),(,

yxsimccsimji cycx

ji

)),(),,(min()),(( kjkikji ccsimccsimcccsim

Ci Cj Ck

Page 105: Text mining, By Hadi Mohammadzadeh

105

.

Hadi Mohammadzadeh Text Mining Pages

Complete Link Example

Page 106: Text mining, By Hadi Mohammadzadeh

106

.

Hadi Mohammadzadeh Text Mining Pages

Group Average Agglomerative Clustering

• Similarity of two clusters = average similarity of all pairs within merged cluster.

• Compromise between single and complete link.• Two options:

– Averaged across all ordered pairs in the merged cluster – Averaged over all pairs between the two original clusters

• No clear difference in efficacy

)( :)(

),()1(

1),(

ji jiccx xyccyjiji

ji yxsimcccc

ccsim

Page 107: Text mining, By Hadi Mohammadzadeh

107

.

Hadi Mohammadzadeh Text Mining Pages

Computing Group Average Similarity

• Always maintain sum of vectors in each cluster.

• Compute similarity of clusters in constant time:

jcx

j xcs

)(

)1||||)(|||(|

|)||(|))()(())()((),(

jiji

jijijiji cccc

cccscscscsccsim

Page 108: Text mining, By Hadi Mohammadzadeh

108

.

Hadi Mohammadzadeh Text Mining Pages

Seminar on Text MiningPart Six

Text Categorization(TC)

Page 109: Text mining, By Hadi Mohammadzadeh

109

.

Hadi Mohammadzadeh Text Mining Pages

Approaches to TC

There are two main approaches to TC:

• Knowledge Engineering– The main drawback of the KEA is what might be called the

Knowledge acquisition bottleneck. The huge amount of highly skilled labor and expert knowledge required to create and maintain the knowledge-encoding rules

• Machine Learning– Requires only a set of manually classified training instances that

muchless costly to produce.

Page 110: Text mining, By Hadi Mohammadzadeh

110

.

Hadi Mohammadzadeh Text Mining Pages

Applocations of TC

Three common TC appications are:

• Text Indexing

• Document sorting and text filtering

• Web page categorization

Page 111: Text mining, By Hadi Mohammadzadeh

111

.

Hadi Mohammadzadeh Text Mining Pages

Text Indexing(TI)

• The task of assigning keywords from a controlled vocabulary to text documents is called TI. If the keywords are viewed as categories, then TI is an instance of general TC problem.

Page 112: Text mining, By Hadi Mohammadzadeh

112

.

Hadi Mohammadzadeh Text Mining Pages

Document sorting and text filtering

• Examples:– In a newspaper, the classified ads may need to be categorized

into “Personal”, “Car Sales”, “Real State”– Emails can be sorted intocategories such as “Complaints”,

“Deals”, “Job applications”

• Text Filtering activity can be seen as document sorting with only two bins- the “relevant” and “irrelevant” docs.

Page 113: Text mining, By Hadi Mohammadzadeh

113

.

Hadi Mohammadzadeh Text Mining Pages

Web page categorization

• A common use of TC is the automatic classification of Web pages under the hierarchical calalogues posted by popular Internet portals such as Yahoo.

• Whenever the number of docs in a category exceeds k, it should be spilt into two or more subcategories.

• The Web docs contain links, which may be important source of information for classifier because linked docs often share semantics.

Page 114: Text mining, By Hadi Mohammadzadeh

114

.

Hadi Mohammadzadeh Text Mining Pages

Definition of the Problem

• The General text categorization task can be formally defined as the task of approximating an unknown category assignment function

• Where D is the set of all possible docs and C is the set of predefined categories.

• The value of is 1 if the document d belongs to the category c and 0 otherwise.

• The approximation function is called a classifer, and the task is to build a classifer that produces results as “close” as possible to the true category assignment function F.

1,0: CDF

cdF ,

1,0: CDM

Page 115: Text mining, By Hadi Mohammadzadeh

115

.

Hadi Mohammadzadeh Text Mining Pages

Types of Categorization

• Single-Label versus Multilabel Categorization– In multilabel categorization the categories overlap, and a document

may belongs to any number of categories.

• Document-Pivoted versus Category-Pivoted Categorization– The difference is significant only in the case in which not all docs or

not all categories are immediately available.

• Hard Versus Soft Categorization– Fully automated , and semiautomated

Page 116: Text mining, By Hadi Mohammadzadeh

116

.

Hadi Mohammadzadeh Text Mining Pages

Machine Learning Approache to TC

• Decition Tree Classifiers• Naïve Bayes(Probablistic classifer)

• K-Nearest Neighbor classifiaction• Rocchio Methods• Decition Rule classifer• Neural Networks• Support Vector Machine

Page 117: Text mining, By Hadi Mohammadzadeh

117

.

Hadi Mohammadzadeh Text Mining Pages

References

• Books– Introduction to Information Retrieval-2008

– Managing Gigabytes-1999

– The Text Mining Handbook

– Text Mining Application Programming

– Web Data Mining

Page 118: Text mining, By Hadi Mohammadzadeh

118

.

Hadi Mohammadzadeh Text Mining Pages

References

• Power Points– Introduction to Information Retrieval-2008

– Text Mining Application Programming

– Web Data Mining

– Word classes and part of speech tagging• Rada Mihalcea Note: Some of the material in this slide set was adapted from Chris Brew’s (OSU) slides on part of speech tagging