ameeta agrawal. outline parser evaluation text clustering common n-grams classification method...

AMEETA AGRAWALAMEETA AGRAWAL

OutlineParser EvaluationText ClusteringCommon N-Grams classification method

Parser EvaluationPARSEVAL measurePrecision, recall, F-measure and cross-brackets

http://www.haskell.org/communities/11-2009/html/Parsing-ParseModule.jpg

Why automate evaluationManual inspection is slow and error-prone.Possibility of introducing bias by human

evaluators.

Parser output vs. Gold standardFormal definitionsLet GS=(S,A) be a gold standard. Let P be a parser.Parser output O(P,GS)=(P(s1), P(s2),… P(s|S|)) is a

sequence of analyses such that P(si) for each i, 1 < i < n is the analysis assigned by parser P for sentence si € S.

Gold standard is a 2-tuple GS=(S,A) whereS=(s1,s2,…,sn) is a finite sequence of grammatical

structures, i.e. constituents, dependency links or sentences.

A=(a1,a2,…,an) is a finite sequence of analyses. For each i, 1 < i < n , ai € A, is the analysis of si € S.

Parser evaluationCompare each element in O(P, GS) to each

element in A.

The sets of analyses in parser evaluation.6

PARSEVALParsers are usually evaluated using the

PARSEVAL measures (Black et al. 1991).To compute the PARSEVAL measures:

the parse trees are decomposed into labelled constituents (LC) LC = triples consisting of the starting and ending

point of a constituent’s span in a sentence, and the constituent’s label.

for each sentence, the sets of LC got from a parser (PT), and gold standard parse tree (GT) are compared.

Labelled vs. UnlabelledLabelled Parseval:Two analyses match if and only if both the

brackets and the labels (POS and syntactic tags) match.

Unlabelled Parseval:Compares only the brackets

PARSEVAL measuresPrecisionRecallF-scoreCross-brackets

Precision, Recall, F-scorePrecision:

# of correct constituents in parser output

total # of constituents in parser output

Recall: # of correct constituents in parser outputtotal # of constituents in gold standard

The F-score: harmonic mean of precision and recall2 · (labelled precision) · (labelled recall)(labelled precision) + (labelled recall)

Crossing bracketsThe mean number of bracketed sequences in

which the parser output overlaps with the gold standard structure.

Non-crossing and crossing brackets. The phrase boundaries [i, j] and [i’, j’] are boundaries in the gold standard and the parser output respectively. Pair [i, j] [i’, j’] is defined as a pair of crossing brackets if they overlap, that is, if i < i’ < j < j’.

Labelled PARSEVAL exampleConsider the following two sentences:

Time flies like an arrow.He ate the cake with a spoon.

Ambiguous sentences for a parser...

Gold standard parse tree(S (NP (NN time) (NN flies))

(VP (VB like)

(NP (DT an) (NN arrow))))

(S (NP (PRP he))

(VP (VBD ate) (NP (DT the)

(NN cake))

(PP (IN with)

(NP (DT a) (NN spoon)))))

Parser output parse tree(S (NP (NN time))

(VP (VB flies)

(PP (IN like)

(NP (DT an) (NN arrow)))))

(S (NP (PRP he))

(VP (VBD ate)

(NP (DT the) (NN cake)

(PP (IN with)

(NP (DT a) (NN spoon)))))) 14

Labelled edges of parse trees - 1

Labelled edges of parse trees - 2

ResultPrecision = 73.9 % (17/23)Recall = 77.2 % (17/22)F-score = 75.5 %

Unlabelled PARSEVAL exampleA) [[He [hit [the post]]] [while [[the all-star

goalkeeper] [was [out [of [the goal]]]]]]]

B) [He [[hit [the post]] [while [[the [[all-star] goalkeeper]] [was [out of [the goal]]]]]]]

A) is the gold standard structure and B) the parser output adapted from Lin 1998.

Precision = 75.0% (9/12)Recall = 81.8% (9/11)F-score = 78.3%Crossing brackets = 1 pair

Strengths & Weakness: PARSEVALStrength:The state-of-the-art parsers obtain up to 90% precision and recall

on the Penn Treebank data (Bod, 2003; Charniak and Johnson, 2005)

Weaknesses:Evaluation based on phrase-structure constituents abstracts

away from basic predicate-argument relationships which are important for correctly capturing the semantics of the sentence (Lin, 1998; Carroll et al., 2002).

Also, using the same resource for training and testing may result in the parser learning systematic errors which are present in both the training and testing material (Rehbein and van Genabith, 2007).

Other metrics: the Leaf-Ancestor metric (G. Sampson, 1980s)19

Text ClusteringTask definitionPartitional clustering

Simple K-meansHierarchical clustering

Divisive & agglomerativeEvaluation of clustering Inter-cluster similarityCluster purityEntropy or information gain

http://www.miner3d.com/images/kmeans_medium.jpg

ClusteringPartition unlabeled examples into disjoint subsets of

clusters, so that: examples within a cluster are very similar examples in different clusters are very different

Discover new categories in an unsupervised mannerInter-cluster distances are maximized

Intra-cluster distances are

minimized

Notion of a cluster can be ambiguous

How many clusters?

Four Clusters Two Clusters

Six Clusters

Data Mining, Cluster Analysis: Basic Concepts and Algorithms, by Tan, Steinbach, Kumar

Ambiguous web queriesWeb queries are often truly ambiguous:

jaguarNLPparis hilton

Seems like word sense ambiguation should helpDifferent senses of jaguar: animal, car, OS X…

In practice WSD doesn’t help for web queriesDisambiguation is either impossible (“jaguar”) or

trivial (“jaguar car”)“Cluster” the results into useful groupings

Clusty: the clustering search engine

Types of clusteringPartitional clustering

A division of data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subset.

E.g. K-means

Hierarchical clusteringA set of nested clusters organized as a hierarchical tree.e.g. Agglomerative and Divisive

Density-based clusteringArbitrary-shaped clusters; a cluster is regarded as a

region in which the density of data objects exceeds a threshold.

e.g. DBSCAN and OPTICS

Partitional clustering

Original Points A Partitional Clustering

Hierarchical clustering

p4p1 p2 p3

Traditional Hierarchical Clustering

Non-traditional Hierarchical Clustering

Non-traditional Dendrogram

Traditional Dendrogram

Other types of clusteringHard vs. soft

Hard: each document is a member of exactly one clusterSoft: a document has fractional membership in several

clustersExclusive vs. non-exclusive

In non-exclusive clustering, points may belong to multiple clusters

Can represent multiple classes or ‘border’ pointsFuzzy vs. non-fuzzy

In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1

Weights must sum to 1Probabilistic clustering has similar characteristics

Partial vs. completeIn some cases, we only want to cluster some of the data

Heterogeneous vs. homogeneousCluster of widely different sizes, shapes, and densities

K-means clusteringDocuments are represented as length-

normalized vectors in a real-valued space.use normalized, TF/IDF-weighted vectors

Initial centroids are often chosen randomly.clusters produced vary from one run to

another.The centroid is, typically, the mean of the

points in the cluster.‘Closeness’ is measured by Euclidean

distance, cosine similarity, correlation, etc.

K-means algorithm

Stopping criteriaA fixed number of iterations has been

completed. Assignment of documents to clusters does not

change between iterations.Centroids do not change between iterations.When distance between the centroid and data

points falls below a threshold.

Two different K-means clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Optimal Clustering

Original Points

Choosing initial centroids - 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

yIteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

yIteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

yIteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

yIteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

yIteration 6

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 6

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

yIteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

yIteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

yIteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

yIteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

Iteration 5

Strengths & weaknesses: K-meansStrength: Relatively efficient: complexity is O( n * K * I )

n = number of points, K = number of clusters, I = number of iterations, normally, k, i << n.

Weakness: Sensitive to the initial clusters Need to specify k, the number of clusters, in advance Very sensitive to noise and outliers May have a problem when clusters have different

sizes Not suitable to discover clusters with non-convex

shapes Often terminates at a local optimum.

Hierarchical clusteringTwo main types of hierarchical clustering

Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one

cluster (or k clusters) left

Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point

(or there are k clusters)

Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time

Agglomerative clustering Bottom-up

Basic algorithm is straightforward1. Compute the proximity matrix2. Let each data point be a cluster3. Repeat4. Merge the two closest clusters5. Update the proximity matrix6. Until only a single cluster remains

Key operation is the computation of the proximity of two clusters

Agglomerative exampleStart with clusters of individualpoints and a proximity matrixp1

p1 p2 p3 p4 p5 . . .

. Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Agglomerative exampleAfter some merging steps, we have some clusters

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Agglomerative exampleWe want to merge the two closest clusters (C2 and C5)and update the proximity matrix.

C3 C4 C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p1242

The question is “How do we update the proximity matrix?”

C2 U C5

C3? ? ? ?

C2 U C5C1

C2 U C5

Proximity Matrix

...p1 p2 p3 p4 p9 p10 p11 p12

Agglomerative example

Inter-cluster similarity

p1 p2 p3 p4 p5 . . .

Similarity?

MIN MAX Group average Distance between centroids Other methods driven by an objective

function Ward’s Method uses squared error

Proximity Matrix

p1 p2 p3 p4 p5 . . .

. Proximity Matrix

MIN MAX Group average Distance between centroids Other methods driven by an

objective function Ward’s Method uses squared error 45

p1 p2 p3 p4 p5 . . .

. Proximity Matrix

p1 p2 p3 p4 p5 . . .

.Proximity Matrix

p1 p2 p3 p4 p5 . . .

. Proximity Matrix

objective function Ward’s Method uses squared error

p1 p2 p3 p4 p5 . . .

. Proximity Matrix

Hierarchical clustering comparison

Group Average

Ward’s Method

MIN MAX

http://www-users.cs.umn.edu/~kumar/dmbook/index.php#item4

Divisive clusteringTop-down.Split a cluster iteratively.Start with all objects in one cluster and

subdivide them into smaller pieces.Less popular than agglomerative.

Hierarchical: strengths & weaknessesStrengths:Do not have to assume any particular number of

clustersAny desired number of clusters can be obtained by

‘cutting’ the dendogram at the proper levelThey may correspond to meaningful taxonomies

E.g. animal kingdom in biological sciencesWeaknesses:When clusters are merged/split, the decision is

permanentErroneous decisions are impossible to correct later

Do not scale wellSpace complexity O(n2), n = total # of pointsTime complexity is O(n3) in many cases;

n steps and at each step the proximity matrix of size n2 must be updated and searched

Evaluating clusteringInternal criterion

high intra-cluster similarity and low inter-cluster similarity

External criteriacompare against gold standard produced by

humans Purity Normalized mutual information Rand index penalizes F measure

PurityEach cluster is assigned to the class which is

most frequent in the clusterThen the accuracy of this assignment is

measured by counting the number of correctly assigned documents and dividing by N, where N = data points

Purity

Purity as an external evaluation criterion for cluster quality. x, 5(cluster 1); o, 4 (cluster 2); and ⋄, 3 (cluster 3). Purity is (5 + 4 + 3) / 17 ≈ 0.71.0 < Purity < 1

Pitfall of PurityPurity is 1 if each document gets its own

cluster. Thus, we cannot use purity to trade off the

quality of the clustering against the number of clusters.

Solution: Normalized Mutual Information

Normalized Mutual Information

I is mutual information, H is entropy, NMI is mutual information divided by normalized entropy

where P(ωk), P(cj), and P(ωk ∩ cj) are the probabilities of a document beingin cluster ωk, class cj, and in the intersection of ωk and cj, respectively.

Mutual informationMI measures the amount of information by which

our knowledge about the classes increases when we are told what the clusters are.

Minimum MI if the clustering is random. Maximum MI if K = N one-document clusters.So MI has the same problem as purity -- it does

not penalize large cardinalities and fewer clusters are better.

The normalization by the denominator [H(W)+H(C)]/2 fixes this problem since entropy tends to increase with the number of clusters.

0 < NMI < 1

CNG - Common N-Gram analysisDefinitionExampleSimilarity measure

http://afflatus.ucd.ie/attachment/2009_6/tn_1246380626644.jpg

http://home.arcor.de/David-Peters/n-Grams.png 59

N-gramsAn n-gram model is a type of probabilistic

model for predicting the next item in such a sequenceItems can be phonemes, syllables, letters,

words or base pairsInvolves splitting sentence into chunks of

consecutive items of length n

N-grams example“I don’t know what to say”1-gram (unigram): I, don’t, know, what, to, say2-gram (bigram): I don’t, don’t know, know what, what to, to say3-gram (trigram): I don’t know, don’t know what, know what to, etc.…n-gram

“TEXT”unigram : {T, E, X, T}bigram : { _T, TE, EX, XT, T_}trigram : {_TE, TEX, EXT, XT_, T__}...n-gram

Why do we want to predict items?Author attribution

Plagiarism detection

Malicious code detection

Genre classification

Sentiment classification

Spam identification

Language and encoding identification

Spelling correction

Common N-grams methodCompare the semantic of two texts or audio or

video data files.Build a byte-level n-gram author profile of an

author's writing.The profile is a small set of L pairs {(x1, f1), (x2, f2),

...,(xL, fL)} of frequent n-grams and their normalized frequencies, generated from training data.

Two important operations:choose the optimal set of n-grams for a profilecalculate the similarity between two profiles

Common N-grams methodDoes not use any language-dependent

information (no information about space character, new line character, uppercase, lowercase).

The approach does not depend on a specific language.does not require segmentation for languages

such as Chinese or Thai. There is no text preprocessing.

so we avoid the necessity for use of taggers, parsers, feature selection.

How do n-grams work?Marley was dead: to begin with. There

is no doubt whatever about that. …

Mararlrleleyey_y_w_wa

_th 0.015 ___ 0.013 the 0.013 he_ 0.011 and 0.007 _an 0.007 nd_ 0.007 ed_ 0.006

sort by frequency

(from Christmas Carol by Charles Dickens)

Comparing profiles

_th 0.015 ___ 0.013 the 0.013 he_ 0.011 and 0.007

Dickens: Christmas Carol _th 0.016 the 0.014 he_ 0.012 and 0.007 nd_ 0.007

Dickens: A Tale of Two Cities

_th 0.017 ___ 0.017 the 0.014 he_ 0.014 ing 0.007

Carroll: Alice’s adventures in wonderland

Similarity measureIn order to “normalize" the differences between two profiles, we divide them by the average frequency for a given n-gram (f1(s) + f2(s))/2 . E.g. the difference of 0.1 for an n-gram with frequencies 0.9 and 0.8 in two profiles will be less weighted than the same difference for an n-gram with frequencies 0.2 and 0.1.

weight

profile 21

))()((2

2)()()()(

ss sfsf

sfsfsfsfsfsf

s is any n-gram from one of the two profiles, and f1(n) and f2(n) are n-gram frequencies in two profiles.

Profile dissimilarity algorithm

Returns a positive number, which is a measure of dissimilarity.

For identical texts, the dissimilarity is 0.

Text classification using CNGGiven a test document, a test profile is

produced.The distances between the test profile and

the author profiles are calculated.The test document is classified using k-

nearest neighbours method with k = 1, the test document is attributed to the author

whose profile is closest to the test profile.

Strengths & Weaknesses: CNG methodStrengths:

Easy to computeEasy to test

Weaknesses:Computational resources for trainingImbalanced datasetsAutomatic selection of N and L

As an aside: Ordering doesn’t matterAoccdrnig to rscheearch at an Elingsh

uinervtisy, it deosn't mttaer in waht oredr the ltteers in a wrod are, olny taht the frist and lsat ltteres are at the rghit pcleas. The rset can be a toatl mses and you can sitll raed it wouthit a porbelm. Tihs is bcuseae we do not raed ervey lteter by ilstef, but the wrod as a wlohe.

“Humans are interesting” – Ryuk

ReferencesA procedure for quantitatively comparing the syntactic

coverage of English grammars, E. Black et al., 1991Framework and resources for natural language Parser

evaluation, Tuomo Kakkonen, 2007Solving the heterogeneity problem in e-government using

n-grams, Cornoiu SorinaOn automatic plagiarism detection based on n-grams

comparison, Alberto Barron-Cedeno and Paolo RossoN-gram-based author profiles for authorship attribution,

V. Keseljy, N. Cercone et al., 2003CNG method with weighted voting, V. Keseljy, N. CerconeN-gram-based detection of new malicious code, T. Abou-

Assaleh, N. Cercone et al., 2004Book: Introduction to Data Mining, Tan, Steinbach, Kumar

Thank you! Questions?

Evaluating K-means clustersMost common measure is Sum of Squared Error (SSE)

For each point, the error is the distance to the nearest clusterTo get SSE, we square these errors and sum them.

x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi corresponds to the center (mean) of the cluster

Given two clusters, we can choose the one with the smallest error

One easy way to reduce SSE is to increase K, the number of clusters

xmdistSSE1

Measures of cluster validity Numerical measures:

External Index: Used to measure the extent to which cluster labels match externally supplied class labels.

Entropy

Internal Index: Used to measure the goodness of a clustering structure without respect to external information.

Sum of Squared Error (SSE)

Relative Index: Used to compare two different clustering or clusters.

Often an external or internal index is used for this function, e.g., SSE or entropy

External Measures of Cluster Validity: Entropy and Purity

Cluster validityFor supervised classification we have a variety of

measures to evaluate how good our model isAccuracy, precision, recall

For cluster analysis, the analogous question is how to evaluate the “goodness” of the resulting clusters?

But “clusters are in the eye of the beholder”!

Then why do we want to evaluate them?To avoid finding patterns in noiseTo compare clustering algorithmsTo compare two sets of clustersTo compare two clusters

Clusters found in random data

0 0.2 0.4 0.6 0.8 10

Random Points

0 0.2 0.4 0.6 0.8 10

K-means

0 0.2 0.4 0.6 0.8 10

yComplete Link

K-means clustering Partitional clustering approach Each cluster is associated with a

centroid (center point) Each point is assigned to the cluster with

the closest centroid Number of clusters, K, must be specified

Text clusteringText clustering is quite different…

Feature representations of text will typically have a large number of dimensions (103 - 106)

Euclidean distance isn’t necessarily the best distance metric for feature representations

Typically use normalized, TF/IDF-weighted vectors and cosine similarity.

Optimize computations for sparse vectors.Applications:

During retrieval, add other documents in the same cluster as the initial retrieved documents to improve recall.

Clustering of results of retrieval to present more organized results to the user (e.g. Clusty, Northernlight folders).

Automated production of hierarchical taxonomies of documents for browsing purposes (e.g. Yahoo).

Cluster similarity: MIN (Single Link)Based on the two most similar (closest)

points in the different clustersDetermined by one pair of points, i.e., by one

link in the proximity graph

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

Strengths: MIN

Original Points Two Clusters

• Can handle non-elliptical shapes

Limitation: MIN

• Sensitive to noise and outliers83

Cluster similarity: MAX (Complete Linkage)Similarity of two clusters is based on the two

least similar (most distant) points in the different clustersDetermined by all pairs of points in the two

clustersI1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00 1 2 3 4 5

Strength: MAX

• Less susceptible to noise and outliers85

Limitation: MAX

•Tends to break large clusters

•Biased towards globular clusters 86

Cluster similarity: Group averageProximity of two clusters is the average of pairwise

proximity between points in the two clusters.

||Cluster||Cluster

)p,pproximity(

)Cluster,Clusterproximity(ji

ClusterpClusterp

jijjii

I1 I2 I3 I4 I5I1 1.00 0.90 0.10 0.65 0.20I2 0.90 1.00 0.70 0.60 0.50I3 0.10 0.70 1.00 0.40 0.30I4 0.65 0.60 0.40 1.00 0.80I5 0.20 0.50 0.30 0.80 1.00

1 2 3 4 5 87

Strength & Limitation: Group average Compromise between Single and

Complete Link Strength:

Less susceptible to noise and outliers Limitation:

Biased towards globular clusters

Cluster similarity: Ward’s methodSimilarity of two clusters is based on the

increase in squared error when two clusters are mergedSimilar to group average if distance between

points is distance squaredLess susceptible to noise and outliersBiased towards globular clustersCan be used to initialize K-means

ameeta agrawal. outline parser evaluation text clustering common n-grams classification method...

parser evaluationcompare

parser pt

parser outputtotal

parser outputrecall

parser output op

parser output total

parser output overlaps

pair i

Documents

parser evalua+on

parser cache€¦ · parser cache...

cyk parser

data parser

utah cng llc installation guide fitting cng components to...

mms parser

the new delhi cng experience - cng services...

compiler predictive parser

sanskrit parser report

top down parser

pl/0 parser

5.java parser

operator precedence parser

aakanksha vatsal, smita zinjarde and ameeta ravi kumar ·...

fca parser

syntax analysis / parser

fhir parser

draft - cng feasibility study - kcata - cng feasibility...

fast%fillingsuccess– dispenserstriumph - cng...

lexical analyser parser