![Page 1: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/1.jpg)
ClusteringShallow Processing Techniques for NLP
Ling570November 30, 2011
![Page 2: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/2.jpg)
RoadmapClustering
Motivation & Applications
Clustering Approaches
Evaluation
![Page 3: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/3.jpg)
ClusteringTask: Given a set of objects, create a set of
clusters over those objects
Applications:
![Page 4: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/4.jpg)
ClusteringTask: Given a set of objects, create a set of
clusters over those objects
Applications: Exploratory data analysis Document clustering Language modeling
Generalization for class-based LMs Unsupervised Word Sense Disambiguation
Automatic thesaurus creations Unsupervised Part-of-Speech Tagging Speaker clustering,….
![Page 5: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/5.jpg)
Example: Document Clustering
Input: Set of individual documents
Output: Sets of document clusters
Many different types of clustering:
![Page 6: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/6.jpg)
Example: Document Clustering
Input: Set of individual documents
Output: Sets of document clusters
Many different types of clustering:Category: news, sports, weather, entertainment
![Page 7: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/7.jpg)
Example: Document Clustering
Input: Set of individual documents
Output: Sets of document clusters
Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets,
newswire
![Page 8: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/8.jpg)
Example: Document Clustering
Input: Set of individual documents
Output: Sets of document clusters
Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets,
newswireAuthor clustering
![Page 9: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/9.jpg)
Example: Document Clustering
Input: Set of individual documents
Output: Sets of document clusters
Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets,
newswireAuthor clusteringLanguage ID: language clusters
![Page 10: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/10.jpg)
Example: Document Clustering
Input: Set of individual documents
Output: Sets of document clusters
Many different types of clustering:Category: news, sports, weather, entertainmentGenre clustering: Similar styles: blogs, tweets, newswireAuthor clusteringLanguage ID: language clustersTopic clustering: documents on the same topic
OWS, debt supercommittee, Seattle Marathon, Black Friday..
![Page 11: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/11.jpg)
Example:Word Clustering
Input: WordsBarbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox,
ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats
Output: Word clusters
![Page 12: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/12.jpg)
Example:Word Clustering
Input: WordsBarbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox,
ballot, finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats
Output: Word clusters
Example clusters:
![Page 13: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/13.jpg)
Example:Word Clustering
Input: Words Barbara, Edward, Gov, Mary, NFL, Reds, Scott, Sox, ballot,
finance, inning, payments, polls, profit, quarterback, researchers, science, score, scored, seats
Output: Word clusters
Example clusters: (from NYT) ballot, polls, Gov, seats profit, finance, payments NFL, Reds, Sox, inning, quarterback, scored, score researchers, science Scott, Mary, Barbara, Edward
![Page 14: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/14.jpg)
QuestionsWhat should a cluster represent?
Due to F. Xia
![Page 15: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/15.jpg)
QuestionsWhat should a cluster represent?
Similarity among objects
How can we create clusters?
Due to F. Xia
![Page 16: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/16.jpg)
QuestionsWhat should a cluster represent?
Similarity among objects
How can we create clusters?
How can we evaluate clusters?
Due to F. Xia
![Page 17: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/17.jpg)
QuestionsWhat should a cluster represent?
Similarity among objects
How can we create clusters?
How can we evaluate clusters?
How can we improve NLP with clustering?
Due to F. Xia
![Page 18: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/18.jpg)
SimilarityBetween two instances
![Page 19: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/19.jpg)
SimilarityBetween two instances
Between an instance and a cluster
![Page 20: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/20.jpg)
SimilarityBetween two instances
Between an instance and a cluster
Between clusters
![Page 21: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/21.jpg)
Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)
![Page 22: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/22.jpg)
Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)
Euclidean distance:
![Page 23: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/23.jpg)
Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)
Euclidean distance:
Manhattan distance:
![Page 24: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/24.jpg)
Similarity MeasuresGiven x=(x1,x2,…,xn) and y=(y1,y2,…,yn)
Euclidean distance:
Manhattan distance:
Cosine similarity:
![Page 25: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/25.jpg)
Clustering Algorithms
![Page 26: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/26.jpg)
Types of ClusteringFlat vs Hierarchical Clustering:
Flat: partition data into k clusters
![Page 27: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/27.jpg)
Types of ClusteringFlat vs Hierarchical Clustering:
Flat: partition data into k clustersHierarchical: Nodes form hierarchy
![Page 28: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/28.jpg)
Types of ClusteringFlat vs Hierarchical Clustering:
Flat: partition data into k clustersHierarchical: Nodes form hierarchy
Hard vs Soft ClusteringHard: Each object assigned to exactly one cluster
![Page 29: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/29.jpg)
Types of ClusteringFlat vs Hierarchical Clustering:
Flat: partition data into k clustersHierarchical: Nodes form hierarchy
Hard vs Soft ClusteringHard: Each object assigned to exactly one clusterSoft: Allows degrees of membership and
membership in more than one cluster Often probability distribution over cluster
membership
![Page 30: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/30.jpg)
Hierarchical Clustering
![Page 31: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/31.jpg)
Hierarchical Vs. FlatHierarchical clustering:
![Page 32: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/32.jpg)
Hierarchical Vs. FlatHierarchical clustering:
More informativeGood for data explorationMany algorithms, none good for all dataComputationally expensive
![Page 33: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/33.jpg)
Hierarchical Vs. FlatHierarchical clustering:
More informativeGood for data explorationMany algorithms, none good for all dataComputationally expensive
Flat clustering:
![Page 34: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/34.jpg)
Hierarchical Vs. FlatHierarchical clustering:
More informativeGood for data explorationMany algorithms, none good for all dataComputationally expensive
Flat clustering:Fairly efficientSimple baseline algorithm: K-meansProbabilistic models use EM algorithm
![Page 35: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/35.jpg)
Clustering AlgorithmsFlat clustering:
K-means clustering
K-medoids clustering
Hierarchical clustering:Greedy, bottom-up clustering
![Page 36: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/36.jpg)
K-Means ClusteringInitialize:
Randomly select k initial centroids
![Page 37: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/37.jpg)
K-Means ClusteringInitialize:
Randomly select k initial centroidsCenter (mean) of cluster
Iterate until clusters stop changing
![Page 38: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/38.jpg)
K-Means ClusteringInitialize:
Randomly select k initial centroidsCenter (mean) of cluster
Iterate until clusters stop changingAssign each instance to the nearest cluster
Cluster is nearest if cluster centroid is nearest
![Page 39: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/39.jpg)
K-Means ClusteringInitialize:
Randomly select k initial centroidsCenter (mean) of cluster
Iterate until clusters stop changingAssign each instance to the nearest cluster
Cluster is nearest if cluster centroid is nearestRecompute cluster centroids
Mean of instances in the cluster
![Page 40: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/40.jpg)
K-Means: 1 step
![Page 41: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/41.jpg)
K-MeansRunning time:
![Page 42: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/42.jpg)
K-MeansRunning time:
O(n) – where n is the number of clustersConverges in finite number of steps
Issues:
![Page 43: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/43.jpg)
K-MeansRunning time:
O(n) – where n is the number of clustersConverges in finite number of steps
Issues:Need to pick # clusters kCan find only local optimumSensitive to outliersRequires Euclidean distance:
What about enumerable classes (e.g. colors)?
![Page 44: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/44.jpg)
MedoidMedoid: Element in cluster with highest average
similarity to other elements in cluster
![Page 45: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/45.jpg)
MedoidMedoid: Element in cluster with highest average
similarity to other elements in cluster
Finding the medoid:For each element compute:
![Page 46: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/46.jpg)
MedoidMedoid: Element in cluster with highest average
similarity to other elements in cluster
Finding the medoid:For each element compute:
Select the element with highest f(p)
![Page 47: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/47.jpg)
K-MedoidsInitialize:
Select k instances at random as medoids
![Page 48: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/48.jpg)
K-MedoidsInitialize:
Select k instances at random as medoids
Iterate until no changesAssign instances to cluster with nearest medoid
![Page 49: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/49.jpg)
K-MedoidsInitialize:
Select k instances at random as medoids
Iterate until no changesAssign instances to cluster with nearest medoid
Recompute medoid for each cluster
![Page 50: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/50.jpg)
Greedy, Bottom-Up Hierarchical Clustering
Initialize:Make an individual cluster for each instance
![Page 51: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/51.jpg)
Greedy, Bottom-Up Hierarchical Clustering
Initialize:Make an individual cluster for each instance
Iterate until all instances in same clusterMerge two most similar clusters
![Page 52: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/52.jpg)
Evaluation
![Page 53: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/53.jpg)
EvaluationWith respect to gold standard
AccuracyFor each cluster, assign most common label to all
itemsRand indexF-measure
Alternatives:
![Page 54: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/54.jpg)
EvaluationWith respect to gold standard
AccuracyFor each cluster, assign most common label to all
itemsRand indexF-measure
Alternatives:Extrinsic evaluation
![Page 55: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/55.jpg)
EvaluationWith respect to gold standard
AccuracyFor each cluster, assign most common label to all
itemsRand indexF-measure
Alternatives:Extrinsic evaluationHuman inspection
![Page 56: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/56.jpg)
ConfigurationGiven
Set of objects O = {o1,o2,….on}
![Page 57: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/57.jpg)
ConfigurationGiven
Set of objects O = {o1,o2,….on}
Partition X ={x1,…,xr}
Partition Y ={y1,….ys}
![Page 58: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/58.jpg)
ConfigurationGiven
Set of objects O = {o1,o2,….on}
Partition X ={x1,…,xr}
Partition Y ={y1,….ys}
In same sets in X In diff’t sets in X
In same sets in Y a d
In diff’t sets in Y c b
![Page 59: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/59.jpg)
Rand IndexMeasure of cluster similarity (Rand, 1971)
No agreement?
In same sets in X In diff’t sets in X
In same sets in Y a d
In diff’t sets in Y c b
![Page 60: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/60.jpg)
Rand IndexMeasure of cluster similarity (Rand, 1971)
No agreement? 0; Full agreement
In same sets in X In diff’t sets in X
In same sets in Y a d
In diff’t sets in Y c b
![Page 61: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/61.jpg)
Rand IndexMeasure of cluster similarity (Rand, 1971)
No agreement? 0; Full agreement? 1
In same sets in X In diff’t sets in X
In same sets in Y a d
In diff’t sets in Y c b
![Page 62: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/62.jpg)
Precision & RecallAssume X is the gold standard partition
Assume Y is the system-generated partition
![Page 63: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/63.jpg)
Precision & RecallAssume X is the gold standard partition
Assume Y is the system-generated partition
For each pair of items in a cluster in YCorrect if they appear together in a cluster in X
![Page 64: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/64.jpg)
Precision & RecallAssume X is the gold standard partition
Assume Y is the system-generated partition
For each pair of items in a cluster in YCorrect if they appear together in a cluster in X
Can compute P, R, and F-measure
![Page 65: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/65.jpg)
HW #10
Due to F. Xia
![Page 66: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/66.jpg)
HW #10Unsupervised POS tagging:
Word clustering by neighboring word cooccurrence
Create feature vectors: Features: counts of adjacent word occurrence E.g., L=he:10 or R=run:3
Perform clustering: K-medoids algorithm ( with cosine similarity)
Evaluate clusters: Cluster mapping + accuracy
![Page 67: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/67.jpg)
Q1create_vectors.* training_file word_file feat_file outfile
Training file: one-sentence-per-line: w1 w2 w3 …wn
word_file: List of words to cluster word<tab>freq
feat_file: List of words to use as features feat<tab>freq
outfile: One list per word in word_file Format: word L=he 10 L=she 5 ….. R=gone 2 R=run 3…
![Page 68: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/68.jpg)
FeaturesFeatures are of the form:
(L|R)=xx freqwhere xx is a word in the feat_file, L, R: the position where the feature
appeared freq: # of times word xx appeared in
position in training file
Suppose ‘New York’ appears 540 times in corpus
York L=New 540 … R=New 0…
![Page 69: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/69.jpg)
Vector FileOne line per word in word_file
Lines should be ordered by word_file
Features should be sorted alphabetically by feature nameE.g. L=an 3 L=the 10 … R=aqua 1 R=house 5
Feature sorting aids cosine computation
![Page 70: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/70.jpg)
Q2k_medoids.* vector_file num_clusters
sys_cluster_file
vector_file: Created by Q1
num_clusters : number of clusters to create
sys_cluster_file: output representing clustering of vectorsmedoid w1 w2 w3 …wnwhere medoid is the medoid representing the cluster w1…wn are the words in the cluster
![Page 71: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/71.jpg)
Q2: K-MedoidsSimilarity measure: Cosine similarity
Initial medoids:Medoid i is at instance:
where N is # of words to cluster C is # of clusters
![Page 72: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/72.jpg)
Mapping Sys to Gold:One-to-One
Find highest number in matrixRemove corresponding row and columnRepeat until all rows removeds1 => g2 10 s2 => g1 7s3 => g3 6 acc= (10+7+6)/sum
Due to F. Xia
g1 g2 g3
s1 2 10 9
s2 7 4 2
s3 0 9 6
s4 5 0 3
![Page 73: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/73.jpg)
Mapping Sys to Gold:One-to-One
Find highest number in matrix Remove corresponding row and column Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 6 acc= (10+7+6)/sum
Due to F. Xia
g1 g2 g3
s1 2 10 9
s2 7 4 2
s3 0 9 6
s4 5 0 3
![Page 74: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/74.jpg)
Mapping Sys to Gold:Many-to-One
Find highest number in matrix Remove corresponding row (but not column) Repeat until all rows removed s1 => g2 10 s2 => g1 7 s3 => g3 9 s4 => g1 5 acc= (10+7+9+5)/sum
Due to F. Xia
g1 g2 g3
s1 2 10 9
s2 7 4 2
s3 0 9 6
s4 5 0 3
![Page 75: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/75.jpg)
Q3: calculate_accuracycalculate_accuracy.* sys_clust gold_clust flag
map_file acc_file sys_clust: output of Q2: m w1 w2 …
gold_clust: similar format, gold standard
flag: 0: one-to-one; 1:many-to-one
map_file: mapping of sys to gold clusterssys_clust_num => gold_clust_num count
acc_file: just overall accuracy
![Page 76: Clustering Shallow Processing Techniques for NLP Ling570 November 30, 2011](https://reader036.vdocument.in/reader036/viewer/2022062300/56649d2d5503460f94a0469e/html5/thumbnails/76.jpg)
ExperimentsCompare different numbers of words and
different feature representations
Compare different mapping strategies for accuracy
Tabulate results