text similarities - pg pushpin

41
PUSHPIN TEXT SIMILARITIES Junaid Surve 6644418

Upload: jsurve

Post on 06-Dec-2014

1.653 views

Category:

Education


6 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Text Similarities - PG Pushpin

PUSHPINTEXT SIMILARITIES

Junaid Surve

6644418

Page 2: Text Similarities - PG Pushpin

2

AGENDA Introduction Data Retrieval

TF/IDF Document-Term Matrix VSM LSA

Similarity Measurements Cosine Similarity SOC-PMI

Applications & Prototype Summary

Page 3: Text Similarities - PG Pushpin

3

AGENDA Introduction Data Retrieval

TF/IDF Document-Term Matrix VSM LSA

Similarity Measurements Cosine Similarity SOC-PMI

Applications & Prototype Summary

Page 4: Text Similarities - PG Pushpin

4

INTRODUCTION WWW – a huge tangled web of information.

Issues faced – duplications, plagiarism, copyright violation etc.

Aim : To detect and report duplicates

Method : Compare and output the level of similarity which is “TEXT SIMILARITY”.

Page 5: Text Similarities - PG Pushpin

5

Text Similarity has 2 aspects : Content Similarity : Words are compared.

e.g. “I have a car” and “I have a vehicle” are 75% similar.

Expression Similarity : Meaning of the information is considered.e.g. “I have a car” and “I have a vehicle” can be considered 100% similar.

Scope – Content Similarity

Page 6: Text Similarities - PG Pushpin

6

2 step process:

STEP 1 : Data Retrieval“The area of study concerned with searching for documents, for information within documents, and for metadata about documents, as well as that of searching structured storage, relational databases, and the World WideWeb” [1]

STEP II : Similarity MeasurementsTo correlate the words or terms of two or more documents or web pages.

Page 7: Text Similarities - PG Pushpin

7

AGENDA Introduction Data Retrieval

TF/IDF Document-Term Matrix VSM LSA

Similarity Measurements Cosine Similarity SOC-PMI

Applications & Prototype Summary

Page 8: Text Similarities - PG Pushpin

8

DATA RETRIEVAL Translation of literature to mathematics.

A variety of such concrete techniques exist – TF/IDF Document-Term Matrix VSM LSA

The corresponding mathematical structure is derived based of the relevant concrete data retrieval methodology used.

Page 9: Text Similarities - PG Pushpin

9

TF/IDF Term Frequency / Inverse Document

Frequency

Idea : More common the term, the less importance it has and hence should be considered at the least end of the query spectrum.

Two linear, independent aspects: Term Frequency - frequency of occurrence of a

term in a given document. Inverse Document Frequency - measure of the

general importance of the term.

Page 10: Text Similarities - PG Pushpin

10

TF IDF Example [7] Three Documents –

D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck”

Two steps Calculate the Term Frequency Calculate the Inverse Document Frequency

Page 11: Text Similarities - PG Pushpin

11

TF IDF Example

Terms D1 D2 D3 dfi D/dfi IDF=log(D/dfi)

a 1 1 1 3 3/3 = 1 0

arrived 1 1 2 3/2 = 1.5 0.1761

damaged 1 1 3/1 = 3 0.4771

delivery 1 1 3/1 = 3 0.4771

fire 1 1 3/1 = 3 0.4771

gold 1 1 2 3/2 = 1.5 0.1761

in 1 1 1 3 3/3 = 1 0

of 1 1 1 3 3/3 = 1 0

silver 2 1 3/1 = 3 0.4771

shipment 1 1 2 3/2 = 1.5 0.1761

truck 1 1 2 3/2 = 1.5 0.1761

Page 12: Text Similarities - PG Pushpin

12

Document-Term Matrix “A Document-Term Matrix is a mathematical

matrix that describes the frequency of terms that occur in a collection of documents.” [2]

Rows – DocumentsColumns – Terms

Only depicts which document contains which term and the number of occurrences of that term in the document.

Page 13: Text Similarities - PG Pushpin

13

Document-Term Matrix Example D1 = “I like databases” D2 = “I hate hate databases”

I like databases hate

D1 1 1 1 0

D2 1 0 1 2

Page 14: Text Similarities - PG Pushpin

14

VSM “Vector Space Model (VSM) is an algebraic

model for representing text documents (and any objects, in general) as vectors of identifiers, such as, for e.g. index terms.” [3]

Each document and query is represented as a vector: document : dj = (w1,j , w2,j , .... , wn,j) query : q = (w1,q , w2,q , .... , wn,q)

Terms can be individual words, keywords, or phrases, based on the type of application.

Page 15: Text Similarities - PG Pushpin

15

VSM Example [7]

Three Documents – D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck”

Query – Gold Silver Truck

Page 16: Text Similarities - PG Pushpin

16

VSM Example continued... Calculating TF-IDFTerms Q D1 D2 D3 IDFi QxIDF

i

D1xIDFi

D2xIDFi

D3xIDFi

a 1 1 1 0

arrived 1 1 0.1761

0.1761 0.1761

damaged

1 0.4771

0.4771

delivery

1 0.4771

0.4771

fire 1 0.4771

0.4771

gold 1 1 1 0.1761

0.1761 0.1761 0.1761

in 1 1 1 0

of 1 1 1 0

silver 1 2 0.4771

0.4771 0.9542

shipment

1 1 0.1761

0.1761 0.1761

truck 1 1 1 0.1761

0.1761 0.1761 0.1761

Page 17: Text Similarities - PG Pushpin

17

LSA “Latent Semantic Analysis (LSA) is a theory and

method for extracting and representing the meaning of words and passages of words.” [4]

Built on the assumption that similar terms tend to appear in close proximities and hence identification of correlation patterns between documents or terms becomes easier.

2 step process: Construction of Document-Term Matrix Singular Value Decomposition

Page 18: Text Similarities - PG Pushpin

18

LSA Example

Three Documents – D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck”

Query – Gold Silver Truck

Page 19: Text Similarities - PG Pushpin

19

LSA Example contd...

STEP 1 : Constructing the Term-Document Matrix & Query Matrix

Page 20: Text Similarities - PG Pushpin

20

LSA Example contd...

STEP 2: Evaluating Singular Vector Decomposition

Page 21: Text Similarities - PG Pushpin

21

LSA Example contd...

STEP 3 : Reducing Dimensionality w.r.t k

Page 22: Text Similarities - PG Pushpin

22

Similar SVD evaluation and reduction is done for the query vector Q.

At the end we have: Reduced SVD Matrix V (for the documents) Reduced SVD Matrix Q (for the query)

V = Q =

This further can be supplied to similarity measurement technique.

Page 23: Text Similarities - PG Pushpin

23

AGENDA Introduction Data Retrieval

TF/IDF Document-Term Matrix VSM LSA

Similarity Measurements Cosine Similarity SOC-PMI

Applications & Prototype Summary

Page 24: Text Similarities - PG Pushpin

24

SIMILARITY MEASUREMENTS Major focus of “Text Similarities” methodology.

Uses the Mathematical Structures generated by the Data Retrieval techniques to evaluate the percentage of likeness between two or more documents or web pages.

Two major techniques in focus here: Cosine Similarity SOC-PMI

Page 25: Text Similarities - PG Pushpin

25

COSINE SIMILARITY Evaluate similarity between 2 vectors by

measuring cosine of the angle between them.

Cosine of the angle will detemine whether the vectors are roughly pointing in the same direction.

In our scope : similarity will range between 0 and 1, since term weights are always positive.i.e. The angle between two considered vectors will never exceed 90

Page 26: Text Similarities - PG Pushpin

26

COSINE Example [7] Example continued from VSM.

Three Documents – D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck”

Query – Gold Silver Truck

We have calculated weights using TF-IDF scheme.

Next Step – Calculate Cosine Similarity: CosineΘDi = (Q . Di ) / (|Q| x |Di|) i.e. First calculate Dot product: Q . Di

Then calculate scalar product: |Q| x |Di|

Page 27: Text Similarities - PG Pushpin

27

COSINE Example continued... Dot Products: Q.Di = ∑i wQ,j wi,j

Q.D1 = 0.0310, Q.D2 = 0.4862, Q.D3 = 0.0620

Scalar Products: |Q| x |Di| = sqrt(∑i w2Q,j)sqrt(∑i

w2i,j)

|Q| x |D1| = 0.3871, |Q| x |D2| = 0.5896, |Q| x |D3| = 0.1896

Cosine Similarity: CosineΘD1 = 0.0801 CosineΘD2 = 0.8246 CosineΘD3 = 0.3271

Page 28: Text Similarities - PG Pushpin

28

SOC-PMI “Second-Order Co-occurence Pointwise Mutual

Information (SOC-PMI) is a semantic similarity measure using pointwise mutual information to sort lists of important neighbor words of the two target words from a large corpus.” [5]

A lot of mathematics involved to generate the formula.

This Similarity measure at the end is also normalized so as to limit the range of similarity between 0 and 1.

Page 29: Text Similarities - PG Pushpin

29

SOC-PMI with an example Complicated method with a lot of mathematical

formulae.

Example [6] : W1 = car W2 = automobile

m = 70, n = 43

Assumptions: ϒ = 3, ∂ = 0.7 window of 11 words

β1 = β2 = 24.88CORPUS

Page 30: Text Similarities - PG Pushpin

30

SOC-PMI example contd...

Types & Frequencies Bigram frequencies and the set X and the set Y of words with

their PMI values

Page 31: Text Similarities - PG Pushpin

31

SOC-PMI example contd...

Page 32: Text Similarities - PG Pushpin

32

AGENDA Introduction Data Retrieval

TF/IDF Document-Term Matrix VSM LSA

Similarity Measurements Cosine Similarity SOC-PMI

Applications & Prototype Summary

Page 33: Text Similarities - PG Pushpin

33

APPLICATIONS Plagiarism Detection

Term Similarity play an important in the field of Plagiarism Detection.

Copyright ViolationCopies of restricted Software/Data can be detected

using Text Similarities. Recommender Services

Page 34: Text Similarities - PG Pushpin

34

PROTOTYPE AIM : Finding the degree of Similarity between

files.

2 steps Data Retrival

TF-IDF Similarity Measurement

Cosine Pearson Correlation Distribution Matrix Co-occurence

Page 35: Text Similarities - PG Pushpin

35

Prototype – Data Retrieval Steps followed to retrive data using TF-IDF scheme

SequenceFilesFromDirectory Converts files into sequence files. < Text, Text >

DocumentProcessor Converts the sequence file into <Text, StringTuple>

DictionaryVectorizer Creates TF Vectors <Text, VectorWritable> Creates dfcount < IntWritable, LongWritable> Creates wordcount <Text, LongWritable>

TFIDFConverter Creates TF-IDF vectors <Text, VectorWritable>

Page 36: Text Similarities - PG Pushpin

36

Prototype – Similarity Measurement Intermediate steps

Convert the TF-IDF into a Matrix <IntWritable, VectorWritable>

Similarity Measurement Distribution Multiplication

Matrix * Matrix´ Cosine, Pearson Correlation and Co-occuerrence

RowSimilarityJob (Similarity Classname) SIMILARITY_UNCENTERED_ZERO_ASSUMING_COSINE SIMILARITY_PEARSON_CORRELATION SIMILARITY_COOCCURRENCE

Page 37: Text Similarities - PG Pushpin

37

Prototype – Similarity Measurment Cosine

Pearson Correlation

Distribution Matrix

Co-occurence

Page 38: Text Similarities - PG Pushpin

38

AGENDA Introduction Data Retrieval

TF/IDF Document-Term Matrix VSM LSA

Similarity Measurements Cosine Similarity SOC-PMI

Applications & Prototype Summary

Page 39: Text Similarities - PG Pushpin

39

SUMMARY What is Text Similarity. Scope - Content Similarity Steps involved in the process:

Data Retrieval TF/IDF Document-Term Matrix VSM LSA

Similarity Measurements Cosine Similarity SOC-PMI

Applications & Prototype

Page 40: Text Similarities - PG Pushpin

40

Page 41: Text Similarities - PG Pushpin

41

References[1] Wikipedia: Information retrieval - Wikipedia, the free encyclopedia

(2012), http://en.wikipedia.org/wiki/Information_retrieval[2] Wikipedia: Document-term matrix - Wikipedia, the free encyclopedia

(2011), http://en.wikipedia.org/wiki/Document-term_matrix[3] Wikipedia: Vector space model - Wikipedia, the free encyclopedia

(2011), http://en.wikipedia.org/wiki/Vector_space_model[4] Wikipedia: Latent semantic indexing - Wikipedia, the free

encyclopedia (2011), http://en.wikipedia.org/wiki/Latent_semantic_indexing

[5] Wikipedia: Second-order co-occurrence pointwise mutual information - Wikipedia, the free encyclopedia (2011), http://en.wikipedia.org/wiki/Second-order_co-occurrence_pointwise_mutual_information

[6] Islam, A. and Inkpen, D. (2006). Second Order Co-occurrence PMI for Determining the Semantic Similarity of Words, in Proceedings of the International Conference on Language Resources and Evaluation (LREC 2006), Genoa, Italy, pp. 1033–1038.

[7] Dr. E. Garcia. Mi Islita.com - http://www.miislita.com/information-retrieval-tutorial/svd-lsi-tutorial-4-lsi-how-to-calculations.html