text mining: fast phrase-based text indexing and matching khaled hammouda, ph.d. student pami...

12
Text Mining: Text Mining: Fast Phrase-based Text Fast Phrase-based Text Indexing and Matching Indexing and Matching Khaled Hammouda, Ph.D. Student Khaled Hammouda, Ph.D. Student PAMI Research Group PAMI Research Group University of Waterloo University of Waterloo Waterloo, Ontario, Canada Waterloo, Ontario, Canada LORNET Theme 4

Upload: kory-townsend

Post on 02-Jan-2016

223 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,

Text Mining:Text Mining:Fast Phrase-based Text Indexing and Fast Phrase-based Text Indexing and

MatchingMatching

Khaled Hammouda, Ph.D. StudentKhaled Hammouda, Ph.D. Student

PAMI Research GroupPAMI Research GroupUniversity of WaterlooUniversity of Waterloo

Waterloo, Ontario, CanadaWaterloo, Ontario, Canada

LORNET Theme 4

Page 2: Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,

The ProblemThe Problem

InformationSource

Web Web // LOR LOR

Text DocumentsText DocumentsWeb DocumentsWeb DocumentsDiscussion ArticlesDiscussion Articles......

AutomaticClustering/Grouping

ProgrammingProgrammingLanguagesLanguages

Database SystemsDatabase Systems

Pattern Pattern RecognitionRecognition

How do we judge similarity?

Data Data MiningMining

Page 3: Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,

Group Similar Documents TogetherGroup Similar Documents Together Maximize intra-cluster similarityMaximize intra-cluster similarity Minimize inter-cluster similarityMinimize inter-cluster similarity

Need to Need to accuratelyaccurately calculate calculate document similaritydocument similarity

Intra-Cluster Similarity

Inter-Cluster Similarity

Document Cluster

Document Cluster

Document Cluster

Clustering DocumentsClustering Documents

Page 4: Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,

Document SimilarityDocument Similarity

How similar each document isHow similar each document isto every other document?to every other document?

Very time consuming!Very time consuming!O(O(nn22))

Page 5: Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,

Document SimilarityDocument Similarity

Information Theoretic Measure (Dekang’98):Information Theoretic Measure (Dekang’98):

How do we intersect every pair of documents How do we intersect every pair of documents without sacrificing efficiency?without sacrificing efficiency?

What features should we intersect?What features should we intersect? WordsWords PhrasesPhrases

BA

BABA

),sim(

Page 6: Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,

Fast Phrase-based Document IndexingFast Phrase-based Document Indexingand Matchingand MatchingDocument Index Graph StructureDocument Index Graph Structure

A model based on a A model based on a digraphdigraph representation of the representation of the phrasesphrases in the document set in the document set

NodesNodes correspond to unique correspond to unique termsterms EdgesEdges maintain maintain phrasephrase representation representation A phrase is a A phrase is a pathpath in the graph in the graph The model is an The model is an inverted listinverted list (terms (terms documents) documents) Nodes carry Nodes carry term weightterm weight information for each information for each

document in which they appeardocument in which they appear Shared phrasesShared phrases can be matched efficiently can be matched efficiently

Phrase-based FeaturesPhrase-based Features PhrasesPhrases: more informative feature than individual : more informative feature than individual

words words local contextlocal context matching matching Represent Represent sentencessentences rather than words rather than words Facilitate Facilitate phrase-matchingphrase-matching between documents between documents Achieves accurate document pair-wise similarityAchieves accurate document pair-wise similarity Avoid high-dimensionalityAvoid high-dimensionality of vector space model of vector space model Allow Allow incrementalincremental processing processing

Document 1

river raftingmild river raftingriver rafting trips

Document 2

wild river adventuresriver rafting vacation plan

Document 3

fishing tripsfishing vacation planbooking fishing tripsriver fishing

mild

wild

river

rafting

adventures

booking

fishing

trips vacationplan

Document Index Graph

Page 7: Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,

Document Index GraphDocument Index GraphDocument 1

river raftingmild river raftingriver rafting trips

river

booking

fishing

trips vacationplan

mild

river

rafting

trips

wild

river

rafting

adventures

vacationplan

Document 2

wild river adventuresriver rafting vacation plan

Document 3

fishing tripsfishing vacation planbooking fishing tripsriver fishing

- river rafting - river- vacation plan

- river- trips

Page 8: Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,

Phrase-based Document IndexingPhrase-based Document Indexing

Document Index Graph (Document Index Graph (internal structureinternal structure))

river rafting

adventures

fishing

e2

e1

e0

doc TF ET1 {0,0,3}2 {0,0,2}3 {0,0,1}

e0 s1(1),s2(2),s3(1)

e0 s2(1)e2 s1(2)

e1 s4(1)

Edge TablesDocument Table

Document Index Graph (Document Index Graph (size scalabilitysize scalability))

Document Index Graph (Document Index Graph (time performancetime performance))

Page 9: Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,

Effect of using phrase-based similarity over Effect of using phrase-based similarity over individual wordsindividual words

Effect of using phrase similarity (F-measure) Effect of using phrase similarity (Entropy)

Page 10: Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,

ApplicationsApplications

Grouping search engine results on-the-flyGrouping search engine results on-the-fly(incremental processing)(incremental processing)

Creating taxonomies of documentsCreating taxonomies of documents(Yahoo! and Open Directory style)(Yahoo! and Open Directory style)

Implementing “Find Related” or “Find Similar” features of information Implementing “Find Related” or “Find Similar” features of information retrieval systemsretrieval systems

Automatic generation of descriptive phrases about a set of Automatic generation of descriptive phrases about a set of documents (documents (i.e.i.e. labeling clusters) labeling clusters)

Detecting plagiarismDetecting plagiarism

Page 11: Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,

CollaborationCollaboration

Provide Data Mining services (primarily Provide Data Mining services (primarily text mining) for other groupstext mining) for other groups

Opportunity for collaboration with U of Opportunity for collaboration with U of Saskatchewan:Saskatchewan: I-Help Discussion SystemI-Help Discussion System Course Delivery ToolsCourse Delivery Tools

Others are welcomeOthers are welcome

Page 12: Text Mining: Fast Phrase-based Text Indexing and Matching Khaled Hammouda, Ph.D. Student PAMI Research Group University of Waterloo Waterloo, Ontario,

QuestionsQuestions

Instant MessagingInstant Messaging MSN Messenger: MSN Messenger: [email protected][email protected]

E-mailE-mail [email protected]@pami.uwaterloo.ca