text mining: fast phrase-based text indexing and matching khaled hammouda, ph.d. student pami...
TRANSCRIPT
Text Mining:Text Mining:Fast Phrase-based Text Indexing and Fast Phrase-based Text Indexing and
MatchingMatching
Khaled Hammouda, Ph.D. StudentKhaled Hammouda, Ph.D. Student
PAMI Research GroupPAMI Research GroupUniversity of WaterlooUniversity of Waterloo
Waterloo, Ontario, CanadaWaterloo, Ontario, Canada
LORNET Theme 4
The ProblemThe Problem
InformationSource
Web Web // LOR LOR
Text DocumentsText DocumentsWeb DocumentsWeb DocumentsDiscussion ArticlesDiscussion Articles......
AutomaticClustering/Grouping
ProgrammingProgrammingLanguagesLanguages
Database SystemsDatabase Systems
Pattern Pattern RecognitionRecognition
How do we judge similarity?
Data Data MiningMining
Group Similar Documents TogetherGroup Similar Documents Together Maximize intra-cluster similarityMaximize intra-cluster similarity Minimize inter-cluster similarityMinimize inter-cluster similarity
Need to Need to accuratelyaccurately calculate calculate document similaritydocument similarity
Intra-Cluster Similarity
Inter-Cluster Similarity
Document Cluster
Document Cluster
Document Cluster
Clustering DocumentsClustering Documents
Document SimilarityDocument Similarity
How similar each document isHow similar each document isto every other document?to every other document?
Very time consuming!Very time consuming!O(O(nn22))
Document SimilarityDocument Similarity
Information Theoretic Measure (Dekang’98):Information Theoretic Measure (Dekang’98):
How do we intersect every pair of documents How do we intersect every pair of documents without sacrificing efficiency?without sacrificing efficiency?
What features should we intersect?What features should we intersect? WordsWords PhrasesPhrases
BA
BABA
),sim(
Fast Phrase-based Document IndexingFast Phrase-based Document Indexingand Matchingand MatchingDocument Index Graph StructureDocument Index Graph Structure
A model based on a A model based on a digraphdigraph representation of the representation of the phrasesphrases in the document set in the document set
NodesNodes correspond to unique correspond to unique termsterms EdgesEdges maintain maintain phrasephrase representation representation A phrase is a A phrase is a pathpath in the graph in the graph The model is an The model is an inverted listinverted list (terms (terms documents) documents) Nodes carry Nodes carry term weightterm weight information for each information for each
document in which they appeardocument in which they appear Shared phrasesShared phrases can be matched efficiently can be matched efficiently
Phrase-based FeaturesPhrase-based Features PhrasesPhrases: more informative feature than individual : more informative feature than individual
words words local contextlocal context matching matching Represent Represent sentencessentences rather than words rather than words Facilitate Facilitate phrase-matchingphrase-matching between documents between documents Achieves accurate document pair-wise similarityAchieves accurate document pair-wise similarity Avoid high-dimensionalityAvoid high-dimensionality of vector space model of vector space model Allow Allow incrementalincremental processing processing
Document 1
river raftingmild river raftingriver rafting trips
Document 2
wild river adventuresriver rafting vacation plan
Document 3
fishing tripsfishing vacation planbooking fishing tripsriver fishing
mild
wild
river
rafting
adventures
booking
fishing
trips vacationplan
Document Index Graph
Document Index GraphDocument Index GraphDocument 1
river raftingmild river raftingriver rafting trips
river
booking
fishing
trips vacationplan
mild
river
rafting
trips
wild
river
rafting
adventures
vacationplan
Document 2
wild river adventuresriver rafting vacation plan
Document 3
fishing tripsfishing vacation planbooking fishing tripsriver fishing
- river rafting - river- vacation plan
- river- trips
Phrase-based Document IndexingPhrase-based Document Indexing
Document Index Graph (Document Index Graph (internal structureinternal structure))
river rafting
adventures
fishing
e2
e1
e0
doc TF ET1 {0,0,3}2 {0,0,2}3 {0,0,1}
e0 s1(1),s2(2),s3(1)
e0 s2(1)e2 s1(2)
e1 s4(1)
Edge TablesDocument Table
Document Index Graph (Document Index Graph (size scalabilitysize scalability))
Document Index Graph (Document Index Graph (time performancetime performance))
Effect of using phrase-based similarity over Effect of using phrase-based similarity over individual wordsindividual words
Effect of using phrase similarity (F-measure) Effect of using phrase similarity (Entropy)
ApplicationsApplications
Grouping search engine results on-the-flyGrouping search engine results on-the-fly(incremental processing)(incremental processing)
Creating taxonomies of documentsCreating taxonomies of documents(Yahoo! and Open Directory style)(Yahoo! and Open Directory style)
Implementing “Find Related” or “Find Similar” features of information Implementing “Find Related” or “Find Similar” features of information retrieval systemsretrieval systems
Automatic generation of descriptive phrases about a set of Automatic generation of descriptive phrases about a set of documents (documents (i.e.i.e. labeling clusters) labeling clusters)
Detecting plagiarismDetecting plagiarism
CollaborationCollaboration
Provide Data Mining services (primarily Provide Data Mining services (primarily text mining) for other groupstext mining) for other groups
Opportunity for collaboration with U of Opportunity for collaboration with U of Saskatchewan:Saskatchewan: I-Help Discussion SystemI-Help Discussion System Course Delivery ToolsCourse Delivery Tools
Others are welcomeOthers are welcome
QuestionsQuestions
Instant MessagingInstant Messaging MSN Messenger: MSN Messenger: [email protected][email protected]
E-mailE-mail [email protected]@pami.uwaterloo.ca