document clustering and collection selection diego puppin web mining, 2006-2007
TRANSCRIPT
![Page 1: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/1.jpg)
Document Clustering and Collection
SelectionDiego Puppin
Web Mining, 2006-2007
![Page 2: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/2.jpg)
Web Search Engines: Parallel Architecture
To improve throughput, latency, res. quality A broker works as a unique interfaceComposed of several computing serversQueries are routed to a subset The broker collects and merges results
![Page 3: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/3.jpg)
![Page 4: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/4.jpg)
d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8t1t1 x xt2t2 x xt3t3 x x xt4t4 x xt5t5 x x x x x x xt6t6 x x xt7t7 x x x xt8t8 x x xt9t9
![Page 5: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/5.jpg)
d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8t1t1 x xt2t2 x xt3t3 x x xt4t4 x xt5t5 x x x x x x xt6t6 x x xt7t7 x x x xt8t8 x x xt9t9
![Page 6: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/6.jpg)
d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8t1t1 x xt2t2 x xt3t3 x x xt4t4 x xt5t5 x x x x x x xt6t6 x x xt7t7 x x x xt8t8 x x xt9t9
![Page 7: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/7.jpg)
Doc-Partitioned Approach
The document base is split among serversEach server indexes and manages queries for its own documentsIt knows all terms of some documentsBetter scalability of indexing/search
Each server is independentDocuments can be easily added/removed
![Page 8: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/8.jpg)
Term-Partitioned Approach
The dictionary is split among serversEach server stores the index for some termsIt knows documents where its terms occurPotential for load reductionPoor load balancing (some work...)
![Page 9: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/9.jpg)
Some considerations
Every time you add/remove a docYou must update MANY servers
With queries:only relevant servers are queried but... servers with hot terms are overloaded
![Page 10: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/10.jpg)
How To Load BalancePut together related terms
This minimizes the number of hit servers
Try to put together group of documents with similar overall frequence
Servers shouldn’t be overloadedQuery logs could be used to predict
![Page 11: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/11.jpg)
Multiple indexesTerm-based index
Query-vector or Bag-of-wordHot text index
Titles, Anchor, Bold etcLink-based index
It can find related pages etcKey-phrase index
For some idioms
![Page 12: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/12.jpg)
terms links hot
query terms
referrer, relatedquery terms
![Page 13: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/13.jpg)
How To Doc-Partition?
1. Random doc assignment + Query broadcast
2. Collection selection for independent collections (meta-search)
3. Smart doc assignment + Collection selection
4. Random assignment + Random selection
![Page 14: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/14.jpg)
1. Random + Broadcast
Used by commercial WSEsNo computing effort for doc clusteringVery high scalability
Low latency on each serverResult collection and merging is the heaviest part
![Page 15: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/15.jpg)
IR Core1idx
IR Core2idx
IR Corekidx
t1,t2,…tq r1,r2,…rr
query results
Broker
Distributed/Replicated Documents
![Page 16: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/16.jpg)
2. Independent collections
The WSE uses data from several sourcesIt routes the query to the most authoritative collection(s)It collects the results according to independent ranking choices (HARD)Example: Biology, News, Law
![Page 17: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/17.jpg)
Coll1 Coll2 Coll.k
t1,t2,…tq r1,r2,…rr
query results
Broker
![Page 18: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/18.jpg)
3. Assignment + Selection
WSE creates document groupsEach server holds one groupThe broker has a knowledge of group placementThe selection strategy routes the query suitably
![Page 19: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/19.jpg)
Doc cluster1idx
Doccluster2idx
Doccluster kidx
t1,t2,…tq r1,r2,…rr
query results
Broker
![Page 20: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/20.jpg)
4. Random + Random
If data (pages, resources...) are replicated, interchangeable, hard to indexData are stored in the server that publishes themWe query a few servers hoping to get something
![Page 21: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/21.jpg)
Doc 1idx
Doc2idx
Dockidx
t1,t2,…tq r1,r2,…rr
query results
Broker
![Page 22: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/22.jpg)
CORI
The Effect of Database Size Distribution on Resource Selection Algorithms, Luo Si and Jamie CallanExtends the concept of TF.IDF to collections
![Page 23: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/23.jpg)
![Page 24: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/24.jpg)
![Page 25: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/25.jpg)
CORINeeds a deep collaboration from the collections:
Data about terms, documents, sizeUnfeasible with independent collections
Statistical sampling, Query-based sampling
Term-based: no links, no anchorsVery large footprint
![Page 26: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/26.jpg)
![Page 27: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/27.jpg)
Querylog-based Collection SelectionUsing Query Logs to Establish Vocabularies in Distributed Information RetrievalMilad Shokouhi; Justin Zobel; S.M.M. Tahaghoghi; Falk ScholerThe collections are sampled using data from a query logBefore: Queries over a dictionary (QBS)
![Page 28: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/28.jpg)
RecallNumber of found documents (not for web)
Precision at X (P@X)Number of relevant documents out of the first X results (X= 5, 10, 20, 100)
Average precisionAverage of Precision for increasing X
MAP (Mean Average Precision)The average over all queries
![Page 29: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/29.jpg)
![Page 30: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/30.jpg)
And now for something completely
different!
![Page 31: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/31.jpg)
Important features
It can use any underlying WSELinks, snippets, anchors...
Good or bad as the WSE it uses!Small footprint
It can be added as another index
![Page 32: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/32.jpg)
terms links hot query-vector
query terms
referrer, relatedquery terms
query mapping
![Page 33: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/33.jpg)
DevelopmentsQuery suggestions
The system finds related queriesResult grouping
Documents are already organized into groups
Query expansionCan find more complex queries still matching
![Page 34: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/34.jpg)
d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8Q1Q1 x xQ2Q2 x xQ3Q3 x x xQ4Q4 x xQ5Q5 x x x x x x xQ6Q6 x x xQ7Q7 x x x xQ8Q8 x x xQ9Q9
![Page 35: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/35.jpg)
d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8Q1Q1 x xQ2Q2 x xQ3Q3 x x xQ4Q4 x xQ5Q5 x x x x x x xQ6Q6 x x xQ7Q7 x x x xQ8Q8 x x xQ9Q9
![Page 36: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/36.jpg)
d1d1 d2d2 d3d3 d4d4 d5d5 d6d6 d7d7 d8d8Q1Q1 x xQ2Q2 x xQ3Q3 x x xQ4Q4 x xQ5Q5 x x x x x x xQ6Q6 x x xQ7Q7 x x x xQ8Q8 x x xQ9Q9
![Page 37: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/37.jpg)
Still missingUsing the QV model as a full IR model:
Is it possible to perform queries over this representation?
New query terms cannot be foundCORI should be used for unseen queries
Better testing with topic shift
![Page 38: Document Clustering and Collection Selection Diego Puppin Web Mining, 2006-2007](https://reader034.vdocument.in/reader034/viewer/2022051105/5a4d1b807f8b9ab0599babb2/html5/thumbnails/38.jpg)
Possible seminars/projects
Advanced collection selectionStatistical sampling, Query-based sampling
Partitioning a LARGE collection (1 TB)Load balancing for doc- and term-partitioningTopic shift
Query log analysis