topical crawling for business intelligence gautam pant * and filippo menczer ** * department of...
TRANSCRIPT
![Page 1: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/1.jpg)
Topical Crawling for Business IntelligenceGautam Pant* and Filippo Menczer**
*Department of Management SciencesThe University of Iowa, Iowa City, IA 52246
**School of InformaticsIndiana University, Bloomington, IN 47408
![Page 2: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/2.jpg)
Overview Topical Crawling The Business Intelligence Problem Test Bed Crawling Algorithms Results Finding Better Seeds
![Page 3: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/3.jpg)
Crawling as Graph Search
History
Frontier
Seeds
Node expansion – Downloading and parsing a page
Open list - Frontier Closed list – History Expansion order – Crawl path
![Page 4: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/4.jpg)
Exhaustive vs. Preferential Crawling Exhaustive - blind expansion order (e.g.
Breadth First ) Preferential - heuristic-based expansion order
(e.g. Best First) Topical Crawling: the guiding heuristic is based
on a topic or a set of topics
![Page 5: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/5.jpg)
Business Intelligence Problem Web based information about related business
entities Related through the area of competence,
research thrust etc. Topical crawlers can help in creating a small
but focused collection of Web pages that is rich in information about related business entities
![Page 6: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/6.jpg)
Business Intelligence Problem A list of business entities is available We create a focused document collection that
can be further explored with ranking, indexing and text-mining tools
We investigate the crawling techniques for the task
![Page 7: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/7.jpg)
Finding paths in a competitive community
.com
.edu, .org, .gov.com
.com
.com.com
.com
.com
![Page 8: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/8.jpg)
Test Bed DMOZ Categories – “Companies”,
“Consultants”, “Manufacturers” 159 topics seeds, targets, keywords and description Each crawler crawl up-to 10,000 pages for
each topic
![Page 9: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/9.jpg)
Sample Topic
![Page 10: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/10.jpg)
Performance Metrics Precision@N
Target Recall@N
N
iitt pdsim
NNprecision
1
),(1
@
Relevant
Targets
Crawled
• |Crawled ∩ Relevant| / |Relevant|• |Crawled ∩ Targets| / |Targets|
![Page 11: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/11.jpg)
Crawling Infrastructure
![Page 12: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/12.jpg)
Crawling Algorithms Breadth First
Naïve Best First
![Page 13: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/13.jpg)
Crawling Algorithms – DOM Crawler
scorecontextscorelinkscorelink bfsdom _)1(__
![Page 14: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/14.jpg)
Hub-Seeking Crawler
0,1
)1(_
2
nn
nnscorehub
n – number of seed hosts
domhub scorelinkscorehubscorelink _,_max_
![Page 15: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/15.jpg)
Performance
![Page 16: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/16.jpg)
Improving the Seed Set Top 10 hubs based on back-
links from Google Avoiding mirrors of DMOZ Augmented seed set
![Page 17: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/17.jpg)
Performance
![Page 18: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/18.jpg)
Related work Chakrabarti et. al. [1998]
Use of Hubs Menczer et. al. [2001]
Framework for evaluating topical crawlers Chakrabarti et. al. [2002]
Use of DOM
![Page 19: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/19.jpg)
Conclusion Investigated the problem of creating a small collection
through topical crawling for locating related business entities Hub Seeking crawler that seeks hubs at crawl time and
exploits the tag tree structure of Web pages outperforms Naïve Best-First
Positive effects of identifying hubs before and during the crawl process
Future Work – Find optimal aggregation node Compare the benefits of identifying hubs in competitive vs.
collaborative communities
![Page 20: Topical Crawling for Business Intelligence Gautam Pant * and Filippo Menczer ** * Department of Management Sciences The University of Iowa, Iowa City,](https://reader036.vdocument.in/reader036/viewer/2022062515/56649d1a5503460f949ef8d8/html5/thumbnails/20.jpg)
Thank [email protected]
Acknowledgements:
Robin McEntire (GlaxoSmithKline R&D)
Valdis A. Dzelzkalns (GlaxoSmithKline R&D)
Paul Stead (GlaxoSmithKline R&D)
NSF grant to FM