effective topic distillation with key resource pre...
TRANSCRIPT
![Page 1: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/1.jpg)
Effective Topic Distillation Effective Topic Distillation with Key Resource Prewith Key Resource Pre--selectionselection
Yiqun Liu, Min Zhang and Shaoping Ma
State Key Lab of Intelligent Tech. & Sys. Tsinghua University, Beijing, 100084
(2004/10/19)
![Page 2: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/2.jpg)
For AIRS presentation 04/10/19
OutlineOutline
• Why Key Resource Pre-selection?
• Possibilities of selecting key resources
• How to select key resources?
• Experiments
• Conclusion
![Page 3: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/3.jpg)
For AIRS presentation 04/10/19
Why Key resource selection? (1)Why Key resource selection? (1)
• The amount of web pages
Medium 2002 Internet
Surface Web 167 TB
Deep Web 91,850 TB
#Surface web pages 20 billion
#Deep web pages 130 billion
According to "How Much Information", 2003. http://www.sims.berkeley.edu/how-much-info-2003.
![Page 4: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/4.jpg)
For AIRS presentation 04/10/19
Why Key resource selection? (2)Why Key resource selection? (2)
• Index amount of web search engine
GG=Google,
ATW=AllTheWeb,
INK=Inktomi,
TMA=Teoma,
AV=AltaVista
Billions Of Textual Documents IndexedBillions Of Textual Documents Indexed
According to a report by search engine watch website; September 2, 2003
Less than 1/6
![Page 5: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/5.jpg)
For AIRS presentation 04/10/19
Why Key resource selection? (3)Why Key resource selection? (3)
Not all pages can be indexed by web IR tools
Many pages Indexed aren’t key resources
TD is difficult Key ResourceSelection
![Page 6: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/6.jpg)
For AIRS presentation 04/10/19
Definitions of TD and key resourceDefinitions of TD and key resource
• Key Resource (Key Resource Page)– High-quality web pages for a particular topic
• Offering credible information/service for this topic
• Introducing other useful web pages for this topic
– Key resources are only a small part of relevant pages
• Topic Distillation (TD)– To find key resources for certain topics
– A major task for web search (it covers over 70% web search queries)
![Page 7: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/7.jpg)
For AIRS presentation 04/10/19
• Selecting key resources is useful for TD
• Possibilities of selecting key resources– Is there any difference between ordinary pages and key r
esource pages?
• How to select key resources?
• Experiments
• Conclusion
OutlineOutline
![Page 8: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/8.jpg)
For AIRS presentation 04/10/19
NonNon--content features of key resourcescontent features of key resources
• Key resources v.s. ordinary pages (non-content features) – Common-used features
• In-degree, URL-type, Doc-length– Features involving site’s self-link analysis
• In-site out-link number, anchor text rate
• Two Data sets to compare the differences– Key resource page training set
• Built with TREC 11 TD task relevant qrels
– Ordinary page set: .GOV (over 1.2M web pages from .GOV domain)
![Page 9: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/9.jpg)
For AIRS presentation 04/10/19
InIn--degreedegree
• Key resource pages have more in-links
![Page 10: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/10.jpg)
For AIRS presentation 04/10/19
URLURL--typetype
• Key resource pages tend to be non-FILE type
![Page 11: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/11.jpg)
For AIRS presentation 04/10/19
0.00%
3.00%
6.00%
9.00%
12.00%
15.00%
18.00%
<200 600 1000 3000 5000 7000 9000 20000 >30000
Training Set .GOV Corpus
DocDoc--lengthlength
• Key resources don’t have too few words
![Page 12: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/12.jpg)
For AIRS presentation 04/10/19
InIn--site Outsite Out--link analysislink analysis
• Definition
• Feature– In-site out-link number– In-site out-link anchor text rate
Site AP1 P2
1 23
)textfullpageweb(WordCount)anchorlinkoutsitein(WordCountrate −−
=
![Page 13: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/13.jpg)
For AIRS presentation 04/10/19
InIn--site Outsite Out--link analysislink analysis
• Key resource pages have more in-site out-links and longer in-site out-link anchor texts
In-site out-link anchor text rateIn-site out-link anchor number
![Page 14: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/14.jpg)
For AIRS presentation 04/10/19
• Selecting key resources is useful for TD
• Possibilities of selecting key resources
• How to select key resources?– Construction of a key resource decision tree
• Experiments
• Conclusion
OutlineOutline
![Page 15: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/15.jpg)
For AIRS presentation 04/10/19
Construction of a key resource decision treeConstruction of a key resource decision tree
• Why decision tree?– The most effective and efficient classifier when there are small
number of features • 5 non-content features
– Providing a metric to estimate quality of these features in the form of
• Information gain (ID3)
• Information ratio (C4.5)
![Page 16: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/16.jpg)
For AIRS presentation 04/10/19
Construction of a key resource decision treeConstruction of a key resource decision tree
68.53% of .GOV
![Page 17: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/17.jpg)
For AIRS presentation 04/10/19
OutlineOutline
• Selecting key resources is useful for TD• Possibilities of selecting key resources• How to select key resources?• Experiments
– Is this key resource selection process effective?– Does TD perform better on the key resource result set?
• conclusion
![Page 18: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/18.jpg)
For AIRS presentation 04/10/19
Is this key resource selection process eIs this key resource selection process effective?ffective?
• Key resource selection algorithm is effective
70%
20%
![Page 19: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/19.jpg)
For AIRS presentation 04/10/19
Does TD perform better on the key resDoes TD perform better on the key resource result set?ource result set?
• Test set:– From TREC 2003 TD task
– 50 topics and corresponding relevant qrels
• Evaluation Metrics:– Precision at 10 documents
– R-precision (precision at #relevant documents)
• Weighting– BM2500 ranking, default parameters
![Page 20: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/20.jpg)
For AIRS presentation 04/10/19
Does TD perform better on the key resDoes TD perform better on the key resource result set?ource result set?
• Text retrieval on different data set
G = .GOV corpusK = Key resource
setF = Full text A = Anchor text T = Trec 2003 best
run
76%
83%
24.89% .GOV data
![Page 21: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/21.jpg)
For AIRS presentation 04/10/19
Conclusion Conclusion
• Key resource pre-selection is needed for TD– Finding high quality pages independent of a given user request
• A new type of non-content features– In-site out-link analyses
• Algorithm of using decision tree to find key resources• Key resource page set:
– use less than 20% .GOV pages– cover more than 70% key resource information– get better performance than whole page set
(There is 76% performance improvement in p@10)
![Page 22: Effective Topic Distillation with Key Resource Pre …YQLiu/publications/airs2004-slides.pdfEffective Topic Distillation with Key Resource Pre-selection Yiqun Liu, Min Zhang and Shaoping](https://reader033.vdocument.in/reader033/viewer/2022041715/5e4acd0268bce35b2c0a9837/html5/thumbnails/22.jpg)
For AIRS presentation 04/10/19
Welcome to contact me:
Thank you!
Questions and comments?