focused crawling for both topical relevance and quality of medical information by tim tang, david...
TRANSCRIPT
![Page 1: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/1.jpg)
Focused Crawling for both Topical Relevance and Quality of Medical Information
By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths
CIKM ’05November, 2005
![Page 2: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/2.jpg)
2
Outlines
Problems and Motivation The experiment
– Focused crawling– Relevance and quality prediction– The three crawlers– Measures for relevance and quality– Results, findings
Future work
![Page 3: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/3.jpg)
3
Why Health Information on the Web?
The Internet is a free medium Health information of various quality Incorrect health advice may be dangerous High user demand for Web health information
![Page 4: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/4.jpg)
4
Problems
Relevance (in IR):– Topical relevance based on text– Navigational and distillation relevance based on links
None of these techniques guarantee quality Our previous study (Tang et al., JIR ‘05) showed Google
returns a lot of low-quality health results -> PageRank does not guarantee quality
![Page 5: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/5.jpg)
5
Problems: Quality of Health Info
Quality of health information is often measured by evidence-based medicine which are Interventions supported by a systematic review of the evidence as effective.
Low quality health information originate from untrusted sites: personal home pages, commercial sites, chat sites, web forums, and even some published materials,…
![Page 6: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/6.jpg)
6
Wrong Advice from an Article
![Page 7: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/7.jpg)
7
Dangerous Information from Personal Web Pages
![Page 8: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/8.jpg)
8
Commercial Promotion
![Page 9: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/9.jpg)
9
Why Domain-specific Search?
Impose domain restriction
Results from previous work (Tang et. al, JIR ‘05)
Quality: Domain-specific engines performed much better than Google
Relevance: GoogleD was best
Coverage analysis: BPS & 4sites are poor
Engine Relevance Quality
GoogleD0.407 78
BPS0.319 127
4sites0.225 143
Google0.195 28
![Page 10: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/10.jpg)
10
The Problems of Domain-specific Engines
The current method to build domain-specific engines is very expensive: manual, rule-based.
Example: BluePages Search – A depression portal at the ANU (http://bluepages.anu.edu.au)– Manual judgments of health sites by domain experts for two
weeks to decide what to include in the index.– Low coverage: only 207 Web sites in the index– Tedious maintenance process: Web pages change, cease to
exist, new pages come out, etc.
-> A quality focused crawler may be a cheaper approach, maintaining high quality while improving coverage
![Page 11: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/11.jpg)
11
The FC Process Designed to selectively fetch content relevant to a
specified topic of interest using the Web’s hyperlink structure.
URL Frontier
Link extractorDownload
Classifier
{URLs, link info}
dequeue
{URLs, scores}
enqueue
Link info = anchor text, URL, source page’s content, so on.
![Page 12: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/12.jpg)
12
Relevance Prediction
anchor text: text appearing in a hyperlink text around the link: 50 bytes before and after the link URL words: Words formed by parsing the URL address
![Page 13: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/13.jpg)
13
Relevance Indicators
URL: http://www.depression.com/psychotherapy.html
=> URL words: depression, com, psychotherapy
Anchor text: psychotherapy Text around the link:
– 50 bytes before: section, learn
– 50 bytes after: talk, therapy, standard, treatment
![Page 14: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/14.jpg)
14
Methods Machine learning approach: Train and test relevant and
irrelevant Web pages using the discussed indicators. Evaluated different learning algorithms: k-nearest
neighbor, Naïve Bayes, C4.5, Perceptron. Result: The C4.5 decision tree was the best to predict
relevance. A Laplace correction formula (Margineantu et al., LNS, ‘02)
was used to produce a confidence score (confidence_level) at each leaf node of the tree.
The same method applied to predict quality but not successful!!! -> Link anchor context cannot predict quality
![Page 15: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/15.jpg)
15
Quality Prediction
Using evidence-based medicine, and
Using Relevance Feedback (RF) technique
![Page 16: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/16.jpg)
16
Evidence-based Medicine
Evidence-based treatments were divided into single and 2-word terms.
Example:– Cognitive behavioral therapy
-> cognitive, behavioral, therapy, cognitive behavioral, behavioral therapy
![Page 17: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/17.jpg)
17
Relevance Feedback
Well-known IR approach of query by examples. Basic idea: Do an initial query, get feedback from users
about what documents are relevant, then add words from relevant document to the query.
Goal: Add terms to the query in order to get more relevant results.
Usually, 20 terms are added into the query in total
![Page 18: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/18.jpg)
18
Our RF Approach
Not for relevance, but Quality Not only single terms, but also phrases Generate a list of single terms and 2-word phrases and
their associated weights Select the top weighted terms and phrases Cut-off points at the lowest-ranked term that appears in
the evidence-based treatment list 20 phrases and 29 single words form a ‘quality query’
![Page 19: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/19.jpg)
19
Predicting Quality For downloaded pages, quality score (QScore) is
computed using a modification of the BM25 formula, taking into account term weights.
Quality of a new page is then predicted based on the quality of all the downloaded pages linking to it.(Assumption: There is quality locality, pages with similar content are inter-connected (Davison, SIGIR ‘00))
Predicted quality score of a page with n downloaded source pages:
PScore = ΣQScore/n
…
Downloaded sources
P1
P2
Pn
Target
![Page 20: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/20.jpg)
20
Combining Relevance and Quality
We need to balance between relevance and quality Quality and relevance score combination is new Our method uses a product of the two scores:
URLScore = confidence_level * PScore Other ways to combine these scores will be explored in
future work A quality focused crawler will rely on this combined score
to order its crawl queue
![Page 21: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/21.jpg)
21
The Three Crawlers
The Breadth-first crawler: Traverses the link graph in a FIFO fashion (serves as baseline for comparison)
The Relevance crawler: For topical relevance, ordering the crawl queue using the C4.5 decision tree
The Quality crawler: For both relevance and quality, ordering the crawl queue using the combination of the C4.5 decision tree and RF techniques.
![Page 22: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/22.jpg)
22
Measures
Relevance: The relevance performance of the three crawlers were evaluated using a relevance classifier.
Quality: were judged by domain experts using the evidence-based guidelines from the Centre for Evidence Based Mental Health (CEBMH). – Overall quality: taking into account all pages – High and low quality categories: the top 25%, and the
bottom 25% results in each crawl were compared.
![Page 23: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/23.jpg)
23
Results
![Page 24: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/24.jpg)
24
Relevance
![Page 25: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/25.jpg)
25
Quality
![Page 26: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/26.jpg)
26
High Quality Pages
AAQ = Above Average Quality: top 25%
![Page 27: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/27.jpg)
27
Low Quality Pages
BAQ = Below Average Quality: bottom 25%
![Page 28: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/28.jpg)
28
Findings
Topical-relevance can be predicted using link anchor context.
Relevance feedback technique proved its usefulness in quality prediction.
Domain-specific search portals can be successfully built using focused crawling techniques.
![Page 29: Focused Crawling for both Topical Relevance and Quality of Medical Information By Tim Tang, David Hawking, Nick Craswell, Kathy Griffiths CIKM ’05 November,](https://reader035.vdocument.in/reader035/viewer/2022062719/56649ec65503460f94bd239e/html5/thumbnails/29.jpg)
29
Future Work
We only experimented in one health topic. Our plan is to repeat the same experiments with another topic, and generalise the technique to another domain.
Other ways of combining relevance and quality should be explored.
Experiments to compare our quality crawl with other health portals is necessary.