vishue: web page segmentation for improved query interface for medlineplus medical encyclopaedia

Upload: aastha-madaan

Post on 26-Feb-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    1/40

    to Advance Knowledge for Humanity

    Aastha Madaan, Wanm ing Chu , Subhash

    Bhal la

    Universi ty o f Aizu

    1

    VisHue: Web Page Segmentation for an

    Improved Query I nterface for M edlinePlus

    Medical Encyclopedia

    12/10/2011

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    2/40

    to Advance Knowledge for Humanity

    Outl ine

    1. In troduc t io n

    2. Backg rounda) Hierarchical structure

    b) Page-Level Segmentation

    3. Web Page segmen tat ion A lgo rit hms

    a) Features

    b) Main focus

    c) Comparison

    4. The Proposal: The VisHue A lgo rit hm

    5. Qu ery by Segmen t

    6. Perfo rmance Analy sis

    7. Discuss ions

    8. Summary and Conc lu sio ns

    212/10/2011

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    3/40

    to Advance Knowledge for Humanity

    1. In troduction

    WWW is a common and the largest source of

    information

    Deep Querying Gaining importance

    Understanding web page semantics Improves Users

    search experience

    Within a web page Identify semantic groups

    Important Discovering these semantic blocks

    312/10/2011

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    4/40

    to Advance Knowledge for Humanity

    1(i). The Statement [1]

    A. Large variety of HTML pages suitable query and

    search ?

    B. Basic Requirements searching and querying

    Simple querying and searching semantic querying and

    searching

    C. Significant Recognize the semantic and coherent

    segments

    Page-level Segment Level

    D. Case Example Medical Encyclopedia

    MedlinePlus various choices of medical encyclopedias

    412/10/2011

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    5/40

    to Advance Knowledge for Humanity

    1(i). The Statement [2]

    12/10/2011 5

    UML Class

    Diagram

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    6/40

    to Advance Knowledge for Humanity

    Outl ine

    1. In troduc t io n

    2. Backg rounda) Hierarchical structure

    b) Page-Level Segmentation

    3. Web Page segmen tat ion A lgo rit hms

    a) Features

    b) Main focus

    c) Comparison

    4. The Proposal: VisHue A lgo rit hm

    5. Qu ery by Segmen t

    6. Perfo rmance Analy sis

    7. Discuss ions

    8. Summary and Conc lu sio ns

    612/10/2011

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    7/40

    to Advance Knowledge for Humanity

    2. Background : Med l inePlus Web page:

    i. Relevant content ii. Irrelevant content

    a. Relevant Content:

    i. Topic headings ii. Topic wise contents

    b. Irrelevant Content:Navigation bars, header, footer, advertisements

    Headings Identify hierarchical structure

    Distinct blocks What a usersperception identifies Main focus Skilled and Semi-skilled users

    Assumption Headings Query attributes

    712/10/2011

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    8/40

    to Advance Knowledge for Humanity

    2(a). Hierarch ical Struc tu re

    1. Hierarchical structure logical structure within the

    Page(document)

    2. Indicates the binary relationships (belongingness)

    between a pair of segments

    3. Accurate Hierarchical Representation User Level

    Query Attributes (in segments)

    4. Proposed hierarchical structure based on domainknowledge (skilled and semi-skilled users)

    Captures users perception

    812/10/2011

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    9/40

    to Advance Knowledge for Humanity

    2(a).(i). Segmentation Semantic Query

    9

    User

    Semantic query

    and search

    (In future)

    Common

    WebUser

    12/10/2011

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    10/40

    to Advance Knowledge for Humanity

    2 (b ). Page-Level Segmentat ion

    Definition

    A self-contained logical region within a Web page that is:

    (i) not nested within any other segment;

    (ii) represented by a pair (l; c)

    Where, l label of the segment

    c portion of textof the segment [1].

    1012/10/2011

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    11/40

    to Advance Knowledge for Humanity

    Outl ine

    1. In troduc t io n

    2. Backg rounda) Hierarchical structure

    b) Page-Level Segmentation

    3. Web Page segmen tat ion A lgo rit hms

    a) Features

    b) Main focus

    c) Comparison

    4. The Proposal: VisHue A lgo rit hm

    5. Qu ery by Segmen t

    6. Perfo rmance Analy sis

    7. Discuss ions

    8. Summary and Conc lu sio ns

    1112/10/2011

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    12/40

    to Advance Knowledge for Humanity

    3. Segmentat ion algo r i thms

    i. History segmentation traces back to theyear 2001 (continues till 2011)

    ii. Various application domains

    iii. Various techniques for segmenting

    iv. Various terminologies used

    v. Proposed MedlinePlus items of users

    focus Query Attributes

    1212/10/2011

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    13/40

    to Advance Knowledge for Humanity

    3 (a). Features of Segm entation A lgo r i thm

    A. Match and Identify a users points of focus

    B. Discover informative segments

    i. Better search and query

    ii. Segments become query-able attributes

    iii. Skilled users aim to query the informative areas

    (only)

    C. Generate True hierarchical structure

    D. Segmentation Process Low space and time

    complexity

    1312/10/2011

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    14/40

    to Advance Knowledge for Humanity

    3(b ). Main Focus

    Find an algorithm best suited for:

    1. Generate hierarchical structure

    2. Convert segments to attributes in

    database

    3. Facilitates in-depth querying

    12/10/2011 14

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    15/40

    to Advance Knowledge for Humanity

    3 (b). (i). Segmentation Methods Web Technologies

    1512/10/2011

    t Ad K l d f H it

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    16/40

    to Advance Knowledge for Humanity

    3 (b ). (i i ). Class if icat ion o f A lgori thms

    1612/10/2011

    t Ad K l d f H it

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    17/40

    to Advance Knowledge for Humanity

    3(b). (iii). Timeline Techniques

    Algorithm Year

    Technique

    Template Detection [9], [6] 2002, 2007

    Dom-Node Recognition [8], [11], [10] 2001, 2002, 2006

    Visual-DOM based

    Rendering

    [2] 2003

    Visual-Heuristics based

    Method

    Proposed -

    Graph-theoretic Method [3] 2008

    Linguistics basedMethod [7] 2008

    Image of the Web Page [4], [5] 2010,2009

    Site-Oriented Method [1] 2011

    1712/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    18/40

    to Advance Knowledge for Humanity

    3(c). Comparison

    1812/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    19/40

    to Advance Knowledge for Humanity

    3(c ).(i). Main Focus

    1912/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    20/40

    to Advance Knowledge for Humanity

    3.(c ).(i i ).Comparison : Vision based Mtds .

    2012/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    21/40

    to Advance Knowledge for Humanity

    3(c ).(i i i ). Content Struc tu re by VisHue

    2112/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    22/40

    to Advance Knowledge for Humanity

    Outl ine

    1. In troduc t io n

    2. Backg rounda) Hierarchical structure

    b) Page-Level Segmentation

    3. Web Page segmen tat ion A lgo rit hms

    a) Features

    b) Main focus

    c) Comparison

    4. The Proposal: VisHue A lgo rit hm

    5. Qu ery by Segmen t

    6. Perfo rmance Analy sis

    7. Discuss ions

    8. Summary and Conc lu sio ns

    2212/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    23/40

    to Advance Knowledge for Humanity

    4. The Proposal: VisHue A lgo r i thm

    12/10/2011 23

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    24/40

    to Advance Knowledge for Humanity

    4. (i). Query In ter faces

    Querying v/s Searching

    Searching: Recent Trends

    1. Object based search2. Block based search

    3. Entity based search

    Querying: Recent Trends

    Very few efforts have been done

    2412/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    25/40

    to Advance Knowledge for Humanity

    Outl ine

    1. In troduc t io n

    2. Backg rounda) Hierarchical structure

    b) Page-Level Segmentation

    3. Web Page segmen tat ion A lgo rit hms

    a) Features

    b) Main focus

    c) Comparison

    4. The Proposal: VisHue A lgo rit hm

    5. Qu ery by Segmen t

    6. Perfo rmance Analy sis

    7. Discuss ions

    8. Summary and Conc lu sio ns

    2512/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    26/40

    g y

    5. Query by Segment

    Query by Segment as Query by Tag (Heading) QBT

    Based on Content Structure (VisHue algorithm) :

    Query by Attributes

    MedlinePlus medical encyclopedia 3886 web pages

    Target Focused and explicit querying

    i. Beneficial skilled and semi-skilled users

    ii. Medical encyclopedia result of years of efforts

    by experts

    2612/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    27/40

    g y

    5. (i). The QBT in ter face

    27

    Traditional search on MedlinePlus

    medical encyclopedia

    QBT interface

    12/10/2011

    Title Caus

    es

    Sympt

    oms

    Post-

    Care

    DB

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    28/40

    5. (ii). QBT In terface Hierarch ical Struc tu re

    Labels QueryAttributes

    QBT interface: Search and Query

    Child nodes search attributes

    Left siblings limit the scope of search of right

    siblings in the interface

    Segments Attributes for Deep Query over allpages of MedlinePlus

    2812/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    29/40

    Outl ine

    1. In troduc t io n

    2. Backg rounda) Hierarchical structure

    b) Segmentation

    3. Web Page segmen tat ion A lgo rit hms

    a) Features

    b) Main focus

    c) Comparison

    4. The Proposal: VisHue A lgo rit hm

    5. Qu ery by Segmen t

    6. Perfo rmance Analy sis

    7. Discuss ions

    8. Summary and Conc lu sio ns

    2912/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    30/40

    6. Perfo rmance Analys is

    i. Qualitative comparison with traditional

    keyword search

    ii. Query formulation and interpretation

    iii. Quantitative performance analysis of theinterface

    3012/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    31/40

    6.(i). QBT vs . Keyword Search

    3112/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    32/40

    6. (i i ). Query Form ulat ion : A Comparison

    3212/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    33/40

    6. (i i i). Query Example

    Query 1:Cases where patient has

    hypertension but not high blood pressure

    QBT query :

    Symptoms: Hypertension

    Symptoms:NOT High Blood Pressure

    33

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    34/40

    34

    6. (iv ). Query A ttr ibu tes

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    35/40

    35

    6. (v ). Query Resu lts

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    36/40

    6. (v i). Quanti tat ive Perform ance Analys is

    36

    QBT Query

    Symptom: Hypertension

    Symptom:NOT High

    Blood Pressure

    Before Procedure: Stop

    After Procedure:Normal

    Cause: HighBlood

    Pressure

    Symptom: Heart Attack

    Food Source: Fish

    Side Effect: Poisoning

    12/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    37/40

    7. Discuss ions

    Content fragments as perceived by skilled and semi-

    skilled domain users determined by web pagesegmentation process

    Proposed effort Formulating a generic heuristic

    design-rule and visual features based algorithm

    The QBT interface Query over user identified

    segments (attributes)

    Aim Convert MedlinePlus pages DB

    Contention web page good source easy to use

    new query language interface for segments

    3712/10/2011

    to Advance Knowledge for Humanity

    8 S C

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    38/40

    8. Summary and Conclus ions

    A. Heuristics + visual features based segmentation

    turning point:

    A. Provides independent solution

    B. Improves Query interfaces for chosen domain

    B. The medical domain need to make the informationaccessible to the end-users

    C. Query by Segment or Tag (QBT) Anattempt

    A. Aim return the users query-able attributes

    3812/10/2011

    to Advance Knowledge for Humanity

    R f

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    39/40

    References1. A Site Oriented Method for Segmenting Web Pages, David Fernandes, Edleno S. de Moura, Altigran S.

    da Silva, Berthier Ribeiro-Neto, Edisson Braga, SIGIR11, July 24-28, 2011.

    2. Extracting Content Structure for Web Pages based on Visual Representation, Deng Cai, Shipeng Yu, Ji-

    Rong Wen and Wei-Ying Ma, Web Technologies and Applications: 5th Asia-Pacific Web Conference,APWeb 2003, Xian, China, April 23-25, 2003. Proceedings (2003), pp. 596-596.

    3. Graph-Theoretic Approach to Webpage Segmentation, Deepayan Chakrabarti, Ravi Kumar, Kunal

    Punera, WWW 2008 / Refereed Track: Search - Corpus Characterization & Search Performance, Beijing,

    China.

    4. A segmentation method for web page analysis using shrinking and dividing, Jiuxin Cao, Bo Mao &

    Junzhou Luo (2010): International Journal of Parallel, Emergent and Distributed Systems, 25:2, 93-104.

    5. Web Page Layout via Visual Segmentation,Ayelet Pnueli, Ruth Bergman, Sagi Schein, Omer Barkol, HP

    Laboratories, 2009.

    6. Page-level template detection via isotonic smoothing. D. Chakrabarti, R. Kumar, and K. Punera. In 16th

    WWW, pages 6170, 2007.

    7. "A Densitometric Approach to Web Page Segmentation", Christian Kohlschtter, Wolfgang Nejdl, CIKM08,

    October 2630, 2008

    8. HTML Page Analysis Based on Visual Cues , Yudong Yang and HongJiang Zhang, IEEE 2001

    9. Template Detection via Data Mining and itsApplications , Ziv Bar Yossef, Sridhar Rajagopalan, In

    Proceedings of WWW'02, May 711, 2002, Honolulu, Hawaii, USA.

    10. "DeSeA: A Page Segmentation based Algorithm for Information Extraction", He Juan, Gao Zhiqiang, Xu

    Hui, Qu Yuzhong, Proceedings of the First International Conference on Semantics, Knowledge, and Grid

    (SKG 2005).

    11. "Reverse Engineering for Web Data: From Visual to Semantic Structures", Christina Yip Chung, Michael

    Gertz, Neel Sundaresan, In proceedings of the 18th International Conference on Data Engineering

    (ICDE02).

    3912/10/2011

    to Advance Knowledge for Humanity

  • 7/25/2019 VisHue: Web Page Segmentation for Improved Query Interface for MedlinePlus Medical Encyclopaedia

    40/40

    Thank you

    Quest ions

    4012/10/2011