assuming accurate layout information for web documents is available, what now? hassan alam, rachmat...

15
Assuming Accurate Assuming Accurate Layout Information for Layout Information for Web Documents is Web Documents is Available, What Now? Available, What Now? Hassan Alam, Rachmat Hartono, Aman Kumar, Hassan Alam, Rachmat Hartono, Aman Kumar, Fuad Fuad Rahman, Rahman, Yuliya Tarnikova and Che Wilcox Yuliya Tarnikova and Che Wilcox Human Computer Interaction Group Human Computer Interaction Group BCL Technologies Inc. Santa Clara, CA 95050 BCL Technologies Inc. Santa Clara, CA 95050 www. www. bcltechnologies bcltechnologies .com .com [email protected] [email protected]

Upload: laura-greene

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Assuming Accurate Layout Assuming Accurate Layout Information for Web Information for Web Documents is Available, What Documents is Available, What Now? Now?

Hassan Alam, Rachmat Hartono, Aman Kumar, Hassan Alam, Rachmat Hartono, Aman Kumar, Fuad Rahman,Fuad Rahman, Yuliya Yuliya Tarnikova and Che Wilcox Tarnikova and Che Wilcox

Human Computer Interaction GroupHuman Computer Interaction GroupBCL Technologies Inc. Santa Clara, CA 95050BCL Technologies Inc. Santa Clara, CA 95050www.www.bcltechnologiesbcltechnologies.com.com

[email protected]@bcltechnologies.com

Overview of the talkOverview of the talk Web pages vs. document layout Why do we need layout information? Web page summarization for

handheld devices The future: Marrying Ontology with

XML Conclusion and Future Work

Related WorkRelated Work

Handcrafting

Transcoding

Adaptive Re-authoring

Handcrafting involves typically crafting web pages by hand by a set of content experts for device specific output.

Transcoding replaces HTML tags with suitable device specific tags, such as HDML, WML and others.

The research on web page re-authoring can explicitly use natural language processing or use non-NLP techniques.

Web Page Summarization for Web Page Summarization for Handheld DevicesHandheld Devices

Web Page Data Structure

Content Analysis Content Processing for Re-authoring

Verbatim Transcode Summarize

Node Merging

Representing the Complete Web page

When to Summarize? Creating a label Creating a Summary

Web Page Summarization for Web Page Summarization for Handheld DevicesHandheld Devices

The Future: Marrying The Future: Marrying Ontology with XMLOntology with XML

We assume that we have layout information for a web page

What do we do then? How do we use this

information? How do that information help

us in getting better re-authoring solutions?

We then define an ontology for that domain!

We define an XML to code that information

To define an ontology for the domain of web pages

What is Ontology and How do We What is Ontology and How do We Define it?Define it?

Ontology is a specification of a conceptualization.

Ontology establishes a joint terminology between members of a community of interest.

These members can be human or automated agents.

A list of elements

Concept hierarchy

Concept association

Rules or axioms

A List of Elements in the Web DomainA List of Elements in the Web Domain

Concept HierarchyConcept Hierarchy

and so on…

Concept AssociationConcept Association

and so on…

Rules or AxiomsRules or Axioms

and so on…

Web Page Summarization for Web Page Summarization for Handheld Devices using OntologyHandheld Devices using Ontology

Web Page Data Structure

Content Analysis Content Processing for Re-authoring

Verbatim Transcode Summarize

Node Merging

Representing the Complete Web page

When to Summarize? Creating a label Creating a Summary

Output Level Decided

Use Ontology to re-format the web page

XML Structure Derived

Device Specific Display

What is the Advantage ofWhat is the Advantage of using using Ontology?Ontology?

It improves the quality of the output in many ways. It becomes possible to capture the contextual

relationship among various components within the document

It leads to better understanding of the information contained within the document.

This additional information can be used in other processes, such as document categorization and contextual search.

Future WorkFuture Work

It is assumed that the future of mobile browsing lies in the adoption of semantic web technology.

Before that realizes, the proposed approach offers a workable compromise to generate high fidelity re-authored web pages.

This is an exploratory paper offering a specific pathway to the future of web page re-authoring provided accurate layout information is available.

Currently, it is beyond the capability of any algorithm to achieve this level of accuracy. However, approximations to that accuracy are attainable and even practical. It will be interesting to discuss other possibilities in this space.

ConclusionsConclusions

Some ideas about how to produce better web page re-authoring solutions by using linguistic knowledge and ontology assuming accurate layout information for web pages is available.

It is shown that such an approach will produce high quality intelligent summary for web pages allowing fast and efficient web browsing on small display handheld devices.