prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Information extraction from HTML product catalogues

Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1

{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept. of Information and Knowledge Engineering, Prague University of Economics

2 Dept. of Applied Mathematics, Technical University of Ostrava

Coupling quantitative and knowledge-based approaches


Agenda

• Overview of the Rainbow project• Extraction of product offers

– Annotation using HMMs– Impact of image information– Ontology-based instance extraction– Search interface

• Future work


Rainbow overview• Goal

– to present the content and structure of legacy websites to a user or computer agent

• How– multiway analysis of websites: utilize features derived from text,

images, formatting, URLs, navigation structure and background knowledge

• Modular architecture, web services– information extraction (HMMs)– discovery of website navigation structure (link graph)– image classifiers (histograms, dimensions, similarity)– URL classifier (rule-based)– extractor of summarizing sentences (bootstrapped indicator

keywords)


Application of Rainbow


Extraction of product offers

• Combines– automatic document annotation using HMMs– image classifier– ontology-based instance composition– URL classifier for focused crawling– structured search interface powered by Sesame

• The data– over 1000 bicycle offers (labeled using 15 attributes)– in 100 pages from different websites


Sample data


Preprocessing

• HTML cleanup– conversion to valid XHTML

• Only potentially relevant blocks kept– blocks that do not directly contain text or images omitted

• Formatting tags– attributes removed– several rules matching common constructions (add-to-basket

form, choose-amount button)

• Images– baseline: all images treated as a single token


Annotation using HMMs

• HMM structure– target, prefix, suffix and background states– adopted from [Freitag, McCallum 99]

• Single tag trigram model for all tags

• F-measures – 83% for name, 89% for price– 56% average for 13 other attributes (17-90%)

• Variations– word-ngram models for lexical probabilities of target states– state substructures instead of single target states, learned

by EM


Impact of image information

• Image classifier– classifies into 3 classes – Pos, Neg, Unk– before HMM annotation, each image occurence in a document is

substituted by its class– best result 6.6% error rate for binary classification with multi-layer

perceptron (weka)

• Features used for classification– dimensions (estimated 2-dimensional normal distribution)– similarity (latent semantic similarity [Praks 2004] )– whether the same image repeats in the same document

• Results– image precision increased by 19.1%, recall by 2%– improvements for other tags negligible


Ontology-based instance extraction

Instance extractionalgorithm

Instances(xml)

SesameRDF

repository

Documentannotatedby HMM

Presentationontology


Domain ontology Presentation ontology


Instance extraction algorithm

• Sequentially parses annotated document• Adds annotated attributes to working instance WI• If adding an attribute would cause an inconsitency,

an empty working_instance is created. The old working_instance is saved only if it is consistent.

1. WI = empty_instance;2. while (more_attributes) {3. A = next_attribute;4. if (cannot_add (WI, A)) {5. if (consistent (WI)) {6. store (WI);7. }8. WI = empty_instance;9. }10. add (WI, A);11. }


Search interface powered by Sesame


Future work

• Learn to correct annotation errors– use document structure to detect unlabeled attributes– bootstrap from these new examples– use ontology constraints on values (types, lists, regexps)

• Population algorithm– utilize scores for each annotated attribute– augment presentation ontology with frequencies of attribute

orderings– use approximate name matching to identify instances

• Improve search interface– approximate name matching (word and char edit distance)


Thank you!

rainbow.vse.cz

prie.ppt

Documents

instance extraction

annotated document

information extraction

ontology constraints

consistent wi

document structure

automatic document annotation

image repeats