prie.ppt

15
Machine Learning for the Semantic Web, Feb 14 th 2005 Information extraction from HTML product catalogues Martin Labský 1 , Vojtěch Svátek 1 , Pavel Praks 2 , Ondřej Šváb 1 {labsky, svatek, xsvao06}@vse.cz, [email protected] rainbow.vse.cz 1 Dept. of Information and Knowledge Engineering, Prague University of Economics 2 Dept. of Applied Mathematics, Technical University of Ostrava Coupling quantitative and knowledge-based approaches

Upload: butest

Post on 21-Jun-2015

305 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Information extraction from HTML product catalogues

Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1

{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept. of Information and Knowledge Engineering, Prague University of Economics

2 Dept. of Applied Mathematics, Technical University of Ostrava

Coupling quantitative and knowledge-based approaches

Page 2: prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Agenda

• Overview of the Rainbow project• Extraction of product offers

– Annotation using HMMs– Impact of image information– Ontology-based instance extraction– Search interface

• Future work

Page 3: prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Rainbow overview• Goal

– to present the content and structure of legacy websites to a user or computer agent

• How– multiway analysis of websites: utilize features derived from text,

images, formatting, URLs, navigation structure and background knowledge

• Modular architecture, web services– information extraction (HMMs)– discovery of website navigation structure (link graph)– image classifiers (histograms, dimensions, similarity)– URL classifier (rule-based)– extractor of summarizing sentences (bootstrapped indicator

keywords)

Page 4: prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Application of Rainbow

Page 5: prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Extraction of product offers

• Combines– automatic document annotation using HMMs– image classifier– ontology-based instance composition– URL classifier for focused crawling– structured search interface powered by Sesame

• The data– over 1000 bicycle offers (labeled using 15 attributes)– in 100 pages from different websites

Page 6: prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Sample data

Page 7: prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Preprocessing

• HTML cleanup– conversion to valid XHTML

• Only potentially relevant blocks kept– blocks that do not directly contain text or images omitted

• Formatting tags– attributes removed– several rules matching common constructions (add-to-basket

form, choose-amount button)

• Images– baseline: all images treated as a single token

Page 8: prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Annotation using HMMs

• HMM structure– target, prefix, suffix and background states– adopted from [Freitag, McCallum 99]

• Single tag trigram model for all tags

• F-measures – 83% for name, 89% for price– 56% average for 13 other attributes (17-90%)

• Variations– word-ngram models for lexical probabilities of target states– state substructures instead of single target states, learned

by EM

Page 9: prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Impact of image information

• Image classifier– classifies into 3 classes – Pos, Neg, Unk– before HMM annotation, each image occurence in a document is

substituted by its class– best result 6.6% error rate for binary classification with multi-layer

perceptron (weka)

• Features used for classification– dimensions (estimated 2-dimensional normal distribution)– similarity (latent semantic similarity [Praks 2004] )– whether the same image repeats in the same document

• Results– image precision increased by 19.1%, recall by 2%– improvements for other tags negligible

Page 10: prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Ontology-based instance extraction

Instance extractionalgorithm

Instances(xml)

SesameRDF

repository

Documentannotatedby HMM

Presentationontology

Page 11: prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Domain ontology Presentation ontology

Page 12: prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Instance extraction algorithm

• Sequentially parses annotated document• Adds annotated attributes to working instance WI• If adding an attribute would cause an inconsitency,

an empty working_instance is created. The old working_instance is saved only if it is consistent.

1. WI = empty_instance;2. while (more_attributes) {3. A = next_attribute;4. if (cannot_add (WI, A)) {5. if (consistent (WI)) {6. store (WI);7. }8. WI = empty_instance;9. }10. add (WI, A);11. }

Page 13: prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Search interface powered by Sesame

Page 14: prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Future work

• Learn to correct annotation errors– use document structure to detect unlabeled attributes– bootstrap from these new examples– use ontology constraints on values (types, lists, regexps)

• Population algorithm– utilize scores for each annotated attribute– augment presentation ontology with frequencies of attribute

orderings– use approximate name matching to identify instances

• Improve search interface– approximate name matching (word and char edit distance)

Page 15: prie.ppt

Machine Learning for the Semantic Web, Feb 14th 2005

Thank you!

rainbow.vse.cz