prie.ppt
TRANSCRIPT
Machine Learning for the Semantic Web, Feb 14th 2005
Information extraction from HTML product catalogues
Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1
{labsky, svatek, xsvao06}@vse.cz, [email protected]
rainbow.vse.cz
1 Dept. of Information and Knowledge Engineering, Prague University of Economics
2 Dept. of Applied Mathematics, Technical University of Ostrava
Coupling quantitative and knowledge-based approaches
Machine Learning for the Semantic Web, Feb 14th 2005
Agenda
• Overview of the Rainbow project• Extraction of product offers
– Annotation using HMMs– Impact of image information– Ontology-based instance extraction– Search interface
• Future work
Machine Learning for the Semantic Web, Feb 14th 2005
Rainbow overview• Goal
– to present the content and structure of legacy websites to a user or computer agent
• How– multiway analysis of websites: utilize features derived from text,
images, formatting, URLs, navigation structure and background knowledge
• Modular architecture, web services– information extraction (HMMs)– discovery of website navigation structure (link graph)– image classifiers (histograms, dimensions, similarity)– URL classifier (rule-based)– extractor of summarizing sentences (bootstrapped indicator
keywords)
Machine Learning for the Semantic Web, Feb 14th 2005
Application of Rainbow
Machine Learning for the Semantic Web, Feb 14th 2005
Extraction of product offers
• Combines– automatic document annotation using HMMs– image classifier– ontology-based instance composition– URL classifier for focused crawling– structured search interface powered by Sesame
• The data– over 1000 bicycle offers (labeled using 15 attributes)– in 100 pages from different websites
Machine Learning for the Semantic Web, Feb 14th 2005
Sample data
Machine Learning for the Semantic Web, Feb 14th 2005
Preprocessing
• HTML cleanup– conversion to valid XHTML
• Only potentially relevant blocks kept– blocks that do not directly contain text or images omitted
• Formatting tags– attributes removed– several rules matching common constructions (add-to-basket
form, choose-amount button)
• Images– baseline: all images treated as a single token
Machine Learning for the Semantic Web, Feb 14th 2005
Annotation using HMMs
• HMM structure– target, prefix, suffix and background states– adopted from [Freitag, McCallum 99]
• Single tag trigram model for all tags
• F-measures – 83% for name, 89% for price– 56% average for 13 other attributes (17-90%)
• Variations– word-ngram models for lexical probabilities of target states– state substructures instead of single target states, learned
by EM
Machine Learning for the Semantic Web, Feb 14th 2005
Impact of image information
• Image classifier– classifies into 3 classes – Pos, Neg, Unk– before HMM annotation, each image occurence in a document is
substituted by its class– best result 6.6% error rate for binary classification with multi-layer
perceptron (weka)
• Features used for classification– dimensions (estimated 2-dimensional normal distribution)– similarity (latent semantic similarity [Praks 2004] )– whether the same image repeats in the same document
• Results– image precision increased by 19.1%, recall by 2%– improvements for other tags negligible
Machine Learning for the Semantic Web, Feb 14th 2005
Ontology-based instance extraction
Instance extractionalgorithm
Instances(xml)
SesameRDF
repository
Documentannotatedby HMM
Presentationontology
Machine Learning for the Semantic Web, Feb 14th 2005
Domain ontology Presentation ontology
Machine Learning for the Semantic Web, Feb 14th 2005
Instance extraction algorithm
• Sequentially parses annotated document• Adds annotated attributes to working instance WI• If adding an attribute would cause an inconsitency,
an empty working_instance is created. The old working_instance is saved only if it is consistent.
1. WI = empty_instance;2. while (more_attributes) {3. A = next_attribute;4. if (cannot_add (WI, A)) {5. if (consistent (WI)) {6. store (WI);7. }8. WI = empty_instance;9. }10. add (WI, A);11. }
Machine Learning for the Semantic Web, Feb 14th 2005
Search interface powered by Sesame
Machine Learning for the Semantic Web, Feb 14th 2005
Future work
• Learn to correct annotation errors– use document structure to detect unlabeled attributes– bootstrap from these new examples– use ontology constraints on values (types, lists, regexps)
• Population algorithm– utilize scores for each annotated attribute– augment presentation ontology with frequencies of attribute
orderings– use approximate name matching to identify instances
• Improve search interface– approximate name matching (word and char edit distance)
Machine Learning for the Semantic Web, Feb 14th 2005
Thank you!
rainbow.vse.cz