multimedia information extraction from html product catalogues

23
DATESO, April 14 th 2005 Multimedia Information extraction from HTML product catalogues Martin Labský 1 , Vojtěch Svátek 1 , Pavel Praks 2 , Ondřej Šváb 1 {labsky, svatek, xsvao06}@vse.cz, [email protected] rainbow.vse.cz 1 Dept. of Information and Knowledge Engineering, Prague University of Economics 2 Dept. of Applied Mathematics, Technical University of Ostrava

Upload: thad

Post on 10-Jan-2016

23 views

Category:

Documents


1 download

DESCRIPTION

Multimedia Information extraction from HTML product catalogues. Martin Labsk ý 1 , Vojtěch Svátek 1 , Pavel Praks 2 , Ondřej Šváb 1 {labsky, svatek, xsvao06}@vse.cz, [email protected] rainbow.vse.cz 1 Dept. of Information and Knowledge Engineering, Prague University of Economics - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005

Multimedia Information extraction from HTML product catalogues

Martin Labský1, Vojtěch Svátek1, Pavel Praks2, Ondřej Šváb1

{labsky, svatek, xsvao06}@vse.cz, [email protected]

rainbow.vse.cz

1 Dept. of Information and Knowledge Engineering, Prague University of Economics

2 Dept. of Applied Mathematics, Technical University of Ostrava

Page 2: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 2

Agenda

• Information Extraction from Internet

• Annotation using Hidden Markov Models

• Extracting images

• Instance composition guided by ontology

• Bicycle search application

Page 3: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 3

IE from Internet

• Motivation– Semantic and structured search over large

document collections

• Requirements– Identify relevant documents– Perform automatic IE

• documents are semi-structured, have heterogeneous layouts and formattings

searching for objects of type Bicyclein price range €500 - €900

find structures (name, price, equipment)

IE from Internet

Page 4: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 4

Our approach to IE

Preprocessing

Acquire newdocument

Annotation using HMMs

w1 w2 ... wn

w1 w2 w3 w4 w5 w6 w7 w8 w9 ... wnw3 w4 w6 w7

HTML

w9

name price picture

Instance extraction

namepricepicture

Bicycle offerw3w4

w6w7

w9

IE from Internet

Page 5: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 5

Relevant documents

IE from Internet

Page 6: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 6

Agenda

• Information Extraction from Internet

• Annotation using Hidden Markov Models

• Extracting images

• Instance composition guided by ontology

• Bicycle search application

Page 7: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 7

Preprocessing

• HTML cleanup– conversion to valid XHTML

• Only potentially relevant blocks kept– blocks that do not directly contain text or images omitted

• Formatting tags– attributes removed– several rules matching common constructions (add-to-

basket form, choose-amount button)

• Images– baseline: all images treated as a single token

Annotation using HMMs

Page 8: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 8

Preprocessing – example

<p> <img/> <br/> TREK Session 77 <br/> ( 2005 ) <br/> OUR PRICE &pound; 3000 . 00 <p> - - Select Size - - 15 . 5 17 . 5 19 <br/> <_CHOOSEAMOUNT/> <_ADDTOBASKET/>

<p align="center"><a href="/products.php?plid=m1b0s1p979"> <img src=/smsimg/3/tn_m2366_05tksession77.jpg width=100 height=70 alt="TREK Session 77" border=0><br> TREK Session 77</a><br> (2005)<br> OUR PRICE £3000.00 <form method=post action=/products.php?plid=m1b0s1p0 name=buyit> <input type=hidden name=cartadditem id=cartadditem value=979> <select name="selected_size" id="selected_size"> <option value="size not specified">-- Select Size --</option> <option value="15.5">15.5</option> <option value=" 17.5"> 17.5</option> <option value=" 19"> 19</option> </select><br> <input type="hidden" name="selected_colour" id="selected_colour" value="default"> <select name=add_qty id=add_qty><option value=0>0</option><option value=1 SELECTED>1</option><option value=2>2</option><option value=3>3</option><option value=4>4</option><option value=5>5</option></select> <input type=submit name=submit id=submit value="Add to Basket"></form>

Annotation using HMMs

Page 9: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 9

Document modeling using HMMs

• Generative model

• Document = [w1c1] [w2c2]

• P([w1c1] [w2c2]) = P(c1)P(c2|c1)P(w1|c1)P(w2|c2)

• c1c2 = argmaxi,j P([w1ci] [w2cj])

Annotation using HMMs

c1 c2

P(c2|c1)

P(c1|c2)P(w1|c1) P(w1|c2)

transition prob. lexical prob.

estimated from training data (frequencies)

word class

Page 10: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 10

HMM Structure

• States– adopted from [Freitag, McCallum 99]– Target, Prefix, Suffix and Background– densely connected

• Class trigram model– P(name | name_prefix, name)

• Variations– word-ngram models for lexical probabilities of

target states P(w1 | wi-1, name)– state substructures instead of single target states,

learned by EM

Annotation using HMMs

Page 11: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 11

Agenda

• Information Extraction from Internet

• Annotation using Hidden Markov Models

• Extracting images

• Instance composition guided by ontology

• Bicycle search application

Page 12: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 12

Extracting Images

• Baseline– every image represented by the same <img/>

token– HMM only extracts product images based on

context, e.g. P(product_picture | name, product_picture_prefix)

• Use image classifier to preprocess images– classifies into 3 classes – Pos, Neg, Unk– before HMM annotation, each image occurrence in

document is substituted by its class

Extracting Images

Page 13: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 13

Image Classification – Features

• Image size– estimated 2-dimensional normal distribution from a set

of 1000 unique bicycle images NC(x, y)– estimated decision threshold (1-feature binary classifier)

using held-out set of 150 images (60% positive)

• Image similarity– latent semantic similarity [Praks 2004] sim(I1,I2)

– estimated decision threshold for 1-feature bin classifier

• Does the image repeat in document?

Extracting Images

Page 14: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 14

Image Classification

• Combined binary classifier– Multi-layer perceptron (Weka)

– Features: NC(x,y) , simC(I) , repeats(I)

• Performance of binary classifiers– 10-fold cross-validation, document-level folds

Extracting Images

Page 15: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 15

Annotation Results

• Combined ternary classifier– outputs Pos Unk Neg– decision list based on predictions of all 3 single

feature ternary classifiers

Extracting Images

Page 16: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 16

Agenda

• Information Extraction from Internet

• Annotation using Hidden Markov Models

• Extracting images

• Instance composition guided by ontology

• Bicycle search application

Page 17: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 17

Instance Composition

Instance Composition

Instance extractionalgorithm

Instances(xml)

SesameRDF

repository

Documentannotatedby HMM

Presentationontology

Page 18: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 18

Domain ontology

Instance Composition

Presentation Ontology

Page 19: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 19

Instance extraction algorithm

• Sequentially parses annotated document• Adds annotated attributes to working instance WI• If adding an attribute would cause an inconsitency,

an empty working_instance is created. The old working_instance is saved only if it is consistent.

1. WI = empty_instance;2. while (more_attributes) {3. A = next_attribute;4. if (cannot_add (WI, A)) {5. if (consistent (WI)) {6. store (WI);7. }8. WI = empty_instance;9. }10. add (WI, A);11. }

Instance Composition

http://eso.vse.cz/~labsky/cgi-bin/client/

Page 20: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 20

Agenda

• Information Extraction from Internet

• Annotation using Hidden Markov Models

• Extracting images

• Instance composition guided by ontology

• Bicycle search application

Page 21: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 21

Bicycle search application, powered by Sesame RDF DB

http://rainbow.vse.cz:8000/sesame/

Page 22: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 22

Future work

• Learn to correct annotation errors– use document structure to detect unlabeled attributes– bootstrap from these new examples– use ontology constraints on values (types, lists, regexps)

• Population algorithm– utilize scores for each annotated attribute– augment presentation ontology with frequencies of attribute

orderings– use approximate name matching to identify instances

• Improve search interface– approximate name matching (word and char edit distance)

Page 23: Multimedia Information extraction from HTML product catalogues

DATESO, April 14th 2005 23

Thank you!

rainbow.vse.cz