information extraction on real estate rental classifieds eddy hartanto ryohei takahashi

22
Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Post on 21-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Information Extraction on Real Estate Rental Classifieds

Eddy Hartanto

Ryohei Takahashi

Page 2: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Overview We want to extract 10 fields:

Security deposit Square footage Number of bathrooms Contact person’s

name Contact phone

number

Nearby landmarks Cost of parking Date available Building style /

architecture Number of units in

building

These fields can’t easily be served by keyword search

Page 3: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Approach Hand labeled test set as precision and

recall computation base Pattern matching approach with Rapier Statistical approach using HMM with

different structures

Page 4: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Demo …

Page 5: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Hidden Markov Models We consider three different HMM

structures We train one HMM per field Words in postings are output symbols of

HMM Hexagons represent target states,

which emit the relevant words for that field

Page 6: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Training Data We use a randomly-selected set of 110

postings to use as the training data We manually label which words in each

posting are relevant to each of the 10 fields

Page 7: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

HMM Structure #1

A single prefix state and single suffix state Prefixes and suffixes can be of arbitrary

length

Page 8: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

HMM Structure #2

Varying numbers of prefix, suffix, and target states

Page 9: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

HMM Structure #3

Varying numbers of prefix, suffix, and target states Prefixes and suffixes are fixed in length

Page 10: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Cross-Validation We use cross-validation to find the

optimal number of prefix, suffix, and target states

Page 11: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Preventing Underflow Postings are hundreds of words long Forward and backward probabilities

become incredibly small => underflow To avoid underflow, we normalize the

forward probabilities:

instead of

ˆ t iP qt Si |O1,...,Ot ,

t iP O1,...,Ot,qt Si |

Page 12: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Smoothing We perform add-one smoothing for the

emission probabilities:

bi* k

t it1,Otvk

T

1

t it1

T

M

Page 13: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Rapier Rapier automatically learns rules to

extract fields from training examples We use the same 110 training postings

as for the HMMs

Page 14: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Data Preparation Sentence Splitter (Cognitive Computation

Group at UIUC, http://l2r.cs.uiuc.edu/~cogcomp/tools.php): puts one sentence on each line

Stanford Tagger (Stanford NLP Group, http://nlp.stanford.edu/software/tagger.shtml): tags each word with part of speech

We then manually create a template file for each of the files, with the information for the 10 fields filled in

Page 15: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Test Data We use a randomly-selected set of 100

postings to use as the test data We manually label these 100 postings

with the fields

Page 16: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Rapier Results We use Rapier’s “test2” program to evaluate

performance on the labeled postings Training Set

Precision: 0.990099 Recall: 0.408998 F-measure: 0.578871

Test Set Precision: 0.747126 Recall: 0.151869 F-measure: 0.252427

Page 17: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Another run at Rapier

Overall Precision Recall F-measure

0.847 0.201 0.324

Field Correct RetrievedCorrect&Retrieved Precision Recall

F-measure

security_deposit 23 0 0 0 0 0

square_footage 24 10 10 1 0.417 0.588

no_bathrooms 58 28 25 0.893 0.431 0.581

contact_person 40 28 24 0.857 0.6 0.706

contact_phone 93 2 1 0.5 0.011 0.021

nearby_landmarks 76 8 5 0.625 0.066 0.119

parking_cost 4 0 0 0 0 0

date_available 21 1 0 0 0 0

building_style 6 4 4 1 0.667 0.8

no_units 14 4 3 0.75 0.214 0.333

Page 18: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

HMM Structure#1

Field Correct Retrieved CorrectRetrieved Precision RecallF-measure

security_deposit 23 0 0 0 0 0

square_footage 24 0 0 0 0 0

no_bathrooms 58 100 17 0.17 0.293 0.215

contact_person 40 100 0 0 0 0

contact_phone 93 41 26 0.634 0.28 0.388

nearby_landmarks 76 0 0 0 0 0

parking_cost 4 59 0 0 0 0

date_available 21 100 0 0 0 0

building_style 6 100 2 0.02 0.333 0.038

no_units 14 0 0 0 0 0

Overall Precision Recall F-measure

0.09 0.125 0.105

Page 19: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

HMM Structure#2

Field Correct RetrievedCorrectRetrieved Precision Recall F-measure

security_deposit 23 0 0 0 0 0

square_footage 24 9 8 0.889 0.333 0.485

no_bathrooms 58 0 0 0 0 0

contact_person 40 0 0 0 0 0

contact_phone 93 100 7 0.07 0.075 0.073

nearby_landmarks 76 0 0 0 0 0

parking_cost 4 0 0 0 0 0

date_available 21 0 0 0 0 0

building_style 6 3 0 0 0 0

no_units 14 0 0 0 0 0

Overall

Precision Recall F-measure

0.134 0.042 0.064

Page 20: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

HMM Structure#3

Field Correct RetrievedCorrect

Retrieved Precision Recall F-measure

security_deposit 23 0 0 0 0 0

square_footage 24 9 8 0.889 0.333 0.485

no_bathrooms 58 100 37 0.37 0.638 0.468

contact_person 40 100 4 0.04 0.1 0.057

contact_phone 93 100 6 0.06 0.065 0.062

nearby_landmarks 76 100 7 0.07 0.092 0.08

parking_cost 4 0 0 0 0 0

date_available 21 4 0 0 0 0

building_style 6 31 1 0.032 0.167 0.054

no_units 14 0 0 0 0 0

Overall Precision Recall F-measure

0.142 0.175 0.157

Page 21: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Insights Relatively good performance with Rapier Not too good performance with HMM, due to lack of

training data (only 0.67% or 100 sampled randomly from 15000 postings) while test data is 10% or 1500 postings sampled from 15000 postings.

Limitation of automatic spelling correction although enhanced with California town, city, county names and first person names.

Wish the availability of advanced ontology as Wordnet is somewhat limited: recognize entity such as SJSU, Albertson, street names

Page 22: Information Extraction on Real Estate Rental Classifieds Eddy Hartanto Ryohei Takahashi

Question & Answer