an empirical study on using hidden markov models for search interface segmentation
Post on 11-Jun-2015
373 Views
Preview:
TRANSCRIPT
1
AN EMPIRICAL STUDY ON USINGHIDDEN MARKOV MODEL
FORSEARCH INTERFACE SEGMENTATION
Ritu Khare and Yuan An
The iSchool at Drexel
Drexel University, USA
Presentation Order
1. Problem: Interface Segmentation
2. Solution : Hidden Markov Model
3. Empirical Results4. Summing Up
2
Part 1
1. Problem: Interface SegmentationMotivation: The Deep WebSearch Interface SegmentationChallenges Novelty of the Solution
2. Solution : Hidden Markov Model3. Empirical Results4. Summing Up
3
4
Motivation: The Deep Web
What is DEEP WEB? Portion of Web, not returned by search engines
through traditional crawling and indexing. Contents lie in online databases and are accessed
by manually filling up HTML forms on search interfaces.
How to make it USEFUL? Meta Search Engines
E.g. Wu et al. (2004), He et al. (2004), Chang, He and Zhang (2005)
Deep Web Crawlers E.g. Raghavan and Garcia-Molina (2001), Madhavan et al. (2008)
A pre-requisite is A thorough understanding of semantics of search
interfaces
5
Search Interface Segmentation
A critical part in understanding semantics of search interfaces The segmentation of search interfaces into
logical groups of implied queries. Grouping of related interface components
together
Search Interface SegmentationTop segment = 7 componentsBottom Segment = 4 components
6
Why is Segmentation Challenging?
Cannot “see” a segment.
Visually close components, might be located far away in the HTML code.
No Cognitive Ability
Human Designer / User Machine Segment has
apparent semantic existence
Visual Arrangements
Past ExperiencesIn this paper, we investigate whether a machine can “learn” how to segment an interface.
7
The Novelty of The Solution:Model-based
Shortcomings of existing works: They use rules and heuristics for segmentation.
These techniques have problems in handling scalability and heterogeneity. Zhang et al., 2004 and He et al., 2004, Raghavan and Garcia-
Molina, 2001, Kalijuvee et al., 2001
We overcome these shortcomingsModel Based Approach
Implicit Knowledge (used by a designer to design an interface)
HMM(Artificial Designer)
SEGMENTATION
8
The Novelty of The Solution:
The Domain Aspect To segment interfaces from a given subject domain …
Existing works have compared the accuracies attained by two methods.
Using Hidden Markov Models . . . We don’t limit to the comparison between the two methods. For a given domain, we investigate what kind of training interfaces result
in high segmentation accuracy and why?
The deep Web has diverse domains. The interface designs differ across domains
Fresh Perspective
I(Di
)
I(Di
)
Domain –
Specific Method
Generic Method
Interfaces from domain Di
Interface Ifrom domain Di Interfaces from
mix of arbitrary domain D1, D2, D3 …
Part 2
1. Problem: Interface Segmentation2. Solution : Hidden Markov Model
Hidden Markov Model (HMMs)Search Interface AnalysisHMM: An Artificial Designer2-Layered ApproachModel Specification & Architecture
3. Empirical Results4. Summing Up
9
10
What is an HMM?“A finite state automaton with stochastic state transitions and symbol emissions” (Rabiner, 1989).
q0 q1 q2 q3 q4
σ0 σ1 σ2 σ3 σ4
STATE(hidden)
SYMBOL(observab
le)
TRANSITION
EMISSION
1. State Space : A finite set of states {q0, q1, q2 …qn}.2. Transition Matrix: Probability P (qi → qj) of transitioning from a state qi to qj. 3. Symbol Space : A set of output tokens {σ1, σ2, …, σm}. 4. Emission Matrix :Probability P (qi↑ σk) of state qi emitting the token σk.
Two ‘stochastic processes’: State Transitions and Symbol Emissions. Needed to model and explain the ‘real-world processes’ that are implicit and unobservable.
11
Search Interface AnalysisSemantic Labels
For data-driven Web applications, interface components are translated into structured query (e.g. SQL) expressions: SELECT * FROM Gene WHERE Gene_Name = ‘maggie’;
A segment in a search interface corresponds to a WHERE clause, each collecting values qualified using a built-in operator, for a particular attribute in the DB schema.
Segmentation is a two-fold problem Identification of boundaries of logical groups Assignment of semantic labels to components.
Logical Group
Logical Group
Attribute-name Operator Operand
12
INTERFACE DESIGN PROCESS
While the components are observable, their semantic roles appear hidden to a machine.
The proceeding of one semantic label by another is similar to the transitioning of HMM states.
Attribute
Name
Operand
Operator
Attribute
Name
Operand
Text(Gene
ID)
Textbox
Text(Gene Name)
RB Group
Textbox
Attribute-name Operator Operand
Attribute-name
Operand
13
HMM: An Artificial Designer
An HMM can act like a human designer that can design an interface and determine the segment boundaries and semantic labels of components.
We encoded the implicit knowledge required for interface segmentation in an HMM-based artificial designer.We employ a 2-layered HMM: The first layer T-HMM tags each component with appropriate
semantic labels (attribute-name, operator, and operand). The second layer S-HMM segments the interface into logical
attributes.
14
2-LAYERED HMM
TextTextbox
Text RB Group
TextboxAttribute-
name Operand
Attribute-name
Operator OperandBegin-segment End-
segmentBegin-segment
Inside-segment
End-segment
ParserT-HMM
S-HMM
15
MODEL SPECIFICATION: T-HMM & S-HMM
T-HMM S-HMM
Symbols States
HTML Constructs: Text label, Textbox, Textarea, radiobutton, checkbox, select list, etc.
Semantic Labels:Attribute-name, Operator, Operand, Text-Misc
Symbols States
Semantic Labels:Attribute-name, Operator, Operand, Text-Misc
Segment Positions: Begin,Inside, End, Outside
Training interfaces
Testinterfaces
Semantic Labels &
Segment Boundaries(of test interfaces)
State Sequences
Symbol Sequences
T-HMM
S-HMM
Part 3
1. Problem: Interface Segmentation
2. Solution : Hidden Markov Model3. Empirical Results
Initial Experiments Variations of Models Some Interesting Results Conclusions
4. Summing Up
16
17
INITIAL EXPERIMENTS: Domain-Specific
Dataset: 200 interfaces Cross Validation: 190 training
and 10 testing examples. Training: Maximum Likelihood
Method Testing: Viterbi Algorithm
Dataset: 100 interfaces each
Why 2-Layered HMM outperformed? LEX does not model text-misc
and thus suffered from under-segmentation.
LEX considers only those texts as attribute-names that are located within 2-top-row distance from the form element. In reality, attribute-name and operand might be located far apart in the source code.
FIRST EXP.: BIOLOGY DOMAIN
COMPARISON WITH LEX (He et
al. 2007) : 4 DOMAINS
Semantic label Accuracy (%)Segment 86.05
Attribute-name 90.11Attribute-name * 99.75Operator 85.10Operand 98.60
Domain LEX HMM HMMbio
Biology 70.94 +16.66 +16.66
Health 66.85 +5.39 +13.74Automobile
54.34 +24.66 +18.01
Movie 70 +0 +5.9S-HMM
T-HMM
*For segments with multiple instances of attr-names, at least 1 was correctly identified
Design preferencesof designers from different domainsare different.
HMM VariationsT-HMM Topology
AUTOMOBILE BIOLOGY
HEALTH
REFERENCE & EDU
MOVIE
MIXEDTransitions <5% probable not shown
19
RESULTS
A Pattern Captured by Domain Specific Model
Test Domain
HMM Variations (based on training data)
HMMauto HMMbio HMMhealth HMMMovie HMMref_edu HMMmixed
Auto 79 72.35 73.63 68.81 67.52 70.7
Bio 48.7 87.6 48.72 45.29 52.56 51.2
Health 70.35 80.59 72.24 69 74.12 73.05
Movie 72.96 75.9 73.33 70 74.81 74
Ref. & Edu.
44.44 62.3 43.25 38.88 51 44
A Pattern Captured by Cross-Domain Model
AutomobileHealth
Domain-specific models do not always result in best performance, e.g. movie domain
Text-misc
1. Domain-Specific2. Generic3. Cross Domain
20
CONCLUSION
P can be recovered by HMMD1. E.g., Biology and automobile.
P can best be recovered using HMMD2, where D2 is a domain that has P as a frequent pattern. E.g., movie and health, wherein
most of the rare patterns are recovered by HMMbio.
Frequent Pattern P from Domain D1
Rare Pattern P from Domain D1
An artificial designer trained by more appropriate interfaces leads to more accurate results. The appropriateness depends on: frequency of segment design patterns in the test domain frequency of segment design patterns in the training dataset.
Part 4
1. Problem: Interface Segmentation
2. Solution : Hidden Markov Model
3. Empirical Results4. Summing Up
ContributionsFuture Work
21
22
CONTRIBUTIONS Introduction to 2-layered HMM approach for interface
segmentation motivated by probabilistic nature of interface design process. First work to apply HMMs on deep Web search interfaces.
Effectiveness test across representative domains of deep Web. High segmentation accuracy in most domains. Outperformed a previous approach, LEX by at least 10% in
most cases. Design & comparison of various of learning models.
A single model has the potential of accurately segmenting interfaces from multiple domains, provided it is trained on the data having appropriate variety and frequency of design patterns.
An example is HMMbio that performed better than other models on 80% of the tested domains. The variety and frequency of patterns in biology domain helps HMMbio contain more design knowledge & be a smarter designer.
23
FUTURE WORK Design a minimal set of models that reaches as many
deep Web domains as possible Involve More Domains Each model returns higher accuracy than its domain-specific
counterpart Transition to a new interface representation scheme:
Distributed Segments and Segments with intertwined components Recover the schema of deep Web databases: Extracting
finer details, such as data types and constraints. Overcome the challenges posed by HMMs
Manual Tagging of training data: Explore unsupervised training methods such as Baum Welch algorithm.
Time taken by Viterbi algorithm for state recovery Find optimization techniques to improve efficiency. Use this method as an off-line pre-processing module to other applications such as meta-
search engines and deep Web crawlers.
24
Suggestions, Thoughts, Ideas, Questions…
THANK YOU !
Acknowledgements: To the Anonymous Reviewers of CIKM 2009
References: [1] to [23] (in full paper).
top related