searching repositories of web application models
DESCRIPTION
Project repositories are a central asset in software development, as they preserve the technical knowledge gathered in past development activities. However, locating relevant information in a vast project repository is problematic, because it requires manually tagging projects with accurate metadata, an activity which is time consuming and prone to errors and omissions. This paper investigates the use of classical Information Retrieval techniques for easing the discovery of useful information from past projects. Differently from approaches based on textual search over the source code of applications or on querying structured metadata, we propose to index and search the models of applications, which are available in companies applying Model-Driven Engineering practices. We contrast alternative index structures and result presentations, and evaluate a prototype implementation on real-world experimental data.TRANSCRIPT
Politecnico di MilanoDipartimento di Elettronica e Informazione
Searching Repositoriesof Web Application Models
Alessandro Bozzon, Marco Brambilla, Piero Fraternali
ICWE 2010Vienna, July 7th 2010
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Context
•Project repositories are a central asset in (Web) software development
• they preserve the technical knowledge gathered in past development activities
• repositories now overcome the boundaries of individual organizations and have a social role in the diffusion of coding and design solutions
• they allow for reuse of knowledge and artifacts
•Locating relevant information in a vast project repository is problematic
• Two options
1. Manual tagging time consuming and prone to errors, omissions and incoherencies
2. Automatic analysis a lot of semantic can be lost in the process
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Addressed Problem
•Objective: easing the discovery of useful information from past software projects
•Main resource: application models
•available in companies applying Model-Driven Engineering practices
• In contrast to existing solutions, that mainly focus on discovery of code, documentation, and annotations
•Why dealing with application models is an advantage?• Increased result quality (thanks to the more valuable
information embedded in models wrt to the code)
• Less need for manual tagging
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Related work: Component search
•Retrieval of annotated pieces of software dates back to the '90s. Various approaches:
•worldwide search engine based on JavaBeans and Corba [Agora, Internet Computing, 1998]
•Search engines for Web services based on indexed Vector Space Model characterization of their properties [Dustdar et al., ECWS 2005]
•Significance based search that exploits graph models of a software component library (usage relations used as links propagating significance) [Inoue et al., TOSEM 2005]
• Combination of formal and semi-formal specification to describe behaviour and structure of components [Khalifa et al., ASEA 2008]
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Related work: Source code search• Several communities and on-line tools for sharing and
retrieving code: Google code, Snipplr, Koders, Codase, Jexamples, SourceForge
•Keyword queries directly matched to the code
• Results are the exact locations where the keyword(s) appear
• Plus advanced behaviours: regular expressions (Google), wildcards (Codase), restriction to specific concept types (Jexamples, Codase), advanced ranking, e.g., based on rank results based on relevance of match, activity, date of registration, recency of last update (SourceForge)
• Other approaches:
• Information retrieval techniques for software reuse [Frakes et al., SIGIR Forum 1987]
• taking advantage of code structural information [Holmes and Murphy, ICSE 2005] and [Sourcerer Project by Bajracharya et al., SUITE ICSE workshop 2009]
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Related work: Model search
• The problem is usually restricted to Searching UML or ER models
• XML / XMI format for indexing seamlessly UML models, text files, and others [Gibb et al., 2000] [Lorens et al., 2004] [Moogle: Lucredio et al., Models 2008]
• UML artifacts classified with WordNet terms and extracted though Case-Based Reasoning [Gomes et al., AI Comm., 2004]
• database conceptual model retrieval based on text search, schema matching, and structurally-aware scoring methods, with queries by example and keword-based [Schemr: Chen, Halevy, SIGMOD09]
• IR techniques applied to models and code together, for tracing the association between requirements, design artifacts, and code [Antoniol et al., 2000] […]
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Related work: Business Process Discovery
• Different approaches to extraction of BP models from repositories
• Based on the workflow topology only: graph-based comparison or XML-based querying [Beeri et al., VLDB 2006] [Lu et al., BPM 2006] [Shao et al., ICDE 2009]
• Based on semantic reasoning and discovery, using SPARQL, query by example, SQL-like languages, and so on [Kiefer et al., ESWC 2007] [Goderis et al., ICWS 2006] [Awad et al., EDOC 2008] [Zhuge 2002] [Belhajjame, Brambilla, BPMDS 2009]
• Based on IR techniques [Dongen, Dijkman et al., Caise 2008]
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Our contribution
A model-based search solution, with several innovations: • it automatically exploits the semantics from the searched
conceptual models• It does not require manual annotation• it supports alternative indexing and ranking functions, based of
the meta-model of the considered DSL(s) • it is based on a model-independent framework, which can be
customized to any meta-model
User study to evaluate acceptance and the quality perceived by users
Performance tests to evaluate scalability
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Overall Architecture of the System
Content processing
QUERY (keyword)
RESULTS
SegmentationProject
repository
Indexing
Segment analysis
Linguistic normalization
Query processing
RESULTS
Search User Interface
QUERY(keyword)
Project analysisDSL Metamodel
Metadata
Index Search Engine
Content processing flow L
egen
d Query and result presentation flow
En
gin
eerin
g W
eb
Search
Ap
plica
tion
Bozzo
n, B
ram
billa
, Tu
toria
l @IC
WE
20
10
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Overall Architecture of the System
Content processing
QUERY (keyword)
RESULTS
SegmentationProject
repository
Indexing
Segment analysis
Linguistic normalization
Query processing
RESULTS
Search User Interface
QUERY(keyword)
Project analysisDSL Metamodel
Metadata
Index Search Engine
Content processing flow L
egen
d Query and result presentation flow
The Content Processing Flow extracts meaningful information from projects and uses it to create the search
engine index.1. CONTENT PROCESSING • project analysis captures project-level, global metadata • segmentation splits the project into smaller units • segment analysis extracts from segments the information to be indexed • linguistic normalization applies the typical normalization operations of IR
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Overall Architecture of the System
Content processing
QUERY (keyword)
RESULTS
SegmentationProject
repository
Indexing
Segment analysis
Linguistic normalization
Query processing
RESULTS
Search User Interface
QUERY(keyword)
Project analysisDSL Metamodel
Metadata
Index Search Engine
Content processing flow L
egen
d Query and result presentation flow
The Content Processing Flow extracts meaningful information from projects and uses it to create the search
engine index.2. INDEXING• each project or segment is physically represented as a document• the search engine indexes are built based on the documents• the DSL metamodel is taken into account
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Overall Architecture of the System
Content processing
QUERY (keyword)
RESULTS
SegmentationProject
repository
Indexing
Segment analysis
Linguistic normalization
Query processing
RESULTS
Search User Interface
QUERY(keyword)
Project analysisDSL Metamodel
Metadata
Index Search Engine
Content processing flow L
egen
d Query and result presentation flow
The query and result presentation Flow deals with the submitted
queries and the production of the result set. 1. USER INTERFACE supports
• Keyword-based queries• Content-based queries (aka QBE)• Rendering of the results
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Overall Architecture of the System
Content processing
QUERY (keyword)
RESULTS
SegmentationProject
repository
Indexing
Segment analysis
Linguistic normalization
Query processing
RESULTS
Search User Interface
QUERY(keyword)
Project analysisDSL Metamodel
Metadata
Index Search Engine
Content processing flow L
egen
d Query and result presentation flow
The query and result presentation Flow deals with the submitted
queries and the production of the result set. 2. QUERY PROCESSING
• matches the query to the indexed content using a given similarity criteria• produces ranked results
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Design Dimensions of Model Retrieval (1/2)
Segmentation Granularity: the “size” of atomic unit of retrieval for the user
• Project
• Sub-project
• Model concepts (all or only the main ones)
Index structure: one or more fields (associated with an boosting score)
• Flat: a simple list of terms without taking into account model semantics• Weighted: model concepts used to weight terms in the ranking• Multi-field: terms belonging to
different model concepts are collected into separate fields
• Structured: the model is translated into a representation that reflects the hierarchies and associations among concepts
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Example of Model Indexing
Metamodel Model Model XML Representation
Product Catalogue Catalogue Home Page List Products List of product in the catalogue View Details Details of a
selected product
HYPERTEXT MODEL
1
ID
Product Catalogue Application
PROJECT NAME
Multi-Field
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Example of Model Indexing
Metamodel Model Model XML Representation
Product|2.0 Catalogue|2.0 Catalogue|1.0 Home|1.0 Page|1.0 List|0.5 Products|0.5 List|0.2 of|0.2 products|0.2 in|0.2
the|0.2 catalogue|0.2 View Details Details of a selected product
HYPERTEXT MODEL
1
ID
Product Catalogue Application
PROJECT NAME
Multi-Field, Weighted Index
2.0
1.0 0.5
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Design Dimensions of Model Retrieval (2/2)
Query Language and Result Presentation: the way queries and results are presented.
• Keyword-based search
• Document-based search: the system extracts the most significant words and submits them as a query
• Search by example: the query is a model, analyzed and matched by similarity
• Faceted search: exploration using facets (i.e., property-value pairs) extracted from the indexed documents
• Snippet visualization: with the matching points highlighted in graphical or textual form
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Our model-based search engine prototype
General purpose, model-independent, configurable system:
Configuration of a general purpose search engine according to the selected design dimensions
metamodel-aware rules to analyze models and populate the index segmentation and text-extraction steps model
transformation rules
Offline collection analysis compute statistics for fine-tuning the retrieval and ranking
Stop Domain Concept removal
optimization of the weights assigned to each model concept
Provides a visual interface to perform queries and inspect results.
Content processing has been implemented by extending the text processing and analysis components provided by Apache Lucene
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Detailed indexing process
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Experiment Settings - Dataset
48 real-world WebML projects from WebRatio
• trouble ticketing, human resource management, multimedia search engines, Web portals, etc.
• Italian and English
•~ 250 Modeling Concepts
•3,800 data model entities (with about 35,000 attributes and 3,800 relationships)
•138 site views with about 10,000 pages and 470,000 units, and 20 Web services.
•The overall repository takes around 85MB of disk space
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Experiment settings - Configurations
• 3 different settings of the design dimensions: A, B, C• A flat index structure;
• B and C multi-field weighted (projectID, projectName, documentType, text)
Option Description A B C
Segmentation Granularity
Project Entire project X
Sub-project Subproject X X
Single-Concept Arbitrary model concepts X X
Index Structure
Flat Flat list of words X X
Weighted Words weighted by the model concepts they belong to
X
Multi-field Words belonging to each model concept in separate fields
X X
Query Language and Result Presentation
Keyword-based
Query By Keywords X X X
Faceted Query refined through specific dimensions X X X
Snippets Visualization and exploration of result previews X X X
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Experiment C – model-based scoring function
• Experiments A and B exploit a traditional TF-IDF ranking function
• Experiment C exploits the DSL metamodel
•mtw(m, t) : Model Term Weight, a metamodel specific boost that depends on the concept m containing the term t
•dw(d) : Document Weight, a metamodel and model-specific boosting value that expresses the importance of a given document (according to the selected granularity)
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
User Interface
(A) Rendered result set and facets
(B) Snippet window with highlighted matches
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
User Evaluation – Perceived Quality
• User study has been conducted with 5 expert WebML designers to assess the quality and perception of alternative configurations
• Users rated the results in the result sets,
• Votes ranged from 1 (highly inappropriate) to 5 (highly appropriate)
• Experiment B and C got more votes in high range of the scale
• Success Factor:
• Injecting the semantic
of the meta-model
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
User Evaluation - Acceptance
• Users were asked 10 questions about the features of the application
• Votes ranged from 1 (bad) to 5 (good)
Avg.
Var..
Features
Keyword Search 3.6 0.24
Search Result Ranking 3.2 0.16
Faceted Search 3.8 0.16
Match Highlighting 3.6 0.24
Application
Help reducing the maintenance costs 3.2 0.56
Help improving the quality of the delivered application?
3.0 0.4
Help understanding the model assets in the company?
4.4 0.24
Help providing better estimates for future application costs?
2.8 0.56
Wrap Up
Overall Evaluation of the system 4.0 0.4
Would you use the system in your activities? 3.0 1.2
useful for model maintenance and reuse
role in improving the quality of the applications
• a certain distance between the overall judged quality and the adoption likelihood• But there is a bias due
to the lack of a graphical viewer
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Performance Evaluation - Query Time
• About 400 2-terms and 3-terms randomly generated keyword queries
• Each query has been executed 20 times
• Query time is abundantly sub-second and curves indicate a sub-linear growth
• The addition of Faceted Search and Snippet Visualization impacts heavily
• with the number of inde
• NOSeg: No Segmentation • Seg: Segmentation• KS: Keyword Search• FS: Faceted Search• Snip: Snippet
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Performance Evaluation - Index Size
• Size grows almost linearly with the number of projects in all configurations
• Baseline configurations feature index sizes about 10 times smaller than the repository size
• Faceted Search doubles the index size
• NOSeg: No Segmentation • Seg: Segmentation• KS: Keyword Search• FS: Faceted Search• Snip: Snippet
A. Bozzon, M. Brambilla, P. Fraternali. Searching Repositories of Web Application Models
Conclusions and future directions
• A metamodel-aware approach and a system prototype for searches over model repositories
• Scalability tests and user studies in different experimental settings
• Future works:
• Integration of content-based search
• Improve result visualization: integration in the WebRatio tool-suite for WebML and visual highlighting of the matches in the projects
• Adaptive fine-tuning for improving precision and recall
• Experiments with more modeling languages (e.g. BPMN)
• Definition of generic benchmark criteria for model-driven repository search
Thanks!
Questions?
Alessandro BozzonMarco Brambilla Piero Fraternali
?Searc
hin
g R
eposi
tori
es
of
Web A
pp
licati
on M
odels