dynamic query forms for database queries

14
1 Dynamic Query Forms for Database Queries Liang Tang, Tao Li, Yexi Jiang, and Zhiyuan Chen Abstract—Modern scientific databases and web databases maintain large and heterogeneous data. These real-world databases contain over hundreds or even thousands of relations and attributes. Traditional predefined query forms are not able to satisfy various ad-hoc queries from users on those databases. This paper proposes DQF, a novel database query form interface, which is able to dynamically generate query forms. The essence of DQF is to capture a user’s preference and rank query form components, assisting him/her to make decisions. The generation of a query form is an iterative process and is guided by the user. At each iteration, the system automatically generates ranking lists of form components and the user then adds the desired form components into the query form. The ranking of form components is based on the captured user preference. A user can also fill the query form and submit queries to view the query result at each iteration. In this way, a query form could be dynamically refined till the user satisfies with the query results. We utilize the expected F-measure for measuring the goodness of a query form. A probabilistic model is developed for estimating the goodness of a query form in DQF. Our experimental evaluation and user study demonstrate the effectiveness and efficiency of the system. Index Terms—Query Form, User Interaction, Query Form Generation, 1 I NTRODUCTION Query form is one of the most widely used user inter- faces for querying databases. Traditional query forms are designed and predefined by developers or DBA in various information management systems. With the rapid development of web information and scientific databases, modern databases become very large and complex. In natural sciences, such as genomics and diseases, the databases have over hundreds of entities for chemical and biological data resources [22] [13] [25]. Many web databases, such as Freebase and DBPedia, typically have thousands of structured web entities [4] [2]. Therefore, it is difficult to design a set of static query forms to satisfy various ad-hoc database queries on those complex databases. Many existing database management and devel- opment tools, such as EasyQuery [3], Cold Fu- sion [1], SAP and Microsoft Access, provide several mechanisms to let users create customized queries on databases. However, the creation of customized queries totally depends on users’ manual editing [16]. If a user is not familiar with the database schema in advance, those hundreds or thousands of data attributes would confuse him/her. 1.1 Our Approach In this paper, we propose a D ynamic Q uery F orm system: DQF, a query interface which is capable of dy- namically generating query forms for users. Different from traditional document retrieval, users in database Liang Tang, Tao Li, and Yexi Jiang are with School of Computer Sci- ence, Florida International University, Miami, Florida, 33199, U.S.A. E-mail: {ltang002,taoli, yjian004}@cs.fiu.edu Zhiyuan Chen is with Information Systems Department, University of Maryland Baltimore County, Baltimore, MD, 21250, U.S.A TABLE 1 Interactions Between Users and DQF Query Form Enrichment 1) DQF recommends a ranked list of query form components to the user. 2) The user selects the desired form components into the current query form. Query Execution 1) The user fills out the current query form and submit a query. 2) DQF executes the query and shows the results. 3) The user provides the feedback about the query results. retrieval are often willing to perform many rounds of actions (i.e., refining query conditions) before iden- tifying the final candidates [7]. The essence of DQF is to capture user interests during user interactions and to adapt the query form iteratively. Each itera- tion consists of two types of user interactions: Query Form Enrichment and Query Execution (see Table 1). Figure 1 shows the work-flow of DQF. It starts with a basic query form which contains very few primary attributes of the database. The basic query form is then enriched iteratively via the interactions between the user and our system until the user is satisfied with the query results. In this paper, we mainly study the ranking of query form components and the dynamic generation of query forms. 1.2 Contributions Our contributions can be summarized as follows: We propose a dynamic query form system which generates the query forms according to the user’s desire at run time. The system provides a solution IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL:PP NO:99 YEAR 2013

Upload: swathi-manthena

Post on 26-Nov-2015

231 views

Category:

Documents


5 download

DESCRIPTION

Dynamic Query Forms for Database Queries

TRANSCRIPT

  • 1Dynamic Query Forms for Database QueriesLiang Tang, Tao Li, Yexi Jiang, and Zhiyuan Chen

    AbstractModern scientific databases and web databases maintain large and heterogeneous data. These real-world databasescontain over hundreds or even thousands of relations and attributes. Traditional predefined query forms are not able to satisfyvarious ad-hoc queries from users on those databases. This paper proposes DQF, a novel database query form interface, which isable to dynamically generate query forms. The essence of DQF is to capture a users preference and rank query form components,assisting him/her to make decisions. The generation of a query form is an iterative process and is guided by the user. Ateach iteration, the system automatically generates ranking lists of form components and the user then adds the desired formcomponents into the query form. The ranking of form components is based on the captured user preference. A user can alsofill the query form and submit queries to view the query result at each iteration. In this way, a query form could be dynamicallyrefined till the user satisfies with the query results. We utilize the expected F-measure for measuring the goodness of a queryform. A probabilistic model is developed for estimating the goodness of a query form in DQF. Our experimental evaluation anduser study demonstrate the effectiveness and efficiency of the system.

    Index TermsQuery Form, User Interaction, Query Form Generation,

    F

    1 INTRODUCTIONQuery form is one of the most widely used user inter-faces for querying databases. Traditional query formsare designed and predefined by developers or DBA invarious information management systems. With therapid development of web information and scientificdatabases, modern databases become very large andcomplex. In natural sciences, such as genomics anddiseases, the databases have over hundreds of entitiesfor chemical and biological data resources [22] [13][25]. Many web databases, such as Freebase andDBPedia, typically have thousands of structured webentities [4] [2]. Therefore, it is difficult to design aset of static query forms to satisfy various ad-hocdatabase queries on those complex databases.Many existing database management and devel-

    opment tools, such as EasyQuery [3], Cold Fu-sion [1], SAP and Microsoft Access, provide severalmechanisms to let users create customized querieson databases. However, the creation of customizedqueries totally depends on users manual editing [16].If a user is not familiar with the database schemain advance, those hundreds or thousands of dataattributes would confuse him/her.

    1.1 Our ApproachIn this paper, we propose a Dynamic Query Formsystem: DQF, a query interface which is capable of dy-namically generating query forms for users. Differentfrom traditional document retrieval, users in database

    Liang Tang, Tao Li, and Yexi Jiang are with School of Computer Sci-ence, Florida International University, Miami, Florida, 33199, U.S.A.E-mail: fltang002,taoli, [email protected]

    Zhiyuan Chen is with Information Systems Department, Universityof Maryland Baltimore County, Baltimore, MD, 21250, U.S.A

    TABLE 1Interactions Between Users and DQF

    Query FormEnrichment 1) DQF recommends a ranked list of

    query form components to the user.2) The user selects the desired form

    components into the current queryform.

    QueryExecution 1) The user fills out the current query

    form and submit a query.2) DQF executes the query and shows

    the results.3) The user provides the feedback

    about the query results.

    retrieval are often willing to perform many rounds ofactions (i.e., refining query conditions) before iden-tifying the final candidates [7]. The essence of DQFis to capture user interests during user interactionsand to adapt the query form iteratively. Each itera-tion consists of two types of user interactions: QueryForm Enrichment and Query Execution (see Table 1).Figure 1 shows the work-flow of DQF. It starts with abasic query form which contains very few primaryattributes of the database. The basic query form isthen enriched iteratively via the interactions betweenthe user and our system until the user is satisfied withthe query results. In this paper, we mainly study theranking of query form components and the dynamicgeneration of query forms.

    1.2 ContributionsOur contributions can be summarized as follows: We propose a dynamic query form system whichgenerates the query forms according to the usersdesire at run time. The system provides a solution

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING VOL:PP NO:99 YEAR 2013

  • 2Fig. 1. Flowchart of Dynamic Query Form

    for the query interface in large and complexdatabases.

    We apply F-measure to estimate the goodness ofa query form [30]. F-measure is a typical metricto evaluate query results [33]. This metric is alsoappropriate for query forms because query formsare designed to help users query the database.The goodness of a query form is determinedby the query results generated from the queryform. Based on this, we rank and recommend thepotential query form components so that userscan refine the query form easily.

    Based on the proposed metric, we develop effi-cient algorithms to estimate the goodness of theprojection and selection form components. Hereefficiency is important because DQF is an onlinesystem where users often expect quick response.

    The rest of the paper is organized as follows. Section 2describes the related work. Section 3 defines the queryform and introduces how users interact with ourdynamic query form. Section 4 defines a probabilisticmodel to rank query form components. Section 5describes how to estimate the ranking score. Section6 reports experimental results, and finally Section 7concludes the paper.

    2 RELATED WORKHow to let non-expert users make use of the relationaldatabase is a challenging topic. A lot of researchworks focus on database interfaces which assist usersto query the relational database without SQL. QBE(Query-By-Example) [36] and Query Form are twomost widely used database querying interfaces. Atpresent, query forms have been utilized in mostreal-world business or scientific information systems.Current studies and works mainly focus on how togenerate the query forms.Customized Query Form: Existing database clientsand tools make great efforts to help developers designand generate the query forms, such as EasyQuery [3],Cold Fusion [1], SAP, Microsoft Access and so on.

    They provide visual interfaces for developers to createor customize query forms. The problem of those toolsis that, they are provided for the professional devel-opers who are familiar with their databases, not forend-users [16]. [17] proposed a system which allowsend-users to customize the existing query form at runtime. However, an end-user may not be familiar withthe database. If the database schema is very large, it isdifficult for them to find appropriate database entitiesand attributes and to create desired query forms.Automatic Static Query Form: Recently, [16] [18] pro-posed automatic approaches to generate the databasequery forms without user participation. [16] presenteda data-driven method. It first finds a set of dataattributes, which are most likely queried based on thedatabase schema and data instances. Then, the queryforms are generated based on the selected attributes.[18] is a workload-driven method. It applies clusteringalgorithm on historical queries to find the represen-tative queries. The query forms are then generatedbased on those representative queries. One problem ofthe aforementioned approaches is that, if the databaseschema is large and complex, user queries couldbe quite diverse. In that case, even if we generatelots of query forms in advance, there are still userqueries that cannot be satisfied by any one of queryforms. Another problem is that, when we generatea large number of query forms, how to let usersfind an appropriate and desired query form wouldbe challenging. A solution that combines keywordsearch with query form generation is proposed in[12]. It automatically generates a lot of query formsin advance. The user inputs several keywords to findrelevant query forms from a large number of pre-generated query forms. It works well in the databaseswhich have rich textual information in data tuplesand schemas. However, it is not appropriate when theuser does not have concrete keywords to describe thequeries at the beginning, especially for the numericattributes.Autocompletion for Database Queries: In [26], [21],novel user interfaces have been developed to assist theuser to type the database queries based on the queryworkload, the data distribution and the databaseschema. Different from our work which focuses onquery forms, the queries in their work are in the formsof SQL and keywords.Query Refinement: Query refinement is a commonpractical technique used by most information retrievalsystems [15]. It recommends new terms related tothe query or modifies the terms according to thenavigation path of the user in the search engine. Butfor the database query form, a database query is astructured relational query, not just a set of terms.Dynamic Faceted Search: Dynamic faceted search isa type of search engines where relevant facets arepresented for the users according to their naviga-tion paths [29] [23]. Dynamic faceted search engines

  • 3are similar to our dynamic query forms if we onlyconsider Selection components in a query. However,besides Selections, a database query form has other im-portant components, such as Projection components.Projection components control the output of the queryform and cannot be ignored. Moreover, designs ofSelection and Projection have inherent influences toeach other.Database Query Recommendation: Recent studiesintroduce collaborative approaches to recommenddatabase query components for database exploration[20] [9]. They treat SQL queries as items in the col-laborative filtering approach, and recommend similarqueries to related users. However, they do not con-sider the goodness of the query results. [32] proposesa method to recommend an alternative database querybased on results of a query. The difference from ourwork is that, their recommendation is a completequery and our recommendation is a query componentfor each iteration.Dynamic Data Entry Form: [11] develops an adaptiveforms system for data entry, which can be dynamicallychanged according to the previous data input by theuser. Our work is different as we are dealing withdatabase query forms instead of data-entry forms.Active Feature Probing: Zhu et al. [35] develop theactive featuring probing technique for automaticallygenerating clarification questions to provide appro-priate recommendations to users in database search.Different from their work which focuses on findingthe appropriate questions to ask the user, DQF aimsto select appropriate query components.

    3 QUERY FORM INTERFACE3.1 Query FormIn this section we formally define the query form.Each query form corresponds to an SQL query tem-plate.Definition 1: A query form F is defined as a tuple

    (AF , RF , F , ./ (RF )), which represents a databasequery template as follows:

    F = (SELECT A1; A2; :::; AkFROM ./ (RF ) WHERE F );

    where AF = fA1; A2; :::; Akg are k attributes forprojection, k > 0. RF = fR1; R2; :::; Rng is the set ofn relations (or entities) involved in this query, n > 0.Each attribute in AF belongs to one relation in RF .F is a conjunction of expressions for selections (orconditions) on relations in RF . ./ (RF ) is a joinfunction to generate a conjunction of expressions forjoining relations of RF .In the user interface of a query form F , AF is

    the set of columns of the result table. F is the setof input components for users to fill. Query formsallow users to fill parameters to generate differentqueries. RF and ./ (RF ) are not visible in the user

    interface, which are usually generated by the systemaccording to the database schema. For a query form F ,./ (RF ) is automatically constructed according to theforeign keys among relations in RF . Meanwhile, RFis determined by AF and F . RF is the union set ofrelations which contains at least one attribute of AFor F . Hence, the components of query form F areactually determined by AF and F . As we mentioned,only AF and F are visible to the user in the userinterface. In this paper, we focus on the projection andselection components of a query form. Ad-hoc joinis not handled by our dynamic query form becausejoin is not a part of the query form and is invisiblefor users. As for Aggregation and Order by inSQL, there are limited options for users. For example,Aggregation can only be MAX, MIN, AVG, and soon; and Order by can only be increasing orderand decreasing order. Our dynamic query formcan be easily extended to include those options byimplementing them as dropdown boxes in the userinterface of the query form.

    3.2 Query ResultsTo decide whether a query form is desired or not,a user does not have time to go over every data in-stance in the query results. In addition, many databasequeries output a huge amount of data instances. Inorder to avoid this Many-Answer problem [10], weonly output a compressed result table to show a high-level view of the query results first. Each instance inthe compressed table represents a cluster of actualdata instances. Then, the user can click through in-terested clusters to view the detailed data instances.Figure 2 shows the flow of user actions. The com-pressed high-level view of query results is proposedin [24]. There are many one-pass clustering algorithmsfor generating the compressed view efficiently [34],[5]. In our implementation, we choose the incrementaldata clustering framework [5] because of the efficiencyissue. Certainly, different data clustering methodswould have different compressed views for the users.Also, different clustering methods are preferable todifferent data types. In this paper, clustering is just toprovide a better view of the query results for the user.The system developers can select a different clusteringalgorithm if needed.

    Fig. 2. User Actions

    Another important usage of the compressed view isto collect the user feedback. Using the collected feed-

  • 4back, we can estimate the goodness of a query formso that we could recommend appropriate query formcomponents. In real world, end-users are reluctantto provide explicit feedback [19]. The click-throughon the compressed view table is an implicit feedbackto tell our system which cluster (or subset) of datainstances is desired by the user. The clicked subset isdenoted by Duf . Note that Duf is only a subset ofall user desired data instances in the database. Butit can help our system generate recommended formcomponents that help users discover more desireddata instances. In some recommendation systems andsearch engines, the end-users are also allowed toprovide the negative feedback. The negative feedbackis a collection of the data instances that are not desiredby the users. In the query form results, we assumemost of the queried data instances are not desiredby the users because if they are already desired, thenthe query form generation is almost done. Therefore,the positive feedback is more informative than thenegative feedback in the query form generation. Ourproposed model can be easily extended for incorpo-rating the negative feedback.

    4 RANKING METRICQuery forms are designed to return the users de-sired result. There are two traditional measures toevaluate the quality of the query results: precision andrecall [30]. Query forms are able to produce differentqueries by different inputs, and different queries canoutput different query results and achieve differentprecisions and recalls, so we use expected precision andexpected recall to evaluate the expected performanceof the query form. Intuitively, expected precision is theexpected proportion of the query results which areinterested by the current user. Expected recall is theexpected proportion of user interested data instanceswhich are returned by the current query form. Theuser interest is estimated based on the users click-through on query results displayed by the query form.For example, if some data instances are clicked bythe user, these data instances must have high userinterests. Then, the query form components whichcan capture these data instances should be rankedhigher than other components. Next we introducesome notations and then define expected precisionand recall.Notations: Table 2 lists the symbols used in this paper.Let F be a query form with selection condition F andprojection attribute set AF . Let D be the collectionof instances in ./ (RF ). N is the number of datainstances in D. Let d be an instance in D with a setof attributes A = fA1; A2; :::; Ang, where n = jAj.We use dAF to denote the projection of instance d onattribute set AF and we call it a projected instance.P (d) is the occurrence probability of d in D. P (F jd)is the probability of d satisfies F . P (F jd) 2 f0; 1g.

    TABLE 2Symbols and Notations

    F query formRF set of relations involved in FA set of all attributes in ./ (RF )AF set of projection attributes of query form FAr(F ) set of relevant attributes of query form FF set of selection expressions of query form FOP set of relational operators in selectiond data instance in ./ (RF )D the collection of data instances in ./ (RF )N number of data instances in DdA1 data instance d projected on attribute set A1DA1 set of unique values D projected on attribute set

    A1Q database queryDQ results of QDuf user feedback as clicked instances in DQ fraction of instances desired by users

    P (F jd) = 1 if d is returned by F and P (F jd) = 0otherwise.Since query form F projects instances to attribute

    set AF , we have DAF as a projected database andP (dAF ) as the probability of projected instance dAFin the projected database. Since there are often dupli-cated projected instances, P (dAF ) may be greater than1=N . Let Pu(d) be the probability of d being desiredby the user and Pu(dAF ) be the probability of the userbeing interested in a projected instance. We give anexample below to illustrate those notations.

    TABLE 3Data Table

    ID C1 C2 C3 C4 C5I1 a1 b1 c1 20 1I2 a1 b2 c2 20 100I3 a1 b2 c3 30 99I4 a1 b1 c4 20 1I5 a1 b3 c4 10 2

    Example 1: Consider a query form Fi with one re-lational data table shown in Table 3. There are 5data instances in this table, D = fI1; I2; :::; I5g, with5 data attributes A = fC1; C2; C3; C4; C5g, N = 5.Query form Fi executes a query Q as SELECT C2,C5 FROM D WHERE C2 = b1 OR C2 = b2. The queryresult is DQ = fI1; I2; I3; I4g with projected on C2and C5. Thus P (Fi jd) is 1 for I1 to I4 and is zerofor I5. Instance I1 and I4 have the same projectedvalues so we can use I1 to represent both of themand P (I1C2;C5 ) = 2=5.Metrics: We now describe the two measures expectedprecision and expected recall for query forms.Definition 2: Given a set of projection attributes A

    and a universe of selection expressions , the expectedprecision and expected recall of a query form F=(AF ,RF , F , ./ (RF )) are PrecisionE(F ) and RecallE(F )

  • 5respectively, i.e.,

    PrecisionE(F ) =

    Pd2DAF Pu(dAF )P (dAF )P (F jd)NP

    d2DAF P (dAF )P (F jd)N;

    (1)

    RecallE(F ) =

    Pd2DAF Pu(dAF )P (dAF )P (F jd)N

    N;

    (2)where AF A, F 2 , and is the fraction of in-stances desired by the user, i.e., =

    Pd2D Pu(d)P (d).

    The numerators of both equations represent theexpected number of data instances in the query resultthat are desired by the user. In the query result,each data instance is projected to attributes in AF . SoPu(dAF ) represents the user interest on instance d inthe query result. P (dAF )N is the expected number ofrows in D that the projected instance dAF represents.Further, given a data instance d 2 D, d being desiredby the user and d satisfying F are independent.Therefore, the product of Pu(dAF ) and P (F jd) canbe interpreted as the probability of d being desired bythe user and meanwhile d being returned in the queryresult. Summing up over all data instances gives theexpected number of data instance in the query resultbeing desired by the user.Similarly, the denominator of Eq.(1) is simply the

    number of instances in the query result. The denom-inator of Eq.(2) is the expected number of instancesdesired by the user in the whole database. In bothequations N cancels out so we do not need to con-sider N when estimating precision and recall. Theprobabilities in these equations can be estimated usingmethods described in Section 5. =

    Pd2D Pu(d)P (d)

    is the fraction of instances desired by user. P (d) isgiven by D. Pu(d) could be estimated by the methoddescribed in Section 5.1.For example, suppose in Example 1, after projecting

    on C2; C5, there are only 4 distinct instances I1, I2, I3,and I5 (I4 has the same projected values as I1). Theprobability of these projected instances are 0:4, 0:2, 0:2,and 0:2, respectively. Suppose Pu for I2 and I3 are 0.9and Pu for I1 and I5 are 0.03. The expected precisionequals 0:030:4+0:90:2+0:90:2+00:4+0:2+0:2+0 = 0:465. Suppose =0:4, then the expected recall equals (0:03 0:4+ 0:90:2 + 0:9 0:2 + 0)=0:4= 0:93.Considering both expected precision and expected re-

    call, we derive the overall performance measure, ex-pected F-Measure as shown in Equation 3. Note that is a constant parameter to control the preference onexpected precision or expected recall.Definition 3: Given a set of projection attributes A

    and an universe of selection expressions , the expectedF-Measure of a query form F=(AF , RF , F , ./ (RF ))is FScoreE(F ), i.e.,

    FScoreE(F )

    =(1 + 2) PrecisionE(F ) RecallE(F )2 PrecisionE(F ) +RecallE(F ) :

    Problem Definition: In our system, we provide aranked list of query form components for the user.Problem 1 is the formal statement of the rankingproblem.Problem 1: Let the current query form be Fi and the

    next query form be Fi+1, construct a ranking of allcandidate form components, in descending order ofFScoreE(Fi+1), where Fi+1 is the query form of Fienriched by the corresponding form component.FScoreE(Fi+1) is the estimated goodness of the

    next query form Fi+1. Since we aim to maxi-mize the goodness of the next query form, theform components are ranked in descending order ofFScoreE(Fi+1). In the next section, we will discusshow to compute the FScoreE(Fi+1) for a specific formcomponent.

    5 ESTIMATION OF RANKING SCORE

    5.1 Ranking Projection Form Components

    DQF provides a two-level ranked list for projectioncomponents. The first level is the ranked list of en-tities. The second level is the ranked list of attributesin the same entity. We first describe how to rank eachentitys attributes locally, and then describe how torank entities.

    5.1.1 Ranking Attributes

    Suggesting projection components is actually suggest-ing attributes for projection. Let the current queryform be Fi, the next query form be Fi+1. Let AFi =fA1; A2; :::; Ajg, and AFi+1 = AFi[fAj+1g, j+1 jAj.Aj+1 is the projection attribute we want to suggest forthe Fi+1, which maximizes FScoreE(Fi+1). From theDefinition 3, we obtain FScoreE(Fi+1) as follows:

    FScoreE(Fi+1)

    =(1 + 2) PrecisionE(Fi+1) RecallE(Fi+1)2 PrecisionE(Fi+1) +RecallE(Fi+1)

    =(1 + 2) Pd2DAFi+1 Pu(dAFi+1 )P (dAFi+1 )P (Fi+1 jd)P

    d2D P (dAFi+1 )P (Fi+1 jd) + 2:

    (3)

    Note that adding a projection component Aj+1 doesnot affect the selection part of Fi. Hence, Fi+1 = Fiand P (Fi+1 jd) = P (Fi jd). Since Fi is already usedby the user, we can estimate P (dAFi+1 )P (Fi+1 jd) asfollows. For each query submitted for form Fi, wekeep the query results including all columns in RF .Clearly, for those instances not in query results theirP (Fi+1 jd) = 0 and we do not need to consider them.For each instance d in the query results, we simplycount the number of times they appear in the resultsand P (dAFi+1 )P (Fi+1 jd) equals the occurrence countdivided by N .

  • 6Now we only need to estimate Pu(dAFi+1 ). As forthe projection components, we have:

    Pu(dAFi+1 ) = Pu(dA1 ; :::; dAj ; dAj+1)

    = Pu(dAj+1 jdAFi )Pu(dAFi ): (4)Pu(dAFi ) in Eq.(4) can be estimated by the users

    click-through on results of Fi. The click-throughDuf D is a set of data instances which are clickedby the user in previous query results. We apply kerneldensity estimation method to estimate Pu(dAFi ). Eachdb 2 Duf represents a Gaussian distribution of theusers interest. Then,

    Pu(dAFi ) =1

    jDuf jX

    x2Duf

    1p22

    exp(d(dAFi ; xAFi )2

    22);

    where d(; ) denotes the distance between two data in-stances, 2 is the variance of Gaussian models. For nu-merical data, the Euclidean distance is a conventionalchoice for distance function. For categorical data, suchas string, previous literatures propose several context-based similarity functions which can be employed forcategorical data instances [14] [8].Pu(dAj+1 jdAFi ) in Eq.(4) is not visible in the run-

    time data, since dAj+1 has not been used before Fi+1.We can only estimate it from other data sources.We mainly consider the following two data-drivenapproaches to estimate the conditional probabilityPu(dAj+1 jdAFi ). Workload-Driven Approach: The conditional proba-bility of Pu(dAj+1 jdAFi ) could be estimated fromquery results of historic queries. If a lot of usersqueried attributes AFi and Aj+1 together on in-stance d, then Pu(dAj+1 jdAFi ) must be high.

    Schema-Driven Approach: The database schema im-plies the relations of the attributes. If two at-tributes are contained by the same entity, thenthey are more relevant.

    Each of the two approaches has its own draw-back. The workload-driven approach has the cold-start problem since it needs a large amount of queries.The schema-driven approach is not able to identifythe difference of the same entitys attributes. In oursystem, we combined the two approaches as follows:

    Pu(dAj+1 jdAFi )= (1 )Pb(dAj+1 jdAFi ) + sim(Aj+1;AFi);

    where Pb(dAj+1 jdAFi ) is the probability estimatedfrom the historic queries, sim(Aj+1;AFi) is the sim-ilarity between Aj+1 and AFi estimated from thedatabase schema, and is a weight parameter in[0; 1]. is utilized to balance the workload-drivenestimation and schema-driven estimation. Note that

    sim(Aj+1;AFi) = 1P

    A2AFi d(Aj+1; A)

    jAFi j dmax;

    where d(Aj+1; A) is the schema distance between theattribute Aj+1 and A in the schema graph, dmax is the

    diameter of the schema graph. The idea of consideringa database schema as a graph is initially proposedby [16]. They proposed a PageRank-like algorithm tocompute the importance of an attribute in the schemaaccording to the schema graph. In this paper, weutilize the schema graph to compute the relevance oftwo attributes. A database schema graph is denotedby G = (R;FK; ;A), in which R is the set of nodesrepresenting the relations, A is the set of attributes,FK is the set of edges representing the foreign keys,and : A ! R is an attribute labeling function toindicate which relation contains the attribute. Basedon the database schema graph, the schema distance isdefined as follows.Definition 4: Schema Distance Given two attributes

    A1,A2 with a database schema graph G=(R,FK,,A),A1 2 A, A2 2 A, the schema distance between A1 andA2 is d(A1; A2), which is the length of the shortestpath between node (A1) and node (A2).

    5.1.2 Ranking EntitiesThe ranking score of an entity is just the averagedFScoreE(Fi+1) of that entitys attributes. Intuitively,if one entity has many high score attributes, then itshould have a higher rank.

    5.2 Ranking Selection Form Components

    The selection attributes must be relevant to the currentprojected entities, otherwise that selection would bemeaningless. Therefore, the system should first findout the relevant attributes for creating the selectioncomponents. We first describe how to select relevantattributes and then describe a naive method and amore efficient one-query method to rank selectioncomponents.

    5.2.1 Relevant Attribute SelectionThe relevance of attributes in our system is measuredbased on the database schema as follows.Definition 5: Relevant Attributes Given a database

    query form F with a schema graph G=(R,FK,,A),the relevant attributes is: Ar(F ) = fAjA 2 A; 9Aj 2AF ; d(A;Aj) tg, where t is a user-defined thresholdand d(A;Aj) is the schema distance defined in Defi-nition 4.The choice of t depends on how compact of theschema is designed. For instance, some databases putall attributes of one entity into a relation, then t couldbe 1. Some databases separate all attributes of oneentity into several relations, then t could be greaterthan 1. Using the depth-first traversing of the databaseschema graph, Ar(F ) can be obtained in O(jAr(F )jt).

    5.2.2 Ranking Selection ComponentsFor enriching selection form components of a queryform, the set of projection componentsAF is fixed, i.e.,

  • 7AFi+1 = AFi . Therefore, FScoreE(Fi+1) only dependson Fi+1 .For the simplicity of the user interface, most query

    forms selection components are simple binary rela-tions in the form of Aj op cj, where Aj is an at-tribute, cj is a constant and op is a relational operator.The op operator could be =, and so on. Ineach cycle, the system provides a ranked list of suchbinary relations for users to enrich the selection part.Since the total number of binary relations are so large,we only select the best selection component for eachattribute.For attribute As, As 2 Ar(F ), let Fi+1 = Fi [ fsg,

    s 2 and s contains As. According to the formula ofFScoreE(Fi+1), in order to find the s 2 that max-imizes the FScoreE(Fi+1), we only need to estimateP (Fi+1 jd) for each data instance d 2 D. Note that, inour system, F represents a conjunctive expression,which connects all elemental binary expressions byAND. Fi+1 exists if and only if both Fi and s exist.Hence, Fi+1 , Fi ^ s. Then, we have:

    P (Fi+1 jd) = P (Fi ; sjd) = P (sjFi ; d)P (Fi jd): (5)

    P (Fi jd) can be estimated by previous queries exe-cuted on query form Fi, which has been discussed inSection 5.1. P (sjFi ; d) is 1 if and only if d satisfiesFi and s, otherwise it is 0. The only problem is todetermine the space of s, since we have to enumerateall the s to compute their scores. Note that s is abinary expression in the form of As ops cs, in whichAs is fixed and given. ops 2 OP where OP is afinite set of relational operators, f=;;; :::g, and csbelongs to the data domain of As in the database.Therefore, the space of s is a finite set OP DAs . Inorder to efficiently estimate the new FScore inducedby a query condition s, we propose the One-querymethod in this paper. The idea of One-query issimple: we sort the values of an attribute in s andincrementally compute the FScore on all possiblevalues for that attribute.To find the best selection component for the next

    query form, the first step is to query the databaseto retrieve the data instances. In Section 5.2, Eq. (5)presents P (Fi+1 jd) depends on the previous queryconditions Fi . If P (Fi jd) = 0, P (Fi+1 jd) must be 0.Hence, in order to compute the P (Fi+1 jd) for eachd 2 D, we dont need to retrieve all data instances inthe database. What we need is only the set of datainstances D0 D such that each d 2 D0 satisfiesP (Fi jd) > 0. So the selection of One-Querys queryis the union of query conditions executed in Fi.In addition, One-Query algorithm does not send

    each query condition s to the database engine toselect data instances, which would be a heavy burdenfor the database engine since the number of queryconditions is large. Instead, it retrieves the set of datainstances D0, and checks every data instance with

    every query condition by its own. For this purpose,the algorithm needs to know the values of all selectionattributes of D0. Hence, One-Query adds all theselection attributes into the projections of the query.Algorithm 1 describes the algorithm of the

    One-Querys query construction. The functionGenerateQuery is to generate the database querybased on the given set of projection attributes Aonewith selection expression one.

    Algorithm 1: QueryConstructionData: Q = fQ1; Q2; :::; g is the set of previous

    queries executed on Fi.Result: Qone is the query of One-Querybegin

    one 0for Q 2 Q do

    one one _ QAone AFi [ Ar(Fi)Qone GenerateQuery(Aone,one)

    When the system receives the result of the queryQone from the database engine, it calls the secondalgorithm of One-Query to find the best query con-dition.We first discuss the condition. The basic idea

    of this algorithm is based on a simple property. For aspecific attribute As with a data instance d, given twoconditions:

    s1 : As a1,s2 : As a2,

    and a1 a2, if s1 is satisfied, then s2 must be satisfied.Based on this property, we could incrementally com-pute the FScore of each query condition by scanningone pass of data instances. There are 2 steps to dothis.

    1) First, we sort the values of As in the order ofa1 a2 :::: am, where m is the numberof Ass values. Let Daj denote the set of datainstances in which Ass value is equal to aj .

    2) Then, we go through every data instance inthe order of Ass value. Let query conditionsj = \As aj" and its corresponding FScorebe fscorej . According to Eq. (3), fscorej can becomputed as

    fscorej = (1 + 2) nj=dj ;

    nj =X

    d2DQonePu(dAFi )P (dAFi )P (Fi jd)P (sijd);

    dj =X

    d2DQoneP (dAFi )P (Fi jd)P (sijd) + 2:

    For j > 1, nj and dj can be calculated incremen-

  • 8tally:

    nj = nj1 +Xd2Daj

    Pu(dAFi )P (dAFi )P (Fi jd)P (sj jd);

    dj = dj1 +X

    d2DajP (dAFi )P (Fi jd)P (sj jd):

    Algorithm 2 shows the pseudocode for finding thebest condition.

    Algorithm 2: FindBestLessEqConditionData: is the fraction of instances desired by user,

    DQone is the query result of Qone, As is theselection attribute.

    Result: s is the best query condition of As.begin

    // sort by As into an ordered set DsortedDsorted Sort(DQone ; As)s ;, fscore 0n 0, d 2for i 1 to jDsortedj do

    d Dsorted[i]s As dAs// compute fscore of As dAsn n+ Pu(dAFi )P (dAFi )P (Fi jd)P (sjd)d d+ P (dAFi )P (Fi jd)P (sjd)fscore (1 + 2) n=dif fscore fscore then

    s sfscore fscore

    Complexity: As for other query conditions, such as=, , we can also find similar incremental ap-proaches to compute their FScore. They all share thesorting result in the first step. And for the secondstep, all incremental computations can be mergedinto one pass of scanning DQone . Therefore, the timecomplexity of finding the best query condition for anattribute is O(jDQone jjAFi j). Ranking every attributesselection component is O(jDQone j jAFi j jAr(Fi)j).

    5.2.3 Diversity of Selection ComponentsTwo selection components may have a lot of overlap(or redundancy). For example, if a user is interestedin some customers with age between 30 and 45, thentwo selection components: age > 28 and age > 29could get similar FScores and similar sets of datainstances. Therefore, there is a redundancy of the twoselections. Besides a high precision, we also requirethe recommended selection components should havea high diversity. diversity is a recent research topicin recommendation systems and web search engines[6] [28]. However, simultaneously maximizing theprecision and the diversity is an NP-Hard problem[6]. It cannot be efficiently implemented in an inter-active system. In our dynamic query form system,we observe that most redundant selection components

    are constructed by the same attribute. Thus, we onlyrecommend the best selection component for eachattribute.

    6 EVALUATIONThe goal of our evaluation is to verify the followinghypotheses:H1: Is DQF more usable than existing approaches

    such as static query form and customizedquery form?

    H2: Is DQF more effective to rank projectionand selection components than the baselinemethod and the random method?

    H3: Is DQF efficient to rank the recommendedquery form components in an online userinterface?

    6.1 System Implementation and ExperimentalSetupWe implemented the dynamic query forms as a web-based system using JDK 1.6 with Java Server Page.The dynamic web interface for the query forms usedopen-source javascript library jQuery 1.4. We usedMySQL 5.1.39 as the database engine. All experimentswere run using a machine with Intel Core 2 [email protected], 3.5G main memory, and running on Win-dows XP SP2. Figure 3 shows a system prototype.Data sets: 3 databases: NBA 1, Green Car 2 andGeobase 3 were used in our experiments. Table 4shows a general description of those databases.

    TABLE 4Data Description

    Name #Relations #Attribute #InstancesNBA 10 180 44,590Green Car 1 17 2,187Geobase 9 32 1,329

    Form Generation Approaches: We compared threeapproaches to generate query forms: DQF: The dynamic query form system proposedin this paper.

    SQF: The static query form generation approachproposed in [18]. It also uses query workload.Queries in the workload are first divided intoclusters. Each cluster is converted into a queryform.

    CQF: The customized query form generation usedby many existing database clients, such as Mi-crosoft Access, EasyQuery, ActiveQueryBuilder.

    User Study Setup:We conducted a user study to eval-uate the usability of our approach. We recruited 20

    1. http://www.databasebasketball.com2. http://www.epa.gov/greenvehicles3. Geobase is a database of geographic information about the

    USA, which is used in [16]

  • 9Fig. 3. Screenshot of Web-based Dynamic Query Form

    participants of graduate students, UI designers, andsoftware engineers. The system prototype is shown byFigure 3. The user study contains 2 phases, a querycollection phase and a testing phase. In the collectionphase, each participant used our system to submitsome queries and we collected these queries. Therewere 75 queries collected for NBA, 68 queries collectedfor Green Car, and 132 queries for Geobase. Thesequeries were used as query workload to train oursystem (see Section 5.1). In the second phase, weasked each participant to complete 12 tasks (none ofthese tasks appeared in the workload) listed in Table5. Each participant used all three form generationapproaches to form queries. The order of the threeapproaches were randomized to remove bias. We setparameter = 0:001 in our experiments because ourdatabases collect a certain amount of historic queriesso that we mainly consider the probability estimatedfrom the historic queries.Simulation Study Setup: We also used the collectedqueries in a larger scale simulation study. We useda cross-validation approach which partitions queriesinto a training set (used as workload information) anda testing set. We then reported the average perfor-mance for testing sets.

    6.2 User Study ResultsUsability Metrics: In this paper, we employ somewidely used metrics in Human-Computer Interactionand Software Quality for measuring the usability of asystem [31], [27]. These metrics are listed in Table 7.

    TABLE 7Usability Metrics

    Metric DefinitionACmin The minimal number of action for usersAC The actual number of action performed by

    usersACratio ACmin=AC 100:0%FNmax The total number of provided UI function for

    users to chooseFN The number of actual used UI function by

    the userFNratio FN=FNmax 100%Success The percentage of users successfully com-

    pleted a specific task

    In database query forms, one action means a mouseclick or a keyboard input for a textbox. ACmin is theminimal number of actions for a querying task. Onefunction means a provided option for the user to use,such as a query form or a form component. In a webpage based system, FNmax is the total number ofUI components in web pages explored by the user.In this user study, each page at most contains 5 UIcomponents. The smaller ACmin, AC, FNmax, andFN , the better the usability. Similarly, the higher theACratio, FNratio, and Success, the better the usability.There is a trade-off between ACmin and FNmax. An

    extreme case is that, we generate all possible queryforms in one web page, the user only needs to chooseone query form to finish his(or her) query task, soACmin is 1. However, FNmax would be the number

  • 10

    TABLE 5Query Tasks

    Task SQL MeaningT1 SELECT ilkid, firstname, lastname FROM players Find all NBA players ID and full names.T2 SELECT p.ilkid, p.firstname, p.lastname FROM players

    p, player_playoffs_career c WHERE p.ilkid = c.ilkid

    AND c.minutes > 5000

    Find players who have played more than 5000 minutesin the playoff.

    T4 SELECT t.team, t.location, c.firstname, c.lastname,c.year FROM teams t, coaches c WHERE t.team=c.team AND

    t.location = Los Angeles

    Find the name of teams located in Los Angeles with theircoaches.

    T5 SELECT Models, Hwy_MPG FROM cars WHERE City_MPG > 20 Find the high way MPG of cars whose city road MPG isgreater than 20.

    T6 SELECT Models, Displ, Fuel FROM cars WHERE Sales_Area= CA

    Find the model, displacement and fuel of cars which issold in California.

    T7 SELECT Models, Displ FROM cars WHERE Veh_Class = SUV Find the displacement of all SUV cars.T8 SELECT Models FROM cars WHERE Drive = 4WD Find all 4 wheel-driven cars.T9 SELECT t0.population FROM city t0 WHERE t0.state =

    california

    Find all cities in California with the population of eachcity.

    T10 SELECT t1.state, t0.area FROM state t0, border t1WHERE t1.name = wisconsin and t0.area > 80000 and

    t0.name = t1.name

    Find the neighbor states of Wisconsin whose area isgreater than 80,000 square miles.

    T11 SELECT t0.name, t0.length FROM river t0 WHERE t0.state= illinois

    Find all rivers across Illinois with each rivers length.

    T12 SELECT t0.name, t0.elevation, t0.state FROM mountaint0 WHERE t0.elevation > 5000

    Find all mountains whose elevation is greater than 5000meters and each mountains state.

    of all possible query forms with their components,which is a huge number. On the other hand, if usershave to interact a lot with a system, that system wouldknow better about the users desire. In that case, thesystem would cut down many unnecessary functions,so that FNmax could be smaller. But ACmin would behigher since there are a lot of user interactions.User Study Analysis: Table 6 shows the average resultof the usability experiments for those query tasks. Asfor SQF, we generated 10 static query forms based onthe collected user queries for each database (i.e., 10clusters were generated on the query workload).The results show that users did not accomplish

    querying tasks by SQF. The reason is that, SQF isbuilt from the query workload and may not be ableto answer ad hoc queries in the query tasks. E.g., SQFdoes not contain any relevant attributes for query taskT3 and T11, so users failed to accomplish the queriesby SQF.

    TABLE 8Statistical Test on FNmax ( with CQF)

    Task T1 T4 T5 T10 T11 T12P Value 0.0106

  • 11

    TABLE 6Usability Results

    Task Query Form ACmin AC ACratio FNmax FN FNratio Success

    T1DQF 6 6.7 90.0% 40.0 3 7.5% 100.0%CQF 6 7.0 85.7% 60.0 3 5% 100.0%SQF 1 1.0 100.0% 35.0 3 8.6% 44.4%

    T2DQF 7 7.7 91.0% 65.0 4 6.2% 100.0%CQF 8 10.0 80.0% 86.7 4 4.6% 100.0%SQF 1 1.0 100.0% 38.3 4 10.4% 16.7%

    T3DQF 10 10.7 93.5% 133.3 6 3.8% 100.0%CQF 12 13.3 90.2% 121.7 6 4.9% 100.0%SQF 1 N/A N/A N/A 6 N/A 0.0%

    T4DQF 11 11.7 94.0% 71.7 6 8.4% 100.0%CQF 12 13.3 90.2% 103.3 6 5.8% 100.0%SQF 1 1.0 100.0% 70.0 6 8.6% 16.7%

    T5DQF 5 5.7 87.7% 28.3 3 10.6% 100.0%CQF 6 6.7 90.0% 56.7 3 5.3% 100.0%SQF 1 1.0 100.0% 10 3 30.0% 66.7%

    T6DQF 7 7.7 91.0% 61.7 4 6.5% 100.0%CQF 8 10 80.0% 61.7 4 6.5% 100.0%SQF 1 1 100.0% 23.3 4 17.2% 41.7%

    T7DQF 5 6.0 83.3% 48.3 3 6.2% 100.0%CQF 6 6.7 90.0% 50.0 3 6.0% 100.0%SQF 1 1.0 100.0% 18.3 3 16.4% 44.4%

    T8DQF 3 3.3 91.0% 21.7 2 9.2% 100.0%CQF 4 4.7 85.1% 26.7 2 7.5% 100.0%SQF 1 1.0 100.0% 38.3 2 5.2% 66.7%

    T9DQF 5 6.3 79.3% 31.7 3 9.5% 100.0%CQF 6 6.7 90.0% 36.7 3 8.2% 100.0%SQF 1 1.0 100.0% 106.7 3 2.8% 66.7%

    T10DQF 6 6.7 90.0% 43.3 4 9.2% 100.0%CQF 8 8.7 92.0% 63.3 4 6.3% 100.0%SQF 1 1.0 100.0% 75.0 4 5.3% 33.3%

    T11DQF 5 6.3 79.4% 36.7 3 8.2% 100.0%CQF 6 6.7 90.0% 50.0 3 6.0% 100.0%SQF 1 N/A N/A N/A 3 N/A 0.0%

    T12DQF 7 7.7 91.0% 46.7 4 8.6% 100.0%CQF 8 10.0 80.0% 85.0 4 4.7% 100.0%SQF 1 1.0 100.0% 31.7 4 12.6% 25.0%

    On the premise of satisfying all users queries, thecomplexities of query forms should be as small aspossible. DQF generates one customized query formfor each query. The average form complexity is 8.1 forNBA, 4.5 for Green Car and 6.7 for Geobase. But forSQF, the complexity is 30 for NBA and 16 for GreenCar (2 static query forms). This result shows that,in order to satisfy various query tasks, the staticallygenerated query form has to be more complex.

    6.4 Effectiveness

    We compare the ranking function of DQF with twoother ranking methods: the baseline method and therandom method. The baseline method ranks projec-tion and selection attributes in ascending order oftheir schema distance (see Definition 4) to the currentquery form. For the query condition, it chooses themost frequent used condition in the training set forthat attribute. The random method randomly suggestsone query form component. The ground truth of thequery form component ranking is obtained from thequery workloads and stated in Section 6.1.Ranking projection components: Ranking score is asupervised method to measure the accuracy of the

    recommendation. It is obtained by comparing thecomputed ranking with the optimal ranking. In theoptimal ranking, the actual selected component by theuser is ranked first. So ranking score evaluates how faraway the actual selected component is ranked fromthe first. The formula of ranks score is computed asfollows:

    RankScore(Q;Aj) =1

    log(r^(Aj)) + 1;

    where Q is a test query, Aj is the j-th projectionattribute of Q, r^(Aj) is the computed rank of Aj .Figure 4 shows the average ranking scores for all

    queries in the workload. We compare three methods:DQF, Baseline, and Random. The x-axis indicates theportion of the training queries, and the rest queriesare used as testing queries. The y-axis indicates theaverage ranking scores among all the testing queries.DQF always outperforms the baseline method andrandom method. The gap also grows as the portionof training queries increases because DQF can betterutilize the training queries.Ranking selection components: F-Measure is uti-lized to measure ranking of selection components.Intuitively, if the query result obtained by using the

  • 12

    0.3 0.4 0.5 0.6 0.7 0.8 0.9

    Training Queries Ratio

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    Average Ranking Score

    DQF

    Baseline

    Random

    (a) Ranking Scores on NBA Data

    0.3 0.4 0.5 0.6 0.7 0.8 0.9

    Training Queries Ratio

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    Average Ranking Score

    DQF

    Baseline

    Random

    (b) Ranking Scores on Green Car Data

    0.3 0.4 0.5 0.6 0.7 0.8 0.9

    Training Queries Ratio

    0.0

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    Average Ranking Score

    DQF

    Baseline

    Random

    (c) Ranking Scores on Geobase Data

    Fig. 4. Ranking Scores of Suggested Selection Components

    0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40

    Feedback Ratio

    0.35

    0.40

    0.45

    0.50

    0.55

    0.60

    0.65

    0.70

    0.75

    0.80

    Average F-Measure

    DQF

    Baseline

    Random

    (a) Average F-Measure for NBA Data (Top5 Ranked Components)

    0.00 0.01 0.02 0.03 0.04 0.05

    Feedback Ratio

    0.30

    0.35

    0.40

    0.45

    0.50

    0.55

    0.60

    0.65

    0.70

    0.75

    0.80

    0.85

    Average F-Measure

    DQF

    Baseline

    Random

    (b) Average F-Measure on Green Car Data(Top 5 Ranked Components)

    0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35

    Feedback Ratio

    0.10

    0.15

    0.20

    0.25

    0.30

    0.35

    0.40

    0.45

    0.50

    0.55

    0.60

    0.65

    0.70

    0.75

    Average F-Measure

    DQF

    Baseline

    Random

    (c) Average F-Measure on Geobase Data(Top 3 Ranked Components)

    Fig. 5. Average F-Measure of Suggested Selection Components

    1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

    Number of Data Instances of Query Result

    0

    500

    1000

    1500

    2000

    2500

    3000

    Running Time (Milliseconds)

    Q1

    Q2

    Q3

    Q4

    (a) Running time on NBA Data

    600 800 1000 1200 1400 1600 1800 2000

    Number of Data Instances of Query Result

    0

    20

    40

    60

    80

    100

    120

    Running Time (Milliseconds)

    Q1

    Q2

    Q3

    Q4

    (b) Running time on Green Car Data

    1000 2000 3000 4000 5000 6000 7000 8000 9000 10000

    Number of Data Instances of Query Result

    0

    100

    200

    300

    400

    500

    Running Time (Milliseconds)

    Q1

    Q2

    Q3

    Q4

    (c) Running time on Geobase Data

    Fig. 6. Scalability of Ranking Selection Components

    suggested selection component is closer to the actualquery result, the F-Measure should be higher. Fora test query Q, we define the ground truth as theset of data instances returned by the query Q. Wealso constructed a query Q^, where Q^ is identical toQ except for the last selection component. The lastselection component of Q^ is constructed by the topranked component returned by one of the three rank-ing methods. We then compared the results of Q^ tothe ground truth to compute F-Measure. We randomlyselected half of queries in the workload as training setand the rest as testing. Since DQF uses users click-through as implicit feedback, we randomly selectedsome small portion (Feedback Ratio) of the groundtruth as click-through.Figure 5 shows the F-Measure (=2) of all methods

    on the data sets. The x-axis of those figures indicates

    the Feedback Ratio over the whole ground truth.The y-axis is the average F-Measure value among allcollected queries. From those figures, DQF even per-forms well when there is no click-through information( Feedback Ratio=0).

    6.5 Efficiency

    The run-time cost of ranking projection and selectioncomponents for DQF depends on the current formcomponents and the query result size. Thus we se-lected 4 complex queries with large result size foreach data set. Table 10, Table 11 and Table 12 listthese queries, where those join conditions are implicitinner joins and written in WHERE clause. We variedthe query result size by query paging in MySQLengine. The running times of ranking projection are

  • 13

    all less than 1 millisecond, since DQF only computesthe schema distance and conditional probabilities ofattributes. Figure 6 shows the time for DQF to rankselection components for queries on the data sets. Theresults show that the execution time grows approxi-mately linearly with respect to the query result size.The execution time is between 1 to 3 seconds for NBAwhen the results contain 10000 records, less than 0.11second for Green Car when the results contain 2000records, and less than 0.5 second for Geobase whenresults contain 10000 records. So DQF can be used inan interactive environment.

    TABLE 10NBAs Queries in Scalability Test

    Query SQLQ1 SELECT t0.coachid, t2.leag, t2.location,

    t2.team, t3.d_blk, t1.fta, t0.season_win,

    t1.fgm FROM coaches t0, player_regular_season

    t1, teams t2, team_seasons t3 WHERE t0.team =

    t1.team and t1.team = t3.team and t3.team =

    t2.team

    Q2 SELECT t2.lastname, t2.firstname, t1.wonFROM player_regular_season t0, team_seasons

    t1, players t2 WHERE t1.team = t0.team and

    t0.ilkid = t2.ilkid

    Q3 SELECT t0.lastname, t0.firstname FROM playerst0, player_regular_season t1, team_seasons

    t2 WHERE t2.team = t1.team and t1.ilkid =

    t0.ilkid

    Q4 SELECT t0.won, t3.name, t2.h_feet FROMteam_seasons t0, player_regular_season t1,

    players t2, teams t3 WHERE t3.team = t0.team

    and t0.team = t1.team and t1.ilkid = t2.ilkid

    TABLE 11Green Cars Queries in Scalability Test

    Query SQLQ1 SELECT Underhood_ID, Displ, Hwy_MPG FROM cars

    WHERE City_MPG

  • 14

    [12] E. Chu, A. Baid, X. Chai, A. Doan, and J. F. Naughton.Combining keyword search and forms for ad hoc querying ofdatabases. In Proceedings of ACM SIGMOD Conference, pages349360, Providence, Rhode Island, USA, June 2009.

    [13] S. Cohen-Boulakia, O. Biton, S. Davidson, and C. Froidevaux.Bioguidesrs: querying multiple sources with a user-centricperspective. Bioinformatics, 23(10):13011303, 2007.

    [14] G. Das and H. Mannila. Context-based similarity measuresfor categorical databases. In Proceedings of PKDD 2000, pages201210, Lyon, France, September 2000.

    [15] W. B. Frakes and R. A. Baeza-Yates. Information Retrieval: DataStructures and Algorithms. Prentice-Hall, 1992.

    [16] M. Jayapandian and H. V. Jagadish. Automated creation ofa forms-based database query interface. In Proceedings of theVLDB Endowment, pages 695709, August 2008.

    [17] M. Jayapandian and H. V. Jagadish. Expressive query spec-ification through form customization. In Proceedings of Inter-national Conference on Extending Database Technology (EDBT),pages 416427, Nantes, France, March 2008.

    [18] M. Jayapandian and H. V. Jagadish. Automating the designand construction of query forms. IEEE TKDE, 21(10):13891402, 2009.

    [19] T. Joachims and F. Radlinski. Search engines that learn fromimplicit feedback. IEEE Computer (COMPUTER), 40(8):3440,2007.

    [20] N. Khoussainova, M. Balazinska, W. Gatterbauer, Y. Kwon,and D. Suciu. A case for a collaborative query managementsystem. In Proceedings of CIDR, Asilomar, CA, USA, January2009.

    [21] N. Khoussainova, Y. Kwon, M. Balazinska, and D. Suciu.Snipsuggest: Context-aware autocompletion for sql. PVLDB,4(1):2233, 2010.

    [22] J. C. Kissinger, B. P. Brunk, J. Crabtree, M. J. Fraunholz,B. Gajria, A. J. Milgram, D. S. Pearson, J. Schug, A. Bahl, S. J.Diskin, H. Ginsburg, G. R. Grant, D. Gupta, P. Labo, L. Li,M. D. Mailman, S. K. McWeeney, P. Whetzel, C. J. Stoeckert,and J. . D. S. Roos. The plasmodium genome database:Designing and mining a eukaryotic genomics resource. Nature,419:490492, 2002.

    [23] C. Li, N. Yan, S. B. Roy, L. Lisham, and G. Das. Facetedpedia:dynamic generation of query-dependent faceted interfaces forwikipedia. In Proceedings of WWW, pages 651660, Raleigh,North Carolina, USA, April 2010.

    [24] B. Liu and H. V. Jagadish. Using trees to depict a forest.PVLDB, 2(1):133144, 2009.

    [25] P. Mork, R. Shaker, A. Halevy, and P. Tarczy-Hornoch. Pql: adeclarative query language over dynamic biological schemata.In In Proceedings of American Medical Informatics Association FallSymposium, pages 533537, San Antonio, Texas, 2007.

    [26] A. Nandi and H. V. Jagadish. Assisted querying using instant-response interfaces. In Proceedings of ACM SIGMOD, pages11561158, 2007.

    [27] J. Nielsen. Usability Engineering. Morgan Kaufmann, SanFrancisco, 1993.

    [28] D. Rafiei, K. Bharat, and A. Shukla. Diversifying web searchresults. In Proceedings of WWW, pages 781790, Raleigh, NorthCarolina, USA, April 2010.

    [29] S. B. Roy, H. Wang, U. Nambiar, G. Das, and M. K. Moha-nia. Dynacet: Building dynamic faceted search systems overdatabases. In Proceedings of ICDE, pages 14631466, Shanghai,China, March 2009.

    [30] G. Salton and M. McGill. Introduction to Modern InformationRetrieval. McGraw-Hill, 1984.

    [31] A. Seffah, M. Donyaee, R. B. Kline, and H. K. Padda. Usabilitymeasurement and metrics: A consolidated model. SoftwareQuality Journal, 14(2):159178, 2006.

    [32] Q. T. Tran, C.-Y. Chan, and S. Parthasarathy. Query by output.In Proceedings of SIGMOD, pages 535548, Providence, RhodeIsland, USA, September 2009.

    [33] G. Wolf, H. Khatri, B. Chokshi, J. Fan, Y. Chen, and S. Kamb-hampati. Query processing over incomplete autonomousdatabases. In Proceedings of VLDB, pages 651662, 2007.

    [34] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficientdata clustering method for very large databases. In Proceedingsof SIGMOD, pages 103114, Montreal, Canada, June 1996.

    [35] S. Zhu, T. Li, Z. Chen, D. Wang, and Y. Gong. Dynamic activeprobing of helpdesk databases. Proc. VLDB Endow., 1(1):748760, Aug. 2008.

    [36] M. M. Zloof. Query-by-example: the invocation and definitionof tables and forms. In Proceedings of VLDB, pages 114,Framingham, Massachusetts, USA, September 1975.

    Liang Tang Liang Tang received the master degree in computerscience from the Department of Computer Science, Sichuan Univer-sity, Chengdu, China, in 2009. He is currently a Ph.D. student in theSchool of Computing and Information Sciences, Florida InternationalUniversity, Miami. His research interests are data mining, computingsystem management, data mining and machine learning.

    Tao Li Tao Li received the Ph.D. degree in computer sciencefrom the Department of Computer Science, University of Rochester,Rochester, NY, in 2004. He is currently an Associate Professor withthe School of Computing and Information Sciences, Florida Inter-national University, Miami. His research interests are data mining,computing system management, information retrieval, and machinelearning. He is a recipient of NSF CAREER Award and multiple IBMFaculty Research Awards.

    Yexi Jiang Yexi Jiang received the master degree in computer sci-ence from the Department of Computer Science, Sichuan University,Chengdu, China, in 2010. He is currently a Ph.D. student in theSchool of Computing and Information Sciences, Florida InternationalUniversity, Miami. His research interests are system oriented datamining, intelligent cloud, large scale data mining, database andsemantic web.

    Zhiyuan Chen Zhiyuan Chen is an associate professor at the De-partment of Information Systems, University of Maryland BaltimoreCounty. He received PhD in Computer Science from Cornell Uni-versity in 2002, and M.S. and B.S. in computer science from FudanUniversity, China. His research interests are in database systemsand data mining, including privacy preserving data mining, dataexploration and navigation, health informatics, data integration, XML,automatic database administration, and database compression.