annotating search results

Upload: sujaysuju

Post on 05-Apr-2018

234 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/31/2019 Annotating Search Results

    1/14

    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID 1

    Annotating Search Results From WebDatabases

    Yiyao Lu, Hai He, Hongkun Zhao, Weiyi Meng, Clement Yu

    Abstract An increasing number of databases have become Web accessible through HTML form-based search interfaces. Thedata units returned from the underlying database are usually encoded into the result pages dynamically for human browsing.For the encoded data units to be machine processable, which is essential for many applications such as deep Web datacollection and Internet comparison shopping, they need to be extracted out and assigned meaningful labels. In this paper, wepresent an automatic annotation approach that first aligns the data units on a result page into different groups such that the datain the same group have the same semantic. Then for each group we annotate it from different aspects and aggregate thedifferent annotations to predict a final annotation label for it. An annotation wrapper for the search site is automaticallyconstructed and can be used to annotate new result pages from the same Web database. Our experiments indicate that theproposed approach is highly effective.

    Index Terms Data alignment, Data annotation, Web database, Wrapper generation

    1 INTRODUCTIONA large portion of the deep web is database-based, i.e., formany search engines, data encoded in the returned resultpages come from the underlying structured databases.Such type of search engines is often referred as Web data-bases (WDB). A typical result page returned from a WDBhas multiple search result records (SRRs). Each SRR con-tains multiple data units each of which describes one as-pect of a real-world entity. Fig. 1 shows three SRRs on aresult page from a book WDB. Each SRR represents onebook with several data units, e.g., the first book record inFig. 1 has data units Talking Back to the Machine: Com-puters and Human Aspiration, Peter J. Denning, etc.

    In this paper, a data unit is a piece of text that semanti-cally represents one concept of an entity. It corresponds tothe value of a record under an attribute. It is differentfrom a text node which refers to a sequence of text sur-rounded by a pair of HTML tags. Section 3.1 describes therelationships between text nodes and data units in detail.In this paper, we perform data unit level annotation.

    There is a high demand for collecting data of interestfrom multiple WDBs. For example, once a book compari-

    son shopping system collects multiple result records fromdifferent book sites, it needs to determine whether anytwo SRRs refer to the same book. The ISBNs can be com-pared to achieve this. If ISBNs are not available, theirtitles and authors could be compared. The system alsoneeds to list the prices offered by each site. Thus, the sys-

    x Y. Lu is with the Department of Computer Science, BinghamtonUniversity, Binghamton, NY 13902. E-mail: [email protected].

    x H. He is with Morningstar Inc, Chicago, IL 60602.E-mail: [email protected].

    x H. Zhao is with Bloomberg L.P., Princeton, NJ 08858.E-mail: [email protected].

    x W. Meng is with the Department of Computer Science, Bingham-ton University, Binghamton, NY 13902.E-mail: [email protected].

    x C. Yu is with the Department of Computer Science, University of Illinois at Chicago, Chicago, IL 60607. E-mail: [email protected].

    tem needs to know the semantic of each data unit. Unfor-tunately, the semantic labels of data units are often notprovided in result pages. For instance, in Fig. 1, no se-mantic labels for the values of title, author, publisher, etc.are given. Having semantic labels for data units is notonly important for the above record linkage task, but alsofor storing collected SRRs into a database table (e.g., DeepWeb crawlers [23]) for later analysis. Early applicationsrequire tremendous human efforts to annotate data unitsmanually, which severely limit their scalability. In thispaper, we consider how to automatically assign labels tothe data units within the SRRs returned from WDBs.

    (a). Original HTML page.

    (b). Simplified HTML source for the first SRR.

    Fig. 1. Example search results from Bookpool.com

    Given a set of SRRs that have been extracted from a re-sult page returned from a WDB, our automatic annotation

    xxxx-xxxx/0x/$xx.00 200x IEEE

    Talking Back to the Machine: Computers andHuman Aspiration
    Peter J. Denning / Springer-Verlag / 1999 / 0387984135/0.06667
    Our Price $17.50 ~ YouSave $9.50 (35% Off)
    Out-Of-Stock

    Digital Object Indentifier 10.1109/TKDE.2011.175 1041-4347/11/$26.00 2011 IEEE

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 7/31/2019 Annotating Search Results

    2/14

    2 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID

    solution consists of three phases as illustrated in Fig. 2.Let jid denote the data unit belonging to the ith SRR ofconcept j. The SRRs on a result page can be represented ina table format (Fig.2a) with each row representing anSRR. Phase 1 is the alignment phase. In this phase, we firstidentify all data units in the SRRs and then organize theminto different groups with each group corresponding to adifferent concept (e.g., all titles are grouped together).Fig.2b shows the result of this phase with each columncontaining data units of the same concept across all SRRs.Grouping data units of the same semantic can help identi-fy the common patterns and features among these dataunits. These common features are the basis of our annota-tors. In Phase 2 (the annotation phase), we introduce mul-tiple basic annotators with each exploiting one type offeatures. Every basic annotator is used to produce a labelfor the units within their group holistically, and a proba-bility model is adopted to determine the most appropriatelabel for each group. Fig.2c shows that at the end of thisphase, a semantic label j L is assigned to each column. InPhase 3 (the annotation wrapper generation phase), as shownin Fig. 2d, for each identified concept, we generate an an-notation rule j R that describes how to extract the dataunits of this concept in the result page and what the ap-propriate semantic label should be. The rules for allaligned groups, collectively, form the annotation wrapperfor the corresponding WDB, which can be used to directlyannotate the data retrieved from the same WDB in re-sponse to new queries without the need to perform thealignment and annotation phases again. As such, annota-tion wrappers can perform annotation quickly, which is

    essential for online applications.

    Fig. 2. Illustration of our three-phase annotation solution

    This paper has the following contributions. (1) Whilemost existing approaches simply assign labels to eachHTML text node, we thoroughly analyze the relationshipsbetween text nodes and data units. We perform data unitlevel annotation. (2) We propose a clustering-based shift-ing technique to align data units into different groups sothat the data units inside the same group have the samesemantic. Instead of using only the DOM tree or otherHTML tag tree structures of the SRRs to align the dataunits (like most current methods do), our approach alsoconsiders other important features shared among data

    units, such as their data types, text contents, presentationformats, and adjacency information. (3) We utilize theintegrated interface schema (IIS) over multiple WDBs inthe same domain to enhance data unit annotation. To thebest of our knowledge, we are the first to utilize IIS for

    annotating SRRs. (4) We employ six basic annotators;each annotator can independently assign labels to dataunits based on certain features of the data units. We alsoemploy a probabilistic model to combine the results fromdifferent annotators into a single label. This model ishighly flexible so that the existing basic annotators maybe modified and new annotators may be added easilywithout affecting the operation of other annotators. (5)We construct an annotation wrapper for any given WDB.The wrapper can be applied to efficiently annotating theSRRs retrieved from the same WDB with new queries.

    The rest of this paper is organized as follows. Section 2reviews related works. Section 3 analyzes the relation-ships between data units and text nodes, and describesthe data unit features we use. Section 4 introduces ourdata alignment algorithm. Our multi-annotator approachis presented in Section 5. In Section 6, we propose ourwrapper generation method. Section 7 reports our expe-rimental results and Section 8 concludes the paper.

    2 R ELATED WORKWeb information extraction and annotation has been anactive research area in recent years. Many systems ([18],[20]) rely on human users to mark the desired informa-tion on sample pages and label the marked data at thesame time, and then the system can induce a series ofrules (wrapper) to extract the same set of information onweb pages from the same source. These systems are oftenreferred as a wrapper induction system. Because of the su-pervised training and learning process, these systems can

    usually achieve high extraction accuracy. However, theysuffer from poor scalability and are not suitable for appli-cations ([24], [31]) that need to extract information from alarge number of Web sources.

    Embley et al. [8] utilize ontologies together with sever-al heuristics to automatically extract data in multi-recorddocuments and label them. However, ontologies for dif-ferent domains must be constructed manually. Mukherjeeet al. [25] exploit the presentation styles and the spatiallocality of semantically related items, but its learningprocess for annotation is domain-dependent. Moreover, aseed of instances of semantic concepts in a set of HTMLdocuments needs to be hand-labeled. These methods arenot fully automatic.

    [1], [5], [21] are the efforts to automatically constructwrappers, but the wrappers are used for data extractiononly (not for annotation). We are aware of several works([2], [28], [30], [36]) which aim at automatically assigningmeaningful labels to the data units in SRRs. Arlotta et al.[2] basically annotate data units with the closest labels onresult pages. This method has limited applicability be-cause many WDBs do not encode data units with theirlabels on result pages. In ODE system [28], ontologies arefirst constructed using query interfaces and result pagesfrom WDBs in the same domain. The domain ontology isthen used to assign labels to each data unit on result page.After labeling, the data values with the same label arenaturally aligned. This method is sensitive to the qualityand completeness of the ontologies generated. DeLa [30]

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 7/31/2019 Annotating Search Results

    3/14

    Y. LU ET AL.: ANNOTATING SEARCH RESULTS FROM WEB DATABASES 3

    first uses HTML tags to align data units by filling theminto a table through a regular expression based data treealgorithm. Then it employs four heuristics to select a labelfor each aligned table column. The approach in [36] per-forms attributes extraction and labeling simultaneously.However, the label set is predefined and contains only asmall number of values.

    We align data units and annotate the ones within thesame semantic group holistically. Data alignment is animportant step in achieving accurate annotation and it isalso used in [25] and [30]. Most existing automatic dataalignment techniques are based on one or very few fea-tures. The most frequently used feature is HTML tagpaths [33]. The assumption is that the sub-trees corres-ponding to two data units in different SRRs but with thesame concept usually have the same tag structure. How-ever, this assumption is not always correct as the tag treeis very sensitive to even minor differences, which may becaused by the need to emphasize certain data units orerroneous coding. ViDIE [21] uses visual features on re-sult pages to perform alignment and it also generates analignment wrapper. But its alignment is only at text nodelevel, not data unit level. The method in [7] first splitseach SRR into text segments. The most common numberof segments is determined to be the number of alignedcolumns (attributes). The SRR with more segments arethen re-split using the common number. For each SRRwith fewer segments than the common number, eachsegment is assigned to the most similar aligned column.

    Our data alignment approach differs from the previousworks in the following aspects. First, our approach han-

    dles all types of relationships between text nodes anddata units (see Section 3.1), while existing approachesconsider only some of the types (i.e., one-to-one [25] orone-to-many ([7], [30])). Second, we use a variety of fea-tures together, including the ones used in existing ap-proaches, while existing approaches use significantlyfewer features (e.g., HTML tag in [30] and [33], visualfeatures in [21]). All the features that we use can be auto-matically obtained from the result page and do not needany domain specific ontology or knowledge. Third, weintroduce a new clustering-based shifting algorithm toperform alignment.

    Among all existing researches, DeLa [30] is the mostsimilar to our work. But our approach is significantly dif-ferent from DeLas approach. First, DeLas alignment me-thod is purely based on HTML tags, while ours uses otherimportant features such as data type, text content, andadjacency information. Second, our method handles alltypes of relationships between text nodes and data units,whereas DeLa deals with only two of them (i.e., one-to-one and one-to-many). Third, DeLa and our approachutilize different search interfaces of WDBs for annotation.Ours uses an integrated interface schema (IIS) of multipleWDBs in the same domain, whereas DeLa uses only thelocal interface schema (LIS) of each individual WDB. Ouranalysis shows that utilizing IISs has several benefits, in-cluding significantly alleviating the local interface schemainadequacy problem and the inconsistent label problem (seeSection 5.1 for more details). Fourth, we significantly en-

    hanced DeLas annotation method. Specifically, amongthe six basic annotators in our method, two (i.e., schemavalue annotator and frequency-based annotator) are new(i.e., not used is DeLa), three (table annotator, query-based annotator and common knowledge annotator) havebetter implementations than the corresponding annota-tion heuristics in DeLa, and one (in-text prefix/suffix an-notator) is the same as a heuristic in DeLa. For each of thethree annotators that have different implementations, thespecific difference and the motivation for using a differ-ent implementation will be given in Section 5.2. Further-more, it is not clear how DeLa combines its annotationheuristics to produce a single label for an attribute, whilewe employ a probabilistic model to combine the results ofdifferent annotators. Finally, DeLa builds wrapper foreach WDB just for data unit extraction. In our approach,we construct an annotation wrapper describing the rulesnot only for extraction but also for assigning labels.

    To enable fully automatic annotation, the result pageshave to be automatically obtained and the SRRs need tobe automatically extracted. In a metasearch context, resultpages are retrieved by queries submitted by users (somereformatting may be needed when the queries are dis-patched to individual WDBs). In the deep Web crawlingcontext, result pages are retrieved by queries automatical-ly generated by the Deep Web Crawler. We employViNTs [34] to extract SRRs from result pages in this work.Each SRR is stored in a tree structure with a single rootand each node in the tree corresponds to an HTML tag ora piece of text in the original page. With this structure, itbecomes easy to locate each node in the original HTML

    page. The physical position information of each node onthe rendered page, including its coordinates and area size,can also be obtained using ViNTs.

    This paper is an extension of our previous work [22].The following summarizes the main improvements of thispaper over [22]. First, a significantly more comprehensivediscussion about the relationships between text nodesand data units is provided. Specifically, this paper identi-fies four relationship types and provides analysis of eachtype, while only two of the four types (i.e., one-to-one andone-to-many) were very briefly mentioned in [22]. Second,the alignment algorithm is significantly improved. A newstep is added to handle the many-to-one relationship be-tween text nodes and data units. In addition, a clustering-shift algorithm is introduced in this paper to explicitlyhandle the one-to-nothing relationship between textnodes and data units while the previous version has apure clustering algorithm. With these two improvements,the new alignment algorithm takes all four types of rela-tionships into consideration. Third, the experiment sec-tion (Section 7) is significantly different from the previousversion. The dataset used for experiments has been ex-panded by one domain (from 6 to 7) and by 22 WDBs(from 91 to 112). Moreover, the experiments on alignmentand annotation have been re-done based on the new data-set and the improved alignment algorithm. Fourth, sever-al related papers that were published recently have beenreviewed and compared in this paper.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 7/31/2019 Annotating Search Results

    4/14

    4 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID

    3 F UNDAMENTALS3.1 Data Unit and Text Node Each SRR extracted by ViNTs has a tag structure that de-termines how the contents of the SRRs are displayed on aWeb browser. Each node in such a tag structure is either atag node or a text node. A tag node corresponds to anHTML tag surrounded by in HTML source,while a text node is the text outside the . Textnodes are the visible elements on the web page and dataunits are located in the text nodes. However, as we cansee from Fig. 1, text nodes are not always identical to dataunits. Since our annotation is at the data unit level, weneed to identify data units from text nodes.

    Depending on how many data units a text node maycontain, we identify the following four types of relation-ships between data unit ( U ) and text node ( T ):x One-to-One Relationship (denoted as T = U ). In this type,

    each text node contains exactly one data unit, i.e., the

    text of this node contains the value of a single attribute.This is the most frequently seen case. For example, inFig. 1, each text node surrounded by the pair of tags and is a value of the Title attribute. We referto such kind of text nodes as atomic text nodes. Anatomic text node is equivalent to a data unit.

    x One-to-Many Relationship (denoted as T U ). In thistype, multiple data units are encoded in one text node.For example, in Fig. 1, part of the second line of eachSRR (e.g., Springer-Verlag / 1999 / 0387984135 / 0.06667 in the first record) is a single text node. It con-sists of four semantic data units: Publisher , PublicationDate, ISBN and Relevance Score. Since the text of suchkind of nodes can be considered as a composition ofthe texts of multiple data units, we call it a compositetext node. An important observation that can be made is:if the data units of attributes A1 Ak in one SRR areencoded as a composite text node, it is usually truethat the data units of the same attributes in other SRRsreturned by the same WDB are also encoded as com-posite text nodes, and those embedded data units al-ways appear in the same order. This observation is va-lid in general because SRRs are generated by templateprograms. We need to split each composite text nodeto obtain real data units and annotate them. Section 4describes our splitting algorithm.

    x Many-to-One Relationship (denoted as T U ). In thiscase, multiple text nodes together form a data unit. Fig.3 shows an example of this case. The value of the Au-thor attribute is contained in multiple text nodes witheach embedded inside a separate pair of (, )HTML tags. As another example, the tags and surrounding the keyword Java split the title stringinto 3 text nodes. It is a general practice that web pagedesigners use special HTML tags to embellish certaininformation. [35] calls this kind of tags as decorative tags because they are used mainly for changing the appear-

    ance of part of the text nodes. For the purpose of ex-traction and annotation, we need to identify and re-move these tags inside SRRs so that the wholeness ofeach split data unit can be restored. The first step of

    our alignment algorithm handles this case specifically(see Section 4 for details).

    x One-To-Nothing Relationship (denoted as T z U ). Thetext nodes belonging to this category are not part ofany data unit inside SRRs. For example, in Fig. 3, textnodes like Author and Publisher are not data units,but are instead the semantic labels describing themeanings of the corresponding data units. Further ob-servations indicate that these text nodes are usuallydisplayed in a certain pattern across all SRRs. Thus, wecall them template text nodes. We employ a frequency-based annotator (see Section 5.2) to identify templatetext nodes.

    Fig. 3. An example illustrating the Many-to-One relationship

    3.2 Data Unit and Text Node FeaturesWe identify and use five common features shared by thedata units belonging to the same concept across all SRRs,and all of them can be automatically obtained. It is notdifficult to see that all these features are applicable to textnodes, including composite text nodes involving the sameset of concepts, and template text nodes.

    Data Content (DC) : The data units or text nodes with thesame concept often share certain keywords. This is truefor two reasons. First, the data units corresponding to thesearch field where the user enters a search conditionusually contain the search keywords. For example, in Fig.1, the sample result page is returned for the search on thetitle field with keyword machine. We can see that allthe titles have this keyword. Second, web designerssometimes put some leading label in front of certain dataunit within the same text node to make it easier for usersto understand the data. Text nodes that contain data unitsof the same concept usually have the same leading label.For example, in Fig. 1, the price of every book has leading

    words Our Price in the same text node.Presentation Style (PS) : This feature describes how adata unit is displayed on a web page. It consists of 6 stylefeatures: font face, font size, font color , font weight, textdecoration (underline, strike, etc.), and whether it is italic.Data units of the same concept in different SRRs areusually displayed in the same style. For example, in Fig.1, all the availability information is displayed in the samefont (in italic).

    Data Type (DT) : Each data unit has its own semantictype although it is just a text string in the HTML code.The following basic data types are currently considered inour approach: Date, Time, Currency, Integer , Decimal,Percentage, Symbol, and String . String type is furtherdefined in All-Capitalized-String, First-Letter-Capitalized-String , and Ordinary String . The data type of a composite

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 7/31/2019 Annotating Search Results

    5/14

    Y. LU ET AL.: ANNOTATING SEARCH RESULTS FROM WEB DATABASES 5

    text node is the concatenation of the data types of all itsdata units. For example, the data type of the text nodePremier Press / 2002 / 1931841616 / 0.06667 in Fig. 1 is . Consecutiveterms with the same data type are treated as a single termand only one of them will be kept. Each type exceptOrdinary String has certain pattern(s) so that it can beeasily identified. The data units of the same concept ortext nodes involving the same set of concepts usuallyhave the same data type.

    Tag Path (TP) : A tag path of a text node is a sequence oftags traversing from the root of the SRR to thecorresponding node in the tag tree. Since we use ViNTsfor SRR extraction, we adopt the same tag pathexpression as in [34]. Each node in the expressioncontains two parts, one is the tag name, and the other isthe direction indicating whether the next node is the nextsibling (denoted as S) or the first child (denoted as

    C). Text node is simply represented as . Forexample, in Fig. 1(b), the tag path of the text nodeSpringer-Verlag / 1999 / 0387984135 / 0.06667 isCC
    SSCC . Anobservation is that the tag paths of the text nodes with thesame set of concepts have very similar tag paths, thoughin many cases, not exactly the same.

    Adjacency (AD) : For a given data unit d in an SRR, let pd and sd denote the data units immediately before and afterd in the SRR, respectively. We refer pd and sd as the pre-ceding and succeeding data units of d, respectively. Con-sider two data units d1 and d2 from two SRRs, respective-

    ly. It can be observed that if p

    d 1 and p

    d 2 belong to thesame concept and/or sd 1 and sd 2 belong to the same con-

    cept, then it is more likely that d1 and d2 also belong to thesame concept.

    We note that none of the above five features is guaran-teed to be true for any particular pair of data units (or textnodes) with the same concept. For example, in Fig. 1,Springer-Verlag and McGraw-Hill are both publishersbut they do not share any content words. However, suchdata units usually share some other features. As a result,our alignment algorithm (Section 4) can still work welleven in the presence of some violation of these features.

    This is confirmed by our experimental results in Section 7.

    4 D ATA ALIGNMENT4.1 Data Unit SimilarityThe purpose of data alignment is to put the data units ofthe same concept into one group so that they can be anno-tated holistically. Whether two data units belong to thesame concept is determined by how similar they arebased on the features described in Section 3.2. In this pa-per, the similarity between two data units (or two textnodes) d1 and d2 is a weighted sum of the similarities of

    the five features between them, i.e.:

    ),(*

    ),(*),(*

    ),(*),(*),(

    215

    214213

    21221121

    d d SimAw

    d d SimT wd d SimDw

    d d SimP wd d SimC wd d Sim(1)

    The weights in the above formula are obtained using agenetic algorithm based method [10] and the trainedweights are given in Section 7.2. The similarity for eachindividual feature is defined as follows:x Data content similarity ( SimC ): It is the Cosine similari-

    ty [27] between the term frequency vectors of d1 and d2:

    21

    21),( 21d d

    d d

    V V

    V V d d SimC

    x(2)

    where V d is the frequency vector of the terms insidedata unit d, ||V d || is the length of V d and the numeratoris the inner product of two vectors.

    x Presentation style similarity ( SimP ): It is the averageof the style feature scores ( FS) over all 6 presentation

    style features ( F ) between d1 and d2:6/),(

    6

    121

    ii FS d d SimP

    (3)

    where FSi is the score of the ith style feature and it isdefined by FSi = 1 if 21 d i

    d i F F and FSi = 0 otherwise,

    and d i F is the ith style feature of data unit d.x Data type similarity ( SimD ): It is determined by the

    common sequence of the component data types be-tween two data units. The longest common sequence(LCS) cannot be longer than the number of componentdata types in these two data units. Thus, let t1 and t2 be

    the sequences of the data types of d1 and d2 respective-ly, and TLen( t) represent the number of componenttypes of data type t, the data type similarity betweendata units d1 and d2 is

    ))(),((),(

    ),(21

    2121 t Tlent Tlen Max

    t t LCS d d SimD (4)

    x Tag path similarity ( SimT ): This is the edit distancebetween the tag paths of two data units. The edit dis-tance (EDT) here refers to the number of insertions anddeletions of tags needed to transform one tag path intothe other. It can be seen that the maximum number ofpossible operations needed is the total number of tags

    in the two tag paths. Let p1 and p2 be the tag paths of d1 and d2, respectively, and PLen( p) denote the number oftags in tag path p, the tag path similarity between d1 and d2 is

    )()(),(

    1),(21

    2121 p PLen p PLen

    p p EDT d d SimT

    (5)

    Note that in our edit distance calculation, a substi-tution is considered as a deletion followed by an inser-tion, requiring two operations. The rationale is thattwo attributes of the same concept tend to be encodedin the same sub-tree in DOM (relative to the root oftheir SRRs) even though some decorative tags may ap-pear in one SRR but not in the other. For example, con-sider two pairs of tag paths (, ) and (, ). The twotag paths in the first pair are more likely to point to the

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 7/31/2019 Annotating Search Results

    6/14

    6 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID

    attributes of the same concept as T2 might be a decora-tive tag. Based on our method, the first pair has editdistance 1 (insertion of T2) while the second pair hasedit distance 2 (deletion of T2 plus insertion of T4). Inother words, the first pair has a higher similarity.

    x Adjacency similarity ( SimA ): The adjacency similaritybetween two data units d1 and d2 is the average of thesimilarity between pd 1 and

    pd 2 and the similarity be-tween sd 1 and

    sd 2 , that is

    2/)),('),('(),( 212121 s s p p d d Simd d Simd d SimA (6)

    When computing the similarities ( Sim) between thepreceding/succeeding units, only the first four fea-tures are used. The weight for adjacency feature ( w5) isproportionally distributed to other four weights.Our alignment algorithm also needs the similarity be-

    tween two data unit groups where each group is a collec-tion of data units. We define the similarity betweengroups G1 and G2 to be the average of the similarities be-tween every data unit in G1 and every data unit in G2.

    4.2 Alignment AlgorithmOur data alignment algorithm is based on the assumptionthat attributes appear in the same order across all SRRson the same result page, although the SRRs may containdifferent sets of attributes (due to missing values). This istrue in general because the SRRs from the same WDB arenormally generated by the same template program. Thus,we can conceptually consider the SRRs on a result page ina table format where each row represents one SRR andeach cell holds a data unit (or empty if the data unit is not

    available). Each table column, in our work, is referred toas an alignment group, containing at most one data unitfrom each SRR. If an alignment group contains all thedata units of one concept and no data unit from otherconcepts, we call this group well-aligned. The goal ofalignment is to move the data units in the table so thatevery alignment group is well aligned, while the order ofthe data units within every SRR is preserved.

    Our data alignment method consists of the followingfour steps. The detail of each step will be provided later.Step 1: Merge text nodes. This step detects and removes

    decorative tags from each SRR to allow the text nodes

    corresponding to the same attribute (separated bydecorative tags) to be merged into a single text node.

    Step 2: Align text nodes. This step aligns text nodes intogroups so that eventually each group contains thetext nodes with the same concept (for atomic nodes)or the same set of concepts (for composite nodes).

    Step 3: Split (composite) text nodes. This step aims to splitthe values in composite text nodes into individualdata units. This step is carried out based on the textnodes in the same group holistically. A group whosevalues need to be split is called a composite group.

    Step 4: Align data units. This step is to separate each com-

    posite group into multiple aligned groups with eachcontaining the data units of the same concept.As we discussed in Section 3.1, the Many-to-One rela-

    tionship between text nodes and data units usually occurs

    because of the decorative tags. We need to remove themto restore the integrity of data unit. In Step 1, we use amodified method in [35] to detect the decorative tags. Forevery HTML tag, its statistical scores of a set of prede-fined features are collected across all SRRs, including thedistance to its leaf descendants, the number of occur-rences, and the first and last occurring positions in everySRRs, etc. Each individual feature score is normalizedbetween 0 and 1, and all normalized feature scores arethen averaged into a single score s. A tag is identified as adecorative tag if s P (P = 0.5 is used in this work, follow-ing [35]). To remove decorative tags, we do the breadth-first traversal over the DOM tree of each SRR. If the tra-versed node is identified as a decorative tag, its imme-diate child nodes are moved up as the right siblings ofthis node, and then the node is deleted from the tree.

    Fig. 4. Alignment Algorithm

    In Step 2, as shown in ALIGN in Fig. 4, text nodes areinitially aligned into alignment groups based on their

    ALIGN(SRRs)1. j ;

    2. while true //create alignment groups

    3. for i ^ZZ 4. G j ^ZZ ; //j

    th element in SRR[i] 5. if G j 6. exit; //break the loop7. V >h^d Z/E';' ;8. if |V| > 1

    //collect all data units in groups following j9. S ;10. for x ^ZZ 11. ^ZZ 12. S ^ZZ

    //find cluster c least similar to following groups13. ))],[((min][

    ||1S k V simcV

    V tok

    ;

    //shifting14. for k |V k c15. ^ZZ s 16. E/> ^ZZ ;17. j ; //move to next group

    >h^d Z/E';' 1. V ';2. while |V| > 13. ;4. L E/> R E/> ;5. s 6. s 7. ;; ;; 8. ; ;9. L ;10. R ;11. / d 12. > s;13. Z s;14. > R to V;15. e 16. s;

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 7/31/2019 Annotating Search Results

    7/14

    Y. LU ET AL.: ANNOTATING SEARCH RESULTS FROM WEB DATABASES 7

    positions within SRRs so that group G j contains the jth textnode from each SRR (lines 3-4). Since a particular SRRmay have no value(s) for certain attribute(s) (e.g., a bookwould not have discount price if it is not on sale), G j maycontain the elements of different concepts. We apply theagglomerative clustering algorithm [17] to cluster the textnodes inside this group (line 7 and CLUSTERING). In-itially, each text node forms a separate cluster of its own.We then repeatedly merge two clusters that have thehighest similarity value until no two clusters have simi-larity above a threshold T . After clustering, we obtain aset of clusters V and each cluster contains the elements ofthe same concept only.

    Recall that attributes are assumed to be encoded in thesame order across all SRRs. Suppose an attribute A j ismissing in an SRR. Since we are initially aligned by posi-tion, the element of attribute Ak, where k > j is put intogroup G j. This element must have certain commonalitieswith some other elements of the same concept in thegroups following G j, which should be reflected by highersimilarity values. Thus, if we get multiple clusters fromthe above step, the one having the least similarity (V[c])with the groups following G j should belong to attribute A j (lines 9 13). Finally in Step 2, for each element in theclusters other than V[c], shift it (and all elements after it)to the next group. Lines 14 16 show that we achieve thisby inserting an NIL element at position j in the corres-ponding SRRs. This process is repeated for position j+1(line 17) until all the text nodes are considered (lines 5-6).Example 1. In Fig. 5, after initial alignment, there are three

    alignment groups. The first group G1 is clustered into twoclusters {{a1, b1}, {c1}}. Suppose {a1, b1} is the least similar to G2 and G3 , we then shift c1 one position to the right. The figure on the right depicts all groups after shifting.

    Fig. 5. An example illustrating Step 2 of the alignment algorithm.

    After the text nodes are grouped using the above pro-cedure, we need to determine whether a group needs tobe further split to obtain the actual data units (Step 3).First, we identify groups whose text nodes are not split-able. Each such a group satisfies one of the followingconditions: (a) each of its text nodes is a hyperlink (ahyperlink is assumed to already represent a single seman-tic unit); (b) the texts of all the nodes in the group are thesame; (c) all the nodes in the group have the same non-string data type; and (d) the group is not a merged groupfrom Step 1. Next, we try to split each group that does notsatisfy any of the above conditions. To do the splitting

    correctly, we need to identify the right separators. We ob-serve that the same separator is usually used to separatethe data units of a given pair of attributes across all SRRsretrieved from the same WDB although it is possible that

    different separators are used for data units of differentattribute pairs. Based on this observation, in this step, wefirst scan the text strings of every text node in the groupand select the symbol(s) (non-letter, non-digit, non-currency, non-parenthesis and non-hyphen) that occur inmost strings in consistent positions. Second, for each textnode in the group, its text is split into several small piecesusing the separator(s), each of which becomes a real dataunit. As an example, in Springer-Verlag / 1999 /0387984135 / 0.06667, / is the separator and it splitsthe composite text node into four data units.

    The data units in a composite group are not alwaysaligned after splitting because some attributes may havemissing values in the composite text node. Our solution isto apply the same alignment algorithm in Step 2 here, i.e.,initially align based on each data units natural positionand then apply the clustering-based shifting method. Theonly difference is that, in Step 4, since all data units to bealigned are split from the same composite text node, theyshare the same presentation style and tag path. Thus, inthis case, these two features are not used for calculatingsimilarity for aligning data units. Their feature weightsare proportionally distributed to the three features used.

    DeLa [30] also detects the separators inside compositetext nodes and uses them to separate the data units.However, in DeLa, the separated pieces are simplyaligned by their natural order and optional attribute val-ues are not considered.

    5 A SSIGNING LABELS

    5.1 Local vs. Integrated Interface SchemasFor a WDB, its search interface often contains someattributes of the underlying data. We denote a local inter-face schema (LIS) as Si = { A1, A2, , Ak}, where each A j isan attribute. When a query is submitted against the searchinterface, the entities in the returned results also have acertain hidden schema, denoted as Se = {a1, a2, , an},where each a j ( j = 1n) is an attribute to be discovered.The schema of the retrieved data and the LIS usuallyshare a significant number of attributes [29]. This observa-tion provides the basis for some of our basic annotators(see Section 5.2). If an attribute at in the search results hasa matched attribute At in the LIS, all the data units identi-fied with at can be labeled by the name of At.

    However, it is quite often that Se is not entirely con-tained in Si because some attributes of the underlyingdatabase are not suitable or needed for specifying queryconditions as determined by the developer of the WDB,and these attributes would not be included in Si. Thisphenomenon raises a problem called local interface schemainadequacy problem. Specifically, it is possible that a hidden attribute discovered in the search result schema Se doesnot have a matching attribute At in the LIS Si. In this case,there will be no label in the search interface that can beassigned to the discovered data units of this attribute.

    Another potential problem associated with using LISsfor annotation is the inconsistent label problem, i.e., differ-ent labels are assigned to semantically identical data unitsreturned from different WDBs because different LISs may

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 7/31/2019 Annotating Search Results

    8/14

    8 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID

    give different names to the same attribute. This can causeproblem when using the annotated data collected fromdifferent WDBs, e.g., for data integration applications.

    In our approach, for each used domain, we use WISE-Integrator [14]) to build an integrated interface schema(IIS) over multiple WDBs in that domain. The generatedIIS combines all the attributes of the LISs. For matchedattributes from different LISs, their values in the localinterfaces (e.g., values in selection list) are combined asthe values of the integrated global attribute [14]. Eachglobal attribute has a unique global name and anattribute-mapping table is created to establish the map-ping between the name of each LIS attribute and its cor-responding name in the IIS. In this paper, for attribute A on a LIS, we use gn( A) to denote the name of As corres-ponding attribute (i.e., the global attribute) in the IIS.

    For each WDB in a given domain, our annotation me-thod uses both the LIS of the WDB and the IIS of the do-main to annotate the retrieved data from this WDB. UsingIISs has two major advantages. First, it has the potentialto increase the annotation recall. Since the IIS contains theattributes in all the LISs, it has a better chance that anattribute discovered from the returned results has amatching attribute in the IIS even though it has no match-ing attribute in the LIS. Second, when an annotator dis-covers a label for a group of data units, the label will bereplaced by its corresponding global attribute name (ifany) in the IIS by looking up the attribute-mapping tableso that the data units of the same concept across differentWDBs will have the same label.

    We should point out that even though using the IIS

    can significantly alleviate both the local interface schemainadequacy problem and the inconsistent label problem, itcannot solve them completely. For the first problem, it isstill possible that some attributes of the underlying enti-ties do not appear in any local interface, and as a result,such attributes will not appear in the IIS. As to the secondproblem, for example, in Fig. 1, $17.50 is annotated byOur Price, but at another site a price may be displayedas You Pay: $50.50 (i.e., $50.50 is annotated by YouPay). If one or more of these annotations are not localattribute names in the attribute-mapping table for thisdomain, then using the IIS cannot solve the problem andnew techniques are needed.

    5.2 Basic AnnotatorsIn a returned result page containing multiple SRRs, thedata units corresponding to the same concept (attribute)often share special common features. And such commonfeatures are usually associated with the data units on theresult page in certain patterns. Based on this observation,we define six basic annotators to label data units, witheach of them considering a special type of pat-terns/features. Four of these annotators (i.e., table annota-tor, query-based annotator, in-text prefix/suffix annotatorand common knowledge annotator) are similar to theannotation heuristics used by DeLa [30] but we have dif-ferent implementations for three of them (i.e., table anno-tator, query-based annotator, and common knowledgeannotator). Details of the differences and the advantages

    of our implementations for these three annotators will beprovided when these annotators are introduced below.

    Table Annotator (TA)Many WDBs use a table to organize the returned SRRs.

    In the table, each row represents an SRR. The table header,which indicates the meaning of each column, is usuallylocated at the top of the table. Fig. 6 shows an example ofSRRs presented in a table format. Usually, the data unitsof the same concepts are well aligned with its correspond-ing column header. This special feature of the table layoutcan be utilized to annotate the SRRs.

    Fig. 6. SRRs in table format

    Since the physical position information of each dataunit is obtained during SRR extraction, we can utilize theinformation to associate each data unit with its corres-ponding header. Our Table Annotator works as follows.First, it identifies all the column headers of the table.Second, for each SRR, it takes a data unit in a cell and se-lects the column header whose area (determined by coor-dinates) has the maximum vertical overlap (i.e., based onthe x-axis) with the cell. This unit is then assigned withthis column header and labeled by the header text A (ac-tually by its corresponding global name gn( A) if gn( A)exists). The remaining data units are processed similarly.In case that the table header is not provided or is not suc-cessfully extracted by ViNTs [34], the Table Annotatorwill not be applied.

    DeLa [30] also searches for table header texts as thepossible labels for the data units listed in table format.However, DeLa only relies on HTML tag and for this purpose. But many HTML tables donot use or to encode their headers,which limits the applicability of DeLas approach. In thetest dataset we collected, 11 WDBs have SRRs in tableformat, but only 3 use or . In contrast,our table annotator does not have this limitation.

    Query-based Annotator (QA)The basic idea of this annotator is that the returned

    SRRs from a WDB are always related to the specifiedquery. Specifically, the query terms entered in the searchattributes on the local search interface of the WDB willmost likely appear in some retrieved SRRs. For example,in Fig. 1, query term machine is submitted through theTitle field on the search interface of the WDB and all threetitles of the returned SRRs contain this query term. Thus,we can use the name of search field Title to annotate thetitle values of these SRRs. In general, query terms againstan attribute may be entered to a textbox or chosen from aselection list on the local search interface.

    Our Query-based Annotator works as follows. Given aquery with a set of query terms submitted against anattribute A on the local search interface, first find the

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 7/31/2019 Annotating Search Results

    9/14

    Y. LU ET AL.: ANNOTATING SEARCH RESULTS FROM WEB DATABASES 9

    group that has the largest total occurrences of these queryterms and then assign gn( A) as the label to the group.

    As mentioned in Section 5.1, the LIS of a WDB usuallydoes not have all the attributes of the underlying database.As a result, the query-based annotator by itself cannotcompletely annotate the SRRs.

    DeLa [30] also uses query terms to match the data unittexts and use the name of the queried form element as thelabel. However, DeLa uses only local schema elementnames, not element names in the IIS.

    Schema value Annotator (SA)Many attributes on a search interface have pre-defined

    values on the interface. For example, the attribute Publish-ers may have a set of pre-defined values (i.e., publishers)in its selection list. More attributes in the IIS tend to havepre-defined values and these attributes are likely to havemore such values than those on LISs, because whenattributes from multiple interfaces are integrated, their

    values are also combined [14]. Our schema value annota-tor utilizes the combined value set to perform annotation.Given a group of data units Gi = {d1, , dn}, the schema

    value annotator is to discover the best matched attributeto the group from the IIS. Let A j be an attribute containinga list of values { v1, , vm} on the IIS. For each data unit dk,this annotator first computes the Cosine similarities be-tween dk and all values in A j to find the value (say vt) withthe highest similarity. Then the data fusion functionCombMNZ [19] is applied to the similarities for all thedata units. More specifically, the annotator sums up thesimilarities and multiplies the sum by the number of non-zero similarities. This final value is treated as the match-ing score between Gi and A j.

    The schema value annotator first identifies theattribute A j that has the highest matching score among allattributes and then uses gn( A j) to annotate the group Gi.Note that multiplying the above sum by the number ofnon-zero similarities is to give preference to attributesthat have more matches (i.e., having non-zero similarities)over those that have fewer matches. This is found to bevery effective in improving the retrieval effectiveness ofcombination systems in information retrieval [4].

    Frequency-based Annotator (FA)

    In Fig. 1, Our Price appears in the three records andthe followed price values are all different in these records.In other words, the adjacent units have different occur-rence frequencies. As argued in [1], the data units withthe higher frequency are likely to be attribute names, aspart of the template program for generating records,while the data units with the lower frequency most prob-ably come from databases as embedded values. Followingthis argument, Our Price can be recognized as the labelof the value immediately following it. The phenomenondescribed in this example is widely observable on resultpages returned by many WDBs and our frequency-basedannotator is designed to exploit this phenomenon.

    Consider a group Gi whose data units have a lowerfrequency. The frequency-based annotator intends to findcommon preceding units shared by all the data units ofthe group Gi. This can be easily conducted by following

    their preceding chains recursively until the encountereddata units are different. All found preceding units areconcatenated to form the label for the group Gi.Example 2. In Fig. 1, during the data alignment step, a group

    is formed for {$17.50, $18.95, $20.50}. Clearly thedata units in this group have different values. These values

    share the same preceding unit Our Price, which occurs inall SRRs. Furthermore, Our Price does not have preced-ing data units because it is the first unit in this line. There- fore, the frequency-based annotator will assign label Our Price to this group.

    In-text prefix/suffix Annotator (IA)In some cases, a piece of data is encoded with its label

    to form a single unit without any obvious separator be-tween the label and the value, but it contains both thelabel and the value. Such nodes may occur in all or mul-tiple SRRs. After data alignment, all such nodes would bealigned together to form a group. For example, in Fig. 1,after alignment, one group may contain three data units,{You Save $9.50, You Save $11.04, You Save $4.45}.

    The in-text prefix/suffix annotator checks whether alldata units in the aligned group share the same prefix orsuffix. If the same prefix is confirmed and it is not a deli-miter, then it is removed from all the data units in thegroup and is used as the label to annotate values follow-ing it. If the same suffix is identified and if the number ofdata units having the same suffix match the number ofdata units inside the next group, the suffix is used to an-notate the data units inside the next group. In the aboveexample, the label You save will be assigned to the

    group of prices. Any group whose data unit texts arecompletely identical is not considered by this annotator.

    Common knowledge Annotator (CA)Some data units on the result page are self-explanatory

    because of the common knowledge shared by human be-ings. For example, in stock and out of stock occur inmany SRRs from e-commerce sites. Human users under-stand that it is about the availability of the product be-cause this is common knowledge. So our common know-ledge annotator tries to exploit this situation by usingsome predefined common concepts.

    Each common concept contains a label and a set of pat-terns or values. For example, a country concept has a la-bel country and a set of values such as U.S.A., Can-ada, and so on. As another example, the email address(assume all lower cases) concept has the pattern [a-z0-9._%+-]+@([a-z0-9-]+\.)+[a-z]{2,4}. Given a group of dataunits from the alignment step, if all the data units matchthe pattern or value of a concept, the label of this conceptis assigned to the data units of this group.

    DeLa [30] also uses some conventions to annotate dataunits. However, it only considers certain patterns. OurCommon knowledge annotator considers both patternsand certain value sets such as the set of countries.

    It should be pointed out that our common concepts aredifferent from the ontologies that are widely used in someworks in semantic Web (e.g., [6], [11], [12], [16], [26]). First,our common concepts are domain independent. Second,

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 7/31/2019 Annotating Search Results

    10/14

    10 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID

    they can be obtained from existing information resourceswith little additional human effort.

    5.3 Combining AnnotatorsOur analysis indicates that no single annotator is capableof fully labeling all the data units on different result pages.The applicability of an annotator is the percentage of theattributes to which the annotator can be applied. For ex-ample, if out of 10 attributes, 4 appear in tables, then theapplicability of the table annotator is 40%. Table 1 showsthe average applicability of each basic annotator across alltesting domains in our dataset. This indicates that theresults of different basic annotators should be combinedin order to annotate a higher percentage of data units.Moreover, different annotators may produce differentlabels for a given group of data units. Therefore, we needa method to select the most suitable one for the group.

    Our annotators are fairly independent from each othersince each exploits an independent feature. Based on this

    characteristic, we employ a simple probabilistic methodto combine different annotators. For a given annotator L,let P(L) be the probability that L is correct in identifying acorrect label for a group of data units when L is applicable.P(L) is essentially the success rate of L. Specifically, sup-pose L is applicable to N cases and among these cases M are annotated correctly, then P(L) = M / N . If k indepen-dent annotators Li, i = 1, , k, identify the same label for agroup of data units, then the combined probability that atleast one of the annotators is correct is

    ))(1(11

    k

    i i L P

    (7)

    To obtain the success rate of an annotator, we use theannotator to annotate every result page in a training dataset (DS1 in Section 7.1). The training result is listed in Ta-ble 1. It can be seen that the table annotator is 100% cor-rect when applicable. The query-based annotator also hasvery high success rate while the schema value annotatoris the least accurate.

    TABLE 1 APPLICABILITIES AND S UCCESS R ATES OF ANNOTATORS

    A PPLI- CABILITY

    S UCCESS R ATE

    TABLE ANNOTATOR 6% 1.0

    QUERY -BASED ANNOTATOR 35% 0.95S CHEMA VALUE ANNOTATOR 14% 0.5FREQUENCY -BASED ANNOTATOR 41% 0.86IN-TEXT PREFIX /SUFFIX ANNOTATOR 7% 0.85COMMON KNOWLEDGE ANNOTATOR 25% 0.84

    An important issue DeLa did not address is what ifmultiple heuristics can be applied to a data unit. In oursolution, if multiple labels are predicted for a group ofdata units by different annotators, we compute the com-bined probability for each label based on the annotators

    that identified the label, and select the label with the larg-est combined probability.One advantage of this model is that it has high flexibil-

    ity in the sense that in case that an existing annotator ismodified or a new annotator is added in, all we need is to

    obtain the applicability and success rate of thisnew/revised annotator while keeping all remaining an-notators unchanged. We also note that no domain-specifictraining is needed to obtain the applicability and successrate of each annotator.

    6 A NNOTATION WRAPPEROnce the data units on a result page have been annotated,we use these annotated data units to construct an annota-tion wrapper for the WDB so that the new SRRs retrievedfrom the same WDB can be annotated using this wrapperquickly without re-applying the entire annotationprocess. We now describe our method for constructingsuch a wrapper below.

    Each annotated group of data units corresponds to anattribute in the SRRs. The annotation wrapper is a de-scription of the annotation rules for all the attributes onthe result page. After the data unit groups are annotated,

    they are organized based on the order of its data units inthe original SRRs. Consider the ith group Gi. Every SRRhas a tag-node sequence like Fig. 1(b) that consists of onlyHTML tag names and texts. For each data unit in Gi, wescan the sequence both backward and forward to obtainthe prefix and suffix of the data unit. The scan stops whenan encountered unit is a valid data unit with a meaning-ful label assigned. Then we compare the prefixes of all thedata units in Gi to obtain the common prefix shared bythese data units. Similarly, the common suffix is obtainedby comparing all the suffixes of these data units. For ex-ample, the data unit for book title in Fig. 1(b) has

    as its prefix and
    as its suf-fix. If a data unit is generated by splitting from a compo-site text node, then its prefix and suffix are the same asthose of its parent data unit. This wrapper is similar to theLR wrapper in [18]. Here we use prefix as the left delimi-ter, and suffix as the right delimiter to identify data units.However, the LR wrapper has difficulties to extract dataunits packed inside composite text nodes due to the factthat there is no HTML tag within a text node. To over-come this limitation, besides the prefix and suffix, we alsorecord the separators used for splitting the composite textnode as well as its position index in the split unit vector.Thus, the annotation rule for each attribute consists of 5components, expressed as: attribute i = . The annotation wrapper for the siteis simply a collection of the annotation rules for all theattributes identified on the result page with order corres-ponding to the ordered data unit groups.

    To use the wrapper to annotate a new result page, foreach data unit in an SRR, the annotation rules are appliedon it one by one based on the order they appear in thewrapper. If this data unit has the same prefix and suffixas specified in the rule, the rule is matched and the unit islabeled with the given label in the rule. If the separatorsare specified, they are used to split the unit, and labeli isassigned to the unit at the position unitindex i.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 7/31/2019 Annotating Search Results

    11/14

    Y. LU ET AL.: ANNOTATING SEARCH RESULTS FROM WEB DATABASES 11

    7 E XPERIMENTS7.1 Data Sets and Performance MeasureOur experiments are based on 112 WDBs selected from 7domains: book, movie, music, game, job, electronics, and auto.For each WDB, its LIS is constructed automatically usingWISE-iExtractor [15]. For each domain, WISE-Integrator[14] is used to build the IIS automatically. These collectedWDBs are randomly divided into two disjoint groups.The first group contains 22 WDBs and is used for training,and the second group has 90 WDBs and is used for test-ing. Data set DS1 is formed by obtaining one sample re-sult page from each training site. Two testing data setsDS2 and DS3 are generated by collecting 2 sample resultpages from each testing site using different queries. Wenote that we largely recollected the result pages fromWDBs used in our previous study [22]. We have noticedthat result pages have become more complex in general asWeb developers try to make their pages fancier.

    Each testing data set contains one page from eachWDB. Some general terms are manually selected as thequery keywords to obtain the sample result pages. Thequery terms are selected in such a way that they yieldresult pages with many SRRs from all WDBs of the samedomain, for example, Java for Title, James for Au-thor, etc. The query terms and the form element eachquery is submitted to are also stored together with eachLIS for use by the Query-based Annotator.

    Data set DS1 is used for learning the weights of the da-ta unit features and clustering threshold T in the align-ment step (See Section 4), and determining the successrate of each basic annotator (See Section 5). For each resultpage in this data set, the data units are manually ex-tracted, aligned in groups, and assigned labels by a hu-man expert. We use a genetic algorithm based method [10]to obtain the best combination of feature weights andclustering threshold T that leads to the best performanceover the training data set.

    DS2 is used to test the performance of our alignmentand annotation methods based on the parameter valuesand statistics obtained from DS1. At the same time, theannotation wrapper for each site will be generated. DS3 isused to test the quality of the generated wrappers. Thecorrectness of data unit extraction, alignment, and anno-tation is again manually verified by the same human ex-pert for performance evaluation purpose.

    We adopt the precision and recall measures from infor-mation retrieval to evaluate the performance of our me-thods. For alignment, the precision is defined as the per-centage of the correctly aligned data units over all thealigned units by the system; recall is the percentage of thedata units that are correctly aligned by the system over allmanually aligned data units by the expert. A result dataunit is counted as incorrect if it is mistakenly extracted(e.g., failed to be split from composite text node). For an-notation, the precision is the percentage of the correctly

    annotated units over all the data units annotated by thesystem and the recall is the percentage of the data unitscorrectly annotated by the system over all the manuallyannotated units. A data unit is said to be correctly anno-

    tated if its system-assigned label has the same meaning asits manually assigned label.

    7.2 Experimental ResultsThe optimal feature weights obtained through our genetictraining method (See Section 4) over DS1 are { 0.64, 0.81,1.0, 0.48, 0.56} for SimC , SimP, SimD, SimT , and SimA re-spectively, and 0.59 for clustering threshold T . The aver-age alignment precision and recall are converged at about97%. The learning result shows that the data type and thepresentation style are the most important features in ouralignment method. Then we apply our annotation me-thod on DS1 to determine the success rate of each annota-tor (see Section 5.3).

    Table 2 sh ows the performance of our data alignmentalgorithm for all 90 pages in DS2. The precision and recallfor every domain are above 95%, and the average preci-sion and recall across all domains are above 98%. Theperformance is consistent with that obtained over the

    training set. The errors usually happen in the followingcases. First, some composite text nodes failed to be splitinto correct data units when no explicit separators can beidentified. For example, the data units in some compositetext nodes are separated by blank spaces created by con-secutive HTML entities like or some formattingHTML tags such as . Second, the data units ofthe same attribute across different SRRs may sometimesvary a lot in terms of appearance or layout. For example,the promotion price information often has color or fonttype different from that for the regular price information.Note that in this case, such two price data units have low

    similarity on content, presentation style, and the tag path.Even though they share the same data type, the overallsimilarity between them would still be low. Finally, thedecorative tag detection (Step 1 of the alignment algo-rithm) is not perfect (accuracy about 90%), which resultsin some tags to be falsely detected as decorative tags,leading to incorrect merging of the values of differentattributes. We will address these issues in the future.

    TABLE 2P ERFORMANCE OF D ATA ALIGNMENT

    D OMAIN P RECISION R ECALL

    AUTO 98.6% 98.6%BOOK 98.4% 97.3%

    ELECTRONICS 100% 100%GAME 99.6% 99.6%J OB 95.2% 100%

    MOVIE 98.7% 98.0%MUSIC 99.0% 99.1%

    O VERALL AVG . 98.5% 98.9%

    Table 3 lists the results of our annotation performanceover DS2. We can see that the overall precision and recallare very high, which shows that our annotation method isvery effective. Moreover, high precision and recall areachieved for every domain, which indicates that our an-notation method is domain independent. We notice thatthe overall recall is a little bit higher than precision, most-

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 7/31/2019 Annotating Search Results

    12/14

    12 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID

    ly because some data units are not correctly separated inthe alignment step. We also found that in a few casessome texts are not assigned labels by any of our basic an-notators. One reason is that some texts are for cosmetic ornavigating purposes. These texts do not represent anyattributes of the real world entity and they are not thelabels of any data unit, which belong to our One-To-Nothing relationship type. It is also possible that some ofthese texts are indeed data units but none of our currentbasic annotators are applicable to them.

    TABLE 3P ERFORMANCE OF ANNOTATIOND OMAIN P RECISION R ECALL

    AUTO 97.7% 97.8%BOOK 97.4% 96.4%

    ELECTRONICS 98.3% 98.3%GAME 96.1% 96.1%

    J OB 95.2% 100%MOVIE 97.2% 96.7%MUSIC 97.6% 97.7%

    O VERALL AVG . 97.1% 97.6%

    For each web page in DS2, once its SRRs have beenannotated, an annotation wrapper is built, and it is ap-plied to the corresponding page in DS3. In terms ofprocessing speed, the time needed to annotate one resultpage drops from 10~20 seconds without using wrapper to1~2 seconds using wrapper depending on the complexityof the page. The reason is that wrapper based approachdirectly extracts all data units specified by the tag path(s)for each attribute and assigns the label specified in therule to those data units. In contrast, the non-wrapperbased approach needs to go through some time-consuming steps such as result page rendering, data unitsimilarity matrix computation, etc, for each result page.

    TABLE 4P ERFORMANCE OF ANNOTATION WITH WRAPPER

    D OMAIN P RECISION R ECALL

    AUTO 94.9% 91.0%BOOK 93.7% 92.1%

    ELECTRONICS 95.1% 94.5%J OB 93.7% 92.0%

    MOVIE 93.0% 92.1%MUSIC 95.3% 94.7%GAME 95.6% 93.8%

    O VERALL AVG . 94.4% 93.0%

    In terms of precision and recall, the wrapper based ap-proach reduces the precision by 2.7 percentage points andrecall by 4.6 percentage points, as can be seen from theresults in Tables 3 and 4. The reason is that our wrapperonly used tag path for text node extraction and alignment,and the composite text node splitting is solely based onthe data unit position. In the future, we will investigatehow to incorporate other features in the wrapper to in-

    crease the performance. Another reason is that in this ex-periment, we use only one page for each site in DS2 tobuild the wrapper. As a result, it is possible that someattributes that appear on the page in DS3 do not appearon the training page in DS2 (so they do not appear in thewrapper expression). For example, some WDBs only al-low queries that use only one attribute at a time (theseattributes are called exclusive attributes in [15]). In this case,if the query term is based on Title, then the data units forattribute Author may not be correctly identified and anno-tated because the query-based annotator is the main tech-nique used for attributes that have a textbox on the searchinterface. One possible solution to remedy this problem isto combine multiple result pages based on different que-ries to build a more robust wrapper.

    We also conducted experiments to evaluate the signi-ficance of each feature on the performance of our align-ment algorithm. For this purpose, we compare the per-formance when a feature is used with that when it is notused. Each time one feature is selected not to be used, andits weight is proportionally distributed to other featuresbased on the ratios of their weights to the total weight ofthe used features. The alignment process then uses thenew parameters to run on DS2. Fig. 7 shows the results. Itcan be seen that when any one of these features is notused, both the precision and recall decrease, indicatingthat all the features in our approach are valid and useful.We can also see that the data type and the presentationstyle are the most important features because when theyare not used, the precision and recall drop the most(around 28 and 23 percentage points for precision, and 31

    and 25 percentage points for recall, respectively). Thisresult is consistent with our training result where the datatype and the presentation style have the highest featureweights. The adjacency and tag path feature are less sig-nificant comparatively, but without either of them, theprecision and recall drop more than 15 percentage points.

    Fig. 7. Evaluation of alignment features.

    We use a similar method as the above to evaluate thesignificance of each basic annotator. Each time, one anno-tator is removed and the remaining annotators are usedto annotate the pages in DS2. Fig. 8 shows the results. Itshows that omitting any annotator causes both precisionand recall to drop, i.e., every annotator contributes posi-tively to the overall performance. Among the 6 annotatorsconsidered, the query-based annotator and the frequency-based annotator are the most significant. Another obser-vation is that when an annotator is removed, the recall

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 7/31/2019 Annotating Search Results

    13/14

    Y. LU ET AL.: ANNOTATING SEARCH RESULTS FROM WEB DATABASES 13

    decreases more dramatically than precision. This indi-cates that each of our annotators is fairly independent interms of describing the attributes. Each annotator de-scribes one aspect of the attribute which, to a large extent,is not applicable to other annotators.

    Fig. 8. Evaluation of basic annotators.

    Finally, we conducted experiments to study the effectof using LISs versus using the IIS in annotation. We runthe annotation process on DS2 again but this time, insteadof using the IIS built for each domain, we use the LIS ofeach WDB. The results are shown in Table 5. By compar-ing the results in Table 5 with those in Table 3 where theIIS is used, we can see that using the LIS has a relativelysmall impact on precision, but significant effect on recall(the overall average recall is reduced by more than 6 per-centage points) because of the local interface schema in-adequacy problem as described in Section 5.1. And it alsoshows that using the integrated interface schema can in-

    deed increase the annotation performance.As reported in [30], only three domains (Book, Job, andCar) were used to evaluate DeLa and for each domain,only 9 sites were selected. The overall precision and recallof DeLas annotation method using this data set arearound 80%. In contrast, the prevision and recall of ourannotation method for these three domains are wellabove 95% (see Table 3), although different sets of sitesfor these domains are used in the two works.

    TABLE 5P ERFORMANCE USING LOCAL INTERFACE S CHEMA

    D OMAIN P RECISION R ECALL

    AUTO 98.3% 98.3%BOOK 96.1% 85.6%

    ELECTRONICS 97.5% 91.7%GAME 95.9% 92.2%J OB 95.3% 95.1%

    MOVIE 96.8% 90.8%MUSIC 95.2% 83.9%

    O VERALL AVG . 96.4% 91.1%

    8 C ONCLUSIONIn this paper, we studied the data annotation problemand proposed a multi-annotator approach to automatical-ly constructing an annotation wrapper for annotating the

    search result records retrieved from any given Web data-base. This approach consists of six basic annotators and aprobabilistic method to combine the basic annotators.Each of these annotators exploits one type of features forannotation and our experimental results show that eachof the annotators is useful and they together are capableof generating high quality annotation. A special feature ofour method is that, when annotating the results retrievedfrom a Web database, it utilizes both the LIS of the Webdatabase and the IIS of multiple Web databases in thesame domain. We also explained how the use of the IIScan help alleviate the local interface schema inadequacyproblem and the inconsistent label problem.

    In this paper, we also studied the automatic dataalignment problem. Accurate alignment is critical toachieving holistic and accurate annotation. Our method isa clustering based shifting method utilizing richer yetautomatically obtainable features. This method is capableof handling a variety of relationships between HTML textnodes and data units, including one-to-one, one-to-many,many-to-one and one-to-nothing. Our experimental re-sults show that the precision and recall of this method areboth above 98%. There is still room for improvement inseveral areas as mentioned in Section 7.2. For example,we need to enhance our method to split composite textnode when there are no explicit separators. We wouldalso like to try using different machine learning tech-niques and using more sample pages from each trainingsite to obtain the feature weights so that we can identifythe best technique to the data alignment problem.

    ACKNOWLEDGMENTThis work is supported in part by the following NSFgrants: IIS-0414981, IIS-0414939, CNS-0454298 and CNS-0958501. The authors would also like to express their gra-titude to the anonymous reviewers for providing veryconstructive suggestions to improve the manuscript.

    REFERENCES[1] Arasu and H. Garcia-Molina. Extracting Structured Data from

    Web pages. SIGMOD Conference, 2003.[2] L. Arlotta, V. Crescenzi, G. Mecca, and P. Merialdo. Automatic

    Annotation of Data Extracted from Large Web Sites. WebDB

    Workshop, 2003.[3] P. Chan and S. Stolfo. Experiments on Multistrategy Learningby Meta-Learning. CIKM Conference, 1993.

    [4] W. Bruce Croft. Combining approaches for information retriev-al. In Advances in Inf. Retr.: Recent Research from the Centerfor Intel. Inf. Retr., Kluwer Academic, 2000.

    [5] V. Crescenzi, G. Mecca, and P. Merialdo. RoadRUNNER: To-wards Automatic Data Extraction from Large Web Sites. VLDBConference, 2001.

    [6] S. Dill, N. Eiron, D. Gibson, and et al. SemTag and Seeker: Boot-strapping the Semantic Web via Automated Semantic Annota-tion. WWW Conference, 2003.

    [7] H. Elmeleegy, J. Madhavan, and A. Halevy. Harvesting RelationalTables from Lists on the Web. VLDB Conference, 2009.

    [8] D. Embley, D. Campbell, Y. Jiang, S. Liddle, D. Lonsdale, Y. Ng,R. Smith. Conceptual-Model-Based Data Extraction from Mul-tiple-Record Web Pages. Data and Knowledge Engineering,31(3), 227-251, 1999.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

  • 7/31/2019 Annotating Search Results

    14/14

    14 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, MANUSCRIPT ID

    [9] D. Freitag. Multistrategy Learning for Information Extraction.ICML, 1998.

    [10] D. Goldberg. Genetic Algorithms in Search, Optimization andMachine Learning. Addison Wesley, 1989.

    [11] S. Handschuh, S. Staab, and R. Volz. On Deep Annotation.WWW Conf., 2003.

    [12] S. Handschuh and S. Staab. Authoring and Annotation of WebPages in CREAM. WWW Conference, 2003.

    [13] B. He, K. Chang. Statistical Schema Matching across WebQuery Interfaces. SIGMOD Conference, 2003.

    [14] H. He, W. Meng, C. Yu, and Z. Wu. Automatic Integration ofWeb Search Interfaces with WISE-Integrator. VLDB Journal,Vol.13, No.3, pp.256-273, September 2004.

    [15] H. He, W. Meng, C. Yu, and Z. Wu. Constructing InterfaceSchemas for Search Interfaces of Web Databases. WISE Confe-rence, 2005.

    [16] J. Heflin and J. Hendler. Searching the Web with SHOE. AAAIWorkshop, 2000.

    [17] L. Kaufman and P. Rousseeuw. Finding Groups in Data: AnIntroduction to Cluster Analysis. John Wiley & Sons, 1990.

    [18] N. Krushmerick, D. Weld, and R. Doorenbos. Wrapper Induc-tion for Information Extraction. IJCAI Conf., 1997.

    [19] J. Lee. Analyses of Multiple Evidence Combination. ACM SI-GIR Conference, 1997.

    [20] L. Liu, C. Pu, and W. Han. XWRAP: An XML-enabled WrapperConstruction System for Web Information Sources. IEEE ICDEconference, 2001.

    [21] W. Liu, X. Meng, W. Meng. ViDE: A Vision-based Approach forDeep Web Data Extraction. IEEE Transactions on Knowledgeand Data Engineering, 22(3), pp.447-460, March 2010.

    [22] Y. Lu, H. He, H. Zhao, W. Meng, and C. Yu. Annotating StructuredData of the Deep Web. IEEE ICDE Conference, 2007.

    [23] J. Madhavan, D. Ko, L. Lot, V. Ganapathy, A. Rasmussen, A. Y.Halevy: Google's Deep Web Crawl. PVLDB 1(2): 1241-1252,2008.

    [24] W. Meng, C. Yu, K. Liu. Building Efficient and Effective Meta-

    search Engines. ACM Compu. Surv., 34(1), 2002.[25] S. Mukherjee, I. V. Ramakrishnan, and A. Singh. BootstrappingSemantic Annotation for Content-Rich HTML Documents. IEEEICDE Conference, 2005.

    [26] B. Popov, A. Kiryakov, A. Kirilov, D. Manov, D. Ognyanoff,and M. Goranov. KIM - Semantic Annotation Platform. ISWCConference, 2003.

    [27] G. Salton and M. McGill. Introduction to Modern InformationRetrieval. McGraw-Hill, 1983.

    [28] W. Su, J. Wang, F. H. Lochovsky. ODE: Ontology-assisted dataextraction. ACM Transactions Database Systems, 34(2), June2009.

    [29] J. Wang, J. Wen, F. Lochovsky and W. Ma. Instance-basedSchema Matching for Web Databases by Domain-specificQuery Probing, VLDB conference, 2004.

    [30] J. Wang and F.H. Lochovsky. Data Extraction and Label As-signment for Web Databases. WWW Conference, 2003.

    [31] Z. Wu, V. Raghavan et al. Towards Automatic Incorporation ofSearch Engines into a Large-Scale Metasearch Engine.IEEE/WIC WI-2003 Conference, 2003.

    [32] O. Zamir and O. Etzioni. Web Document Clustering: A Feasibil-ity Demonstration. ACM SIGIR Conference, 1998.

    [33] Y. Zhai and B. Liu. Web Data Extraction Based on Partial TreeAlignment. WWW 2005, Chiba, Japan.

    [34] H. Zhao, W. Meng, Z. Wu, V. Raghavan, and C. Yu. Fully Au-tomatic Wrapper Generation for Search Engines. WWW, 2005.

    [35] H. Zhao, W. Meng, C. Yu. Mining Templates form Search Re-sult Records of Search Engines. ACM SIGKDD Conference,2007.

    [36] J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma. SimultaneousRecord Detection and Attribute Labeling in Web Data Extrac-tion. ACM SIGKDD Conference, 2006.

    Yiyao Lu is currently a Ph.D. student in the Department of Computer Science at State University of New York at Binghamton. His re-search interests include Web information systems, metasearch en-gine and information extraction. Hai He received his Ph.D. degree in computer science from StateUniversity of New York at Binghamton in 2005. He currently worksfor Morningstar. His research interests include Web database inte-

    gration systems, search interface extraction and integration.Hongkun Zhao received his Ph.D. degree in computer science fromState University of New York at Binghamton in 2007. He currentlyworks for Bloomberg. His research interests include metasearchengine, web information extraction and automatic wrapper genera-tion.

    Weiyi Meng received the BS degree in mathematics from SichuanUniversity, China, in 1982, and the MS and PhD degrees in comput-er science from the University of Illinois at Chicago, in 1988 and1992, respectively. He is currently a professor in the Department of Computer Science at the State University of New York at Bingham-ton. His research interests include Web-based information retrieval,metasearch engines, and Web database integration. He is a coau-thor of two books Principles of Database Query Processing for Ad-vanced Applications and Advanced Metasearch Technology. Hehas published more than 120 technical papers. He is a member of the IEEE.

    Clement Yu received the BS degree in applied mathematics fromColumbia University in 1970 and the PhD degree in computer science from Cornell University in 1973. He is a professor in theDepartment of Computer Science at the University of Illinois at Chi-cago. His areas of interest include search engines and multimediaretrieval. He has published in various journals such as the IEEETransactions on Knowledge and Data Engineering, ACM Transac-tions on Database Systems, and Journal of the ACM and in variousconferences such as VLDB, ACM SIGMOD, ACM Special InterestGroup on Information Retrieval, and ACM Multimedia. He previouslyserved as chairman of the ACM Special Interest Group on Informa-tion Retrieval and as a member of the advisory committee to the USNational Science Foundation. He was/is a member of the editorialboard of the IEEE Transactions on Knowledge and Data Engineer-ing, the International Journal of Software Engineering and Know-ledge Engineering, Distributed and Parallel Databases, and WorldWide Web Journal. He is a senior member of the IEEE.

    This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.