unsupervised learning of tree alignment models for information extraction

Upload: philip-zigoris

Post on 31-May-2018

213 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Unsupervised Learning of Tree Alignment Models for Information Extraction

    1/5

    Unsupervised Learning of Tree Alignment Models for Information Extraction

    Philip Zigoris, Damian Eads, and Yi ZhangDepartment of Computer Science

    University of California, Santa Cruz1156 High Street

    Santa Cruz, CA 95064{zigoris,eads,yiz }@soe.ucsc.edu

    Abstract

    We propose an algorithm for extracting elds from HTML search results. The output of the algorithm is adatabase table a data structure that better lends itself tohigh-level data mining and information exploitation. Our algorithm effectively combines tree and string alignment al-gorithms, as well as domain-specic feature extraction tomatch semantically related data across search results. Theapplications of our approach are vast and include hiddenweb crawling, semantic tagging, and federated search. Webuild on earlier research on the use of tree alignment for information extraction. In contrast to previous approachesthat rely on hand tuned parameters, our algorithm makesuseof a variant of SupportVector Machines (SVMs) to learna parameterized, site-independent tree alignment model.

    This model can then be used to deduce common structuraland textual elements of a set of HTML parse trees. We re- port some preliminary results of our systems performanceon data from websites with a variety of different layouts.

    1 Introduction

    There is a proliferation of research in the eld of Knowledge Discovery in Databases (KDD) aiming to de-rive high value conclusions from information stored in adatabase[7, 9]. Many of the advances in the eld rely onthe presence of highly structured data. Unfortunately, avast amount of content on the Internet is in the form of semi-structured HTML search results, making the data un-usable to many algorithms. The eld of information extrac-tion (IE) tries to address this problem by developing toolsfor transforming semi-structured text into highly-structureddatabase content[17, 1, 11, 2, 18, 6, 3]. Thus, successesin IE will enable the exploitation of a broad class of KDDand Data Mining (DM) algorithms on the largest source of information in the worldthe Internet.

    The main intuition driving our approach to informationextraction is that search results will often contain a highdegree of repetition and this indirectly yields informationabout the structure of the data. In order to identify the repet-itive elements we use a variety of parameterized tree align-ment models . Simply put, a tree alignment model assigns acost of pairing vertices from two trees. By nding a mini-mum cost pairing between vertices we presumably identifythe common structural, and possibly textual, elements of thetrees. These models are introduced formally in Section 2.

    One of the major contributions of our work is an un-supervised method for learning the tree alignment param-eters. With well-tuned parameters these models are re-silient to structural variation in dynamic HTML returnedby the web site and a well-informed alignment model willoften have too many parameters to effectively tune by hand.

    Specically, we explore the use of Support Vector Machines(SVM), a popular machine learning algorithm, for learningthese parameters. The details of our approach are presentedin Section 3.

    The other novel aspect of our work is a simple methodfor generating and representing schema. Rather than induc-ing a set of rules for processing unseen data, our methodworks by simply comparingnew data against that which hasalready been seen; analogous to nearest neighbor classica-tion. The details of this method are presented in Section4.

    In Section 5 we present a preliminary evaluation of ourwork.

    2 Tree Edit Distance

    This section introduces two ideas which are central toour approach: tree alignments and tree edit distance, bothoriginally due to Tai [15]. They are, respectively, anal-ogous to the well studied concepts of string alignment andstring edit-distance. Instead of strings, however, we are con-

  • 8/14/2019 Unsupervised Learning of Tree Alignment Models for Information Extraction

    2/5

    cerned with vertex labeled trees , a tree coupled with a label-ing function l(v) that maps vertices to a label . In our work we study HTML parse trees where vertices are labeled withtag identiers or free text. We will often refer to vertices la-beled with text as textual vertices to distinguish them fromvertices labeled with HTML tags.

    Intuitively, a tree alignment is an association betweenvertices in two labeled trees, T 1 and T 2 . The tree edit-distance between the two trees corresponds to the minimalcost of transforming T 1 into T 2 . The edit-distance providesa measure of similarity between trees (and their sub-trees)and the alignment provides a way to identify the commonelements of each tree, with respect to both structure and thevertex labeling.

    In this work we concern ourselves with rooted ordered labeled trees . This is a special case of labeled trees wherea vertex r is identied as the root and the children of everynode have a xed ordering. We can, therefore, speak of the left and right, as well as the i th , child of a node.

    Throughout this paper we will assume all trees are rooted,ordered, and labeled. With a xed ordering and designatedroot, we can formally dene an alignment between trees T 1and T 2 as a set A 2T 1 T 2 such that for all (a, b ), (a , b ) A

    a = a b = b , a is to the left of a b is to the left of b , and a is an ancestor of a b is an ancestor of b

    In other words, an alignment is a list of pairs of nodes, onefrom each tree, such that each vertex is paired with at mostone other vertex and there are no crossovers. An exampleof an alignment between two HTML parse trees is shown inFigure 1.

    Associated with an alignment is a set of editing opera-tions for transforming one tree into the other: Every pair(a, b ) in the alignment corresponds to a relabeling/copyingof a vertex, an operation we denote by R (a, b ). A vertex afrom T 1 that does not occur in the alignment correspondsto a deletion , denoted by D (a ). Similarly, a vertex b T 2not occurring in the alignment corresponds to an insertion ,denoted by I (b).

    In order to discuss tree edit distance we require a func-tion c that assigns a cost to each operation. We will oftenrefer to the parameter as the operation costs , although the

    terminology is somewhat imprecise. It is natural, given c ,to dene the cost of an alignment, C (A), as the sum of costs of its associated operations:

    (a,b )A

    c (R (a, b )) +aT 1 :bT 2

    (a,b )A

    c (D (a )) +bT 2 :aT 1

    (a,b )A

    c (I (a ))

    Finally, the tree edit distance between trees T 1 and T 2 ,d (T 1 , T 2) is dened to be the minimum cost over all align-

    Dune

    Price $9

    Middlesex Price $15

    Free Shipping

    em

    Dune Price $9

    Middlesex Free Shipping Price $15

    Figure 1. An example of an alignment be-tween two HTML parse trees. The renderedHTML text is illustrated at the bottom left ofeach box. Vertices labeled with HTML tagsare shown as ellipses and textual vertices areshown as squares.

    ments. We will denote by A (T 1 , T 2) the minimizing align-

    ment.Finding the minimum cost alignment can be done in, atbest, cubic time with dynamic programming [8, 4]. How-ever, for large trees even cubic running time can be pro-hibitive. To alleviate this, other work in information extrac-tion has relied on approximate alignment algorithms suchas partial tree alignment [17] and restricted top down map- pings [14]. In our work the trees were small enough thatresorting to such methods was unnecessary. However, in-corporating our methodology into a practical system wouldprobably require their use.

    2.1 Cost functions

    The task of information extraction requires a high de-gree of specicity in the cost function. A eld will typicallycontain different strings with similar semantics (e.g. prices,dates, ISBN). In order for the vertices in a eld to alignwell, the cost function must assign a low cost to aligningstrings with similar content. For instance, consider aligningthe text Price $4.99 with $100. Despite the large syn-tactic differences between them, both strings have a similar

  • 8/14/2019 Unsupervised Learning of Tree Alignment Models for Information Extraction

    3/5

    function (i.e. to convey the price of an item).In order to study the sensitivity to these issues we de-

    veloped three different cost functions with varying degreesof specicity. The rst, referred to as the Simple cost func-tion, is parameterized by only three numbers: M is the costof copying any vertex, ID is the cost of inserting/deleting

    any vertex, and R is the cost of relabeling any vertex. TheSimple function completely ignores the semantics of the la-beling and so aligning $100 with $5 will have the samecost as aligning $100 with July 4th.

    The second cost function we developed differs from theSimple function only in the cost of aligning textual vertices.We refer to this as the string edit distance or sed cost func-tion and it includes one extra parameter, S . The cost of aligning two text labeled vertices is S times the normal-ized string edit distance of those two strings.

    The third cost function incorporates some simple rulesfor determining the semantic relationship between twostrings of text and so we refer to this as the Semantic costfunction. Here the cost of aligning two vertices labeledwith strings s 1 and s 2 , respectively, is S f (s 1 , s 2) wheref (, ) is a feature vector and S are the weights associatedwith each feature. The features include whether or not bothstrings represent a(n): street, email or web address, date,phone number, or price. In total the Semantic cost functionhas 24 parameters.

    3 Learning Operation Costs with SV M struct

    In the previous section we presented parameterized tree

    alignments. Here we present an algorithm for learning theseparameters from what is, effectively, unlabeled data. It is anextension of work by Tsochantaridis, et al. [16] on usingSupport Vector Machines (SVMs) for learning structuredlabels. In their work they outline a very general frame-work that accommodates settings such as multi-class learn-ing, grammar learning, and learning sequence alignmentpa-rameters.

    Assume we are given a collection of HTML parse treesT that are each labeled with their site of origin s S , whereS is the set of web sites (e.g., google.com, amazon.com,soe.ucsc.edu). In our work each parse tree T T corre-sponds to one data record from a search results page. De-

    note by T s all trees originating from site s .The constraint we impose is that two trees from one site

    must be closer to one another, in terms of tree-edit distance,than to a tree from another website, illustrated in Figure 2.Note that the labeling, i.e. the site of origin, is providedby the system that fetches search results and so no manuallabeling is required.

    Formally, we seek (the cost function parameters) suchthat s1 , s 2 S , T 1 , T 1 T s 1 , T 2 , T s 2 the following

    A

    B

    Amazon.com

    DVD.com

    Godiva.com

    Figure 2. The points represent records fromthree different sites. The task is to nd oper-ation costs such that preserves this groupingunder tree-edit distance. A represents inter-site distance and B represents site width.

    holds:d (T 1 , T 1) d (T 1 , T 2)

    We refer to the maximum distance between any two trees

    from site s S as the width of s . We refer to the mini-mum distance between any two trees in sites s1 and s2 asthe inter-site distance .

    It may be the case that no satises the above con-straints. Accordingly, we can relax the constraints by intro-ducing slack variables s for s S and penalize a solutionby the sum of the slack variables. Our optimization problembecomes

    min, 0

    sS

    s

    s1 , s 2 S

    T 1 , T 1 T s 1 d (T 1 , T 1) d (T 1 , T 2) s 1T 2 , T s 2

    Note that s corresponds to the maximuminter-site distancebetween s and any other site.

    In order to introduce a notion of margin we require thatthe distance function scales linearly with the parameter val-ues. That is

    , > 0, d (a, b ) = d (a, b )

    All of cost functions discussed previously satisfy this re-quirement. In this way we can specify a unique solution bygiving favor to parameters settings that are small in mag-nitude. This has been shown to improve generalization er-ror in the case of linear separators[5]. We balance the costof the slack variables with the parameter magnitude withthe parameter C by adding the term 12 || ||

    2 to the objectivefunction, giving us a quadratic program. The effect of max-imizing the margin is to not only maximize the inter-sitedistances but also minimize the widths of the sites.

    The above optimization problems are difcult to solvedirectly, because the distance between two trees is a func-

  • 8/14/2019 Unsupervised Learning of Tree Alignment Models for Information Extraction

    4/5

    tion of a hidden variable, namely the minimum cost align-ment. Our solution is an EM-like algorithm 1, the format, of which, should be familiar to most readers:

    Set 0 to be all zeros and t = 0 Do

    Expectation-Like Step Find the minimum cost align-ment, A t (, ), between pairs of trees.

    Maximization-Like Step Solve the above optimiza-tion problem but replace the constraints with

    C t (A t (T 1 , T 1)) C t (A t (T 1 , T 2)) s 1 1

    Store the result as t +1t t + 1

    While( t = t 1 )

    We implemented our algorithm using the SVM struct

    package, which is built on the popular SVM light soft-

    ware [10].

    4 Schema Generator

    Our approach differs from previous work in schema in-duction in that we do not explicitly construct rules for ex-tracting elds [14, 13]. Instead, incoming records arealigned against a set of stored and labeled HTML parsetrees (i.e records), referred to as keys . This is akin to nearestneighbor classication. A benet of this approach, besidesits simplicity, is that when a website changes its formattingall that is necessary to update the schema is to cull a new setof keys from the website. In particular, there is no need tolabel data or retrain.

    Our algorithm for generating schema takes two argu-ments: an alignment model and a set of keys, which areHTML parse trees. We assume that all elds occur as text(i.e., we do not extract accompanying images or formattingtags). The initial eld labeling attempts to label the keystextual vertices with numeric abstract eld ids. The rulewe follow is that a vertex should have the same eld id asall of the vertices with which it aligns. The algorithm fordoing this, illustrated in Figure 3, amounts to nding theconnected components of graph:

    Initialize graph G = ( V, E ) such that the edge set E is empty and the vertex set V is the set of text verticesfrom the keys.

    For every pair of keys nd their minimum cost align-ment A.

    For every (u, v ) A, if u and v are both textvertices then add (u, v ) to E

    1Technically it is not EM since we there is no explicit concept of prob-ability or expectation in our framework

    b ca

    d

    e f

    h ig

    Key 3

    Key 1 Key 2

    b

    c

    a d

    e

    f

    h ig

    Induced Graph

    Induced Table

    ad

    g, h

    b, ce, f

    i

    field 1 field 2

    key 1

    key 2

    key 3

    Figure 3. Key labeling algorithm. The rststep is to align all pairs of keys. This inducesa graph, of which we nd the connected com-ponents. Each component is labeled with aeld identier, which is interpreted as a table.

    Assign a unique eld label to every connected compo-nent of G containing more than one vertex. Every ver-tex assumes the eld label of the component to whichit belongs

    The process for extracting elds from a new record issimilar to the schema generation process. We simply alignthe record with each of the keys in the schema. Every tex-tual vertex in the new record assumes the eld label of thevertices with which it aligns; in the event of a conict thevertex assumes the majority eld label.

    5 Preliminary ExperimentsIn this section we briey describe our experimental re-

    sults.Search results were gathered from 13 popular websites.

    On each website the results for between 1 and 4 querieswere collected. The search results were then partitionedinto data records using the MDR tool developed by Liu et.al [12]. Eight sites (24 queries) were chosen as a train-ing set. This set contained: Bed, Bath & Beyond, Epicuri-ous, Google, Jet Blue, Metro North Schedule, RIT CourseSchedule, Tiger Direct, and Zagats. The test set included5 sites (16 queries): Bank Rate (BR), Choice Hotels(CH),Deep Discount DVD(DDD), Olive Garden(OG), and Wal-mart(WM).

    To evaluate our methods we generated four separateschema for every query on every test site. The four schemaincluded each of the tree alignment models described ear-lier as well as a baseline generator. As a baseline we usedthe following simple schema generator: align tow keys bytraversing the in parallel and pairing two textual verticesthat occur simultaneously in the traversal. We measured the

  • 8/14/2019 Unsupervised Learning of Tree Alignment Models for Information Extraction

    5/5

    Query # Queries Baseline Simple SED Semant icBR 5 0.897 0.955 0.955 0.955CH 2 0.838 0.960 0.956 0.946DDD 2 0.796 0.867 0.845 0.859OG 4 0.710 1.000 1.000 0.994WM 4 0.883 0.849 0.901 0.835Avg 3.4 0.830 0.930 0.939 0.923

    Figure 4. Comparison of performance of dif-ferent tree alignment models. Performanceis measured by the percentage of correctlygrouped textual vertices.

    performance of each method as the percentage of pairs of text elds that are correctly grouped together.

    The results of the experiments are given in the table inFigure 4. In general, the schema generators that rely ontree alignments outperform the baseline generator. How-ever, there are cases where the baseline schema generatorperforms surprisingly well. This is likely due to the fact thatsome web sites have very little variation in the formatting of their search results.

    The String Edit Distance (SED) model almost alwaysoutperforms all the other models. One possible explana-tion is that many web sites search results contain a numberof xed strings like Price or by. The vertices associatedwith these strings will align very well under the SED modelsince their edit distance is 0; these vertices essentially act asanchor points for the alignment.

    It is disappointing that the Semantic model performedworse than even the Simple model. We believe this is dueto overtting in the features used. For example, one of theresults contained the address 475 Ohio Pike. The featurescorresponding to addresses did not detect the word Pikeand so that particular eld was labeled incorrectly. Includ-ing the string edit distance as a feature would probably al-leviate these kinds of issues.

    6 Conclusion

    We have presented an unsupervised framework for in-formation extraction that enables higher-level data mining.Work is underway to demonstrate the powerof ourapproachon a larger data set. An experimental comparison is neededto understand the relative benets and weaknesses of ourapproach compared to other approaches. We have demon-strated the use of machine learning to learn tree alignmentparameters and the results are promising. Our preliminarywork also demonstrates how our framework can include arich set of extensible cost functions. This feature makes ourapproach to learning tree alignment modelsbroadly relevantsince it can be accommodate any tree-type phenomena; inparticular, natural and programming language grammars.

    References

    [1] A. Arasu, H. Garcia-Molina, and S. University. Extractingstructured data from web pages. In SIGMOD 03: Proceed-ings of the 2003 ACM SIGMOD international conference on Management of data , pages 337348, New York, NY, USA,2003. ACM Press.

    [2] L. Arlotta, V. Crescenzi, and G. Mecca. Automatic annotationof data extracted from large web sites, 2003.

    [3] R. Baumgartner, S. Flesca, and G. Gottlob. Visual web in-formation extraction with lixto. In The VLDB Journal , pages119128, 2001.

    [4] P. Bille. A survey on tree edit distance and related problems.Theor. Comput. Sci. , 337(1-3):217239, 2005.

    [5] C. J. C. Burges. A tutorial on support vector machines forpattern recognition. Data Mining and Knowledge Discovery ,2(2):121167, 1998.

    [6] C.-H. Chang and S.-C. Lui. IEPAD: information extractionbased on pattern discovery. In WWW 01: Proceedings of the 10th international conference on World Wide Web , pages

    681688, New York, NY, USA, 2001. ACM Press.[7] M.-S. Chen, J. Han, and P. S. Yu. Data mining: an overview

    from a database perspective. Ieee Trans. On Knowledge And Data Engineering , 8:866883, 1996.

    [8] E. D. Demaine, S. Mozes, B. Rossman, and O. Weimann. Ano ( n 3 )-time algorithm for tree edit distance. Technical report,MIT, 2005.

    [9] U. M. Fayyad, G. Piatetsky-Shapiro, and P. Smyth. Thekdd process for extracting useful knowledge from volumesof data. Commun. ACM , 39(11):2734, 1996.

    [10] T. Joachims. Making large-scale support vector machinelearning practical. pages 169184, 1999.

    [11] N. Kushmerick. Wrapper induction: efciency and expres-siveness. Artif. Intell. , 118(1-2):1568, 2000.

    [12] B. Liu, R. Grossman, and Y. Zhai. Mining data records inweb pages. pages 601606, 2003.

    [13] I. Muslea, S. Minton, and C. A. Knoblock. Hierarchicalwrapper induction for semistructured information sources. Autonomous Agents and Multi-Agent Systems , 4(1/2):93114,2001.

    [14] D. Reis, P. Golgher, A. Silva, and A. Laender. Automaticweb news extraction using tree edit distance, 2004.

    [15] K.-C. Tai. The tree-to-tree correction problem. J. ACM ,26(3):422433, 1979.

    [16] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun.Support vector machine learning for interdependent and struc-tured output spaces.

    [17] Y. Zhai and B. Liu. Web data extraction based on partialtree alignment. In WWW 05: Proceedings of the 14th inter-national conference on World Wide Web , pages 7685, NewYork, NY, USA, 2005. ACM Press.

    [18] J. Zhu, Z.Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma. 2d condi-tional random elds for web information extraction. In ICML05: Proceedings of the 22nd international conference on Ma-chine learning , pages 10441051, New York, NY, USA, 2005.ACM Press.