extracting structured data from web pagesilpubs.stanford.edu:8090/548/1/2002-40.pdf · 2008. 9....
TRANSCRIPT
ExtractingStructuredDatafrom WebPages
Arvind Arasu HectorGarcia-Molina
StanfordUniversity�arvinda,hector� @cs.stanford.edu
Abstract
Many web sitescontainlargesetsof pagesgeneratedusinga commontemplateor layout. For example,Amazonlays out the author, title, comments,etc. in the sameway in all its book pages. The valuesusedto generatethe pages(e.g., the author, title,...) typically comefrom a database.In this paper, we studytheproblemof automaticallyextractingthedatabasevaluesfrom thewebpageswithout any learningexamplesorothersimilar humaninput. We formally definethe notion of a template,andproposea modelthatdescribeshow valuesareencodedinto pagesusingatemplate.Wepresentanextractionalgorithmthatusessetsof wordsthathave similar occurrencepatternin theinput pages,to constructthetemplate.Theconstructedtemplateisthenusedto extractvaluesfrom thepages.We show experimentallythat theextractedvaluesmake semanticsensein mostcases.
1 Intr oduction
TheWorld Wide Web is a vastandrapidly growing sourceof information. Most of this informationis in
unstructuredHTML pagesthat aretargetedat a humanaudience.The unstructurednatureof thesepages
makesit hardto do sophisticatedqueryingover theinformationpresentin them.Thereare,however, many
websitesthatcontaina largecollectionof pagesthathave more“structure.” Thesewebpagesencodedata
from an underlyingstructuredsource,like a relationaldatabase,andaretypically generateddynamically.
An exampleof sucha collectionis thesetof book pagesin Amazon[1]. Figure1(a)shows two example
bookpagesfrom Amazon.Therearetwo importantcharacteristicsof sucha collectionof pages.First, all
thepagesin thecollectionencodethesamekind of data.For instance,eachpagein Figure1(a)containsthe
title, authors,andtheprice of a book. In otherwords,the “schema”of thedatain eachpageis thesame.
Second,thedatain eachpageis encodedin a similar fashion.In bothpagesof Figure1(a), the title of the
bookappearsin thebeginning,followedby theword “ ��� ”, followedby theauthor(s).In otherwords,these
pagescanbegeneratedfrom acommontemplateby “plugging-in” valuesfor thetitle, thelist of authorsand
soon.
1
(a)Two bookpagesfrom Amazon[1]
Page A B C �����1 ��� ������������� ������������� (NULL) �����2 ������ ������� ! ��"#��%$�&����� '%()� (�( �����...
......
......
(b) ExtractedData
This paperstudiesthe problemof automaticallyextractingstructureddatafrom a collectionof pages
describedabove, without any humaninput like manuallygeneratedrulesor training sets. By structured
data,we mean“relations” — setsof tuplesof thesamekind thatcanbestoredandprocessedin adatabase.
For instance,from a collection of pageslike thoseshown in Figure 1(a) we would like to extract book
tuples,whereeachtupleconsistsof thetitle, thesetof authors,the(optional)list-price,andotherattributes
(Figure1(b)). Notethat,asFigure1(b) indicates,it is notourgoalto find semanticallymeaningfulattribute
namesfor theextracteddata.
Extractingstructureddatafrom thewebpagesis clearlyvery useful,sinceit enablesusto posecomplex
queriesover thedata.Extractingstructureddatais alsousefulin informationintegrationsystems[9, 17, 15,
11], which integratethedatapresentin differentweb-sites.
It is notsurprising,therefore,thatextractingstructureddatafrom web-pagesis awell-studiedproblemin
thedatabaseandAI communities.However, mostof theexistingwork on thisproblem[14, 13, 16,6, 7, 12]
assumessignificanthumaninput, for example,in form of trainingexamplesof thedatato beextracted.This
is time consuming,especiallyif thetemplateusedto generatethepageschangesoften. In contrast,we aim
for completeautomationin theextractionprocess.To thebestof ourknowledge,ROADRUNNER project[8]
is the only work that tries to solve the sameproblemaswe do. However, ROADRUNNER makesseveral
simplifying assumptionsthat limits its applicability. We defera moretechnicalcomparisonbetweenour
2
w* ork andtheirs,until Section7.
Thebasicideaof our approachis asfollows. First,we deducethetemplateof theinput setof pages.As
we mentionedearlier, templateis the text of thepagesthat is “independent”of theactualdataencodedin
thepages,andis moreor less“common” to all the input pages.For example,in Figure1(a), the text �+� ,,.-�/102/431576
:, 8 319;:102/4325<6 :, =2>+? 3<@ ?.� 3A@�3.: � areall part of the template.Oncethe templateis deduced,the
remainingtext in eachpagethatis notpartof thetemplateis extractedasthedata.
Although our basicapproachlooks intuitively simple thereareseveral challengesthat have to be ad-
dressedin orderto make it feasible.
1. Complex Schema:The“schema”of theinformationencodedin thewebpagescouldbevery complex
with arbitrarylevelsnesting.For instance,eachbookpagecancontainasetof authors,with eachauthor
having asetof addressesandsoon. Eventhenotionof a templateis not very obviousin thepresenceof
suchcomplex schema.
2. Templatevs Data: Syntactically, thereis nothingthatdistinguishesthetext that is partof the template
andthetext thatis partof thedata.Thismakesthetaskof identifying thetemplatechallenging.
Therestof thepaperis organizedasfollows. Section2 providesthepreliminarydefinitions,proposesa
modelfor pagecreationandformally statestheEXTRACT problem,thatwearetrying to solve in thispaper.
Section3 providesa brief overview of our algorithm,EXALG, for solvingtheEXTRACT problem.EXALG
describedin greaterdetail in sections4 and5. Section6 describesour experiments.Section7 describes
relatedwork, andSection8 ourconcludingremarks.
2 Model and ProblemFormulation
In this sectionwe formally definestructureddata,the kind of datathat we arehopingto extract from the
webpages.Wealsoproposeamodelfor pagecreationthatdescribeshow datais encodedusinga template.
Finally, we formulatetheEXTRACT problemthatwe aretrying to solve in thispaper.
2.1 Structured Data
StructuredData is any set of datavaluesconformingto a commonschemaor type. A type is defined
recursively asfollows [5]:
3
1. TheBasicType, denotedby B , representsastringof tokens. A tokenis somebasicunit of text. Wedefine
a tokento beawordor a HTML tag.Howevera tokencouldhave beendefinedasabit or acharacteras
well.
2. If CEDFGGG F�CIH aretypes,thentheirorderedlist JKCLD�FGGG F�CIH�M isalsoatype.Wesaythatthetype JKCLD�FGGG F�CIH�Mis constructedfrom thetypesCED�FGGG�F�CIH usinga tupleconstructorof order N .
3. If C is a type, then OCQP is alsoa type. We say that the type OCQP is constructedfrom C usinga set
constructor.
We usethe term typeconstructorto refer to eithera tuple or setconstructor. An instanceof a schemais
definedrecursively asfollows.
1. An instanceof thebasictype, B , is any stringof tokens.
2. An instanceof type JKCLD�F�CIR.FGGG)F�CSH+M is a tuple of the form JUTVD�F�TWR.FGGG F�TXH+M where T�D�F�TWR.FGGG�F�TWH are in-
stancesof typesCED�F�CYR.FGGG�F�CIH , respectively. InstancesT�D�F�TWR.FGGG F�TWH arecalledattributesof thetuple.
3. An instanceof type OCQP is asetof elementsO)Z7D F�Z�R.FGGG F�Z[\P , suchthat Z�]V^�_Q`aTb`dcfe is aninstanceof
type C .
We alsouseterm value to denotean instance. Also, string denotesa string of tokens. Sometimestype
constructorsymbols,O7P and JXM , aresubscriptedto helpusreferto thecorrespondingtypeconstructors.
Example2.1 Considera setof pages,eachcontaininginformationabouta book. Eachpagecontainsthe
title, thesetof authors,andthecostof a book. Further, eachauthorhasa first nameanda lastname.Then
theschemaof thedataencodedin thepagesis gEDihjJkBlF�O2JkBmFVBnMpoWq)P opr FVBiM o�s . SchemagED hastwo tuplecon-
structors,t;D and tu , andonesetconstructor, tR . An instanceof gED is thevalue vwD�hxJUy%F�O2J{z2D�FV|{D%M F J{z.R.FV|KR M�P<FV}�Mwhere,for example,y denotesthetitle of thebook, z2D denotesthefirst nameof anauthorand } thecost. ~
Schemasandvaluescanbe equivalently viewed astrees. Figure1(d) shows the treerepresentationof
schemagED andvalue vwD . A sub-treeof a schematreeis alsoa schema,andis calleda sub-schemaof the
original schema.A sub-valueof avalueis similarly defined.
4
2.2�
Model of PageCreation
We now describea modelfor pagecreation.Accordingto our model(Figure1(c)),a value v (takenfrom a
databaseshown on the left) is encodedinto a pageusinga templateC . We denotethepageresultingfrom
encodingof v using C by �E^KC�F�vIe .
λ (T,x )
( T )
x
Template
Database
Output Page
(c) Model for PageCreation
< >
{ }
< >
� �
� �
< >
{ }
< > < >
τ
τ
τ1
2
3
t c
f1 l1 f2 l2
(d) ExampleSchemaandInstance
Figure1:
Definition 2.1 (Template) A templateC for a schema� , is definedas a function that mapseachtype
constructor, t of � into anorderedsetof stringsC�^kt�e , suchthat,
1. If t is a tupleconstructorof order N , C�^kt�e is anorderedsetof N���_ strings JX� o D�FGGG F � o;� H7�EDp� M .2. If t is asetconstructor, C�^kt�e is astring g o (trivially anorderedsetof unit size). ~
Optionally, we representtemplateC as C � to denotethat C is definedfor schema� . For case1 (resp.
case2) of Definition 2.1, we saystring � o ]�^�_�`�T�`�N���_ e (resp. string g o ) is associatedwith type
constructort . If astringis associatedwith a typeconstructorin atemplate,any tokenthatoccurswithin the
stringis alsosaidto beassociatedwith thetypeconstructor.
Example2.2 A template, CLD � s , for SchemagLD of Example2.1, is given by the mapping, CED)^kt;D e�hJk�\FV��F ��FV��M , CED)^ktR�e�h�� , CLD)^ktue&hxJk��FV�iF ��M . Eachletter ��� � is astring.
TemplateCED � s tells ushow to encodea pagefrom a value. For example,theencoding�E^KCLD�F�vwD%e is the
string �ly����\z2DV�Q|XD ���f�¡z.R�Q|¢R ���£}�� .
For concreteness,let strings ^k�¤�a��e be asshown in Figure2 ^k¥+e ( ¦ representsan emptystring and
representswhite space). Thewebpagecorrespondingto thebook tuple J C ProgrammingLanguage,O�JBrian,KernighanM , J Dennis,RitchieM§P , $30.00M is shown in Figure2 ^{¨�e . ~
5
© ª¬« �V�.� ª¬® ���� ª¬® K¯ ����°l± ª¢²�® ³ ® ´ ª¬® kµ�����i± ª¢²V® ¶ ª¢²V® � �% ª¢²V« �V�;�· �{¸ ¹º»and
ª¬« �V�.�ª¬® � �� ª¬® K¯ ���V°�± ª¢²V® kµ½¼����#���$����.¾V�#§¿ $��# " $ #�® E¯��¾%$��ÁÀ�� �%�)¾�# « $��&$���E!�� ���)¾%�wÂ)¾�� à « ¾ �ª¬® kµ����i± ª¢²V® $Ä�( � (%(ª¢²V® � �% ª¢²�« �V�.�ÅÇÆ)È ÅÇÉ�È
Figure2: TemplateandPageof Example2.2
Formally, given a template,C � , the encoding�E^KC�F�vIe of an instancev of � is definedrecursively in
termsof encodingof sub-valuesof v . Sinceit causesno ambiguity, we usethe �½^KClF�vIe notationfor values
v thatareinstancesof sub-schemaof � .
1. If v is of basictype, B , �½^KClF�vIe is definedto be v itself.
2. If v is a tuple of form JUv D FGGG F�v H MpoWÊ , �½^KC�F�vYe is the string � D �E^KC�F�v D e¡� R �E^KC�F�v R e�GGG&�E^KC�F�v H e�bH7�ED . Here, v is an instanceof sub-schemathat is rootedat type constructort�Ë in � , and C¡^kt�ËWe�hJX��D�FGGG�F �bH7�ED%M .
3. If v is a set of the form O)Z.D�FGGG)F�Z[QP opÌ , �E^KC�F�vIe is given by the string �½^KClF�Z.D%e�gÍ�½^KClF�Z�R e�gÎGGGEg�E^KC�F�Z[£e . Here v is aninstanceof sub-schemathatis rootedat typeconstructort�Ï in � , and C�^kt�Ï e§hÐg .
Werepresenta templateusinganinfix notation.For example,thetemplateof Example2.2is represented
as Jk�ÒÑd��O2Jk�ÐÑn�ÍÑn��M�P)Óm�ÍÑb�ÔM . The“ Ñ ” symbolis similar to UNIX wild-card,andindicatespositions
wherevaluesof basictypeappearin anencodingusingthetemplate.Notethatstring � , associatedwith tRof Example2.2 is placedassubscriptof O7P .
Ourmodelcapturestherequirementthatthewebpagesbegeneratedin aconsistentmanner. In particular,
it ensuresthatvaluesfor thesameattribute in a tupleoccurin thesamerelative positionwith respectto the
valuesof otherattributes,in all thepages.In Example2.2above, thebooknamealwaysoccursbeforethe
list of authorsandtheprice.Theencodingof thesetcapturestheintuition thatelementsof a setareusually
listedcontiguously, andthattheelementsof thesetareformattedin asimilar manner.
2.3 Optionals and Disjunctions
As we saw in Section2.1, a schemais built from two kinds of type constructors,tuple andset,and the
basictype B . Therearetwo otherkinds of type constructorsthat occurcommonlyin the schemaof web
pages,namely, optionalsanddisjunctions.For examplethe list-priceof a book in Amazonbook pagesis
6
optionalÕ sinceonly pagesfor bookssoldat a discountpricehave list-price information. As anexampleof
a disjunction,theaddressinformationin a webpagecouldbein oneof two formats,basedon whetherthe
addressis a US addressor not, in which casetheschemaof theaddressis a disjunctionof theschemafor
US addressesandtheschemafor non-USaddresses.
We view optionalsanddisjunctionsasspecialtypeconstructorsbuilt from setandtupleconstructors.If
C is a type,then ^KC�e�Ö representstheoptionaltype C , andis equivalentto OC�P o with theconstraintthat in
any instantiationt hasa cardinalityof × or _ . Similarly, if CED and CYR aretypes, ^KCED\Ø7CIR e representsa type
which is disjunctionof CLD and CIR , andis equivalentto JpOCLDP o s F�OCYR;P o r M o , wherefor every instantiationof texactlyoneof t;D�FVtR hascardinalityoneandtheother, cardinalityzero.
Theabove view of optionalsanddisjunctionsenablesus to useour modelof pagecreationfor schema
involving optionalsanddisjunctionswithout any modification.
2.4 Problem Statement
Extract Problem: Givenasetof N pages,Ù�]½hÍ�E^KC�F�v�]WeÁ^�_Q`dT&`aNEe , createdfrom someunknown template
C andvaluesO�vwD�FGGG)F�v�H�P , deducethetemplateC andvaluesO�vwD�FGGG�F�v4H�P from thesetof pagesalone.
In its generalform, EXTRACT problemis ill-defined sincethereareseveral templatesandvaluesthat
couldhave createdagivensetof pages,asthefollowing exampleillustrates.
Example2.3 Considerthreeinput pagesÙ D h¤�l¥ D �Ú¨ D �Q} D � , Ù R h¤�l¥ R ��¨ R �£} R � , Ù u hÛ�m¥ u �Ú¨ u �£} u � .
Thesepagescanbe createdfrom the template Jk�ÜÑm�jÑm�ÛÑl�ÔM anda correspondingsetof values1. For
instance,the value usedto createÙ½D is Jk¥�D�F�¨ D�FV} D%M . Thesepagescan also be createdfrom the template
Jk�ÝÑ&�dÑÁ�ÔM andacorrespondingsetof values.For this template,thevalueusedto createÙwD is Jk¥ÞD��Ú¨ DFV} D%M .~
However, givenasetof realwebpagesfrom asitelikeAmazon,ahumanrarelyhasany ambiguityin picking
theright templateandvaluesencodedin thepages.Ourgoalis to solve theEXTRACT problemfor realweb
pages,i.e., producethetemplateandvaluesthatwouldbeconsideredcorrectby ahuman.
Example2.4 We usethe instanceof EXTRACT problemwith thesetof ß pagesàQáQhâO�Ù4á D FkÙ4á R FkÙ4á u FkÙ4ápã.Pshown in Figure3 asa runningexample. Eachpagein à á containsthe title and the setof reviews of a
1In many, but notall, cases,a templateanda pagecreatedfrom thetemplateuniquelyidentifiesthevalueencodedin thepage
7
book.ä
Eachreview containsthenameof the reviewer, the ratinggivenby the reviewer andthe text of her
comments.The entiretext of commentsis not shown dueto spacelimitations. Arguably, the pageswere
createdfrom templateC á , and values O�v á DF�v á R.F�v á u;F�v ápã P shown in Figure4. The schemaof the values
is � á håJkBlF�O2JkB�FVBlFVBiM oWæps P oWæ{r M oWækq . The correctsolutionof the EXTRACT problemfor the input à á is the
templateC á andvaluesO�v á D�F�v á R;F�v á u.F�v ápã P . ~
ª¬« �V�.� s ª¬® � �%� rª¬® q ¯����V°)ç�è�$��;��é ª¢²�® kê4! $���$ ® $�����ª¬® {ëÞÂ�%ì ¾ � í �%î ª¢²V® kïª ���� s¢ðª ��¾% s{sª¬® sKr Â��%ì ¾ ��í��%� s¢q è $��)� s ç ª¢²V® s éwñ%� « �ª¬® s ê  $ � ¾V��# s ë ª¢²V® s îSòª¬® s ï�ó�%ô�� rk𠪢²�® rps �����ª¢² �¾% r{rª¢² ���% rkqª¢²�® � ��) r ç ª¢²V« �V�;� r é(a:õ æps )ª¬« �V�.� s ª¬® � �%� rª¬® q ¯����V° ç è�$��;� é ª¢²�® êYö "�%���÷�ø��;�ª¬® ë Â�%ì ¾ � í � î ª¢²V® ïª ���� s¢ðª ��¾% s{sª¬® sKr Â��%ì ¾ ��í��%� s¢q è $��)� s ç ª¢²V® s é ñ%� « �ª¬® s êÞ $ � ¾V��# s ë ª¢²V® s îSùª¬® s ï�ó�%ô�� rk𠪢²�® rps . . .ª¢² �¾% r{rª¢² ���% rkqª¢²�® � ��) r ç ª¢²V« �V�;� r é(c:õ ækq )
ª¬« �V�.�� s ª¬® � �� rª¬® q ¯ ���V°)çÞè $��;��é ª¢²V® kê Data �)¾V�)¾V��#ª¬® {ë+Â��%ì ¾�� í)�%î ª¢²�® kïª ���% s¢ðª �¾% s{sª¬® sKr Â��%ì�¾ � í��%� s¢q è�$��;� s ç ª¢²�® s éYñ �%ú%úª¬® s ê  $ ��¾V�# s ë ª¢²V® s îY'ª¬® s ï4ó�%ô%� rk𠪢²V® rps . . .ª¢² �¾ r{rª �¾% s{sª¬® sKr Â��%ì�¾ � í��%� s¢q è�$��;� s ç ª¢²�® s éYñ%$���ª¬® s ê  $ ��¾V�# s ë ª¢²V® s îIûª¬® s ï ó�%ô%� rk𠪢²V® rps . . .ª¢² �¾ r{rª¢² ���� rkqª¢²V® � �% r ç ª¢²V« �V�.� r é(b:õ æ{r )ª¬« �V�.� s ª¬® � �� rª¬® q ¯ ����°)ç4è $��;��é ª¢²V® kê�ó���$�� �%$Ã�� ¾ ���)�ª¬® {ëÞÂ��%ì�¾ � í)��î ª¢²�® kïª �%�� s¢ðª¢² ��� rkqª¢²V® ���� r ç ª¢²V« �V�.� r é(d:õ æ ç )
Figure3: Inputpagesof EXTRACT problem
ü æps ± ! $ ��$ ® $� ��� ý ª ñ%� « � , ò , . . . 1þü æ{r ± Data �)¾V�)¾��# ý ª ñ � ú�ú , ' , . . . , ª ñ%$��� , û , . . . +þü ækq ± ö "��%��Á÷�ø�;� ý ª ñ%� « � , ù , . . . 1þü æ çÁ± ó%��$��)�%$�Ã�� ¾%��� � ÿ
ªª¬« ���.� s ª¬® � �% rª¬® q ¯����V°)ç�è�$��;��é ª¢²�® kê *ª¬® {ëÞÂ�%ì ¾ � í �î ª¢²V® kïª ���� s¢ðý ª ª ��¾% s{sª¬® sKr Â�%ì ¾ � í�%� s¢q è $��)� s ç ª¢²V® s é *ª¬® s ê Â�$ � ¾V�# s ë ª¢²V® s î *ª¬® s ï ó��%ô�� rk𠪢²�® rps *ª¢² �¾� r{rXþª¢² ��� rkqª¢²�® � �� r ç ª¢²�« �V�.� r éFigure4: Thecorrectsolutionto theEXTRACT problem
8
2.5�
MiscellaneousTerminology, Definitions
An occurrenceof a token in a template(resp. value,page)is calleda template-token (resp. value-token,
page-token). Notethedistinctionbetweena tokenandits occurrence.Accordingto our model,eachpage-
token is createdfrom eithera template-token or a page-token. Eachtemplate-token of CYá in Figure4 is
subscriptedto help us refer to it subsequently. The page-tokensof à á in Figure3 thatarecreatedfrom a
template-token have thesamesubscriptasthetemplate-token. Two page-tokensaresaidto have thesame
role if they have beengeneratedby the sametemplate-token. Therefore,two page-tokensin à á have the
samerole iff they have thesamesubscriptin Figure4.
3 Overview of our Approach
In thispaper, wepresentanalgorithm,EXALG to solvetheEXTRACT problem.Figure5 showsthedifferent
sub-modulesof EXALG. Broadly, EXALG worksin two stages.In thefirst stage(ECGM), it discoverssets
of tokensassociatedwith thesametypeconstructorin the(unknown) templateusedto createtheinputpages.
In thesecondstage(Analysis),it usestheabove setsto deducethetemplate.Thededucedtemplateis then
usedto extractthevaluesencodedin thepages.ThissectionoutlinesEXALG for our runningexample.
(Handle Invalid Equivalence Classes)
HandInv
(Differentiate Roles Using Eq Class)DiffEq
(Construct Template)ConstTemp
(Extract Value)ExVal
Equivalence Class Generation Module
Input Pages
Analysis Module
Template
Values
Schema
(ECGM)
(Differentiate Roles Using Format)
DiffForm
(Find Equivalence Classes)
FindEq
Figure5: Modulesof EXALG
In thefirst stage,EXALG (within Sub-moduleFINDEQ) computes“equivalenceclasses”— setsof to-
kenshaving the samefrequency of occurrencein every pagein à á . An exampleof an equivalenceclass
(call � á D ) is the set of � tokens O2J�� :��S@ M F J¢����A��M F����� 4FGGG)F J���� :��I@ M�P , whereeachtoken occursexactly
oncein every input page. Thereare � other equivalenceclasses.EXALG retainsonly the equivalence
classesthatarelarge andwhosetokensoccurin a large numberof input pages.We call suchequivalence
9
classes� LFEQs (for Large andFrequentlyoccurringEQuivalenceclasses).For the runningexamplethere
aretwo LFEQs. The first is �½á D shown above. The second,which we call �½á u , consistsof the � tokens:
O2J @+3 M F�� 6 > 3<6��Þ6</ F��Þ? :43���� F�� 6��2: F J�� @+3 M�P . Eachtoken of � á u occursoncein Ù á D , twice in Ù á R andso on.
Thebasicintuition behindLFEQs is that it is veryunlikely for LFEQs to beformedby “c hance”. Almost
always,LFEQs are formedby tokensassociatedwith thesametypeconstructorin the(unknown)template
usedto createtheinput pages. This intuition is easilyverifiedfor therunningexamplewhereall tokensof
� á D (resp. � á u ) areassociatedwith t á D (resp. t á u ) of � á in C á 2.
For this simpleexample,Sub-moduleHANDINV doesnot play any role, but for real pagesHANDINV
detectsandremoves“invalid” LFEQs — thosethatarenot formedby tokensassociatedwith a typecon-
structor.
However, not all thetokensassociatedwith t á D arein � á D . For example,thetoken �Þ? �S6 doesnot occur
in � á D althoughit is associatedwith t á D in C á . This happensbecause�Þ? �46 hasmultiple “roles” — it is
associatedwith two typeconstructors,namely, t á D and t á u . EXALG triesto addmoretokensto LFEQs by
“dif ferentiating”rolesof tokensusingthecontext in which they occur. For example,EXALG, infers(within
Sub-moduleDIFFFORM)3 that the“role” of �Þ? �46 whenit occursin ���� ��Þ? �S6 is differentfrom the“role”
whenit occursin � 6 > 3<6��+6A/ �Þ? �S6 , using the fact that thesetwo occurrencesalwayshave differentpaths
from the root in the html parsetreesof the pages.EXALG alsoinfers (within Sub-moduleDIFFEQ) that
the role of J¢�4M whenit occursin J¢��M����� �+? �46 is different from the role, whenit occursin J¢��M!� 6 > 3<6�� ,
usingthefactthatthesetwo occurin different“positions”with respectto theLFEQ � á D . Theformeralways
occursbetweentokens J¢�"���2�ÞM and���� of � á D , andthelatterbetweentokens �#�� and� 6 > 3<6��S9 . Returning
to token �Þ? �S6 , let us refer to �+? �S6 as �+? �S6�$ whenit occursin ��#�� %�Þ? �46 and �Þ? �46& whenit occursin
� 6 > 376��Þ6</ �Þ? �S6 . We call �Þ? �S6 $ and �Þ? �S6 & dtokens(for differentiatedtokens). Now, EXALG computes
theoccurrencefrequenciesof thedtokens(againwithin FINDEQ) andchecksif they belongto any of the
existing LFEQs or form new ones.In this case,�+? �S6 $ occursexactly oncein every pageandis, therefore,
addedto � á D . Similarly, �Þ? �46 & is addedto � á u . Similarly, thedtokensformedfrom J¢�4M and J����4M areadded
to oneof �½á D and �Lá u . Thereadercanverify that theabove stepof differentiatingtokensandaddingthem
to existing LFEQs increasesthesizeof � á D (seeFigure6) from � to _�' andthesizeof � á u from � to _�( .EXALG entersthesecondstagewhenit cannotgrow LFEQs,or find new ones.In thisstage,it buildsan
2The subscript)+* (resp. )-, ) of . æWs (resp. . ækq ) hasbeenchosento correspondto the subscriptof / æps (resp. / ækq ). This also
explainswhy thereis no . æ{r — thereareno tokensassociatedwith / æ{r in 0 æ3For exposition,thesequenceof executionof EXALG describedhereis slightly differentfrom theactualsequencedescribedin
Section4. Actually, DIFFFORM executesbeforeFINDEQ assuggestedby Figure5.
10
J�� :��I@ MbJ¢����A�+M&J¢�4M1 �#�� 2�Þ? �46 J����4M3 4+5 67 s{sJ¢�4M8� 6 > 3<6��I9 J����4M�J9� @ M3 4+5 67 sKr
J��:� @ MbJ����"���2��MbJ���� :��I@ M3 4+5 67 s¢qFigure6: � á D atendof ECGM module
outputtemplateC<; ��= usingtheLFEQs constructedin thepreviousstage.In orderto constructg>; , EXALG
first considersthe root LFEQ — the LFEQ whosetokensoccurexactly oncein every input page. In our
runningexample� á D is therootLFEQ. EXALG determinesthepositionsbetweenconsecutive tokensof � á Dthat arenon-empty4. A positionbetweentwo consecutive tokensis emptyif the two tokensalwaysoccur
contiguously, andnon-empty, otherwise.Therearetwo non-emptypositionsin � á D : thepositionbetween
tokens ? ( J�����M ) and @ ( J¢��M ), andbetweentokens _7_ ( J9� @ M ) and _�( ( J��:� @ M ). The positionbetweenthefirst
( J�� ::�I@ M ) and the second( J¢����A�ÞM ) token of �½á D is empty since J¢����A�ÞM always occursimmediatelyafter
J�� ::�I@ M . EXALG generatesa tupleconstructort ;á D of order ( (oneattribute for eachnon-emptypositionof
� á D ) correspondingto � á D . The first non-emptypositiondoesnot have any equivalenceclassesoccurring
within it. EXALG usesthis information to deducethat the type of the first attribute of t ;á D is B . The
secondnon-emptyposition (betweenJ9� @ M and J��:� @ M ) alwayshaszeroor moreoccurrencesof � á u . For
this case,EXALG recursively constructsthe type C á u correspondingto � á u , anddeducesthe type of the
secondattributeof t ;á D to be OC á u;P o =æ{r . It canbeverifiedthat C á u constructedby EXALG is JkBlFVB�FVBiM o =ækq . The
outputschema,g ; , producedby EXALG is the typecorrespondingto root equivalenceclass,�½á D , which is
JkBlF�O2JkBlFVB�FVBnM o =æUq P o =ækr M o =æWs .EXALG constructstheoutputtemplateC<; by generatinga mappingfrom eachtypeconstructorin g>; to
orderedsetof strings. By definition, since t ;á D is a tuple constructorof order ( , C<;{^kt ;á D e is an orderedset
of ' strings, JX��D�D�F ��DWR.F ��DWu M . EXALG constructstheabove ' stringsfrom tokensof � á D . Thestring ��D�D is
theorderedsetof tokensof � á D , thatoccurbeforethefirst non-emptyposition: J�� :��I@ M�J¢�"���2��M4GGG.J�����M . The
string ��DWR is theorderedsetof tokensbetweenthefirst non-emptypositionandthesecond.Thestrings ��DWuis similarly constructed(seeFigure6). EXALG infers that themappingC ; ^kt ;á R e is theemptystring,since
thereis no“separator”betweenconsecutive occurrencesof �½á u . ThemappingC ; ^kt ;á u e is constructedsimilar
to themappingC ; ^kt ;á D e describedearlier.
Thereadercanverify that C<;IhÎC á and g>;whÎ� á . We have not describedhow EXALG extractsthedata
values.But for thiscase,thevaluesareuniquelydefinedgiven C<; and à á , andcanbeverifiedto beequalto
O�v á D�F�v á R;F�v á u.F�v ápã P . Therefore,EXALG producesthecorrectoutputon our runningexample.
Also, we have not describedSub-moduleHANDINV in this section. HANDINV detectsand removes4Thediscussionof this stageof EXALG usesthefactthat . æps and . ækq areordered. Wewill discussthis in Section4
11
“invalid” LFEQs formedin FINDEQ. It doesnot play any role for our simplerunningexamplesincethere
wereno invalid LFEQs formed.
4 EquivalenceClasses
Thissectiondefinesanequivalenceclass,anddescribeshow equivalenceclassesareusedin EXALG. Except
whenwereferto our runningexample,thediscussionof sections4, 5, and6 is in thecontext of anarbitrary
setof pagesàÛhÎO�Ù D FGGG�FkÙ H P , whereÙ ] hÐ�E^KC � F�v ] e�^�_Q` T&`aNEe . Schema� consistsof typeconstructors
O t;D�FGGG�FVtBAAP . ThepagesO�ÙwD�FGGG)FkÙ�HÞP form theinput to EXALG. Note,however, thatEXALG doesnothave
knowledgeof C , � and O�vwDFGGG F�v4HÞP .
Definition 4.1 (OccurrenceVector) Theoccurrence-vectorof atoken y , isdefinedasthevector J{z2DFGGG F�z;H+M ,where z;] is thenumberof occurrencesof y in Ù�] . ~
Definition 4.2 (EquivalenceClass) An equivalenceclass is a maximal set of tokens having the same
occurrence-vector. ~
The set of equivalenceclassesdefinea partition over the set of tokensthat occur in à . As we saw in
Section3, thereare C equivalenceclasses(including � á D and � á u ) for pagesà á of our runningexample.The
occurrencevectorof tokensin � á D is J�_.F�_.F�_.F�_ M andtheoccurrencevectorof tokensin � á u is J�_.F-(1F�_.FV×AM .We areinterestedin equivalenceclassesbecause,in practice,tokensassociatedwith thesametypecon-
structorin C , tend to occur in the sameequivalenceclass. In our running example, � of the _�' tokens
associatedwith tá D in CYá , occur in �½á D . Observe that all occurrencesof these� tokensaregeneratedby
uniquetemplate-tokens.For example,all occurrencesof token J�� ::�I@ M aregeneratedby by template-token
J�� ::�I@ M�D . Onetheotherhand,a tokenlike �Þ? �46 thatdoesnotoccurin � á D , in spiteof beingassociatedwith
t á D in C á , is generatedby morethanonetemplate-token,namely, �Þ? �S6�D and �Þ? �S6 D ã . A tokenis saidto have
uniquerole, if all theoccurrencesof thetokenin thepages,is generatedby a singletemplate-token.
Observation 4.1 Tokensassociatedwith thesametypeconstructortFE in C that haveunique-rolesoccurin
thesameequivalenceclass. ~
If, asin Observation4.1,all the tokensof anequivalenceclass,� , have uniquerolesandareassociated
with thesametypeconstructortFE of g , we saythat � is derivedfrom tFE . Wecall anequivalenceclassvalid
12
ifG
it is derived from sometFE , andinvalid, otherwise.For instance,in our runningexample,theequivalence
classOBHÞ? : ? , I 3J�I3���� , K 6�L#L , M , K<? �Þ6 , N+P (with occurrencevector Jk×�F�_.FV×�FV×AM ) is invalid. Notethatthetokens
in thisequivalenceclassoccurvery infrequently— in justasinglepage.Thisobservationis valid in general
for realwebpages.Definesupportof a token, to be thenumberof pagesin which the token occurs.The
supportof anequivalenceclassis thecommonsupportof thetokensin it. Thesizeof anequivalenceclass
is thenumberof tokensin theequivalenceclass.
Observation 4.2 For real pages,an equivalenceclassof large sizeandsupportis usuallyvalid. ~
Wecall suchequivalenceclassesLFEQs(for LargeandFrequentEQuivalenceclass).Observation4.2is
truebecauseLFEQsarerarelyformedby “chance”.Two tokensrarelyhave thesameoccurrencefrequency
in a largenumberof pagesunlessthey occurin thepagesdueto thesame“reason”.Typically, thenumber
of timestwo typeconstructorsareinstantiatedis not thesamein every input page5. Therefore,tokensasso-
ciatedwith differenttypeconstructorsusuallydonotoccurin thesameequivalenceclass.Tokensgenerated
by value-tokens(e.g., HÞ? : ?;�4? 976�9 in our runningexample)usuallyoccurinfrequentlyandthereforedo not
occurin anLFEQ.
Observation4.2formsthecruxof ourextractiontechnique,whichcanbelooselysummarizedasfollows:
sincetypically LFEQs consistonly of tokensassociatedwith the sametypeconstructorin the (unknown)
input template, useLFEQs to deducethetemplateandschema.
Therearetwo main obstaclesthat we mustovercomein order to make the above ideafeasible. First,
notethatObservation 4.2 is heuristic. Thereis no guaranteethatall the LFEQs for a setof pagessatisfy
Observation 4.2. In practice,we have observed that therearealwayssomeinvalid LFEQs. Second,an
LFEQ, even if it is valid, only containstokensthathave uniqueroles,andtherefore,only containspartial
informationaboutthetemplateusedto generatethepages.We addressboththeseobstaclesin this section.
But, in orderto do so,we observe a few propertiesthatvalid equivalenceclassessatisfy.
4.1 Propertiesof Equivalenceclasses
Definition 4.3 (OrderedEquivalenceClasses) An equivalenceclassis ordered, if its tokenscanbeordered
JUy D�FGGG�F�yp[£M , suchthat,for every pageÙ�]E^�_�`aT&`aNEe , andevery pairof tokens yOE<F�yPAQ^�_Q`RQ%SUT `acfe ,5This statementis not valid if SchemaV is not in “canonical” form (e.g.,
ª{ªXW OY�Z � ªXW OYO[� ). However, for any schemathere
alwaysexistsa “structurallyequivalent” schemain canonicalform (e.g.,ªXW � W for theexampleschemaabove).
13
Ù]\^`_Ba�b�c�d+eBfhghiBc-j�jlkmfnfno�b+j�o5 6B3 4y�D GGG3J4+5J6p c�g � Dp�
y�R GGG3J4B5J6p c�g � RV�y�uQGGG
^-_Ba�b�c�dBgno�j�c�b+q"c`j�j�k+fnfho�b+j�o5 6+3 4y DjGGG3J4B5J6p c�g � Dp�
y�R GGG3J4+5J6p c�g � RV�y�u
Figure7: Occurrenceandspanof anoccurrenceof equivalenceclass
1. If y9E occursat least | timesin Ù�] , the | Ësr occurrenceof y9E in Ù�] occursbeforethe | Ësr occurrenceof yPA in
Ù�] , and
2. If y E occursat least ^k|4�Ü_ e timesin Ù ] , the ^k|4�Ü_ e ÏXË occurrenceof y E in Ù ] is after the | Ësr occurrenceof
y�A in Ù4] .
Wedenotetheabove orderedequivalenceclassby JUy�D�FGGG)F�yp[mM . ~
Let �Îh JUy D�FGGG�F�y�[�M be an orderedequivalenceclass,and let the tokensof � occur z timesin pageÙ#E .Then,we saythat � occurs z timesin Ù E . The T Ësr occurrenceof � referscollectively to the T Ësr occurrence
of tokens y D�FGGG�F�yp[ in Ù#E . Thespanof the T Ësr occurrenceof � in Ù#E is the text startingat (andincluding)
T Ësr occurrenceof y�D andendingat (andincluding) T Ësr occurrenceof y�[ in Ù#E . Thespanof eachoccurrence
of � is sub-divided into ^Uc � _ e positions, namely, t>u�v)^�_ e FGGG)FFtwu�v.^Uc�� _ e . t>u�v ^9TÞei^�_�`xTyS�cfe of T Ësroccurrenceof � in Ù E denotesthetext startingat (but not including) T Ësr occurrenceof y�A andendingat (but
not including) T Ësr occurrenceof y�A �ED . Figure7 illustratesthe spanandand twu�v;^UT�e£^�_�` T�`z(<e for two
occurrencesof anequivalenceclass JUy DF�y�R.F�y�u M in apage.
Definition 4.4 (Nestingof Equivalenceclasses) A pair of equivalenceclasses,�w] and �{E is nestedif,
1. Thespanof any occurrenceof �½] doesnotoverlapwith thespanof any occurrenceof �{E , or
2. Thespanof all occurrencesof �{E is within twu�v;^ ÙIe of someoccurrenceof �½] for somefixed Ù ; or vice-
versa.
A setof equivalenceclassesO���DFGGG)F��wHÞP is nestedif every pair of equivalenceclassesof thesetis nested.
~
Observation 4.3 A valid equivalenceclassis orderedanda pair of twovalid equivalenceclassesis nested.
~
It canbeverifiedthat �½á D and �½á u areordered.Theset O��Eá D F��½á u P is nestedsincethespanof eachoccurrence
of � á u is alwayswithin t>u�v)^9�<e of anoccurrenceof � á D .
14
4.2|
Handling Invalid Equivalenceclasses
As we mentionedearlier, therearealwayssomeinvalid LFEQs thatareformed,for mostinput setsof web
pages.However, typically invalid LFEQsareeithernotorderedor notnestedwith respectto otherLFEQs.
ModuleHANDINV takesasinput asetof LFEQs (determinedby FINDEQ), detectstheexistenceof invalid
LFEQs usingviolationsof orderedandnestingproperties,and“processes”theinvalid LFEQs found— it
discardssomeof theLFEQs completely, andbreaksothersinto smallerLFEQs. Theoutputof HANDINV
is an orderedsetof nested(with high probability valid, seeSection6) LFEQs. A detaileddescriptionof
HANDINV is in AppendixA.
4.3 Differ entiating rolesof tokens
Recall that the fundamentalideaof EXALG is to useLFEQs to discover the templatetokens. However,
typically an LFEQ only containstokensthat have uniqueroles. Therefore,not all template-tokenscan
be discoveredusing LFEQs. This sectionpresentsa powerful technique,called differentiating roles of
tokens, that is usedin EXALG to discover a greaternumber(in practice,almostall) of template-tokens.
Briefly, whenwedifferentiaterolesof tokens,we identify “contexts” suchthattheoccurrencesof a tokenin
differentcontexts above necessarilyhave differentroles. Thenotionof a context shouldbeclearwhenwe
presentthetwo techniquesfor differentiatingrolesusedin EXALG.
The first techniquefor differentiatingroles usesthe html formatting informationof input pages. An
html pagecanbeequivalentlyviewedasa parsetree.An occurrence-pathof a page-token is thepathfrom
the root to the page-token in the parsetree. For instance,the occurrence-pathof the first J����4M in Ù4á D is
J�� ::�I@ M�J¢�"���2�ÞM�J�����M 6.
Observation 4.4 In practice, twopage-tokenswith differentoccurrence-pathshavedifferentroles. ~
Equivalently, Observation 4.4 assertsthat all page-tokens generatedby a template-token have the same
occurrence-path.It canbeverifiedthatObservation4.4 is valid for our runningexample.In thefull version
of thepaper, weusewell-formedpropertiesof html pagesto arguethatObservation4.4is truefor real-world
pagesandtemplates.
Thesecondtechniquefor differentiatingrolesusesvalid equivalenceclasses,andis basedon thefollow-6Thereis a bit of abuseof notationhere. The
ª¬« �V�.� in theoccurrence-pathabove doesnot refer to start-tag,but to thehtml
“element”in theparsetree.
15
ingG
observation.
Observation 4.5 Let � bea valid equivalenceclassderivedfrom t�] . Theroleof anoccurrenceof a token y ,which is outsidethespanof anyoccurrenceof � , is different fromtherole of an occurrencewhich is within
thespanof someoccurrenceof � . Further, the role of an occurrenceof y , which is within twu�v)^k|{e of some
occurrenceof � , is different fromtherole of an occurrenceof y , which is within t>u�v)^Ucfei^Uc~}h�|Xe of some
occurrenceof � . ~
Equivalently, Observation4.5assertsthatall page-tokensgeneratedby atemplate-tokenoccurwithin afixed
t>u�v)^ ÙIe of � , or outsidethespanof any occurrenceof � . Observation4.5canbeprovedin astraight-forward
way basedon thedefinitionof our model. In our runningexample,all page-tokensgeneratedby template-
token J¢��M�u occurin twu�v)^9(<e of someoccurrenceof � á D , andoutsidethespanof any occurrenceof � á u .Wedifferentiaterolesof atokenby identifyingasetof contextsfor thetokenusingObservation4.4or 4.5,
suchthat, eachoccurrenceof the token is within someuniquecontext of the set;and,occurrencesof the
tokenin differentcontextshasdifferentroles.Thesetof contexts is thesetof occurrence-pathsof thetoken,
if we useObservation4.4,andthesetof positionsof � , if we useObservation4.5with a valid equivalence
class� . Weusethetermdtoken(for differentiatedtoken)to jointly referto a tokenandacontext, identified
by differentiation.For example,if wedifferentiatetoken J¢�4M in our runningexampleusingObservation4.4,
( dtokensareformed:onecorrespondingto theoccurrence-path(context) J�� ::�I@ M�J¢�"��A�+M�J¢��M , andtheother
to J�� :��S@ M�J¢����2��M�J9� @ M�J @+3 M�J¢��M . Instead,if we differentiateusingObservation 4.5 with � á D , ' dtokensare
formed: the first correspondsto context definedby t>u�v;^9(<e of occurrencesof �½á D , the secondto context
definedby t>u�v)^9'<e , andthethird to context definedby t>u�v)^9�<e .A dtoken is almostlike a token (a token is a dtoken with no context). We extendthe notationdefined
for tokensto dtokens. The following is a collectionof statementsandnotationrelatedto dtokens: Each
occurrenceof adtokenis generatedby atemplate-tokenor avalue-token;by definition,eachtemplate-token
generatesauniquedtoken;adtokenis saidto haveauniquerole if all occurrencesof thedtokenis generated
by asingletemplatetoken;apagecanbeviewedasastringof dtokens.
4.3.1 EquivalenceClassesand dtokens
For exposition,we have definedequivalenceclassesassetsof tokens. In fact,EXALG workswith equiva-
lenceclassesdefinedusingdtokens.Mostof thediscussionin thissectionadmitsastraightforwardgeneral-
izationfrom tokensto dtokens.Were-statethemainideasin termsof dtokens.
16
An occurrencevectorof adtokenis thevectorof occurrencefrequenciesof thedtokenin theinputpages.
An equivalenceclassis amaximalsetof dtokenshaving thesameoccurrencevector. Thedtokensgenerated
by tokensassociatedwith the sametype constructortFE in C andhaving uniquerolesoccur in the same
equivalenceclass(generalizationof Observation 4.1). Observation 4.2 andObservation 4.3 arealsovalid
for equivalenceclassesdefinedusingdtokens.Section4.2canalsobegeneralizedto dtokens.Finally, the
rolesof dtokensitself couldbefurtherdifferentiatedusingoneof thetwo techniquesdescribedearlier. As an
illustrationof thelaststatement,considerthethreedtokensformedby differentiatingrolesof token J¢�4M using
Observation 4.5 with � á D . The third dtoken (onewith context t>u�v ^9�<e of � á D ) canbefurtherdifferentiated
into ' new dtokensusingObservation 4.5 with a differentequivalenceclass�½á u . For instance,thefirst of
the ' new dtokenscorrespondsto context definedby t>u�v;^9�<e of �½á D , and twu�v)^�_ e of �½á u .
4.4 EquivalenceClassGeneration Module
The input to ECGM is thesetof input pagesà . Theoutputof ECGM is a setof LFEQs of dtokensand
pagesà representedasstringsof dtokens.
First, Sub-moduleDIFFFORM differentiatesrolesof tokensin à usingObservation4.4,andrepresents
theinput pagesà asstringsof dtokensformedasa resultof thedifferentiation.Thesub-modulesFINDEQ,
HANDINV and DIFFEQ iteratein a loop. In eachiteration, the input pagesarerepresentedasstringsof
dtokens. This representationchangesfrom oneiterationto otherbecausenew dtokensareformedin each
iteration. FINDEQ computesoccurrencevectorsof thedtokensin the input pagesanddeterminesLFEQs.
FINDEQ needstwo parameters,SIZETHRES and SUPTHRES, to determineif an equivalenceclassis an
LFEQ. Equivalenceclasseswith sizeandsupportgreaterthanSIZETHRES andSUPTHRES, respectively,
areconsideredLFEQs. HANDINV processesLFEQs determinedby FINDEQ, asdescribedin Section4.2
andproducesanestedsetof orderedLFEQs. DIFFEQ optimisticallyassumesthateachLFEQ producedby
HANDINV is valid, andusesObservation 4.5 to differentiatedtokens. If any new dtokensareformedasa
result,it modifiestheinputpagesto reflecttheoccurrenceof thenew dtokens,andthecontrolpassesbackto
FINDEQ for anotheriteration.Otherwise,ECGM terminateswith thesetof LFEQs outputby HANDINV,
andthecurrentrepresentationof input pagesastheoutput.
Onour runningexample,with SIZETHRES andSUPTHRES bothsetto ' , ECGM runsfor two iterations,
andproducestwo equivalenceclasses,� �á D and � �á u , of sizes _�' and _�( , respectively. The orderedsetof
tokenscorrespondingto dtokensin � �á D is J�J�� :��S@ M F J¢����A�ÞM F J¢�4M F� �#�� YFGGG�F J����"��A��M F J���� ::�I@ M�M , andthatof
� �á u is J�J @�3 M F J¢�4M F�� 6 > 376��+6A/ FGGG)F J�� @�3 M�M .
17
Weconcludethissectionwith aremarkonrepresentationof dtokens.It mightseemextremelycomplex to
storecontext informationof a dtoken. In fact,it is not necessaryto explicitly storeany context information
of a dtoken. Context informationof a dtoken is implicitly storedin its occurrencesin the pages.In our
prototypeimplementationwe usedintegers to representdtokens, and maintaineda mappingfrom each
dtokenintegerto thetoken(a characterstring)correspondingto thedtoken.
5 Building Templateand Extracting Values
This sectiondescribesANALYSIS moduleof EXALG. The input of ANALYSIS moduleis a setof LFEQs
andasetof pagesrepresentedasstringsof dtokens,andtheoutputatemplateandasetof values.ANALYSIS
moduleconsistsof two sub-modules:CONSTTEMP andEXVAL (Figure5). We do not describeEXVAL in
thispapersinceit is reasonablystraightforward to derive it.
5.1 Notation
We needthe following algebraof templatesto describethe recursive constructionof templatesin CONST-
TEMP: ^k¥+e If C D�� s F�C R�� r FGGGVC [���� are templates, and � D F � R FGGG F � [n�ED are strings, C � hJKCLD�F�CIR.FGGG�F�CI[QM-� 7 s�� 7 r`������� � 7 �8� s�� denotesa template,where gâh JXgED�F gIR;FGGG�gIH�M o . C is definedby map-
pings C¡^kt�e�h JX��D�F ��R;FGGG F �b[n�ED%M , and C¡^ktBA.e�h CS]�^ktBA7e , for all tBA in gS]�^�_ `�T�`�cfe ; ^{¨e If C �J�] is a
template,and � a string, C � h�OCS]X�J��P)Ó denotesa template,where gah�O;gS]pP o . C is definedby mappings
C¡^kt�e�hjJk� M and C¡^ktBA7e&hÍCS]V^ktBA;e , for all tBA in gS] ; ^k}e½C ���] (for schemaXgI]{e�Ö ) and ^KC ���] Ø7C �`�E e (for schema
^XgS]&Ø7g�E e ) aresimilarly defined; ^O��eYC B denotesthetrivial templatefor thebasictype B .
t>u�v)^ ÙIe of an orderedequivalenceclass �Íh JO�+D�FF�2R.FGGG FF�<H+M is definedto be emptyif dtokens ��� and
���)�ED alwaysoccurcontiguously. An equivalenceclassis definedto beemptyif all its positionsareempty. In
our runningexample,both � �á D and � �á u arenon-empty:t>u�v)^9?<e and t>u�v)^�_×Ae of � �á D arenon-empty;t>u�v)^9�<e ,t>u�v)^9�<e and t>u�v ^�_7_ e of � �á u arenon-empty.
For an occurrenceof equivalenceclass � anda non-emptyt>u�v)^ ÙYe of � , t>u�v`�#���-�s���4^n��FkÙIe is the string
formedby concatenatingtokensandequivalenceclasses7 thatoccurin t>u�v)^ ÙIe of thatoccurrenceof � , but
do not occurwithin the spanof someotherequivalenceclass � ; whosespanis alsowithin t>u�v;^ ÙIe of the
aboveoccurrenceof � . As anexample,t>u�v`�#���-������^n� �á D F�_×Ae of theonly occurrenceof � �á D in à\R is thestring7More formally, someuniquesymbolcorrespondingto eachequivalenceclass.We usethenameof theequivalenceclass(e.g.,
. �æps , . �ækq ) asits symbol.
18
“ � �á u � �á u ”. Althoughadtokenformedfrom token ��? :43J��� occursin t>u�v)^�_×Ae of theabove occurrence� �á D , it
is notpresentin t>u�v`�#���`�s���4^n� �á D F�_×Ae sinceit is within thespanof anoccurrenceof � �á u .
5.2 CONSTTEM P
Let O��&DF��ER;FGGG F��w[�P bethe input setof LFEQs of ANALYSIS module.For every non-emptyequivalence
class�w] , CONSTTEMP recursively constructsa template,� � � , correspondingto �½] , anda template,C � � � � ,correspondingto eachnon-emptyposition Ù of �E] . The output templateof CONSTTEMP is the template
correspondingto therootequivalenceclass— theequivalenceclasswith occurrencevector J�_.F�_.FB�B�B� M 8.
The templateC � � is definedin termsof C � � � � . Let �w]Úh JO�+D�FF�AR<FGGG�FF���kM , and let yp]�^�_a` T ` |Xe be
the token correspondingto dtoken � ] . Let � ] ^�_ ` T `z�<e denotethe non-emptypositionsof � ] . Define
�Q�x_ strings � ] s F � ] r FGGG�F � ]�� � s asfollows: � ] s h�y D GGGVy Ï s , � ] � h y ÏF��E��YDp�¢�ED GGG y Ï � ^�_yS�Qa`z�<e , and
�����EDmh¤y�Ï � GGG�y�� . Thestrings �b] � ^�_�`�Q `��l�Ð_ e just partition thetokens y�DFGGG F�y�� usingthenon-empty
positionsof �w] . ThetemplateC � � is definedas: C � � hxJKC � � � Ï s FGGG�F�C � � � Ï � M-� 7 sF� 7 r-������� � 7 � � sl� .To constructtemplateC � � � � , CONSTTEMP checksif thesetof strings, t>u�v`�#���`�s����^n�½]�FkÙIe , corresponding
to every occurrenceof �½] , hassomerecognizablepattern. Table1 lists somepatternsthat our prototype
implementationof CONSTTEMP used,and the definition of C � � � � for eachpattern,if the set of strings,
t>u�v`�#���-�����4^n�w]�FkÙIe , hasthat pattern. In our runningexample, t>u�v`�#���-�s���4^n� �á D F-?<e is a string of dtokens,for
Pattern 0 .��O ¡* . � . � ����� ý�0 .�¢ þF£¤ . �m¥ . �m¥ ������. � ý�0 . ¢ þ`¦, . � or .#§ 0 . ¢!¨ 0 .©ª ¹ or . � Å 0 . ¢ È9«¬
stringof dtokensandemptyequivalenceclasses 0 WUnknown 0 W
Table1: Patternsusedin definitionof C � � � �
every occurrenceof � �á D , which matchesPattern � of Table 1. Therefore, C � �æps � ® is definedto be C B .
t>u�v`�#���-�����4^n� �á D F�_×Ae is alwaysa string of × or moreoccurrencesof “ � �á u ”, which matchesPattern _ , and
henceC � �æWs � Dl¯ is definedto be OC � �ækq P�° . Thereadercanrecursively constructC � �ækq andverify thattheoutput
template,C � �æWs producedby CONSTTEMP is thesameasthecorrecttemplateC á .8We canalwaysensurethatsuchanequivalenceclassexistsbeprependingandappendinggreaterthanSIZETHRES numberof
dummytokensto beginningandendof eachpagerespectively
19
6±
Experiments
EXALG makesseveralassumptionsregardingthetemplateandvaluesusedto generateits input pages.We
summarizetheimportantassumptions:
A1: LFEQs arealwaysformeddueto “deterministic”causes.By this we meanthatLFEQs areeithervalid,
or if they arenot, they areformeddueto oneof thecausesanticipatedby HANDINV, sothat theoutput
LFEQs of HANDINV arevalid.
A2: Thereis a sufficient numberof dtokenswith uniquerolesafterdifferentiationusingObservation 4.4 to
bootstrapthe processof forming LFEQs and differentiatingusing the LFEQs to discover additional
dtokens. Also, asa resultof the iterative differentiation,eventually, all dtokensgeneratedby template-
tokenshave uniquerolesandbecomepartof thevalid equivalenceclasses.
A3: For eachtupleconstructortFE in � , a largenumberof template-tokensis associatedwith tFE in C , and tFE is
instantiatedanon-zeronumberof timesin a largenumberof pages.Therefore,if AssumptionA2 holds,
avalid LFEQ derivedfrom tFE is formed.
A4: Stringsassociatedwith tupleconstructorsarenon-empty.
Westudyexperimentallyk¥�e towhatextenttheassumptionsaresatisfied,and ^{¨e theimpactontheoutput
of EXALG whensomeof theassumptionsabove arenot satisfied.For thepurposeof experimentationwe
have built adataextractionsystembasedon EXALG.
We describethe resultsof our experimentsfor C representative collectionsof input pages.Of these,?collections( _��R? of Table2) wereobtainedfrom theROADRUNNER site[8]. Theremaining' collections
werecrawled from well-known datarich siteslike E-bay[2] andNetflix [4]. Thecrawled web-pageswere
usuallythefirst few searchresultsfor somesearchquery. Theschemaof thesecollectionsis morecomplex
(larger numberof type constructors),and “less-structured”,i.e., hasa large numberof disjunctionsand
optionals.In Section7 we arguethat thetechniquesusedin [8] do not work well for collectionswith such
complex schema.
Recall that EXALG usestwo parameters— SIZETHRES and SUPTHRES. In our experimentswe set
SIZETHRES to ' . We wantedthe SIZETHRES to be assmall aspossible,sinceany type constructorwith
lessthanSIZETHRES template-tokensassociatedwith it fails to bediscoveredby EXALG(AssumptionA3).
Wedid not usethevalue ( sincethis leadsto theformationof a lot of invalid equivalenceclassesinvolving
20
start-tagsand their matchingend-tags.For SUPTHRES we useda value ×�G²(�� times the numberof input
pages.Thisvaluewasempiricallydeterminedto begood.
For eachcollection � above,we manuallygeneratedtheschemagS[ of thevaluesencodedin eachpage
of thecollection,usingthesemanticsof theapplication.We ignorednon-text valueslike imagesandand
valuesoccurringwithin tagattributes(e.g., urls) whengeneratinggI[ . Let g á denotetheoutputschemaof
ourautomaticsystem.Weconsideredeachleafattribute �l[ in gS[ , andclassifiedit into oneof thefollowing
' categoriesto reflecthow successfulour systemwasin extractingvaluesof �l[ .
³ Correct: ��[ wasclassifiedascorrectif thereexisteda leafattribute � á in g á suchthatfor eachpagein
� , thesetof valuesof � [ in thepageis equalto thesetof valuesof �má in thepage.
³ Partially Correct: ��[ wasclassifiedaspartially correctif it wasnot correctand thereexisteda leaf
attribute � á in theextractedschemasuchthat for eachpagein � , eachvalueof �l[ occurredaspartof
avalueof � á in thatpage.
³ Incorrect: ��[ wasclassifiedasincorrectif it wasneithercorrectnorpartially correct.
As an illustration of how an attribute in gI[ would be classifiedas incorrect,considera hypothetical
collectionof book pages,containingan attribute, book-title. Also assumethat thereis a token (word) ´thatoccursexactly oncein thetitle of every pagein thecollection. In this case,EXALG will push ´ to the
template,andwill extract two attributescorrespondingto the book title, (correspondingto the text of the
booktitle beforeandaftertheword ´ ), makingtheattributebook-titlein ourmanualschemaincorrect.
Weuseaninstancefrom ourexperimentsto illustratepartiallycorrectattributes.Eachmovie pagein our
Netflix collectionhadanattribute, movie-title. Therewasalsoanoptionalattribute, local-name,for some
foreign languagemovies. However, sincethe local-nameappearedin a very smallnumberof input pages
(AssumptionA3 fails), EXALG did not recognizetheoptionalattribute. Instead,it combinedtheoptional
attribute with the local attribute whenever the formeroccurredin a page.For this case,we classifiedboth
theattributes,movie-title andlocal-name,aspartially correct.
The above classificationwas donewith a view on assumptionsA1-A4. It can be shown that if As-
sumptionA1 is satisfiedthennoneof the attributeswould be incorrect,irrespective of whetherthe other
assumptionsaresatisfiedor not. Partially correctattributesresultif assumptionsA2-A4 arenot satisfied.
Wehaveplacedthedetailedresultsof ourexperimentsat theURL [3]. Theabove link contains,for each
collection,the setof input pagesin the collection,the templatediscoveredby our system,andthe values
21
eµ xtractedfor eachinput page. It alsohasa log of the executionof our systemfor eachinput collection.
Thelog containsthe informationlike thesetof LFEQs formed,thesetof strings twu�vm�#���-�s���S^n�&FkÙIe andthe
patternthatmatchestheset(Section5), for every non-emptypositionÙ of anLFEQ � , andsoon. Finally,
the above link containsdetailsof our evaluation— the manualschemagS[ that we constructedfor each
input collection,andfor eachattribute �l[ in gI[ , thecategory thatwe assigned��[ to, andthereasonfor
doingso.
Index Collection No. pages No. of leaf attributes Correct Partially Correct Incorrect1 AmazonCars 21 13 13 0 02 AmazonPopArtist Lists 19 5 5 0 03 Baseballplayers 10 7 7 0 04 rpmpackages 20 6 6 0 05 UEFA nationalteamsinfo 20 9 9 0 06 UEFA teamplayers 20 2 2 0 07 E-bay 50 22 18 4 08 Netflix 50 29 23 6 09 ATP TennisPlayerProfiles 32 35 33 2 0
Table2: ExperimentalResults
Table2 summarizestheexperimentalresults.Specifically, it shows for eachcollection � , thesizeof the
collection,thetotal numberof leaf attributesin gS[ , andthedistribution of theseattributesinto oneif the 'categoriesdescribedearlier.
The resultsin Table 2 clearly demonstratethat EXALG is very effective in correctly extracting data
from web pagecollections. EXALG correctly extractedthe valuesof most of the attributesencodedin
the input setof pages.For the morecomplex collectionstherewerea few partially correctattributes. As
we mentionedearlier, partially correctattributesresultwhenEXALG extractsdataat a “granularity” less
than the bestpossible. This happens,for example,when EXALG combinestogetheradjacentattributes,
or includessomewordsthatarepartof the templatewithin the extracteddata. Although partially correct
attributesarelessdesirablethancorrectattributes,they arebetterthanincorrectattributes.This is because
onecanpotentiallyconvert partiallycorrectattributesto correctattributesby developingmoresophisticated
techniquesfor (post)processingeachleaf attributeof theschemaextractedby EXALG. We planto develop
suchtechniquesaspartof our futurework. Finally, theabsenceof incorrectattributesindicatesthatall the
input collectionssatisfiedAssumptionA1.
Ourexperimentalresultsalsoillustrateanotherdesirablepropertyof EXALG — theimpactof thefailed
assumptionsis localized. For example,if AssumptionA3 is not valid for sometype constructort E , then
EXALG fails to extracttheattributesof tFE , andsometimesattributesof “surrounding”typeconstructors(see
our exampleof partially correctattributesinvolving Netflix movie-title, local-title above). In thefull paper,
22
we* provide moredetaileddescriptionwhy theimpactof failedassumptionsis localized.
Therunningtimeof EXALG dependson thenumberof iterationsof theloop involving thesub-modules
FINDEQ, HANDINV andDIFFEQ. Therunningtime of eachsub-moduleis linear in theinput size.For all
the input collectionsEXALG thenumberof iterationswaslessthanten. Hence,for all practicalpurposes,
EXALG is linearin theinputsize.Sofar, EXALG thatwehavedescribed,worksontheentireinputcollection
of pages.Whentheinput collectionis large,we canmodify EXALG to work in a “wrapper-mode.” In this
mode,insteadof usingtheentirecollectionto generatethetemplate,EXALG usesa smallsampleof pages
to generatethetemplate,andthenusesthegeneratedtemplateto extractthedatafrom thefull collection.
7 RelatedWork
Most of the relatedwork usea “wrapper-based”systemfor extractingdata. In a wrapperbasedsystem,
extractionis a two stepprocess.In thefirst step,a wrapperfor thegivensetof pagesis generated.In the
secondstep,thewrapperis usedto extract thedatafrom thewebpages.A wrapperis just a programthat
extractsthedatafrom thesetof pages.Notethatawrapperis specificto thesetof pages— thewrapperfor
thepagesof onewebsitewill bedifferentfrom thewrapperfor thepagesof asecondwebsite. In Hammer
etal. [12] ahumanexpressesthelocationof thedatato beextractedasdeclarative rules.A programconverts
theserulesinto awrapper. In [14, 13, 16,6] trainingexamplesconsistingof thepagesandthedatathatoccur
in thosepagesareusedto “learn” the wrapperusingvariousmachinelearningtechniques.All the above
wrapper-basedsystemclearlyrequiremorehumaninput eitherin theform of rulesor learningexamples.If
thetemplatechanges,ashappensfrequentlyin practice,new rulesof examplesmaybeneeded.
DIPRE[7] is aninstanceof anonwrapper-basedsystemthatuseslearningexamples.But theinteresting
aspectof DIPREis that it tries to extract relationsfrom theentirewebandnot just a particularwebsiteas
in our case.But their techniquesapply only to simple“relations” while we aim to extractdatawith more
complex schema.
Our work is mostcloselyrelatedto theROADRUNNER project[10, 8]. ROADRUNNER usesa modelof
pagecreationusingatemplatethatis verysimilar to ours.ROADRUNNER startsoff with theentirefirst input
pageasits initial template.Then,for eachsubsequentpageit checksif thepagecanbe generatedby the
currenttemplate.If it cannotbe,it modifiesits currenttemplatesothatthemodifiedtemplatecangenerate
all thepagesseensofar. Thereareseveral limitationsto theROADRUNNER approach:
23
1 ROADRUNNER assumesthatevery HTML tag in theinput pagesis generatedby thetemplate.This as-
sumptionis crucialin ROADRUNNER to checkif aninputpagecanbegeneratedby thecurrenttemplate.
This assumptionis clearly invalid for pagesin many web-sitessinceHTML tagscanalsooccurwithin
datavalues. For example,a book review in Amazon[1] could containtags— the review could be in
severalparagraphs,in whichcaseit containsJ�¶4M tags,or somewordsin thereview couldbehighlighted
using J 3 M tags.WhentheinputpagescontainsuchdatavaluesROADRUNNER will eitherfail to discover
any template,or produceawrongtemplate.
2 ROADRUNNER assumesthat the “grammar” of the templateusedto generatethe pagesis union-free.
This is equivalentto theassumptionthat thereareno disjunctionsin the input schema.Theauthorsof
ROADRUNNER themselveshave pointedin [8] that this assumptiondoesnot hold for many collections
of pages.Moreover, astheexperimentalresultsin [8] suggest,ROADRUNNER might fail to produceany
outputif therearedisjunctionsin theinput schema.
3 WhenROADRUNNER discoversthat thecurrenttemplatedoesnot generateaninput page,it performsa
complicatedheuristicsearchinvolving “backtracking”for a new template.This searchis exponentialin
thesizeof theschemaof thepages.It is, therefore,not clearhow ROADRUNNER would scaleto web
pagecollectionswith a largeandcomplex schema.
8 Conclusion
This paperpresentedanalgorithm,EXALG, for extractingstructureddatafrom a collectionof webpages
generatedfrom acommontemplate.EXALG first discoverstheunknown templatethatgeneratedthepages
andusesthe discoveredtemplateto extract the datafrom the input pages. EXALG usestwo novel con-
cepts,equivalenceclassesanddifferentiatingroles,to discover the template.Our experimentson several
collectionsof webpages,drawn from many well-known datarich sites,indicatethatEXALG is extremely
good in extractingthe datafrom the web pages.Anotherdesirablefeatureof EXALG is that it doesnot
completelyfail to extractany dataevenwhensomeof theassumptionsmadeby EXALG arenot metby the
input collection.In otherwordstheimpactof thefailedassumptionsis limited to a few attributes.
Thereareseveral interestingdirectionsfor futurework. Thefirst directionis to develop techniquesfor
crawling, indexing andproviding queryingsupportfor the “structured”pagesin the web. Clearly, a lot
of information in thesepagesis lost whennaive key word indexing, andsearchingis used. We indicate
two specificproblemsin this direction. First, how do we automaticallylocatecollectionsof pagesthat
24
are· structured?Second,is it feasibleto generatesomelarge “database”from thesepages?Any technique
for solving the latter problemhasto be much lesssophisticatedthanthe onediscussedhere,possiblyby
sacrificingaccuracy for efficiency. Also whenwe work at the scaleof the entireweb we might be able
to leveragethe redundancy of the dataon the web as in Brin [7]. The seconddirection of work is to
developtechniquesfor automaticallyannotatingtheextracteddata,possiblyusingthewordsthatappearin
thetemplate.
References
[1] Amazon.com.http://www.amazon.com.
[2] ebay.com.http://www.ebay.com.
[3] Experimentalresults.http://www- db.stanford.edu/˜arvind/ex tract/ index .html .
[4] Netflix. http://www.netflix.com.
[5] SergeAbiteboul,RichardHull, andVictor Vianu. Foundationsof Databases. AddisonWesley, Reading,Mas-sachussetts,1995.
[6] G. Barish,Y. S.Chen,D. DiPasquo,andC. A. Knoblock. Theaterloc:Usinginformationintegrationtechnologyto rapidlybuild virtual applications.In Proc.of the2000Intl. Conf. onDataEngineering, pages681–682,2000.
[7] Sergey Brin. Extractingpatternsandrelationsfrom theworld wideweb. In WebDBWorkshopat 6thInternationalConferenceon ExtendingDatabaseTechnology, EDBT’98, 1998.
[8] Valter Crescenzi,GiansalvatoreMecca,andPaolo Merialdo. Roadrunner:Towardsautomaticdataextractionfrom largewebsites.In Proc.of the2001Intl. Conf. on Very LargeData Bases, 2001.
[9] HectorGarcia-Molina,YannisPapakonstantinou,Dallan Quass,AnandRajaraman,YehoshuaSagiv, Jeff Ull-man,andJenniferWidom. The tsimmisproject: Integrationof heterogenousinformationsources.Journal ofIntelligentInformationSystems, 8(2):117–132,1997.
[10] St’ephaneGrumbachandGiansalvatoreMecca.In searchof thelostschema.In Proceedingsof theIntl. Confer-enceof DatabaseTheory(ICDT), 1999.
[11] LauraM. Haas,DonaldKossmann,EdwardL. Wimmers,andJunYang.Optimizingqueriesacrossdiversedatasources.In Proc.of the1997Intl. Conf. on VeryLarge DataBases, pages276–285,1997.
[12] JoachimHammer, HectorGarcia-Molina,JunghooCho, Arturo Crespo,andRohanAranha. Extractingsemistructureinformationfrom theweb. In Proceedingsof the Workshopon Managementof SemistructuredData,1997.
[13] C. N. HsuandM. T. Dung. Generatingfinite-statetransducersfor semi-structureddataextractionfrom theweb.InformationSystemsSpecialIssueon SemistructuredData, 23(8),1998.
[14] N. Kushmerick,D. Weld,andR. Doorenbos.Wrapperinductionfor informationextraction.In Proc.of the1997Intl. Joint Conf. on Artificial Intelligence, 1997.
[15] Alon Levy, AnandRajaraman,andJoannJ.Ordille. Queryingheterogeneousinformationsourcesusingsourcedescriptions.In Proc.of the1996Intl. Conf. onVery LargeData Bases, pages251–262,1996.
[16] I. Muslea,S. Minton, andC. A. Knoblock. A hierarchicalapproachto wrapperinduction. In ProceedingsofThird InternationalConferenceon AutonomousAgents, Seattle,WA, 1999.
[17] Jeffrey D. Ullman. Information integration using logical views. In Proc. of the Internation ConferenceonDatabaseTheory(ICDT), pages19–40,1997.
25
A¸
Handling Invalid Equivalenceclasses
We now presenta procedureHANDINV, which is a sub-moduleof EXALG, whosegoal is to eliminate
invalid equivalenceclasses.HANDINV usesthe orderedandnestingpropertiesof equivalenceclassesto
detecttheexistenceof invalid equivalenceclasses.It subsequently“processes”theinvalid LFEQs foundby
sometimesdiscardingthem,andsometimesbreakingtheminto smallerLFEQs.
It follows from Observation 4.3 that any equivalenceclassthat is not orderedis not valid, andfor any
pair of equivalenceclassesthatarenot nested,at leastoneof themis not valid. HANDINV worksunderthe
hypothesisthatinvalid equivalenceclassesexposetheirpresenceby not satisfyingtheorderednessproperty
or thenestingpropertyor both(which is theconverseof Observation4.3).
The input to HANDINV is a setof equivalenceclasses,possiblycontainingequivalenceclassesthatare
notordered,or pairsof equivalenceclassesthatarenotnested.Theoutputis annestedsetof orderedequiv-
alenceclasses.BeforepresentingHANDINV we examinesomecommoncausesfor formationof invalid
equivalenceclasses.Theseprovide theinsightfor theprocedurethatfollows.
A.0.1 Causesfor formation of invalid equivalenceclasses
In practicetherearethreeprimarycausesfor theoccurrenceof invalid equivalenceclasses.
1 Correlationof instantationof twoor moretypeconstructors. Thetokenshaving uniquerolesandassoci-
atedwith typeconstructors(two or more)having thesamefrequency of instantiationin every inputpage
form aninvalid equivalenceclasses.In practice,this happensbecauserelatedinformationis sometimes
physicallydistributed in non-contiguouspartsof an outputpage.For example,an Amazonbook page
hasanoptionalattribute for theratingof thebook,andanoptionalsetof customerreviews of thebook.
Any bookthathasoneor morecustomerreviews hasa rating,andtheratingattribute is absentif theset
of customerreviews doesnotexist.
2 FrequentlyOccuringtuples.In somecases,thesame(sub-value)tuplerepeatedlyoccursin many differ-
entpages.Thetokensoccuringin thetupletendto form anequivalenceclasssinceall thetokensoccur
whenever thetupleoccursandnoneotherwise.For example,aNetflix movie pagehasinformationabout
relatedmovies. Eachrelatedmovie is a tuple consistingof the movie name,rating, andothersimilar
information.A movie couldberelatedto severalothermovies,andthereforethetuplecorrespondingto
themovie appearsin thepagescorrespondingto all themoviesthatit is relatedto.
A-1
3 Overlapin thetokensassociatedwith different typeconstructors. Considertwo typeconstructors¹»º and
¹B¼ of ½ . Let ¾ ºB¿ ¾P¼ ¿BÀBÀBÀ be thesetof tokensthatareassociatedwith both the type constructors.Theset
of tokens ¾`º ¿ ¾ ¼ ¿BÀBÀBÀ form anequivalenceclassin a setof pages,if they arenot associatedwith any other
typeconstructorin theschema,or generatedby any value-token. Considerour runningexample,assume
thatweuseasizethresholdof Á to determineif anequivalenceclassis anLFEQ. In thiscase,in addition
to theequivalenceclassesÂÄÃ�º and ÂÄÃ�Å , thereis anadditionalequivalenceclassÂ1Æ : ÇÈ�É{Ê ¿ È�Ë�É"Ê-Ì whichhas
anoccurrencevector È9Í ¿-Î#¿ Í ¿ Á�Ê . ÂÄÆ is anexampleof anequivalenceclassformeddueto theoverlapof
thetokens, È�É{Ê and È�Ë�É{Ê , thatareassociatedwith both ¹BÃ�º and ¹BÃ�Å of ½8à .
Wecall invalid equivalenceclassesformedby thefirst two causesinvalid equivalenceclassesof type Ï , and
invalid equivalenceclassesformedby causesÐ andothersinvalid equivalenceclassesof type Ñ .
The invalid equivalenceclassesof type Ï arejust collectionsof valid equivalenceclasses9. They have
easily verifiable characteristicoccurrenceproperties— they are “almost” orderedand “almost” nested.
ProcedureHANDINV tries to partition thesekind of invalid equivalenceclassesinto their constituentvalid
equivalenceclasses.The Type Ñ invalid equivalenceclassescannotbe partitionedto valid equivalence
classes.ProcedureHANDINV discardsType Ñ equivalenceclasseswhenever it detectsone.
A.0.2 HANDI NV
The classificationof invalid equivalenceclassesinto Type Ï andType Ñ equivalenceclassesrequiresa
knowledgeof theschema½ andtemplateÒ , which is notavailableto HANDINV. ProcedureHANDINV just
usestheorderedandnestingpropertiesof theequivalenceclassesto predictif an invalid equivalenceclass
is of Type Ï or Type Ñ .
Definition A.1 (Almost-Ordered) An equivalenceclass isalmost-orderedif thetokensin theequivalence
classcanbepartitionedinto orderedequivalenceclassesthatdonotoverlap. Ó
ExampleA.1 ConsideranequivalenceclassÂÕÔÖÇ�Ï ¿ Ñ ¿`×Ø¿FÙ Ì definedfor aninput of threepages.If the
relativeorderof occurrenceof thetokensin thethreepagesis ÏØÑÚÏÛÑ ×2ÙÜ×2Ù , ÏØÑ ×2Ù , ÏØÑÚÏÛÑÚÏØÑ ×ÝÙÞ×ÝÙÞ×ÝÙ ,
respectively, then  is almost-ordered. canbesplit into equivalenceclassesÇ�Ï ¿ ÑßÌ and Ç ×Ø¿FÙ Ì whichare
orderedanddo notoverlapwith eachother. Ó9This is notentirelytruefor equivalenceclassesformedby causeà . However, it shouldbecomeclearto thereaderafterreading
Section5 thatnoharmis doneby ignoringthis fact
A-2
Definitioná
A.2 (Nearly-Nested) Twoequivalenceclasses,ÂÄâ andÂäã aresaidto benearly-nestedif for every
token ¾�åæ â (andvice versa)thereexistsa position ç , suchthateachoccurrenceof ¾ is eitheroutsidethe
spanof any occurrenceof Â{ã , or within è>é�ê»ëìç8í of someoccurrenceof Â{ã . Ó
ExampleA.2 Considertwo equivalenceclassesÂîºÜÔïÇ�Ï ¿ Ñ ¿`×Ø¿FÙ Ì and  ¼ ÔðÇ�ñ ¿Fòó¿`ôõ¿Fö Ì for an in-
put of threepagesç1º ¿ ç ¼ ¿ ç{Å . If the relative order of occurrencesof the tokens Ïø÷ ö in çĺ ¿ ç ¼ ¿ ç{Å is
ÏØÑ ×2Ù ÏÚëOñ ò íPÑ × ë ôùö í Ù , ÏÚëOñ ò íPÑ × ë ôùö í Ù andñ òúôùö , respectively, theclassesÂûº and ¼ arealmost-
nestedwith respectto eachother. As an example,token ñ of Âw¼ always eitheroccursin è>é�êJë�ü�í of an
occurrenceof  º or doesnotoccurwithin spanof any occurrenceof  º . Ó
Thefollowing lemmais anologueof lemma4.3 for Type Ï equivalenceclasses.
Lemma A.1 An unorderedType Ï equivalenceclassis almost-ordered.Two orderedequivalenceclasses,
eachof which is eithera valid equivalenceclassor an invalid Type Ï equivalenceclassarealmost-nested
with respectto eachother. Ó
Wearenow readyto describeHANDINV.
First, HANDINV checksif eachequivalenceclassis ordered,or almost-ordered.Any equivalenceclass
that is not orderedor almost-orderedis not a valid equivalenceclassor a Type Ï equivalenceclass.Such
equivalenceclassesareremoved from consideration.Any almost-orderedequivalenceclassis partitioned
into orderednon-overlappingequivalenceclasses(Definition A.1). This resultsin a setof orderedequiva-
lenceclasses.
Next, HANDINV checksthe nestingpropertyof every pair of remainingequivalenceclasses.If two
equivalenceclassesÂÄâ and Â{ã areneithernestednoralmost-nestedwith respectto eachotherat leastoneof
themis anType Ñ invalid equivalenceclass.HANDINV greedilypicks theonethat is badlynestedwith a
lot of otherequivalenceclassesanddeletesit from thesetof equivalenceclasses.The intuition is that the
non-Type Ï equivalenceclassin thepair is likely to benot nestedproperlywith a lot of otherequivalence
classes,while thevalid or Type Ï equivalenceclassin thepair is not likely to be. In otherwordsthevalid
andType Ï equivalenceclassesmutually “vote” eachotheras“good”, andthushelp in locatingthe“bad”
equivalenceclasses.Thefollowing exampleillustratesthis idea.
ExampleA.3 Considerourrunningexample.Let thesizethresholdfor determiningwhetheranequivalence
classis an LFEQ or not be Á . For this supportthreshold,therearethreeequivalenceclassesÂÄÃ�º , ÂwÃlÅ and
A-3
Â1Æ<ÔýÇÈ�É{Ê ¿ È�Ë�É"Ê-Ì . With the knowledgeof the schema½þà andtemplateÒþà we know that ÂÄÆ is an invalid
Type Ñ equivalenceclass.
Consideringthe nestingpropertiesof the threepairs formed from the threeequivalenceclasses.The
equivalenceclassesÂÄÃ�º and ÂwÃ�Å arenestedwith respectto eachother. Theotherpairs ÂÄÃ�º ¿ ñ Æ and ÂÄÃ�Å ¿ ñ Æarenotnested.Sinceñ Æ participatesin a largernumberof non-nestingrelationshipsñ Æ , HANDINV predicts
that ñ Æ is aType Ñ equivalenceclassandis discarded. Ó
After the previous step,equivalenceclassesof every pair of remainingequivalenceclassesareeither
nestedor almost-nestedwith respectto eachother. In thefinal step,HANDINV considerseachpairof equiv-
alenceclassesthat is almost-nested,but not nested.Let Â1â and Â{ã be a pair of almost-nestedequivalence
classes.At leastoneof Â1â and Â{ã is an invalid Type Ï equivalenceclass. It is alsopossiblethatbothare
invalid Type Ï equivalenceclasses.HANDINV triesto predicttheinvalid equivalenceclassesandpartition
theminto smallerequivalenceclasses,suchthattheresultingsetof equivalenceclassesarenested.
It canbe easily verified that if Â1â is a valid equivalenceclassand Â{ã an invalid Type Ï equivalence
class,theneachoccurrenceof Â{ã overlapswith an occurrenceof Â1â but not vice versa(otherwise,ÂÄâ and
 ã wouldhave thesameoccurrencevectorandwouldhave belongedto thesameequivalenceclass).In this
case,HANDINV partitions 㠗 eachcontiguoussetof tokensof  ã thatoccurin thesamepositionç of an
occurrenceof ÂÄâ belongto thesamepartition. Eachpartitionof token is madeinto a separateequivalence
class.
If thereexistssomeoccurrenceof Â1â thatdoesnotoverlapwith ÂÄâ andthereexistssomeoccurrenceof Âäãthatdoesnot overlapwith Â1â , thenneitherÂÄâ and Â{ã arevalid, andbotharepartitioned— eachcontiguous
setof tokensof Âäã (resp. ÂÄâ ) thatoccurin thesameposition ç of anoccurrenceof Â1â (resp. Â{ã ) belongto
thesamepartition.
ExampleA.4 Considerthetwo equivalenceclassesÂûº and ¼ from ExampleA.2. HANDINV wouldpredict
thatboth equivalenceclassesareinvalid sincethereexist an occurrence(first occurrencein ç1º ) of Âûº that
doesnotoverlapwith anoccurrenceof  ¼ , andanoccurrenceof  ¼ (pageç{Å ) thatdoesnotoverlapwith an
occurrenceof Âûº . HANDINV would partition Âûº into threeequivalenceclassesÇ�Ï2Ì ¿ Ç�Ñ ¿`× Ì ¿ Ç Ù Ì and  ¼into two equivalenceclassesÇ�ñ ¿Fò Ì and Ç ôõ¿Fö Ì .
On theotherhand,if therelative occurrenceof tokens Ïÿ÷ ö in ç Å wereto be ÏÛÑ ×2Ù , thenHANDINV
wouldpredictthat  º is valid and Âļ is invalid. In thiscase,it wouldpartition Â>¼ alonein to two equivalence
classesÇ�ñ ¿Fò Ì and Ç ôÚ¿Fö Ì . Ó
A-4
Thesetof equivalenceclassesafter the last stepis theoutputof HANDINV. It canbeverified that the
outputis a nestedsetof orderedequivalenceclasses.
A-5