towards combining web classification and web information

Keyword(s): Abstract:

©

Towards Combining Web Classification and Web Information Extraction: aCase StudyPing Luo, Fen Lin, Yuhong Xiong, Yong Zhao, Zhongzhi Shi

HP LaboratoriesHPL-2009-86

Classification, Information extraction, Graphical model

Web content analysis often has two sequential and separate steps: Web Classification to identify the targetWeb pages and Web Information Extraction to extract the metadata contained in the target Web pages. Thisdecoupled strategy is highly ineffective since the errors in Web classification will be propagated to Webinformation extraction and eventually accumulate to a high level. In this paper we study the mutualdependencies between these two steps and propose to combine them by using a model of ConditionalRandom Fields (CRFs). This model can be used to simultaneously recognize the target Web pages andextract the corresponding metadata. Systematic experiments in our project OfCourse for online coursesearch show that this model significantly improves the F1 value for both of the two steps. We believe thatour model can be easily generalized to many Web applications.

External Posting Date: April 21, 2009 [Fulltext] Approved for External PublicationInternal Posting Date: April 21, 2009 [Fulltext]

To be published and presented at the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Copyright The 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.

Towards Combining Web Classification and WebInformation Extraction: a Case Study

Ping Luo†, Fen Lin‡, Yuhong Xiong†, Yong Zhao†, Zhongzhi Shi‡†Hewlett Packard Labs China, {ping.luo, yong.zhao, yuhong.xiong}@hp.com

‡Institute of Computing Technology, CAS, {linf, shizz}@ics.ict.ac.cn

ABSTRACTWeb content analysis often has two sequential and separatesteps: Web Classification to identify the target Web pages,and Web Information Extraction to extract the metadatacontained in the target Web pages. This decoupled strategyis highly ineffective since the errors in Web classification willbe propagated to Web information extraction and eventu-ally accumulate to a high level. In this paper we study themutual dependencies between these two steps and proposeto combine them by using a model of Conditional RandomFields (CRFs). This model can be used to simultaneouslyrecognize the target Web pages and extract the correspond-ing metadata. Systematic experiments in our project Of-Course for online course search show that this model signif-icantly improves the F1 value for both of the two steps. Webelieve that our model can be easily generalized to manyWeb applications.

Categories and Subject DescriptorsI.5.1 [Pattern Recognition]: Models - Statistical

General TermsAlgorithms, Experimentation

KeywordsClassification, Information extraction, Graphical model

1. INTRODUCTIONThe explosive growth and popularity of the World Wide

Web has resulted in a huge amount of information on theInternet. Currently access to this huge collection of informa-tion has been limited to searching and browsing at the levelof Web pages. However, sophisticated Web applications,such as vertical search and entity search, call for flexible in-formation extraction techniques that transform informationon the Web pages into structured (relational) data [7]. In

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.KDD’09, June 28–July 1, 2009, Paris, France.Copyright 2009 ACM 978-1-60558-495-9/09/06 ...$5.00.

recent years, progress in this area has moved us closer to thisgoal. Some research and industry groups have built systemsthat support more precise query on certain content verticals,such as online products [2], academic publications [8] etc.

Usually, to build such systems we need a process that is acascade of two main steps, i.e. Web classification and Webinformation extraction, after the relevant Web pages aredownloaded by focused crawling [10]. Since focused crawlinginevitably includes some irrelevant pages we need to conducta finer level of Web-page classification to separate out therelevant pages from the irrelevant ones. Then, after the Webpages are further categorized Web information extraction isperformed on the real relevant pages to extract target enti-ties with their metadata. As stated by McCallum [7], it ishighly ineffective to use this decoupled strategy - attemptingto do Web classification and Web information extraction intwo separate phases. Specifically, since these two steps areseparate, the errors in Web classification will be propagatedto Web information extraction and eventually accumulateto a high level. Therefore, the overall performance is upper-bounded by that of Web classification.

In this paper, we propose a method to combine Web clas-sification and information extraction and achieve mutualenhancement between these two operations. Rather thanconducting these two steps separately and sequentially, ourmethod utilizes the probabilistic graphical model to simulta-neously detect the target Web pages and extract the meta-data in them, through which we aim to improve both therecall and precision of Web classification and Web informa-tion extraction.

1.1 Motivating ExamplesWe begin by illustrating the problem with an example,

drawn from a project OfCourse at HP Labs China. Of-Course1 is a vertical search engine for online courses. Thegoal of this project is to accurately identify the online coursepages and extract key information from them. The key in-formation include course title, course time, course ID andso on, and we call them course metadata. The traditionalapproach to reach this goal is to first identify the coursehomepages by Web page classification, then extract meta-data from the homepages. After studying this problem, wefind that these two steps are closely related in that informa-tion obtained from one step can greatly help the other step.On one hand, if a Web page is a course homepage it usuallycontains a course title. On the other hand, the existence ofsome course metadata, in turn, indicates that the current

1Available at http://fusion.hpl.hp.com/OfCourse/

(a) with course features and without course metadata (b) without course features and with course metadata

Figure 1: The Motivating Examples of Course Homepages

Web page is a course homepage. This means that there isa forward dependency from Web classification to informa-tion extraction, and also a backward dependency from Webinformation extraction.

To understand the room for improvement by consideringthe backward dependency at an intuitive level, we give twotypical course homepages, as shown in Figure 1. Before wedetail these two examples we first give a brief introductionon how to identify the course homepages by text classifi-cation. The online course homepages usually contain someterms about course and teaching, e.g., “instructor”, “teach-ing assistant”, “course scheduling”, “syllabus” etc. Theseterms, which frequently appear in course homepages, canbe used as features to distinguish them from non coursehomepages. However, the occurrence of these terms is nota sufficient condition for course homepages. The Web pagein Figure 1(a) describes how to enroll a class, and also con-tains many terms about course and teaching. Thus, it islikely that a classifier makes a wrong decision by predictingit to be a course homepage. As to the course homepage inFigure 1(b), it contains a conspicuous course title “QuantumMechanics” with the course ID “physics 113B”. However,it lacks the course terms to indicate it is a course homepage(“Recommended Books” is the only course term it contains).Thus, it is likely that a classifier predicts the page in Fig-ure 1(b) to be not a course homepage. To show the effectof the dependency from information extraction to classifica-tion, we assume for the moment that there exists an oracleto tell whether a Web page contains a course title or not,and a Web page is a course homepage if and only if it con-tains a course title. Then, under this assumption we canreach the right conclusion that the Web page in Figure 1(a)is not a course homepage since it does not contain a coursetitle, and the one in Figure 1(b) is a course homepage sinceit contains a course title. These two examples show thatwe can leverage the existence of certain entity metadata toimprove not only the precision (shown by the example inFigure 1(a)) and but also recall (shown by the example inFigure 1(b)) in classification and extraction. However, howcan we obtain the oracle for course metadata? This is theproblem we address in this paper.

1.2 Our SolutionAs described above, we find that Web classification and

Web information extraction are actually two coupled steps,and they should not be separated. Therefore, in this paperwe propose a method to jointly and simultaneously opti-mize these two steps by the probabilistic graphical modelCRFs [5].

Given a Web page x and a set of DOM (Document Ob-ject Model) tree leaf nodes x1, · · · , xk in x, we formulate theproblem of Web classification and Web information extrac-tion as the task of assigning labels to x, x1, · · · , xk. The labelon x indicates the class of this Web page, while the labelson x1, · · · , xk indicates the types of the metadata for eachleaf node. This way these two tasks can be integrated intoone probabilistic model. Additionally, we explicitly considerthe constraints between Web classification and Web infor-mation extraction (actually the constraints among the labelsof x, x1, · · · , xk), and inject these constraints in both modellearning and prediction. This way we greatly reduce thelabel space and achieve exact inference efficiently.

To the best of our knowledge, our approach is the firstto simultaneously conduct Web classification and Web in-formation extraction by a unified graphical model. Specif-ically, we make the following contributions: 1) A mutuallybeneficial integration of Web classification and Web infor-mation extraction. This method explicitly consider both theforward and backward dependencies between Web classifica-tion and Web information extraction. It shows significantimprovement on the F1 value of these two steps. 2) Anempirical study of this combination method on the task ofonline course search. Two baseline methods are proposedand compared with our uniform graphical model. 3) Theonline course search engine OfCourse. It is built based onthe method proposed in this paper and it is fully functionalfor public access now. It currently contains over 60, 000courses from the top 50 schools in the US.

2. PROBLEM FORMULATIONIn the following we use our work in online course search

as an example to formulate the problem of combining Webclassification and Web information extraction.

(a) Course homepages 1 (b) Course homepages 2

Figure 2: Examples of course homepages

2.1 Problem DescriptionIn this work Web classification is to identify the course

homepages in a given group of Web pages and Web infor-mation extraction is to extract the course metadata fromthe course homepages. Here, the course homepages, whosecontents are provided by the teachers of the courses, includeall the information about the course. Figure 2 gives twoexample segments of course homepages. In this paper, weconsider three types of course metadata:• Course Title: the name of the course.• Course ID: the identifier of the course, often provided

by the university or the department.• Course Time: the semester or time of the year for the

course (it is not the time of the week when the lectures areoffered, like Tuesday and Thursday).

These three kinds of course metadata in Figure 2(a) are“Data Structures”, “ECS 110”, and “Spring 1997” respec-tively, while those in Figure 2(b) are “Introductory Microe-conomics”, “Economics 304K”, and “Spring 2000” respec-tively.

As shown by these two course homepages, course ID &time usually conform to certain patterns, which can be ex-pressed by regular expressions (the patterns for course ID &time, and their extracting process will be detailed later inSection 4.3). However, since online course pages are writ-ten in different formats by different authors and they covera variety of disciplines, it is impossible to manually com-pose rules in terms of functions of dozens of features to ex-tract course titles. Although a course title is usually shownprominently in terms of both position and format the de-gree of prominence for course titles can not be accuratelydefined by simple rules. For example, one might think that acourse title is the field with the biggest font size in the coursehomepage. The example in Figure 2(b) actually invalidatesthis assumption. In this course homepage the field with thebiggest font size is the course ID “Economics 304K”, whilethe course title “Introductory Microeconomics” is below it ina smaller font. Therefore, in this work we adopt the statisticmodels proposed in this paper to extract course titles, anduse pattern matching to extract course ID & time. The rea-sons why we do not incorporate the extraction of course ID& time into a unified graphical model will also be detailedin Section 4.3.

2.2 Problem FormulationUsually, the input HTML Web page is represented in a

DOM tree, and a leaf node in the DOM tree contains textand format information (in the following we call a leaf nodein the DOM tree a DOM node for short). Here, we assume

that a DOM node may correspond to only one kind of meta-data. Although this assumption is true in most cases thereare some exceptions as shown in Figure 2, which requiressome preprocessing. The text “ECS - Data Structures -Spring 1997 (Sections A & B)” inside the red rectangle inFigure 2(a) is a DOM node, which contains all of the threekinds of course metadata: ID (“ECS 110”), title (“DataStructures”) and time (“Spring 1997”). In this case, wehave to split such a DOM node into multiple units. We con-duct node splitting by turning all the candidates of courseID & time, which match the course ID & time patterns, intoa new DOM node.

Therefore, a Web page can be represented as x, x1, · · · , xk,where x is the input (observed) variable corresponding tothe whole page and x1, · · · , xk are the k DOM nodes afterpreprocessing in this page. Based on the representation of aWeb page, we can now formally define the concepts of WebClassification and Web Information Extraction as follows.

Definition 1. (Web Classification) Given a Web page,Web classification is the task of assigning a label to this pagefrom a pre-defined set of labels.

Definition 2. (Web Information Extraction) For all theDOM nodes after preprocessing, information extraction isthe process of assigning metadata labels to these nodes.

The problem of joint optimization for both Web classifi-cation and Web information extraction is defined as follows.

Definition 3. (Joint Optimization of Web classificationand Web Information Extraction) Given a Web page repre-sented as ~x = x, x1, · · · , xk, where x is the input variablecorresponding to the whole page and x1, · · · , xk are the in-put variable of DOM nodes after preprocessing. Let ~y =y, y1, · · · , yk be one possible label assignment for x, x1, · · · , xk.By the principle of Maximum A Posteriori (MAP) the goalof Web classification and Web information extraction is toidentify the label assignment y∗ such that

y∗ = arg max~y

P (~y|~x). (1)

Note that the label sets for classification and informationextraction are different. Specifically, the label set for y con-tains all the categories of Web pages, while the label setfor y1, · · · , yk contains all types of metadata. Based on theabove definitions, the main difficulty is the calculation of theconditional probability P (~y|~x). In the next section we intro-duce a model of CRFs with constrained output to computethis probability.

Figure 3: The graphical model for combing Webclassification and Web information extraction

3. CRFS FOR TASK COMBINATIONIn this section, we first introduce some basic concepts of

CRFs and its simplest version for the traditional classifi-cation problems with a single output variable. Then, wedescribe our proposed model for integrating Web classifica-tion and Web information extraction and give the methodto learn the parameters and perform prediction.

3.1 PreliminaryCRFs [5] is an undirected graphical model that is globally

conditioned on observations. Let g = (~v,~e) be an undirectedgraph, where the node set can be partitioned into two groups~x and ~y. ~x are input variables over the observations, and ~yare output variables over the corresponding labels. The con-ditional distribution of the labels ~y given the observations ~xhas the form,

P (~y|~x) =1

Z(~x)

∏c∈C

ϕc(~xc, ~yc), (2)

where C is the set of all the cliques in G, ~yc and ~xc arethe components of ~y and ~x respectively, ϕc is a potentialfunction with non-negative real values, and Z(~x) is the nor-malization factor with the form,

Z(~x) =∑

~y

∏c∈C

ϕc(~xc, ~yc). (3)

The potential function can be expressed in terms of the fea-ture functions qi(~xc, ~yc) and their weights λi as

ϕc(~xc, ~yc) = exp(∑

i

λiqi(~xc, ~yc)). (4)

Let ~x = x and ~y = y (~x and ~y are both single-variablevectors), then P (y|x) is simplified to

P (y|x) =exp(

∑i αiqi(x, y))∑

y exp(∑

i αiqi(x, y)). (5)

When y is a discrete random variable, Equation 5, which isequivalent to Logistic Regression [3], can be used for classi-fication.

3.2 CRFs for Two Joint TasksIn the graphical model of Figure 3 (in this figure an open

circle indicates that the variable is not generated by themodel, and the edges among y1, · · · , yk are omitted), x andy are the observed and the output variables for a Web pagerespectively. The label space of y contains all the possibleclass labels for Web classification, and in our application,y = ±1, indicating whether it is a course homepage or not.

xj(j = 1, · · · , k) is the observed variable corresponding toa DOM node in the Web page x, and yj(j = 1, · · · , k) isthe output variable of xj . The label space of yj containsall the possible labels for information extraction, and in ourapplication, yj = ±1, indicating whether it is a course titleor not. Then, the conditional probability can be expressedas

P (~y|~x) = P (y, y1, · · · , yk|x, x1, · · · , xk)

=exp(E(y, y1, · · · , yk, x, x1, · · · , xk))∑~y exp(E(y, y1, · · · , yk, x, x1, · · · , xk))

,(6)

where

E(~y, ~x) = E(y, y1, · · · , yk, x, x1, · · · , xk)

=∑

i

αifi(x, y) +k∑

j=1

∑r

βrgr(xj , yj) +∑

t

γtht(y, y1, · · · , yk)

=~αT ~f(x, y) +k∑

j=1

~βT~g(xj , yj) + ~γT~h(y, y1, · · · , yk),

(7)

and fi is the feature function applied to x and y for Webclassification, gi is the feature function applied to xj and yj

on the j-th DOM node for Web information extraction (j =1, · · · , k), and hi is the feature function on all the outputvariables y, y1, · · · , yk. The h functions explicitly considersthe dependency between y and the set of y1, · · · , yk. Thisdependency is denoted by the lines between y and the setof y1, · · · , yk in Figure 3. This way, the h functions furtherbridge the labels of the Web page and all the DOM nodes,and Equations (6) and (7) become a general model for bothWeb classification and information extraction.

3.3 Model Inference with Constrained OutputGiven a set of labeled data {(~xl, ~yl)}n

l=1, the model pa-

rameters ~α, ~β and ~γ can be estimated under the principleof Maximum A-Posteriori. Specifically, with the Gaussianprior the model learning problem is to find the models thatmaximize:

L(~α, ~β,~γ) =

n∑

l=1

ln P (~yl|~xl)−λ1

2‖~α‖2 − λ2

2‖~β‖2 − λ3

2‖~γ‖2.

(8)

This concave function can be globally optimized by the non-linear optimization methods, such as L-BFGS [6] (we adopt

this method in this paper). The gradient vectors on ~α, ~βand ~γ are given as

∂L∂~α

=n∑

l=1

(~f(xl, yl)−∑

~y~f(xl, y) exp(E(~y, ~xl))∑

~y exp(E(~y, ~xl)))− λ1~α, (9)

∂L∂~β

=

n∑

l=1

(

kl∑

j=1

~g(xjl , yj

l )−∑

~y(∑kl

j=1 ~g(xjl , yj)) exp(E(~y, ~xl))∑

~y exp(E(~y, ~xl)))

− λ2~β,

(10)

∂L∂~γ

=

n∑

l=1

(~h(yl, y1l , · · · , y

kll )−

∑~y

~h(y, y1, · · · , ykl ) exp(E(~y, ~xl))∑~y exp(E(~y, ~xl))

)

− λ3~γ,(11)

where E is defined in Equation (7), kl in Equation (10) isthe number of the DOM nodes in the l-th Web pages.

Given the model, we are interested in finding the mostprobable configuration of ~y in the label space for a giveninstance ~x, as shown in Equation (1). This is the process ofmodel inference.

The key problem in both model learning and inference isthe computation of the normalization factor

∑~y exp(E(~y, ~x))

since the label space of ~y is usually exponential to the lengthof ~y. However, using the relation between the page type andthe metadata we can significantly reduce the output labelspace. In our application of course homepage identificationand course metadata extraction we assume that 1) a coursehomepage contains one and only one course title; 2) a noncourse homepage does not contain a course title.

These two assumptions are reasonable due to the followingreasons. 1) Here, a course homepage is defined as the entrypage of an online course, including all the information aboutthe course. Thus, the course title is the indispensable ele-ment for a course homepage2. 2) A course homepage maycontain the title string more than once, but we only con-sider the most prominent one in terms of format and placeas the course title. 3) A non course homepage may talkabout a course by mentioning the course name. Since thementioned course name is not the key information in a noncourse homepage, it does not serve the purpose of a title byhighlighting the format and position. So we do not considerthis appearance of a course name as the metadata we wantto extract. 4) A course listing page may list all the informa-tion of several different courses, including their titles, IDsand time. A course listing page is not considered a coursehomepage in this paper.

Thus, under these assumptions the label space of y, y1, · · · , yk

has only (k + 1) elements:

(k + 1)

y = −1, y1 = −1, y2 = −1 · · · , yk = −1y = 1, y1 = 1, y2 = −1 · · · , yk = −1y = 1, y1 = −1, y2 = 1 · · · , yk = −1· · ·y = 1, y1 = −1, y2 = −1 · · · , yk = 1

With this reduced label space the denominator∑

~y exp(E(~y, ~x))is efficiently tractable. Therefore, exact inference can beachieved in this model.

In our project we only define two h functions:

h1(y, y1, · · · , yk) =

{1 if y = 1, and ∃j such that yj = 10 otherwise.

h2(y, y1, · · · , yk) =

{1 if y = −1, and yj = −1 for j = 1, · · · , k0 otherwise.

These two h functions are actually defined to express thecorrelations between course homepage and course title.

3.4 Some DiscussionDifferent networked CRFs models, such as 2D CRFs [13],

and hierarchical CRFs [14], have been proposed according tothe different structures in ~y and ~x. In some of these modelsthe computation of the normalization factor of Equation 3is often intractable, thus, approximate inference is often re-quired in their inference process. However, a recent work

2There do exist some rare course homepages, which do notcontain a course title on them. We ignore these course home-pages in this work.

by A. Kulesza and F. Pereira [4] shows that the learning ofstructured models can fail even with an approximate infer-ence method with rigorous approximation guarantees. Theproposed model of CRFs injects the constraints among out-put variables deep into it. This way, we greatly reduce thelabel space of ~y, and thus enable exact inference to achieveoptimal result.

The way to constrain the output label space is applicationspecific, but we believe that this idea is generally applicableto a wide range of applications. For example, in anotherproject, we are trying to classify Web pages as news, blog,product, recipe, and so on, and also extract metadata fromthe pages. The metadata for different type of pages are dif-ferent. For example, we want to extract the title, author,and date from news pages but product name, picture, andprice from product pages. If we use the model in this pa-per to conduct classification and extraction jointly, we canuse the relation and constraints between the page type andmetadata to significantly reduce the output label space tomake the problem tractable.

4. FEATURES FOR THE JOINT MODELSince a Web page and a DOM node are different entities

with different granularity we use two groups of features tocharacterize them separately.

4.1 Features for Course Homepage DetectionThe course homepages usually contain some course terms,

e.g., “instructor”, “teaching assistant”, “course introduc-tion”, “course scheduling”, “syllabus” etc. These terms arethe key features to distinguish course homepages from noncourse homepages. Additionally, we find that these termsoften appear at a conspicuous place with some HTML for-matting in a course homepage. To filter out noisy terms,the HTML format and position of a text node are consid-ered. Specifically, we use the following heuristics to extractthe candidates of course terms: 1) the course term candi-date cannot be too long in terms of the number of Englishwords in it; 2) the course term candidate cannot be locatedin the right block, top block and bottom block of a Webpage; 3) the course term candidate is usually conspicuous inthe web page in terms of its format (e.g., header decoration,bold and all capital etc.); 4) a term ending with a colon isa course term candidate.

Furthermore, we manually collected a collection of courseterms and partitioned them into multiple groups. The courseterms in the same group have similar semantics. Table 1shows some groups of similar course terms. Each group ofcourse terms corresponds to a feature for Web classification,and this feature is set to true when a course term candidate,extracted by the above heuristics from the Web page, ap-pears in the corresponding term group. Some preliminaryexperiments show that the proposed features for Web classi-fication is much better than treating the Web page as a bagof words.

4.2 Features for Course Title ExtractionTo extract course title we use as many effective features as

possible. All these features can be divided into three groups:format features, position features and content features. Theformat features, which are about font size, font color, dec-oration type and so on, are used to describe the character-istics that course title is usually the most prominent field

Table 1: The features for course homepageNo. Course Terms

1 course name, course title, course id, course code, ...2 instructor, co-instructor,course administrator, ...3 teaching assistant, ta, teaching fellow, course assistant, ...4 prerequisite, pre-requisite, course requirement, ...5 course objective, course target, learning goal, ...6 course outline, course overview, course summary, ...7 course introduction, course description, ...8 course schedule, course plan, course calendar, ...9 course topic, course content, course document, ...10 recommended reading, useful book, reference resource, ...... ... ...

in the whole Web page or it is conspicuous locally with adifferent format from its neighboring texts. The positionfeatures are used to capture the heuristic that course title isoften located in a noticeable position, such as the top halfregion of the Web page. The format features and positionfeatures are application-independent, however, the contentfeatures are application-dependent. Some typical character-istics addressed by the content features include: 1) coursetitle cannot be too short or too long in terms of the numberof English words in it; 2) course title cannot be the follow-ing expressions: Email, URL, Telephone, Time, combina-tions of letters and numbers; 3) Course title usually has acourse ID or course time around it. Altogether we induce 34features for course title extraction, including 29 application-independent features and 5 application-dependent features.The details of these features are omitted due to the spacelimitation.

4.3 Extraction of Course ID & TimeIn general, a course ID is a combination of letters and

numbers. Usually, its starting letter string is 1) the firstletter of each keyword in the course title, 2) the name ofthe course discipline, 3) the abbreviation of the course disci-pline, 4) combination of 2) or 3) for multi-discipline courses.Following this starting letter string is the course code, com-prising of numbers, letters, or delimiters. A course timeusually contains information about year, semester, seasonor month. Therefore, Course ID & time can be extracted bymatching these patterns. However, not all fields that matchthe specified patterns are course ID or course time (for ex-ample, other time information, such as publication time andexam time, may appear in the course homepage). That is,matching a pattern is a necessary condition but not suffi-cient to extract course ID & time. Nevertheless, we noticethat course ID, course time and course title usually appearclosely in a local area of the course homepage. Thus, we canleverage this heuristic to identify the true course ID & timeamong all the course ID & time candidates.

Based on the above analysis, we can extract course meta-data as follows: 1) Find all the candidates of course ID &time, which match the corresponding patterns. 2) Extractcourse title by the proposed graphical model. 3) If Step2 recognizes the given page as a course homepage and ob-tains the corresponding course title, identify the course ID& time candidates which are closest to the extracted coursetitle, and output them as the extracted course ID & time.

Note that we do not incorporate the extraction of courseID & time into the graphical model. This is because: 1) Asmentioned before, the candidates of course ID & time can beeasily extracted by pattern matching. 2) Excluding course

ID & time in the graphical model further simplify the labelspace of y1, · · · , yk, thus help to achieve exact inference inmodel training and prediction. 3) We notice that there existsome correlations among course title, time and ID. Specif-ically, course ID, time and title usually appear closely in alocal area of the course homepage. To take this point intoaccount we introduce a feature for course title extraction tocheck whether a DOM node has a neighbor that is a courseID or time candidate. We believe that this method has thesame effect as adding course ID & time to the label set ofcourse metadata and defining the feature functions on thelabels of neighboring DOM nodes. Therefore, this techniquereduces the complexity of the graphical model without sac-rificing performance.

This approach of using pattern-based method to extractsome metadata and using machine learning to extract therest is generally applicable in many applications. For exam-ple, if we want to build a vertical search engine for productsearch, the product metadata usually include product name,model number, price, and availability. Here, the model num-ber, price, and availability can be characterized by patterns.Specifically, model number is a short sequence of characters,numbers, and special characters like “-”; price usually fol-lows the $ sign and has two digits after the decimal point,and availability usually contains descriptions like “in stock”,“out of stock”, “available”, “ships in 1-2 business days”.On the other hand, the product name cannot be character-ized by patterns and is better extracted by machine learningmethods.

5. BASELINE METHODSWe compare our approach with two baseline methods. For

each baseline method, we detail its training and predictionprocesses. For training, we are given a set D of labeled Webpages, including a set Dc of course homepages and a set Dn

of non course homepages. For each course homepage in Dc

its course title is also labeled. In these baseline methods theclassification model in Equation (5), which is equivalent toLogistic Regression, is adopted.

5.1 Local Training and Separate InferenceIn this method we train two classifiers, Cc and Ce, for Web

classification and Web information extraction respectively.Cc can use the features for Web classification (detailed inSection 4.1) to give the probability that a Web page is acourse homepage, while Ce can use the features for Webinformation extraction (detailed in Section 4.2) to give theprobability that a DOM node is a course title. Specifically,we use all the labeled Web pages in D to train Cc, and weuse the labeled DOM nodes from the course homepages inDc to train Ce.

In the prediction process we first use Cc to identify all thecourse homepages. Then, for each identified course home-page we use Ce to output the probability of each DOM nodein this page that it is a course title. Finally, we select theDOM node with the largest probability in a course home-page as the node for the course title.

Since each course homepage contains only one course ti-tle (positive instance) while all of the rest are negative in-stances, the training data set for Ce is highly unbalanced.However, in Section 6.1 we will show that this property ofunbalanced data does not affect the performance of coursetitle extraction from the course homepages.

5.2 Local Training and Joint InferenceIn this method we train the same two classifiers, Cc and

Ce, as those in Section 5.1. In the prediction process, givena Web page x, x1, · · · , xk, we use a joint inference similarto the one in Section 3.3 to assign their labels. Specifically,we use the conditional probability in Equation (6), in whichthe function E(~y, ~x) is replaced by

E(~y, ~x) = E(y, y1, · · · , yk, x, x1, · · · , xk)

=∑

i

αifi(x, y) + ηk∑

j=1

∑r

βrgr(xj , yj)

=~αT ~f(x, y) + ηk∑

j=1

~βT~g(xj , yj),

(12)

where ~α and ~β are actually the parameters in Cc and Ce, andη > 0 is the weight to balance the tradeoffs between the twomodels. The larger the value of η is, the more influence theinformation extraction model Ce has on the final result. Thesmaller the value of η is, the more influence the classificationmodel Cc has on the final result. In this formulation weignore the h functions in Equation (7).

In summary, the method in Section 5.1 performs localtraining and separate inference, while the method in Sec-tion 5.2 conducts local training and joint inference. Com-pared with these baseline methods our approach conductsjoint learning in the training process and joint inference inthe prediction process.

6. EXPERIMENTAL EVALUATIONIn this section, we report empirical results by applying

our joint model to perform classification and informationextraction simultaneously. We compare our approach withthe two baseline methods, detailed in Section 5. For the twoclassifiers Cc and Ce used in the two baseline methods, wecarefully tune their parameters used in the training processand compare their best performances with those of the pro-posed model. The results show that our model significantlyimproves both Web classification and Web information ex-traction.

To evaluate all these methods, we collected the labeleddata set D, including the set Dc of 530 course homepagesand the set Dn of 1200 non course pages, and manually an-notated the course title, ID and time for each course home-page.

For both Web classification and Web information extrac-tion, we use the standard accuracy, precision, recall and F1measures to evaluate these methods. For course homepageidentification these metrics are calculated using all the datain our data set D. For course title extraction the evaluationdata sets are different, depending on the extraction methodand the situation. Specifically, when evaluating the perfor-mance of Ce in course title extraction, we assume that theinputs of Ce are all course homepages. Thus, the correspond-ing evaluation metrics are calculated within all the coursehomepages. However, when evaluating the joint model forcourse title extraction, the inputs to this model include bothcourse homepages and non course pages, so the evaluationis conducted with both types of pages. Since our goal isto find all the course titles accurately from a mixture ofcourse homepages and non course pages, the metrics cal-culated within this mixed data set are the most important

Table 2: The experimental results of Cc for coursehomepage identification

λ Accuracy Precision Recall F1

0 0.949 0.903 0.821 0.8580.1 0.949 0.902 0.818 0.8561 0.95 0.919 0.807 0.8575 0.947 0.929 0.775 0.84410 0.947 0.936 0.768 0.84220 0.943 0.952 0.736 0.828100 0.924 0.980 0.607 0.747

Figure 4: The accuracy of Ce for course title extrac-tion from course homepages

measures we care about. Since we aim to show the effec-tiveness of the proposed graphical model we omit the ex-perimental results of course ID & time extraction in thissection. Note also that all the experimental results in thissection are obtained from 10-fold cross validation.

6.1 Results on Cc and Ce

The evaluation results of the classifier Cc for course home-page identification are listed in Table 2, where λ is the pa-rameter on the penalty term to compensate for the over-fitting of complex models. These results show that 1) whenλ = 0 it achieves the best result of F1 value; 2) Along withthe increase of λ values, the precision increases while therecall decreases. We have conducted more theoretical anal-ysis on the relationship between λ and the performances ofprecision and recall, however, it is out of the scope of thispaper.

We evaluate the performance of Ce for course title extrac-tion under the assumption that Ce performs only on coursehomepages. Thus, we use only the 530 course main pagesto conduct the 10-fold cross validation. Note again thatthe DOM tree for each page contains only one node thatis the course title (the positive instance) but many more(several hundred) negative instances, so the data set is veryunbalanced. To alleviate this problem, we under-sample thenegative instances by a certain ratio p in each round of crossvalidation, but keep all the positive instances. The exper-iments in this part aim to answer the following questions:1) What is the relationship between the sampling propor-tion p and the extraction accuracy? 2) How much doesthe extraction accuracy of course title improve with the 5application-dependent features?

To answer these questions, we measure the performanceof Ce with two groups of features (all of the 34 features and

Table 3: The experimental results on the two baseline methodsCourse Homepage Identification Course Title Extraction

Baseline Methods Accuracy Precision Recall F1 Accuracy Precision Recall F1

Local training and Separate Inference 0.949 0.903 0.821 0.858 0.941 0.855 0.774 0.811Local training and Joint Inference (η = 1) 0.934 0.801 0.871 0.833 0.927 0.764 0.832 0.795

Table 4: The experimental results on the joint optimization methodCourse Homepage Identification Course Title Extraction

(λ1, λ2, λ3) Accuracy Precision Recall F1 Accuracy Precision Recall F1

(0,0,0) 0.963 0.922 0.879 0.898 0.953 0.863 0.825 0.842(1,1,0.001) 0.967 0.954 0.864 0.906 0.956 0.889 0.807 0.846(5,5,5) 0.965 0.968 0.839 0.898 0.955 0.904 0.786 0.840(10,10,0.1) 0.963 0.972 0.829 0.894 0.952 0.900 0.768 0.828(10,10,10) 0.963 0.972 0.829 0.897 0.952 0.900 0.768 0.828

the 29 application-independent features only) and ten dif-ferent under-sampling ratios: 0.1, 0.2, · · · , 1. The results areshown in Figure 4. we give the experimental findings andthe corresponding analysis as follows: 1) The application-dependent features significantly improve the extraction ac-curacy of course title. This improvement is independent ofthe value of the under-sampling ratio. On average, the per-formance of Ce increases 8.72% by adding the application-dependent features. 2) Ce yields almost the same perfor-mance under different under-sampling ratio. The reason forthis is that different degrees of data imbalance may changethe absolute value of probability that a node is a course title,however, it will not change the ranking of the nodes in thedecrease order of this probability. Therefore, the node withthe biggest probability value in a course Web page remainsthe same under different imbalance settings.

Note that in the above experiments the parameter λ usedin training Ce is set to 0, and we found that different valuesof this parameter give similar results.

6.2 Results of Web Classification and Web In-formation Extraction on Any Web Pages

In this subsection we evaluate the performance of the jointoptimization model and the two baseline methods for coursehomepage identification and course title extraction given amixture of course homepages and non course pages. Thus,we use all the Web pages in D to conduct the 10-fold crossvalidation.

As shown in Sections 6.1, Cc and Ce achieve their bestperformances respectively when the parameters λ are bothset to 0 in their training. Thus, we adopt these parametersettings in training Cc and Ce for the two baseline methods.Furthermore, for the second baseline method we evaluate itsperformance under different values of the tradeoff parame-ter η in Equation 12. The results show that setting η to 1achieves the best F1 value for both course homepage iden-tification and course title extraction. Thus, we only reportthis best performance for the second baseline method.

The performances of these two baseline methods are shownin Table 3. It shows that 1) the first method is slightly bet-ter than the second one in term of the F1 value for coursetitle extraction; 2) the second method increases the recallfor course title extraction at the cost of precision.

The performances of our joint optimization method withdifferent settings of λ1, λ2, λ3 in Equation 8 are recorded inTable 4. We also conduct the t-test on the results from the10-fold cross validation to check whether the proposed modelsignificantly outperforms the baseline methods with a 95%confidence interval. The results show that 1) under all theseparameter settings the proposed method is significantly bet-

ter than the two baseline methods in terms of the F1 valuesfor both course homepage identification and course title ex-traction; 2) under all the parameter settings this joint modelsignificantly outperforms the first baseline method in termsof all the evaluation metrics, including precision, recall andF1 value; 3) the parameter setting (1,1,0.001) achieves thebest performance of course title extraction in terms of F1value. In this parameter setting we intentionally set λ3 (theparameter on the h functions) to a very small value in orderto increase the effect of the h functions in this model. Itshows that this setting slightly improves the performance ofthis model.

7. OFCOURSE PORTALWe have applied this combination model to the OfCourse

portal for online course search. This portal currently con-tains over 60, 000 courses from the top 50 schools in theUS. The current portal supports basic keyword search, ad-vanced search, and browsing by subject areas. In the ad-vanced search, the search can be conducted within the scopeof the user-specified university and year. For each course inthe search result, the extracted course title is displayed, to-gether with the school, the extracted year, and an excerptfrom the course page containing the search term. The coursetitle links to the course homepage on the Web. These func-tions are all supported by the combination model proposedin this paper. Some user studies show that these results ofclassification and information extraction are quite satisfac-tory. Additionally, the totally automatic process of coursehomepages identification and course metadata extraction al-lows us to easily make our portal an open system, in whichany user can add new courses into the portal by only sub-mitting the URLs. The model behind can identify if thesubmitted URL is a course homepage and simultaneouslyextract its course metadata if it is. Interested users can ex-perience the effectiveness of our joint model by submittingany URL at the OfCourse3 portal.

8. RELATED WORKThe works most related to course title extraction are title

extraction from Web pages [12]. The assumption used in [12]is that a general title is usually the most conspicuous field inany Web page. However, the likelihood that this assumptionis true is much lower for the course titles of online courses.To achieve high precision in course metadata extraction wepropose some application-dependent features, which implic-itly consider the relationships among the features and labelsof the neighboring DOM nodes.3http://fusion.hpl.hp.com/OfCourse/addNewCourse.jsp

All of these previous works perform under the assump-tion that we have a perfect classifier, which can preciselyand completely identify the target Web pages (always con-taining the metadata we want to extract) from a mixtureof Web pages. However, this assumption is not always true.The motivation examples in Section 1.1 intuitively show theroom of improvement in Web classification by consideringthe backward dependency from Web information extractionto Web classification. Thus, we seamlessly combine thesetwo tasks into one graphical model to mutually boost theirperformances.

The methodology to combine multiple mutually-dependenttasks into one statistical graphical model appears in someprevious works for different applications. Roth et al. focuson the topic of learning and inference over structured andconstrained output with the applications of natural languageprocessing [11, 9]. The work in [11] combines the tasks ofentity and relation recognition, in which separate classifiersare first trained for entities and relations, and then these lo-cal classifiers are used to make global inferences for the mostprobable assignment for all entities and relations of interest.It actually motivates the baseline method in Section 5.2.Another related work is the integration of data record de-tection and attribute labeling for online product extractingfrom product Web pages by Hierarchical CRFs [14]. Boththe learning and inference in this work are jointly optimized.The work in this paper further emphasizes the difference be-tween local and joint training, and the difference betweenlocal and joint inference. A recent paper [1] from Bhat-tacharya et al. deals with the combination of structured en-tity identification and document categorization. The com-bined two tasks in that paper are different from those inthis paper. Additionally, they use the generative model intraining and prediction, however, we use the discriminativemodel CRFs to achieve the combination.

9. CONCLUSIONSThis paper discussed the problem of combining Web clas-

sification and Web information extraction, which are actu-ally two coupled steps for Web content analysis. Previousworks in this area treat them as two separate and sequen-tial steps, which consider the forward dependency from Webclassification to Web information extraction, but ignore thebackward dependency from Web information extraction toWeb classification. To mutually boost these two steps, wepropose to combine them by using a model of CRFs. Specif-ically, this graphical model contains two kinds of observedrandom variables with different granularity: the variable ofthe whole Web page for Web classification, and the variablesof the DOM nodes inside the Web page for Web informa-tion extraction. This way these two steps are formulated asa joint task about assigning labels to these variables. Ad-ditionally, the constraints among the two kinds of randomvariables can be utilized to simplify the learning and infer-ence process of this model. The experiments in our projectOfCourse for online course search show that our joint modelsignificantly outperforms the two baseline methods in termsof F1 value. Even though this approach is designed to solvethe specific problem of course homepage identification andcourse metadata extraction, the methodology of combin-ing Web classification and Web information extraction isproblem-independent and can be applied to many other ap-plications.

10. ACKNOWLEDGMENTSThe authors thank Baoyao Zhou for providing part of the

test data and a Web page segmentation tool. We are alsograteful to Shicong Feng and Liwei Zheng for helpful collab-orations in the OfCourse project. The authors Fen Lin andZhongzhi Shi are partially supported by the National ScienceFoundation of China (60775035), 863 National High-TechProgram (No.2007AA01Z132), and National Basic ResearchPriorities Programme (No. 2003CB317004, 2007CB311004).

11. REFERENCES[1] I. Bhattacharya, S. Godbole, and S. Joshi. Structured

entity identification and document categorization: twotasks with one joint model. In Proc. of the 14th ACMSIGKDD, 2008.

[2] M. Castellanos, Q. Chen, U. Dayal, M. Hsu,M. Lemon, P. Siegel, and J. Stinger. Componentadviser: a tool for automatically extracting electroniccomponent data from web datasheets. In Proc. of theWorkshop on Reuse of Web-based Information, the 7thWWW, 1998.

[3] D. Hosmer and S. Lemeshow. Applied LogisticRegression. Wiley, New York, 2000.

[4] A. Kulesza and F. Pereira. Structured learning withapproximate inference. In Proc. of the 21st NIPS,2007.

[5] J. D. Lafferty, A. McCallum, and F. C. N. Pereira.Conditional random fields: probabilistic models forsegmenting and labeling sequence data. In Proc. of the18th ICML, 2001.

[6] D. C. Liu and J. Nocedal. On the limited memory bfgsmethod for large scale optimization. MathematicalProgramming, 45:503–528, 1989.

[7] A. McCallum. Information extraction: Distillingstructured data from unstructured text. ACM Queue,2005.

[8] Z. Nie, J. Wen, and W. Ma. Object-level verticalsearch. In Proc. of the Conf. on Innovative DataSystems Research, 2007.

[9] V. Punyakanok, D. Roth, W. Yih, and D. Zimak.Learning and inference over constrained output. InProc. of the 19th IJCAI, 2005.

[10] J. Rennie and A. McCallum. Using reinforcementlearning to spider the web efficiently. In Proc. of the16th ICML, 1999.

[11] D. Roth and W. Yih. Probabilistic reasoning for entityand relation recognition. In Proc. the 19th COLING,2002.

[12] Y. Xue, Y. Hu, G. Xin, R. Song, S. Shi, Y. Cao, C.-Y.Lin, and H. Li. Web page title extraction and itsapplication. Information Processing and Management,43(5):1332–1347, 2007.

[13] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma.2d conditional random fields for web informationextraction. In Proc. of the 22nd ICML, 2005.

[14] J. Zhu, Z. Nie, J.-R. Wen, B. Zhang, and W.-Y. Ma.Simultaneous record detection and attribute labelingin web data extraction. In Proc. of the 12th ACMSIGKDD, 2006.

towards combining web classification and web information

Documents