web mining is the application of data mining techniques to ...zma/research/dissertation.doc · web...

WEB MINING FOR KNOWLEDGE DISCOVERY

by

Zhongming Ma

A dissertation submitted to the faculty ofThe University of Utah

in partial fulfillment of the requires for the degree of

Doctor of Philosophy

in

Business Administration

David Eccles School of Business

The University of Utah

May 2007

ABSTRACT

The Web has become an unprecedented world-wide repository of knowledge. It

contains valuable information for managers, analysts, and all types of knowledge

workers, yet, the Web is dynamic and noisy. Hence, knowledge discovery from the Web,

while being challenging, is an essential tool for the knowledge economy. This

dissertation covers two related topics – personalized search and business relationships

discovery – in the area of knowledge discovery from the Web.

In part I, we propose an automatic personalized search approach that categorizes

search results under a user’s interests by first mapping a user’s known interests to Open

Directory Project (ODP) categories. In two sets of controlled experiments, we compare

our personalized categorization system (PCAT) with two baseline systems, a list interface

system (LIST) and a nonpersonalized categorization system (CAT). We analyze system

performances on the basis of the type of task and query length and identify conditions

under which our system outperforms a baseline system.

In part II we present a news-driven, social network analysis (SNA)-based business

relationship discovery framework and study two different business relationships,

company revenue relation (CRR) and competitor relationship, respectively, to illustrate

the effectiveness of our approach. As a news story pertaining to a company often cites

several other companies, we construct an intercompany network using such citations,

employ SNA techniques to identify a set of attributes from the network structure, and use

the attributes to predict CRRs and discover the competitor relationships. We find that, for

the two business relationships studied, the structural attributes of the intercompany

network are valuable in predicting the business relationships. Also, our news-driven,

SNA-based business relationship discovery framework is scalable (as compared to

manual approaches) and language-neutral. While we validate our approach with data for

public companies in the U.S., the approach can be easily extended to discover business

relationships for private and foreign companies that are either unavailable or hard to

collect.

v

TABLE OF CONTENTS

ABSTRACT.......................................................................................................................iv

ACKNOWLEDGMENTS..................................................................................................ix

Chapter

1 INTRODUCTION.......................................................................................................1

1.1 Knowledge Discovery on the Web..................................................................11.2 Personalized Search.........................................................................................41.3 Business Relationship Discovery....................................................................71.4 Overview of Dissertation...............................................................................10

PART I PERSONALIZED SEARCH.........................................................................11

2 INTRODUCTION AND LITERATURE REVIEW.................................................12

2.1 Introduction....................................................................................................122.2 Related Literature..........................................................................................16

3 OUR APPROACH.....................................................................................................25

3.1 Step 1: Obtaining an Interest Profile.............................................................263.2 Step 2: Generating Category Profiles............................................................263.3 Step 3: Mapping Interests to ODP Categories...............................................283.4 Step 4: Resolving Mapped Categories...........................................................313.5 Step 5: Categorizing Search Results..............................................................363.6 Implementation..............................................................................................38

4 EXPERIMENTS........................................................................................................46

4.1 Studied Domains and Domain Experts..........................................................474.2 Professional Interests, Search Tasks, and Query Length...............................474.3 Subjects..........................................................................................................514.4 Experiment Process.......................................................................................54

5 EVALUATIONS AND DISCUSSIONS...................................................................55

5.1 Comparing Mean Log Search Time by Query Length..................................555.2 Comparing Mean Log Search Time for Information Gathering Tasks.........585.3 Comparing Mean Log Search Time for Site Finding Tasks..........................605.4 Comparing Mean Log Search Time for Finding Tasks.................................615.5 Questionnaire and Hypotheses......................................................................615.6 Hypothesis Test Based on Questionnaire......................................................635.7 Comparing Indices of Relevant Results........................................................655.8 Discussions....................................................................................................695.9 Limitations and Future Directions.................................................................71

PART II BUSINESS RELATIONSHIP DISCOVERY...............................................74

6 INTRODUCTION AND LITERATURE REVIEW.................................................75

6.1 Introduction....................................................................................................756.2 Literature Review..........................................................................................78

7 NETWORK-BASED ATTRIBUTES AND DATA..................................................82

7.1 Notation in Directed Graphs..........................................................................827.2 Notation in Directed, Weighted Graphs........................................................837.3 Raw Data.......................................................................................................897.4 Preliminary Data Processing..........................................................................907.5 Node and Link Identification.........................................................................917.6 Attribute Distributions...................................................................................91

8 PREDICTING COMPANY REVENUE RELATIONS............................................97

8.1 Measurements of CRR...................................................................................988.2 Research Questions........................................................................................998.3 Research Methods........................................................................................1008.4 Results and Analyses...................................................................................1038.5 Discussions..................................................................................................117

9 DISCOVERING COMPETITOR RELATIONSHIPS............................................120

9.1 Approach Outline and Research Questions.................................................1209.2 Data Sets......................................................................................................1219.3 Examining Competitor Coverage and Density of the Intercompany Network............................................................................................1259.4 Competitor Discovery..................................................................................1299.5 Competitor Extension..................................................................................1439.6 Explorations on Competitors vs. Noncompetitor pairs................................1499.7 Discussions..................................................................................................150

10 CONCLUSIONS.....................................................................................................153

vii

REFERENCES................................................................................................................158

viii

ACKNOWLEDGMENTS

I would like to first thank my advisors, Dr. Gautam Pant and Dr. Olivia R. Liu

Sheng, for their great efforts in consistently improving my research ideas such that I have

the three essays in this dissertation. I also thank Dr. Ellen Riloff, Dr. Paul Hu, and Dr.

Wei Gao for providing constructive comments for my dissertation.

I am sincerely grateful to David Eccles School of Business for the four years’

financial support during my Ph.D. study. I also appreciate the generous support from Dr.

Olivia R. Liu Sheng, Dr. David Plumlee (former department head), and Dr. Robert D.

Allen (department head) for covering some expenses occurred in my research and my

fifth year’s tuition. I am thankful to eBusiness Center at Pennsylvania State University

for the award funding that supports my research project in personalized search.

I should not forget that Dr. Olivia R. Liu Sheng brought me into this program.

And finally I would like to give my special thanks to my parents and my brother for their

continuous and unconditional support, no matter where I am, what I am doing, ups and

downs.

1

CHAPTER 1

INTRODUCTION

1.1 Knowledge Discovery on the Web

Knowledge discovery from databases (KDD) refers to “the nontrivial process of

identifying valid, novel, potentially useful, and ultimately understandable patterns in

data” [Fayyad et al. 1996]. KDD has achieved a broad range of applications including

pattern recognition and predictive analytics in many different areas, such as engineering,

business, and science. Knowledge discovery has two types of goals, verification and

discovery. In general the former goal refers to verifying a user’s hypothesis and the latter

can be further divided into prediction (i.e., predicting unknown or future values) and

description (i.e., presenting identified results such as patterns in a human-understandable

form) [Fayyad et al. 1996].

The Web has become a universal repository with tremendous amount of data that

can be accessed from anywhere in the world and has experienced continuous growth both

in content and its users. Therefore, the Web presents immense opportunities for

discovering knowledge. However, unlike conventional databases, the data on Web are

mostly semistructured or unstructured. This situation makes knowledge discovery from

2

Web (KDW) challenging as compared to KDD. The KDW process requires considerable

effort on identifying, selecting, and processing Web data possibly from multiple sources

and in different (often free-form text) formats. Manual analysis that turns such large and

heterogeneous Web data into knowledge is impractical, and thus KDW becomes an

attempt to address the accentuated problem of data overload on the Web. We adapt the

KDD process presented in [Fayyad et al. 1996] for the Web context and present the

process of KDW in the following Figure 1.

Web mining is a step in the KDW process and it aims to analyze data and

discover knowledge from the Web. The Web data include all kinds of Web documents,

hyperlinks among Web pages, and Web usage logs. Depending on the type of Web data

being mined, Web mining can be broadly divided into three categories: Web content

mining, Web structure mining, and Web usage mining [Srivastava et al. 2000].

Web content mining is the process of discovering knowledge from Web page content

(i.e., often text), and it often uses techniques based on data mining and text mining.

According to [Liu 2007] important Web content mining problems include Web

crawling [e.g., Brin and Page 1998; Pant and Srinivasan 2006], Web search [e.g., Brin

and Page 1998], processing (e.g., clustering or categorizing) of search results

according to page content [e.g., Zamir and Etzioni 1999; Dumais and Chen 2001],

3

Figure 1. Process of Knowledge Discovery from Web

Web information, such as online opinion, extraction [e.g., Peng et al. 2002; Hu and

Liu 2004], Web information integration [e.g., Kalfoglou and Schorlemmer 2003; He

and Chang 2003], etc.

Web structure mining tries to discover useful information such as importance of

pages from the structure of hyperlinks on the basis of social network analysis (SNA)

techniques and graph theory. Its research topics cover ranking pages [e.g., Brin and

Page 1998; Chakrabarti el al. 1999], finding Web communities [e.g., Gibson et al.

1998], etc.

Web usage mining is the automatic discovery of user access patterns from Web logs

[Cooley et al. 1997]. The identified visit patterns can help in understanding the

overall access patterns and trends for all users [e.g., Zaïane et al. 1998] and allow for

Web site design to be responsive to business goals and customer needs, such as user-

level customization [e.g., Eirinaki and Vazirgiannis 2003].

My dissertation consists of two related topics/parts: personalized (online) search

and business relationship discovery, both of which are in the area of KDW. The first

topic presents and evaluates an automatic personalized search framework that categorizes

search results under user’s interests in order to examine how the proposed personalized

search approach outperforms noncategorized and nonpersonalized baseline systems. This

4

research is of Web content mining. The second topic proposes an approach to identifying

an intercompany network using company citations from Web content (more specifically,

online news stories) and discovers business relationships between companies from the

network on the basis of SNA and machine learning techniques. Therefore the second

topic covers both Web content mining and Web structure mining. The main research

question we explore is whether structural attributes derived from the intercompany

network, which in turn is derived from company citations in online news, can identify

business relationships. As shown in Figure 2, at a high level, the first topic connects Web

content to people, and the second uses Web content to discover relationships between

companies. Thus the two topics are connected through mining of Web content. However,

the two topics generate different types of knowledge – categorized and personalized

search results versus company relationships – and hence entail diverse adoptions of Web

data, processing, and Web mining. In the next two sections we briefly introduce the two

topics.

1.2 Personalized Search

Most search engines, including the popular ones such as Google and Yahoo!, ignore

users’ search context, such as users’ interests. As a result the same query from different

5

Figure 2. Process View of the Two Topics of the Dissertation

users with different information needs retrieves the same search results displayed in the

same way. Hence, they use a “one size fits all” [Lawrence 2000] approach. We note that

currently Google is attempting to address this problem with some level of voluntary

personalization. Personalization techniques that consider users’ context during search can

improve search efficiency [Pitkow et al. 2002]. We propose and implement an automatic

approach to categorizing search results according to a user’s interests to help users find

relevant information and find it quicker. Our approach is particularly well suited for a

workplace scenario where much of the information, needed by the proposed system,

about professional interests and skills of knowledge workers is available to the employer.

Personalizing based on such information within an organization can be expected to have

less privacy concerns as compared to a general purpose search engine gathering data on

user interests. Moreover, unlike other approaches, our approach does not impose any

burden of implicit or explicit feedback from the user.

6

We customize the general process of KDW in Figure 1 and present the process of

interest-based personalized search for knowledge discovery in Figure 3 where processes

covered by the horizontal double-arrow-lines correspond to their equivalent ones in

Figure 1. The proposed approach includes a mapping framework that automatically maps

user interests into a group of categories from Open Directory Project (ODP) taxonomy.

A text classifier is built from the content of the mapped ODP categories and later is used

at query-time to categorize search results under user interests. For a workplace scenario

where the employees’ professional interests and skills can be automatically extracted

from their resume or company’s database, this approach is fully automatic in that users

do not need to provide implicit or explicit feedbacks during the search. Also the use of

ODP is transparent to the users because the mapping between interests and ODP

categories are automatically generated. The lack of explicit or implicit feedback and the

use of ODP taxonomy without a user’s awareness of it differentiates this work from many

others, such as [Gauch et al. 2003, Liu et al. 2004; Chirita et al. 2005]. In addition, we

study three search systems with different interfaces for displaying search results. The first

system (LIST) shows search results in a page-by-page list. The second (CAT) categorizes

and displays results under certain ODP categories. The third (PCAT) is what we propose,

and PCAT categorizes and displays results under user interests. We compare the PCAT

with LIST and PCAT with CAT on the basis of different query lengths and different

types of search tasks.

Contributions of this research are that we present an automatic approach to

personalizing Web searches given a set of user interests and compare

our proposed approach with each of two baseline systems to further

7

identify some boundary conditions under which our system

outperforms a baseline system. The main findings include (1) PCAT is better

than LIST for one word query and Information Gathering type of task,

Figure 3. Knowledge Discovery Process for Interest-Based Personalized Search

and PCAT outperforms CAT for free-form queries and for both Information Gathering

and Finding types of tasks in terms of the time spent on finding relevant results. We

conclude that there is not any system universally better than others – the performance of a

system depends on some parameters such as query length and type of task.

1.3 Business Relationship Discovery

Business news contains rich and current information about companies and the

relationships among them. Reading news is very time consuming and requires a reader to

possess certain skills, the most basic of which is a good understanding of the language in

which the news is written. The huge volume of news stories makes the manual

identification of relationships among a large number of companies nontrivial and

unscalable. The previous literature using news to automatically discover business

8

relationships among companies is sparse. Many researchers in areas such as organization

behavior and sociology employ SNA techniques to investigate the nature and

implications of business relationships on the basis of explicitly given company

relationships provided by reliable data sources [e.g., Levine 1972; Walker et al. 1997;

Uzzi 1999; Gulati and Gargiulo 1999]. In contrast, researchers in bibliometrics and

computer science tend to identify links between nodes using implicit signals, such as

article citations, URL links, and email communications, derived from large and noisy

data sources. They study problems such as identifying importance of individual nodes

(e.g., Web pages, journal articles) in a network [e.g., Garfield 1979; Brin and Page 1998;

Kleinberg 1999] and finding communities on the Web [e.g., Kautz et al. 1997; Gibson et

al. 1998], instead of discovering business relationships between companies. We present

an approach of automatic discovery of company relationships from online business news

using machine learning and SNA techniques. Figure 4 illustrates the knowledge

discovery process for business relationship discovery from Web data (i.e., online news).

Given that a news story pertaining to a company often cites one or more other

companies, we construct a directed and weighted intercompany network on the basis of

citations from a large amount of online news by considering company citations as

directed links from the focal companies to the cited companies. Further we identify four

types of attributes from the network structure using SNA techniques. More specifically

they are dyadic degree based-, node degree based-, node centrality based-, and structural

equivalence based-attributes. Those attributes differ in their coverage of the network.

With those network attributes, we study two types of company relationships using

classification methods. This news-driven, SNA-based business relationship discovery

9

approach is scalable and language-neutral. Research along this line consists of two

studies that differ in their target business relationships and we describe them as follows.

Figure 4. KDW Process for Business Relationship Discovery

The first one concentrates on predicting a company revenue relation (CRR).

Given a pair of companies, CRR refers to the relative size of two companies’ annual

revenues. We find that degree-based and centrality-based attributes derived from network

structure can predict CRR with reasonable precision, recall, and accuracy (all above 70%)

for all directly linked company pairs in the network. Contributions of this study are that

(1) Our approach can serve as a data filtering step for studying the revenue relations

among very large number of companies. (2) Since the revenue information for public

companies is available quarterly, our approach can be used as a prediction tool for

revenues. (3) Our approach can be applied to discover the revenue relations for private or

foreign companies as well.

10

In the second work we study competitor relationship between companies. We

discover the competitor relationship between a pair of connected companies in the

intercompany network on the basis of the four types of attributes. And in particular, we

study the classification of company pairs for imbalanced data set where the number of

competitor pairs is much smaller than that of noncompetitor pairs. We use two gold

standards: Hoovers.com and Mergentonline.com that are professional company profile

websites and contain manually identified competitors for each company, to evaluate the

classification performance of our approach. Given that neither of the gold standards is

complete in the coverage of competitors, we estimate the coverage of each gold standard.

Finally we present metrics to estimate how much our approach can extend each of the

gold standards. Contributions of this work include that we present an automatically

approach to discovering competitor relationships between companies. Our approach is

particularly useful to serve as an initial data filtering step to identify a group of potential

competitors for each of many companies. We study an imbalanced data set problem and

report the classification performance for competitor pairs in both the imbalanced data set

and the whole data set. Most important, we report the estimated extension of our

approach to each of two gold standards.

1.4 Overview of Dissertation

At a high level the dissertation consists of two parts. Part I, which consists of

chapters 2 to 5, covers the first topic of the dissertation: Interested-based Personalized

Search. Part II which includes chapters 6 to 9, covers the two related studies in business

relationship discovery. More specifically we highlight each chapter as follows.

11

Chapter 2 introduces the research on personalized search and reviews related prior

work. We detail our approach of personalized search in Chapter 3. Experiments are

covered in Chapter 4 and result analyses and conclusions are discussed in Chapter 5. We

introduce the topic of business relationship discovery and review prior literature in

Chapter 6. Chapter 7 describes how to identify attributes from the network structure and

explains the data and data processing procedures. We concentrate on predicting CRR in

Chapter 8 and on discovering competitor relationships in Chapter 9. Finally we conclude

the dissertation in Chapter 10.

11

PART I

INTEREST-BASED PERSONALIZED SEARCH

12

CHAPTER 2

INTRODUCTION AND LITERATURE REVIEW

2.1 Introduction

The Web provides an extremely large and dynamic source of information, and the

continuous creation and updating of Web pages magnifies information overload on the

Web. Both casual and noncasual users (e.g., knowledge workers) often use search

engines to find a needle in this constantly growing “haystack.” Sellen et al. [2002], who

define a knowledge worker as someone “whose paid work involves significant time spent

in gathering, finding, analyzing, creating, producing or archiving information,” report

that 59% of the tasks performed on the Web by a sample of knowledge workers fall into

the categories of Information Gathering and Finding, which require an active use of Web

search engines.

Most existing Web search engines return a list of search results based on a user’s

query but ignore the user’s specific interests and/or search context. Therefore, the

13

identical query from different users or in different contexts will generate the same set of

results displayed in the same way for all users, a so called one-size-fits-all [Lawrence

2000] approach. Furthermore, the number of search results returned by a search engine is

often so large that the results must be partitioned into multiple result pages. In addition,

individual differences in information needs, polysemy (multiple meanings of the same

word), and synonymy (multiple words with same meaning) pose problems [Deerwester et

al. 1990] in that a user may have to go through many irrelevant results or try several

queries before finding the desired information. Problems encountered in searching are

exaggerated further when the search engine users employ short queries [Jansen et al.

1998]. However, personalization techniques that put a search in the context of the user’s

interests may alleviate some of these issues.

In this study, which focuses on knowledge workers’ search for information online

in a workplace setting, we assume that some information about the knowledge workers,

such as their professional interests and skills, is known to the employing organization and

can be extracted automatically with an information extraction (IE) tool or with database

queries. The organization can then use such information as an input to a system based on

our proposed approach and provide knowledge workers with a personalized search tool

that will reduce their search time and boost their productivity.

For a given query, a personalized search can provide different results for different

users or organize the same results differently for each user. It can be implemented on

either the server side (search engine) or the client side (organization’s intranet or user’s

computer). Personalized search implemented on the server side is computationally

expensive when millions of users are using the search engine, and it also raises privacy

14

concerns when information about users is stored on the server. A personalized search on

the client side can be achieved by query expansion and/or result processing [Pitkow et al.

2002]. By adding extra query terms associated with user interests or search context, the

query expansion approach can retrieve different sets of results. The result processing

includes result filtering, such as removal of some results, and reorganizing, such as

reranking, clustering, and categorizing the results.

Our proposed approach is a form of client-side personalization based on an

interest-to-taxonomy mapping framework and result categorization. It piggybacks on a

standard search engine such as Google1 and categorizes and displays search results on the

basis of known user interests. As a novel feature of our approach, the mapping

framework automatically maps the known user interests onto a set of categories in a Web

directory, such as the Open Directory Project2 (ODP) or Yahoo!3 directory. An advantage

of this mapping framework is that, after user interests have been mapped onto the

categories, a large amount of manually edited data under these categories is freely

available to be used to build text classifiers that correspond to these user interests. The

text classifiers then can categorize search results according to the user’s various interests

at query time. The same text classifiers may be used to categorize emails and other digital

documents, which suggests that our approach may be extended to a broader domain of

content management.

The main research questions that we explore are as follows: (1) What is an

appropriate framework for mapping a user’s professional interests and skills onto a group

of concepts in an taxonomy such as a Web directory? (2) How does a personalized

1 http://www.google.com.2 http://www.dmoz.com.3 http://www.yahoo.com.

15

categorization system (PCAT) based on our proposed approach perform differently from

a list interface system (LIST), similar to a conventional search engine? (3) How does

PCAT perform differently from a nonpersonalized categorization system (CAT) that

categorizes results without any personalization? The third question attempts to separate

the effect of categorization from the effect of personalization in the proposed system. We

explore the second and third questions along two dimensions, type of task and query

length.

Figure 5 illustrates the input and output of these three systems. LIST requires two

inputs: a search query and a search engine, and its output, similar to what a conventional

search engine adopts, is a page-by-page list of search results. Using a large taxonomy

(ODP Web directory), CAT classifies search results and displays them under some

taxonomy categories; in other words, it uses the ODP taxonomy as an additional input.

Finally, PCAT adds another input, namely, a set of user interests. The mapping

framework in PCAT automatically identifies a group of categories from the ODP

taxonomy as relevant to the user’s interests. Using data from these relevant categories,

the system generates text classifiers to categorize search results under the user’s various

interests at query time.

We compare PCAT with LIST and with CAT in two sets of controlled

experiments. Compared with LIST, PCAT works better for searches with short queries

and for Information Gathering tasks. In addition, PCAT outperforms CAT for both

Information Gathering and Finding tasks and for searches with free-form queries.

Subjects indicate that PCAT enable them to identify relevant results and complete given

tasks more quickly and easily than does LIST or CAT.

16

Figure 5. Input and Output of the Three Systems

2.2 Related Literature

This section reviews prior studies pertaining to personalized search. We also

consider several studies using the ODP taxonomy to represent a search context, review

studies on the taxonomy of Web activities, and end by briefly discussing text

categorization.

According to Lawrence [2000], next-generation search engines will increasingly

use context information. Pitkow et al. [2002] also suggest that a contextual computing

approach that enhances user interactions through a greater understanding of the user, the

context, and the applications may prove a breakthrough in personalized search efficiency.

17

They further identify two primary ways to personalize search, query expansion and result

processing [Pitkow et al. 2002] which can complement each other.

2.2.1 Query Expansion

We use an approach similar to query expansion for finding terms related to user

interests in our interest mapping framework. Query expansion refers to the process of

augmenting a query from a user with other words or phrases in order to improve search

effectiveness. It originally was applied in information retrieval (IR) to solve the problem

of word mismatch that arises when search engine users employ different terms than those

used by content authors to describe the same concept [Xu and Croft 1996]. Because the

word mismatch problem can be reduced through the use of longer queries, query

expansion may offer a solution [Xu and Croft 1996].

In line with query expansion, current literature provides various definitions of

context. In the Inquirus 2 project [Glover et al. 1999], a user manually chooses a context

in the form of a category, such as research papers or organizational homepages, before

starting a search. Y!Q4, a large-scale contextual search system, allows a user to choose a

context in the form of a few words or a whole article through three methods: a novel

information widget executed in the user’s Web browser, Yahoo! Toolbar5, or Yahoo!

Messenger6 [Kraft et al. 2005]. In the Watson project, Budzik and Hammond [2000]

derive context information from the whole document a user views. Instead of using a

whole document, Finkelstein et al. [2002] limit the context to the text surrounding a user-

marked query term(s) in the document. That text is part of the whole document so their

query expansion is based on a local context analysis approach [Xu and Croft 1996].

4 http://yq.search.yahoo.com.5 http://toolbar.yahoo.com.6 http://beta.messenger.yahoo.com.

18

Leroy et al. [2003] define context as the combination of titles and descriptions of clicked

search results after an initial query. In all these studies, queries get expanded on the basis

of the context information, and results are generated according to the expanded queries.

2.2.2 Result Processing

Relatively fewer studies deal with result processing which includes result filtering

and reorganizing. Domain filtering eliminates documents irrelevant to given domains

from the search results [Oyama et al. 2004]. For example, Ahoy!, a homepage finder

system, uses domain-specific filtering to eliminate most results returned by one or more

search engines but retain the few pages that are likely to be personal homepages [Shakes

et al. 1997]. Tan and Teo [1998] propose a system that filters out news items that may not

be of interest to a given user according to that user’s explicit (e.g., satisfaction ratings)

and implicit (e.g., viewing order, duration) feedback to create personalized news.

Another approach to result processing is to reorganize, which involves reranking,

clustering, and categorizing search results. For example, Teevan et al. [2005] construct a

user profile (context) over time with rich resources including issued queries, visited Web

pages, composed or read documents and emails. When the user sends a query, the system

reranks the search results on the basis of the learned profile. Shen et al. [2005a] use

previous queries and summaries of clicked results in the current session to rerank results

for a given query. Similarly, UCAIR [Shen et al. 2005b], a client-side personalized

search agent, employs both query expansion on the basis of the immediately preceding

query and result reranking on the basis of summaries of viewed results. Other works also

consider reranking according to a user profile [Gauch et al. 2003; Sugiyama et al. 2004;

19

Speretta and Gauch 2005; Chirita et al. 2005; Kraft et al. 2005]. Gauch et al. [2003] and

Sugiyama et al. [2004] learn a user’s profile from his or her browsing history, whereas

Speretta and Gauch [2005] build the profile on the basis of search history, and Chirita et

al. [2005] require the user to specify the profile entries manually.

Scatter/Gather [Cutting et al. 1992] is one of the first systems to present

documents in clusters. Another system, Grouper [Zamir and Etzioni 1999], uses snippets

of search engine results to cluster the results. Tan [2002] presents a user-configurable

clustering approach that clusters search results using titles and snippets of search results

and the user can manually modify these clusters.

Finally, in comparing seven interfaces that display search results, Dumais and

Chen [2001] report that all interfaces that group results into categories are more effective

than conventional interfaces that display results as a list. They also conclude that the best

performance occurs when both category names and individual page titles and summaries

are presented. We closely follow these recommendations for the two categorization

systems we study (PCAT and CAT). In recent work, Käki [2005] also finds that result

categorization is helpful when the search engine fails to provide relevant results at the top

of the list.

2.2.3 Representing Context Using Taxonomy

In our approach, we map user interests to categories in the ODP taxonomy. Figure

6 shows a portion of the ODP taxonomy in which Computers is a depth-one category, and

C++ and Java are categories at depth four. We refer to Computers/Programming/

Languages as the parent category of category C++ or Java. Hence various concepts

20

(categories) are related through a hierarchy in the taxonomy. Currently, the ODP is a

manually edited directory of 4.6 million URLs that have been categorized into 787,774

categories by 68,983 human editors. The ODP taxonomy has been applied to

personalization of Web search in some prior studies [Pitkow et al. 2002, Gauch et al.

2003, Liu et al. 2004 and Chirita et al. 2005].

For example, the Outride personalized search system (acquired by Google)

performs both query modification and result processing. It builds a user profile (context)

on the basis of a set of personal favorite links, the user’s last 1000 unique clicks, and the

ODP taxonomy, then modifies queries according to that profile. It also reranks search

results on the basis of usage and the user profile. The main focus of the Outride system is

capturing a user’s profile through his or her search and browsing behaviors [Pitkow et al.

2002]. The OBIWAN system [Gauch et al. 2003] automatically learns a user’s interest

profile from his or her browsing history and represents those interests with concepts in

Magellan taxonomy. It maps each visited Web page into five taxonomy concepts with the

21

Figure 6. ODP Taxonomy

highest similarities; thus, the user profile consists of accumulated categories generated

over a collection of visited pages. Liu et al. [2004] also build a user profile that consists

of previous search query terms and five words that surround each query term in each

Web page clicked after the query is issued. The user profile then is used to map the user’s

search query onto three depth-two ODP categories. In contrast, Chirita et al. [2005] use a

system in which a user manually selects ODP categories as entries in his or her profile.

When reranking search results, they measure the similarity between a search result and

the user profile using the node distance in an taxonomy concept tree, which means the

search result must associate with an ODP category. A difficulty in their study is that

many parameters’ values have been set without explanations. The current Google

personalized search7 also explicitly asks users to specify their interests through the

Google directory.

Similar to Gauch et al. [2003], we represent user interests with taxonomy

concepts, but we do not need to collect browsing history. Unlike Liu et al. [2004], we do

not need to gather previous search history, such as search queries and clicked pages, or

know the ODP categories corresponding to the clicked pages. Whereas Gauch et al

[2003] map a visited page onto five ODP categories and Liu et al. [2004] map a search

query onto three categories, we automatically map a user interest onto an ODP category.

A difference between Chirita et al. [2005] and our approach is that when mapping a

user’s interest onto an taxonomy concept, we employ text, that is, page titles and

summaries associated with the concept in taxonomy, while they use the taxonomy

category title and its position in the concept tree when computing the tree-node distance. 7 http://labs.google.com/personalized.

22

Also, in contrast to UCAIR [Shen et al. 2005b] that uses contextual information in the

current session (short-term context) to personalize search, our approach personalizes

search according to a user’s long-term interests, which may be extracted from his or her

resume.

Haveliwala [2002] and Jeh and Widom [2003] extend the PageRank algorithm

[Brin and Page 1998] to generate personalized ranks. Using 16 depth-one categories in

ODP, Haveliwala [2002] computes a set of topic-sensitive PageRank scores. The original

PageRank is a global measure of the query- or topic-insensitive popularity of Web pages

measured solely by a linkage graph derived from a large part of the Web. Haveliwala’s

experiments indicate that, compared with the original PageRank, a topic-sensitive

PageRank achieves greater precision in top-ten search results. Topic-sensitive PageRank

also can be used for personalization after a user’s interests have been mapped onto

appropriate depth-one categories of the ODP, which can be achieved through our

proposed mapping framework. Jeh and Widom [2003] present a scalable personalized

PageRank method in which they identify a linear relationship between basis vectors and

the corresponding personalized PageRank vectors. At query time, their method constructs

an approximation to the personalized PageRank vector from the precomputed basis

vectors.

2.2.4 Taxonomy of Web Activities

We study the performance of the three systems (described in Section 2.1) by

considering different types of Web activities. Sellen et al. [2002] categorize Web

23

activities into six categories: Finding (locate something specific), Information Gathering

(answer a set of questions; less specific than Finding), Browsing (visit sites without

explicit goals), Transacting (execute a transaction), Communicating (participate in chat

rooms or discussion groups), and Housekeeping (check the accuracy and functionality of

Web resources). As Craswell et al. [2001] define a Site Finding task specifically as "one

where the user wants to find a particular site, and their query names the site," we consider

it a type of Finding task. It should be noted that some Web activities, especially

Information Gathering, can involve several searches. On the basis of the intent behind

Web queries, Broder [2002] classifies Web searches into three classes: Navigational

(reach a particular site), Informational (acquire information from one or more Web

pages), and Transactional (perform some Web-mediated activities). As the taxonomy of

search activities suggested by Sellen et al. [2002] is broader than that by Broder [2002],

in this article we choose to study the two major types of activities studied in Sellen et al.

[2002].

2.2.5 Text Categorization

In our study, CAT and PCAT systems employ text classifiers to categorize search

results. Text categorization (TC) is a supervised learning task that classifies new

documents into a set of predefined categories [Yang and Liu 1999]. As a joint discipline

of machine learning and IR, TC has been studied extensively, and many different

classification algorithms (classifiers) have been introduced and tested, including the

Rocchio method, naïve Bayes, decision tree, neural networks, and support vector

machines [Sebastiani 2002]. A standard information retrieval metric, cosine similarity

24

[Salton and McGill 1986], computes the cosine angle between vector representations of

two text fragments or documents. In TC, a document can be assigned to the category with

the highest similarity score. Due to its simplicity and effectiveness, cosine similarity has

been used by many studies for TC [e.g., Yang and Liu 1999; Sugiyama et al. 2004; Liu et

al. 2004].

In summary, to generate user profiles for personalized search, previous studies

have asked users for explicit feedback, such as ratings and preferences, or collected

implicit feedback, such as search and browsing history. However, users are unwilling to

provide explicit feedback even when they anticipate a long-run benefit [Caroll and

Rosson 1987]. Implicit feedback has shown promising results for personalizing search

using short-term context [Leroy et al. 2003, Shen et al. 2005b]. However, generating user

profiles for long-term context through implicit feedback will take time and may raise

privacy concerns. In addition, a user profile generated from implicit feedback may

contain noise because the user preferences have been estimated from behaviors and not

explicitly specified. In our approach two user-related inputs, a search query and the user’s

professional interests and skills, are explicitly given to a system, so some prior work

[Leroy et al. 2003; Gauch et al. 2003; Liu et al. 2004; Sugiyama et al. 2004; Kraft et al.

2005] that relies on modeling user interests through searching or browsing behavior is not

readily applicable.

25

CHAPTER 3

OUR APPROACH

Our approach begins with the assumption that some user interests are known and

therefore is well suited for a workplace setting in which employees’ resumes often are

maintained in a digital form or information about users’ professional interests and skills

is stored in a database. An IE tool or database queries can extract such information as

input to complement the search query, search engine, and contents of the ODP taxonomy.

However, we do not include such an IE program in this study and assume instead that the

interests have been already given. Our interest-category mapping framework tries to

automatically identify an ODP category associated with each of the given user interests.

Then our system uses URLs organized under those categories as training examples to

classify search results into various user interests at query time. We expect the result

categorization to help the user quickly focus on results of interest and decrease total time

spent in searching. The result categorization may also lead to the discovery of

serendipitous connections between the concepts being searched and the user’s other

interests. This form of personalization therefore should reduce search effort and possibly

provide interesting and useful resources the user would not notice otherwise. We focus on

26

work-related search performance, but our approach could be easily extended to include

personal interests as well. We illustrate a process view of our proposed approach in

Figure 7 and present our approach in five steps. Steps 3 and 4 cover the mapping

framework.

3.1 Step 1: Obtaining an Interest Profile

Step 1 (Figure 7) pertains to how the user interests can be extracted from a

resume. Our study assumes that user interests are available to our personalized search

system in the form of a set of words and phrases which we call a user’s interest profile.

3.2 Step 2: Generating Category Profiles

As we explained previously, ODP is a manually edited Web directory with

millions of URLs placed under different categories. Each ODP category contains URLs

that point to external Web pages that human editors consider relevant to the category.

27

Figure 7. Process View of Proposed Approach

Those URLs are accompanied by manually composed titles and summaries that we

believe accurately represent the corresponding Web page content. The category profile of

an ODP category thus is built by concatenating the titles and summaries of the URLs

listed under the category. The constructed category profiles provide a solution to the

cold-start problem, which arises from the difficulty of creating a profile for a new user

from scratch [Maltz and Ehrlich 1995], and they later serve to categorize the search

results. Gauch et al. [2003], Menczer et al. [2004], and Srinivasan et al. [2005] use

similar concatenation to build topic profiles. In our study, we combine up to 30 pairs of

manually composed titles and summaries of URL links under an ODP category as the

category profile.8 In support of this approach, Shen et al. [2004] report that classification

using manually composed summarization in the LookSmart Web directory achieves

higher accuracy than the use of the content of Web pages. For building the category

profile, we pick the first 30 URLs based on the sequence in which they are provided by

8 A category profile does not include titles or summaries of its child (subcategory) URLs.

28

ODP. We note that ODP can have more than 30 URLs listed under a category. In order to

use similar amounts of information for creating profiles for different ODP categories, we

only use the titles and summaries of the first 30 URLs. When generating profiles for

categories in Magellan taxonomy, Gauch et al. [2003] show that a number of documents

between 5 and 60 provide reasonably accurate classification.

At depth-one, ODP contains 17 categories (for a depth-one category, Computers,

see Figure 6). We select five of these (Business, Computers, Games, Reference, and

Science) that are likely to be relevant to our subjects and their interests. These five broad

categories comprise a total of 8,257 categories between depths one and four. We generate

category profiles by removing stop words and applying Porter stemming9 [Porter 1980].

We also filter out any terms that appear only once in a profile to avoid noise and remove

any profiles that contain fewer than two terms. Finally, the category profile is represented

as a term vector [Salton and McGill, 1986] with term frequencies (tf) as weights. Shen et

al. [2004] also use tf-based weighting scheme to represent manually composed

summaries in the LookSmart Web directory to represent a Web page.

3.3 Step 3: Mapping Interests to ODP Categories

Next, we need a framework to map a user’s interests onto appropriate ODP

categories. The framework then can identify category profiles for building text classifiers

that correspond to the user’s interests. Some prior studies [Pitkow et al. 2002; Liu et al.

2004] and the existing Google personalized search use ODP categories with a few

hundred categories up to depth two, but for our study, categories up to depth two may

9 http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/porter.java.

29

lack sufficient specificity. For example, Programming, a depth-two category, is too broad

to map a user interest in specific programming languages such as C++, Java, or Perl.

Therefore, we map user interests to ODP categories up to depth four. As we mentioned in

Step 2, a total of 8,257 such categories can be used for interest mapping. We employ four

different mapping methods to evaluate the mapping performance by testing and

comparing them individually as well as in different combinations. When generating an

output category, a mapping method includes the parent category of the mapped category;

for example, if the mapped category is C++, the output will be Computers/Programming/

Languages/C++.

3.3.1 Mapping Method 1 (m1-category-label):

Simple Term Match

The first method uses a string comparison to find a match between an interest and

the label of the category in ODP. If an interest is the same as a category label, the

category is considered a match to the interest. Plural forms of terms are transformed to

their singular forms by a software tool from the National Library of Medicine.10

Therefore, the interest of search engine is matched with the ODP category Search

Engines, and the output category is Computers/Internet/Searching/Search Engines.

10 http://umlslex.nlm.nih.gov/nlsRepository/nlp/doc/userDoc/index.html.

30

3.3.2 Mapping Method 2 (m2-category-profile):

Most Similar Category Profile

The cosine similarities between an interest and each of the category profiles are

computed, in which case the ODP category with the highest similarity is selected as the

output.

3.3.3 Mapping Method 3 (m3-category-profile-noun): Most Similar

Category Profile While Augmenting Interest

With Potentially Related Nouns

The m1-category-label and m2-category-profile will fail if the category labels and

profiles do not contain any of the words that form a given interest so it may be

worthwhile to augment the interest concept by adding a few semantically similar or

related terms. According to Harris [1985], terms in a language do not occur arbitrarily but

appear at a certain position relative to other terms. On the basis of the concept of

cooccurrence, Riloff and Shepherd [1997] present a corpus-based bootstrapping

algorithm that starts with a few given seed words that belong to a specific domain and

discovers more domain-specific semantically-related lexicons from a corpus. Similar to

query expansion, it is desirable to augment the original interest with a few semantically

similar or related terms.

For m3-category-profile-noun, one of our programs conducts a search on Google

using an interest as a search query and finds the N nouns that most frequently cooccur in

the top ten search results (page titles and snippets). We find cooccurring nouns because

most terms in interest profiles are nouns (for terms from some sample user interests, see

31

Table 1). Terms semantically similar or related to those of the original interest thus can

be obtained without having to ask a user for input such as feedback or a corpus. A noun is

identified by looking up the word in a lexical reference system,11 WordNet [Miller et al.

1990], to determine whether the word has the part-of-speech tag of noun. The similarities

between a concatenated text (a combination of the interest and N most frequently

cooccurring nouns) and each of the category profiles then are computed to determine the

category with the highest similarity as the output of this method.

3.3.4 Mapping Method 4 (m4-category-profile-np): Most Similar

Category Profile While Augmenting Interest With

Potentially Related Noun Phrases

Although similar to m3-category-profile-noun, this method finds the M most

frequently cooccurring noun phrases on the first result page from up to ten search results.

We developed a shallow parser program to parse sentences in the search results into NPs

(noun phrases), VPs (verb phrases), and PPs (prepositional phrases), where a NP can

appear in different forms, such as a single noun, a concatenation of multiple nouns, an

article followed by a noun, or any number of adjectives followed by a noun.

Table 1 lists some examples of frequently cooccurring nouns and NPs identified

by m3-category-profile-noun and m4-category-profile-np. Certain single-noun NPs

generated by m4-category-profile-np differ from individual nouns identified by m3-

category-profile-noun because a noun identified by this method may combine with other

11 http://wordnet.princeton.edu/.

32

terms to form a phrase in m4-category-profile-np and therefore not be present in the

result generated by m4-category-profile-np.

3.4 Step 4: Resolving Mapped Categories

For a given interest, each mapping method in Step 3 may generate a different

mapped ODP category, and m1-category-label may generate multiple ODP categories for

the same interest because the same category label sometimes is repeated in the ODP

taxonomy. For example, the category Databases appears in several different places in the

hierarchy of the taxonomy, such as Computers/Programming/Databases and

Computers/Programming/Internet/Databases.

Using 56 professional interests in the computer domain which were manually

extracted from several resumes of professionals collected from ODP (eight interests are

shown in the first column of Table 1), Table 2 compares the performances of each

individual mapping method. After verification by a domain expert, m1-category-label

generated mapped categories for 29 of 56 interests, and only two did not contain the right

category. We note that m1-category-label has much higher precision than the other three

methods, but it generates the fewest mapped interests. Machine learning research [e.g.,

Dietterich 1997] has shown that an ensemble of classifiers can outperform each classifier

in that ensemble. Since the mapping methods can be viewed as classification techniques

that classify interests into ODP categories, a combination of the mapping methods may

outperform any one method.

Table 1.

33

Frequently Cooccurring Nouns and NPs

Domain Interest Two cooccurring nouns Cooccurring NP

Computer

C++ programme, resource general cIBM DB2 database, software databaseJava tutorial, sun sunMachine Learning information, game ai topicNatural Language Processing intelligence, speech intelligence

Object Oriented Programming concept, link data

Text Mining information, data text mine toolUML model tool acceptance *

Web Site Design html, development library resource web development

Finance

Bonds saving, rate saving bondDay Trading resource, article bookDerivatives trade, international goldMutual Funds news, stock accountOffshore Banking company, formation bank accountRisk Management open source * software risk evaluation *Stocks Exchange trade, information official siteTechnical Analysis market, chart market pullbackTrading Cost service, cap product

* Some cooccurring nouns or NPs may be not semantically similar or related.

Table 2.

Individual Mapping Method Comparison (Based on 56 Computer Interests)

Mapping method m1 m2 m3 m4Number of correctly mapped interests 27 29 25 19Number of incorrectly mapped interests 2 25 30 36Number of total mapped interests 29 54 55 55Precision (

) 93.0% 53.7% 45.5% 34.5%

Recall ( ) 48.2% 51.8% 44.6% 33.9%

F1 63.5% 52.7% 45.0% 34.2%

34

Figure 8 lists the detailed pseudocode of the procedure used to automatically

resolve a final set of categories for an interest profile with the four mapping methods. M1

represents a set of mapped category/categories generated by m1-category-label as do M2,

M3, and M4. Because of its high precision, we prioritize the category/categories

generated by m1-category-label as shown in Step (2); if a category generated by m1-

category-label is the same as, or a parent category of, a category generated by any other

method, we include the category generated by m1-category-label in the list of final

resolved categories. Because m1-category-label uses an exact match strategy, it does not

always generate a category for a given interest. In Step (3), if methods m2-category-

profile, m3-category-profile-noun, and m4-category-profile-np generate the same mapped

category, we select that category, irrespective of whether m1-category-label generates

one. Steps (2) and (3) attempt to produce a category for an interest by considering

overlapping categories from different methods. If no such overlap is found, we look for

overlapping categories generated for different interests in Step (6) because if more than

one interest is mapped to the same category, it is likely to be of interest. In Step (8), we

try to represent all remaining categories at a depth of three or less by truncating the

category at depth four and thereby hope to find overlapped categories through the parent

categories. Step (9) is similar to Step (5) except that all remaining categories are at the

depth of three or less.

(1) For each interest i in interest profile Given i, the four mapping methods generate M1, M2, M3, and M4(2) For each category c in M1 If c is the same as, or a parent of, a category in M2, M3, or M4, add c to a list of

35

final categories, then go to Step (1) End For(3) If M2, M3, and M4 contain the same category c, add c into the list of final categories, then go to Step (1)(4) Put any category c in M1, M2, M3, and M4 into a list of candidate categories End For(5) For each category c in candidate categories Count the frequency for c End For(6) For each depth-four category c in candidate categories If frequency of c >= threshold, add c into final categories. (We chose the threshold equal to the number of mapping methods – 1. The threshold was three in our tests because we used four mapping methods. The number of three or larger means there is an overlap of candidate category between at least two different interests. Then we choose the overlapped candidate category to represent these interests.) End For(7) Removing all candidate categories for the mapped interests in Step (6)(8) Resolving all remaining categories of depth four into depth three by truncating the category at depth four. For example, after truncating to depth three from depth four, reference/knowledge management/publications/articles is resolved as reference/ knowledge management/publications(9) For each category c in candidate categories Count the frequency for c End For(10) For each depth-three category c in candidate categories If frequency of c >= threshold, add c into final categories End For

Figure 8. Category Resolving Procedures

To determine appropriate values for N (number of nouns) and M (number of NPs)

for m3-category-profile-noun and m4-category-profile-np, we tested different

combinations of values ranging from 1 to 3 with the 56 computer interests. According to

the number of correctly mapped interests, choosing the two most frequently cooccurring

nouns and one most frequently cooccurring NP offers the best mapping result (see Table

1 for some examples of identified nouns and NPs.) With the 56 interests, Table 3

compares the number of correctly mapped interests when different mapping methods are

combined. Using all four mapping methods provides the best results; 39 of the 56

36

interests were correctly mapped onto ODP categories. The resolving procedures in Figure

8 thus are based on four mapping methods. When using three methods, we adjusted the

procedures accordingly, such as setting the thresholds in Steps (6) and (10) to two instead

of three.

Table 4 lists mapped and resolved categories for some interests in computer and

finance domains.

After the automatic resolving procedures, mapped categories for some interests

may not be resolved because different mapping methods generate different categories.

Table 3.

Comparison of Combined Mapping Methods

Combination of mapping methodsm1+m2+

m3m1+m2+

m4m1+m3+

m4m1+m2+m3

+m4Number of correctly mapped interests 34 35 32 39Precision* 60.7% 62.5% 57.1% 69.6%

* Recall and F1 were same as precision because the number of mapped interests was 56.

37

Unresolved interests can be handled by having the user manually map them onto the ODP

taxonomy. An alternative approach could use a unresolved user interest as a query to a

search engine (in a manner similar to m3-category-profile-noun and m4-category-profile-

np), then combine the search results, such as page titles and snippets, to compose an ad

hoc category profile for the interest. Such a profile could flexibly represent any interest

and avoid the limitation of taxonomy in that it contains a finite set of categories. It would

be worthwhile to examine the effectiveness of such ad hoc category profiles in a future

study. In this article, user interests are fully mapped and resolved to ODP categories.

These four steps are performed just once for each user, possibly during a software

installation phase, unless the user’s interest profile changes. To reflect such a change in

interests, our system can automatically update the mapping periodically or allow a user to

request an update from the system. As shown in Figure 7, the first four steps can be

performed in a client-side server, such as a machine on the organization’s intranet, and

the category profiles can be shared by each user’s machine.

Finally, user interests, even long-term professional ones, are dynamic in nature. In

the future, we will explore more techniques to learn about and finetune interest mapping

and handle the dynamics of user interests.

3.5 Step 5: Categorizing Search Results

When a user submits a query, our system obtains search results from Google and

downloads the content of up to the top-50 results which correspond to the first five result

38

pages. The average number of result pages viewed by a typical user for a query is 2.35

[Jansen et al. 2000], and a more recent study [Jansen et al. 2005] reports that about 85–

Table 4.

Resolved Categories

Domain Interest ODP category

Computer

C++ computers/programming/languages/c++IBM DB2 computers/software/databases/ibm db2Java computers/programming/languages/javaMachine Learning computers/artificial intelligence/machine learningNatural Language Processing computers/artificial intelligence/natural language

Object Oriented Programming computers/software/object-oriented

Text Mining reference/knowledge management/ knowledge discovery/text mining

UML computers/software/data administration *Web Site Design computers/internet/web design and development

Finance

Bonds business/investing/stocks and bonds/bondsDay Trading business/investing/day tradingDerivatives business/investing/derivativesMutual Funds business/investing/mutual fundsOffshore Banking business/financial services/offshore servicesRisk Management business/management/software *Stocks Exchange business/investing/stocks and bonds/exchanges

Technical Analysis business/investing/research and analysis/technical analysis

Trading Cost business/investing/derivatives/brokerages*Because the mapping and resolving steps are automatic, some resolved categories are erroneous.

92% of users view no more than two result pages. Hence, our system covers

approximately double the number of results normally viewed by a search engine user. On

the basis of page content, the system categorizes the results into various user interests. In

PCAT, we employ a user’s original interests as class labels rather than the ODP category

39

labels because the mapped and resolved ODP categories are associated with user

interests. Therefore, the use of ODP (or any other Web directory) is transparent to the

user. A Web page that corresponds to a search result is categorized by (1) computing the

cosine similarity between the page content and each of the category profiles of the

mapped and resolved ODP categories that correspond to user interests and (2) assigning

the page to the category with the maximum similarity if the similarity is greater than a

threshold. If a search result does not fall into any of the resolved user interests, it is

assigned to the Other category.

The focus of our study is to explore the use of PCAT, an implementation based on

the proposed approach, and compare it with LIST and CAT. With regard to interest

mapping and result categorization (classification problems), we choose the simple and

effective cosine similarity instead of comparing different classification algorithms and

selecting the best one.

3.6 IMPLEMENTATION

We developed three search systems12 with different interfaces to display search

results, and the online searching portion was implemented as a wrapper on Google search

engine using the Google Web API.13 Although the current implementation of our

approach uses a single search engine (Google), following the metasearch approach

[Dreilinger and Howe 1997], it can be extended to handle results from multiple engines.

Because Google has become the most popular search engine,14 we use Google’s search

12 In experiments, we named the systems A, B, or C; in this article, we call them PCAT, LIST, or CAT, respectively.13 http://www.google.com/apis/.14 http://www.comscore.com/press/release.asp?press=873.

40

results to feed the three systems. That is, the systems have the same set of search results

for the same query; recall that LIST can be considered very similar to Google. For

simplicity, we limit the search results in each system to Web pages in HTML format. In

addition, for a given query, each of the systems retrieves up to 50 search results.

PCAT and CAT download the contents of Web pages that correspond to search

results and categorize them according to user interests and ODP categories, respectively.

For faster processing, the systems use multithreading for simultaneous HTTP connections

and download up to 10KB of text for each page. It took our program about five seconds

to fetch 50 pages. We note that our page-fetching program is not an industry strength

module and much better concurrent download speeds have been reported by other works

[Hafri and Djeraba 2004, Najork and Heydon 2001]. Hence, we feel that our page-

fetching time can be greatly reduced in a production implementation. After fetching the

pages, the systems remove stop words and perform word stemming before computing the

cosine similarity between each page content and a category profile. Each Web page is

assigned to the category (and its associated interest for PCAT) with the greatest cosine

similarity. However, if the similarity is not greater than a similarity threshold, the page is

assigned to the Other category. We determined the similarity threshold by testing query

terms from “irrelevant” domains (not relevant to any of the user’s interests). For example,

given that our user interests are related to computer and finance, we tested ten irrelevant

queries, such as NFL, Seinfeld, allergy, and golden retriever. For these irrelevant queries,

when we set the threshold at 0.1, at least 90% (often 96% or higher) of retrieved results

were categorized under the Other category. Thus we chose 0.1 as our similarity threshold.

The time for classifying results according to user interests in PCAT is negligible (tens of

41

milliseconds). However, the time for CAT is three magnitudes greater than that for PCAT

because the number of potential categories for CAT is 8,547, whereas the number of

interests is less than 8 in PCAT.

Figure 9 displays a sample output from PCAT for the query of regular expression.

Once a user logs in with his or her unique identification, PCAT displays a list of the

user’s interests on top of the GUI. After a query is issued, search results are categorized

into various interests and displayed in the result area, as shown in Figure 9. A number

next to the interest indicates how many search results are classified under that interest; if

there is no classified search result, the interest will not be displayed in the result area.

Under each interest (category), PCAT (CAT) shows no more than three results on the

main page. If more than three results occur under an interest or category, a More link

appears next to the number of results. (In Figure 9, there is a More link for the interest of

Java.) Upon clicking this link, the user sees all of the results under that interest in a new

window as shown in Figure 10.

42

Figure 9. Sample Output of PCAT. Category titles are user interests mapped and

resolved to ODP categories

user interests result area previous task next task copy paste query field

43

Figure 10. More Window to Show All of the Results under the Interest Java

44

Figure 11 displays a sample output of LIST for the same query of regular

expression and shows all search results in the result area as a page-by-page list. Clicking

a page number causes a result page with up to ten results to appear in the result area of

the same window. For the search task in Figure 11, the first relevant document is shown

as the sixth result on page 2 in LIST.

Figure 11. Sample Output of LIST

45

Figure 12 displays a sample output for CAT in which the category labels in the

result area are ODP category names sorted alphabetically such that output categories

under business are displayed before those under computers.

We now describe some of the features of the implemented systems that would not

appear in a production system but are meant only for experimental use. We predefined a

set of search tasks the subjects used to conduct searches during the experiments that

specified what information and how many Web pages needed to be found (Section 4.2.2

describes the search tasks in more detail.) Each search result consists of a page title,

46

Figure 12. Sample Output of CAT. Category labels are ODP category titles

snippet, URL, and a link called relevant15 next to the title. Except for the relevant link, the

items are the same as those found in typical search engines. A subject can click the

hyperlinked page title to open the page in a regular Web browser, such as Internet

Explorer. The subject determines whether a result is relevant to a search task by looking

at the page title, snippet, URL, and/or the content of the page.

Many of our search tasks require subjects to find one relevant Web page for a task

but some require two. In Figure 9, the task requires finding two Web pages which is also

indicated by the number 2 at the end of the task description. Once the user finds enough

relevant pages, he or she can click the Next button to proceed to the next task; clicking on

Next before enough relevant page(s) have been found prompts a warning message, which

allows the user to either give up or continue the current search task.

We record search time, or the time spent on a task, as the difference between the

time that the search results appear in the result area and the time that the user finds the

required number of relevant result(s).

15 When a user clicks on the relevant link, the corresponding search result is treated as the answer or solution for the current search task. This clicked result is considered as relevant, and is not necessarily the most relevant among all search results.

46

CHAPTER 4

EXPERIMENTS

We conducted two sets of controlled experiments to examine the effects of

personalization and categorization. In experiment I, we compare PCAT with LIST, that

is, a personalized system that uses categorization versus a system similar to a typical

search engine. Experiment II compares PCAT with CAT in order to study the difference

between personalization and nonpersonalization, given that categorization is common to

both systems. These experiments were designed to examine whether subjects’ mean log

search time16 for different types of search tasks and query lengths varied between the

compared systems. The metric evaluates the efficiency of each system because all three

systems return the same set of search results for the same query. Before experiment I, we

conducted a preliminary experiment comparing PCAT and LIST with several subjects

who later did not participate in either the experiment I or II. The preliminary experiment

16 Mean log search time is the average log-transformed search time for a task across a group of subjects using the same system. We transformed the original search times (measured in seconds) with base 2 log to make the log search times closer to a normal distribution. In addition, taking the average makes the mean log search times more normally distributed.

47

helped us make decisions relating to experiment and system design. Next we introduce

our experiments I and II in detail.

4.1 Studied Domains and Domain Experts

Because we were interested in personalizing search according to a user’s

professional interests, we chose two representative professional domains, computer and

finance, that appear largely disjointed.

For the computer domain, two of the authors, who are researchers in the area of

information systems, served as the domain experts. Both experts also have industrial

experiences related to computer science. For the finance domain, one expert has a

doctoral degree and the other has a master’s degree in finance.

4.2 Professional Interests, Search Tasks, and Query Length

4.2.1 Professional Interests (Interest Profiles)

For each domain, the two domain experts manually chose several interests and

skills that could be considered fundamental which enables us to form a generic interest

profile that would be shared by all subjects within the domain. Moreover, the

fundamental nature of these interests allows us to recruit more subjects, leading to greater

statistical significance in our results. By defining some fundamental skills in the

computer domain, such as programming language, operating system, database, and

applications, the two computer domain experts identified six professional interests:

algorithms, artificial intelligence, C++, Java, Oracle, and Unix. Similarly, the two finance

48

experts provided seven fundamental professional interests: bonds, corporate finance, day

trading, derivatives, investment banking, mutual funds, and stock exchange.

4.2.2 Search Tasks

The domain experts generated search tasks on the basis of the chosen interest

areas but also considered different types of tasks, that is, Finding and Information

Gathering. The content of those search tasks include finding a software tool, locating a

person’s or organization’s homepage, finding pages to learn about a certain concept or

technique, collecting information from multiple pages, and so forth. Our domain experts

predefined 26 nondemo search tasks for each domain as well as 8 and 6 demo tasks for

the computer and finance domains, respectively. The demo tasks were similar to, but not

identical to the non-demo tasks, and therefore offer subjects some familiarity with both

systems before they started to work on the nondemo tasks. Nondemo tasks are used in

post-experiment analysis, while demo tasks are not. All demo and nondemo search tasks

belong to the categories of Finding and Information Gathering [Sellen et al. 2002] as

discussed in Section 2.2.4, and within the finding tasks, we included some Site Finding

tasks [Craswell et al. 2001].

4.2.3 Query length

Using different query lengths, we specified four types of queries for search tasks

in each domain:

(1) One-word query (e.g., jsp, underinvestment)

49

(2) Two-word query (e.g., neural network, security line)

(3) Three-word query (e.g., social network analysis)

(4) Free-form query, which had no limitations on the number of words used

For a given task a user was free to enter any query word(s) of his or her own

choice that conformed to the associated query-length requirement, and the user could

issue multiple queries for the same task. For example, Table 5 shows some sample search

tasks, types of search tasks, and their associated query lengths.

Table 6 lists the distributions of search tasks and their associated query lengths.

For each domain, we divided the 26 nondemo search tasks and demo tasks into two

groups such that the two groups have the same number of tasks and distribution of query

lengths. During each experiment, subjects searched for the first group of tasks using one

system, and the second group of tasks using the other.

Table 5.

Examples of Search Tasks, Types of Tasks, and Query Lengths

Domain Search task Type of search task Query lengthComputer You need an open source IDE

(Integrated Development Environment) for C++. Find a page that provides any details about such an IDE.

Finding one-word

Computer You need to provide a Web service to your clients. Find two pages that describe Web services support using Java technology.

Information Gathering two-word

Finance Find a portfolio management spreadsheet program.

Finding three-word

Finance Find the homepage of New York Stock Exchange.

Site Finding free-form

50

Table 6.

Distribution of Search Tasks and their Associated Query Lengths

Experiment Domain\Query length

One-word

Two-word

Three-word

Free-form

Total tasks

I & II Computer 6 6 4 10 26Finance 8 6 6 6 26

We chose these different query lengths for several reasons. First, numerous

studies show that users tend to submit short Web queries with an average length of two

words. A survey by the NEC Research Institute in Princeton reports that up to 70% of

users typically issue a query with one word in Web searches, and nearly half of the

Institute’s staff—who should be Web-savvy (knowledge workers and researchers)—fail

to define their searches precisely with query terms [Butler 2000]. By collecting search

histories for a two-month period from 16 faculty members across various disciplines at a

university, Käki [2005] found that the average query length was 2.1 words. Similarly,

Jansen et al. [1998] find through their analysis of transaction logs on Excite that, on

average, a query contains 2.35 words. In yet another study, Jansen et al. [2000] report that

the average length of a search query is 2.21 words. From their analysis of users’ logs in

the Encarta encyclopedia, Wen et al. [2002] report that the average length of Web queries

is less than 2 words.

51

Second, we chose different query lengths to simulate different types of Web

queries and examine how these different types affect system performance. A prior study

follows a similar approach; in comparing the IntelliZap system with four popular search

engines, Finkelstein et al. [2002] set the length of queries to one, two, and three words

and allow users to type in their own query terms.

Third, in practice, queries are often incomplete or may not incorporate enough

contextual information which leads to many irrelevant results and/or relevant results that

do not appear at the top of the list. A user then has two obvious options: enter a different

query to start a new search session or go through the long result list page-by-page, both

of which consume time and effort. From a study with 33,000 respondents, Sullivan

[2000] finds that 76% of users employ the same search engine and engage in multiple

search sessions on the same topic. To investigate this problem of incomplete or vague

queries, we associate search tasks with different query lengths to simulate the real-world

problem of incomplete or vague queries. We believe that categorization will present

results in such a way to help disambiguate such queries. Unlike Leroy et al. [2003], who

extract extra query terms from users’ behaviors during consecutive searches, we do not

modify users’ queries but rather observe how a result-processing approach (personalized

categorization of search results) can improve search performance.

4.3 Subjects

Prior to the experiments, we sent emails to students in the business school and the

computer science department of our university, as well as to some professionals in the

computer industry, to solicit their participation. In these emails, we explicitly listed the

52

predefined interests and skills we expected potential subjects to have. We also asked

several questions, including the following two self-reported ones:

(1) When searching online for topics in the computer or finance domain, what do you

think of your search performance (with a search engine) in general?

(a) slow (b) normal (c) fast

(2) How many hours do you spend on online browsing and searching per week (not

limited to your major)?

(a) [0, 7) (b) [7+, 14) (c) [14+)

We verified their responses to ensure each subject possessed the predefined skills

and interests. After the experiments we did not manually verify the correctness of

subject-selected relevant documents. However, in our preliminary experiment with

different subjects, we manually examined all of the relevant documents chosen by

subjects and we confirmed that, on an average, nearly 90% of their choices were correct.

We assume that with the sufficient background the subjects were capable of identifying

the relevant pages. Because we used PCAT in both experiments, no subject from

experiment I participated in experiment II. We summarize some demographic

characteristics of the subjects in tables 7 through 9.

To compare the two studied systems for each domain, we divided the subjects into

two groups, such that subjects in one group were as closely equivalent to the subjects in

the other as possible with respect to their self-reported search performance, weekly

browsing and searching time, and educational status. We computed the mean log search

time for a task by averaging the log search times for each group.

54

Table 7.

Educational Status of Subjects

Experiment

Domain\Status

Undergraduate

Graduate Professional

Total

I Computer 3 7 4 14Finance 4 16 0 20

II Computer 3 11 2 16Finance 0 20 0 20

Table 8.

Self-reported Performance on Search within a Domain

Experiment Domain\Performance Slow Normal Fast

I Computer 0 8 6Finance 2 15 3

II Computer 1 8 7Finance 2 11 7

Table 9.

Self-reported Time (Hours) Spent Searching and Browsing Per Week

Experiment Domain\Time (hours)

[0, 7) [7, 14)

[14+)

I Computer 1 9 4Finance 5 10 5

II Computer 2 7 7Finance 2 11 7

55

4.4 Experiment Process

In experiment I, all subjects used both PCAT and LIST and searched for the same

demo and non-demo tasks. As we show in Table 10, the program automatically switched

between PCAT and LIST according to the task numbers, and the group identified by user

id so users in different groups always used different systems for the same task. The same

system-switching mechanism was adopted in experiment II to switch between PCAT and

CAT.

Table 10.

Distribution of System Uses by Tasks and User Groups

Task Group

First half demo tasks

Second half demo tasks

Non-demo tasks 1–13

Non-demo tasks 14–26

Group one PCAT LIST PCAT LISTGroup two LIST PCAT LIST PCAT

55

CHAPTER 5

EVALUATIONS

In this Chapter we compare two pairs of systems (PCAT vs. LIST, PCAT vs.

CAT) on the basis of the mean log search time along two dimensions: query length and

type of task. We also test five hypotheses using the responses to a postexperiment

questionnaire provided to the subjects. Finally, we demonstrate the differences of the

indices of the relevant results across all tasks for the two pairs of systems.

5.1 Comparing Mean Log Search Time by Query Length

We first compared the two systems by different query lengths. Tables 11 and 12

contain the average mean log search times across tasks with the same query length and

1 standard error for different systems in the two experiments (lower values are better).

The last column of each table provides the average mean log search time across all 26

search tasks and 1 standard error. For most of the comparisons between PCAT vs.

LIST (Table 11) or PCAT vs. CAT (Table 12), for a given domain and query length,

PCAT has lower average mean log search times. We conducted two-tailed t-tests to

56

determine whether PCAT was significantly faster than LIST or CAT for different

domains and query lengths. Table 13 shows the degrees of freedom and p-values for

the t-tests. The

Table 11.

Average Mean Log Search Time across Tasks Associated with Four Types of Query

(PCAT vs. LIST)

Experiment Query length

Domain-SystemOne-word Two-word Three-word Free-form Total

I(PCAT vs.

LIST)

Computer-PCATComputer-LISTFinance- PCAT 3.97 0.34

Finance-LIST 5.10 0.26

Table 12.

Average Mean Log Search Time across Tasks Associated with Four Types of Query

(PCAT vs. CAT)

Experiment Query length

Domain-SystemOne-word Two-word Three-word Free-form Total

II(PCAT vs.

CAT)

Computer-PCAT 4.14 0.26 3.88 0.19 4.30 0.15Computer-CAT 4.96 0.26 4.94 0.34 5.17 0.17Finance-PCAT 4.10 0.35 4.46 0.14Finance-CAT 5.10 0.25 5.11 0.16

Table 13.

The t-test Comparisons (degrees of freedom, p-values)

57

Experiment Domain One-word Two-word Three-word Free-form TotalI

(PCAT vs. LIST)Computer 10, 0.058 10, 0.137 6, 0.517 18, 0.796 50, 0.116Finance 14, 0.015 10, 0.370 10, 0.752 10, 0.829 50, 0.096

II(PCAT vs. CAT)

Computer 10, 0.147 10, 0.050 6, 0.309 18, 0.013 50, 0.001Finance 14, 0.193 10, 0.152 10, 0.237 10, 0.041 50, 0.003

numbers in bold in the Tables 9 and 10 highlight the systems with statistically significant

differences (p < 0.05) in average mean log search times.

In Table 13, for both computer and finance domains, PCAT has a lower mean log

search time than LIST for one-word query tasks with greater than 90% statistical

significance. The two systems are not statistically significantly different for tasks

associated with two-word, three-word, or free-form queries. Compared with a long query,

a one-word query may be more vague or incomplete so a search engine may not provide

relevant pages in its top results, whereas PCAT may show the relevant result at the top of

a user interest. The user therefore could directly jump to the right category in PCAT and

locate the relevant document quickly.

Compared with CAT, PCAT has a significantly lower mean log search time for

free-form queries (p < 0.05). The better performance of PCAT can be attributed to two

main factors. First, the number of categories in the result area for CAT is often large

(about 20) so even if the categorization is accurate, the user must still commit additional

search effort to sift through the various categories. Second, the categorization of CAT

might not be as accurate as that of PCAT because of the much larger number (8,547) of

potential categories which can be expected to be less helpful in disambiguating a vague

or incomplete query. The fact that category labels in CAT are longer than those in PCAT

may also have a marginal effect on the time needed for scanning them.

58

For all 26 search tasks, PCAT has a lower mean log search time than LIST or

CAT with 90% or higher statistical significance except for the computer domain in

experiment I that indicates a p-value of 0.116. When computing the p-values across all

tasks, we notice that the result depends on the distribution of different query lengths and

types of tasks. Therefore, it is important to drill down the systems’ performance for each

type of task.

For reference, Table 14 illustrates the systems’ performance in terms of the

number of tasks that had a lower mean log search time for each type of query length. For

example, the table entry 4 vs. 2 for one-word query in the computer domain of

experiment I indicates that four out of the six one-word query tasks had lower mean log

search time with PCAT, whereas two had a lower mean log search time with LIST.

5.2 Comparing Mean Log Search Time for Information Gathering Tasks

According to Sellen et al. [2002], during information gathering, a user finds

multiple pages to answer a set of questions. Figure 13 compares the mean log search

times of the ten search tasks in the computer domain in experiment I that required the

user to find two relevant results for each task. We sorted the tasks by the differences in

their mean log search times between PCAT and LIST. On average, PCAT allowed the

users to finish eight of ten Information Gathering tasks more quickly than LIST (t(18), p

Table 14.

Numbers of Tasks with a Lower Mean Log Search Time

59

Experiment Domain \ Query length One-word Two-word Three-word Free-form Total

I(PCAT vs. LIST)

Computer 4 vs. 2 6 vs. 0 3 vs. 1 6 vs. 4 19 vs. 7Finance 6 vs. 2 5 vs. 1 3 vs. 3 3 vs. 3 17 vs. 9

II(PCAT vs. CAT)

Computer 4 vs. 2 5 vs. 1 3 vs. 1 10 vs. 0 22 vs. 4Finance 6 vs. 2 6 vs. 0 5 vs. 1 6 vs. 0 23 vs. 3

= 0.005), possibly because PCAT already groups the similar results into a given category.

Therefore, if in a category one page is relevant, the other results in that category are

likely to be relevant as well. This spatial localization of relevant results enables PCAT to

perform this type of task faster than LIST. For the computer domain, experiment II has a

similar result in that PCAT is faster than CAT (t(18), p = 0.007). Since the finance

domain contains only two Information Gathering tasks (too few to make a statistically

robust argument), we only report the mean log search times for the tasks in Table 15. We

observe that the general trend of the results for the finance domain is the same as for the

computer domain (i.e., PCAT has lower search time than LIST or CAT).

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

1 2 3 4 5 6 7 8 9 10

Task

Mea

n Lo

g Se

arch

Tim

e

PCAT LIST

60

Figure 13. Mean Log Search Times for Information Gathering Tasks (Computer

Domain)

61

Table 15.

Mean Log Search Times for Information Gathering Tasks (Finance Domain)

Experiment I Experiment IIPCAT LIST PCAT CAT

Information Gathering task 1 6.33 6.96 6.23 7.64Information Gathering task 2 4.62 5.13 4.72 5.61

5.3 Comparing Mean Log Search Time for Site Finding Tasks

In the computer domain, there were six tasks related to finding particular sites,

such as “Find the home page for the University of Arizona AI Lab.” All six tasks were

associated with free-form queries, and we note that the queries from all subjects

contained site names. Therefore, according to Craswell et al. [2001], those tasks were

Site Finding tasks. Table 16 shows the average mean log search times for the Site Finding

tasks and 1 standard error. There is no significant difference (t(10), p = 0.508) between

PCAT and LIST, as shown in Table 17. This result seems reasonable because for this

type of search task, LIST normally shows the desired result at the top of the first result

page when the site name is in the query. Even if PCAT tended to rank it at the top of a

certain category, users often found the relevant result faster with the LIST layout,

possibly because with PCAT the users had to move to a proper category first and then

looked for the relevant result. However, there is a significant difference between PCAT

and CAT (t(10), p = 0.019); again, the larger number of output categories in CAT may

have required more time for a user to find the relevant site, given that both CAT and

PCAT arrange the output categories alphabetically.

62

Table 16.

Average Mean Log Search Times for Six Site Finding Tasks in Computer Domain

Experiment System Average mean log search time

I(PCAT vs. LIST)

PCATLIST

II(PCAT vs. CAT)

PCAT 3.51 0.12CAT 4.46 0.32

5.4 Comparing Mean Log Search Time for Finding Tasks

As Table 17 shows, for 16 Finding tasks in the computer domain, we do not

observe a statistically significant difference in the mean log search time between PCAT

and LIST (t(30), p = 0.592), but the difference between PCAT and CAT is significant

(t(30), p = 0.013). However, PCAT has lower average mean log search time than both

LIST and CAT. Similarly, for 24 Finding tasks in the finance domain, PCAT achieves a

lower mean log search time than both LIST (t(46), p = 0.101) and CAT (t(46), p = 0.002).

The computer domain includes 6 Site Finding tasks of 16 Finding tasks, whereas the

finance domain has only 2 (of 24). To a certain extent, this situation confirms our

observations about Finding tasks in the computer domain. We conclude that PCAT had a

lower mean log search time for Finding tasks than CAT but not LIST.

5.5 Questionnaire and Hypotheses

After a subject finished the search tasks with the two systems, he or she filled out

a questionnaire with five multiple-choice questions designed to compare the two systems

63

Table 17.

The t-tests for Finding Tasks

Experiment Domain Type of task Degrees, p-value

I(PCAT vs. LIST)

Computer Site Finding 10, 0.508Computer Finding (including Site Finding) 30, 0.592Finance Finding (including Site Finding) 46, 0.101

II(PCAT vs. CAT)

Computer Site Finding 10, 0.019Computer Finding (including Site Finding) 30, 0.013Finance Finding (including Site Finding) 46, 0.002

in terms of their usefulness and ease of use. We use their answers to test several

hypotheses relating to the two systems.

5.5.1 Questionnaire

Subjects completed a five-item, seven-point questionnaire in which their

responses could range from (1) strongly disagree to (7) strongly agree. (The phrase

system B was replaced by system C in experiment II. As explained in footnote 13,

systems A, B, and C refer to PCAT, LIST, and CAT, respectively.)

Q1. System A allows me to identify relevant documents more easily than system B.

Q2. System B allows me to identify relevant documents more quickly than system A.

Q3. I can finish search tasks faster with system A than with system B.

Q4. It’s easier to identify one relevant document with system B than with system A.

Q5. Overall I prefer to use system A over system B.

64

5.5.2 Hypotheses

We developed five hypotheses corresponding to these five questions. (The phrase

system B was replaced by system C for experiment II.)

H1. System A allows users to identify relevant documents more easily than system B.

H2. System B allows users to identify relevant documents more quickly than system A.

H3. Users can finish search tasks more quickly with system A than with system B.

H4. It is easier to identify one relevant document with system B than with system A.

H5. Overall, users prefer to use system A over system B.

5.6 Hypothesis Test Based on Questionnaire

Table 18 shows the means for the choice responses to each of the questions in the

questionnaire. Based on seven scale options described in Section 5.5, we compute

numbers in this table by replacing strongly disagree with 1, strongly agree by 7, and so

on.

As each question in Section 5.5.1 corresponds to a hypothesis in 5.5.2 so we

conducted a two-tailed t-test based on subjects’ responses to each question to test the

hypotheses. We calculated p-values by comparing the subjects’ responses with the mean,

neither agree nor disagree that had a value of 4. The table shows that for both computer

and finance domains, H1, H3, and H5 are supported with at least 95% significance, and

H2 and H4 are not supported.17 The only exception to these results is that we find only

90% significance (p = 0.083) for H1 in the finance domain of experiment I. According to

17 For example, the mean choice in the computer domain for H2 was 2.36 with p < 0.001. According to our scale, 2 means disagree and 3 means mildly disagree, so a score of 2.36 indicates subjects did not quite agree with H2. Hence, we claim that H2 is not supported. The same is true for H4.

65

Table 18.

Mean Responses to Questionnaire Items.

Degrees of Freedom: 13 for Computer and 19 for Finance in Experiment I; 15 for

Computer and 19 for Finance in Experiment II.

Experiment Domain Q1 Q2 Q3 Q4 Q5I

(PCAT vs. LIST)Computer 6.21*** 2.36*** 5.43* 2.71* 5.57**Finance 5.25 3.65* 5.45*** 3.65** 5.40**

II(PCAT vs. CAT)

Computer 6.25*** 2.00*** 6.06*** 2.50*** 6.31***Finance 6.20*** 1.90*** 6.20*** 2.65* 6.50***

*** p < 0.001, ** p < 0.01, * p < 0.05.

these responses on the questionnaire, we conclude that users perceive PCAT as a system

that allows them to identify relevant documents more easily and quickly than LIST or

CAT.

Several results reported in a recent work [Käki 2005] are similar to our findings.

In particular,

Categories are helpful when document ranking in a list interface fails, which fits with

our explanation of why PCAT is faster than LIST for short queries.

When desired results are found at the top of the list, the list interface is faster, in line

with our result and analysis pertaining to Site Finding tasks.

Categories make it easier to access multiple results, consistent with our report for the

Information Gathering tasks.

However, the categorization employed in Käki [2005] does not use examples to

build a classifier. The author simply identifies some frequent words and phrases in search

result summaries and uses them as category labels. Hence, each frequent word or phrase

66

becomes a category (label). A search result is assigned to a category if the result’s

summary contains the category label. Käki [2005] also does not analyze or compare the

two interfaces according to different types of tasks. Moreover, Käki [2005: Figure 4]

shows, though without explicit explanations, that categorization is always slower than a

list. This result contradicts our findings and several prior studies [e.g., Dumais and Chen

2001]. We notice that the system described by Käki [2005] uses a list interface to show

the search results by default so a user may always look for a desired page from the list

interface first and switch to the category interface only if he or she does not find it within

a reasonable time.

5.7 Comparing Indices of Relevant Results

To better understand why PCAT was perceived as faster and easier to use by the

subjects as compared with LIST or CAT, we looked at the indices of relevant results in

the different systems. An expert from each domain completed all search tasks using

PCAT and LIST. Using the relevant results identified by them, we compare the indices of

the relevant search results for the two systems, as we show in Figures 14 and 15.

We sort the tasks by the index differences between LIST and PCAT in ascending

order. Thus, the task numbers on the x-axis are not necessarily the original task numbers

in our experiments. Because PCAT organizes the search results into different categories

(interests), the index of a result reflects the relative position of that result under a

category. In LIST, a relevant result’s index number equals its relative position on the

particular page on which it appears plus ten (i.e., the number of results per page) times

the number of preceding pages. Thus, a result that appears in the fourth position on the

67

third page would have an index number of 24 (4 + 10 × 2). If users had to find two

relevant results for a task, we took the average of the indices. In Figure 14, PCAT and

LIST share the same indices in 10 of 26 tasks, and PCAT has lower indices than LIST in

15 tasks. In Figure 15, PCAT and LIST share the same indices in 7 of 26 tasks, and

PCAT has smaller indices than LIST in 18 tasks.

Similarly, Figures 16 and 17 show indices of the relevant search results of PCAT

and CAT in experiment II. The data for PCAT in Figures 16 and 17 are same as those in

Figures 14 and 15, and we show tasks by the index differences between PCAT and CAT

in ascending order. In Figure 16 for the computer domain, PCAT and CAT share same

indices in 15 of 26 tasks, and CAT has lower indices in 6 tasks. In Figure 17 for the

finance domain, the two systems share same indices in 10 of 26 tasks, and CAT has lower

indices in 14 of 26 tasks.

0

5

10

15

20

25

30

35

1 6 11 16 21 26

Task

Inde

x

PCAT (Computer) LIST (Computer)

Figure 14. Indices of Relevant Results in PCAT and LIST (Computer Domain)

68

0

5

10

15

20

25

30

35

40

45

1 6 11 16 21 26

Task

Inde

x

PCAT (Finance) LIST (Finance)

Figure 15. Indices of Relevant Results in PCAT and LIST (Finance Domain)

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

1 6 11 16 21 26

Task

Inde

x

PCAT (Computer) CAT (Computer)

Figure 16. Indices of Relevant Results in PCAT and CAT (Computer Domain)

69

0

1

2

3

4

5

6

7

8

9

1 6 11 16 21 26

Task

Inde

x

PCAT (Finance) CAT (Finance)

Figure 17. Indices of Relevant Results in PCAT and CAT (Finance Domain)

The indices for PCAT in Figures 14, 15, 16, and 17, and CAT in Figures 16 and

17 reflect an assumption that a user first jumps to the right category and then finds a

relevant page by looking through the results under that category. This assumption may

not always hold, so Figures 14 and 15 may be optimistic in favor of PCAT. However, if

the time taken to locate the right category is not large (as probably in the case of PCAT),

the figures provide a possible explanation for some of the results we observe such as the

lower search times for PCAT with one-word query and Information Gathering tasks in

Experiment I. However, CAT has smaller index numbers for relevant results than PCAT,

which may seem to contradict the better performance (lower search time) for PCAT in

experiment II. We note that due to its nonpersonalized nature, CAT has a much larger

number of potential categories as compared to PCAT. Therefore, a user can be expected

70

to take a longer time to locate the right category (before jumping to the relevant result in

it) as compared to PCAT.

5.8 Discussions

This article presents an automatic approach to personalizing Web searches given a

set of user interests. The approach is well suited for a workplace setting where

information about professional interests and skills can be obtained automatically from an

employee’s resume or a database using an IE tool or database queries. We present a

variety of mapping methods which we combine into an interest-to-taxonomy mapping

framework. The mapping framework automatically maps and resolves a set of user

interests with a group of categories in the ODP taxonomy. Our approach then uses data

from ODP to build text classifiers to automatically categorize search results according to

various user interests. This approach has several advantages, in that it does not (1) collect

a user’s browsing or search history, (2) ask a user to provide explicit or implicit feedback

about the search results, or (3) require a user to manually specify the mappings between

his or her interests and taxonomy categories. In addition to mapping interests into

categories in a Web directory, our mapping framework can be applied to other types of

data, such as queries, documents, and emails. Moreover, the use of taxonomy is

transparent to the user.

We implemented three search systems: A (personalized categorization system,

PCAT), B (list interface system, LIST,) and C (nonpersonalized categorization system,

CAT). PCAT followed our proposed approach and categorized search results according

to a user’s interests, whereas LIST simply displayed search results in a page-by-page list,

71

similar to conventional search engines, and CAT categorized search results using a large

number of ODP categories without personalization. We experimentally compared two

pairs of systems with different interfaces (PCAT vs. LIST and PCAT vs. CAT) in two

domains, computer and finance. We recruited 14 subjects for the computer domain and

20 subjects for the finance domain to compare PCAT with LIST in experiment I, and 16

in the computer domain and 20 in finance to compare PCAT with CAT in experiment II.

There was no common subject across the experiments. Based on the mean log search

times obtained from our experiments, we examined search tasks associated with four

types of queries. We also considered different types of search tasks to tease out the

relative performances of the compared systems as the nature of task varied.

We find that PCAT outperforms LIST for searches with short queries (especially

one-word queries) and for Information Gathering tasks; by providing personalized

categorization results, PCAT also is better than CAT for searches with free-form queries

and for both Information Gathering and Finding tasks. From subjects’ responses to five

questionnaire items, we conclude that, overall, users identify PCAT as a system that

allows them to find relevant pages more easily and quickly than LIST or CAT.

Considering the fact that most users (even noncasual users) often cannot issue appropriate

queries or provide query terms to fully disambiguate what they are looking for, a PCAT

approach could help users find relevant pages with less time and effort. In comparing two

pairs of search systems with different presentation interfaces, we realize that no system

with a particular interface is universally more efficient than the other, and the

performance of a search system depends on parameters such as the type of search task

and the query length.

72

5.9 Limitations and Future Directions

Our search tasks were generated on the basis of user interests. We realize some

limitations of this experimentation setup in adequately capturing the work-place scenario.

The first limitation is that some of the user interests may not be known in a real-world

application, and hence some search tasks may not reflect the known user interests.

Secondly, a worker may search for information that is unrelated with his or her job. In

both of these cases tasks may not match up with any of the known interests. However,

these limitations reflect a general fact that personalization can only benefit based on what

is known about the user. A future direction of research is to model the dynamics of user

interests over time.

For the purposes of a comparative study, we carefully separated the personalized

system (PCAT) from the nonpersonalized (CAT) one by maintaining a low overlap

between the two systems. This allows us to understand the merits of personalization

alone. However, we can envision a new system that is a combination of the current CAT

and PCAT systems.

In particular, the new system replaces the Other category in PCAT by adding

categories of ODP that match the results that are currently placed in the Other category.

A study of such a PCAT+CAT system could be a future direction for this research. An

interesting and related direction is a smart system that can automatically choose a proper

interface (e.g., categorization, clustering, list) to display search results on the basis of the

nature of the query, the search results, and the user interest profile (context).

As shown in Figures 9 and 12, for PCAT in experiments I and II and CAT in

experiment II, we rank the categories alphabetically but always leave the Other category

73

at the end.18 There are various alternatives for the order in which categories are displayed

such as by the number of (relevant) results in each category or by the total relevance of

results under each category. We recognize that choosing different methods may provide

different individual and relative performances. Also, CAT tends to show more categories

on the main page than PCAT. On one hand, more categories on a page may be a negative

factor for locating a relevant result. On the other hand, more categories provide more

results in the same page which may speed up the discovery of a relevant result as

compared to clicking a More link to open another window (as in PCAT system). We

think that the issues of category ordering and number of categories on a page deserve

further examination.

From the subjects’ log files we observed that when some of the subjects could not

find a relevant document under a relevant category due to result misclassification, they

moved to another category or tried a new query. Such a situation can be expected to

increase the search time for categorization-based systems. Thus, another direction of

future research is to compare different result classification techniques based on their

effect on mean log search time.

It would be worthwhile to study the performance of result categorization using

other types of data such as title and snippets (from search engine results) instead of page

content which would save the time on fetching Web pages. In addition, it may be

interesting to examine how a user could improve his or her performance in Internet

searches in a collaborative (e.g., intranet) environment. In particular, we would like to 18 For the computer domain in experiment I, PCAT shows C++ and Java before other alphabetically ordered interests, and the “Other” category is at the end.

74

measure the benefit the user can derive from the search experiences of other people with

similar interests and skills in a workplace setting.

74

PART II

BUSINESS RELATIONSHIP DISCOVERY

75

CHAPTER 6

INTRODUCTION AND LITERATURE REVIEW

6.1 Introduction

Business news contains rich and current information about companies and the

relationships among them. Online business news from media companies (e.g., Reuters),

content providers (e.g., Yahoo!), and company Web sites offer readers timely

assessments of dynamic company relationships. The task of reading news is very time

consuming and it requires a reader to possess certain skills, the most basic of which is a

good understanding of the language in which the news is written. However, the huge

volume of news stories makes the manual identification, without automated news

analysis, of relationships among a large number of companies nontrivial and unscalable.

For professional or personal finance–related interests, many people regularly spend

significant amounts of time scanning the news to monitor recent companies’ financial

milestones. For tasks such as investment or market research, researchers often need to

compare a pair of companies or identify top-performing companies on the basis of

revenue. The company revenue relationships are dynamic and information about them

may not be readily or continuously available. Public companies typically update their

76

earning or balance sheet data on a quarterly basis, whereas the availability of private,

initial public offering (IPO), or foreign companies’ financials is more limited overall.

Scanning the competitive environment of a company or a group of companies is

essential for supply chain, marketing, investment and strategic partnership

management. Once its competitors have been identified, a company can look for their

product lines, marketing strategies, directions of R&D, key personnel, customers, and

suppliers, and so on to potentially improve its competitive advantage. Analysts and

managers may resort to various options for discovering and monitoring competitor

relationships. These options may include: asking business associates (e.g., customers

or suppliers), reading news, searching on the Web, attending business conventions,

and looking through company profile resources such as Hoover’s19 and Mergent.20

While the availability of company profiling resources has reduced the search effort

and made some of business relationship information easily accessible, the other

above-mentioned approaches, due to their largely manual nature, are still time

consuming and limited in scale. Besides, using possibly different criteria in collecting

and identifying information, businesses that provide company profiles also suffer

from the scalability problem due to limited resources, manpower and budget, leading

to incomplete and inconsistent information. For example, Hoover considers

Interchange Corp. as a competitor of Google, while Mergent does not specify this

relationship. In contrast, Mergent includes Tercica Inc. as a competitor of

GlaxoSmithKline plc while Hoover’s does not. Therefore, it is important to explore

approaches to automatically discover important business relationships that can

19 Hoover’s, Inc., http://www.hoovers.com.20 Mergent Inc., http://www.mergentonline.com.

http://www.mergentonline.com/

http://www.hoovers.com/

77

complement and extend existing time consuming efforts. An automated approach also

allows for a timely update of business relationships thus avoiding information

staleness that can mar manual approaches.

Social network analysis (SNA) refers to a set of research procedures for

identifying and quantifying structures in a social network on the basis of relationships

among the nodes [Richards and Barnett 1993]. A social network consists of a set of

nodes, such as individuals or organizations, which are connected through edges that

represent various relationships (e.g., friendship, affiliation) [Wasserman and Faust 1994]

that tend to be simple to identify and yet voluminous to analyze. It is feasible and

effective to discover network structures by analyzing quantitative measures of the

information represented by nodes and edges of social networks for diverse fields, such as

social and behavioral science, anthropology, psychology [Scott 2000], and information

science.

In this study, we present an approach that applies SNA and machine learning

techniques for automated discovery of business relationships. In particular, we study two

different relationships, CRR and competitor relationships, as two illustrative examples of

our approach. Figure 18 illustrates the main steps for discovery of the two relationships at

a high level. First with a collection of news stories that have been organized by company,

given that a news story pertaining to a company often cites one or more other companies,

we identify company citations in news stories and treat them as links from the focal

(source) companies to those cited (target) companies, and then construct a directed,

weighted intercompany network. Further we identify four types of network attributes

78

based on network topology. The four types of attributes differ in their coverage of the

intercompany network. Finally we feed these identified attributes to classification

Figure 18. A High Level Process View for Studying CRR and Competitor

Relationship

methods to predict CRR and discover competitor relationship between two companies.

This approach is effective and scalable for business relationship screening, and can be

extended for automated discovery of a broad range of business relationships. Moreover,

the approach is language neutral (i.e., we do not analyze the vocabulary or grammar in

news stories to find relationships). This last feature of the approach can help extend it to

news written in languages other than English.

6.2 Literature Review

Many researchers in areas such as organization behavior and sociology have

investigated the nature and implications of social networks created by business

relationships. For example, Levine [1972], using a network of interlocked directorates

79

between major banks and large industrial companies, constructs a map of the sphere of

influence that provides a quick (though approximate) overview of the relations (e.g.,

well-linked bank–company ties) in the network. Walker et al. [1997] examine an

interfirm network on the basis of cooperative relationships from a commercial directory

of biotechnology firms. Using regression techniques with ten independent variables, they

demonstrate that network structure strongly influences the choices of a biotechnology

startup in terms of establishing new relationships (licensing, joint venture, and R&D

partnership) with other companies. Uzzi [1999] investigates how social relationships and

networks affect a firm’s acquisition and cost of capital. Gulati and Gargiulo [1999]

demonstrate that an existing interorganizational network structure affects the formation of

new alliances which eventually modifies the existing network. A major difference

between those prior studies and ours is that prior works construct a social network using

explicit given relationships from gold standard data sources whereas we try to predict a

business relationship, i.e., CRR, between two companies using structural attributes

derived from citation based intercompany network.

Research in information retrieval and bibliometrics has previously exploited SNA

and graph-theoretic techniques on a network of documents They consider implicit

signals, such as URL links, email communications, or article citations, as links between

nodes and further study problems such as identifying importance of individual nodes in

the network [e.g., Brin and Page 1998; Kleinberg 1999; Garfield 1979] and communities

in Web [e.g., Kautz et al. 1997; Gibson et al. 1998], instead of discovering business

relationships between companies.

80

For example, articles such as scholarly publications can be considered to be

connected with one another through citations. A citation index indexes the citations

among such articles [Garfield 1979]. Using a citation index, a researcher can find not

only articles that a given article cites but also articles that cite the given article. CiteSeer

[Giles et al. 1998] is an example of an autonomous citation indexing system that

retrieves, indexes, and builds bibliographic and citation databases from research articles

on the Web. Furthermore, analyses of the networks created by citations have led to

various measures of prestige and the impact of published articles and the journals in

which they appear. Some measures closely resemble measurements of Web page

popularity [Brin and Page 1998] used by Web search engines such as Google.

Park [2003] identifies hyperlink network analysis as a subset of SNA, in which

nodes are Web sites and the relationships are URL links among sites. In such a network,

the linkages among sites reflect the authority, prestige, or trust of the sites [Kleinberg

1999, Palmer et al. 2000]. Brin and Page [1998] propose the PageRank algorithm to rank

the nodes (pages) on the www network with directed URL links among pages and use the

ranks of pages to order search results. Kleinberg [1999] presents the Hyperlink-Induced

Topic Search (HITS) algorithm to compute the hub and authority importance measures

for each node (page), also based on the link structure of the www.

Bernstein et al. [2002] apply a commercial information extraction system to

extract company entities from Yahoo! business news and posit that two companies have a

relationship (link) if they appear in the same piece of news (cooccurrence approach). The

network, which consists of 1,790 identified companies and in which links between two

companies are undirected and unweighted (binary weight), illustrates some central

81

industry players. They further filter out nodes in the network to produce a smaller

network with 315 companies and 1,047 links, which they use to count how many other

companies are connected with each company, rank all companies by the counts, and

indicate that some of the 30 top-ranked companies in the computer industry are also

Fortune 1000 companies. Hence, their result indicates that companies with high revenues

tend to be linked to many other companies in a network derived purely from news stories.

Their work is somewhat similar to our study, in that they use online business news to

construct an intercompany network. However, unlike Bernstein et al. [2002], we qualify

links in the constructed network by both direction and weights. Furthermore, different

from the abovementioned research we employ various graph-based metrics to predict the

CRR between any pair of companies linked in the network that contains tens of thousands

of such company pairs.

82

CHAPTER 7

NETWORK-BASED ATTRIBUTES AND DATA

In this Chapter we first introduce relevant notation in directed graphs, followed by

notation in directed, weighted graphs. Then we describe data and data processing

procedures. To provide statistical insights into the data, we report distributions of the

various network attributes. Hereafter, we use the following pairs of terms

interchangeably: network and graph, node and company, link and company pair or pair of

companies.

7.1 Notation in Directed Graphs

Figure 19 presents a directed graph (digraph) that consists of four nodes joined by

eight directed links. More formally, a digraph Gd = (N, L) consists of a set of nodes N and

a set of links L, where

N = (n1, n2, …, nm) and

L = (l1, l2, …, lk), where li = <nsource, ntarget>.

83

The node indegree, NID(ni), in a digraph is the number of nodes linked to ni; the

node outdegree, NOD(ni), is the number of nodes linked from ni [Wasserman and Faust

1994]. Node indegree, or a metric based on it, has been used often to represent authority

Figure 19. Directed Graph

and prestige in many prior works [e.g., Brin and Page 1998, Kleinberg 1999]. In this

figure NID(n1) and NOD(n1) are 3 and 2, and NID(n4) and NOD(n4) are 1 and 2.

7.2 Notation in Directed, Weighted Graphs

Web portals such as Yahoo! Finance21 and Google Finance22 provide news stories

arranged by company. A news story pertaining to a company (source company) often

cites one or more other companies referred to as target companies. we consider that the

company citation is a directed link (outlink) from the source company to a target and

each citation adds a unit of weight to the link. Finally the link weight between the two

companies is the accumulated citation count across a set of news stories.

21 http://finance.yahoo.com.22 http://finance.google.com/finance.

84

Figure 20 depicts a digraph in which each link carries a weight. It is a very small

portion of the intercompany network that consists of five companies/nodes joined by 15

directed and weighted links. More formally, a weighted digraph Gwd = (N, L, W) includes

N, L, and a weight vector W associated with the set of links, where W = (w1, w2, …, wk).

Figure 20. Directed, Weighted Graph

DELL: Dell Inc., INCX: Interchange Corp., GOOG: Google Inc., JPM: JP Morgan Chase

& Co., YHOO: Yahoo! Inc.

We derive various attributes from the intercompany network that characterize

either a node (one value for each node) or a pair of nodes (one value for each pair). we

divide the various attributes into four types on the basis of the range of the network

covered for computing the attributes and describe these attributes as follows.

7.2.1 Dyadic and Node Degree-based Attributes

We first introduce a group of dyadic degree-based attributes as follows.

85

Dyadic weighted indegree (DWID), DWID(ni, nj) is the weight of the link from nj

to ni.

In Figure 20 the DWID(YHOO, GOOG) is 478.

Dyadic weighted outdegree (DWOD), DWOD(ni, nj) is the weight of the link

from ni to nj.

Again, based on Figure 20, the DWOD(YHOO, GOOG) is 512. We note that both

DWID(GOOG, YHOO) and DWOD(YHOO, GOOG) are large (as compared to

other pairs) and almost equal values. News stories about two competing

companies can be expected to frequently cite the other company and the volume

of citations for each company can be expected to be almost equal when there is no

absolute winner (e.g., monopoly).

Dyadic weighted netdegree (DWND)

DWND(ni, nj) = DWOD(ni, nj) – DWID(ni, nj) (1)

Hence, DWND(YHOO, GOOG) = 512 – 478 = 34 shows a net flow of citations in

the direction of pointing to GOOG when we consider the pair <YHOO, GOOG>.

The positive net flow to GOOG may indicate its slight dominance as reflected by

news citations.

Dyadic weighted inoutdegree (DWIOD)

DWIOD(ni, nj) = DWOD(ni, nj) + DWID(ni, nj) (2)

86

Again, DWIOD(YHOO, GOOG) = 990, which is a relatively large as compared

to other links in the example network. A large DWIOD value may indicate a strong

relationship between the given pair of companies.

The dyadic nature of these attributes captures the flow of citations and hence

potential relationships between a pair of companies. However, dyadic attributes consider

only a pair of connected nodes. To take into account a given node’s neighbors, we

consider the following node degree-based attributes.

Node weighted indegree (NWID)

NWID(ni) = (3)

This measures the flow of citations from all companies in the network to the

given company. We expect “important” companies to possibly draw a large total

number of citations in news from other companies.

Node weighted outdegree (NWOD)

NWOD(ni) = (4)

This measures the flow of citations from the given company to all other

companies in the network.

Node weighted inoutdegree (NWIOD)

87

NWIOD(ni) = (5)

This measures the overall flow of citations both to and from the given company

(ni). In essence, this attribute measures the overall connectivity of the given company

and all neighbor companies in the network independent of the direction of citations.

In Figure 20 for node n1 (YHOO) the NWID, NWOD, and NWIOD values are

513, 541, and 1054 respectively. If a pair of companies has a large DWIOD value as

well as large individual NWIOD values, it may suggest that the two companies have a

strong relationship and are both important players.

7.2.2 Centrality-based Attributes

In addition to the dyadic and node degree-based measurements, we also use a

network analysis package [O'Madadhain 2006] to compute scores on the basis of three

different centrality/importance measuring schemas: PageRank [Brin and Page 1998],

HITS [Kleinberg 1999], and betweenness centrality [Brandes 2001]. These schemas

extend beyond immediate neighbors to compute the importance or centrality of a given

node in the whole network. The PageRank algorithm computes a popularity score for

each Web page on the basis of the probability that a “random surfer” will visit the page

[Brin and Page 1998]. The HITS algorithm generates a pair of scores, “hub” and

“authority,” for each page. Both HITS and PageRank compute principal eigenvectors of

matrices derived from graph representations of the Web [Kleinberg 1999], so our use of

them for a graph whose nodes are companies differs from their original use. As a node

88

centrality measurement, betweenness measures the extent to which a node lies between

the shortest paths of other nodes in the graph [Freeman 1979]. The three schemas do not

consider link weights. JUNG [2006] provides the node authority scores for HITS and

ignores the link direction when computing betweenness centrality. The intuition behind

these global centrality attributes is the same as that for the node degree based attributes

but the former are more informative since they consider the entire network for

computation instead of focusing on immediate neighbors.

7.2.3 Structural Equivalance (SE)-based Attributes

Lorrain and White [1971] identify two nodes to be structurally equivalent if they

have the same links to and from other nodes in the network. As it is unlikely that two

nodes will be exactly structurally equivalent in our intercompany network, we use a

similarity metric to measure the degree to which two nodes are structurally equivalent.

The intercompany network is represented as a weighted adjacency NxN matrix, where N

is the number of nodes. The SE similarity between two nodes is the normalized dot

product (i.e., cosine similarity) of the two corresponding rows in the matrix, where a

matrix element can be DWID, DWOD, or DWIOD value and therefore producing

DWID-, DWOD-, or DWIOD-based SE similarity. Intuitively, the DWID-based SE

similarity between company A and company B captures the overlap between companies

whose news stories cite A and companies whose news stories cite B (analogous to co-

citation [Small 1973]); the DWOD-based SE similarity reflects the overlap between

companies that news stories of A and B cite (analogous to bibliometric coupling [Kessler

1963]). A high overlap between neighbors of two nodes in our intercompany network

89

may be reflective of the overlap in their businesses or markets. Intuitively, this

phenomenon may indicate a competitor relationship. For example, for the sample graph

of Figure 20 DWID-based SE similarity between n1 and n3, or YHOO and GOOG, is 0.98

out of 1 for the maximum possible value.

For classifying whether a pair of companies are competitors we use the above

described attributes. As noted earlier, some of the attributes have one value for a pair of

nodes (DWID, DWOD, DWIOD, and three different SE similarities) while others have a

value for each node (NWID, NWOD, NWIOD, pagerank, hits, and betweenness) in the

pair. Hence, we use the total of 18 attributes for classifying competitor relationship for a

company pair. Table 19 summarizes the four types of attributes by type and range of

network covered.

7.3 Raw Data

Now we describe the source and nature of the raw data (news stories) and the

process by which we constructed the intercompany network from them. The first data set

consists of eight months (July 2005–February 2006) of business news for all companies

on Yahoo! Finance. Both Chapter 8 (predicting CRRs) and Chapter 9 (Discovering

Competitor Relationships) use this data set. In addition in Chapter 8 we use three more

months’ (October–December 2005) news stories from the first data set as a second data

set to validate the major results obtained from the first, but with the second data set we

Table 19.

90

Four Types of Network Attributes

Attribute Type Attributes Range of Network CoveredDyadic degree-based DWID, DWOD, DWIOD A given node and only one directly connected nodeNode degree-based NWID, NWOD, NWIOD A given node and all directly connected nodesNode centrality-based pagerank, hits, betweenness Whole networkSE-based DWID-, DWOD-, DWIOD-

based SE similarityAny two nodes and their directly connected nodes in the whole network

study CRRs on the basis of quarterly revenues. In Section 9.2 we describe three smaller

data sets sampled from the first data set for discovering competitors.

7.4 Preliminary Data Processing

Yahoo! Finance organizes business news stories by company and date. The news

stories are not limited to those available from yahoo.com but also include those from

other news sources, such as forbes.com, thestreet.com, and businessweek.com. In other

words, URL links corresponding to news titles that have been organized under a company

in Yahoo! Finance may point to Web pages located at several domains. Taking advantage

of this organizing mechanism provided by Yahoo!, we consider that news stories

organized under a company belong to the company and identify all news pertaining to a

given company within a period of time. For example, for news belonging to Google and

dated February 28, 2006, a page containing both all news titles and their URLs linking to

news content is at http://finance.yahoo.com/q/h?s=GOOG&t=2006-02 -28, where GOOG

is the stock ticker of Google Inc. We automatically construct similar URLs to gather links

of news stories for each company in Yahoo! Finance across the eight-month period. We

then programmatically fetch news stories corresponding to the links. Yahoo! may

http://finance.yahoo.com/q/h?s=GOOG&t=2006-02%20-28

91

organize the same piece of news under different companies; we treat such a news story as

belonging to each of the companies that Yahoo! identifies.

7.5 Node and Link Identification

A news story identifies a company according to its stock ticker on NYSE,

NASDAQ or AMEX. If a piece of news pertaining to a company ni mentions another

company nj, we consider there is a directed link from n i to nj, denoted as <ni, nj>. If

company nj is cited several times in the same piece of news, each citation adds to the

accumulated weight for the directed link. We aggregate citation frequency across all

news stories in a data set. Furthermore, we do not count self-references; therefore, we

ignore citations to company ni if they appear in a news story belonging to n i. For

example, if a news story pertaining to company n1 mentions the companies in the

sequence [n2, n1, n3, n4, n4, n2, n5], we derive the set of links and weight vector as (<n1,

n2>, < n1, n3>, < n1, n4>, < n1, n5>) and (2, 1, 2, 1), respectively. We filter out news stories

that do not mention any other company. After we collected the annual revenues and news

stories for all companies across all nine sectors in Yahoo! Finance, we emerged with a

total of 6,428 companies and 60,532 news stories. For the first data set, we note that the

early months (i.e., July–September 2005) included fewer news stories than later months,

because Yahoo! does not archive as many historical news stories as recent ones. In Table

20, we provide company and news distribution across the nine sectors in the first data set.

92

7.6 Attribute Distributions

Several variables derived from social phenomena and networks, such as Pareto

distribution of wealth and the frequency of word usage in the English language [Adamic

2002], follow the power law distribution. Recent research shows that several aspects of

Table 20.

Company and News Distribution across Sectors

Sector Number of Companies

Percentage of Companies

Number of News Stories

Percentage of News Stories

Basic materials 522 8.12% 4398 7.27%Conglomerates 30 0.47% 1004 1.66%Consumer goods 496 7.72% 4947 8.17%Financial 1402 21.81% 5512 9.11%Healthcare 706 10.98% 7481 12.36%Industrial goods 423 6.58% 2677 4.42%Services 1334 20.75% 13144 21.71%Technology 1386 21.56% 20723 34.23%Utilities 129 2.00% 646 1.07%Total 6428 100% 60532 100%

digital networks such as the Internet follow power law distributions as well. For example,

the rank and frequency of the outdegrees of Internet domains [Faloutsos et al. 1999] and

the indegree and outdegree of Web page links [Barábasi et al. 2000, Broder et al. 2000,

Kumar et al. 1999] reflect power law distributions. With the directed, weighted

intercompany network, we observe similar power law distributions for various node

degree measurements (NID, NOD, NWID, and NWOD) and link weight. All logarithms

used in the distributions are base 10.

93

7.6.1 Node Indegree Distribution

Figure 21 shows that the distribution of node indegree (NID) follows a power law

distribution with a Pearson correlation at 0.945 (negative sign ignored). The distribution

indicates a few nodes (companies) attract most of the citations, similar to social

phenomena such as the distribution of wealth (Pareto distribution) [Adamic 2002]. We

observe similar power law distributions for other node degree measurements, such as

NOD, NWID, and NWOD. For brevity, we do not show their distribution plots herein.

94

Figure 21. Node Indegree (NID) Distribution

95

7.6.2 Link Weight Distribution

Figure 22 shows the link weight distribution in our intercompany network. The link

weight also follows the power law distribution with a Pearson correlation at 0.944. The

power law distribution of link weights indicates there are a few very strong links and

many weak ones.

0.00 0.50 1.00 1.50 2.00 2.50 3.00

Log(Weight)

0.00

1.00

2.00

3.00

4.00

5.00

Log(

coun

t)

Figure 22. Link Weight Distribution

96

7.6.3 Revenue Distribution

We choose a million of dollars as the unit to record the revenue for each

company, group companies with similar logged revenues, and obtain the histogram in

Figure 23, which shows that the (logged) revenues across the 6,428 companies

approximately follow a normal distribution.

-1.05-.75-.50-.25.00.25.50.751.001.251.501.752.002.252.502.753.003.253.503.754.004.254.504.755.005.255.50

Log(Revenue)

0

100

200

300

400

500

600

Cou

nt

Figure 23. Revenue Distribution

97

7.6.4 Revenue Node Weighted Indegree Distribution

Figure 24 represents a plot of the logged revenues and logged node NWID of all

nodes, with a Pearson correlation of 0.534. Unlike the prior three subsections, we find no

clear pattern for the two variables. In addition, we observe similar distributions for

logged revenue with NID, NOD, and NWOD.

-4.00 -2.00 0.00 2.00 4.00 6.00

Log(Revenue)

0.00

1.00

2.00

3.00

4.00

Log(

WID

)

Figure 24. Scatter Plot of Revenue and NWID

97

CHAPTER 8

PREDICTING COMPANY REVENUE RELATIONS

As explained in Section 7.5, in our approach nodes in an intercompany network

consist of companies mentioned in business news stories. When determining a link

between two nodes, unlike traditional SNA that uses explicit social relationships (e.g.,

common directorship [Levine 1972], cooperative business relationships [Walker et al.

1997]), we assume a directed link from company A to company B if a news story

pertaining to the company A mentions (cites) company B. Moreover, a link from

company A to company B carries a weight that equals the total number of citations for

company B in a set of news stories belonging to company A. The direction and weight

should provide additional information about the flow and strength of business

relationships in the constructed network. Also, by noting the direction, we can examine

the effects of links coming into a node and those going away from it separately. The

weights in our network reflect the accumulated citations between a pair of companies and

enable us to quantitatively identify a relationship between two companies over time. We

identify a “netdegree” measurement (DWND) that combines the direction and weights to

provide an overall view of the relationship between a pair of companies. Hence, this

98

approach is more comprehensive than prior related literature on several dimensions,

including a richer network (with weights and direction), a new degree-based metric,

larger data sets, and various analyses related to business relationship prediction.

To illustrate business relationship prediction, in this chapter we focus on

predicting a (positive or negative) CRR between any pair of linked companies and further

estimate whether a company’s revenue is in the top-N (where N varies from 100 to 1000)

companies on the basis of the network structure. Before we present our research

questions in detail, we first describe how we measure CRR.

8.1 Measurements for CRR

As we mentioned in the introduction, a positive or negative revenue relation exists

between a pair of companies. However, when the two companies come from different

sectors, their (absolute) revenue values may not be comparable. Therefore, besides a

direct comparison of revenues in dollars, we derive the following three metrics to

determine a positive or negative CRR by taking the size of a sector into consideration:

Revenue rank, or the rank of the company’s revenue in its sector, namely,

revenue rank(ni) [1, |sector(ni)|], where revenue rank(ni) is company ni’s rank order

in its sector by revenue and |sector(ni)| is the total number of companies in the sector

to which company ni belongs.

Normalized revenue rank(ni) = (6)

99

Revenue share(ni) = (7)

where revenue(ni) is company ni’s revenue value (in dollars).

In Section 8.4, we report the detailed results measured by normalized revenue

ranks. The results measured by those three metrics are similar to those and therefore are

not included in the paper.

8.2 Research Questions

We want to explore the broad hypothesis that attributes derived from a network

constructed from news stories can indicate meaningful business relationships (in

particular, CRR and top-N by revenue). Therefore, we identify attributes that capture the

pairwise relationships between companies (dyadic degree-based) or estimate the

individual importance of each company (node degree-based and node centrality-based).

In each case, the attributes are computed purely from weighted and directed links formed

by citations in news stories. In turn, based on the problem described previously and the

identified network-based attributes, we ask the following specific research questions:

(1) Is DWND, which captures the net flow of citations between a pair of companies,

an effective indicator of positive CRR?

(2) How well can the attributes derived purely from network structure, as shown in

Table 19 in Section 7.2, predict CRR for a pair of companies in the network?

(3) How does CRR prediction performance differ among the three groups of

attributes, which represent different amounts of network covered?

100

(4) Which of the network structure-based attributes (when combined linearly) are

significant in distinguishing positive and negative CRRs?

(5) How well can CRRs for pairs that flip their revenue relations at different time

periods be predicted?

(6) How well can individual importance measures of each company, such as node

degree- and centrality-based attributes, predict top-N revenue companies?

8.3 Research Methods

With Figure 25 we introduce the specific procedures and methods we use to

address our research questions. For our analysis with pairs of companies, we use DWND

to identify the source and target and ensure each pair is selected only once: If (n i, nj) is

identified as a pair, (nj, ni) cannot be selected. We sort all the links by their DWND

values in descending order and consider only those links whose DWND values are

greater than or equal to 0. For any link <n i, nj> in the network with a DWND value of 0,

we ignore the opposite link <nj, ni>. We identify 87,340 company pairs from the first data

set and use them to predict CRR. With this data set we also predict the top-N companies

by revenue and note that the range of netdegree values is 0–49.

8.3.1 Classification Methods

Using Weka [Witten and Frank 2005] as a data analysis tool, we employ two

classification methods to evaluate the CRR prediction performance for company pairs.

For our classification methods, we select logistic regression and C4.5 [Quinlan 1993]

101

Figure 25. Diagram of Methodology and Analysis Approaches

decision tree (i.e., J48 classifier in Weka). Logistic regression is frequently used in

business research for problems with a binary class label (as for our CRR prediction

problem); decision tree is one of the commonly used classifiers in data mining, because it

is highly accurate for binary classification problems, it does not impose assumptions

about the distribution of data, and its results are well suited for human interpretation

[Padmanabhan et al. 2006]. We use two different methods so we may compare their

performances for our applications. For each of the classification methods, we employ and

report results on the basis of 10-fold cross-validation. In line with standard metrics used

in data mining and information retrieval, we report the following precision, recall, and

accuracy to evaluate the performance of the predictive models:

(8)

102

(9)

(10)

8.3.2 Discriminant Analysis with Logistic Regression

The main purpose of this paper is to explore the power of structural attributes in

predicting CRR. However, we would also like to investigate the significance (if any) of

individual IVs in discriminating between positive and negative CRRs. Therefore we

perform a discriminant analysis using logistic regression. The linear nature in which

attributes are combined in logistic regression allows for a simplistic understanding of

their individual significance. In particular, from the 87,340 pairs in the first data set we

randomly select 1000 pairs such that each company in the chosen pairs is distinct. As a

result, there are 2000 unique companies in the 1000 pairs and hence these pairs are

considered independent. With 12 IVs (DWID and DWOD for source, NWID and NWOD

for source and target, pagerank, hits and betweenness scores for source and target) and

CRR as the dependent variable (DV), we employ binary logistic regression in SPSS

(version 12.0) to find the discriminant variables. In particular, we start with a base model

that uses the mean of the DVs and does not include any IVs. Then from a list of candidate

IVs which have statistically significant differences between the two DV groups, we add

an additional IV at one step by choosing the IV having the largest score statistics (method

“Forward: LR” in SPSS) until the stepwise estimation procedure stops (e.g., no remaining

IV is significant) [Hair et al. 2006].

103

8.3.3 CRR Flips

CRR flip refers to a CRR change (negative to positive or vice-versa) over two different

time periods. We would like to measure the prediction performance of our approach on

CRR flips that represent a more interesting subset of the data since they capture the

dynamics of CRR. We note that for this subset of data, a naïve approach of assuming that

CRR does not change would result in a precision, recall, and accuracy of 0%. We analyze

how well we predict the CRR among the flip pairs based on annual and quarterly revenue

data. For the first data set, we collect annual revenues for the year 2004. The flips are

identified by 2004’s annual revenues and revenues of the last four quarter ending in April

2006. From the 87,340 pairs in the first data set, we find a total of 75,709 pairs that have

annual revenues in both time periods. For the four different CRR measurements (see

Section 3.2), about 4% of pairs flipped. With the two classifiers from Weka we run 10-

fold cross validation and report the prediction performance of CRR on all the 75,709

pairs and all the flip pairs.

With the second data set, we identify quarterly revenues for Q4 2005, Q1 and Q2

2006 from Yahoo! Finance to derive CRRs. We then identify flip pairs for time periods

of Q4–Q1 and Q4–Q2. For the four different CRR measurements, the percentage of flip

pairs is about 5%.

8.4 Results and Analyses

With the first data set, we first explore how DWND is associated with positive

CRR by determining whether the net flow of news citations between a pair of companies

indicates the relative size of their revenues. Then we report how well the various

104

attributes derived from network structure predict CRRs for company pairs. To tease out

the effects of the three different groups of attributes—dyad degree-based, node degree-

based, and node centrality-based —we repeat the prediction experiment with each set of

attributes separately. Using logistic regression as discriminant analysis we report what

IVs are significant in distinguishing CRRs. With the CRR prediction results we further

examine the classification performance for flip pairs. For the second data set, we briefly

report results similar to those obtained by the first data set. In particular, we provide

prediction performance of CRR on the basis of Q4 2005. When analyzing CRR flips,

instead of using revenues from a previous time period, we compare revenues of Q4 2005

with those in next two time periods (i.e., Q1 and Q2 2006) respectively. Then we

examine how well data collected in Q4 2005 can classify flip pairs identified at different

future time periods.

8.4.1 Positive CRR and Top Links

We sort all of the links in the network by their DWND values (in descending

order). Using a set of the top few links from the sorted list, we compute the percentage

that correctly reflects positive CRR. We then successively increase the number of top

links (T); in Table 21, we provide the number and percentage of the top links (where T

varies from 20 to a few hundred) that follow the positive CRR. We measure the

significance of the percentages in Table 21 through a binomial test. Finally, we note that

if the DWND were independent of CRR, the percentages would be close to 50%. When

the DWND values are relatively high, DWND seems to be a good indicator of positive

revenue relations.

105

Table 21.

Positive CRR for Top-N links

Top Links(T)

DWND Range

Number of Links Following Positive CRR

Percentage of Links Following Positive CRR

20 [24, 49] 16 80.0% *37 [19, 49] 31 83.8% ***64 [16, 49] 50 78.1% ***79 [14, 49] 58 73.4% ***114 [12, 49] 80 70.2% ***135 [11, 49] 92 68.2% ***175 [10, 49] 115 65.7% ***217 [9, 49] 134 61.8% ***289 [8, 49] 172 59.5% ***

* p < 0.05, *** p < 0.001 (two-tailed).

8.4.2 Positive CRR and All Links

As the DWND value decreases, so does the signal indicating the positive CRR

between a pair of companies. To examine this observation further, we segment the links

in the intercompany network into baskets, such that links in each basket have the same

DWND, and combine links with different DWND values into one basket only if the

basket contains fewer than 20 links. In Table 22, we provide the percentages of links

following positive CRR in each basket.

When DWND values are small (e.g., less than 10), links in the same baskets do

not display a clear trend toward a positive CRR. In other words, for company pairs in

those baskets, pointing to a company with the same or higher revenue rank is about as

likely as pointing to one with lower revenue rank. However, as the DWND values

increase, positive CRR becomes more salient.

106

In summary DWND can be an indicator of positive CRR for top links, i.e. links

with large DWND values. Overall 48% of the 87,340 pairs whose DWND are non-

Table 22.

Positive CRR for All Links with the Same or Similar DWND

Basket No. DWND Percentage of Links Following Positive CRR

1 1 46.5%2 2 48.8%3 3 46.8%4 4 51.9%5 5 51.8%6 6 57.1%7 7 56.3%8 8 52.8%9 9 45.2%10 10 57.5%11 [11, 12] 55.6%12 [13, 17] 62.5%13 [18, 23] 86.7% ***14 [24, 49] 80.0% *

* p < 0.05, *** p < 0.001 ( two-tailed, binomial test).

negative follow positive CRR, suggesting the indication of DWND disappears when

considering all the pairs.

8.4.3 Predicting CRR with Annual Revenues

For the first data set we first predict CRR using three groups of attributes

identified in Section 7.2, then use each individual group of attributes separately and

107

observe its predictive power. Moreover, we conduct discriminant analysis to identify

what IVs are significant in discriminating CRRs.

8.4.3.1 All Three Groups of Attributes

To predict the CRR for each pair of companies, we use a total of 12 attributes (2

dyadic degree-based, 4 node degree-based, and 6 node centrality-based). For the node

degree-based and node centrality-based measures, we employ a pair of attributes for the

source and target companies of each link. Of the dyadic degree-based attributes, we do

not use DWID because it can be derived directly from DWND and DWOD. Table 23

shows the results of the two classification methods for the first data set (87,340 company

pairs).

From Table 23 we observe that using attributes derived from a network without

resorting to any information about a company’s sector or revenue, we achieve reasonable

precision, recall, and accuracy of approximately 70–80% in predicting the CRR between

companies, given our data set consists of an almost equal number of positive and

negative CRR instances (see the third column in Table 23). In addition we divide the

87,340 pairs into two subsets: (1) all pairs in which both companies in a pair belong to

the same sector and (2) the remaining pairs (different sectors). We examine the prediction

performance for each subset separately, and again, the precision, recall, and accuracy fall

around the 70–80% range, similar to those in Table 23. Using the ten accuracy values

generated through the 10-fold cross-validation, we find that the average accuracies of the

logistic regression and decision tree differ significantly (two-tailed t-test, p < 0.001), with

decision tree proving to be a superior method.

108

Table 23.

Classification Results of CRR with 12 Attributes (First Data Set)

Classification Method

Class Label (CRR)

Number (Percentage) of Pairs Precision Recall Accuracy

Logistic regression

0 45907 (52.6%) 74.8% 77.1% 74.3%1 41433 (47.4%) 73.7% 71.2%

Decision tree 0 45907 (52.6%) 80.5% 81.1% 79.7%1 41433 (47.4%) 78.9% 78.2%Notes: Attributes are DWND, DWOD, source NWID, source NWOD, target NWID, target NWOD, source pagerank, source hits, source betweenness, target pagerank, target hits, target betweenness.

8.4.3.2 Each Individual Group of Attributes

We are also interested in comparing the performances with individual groups of

attributes separately; in Tables 21, 22, and 23, we provide the associated results for the

first data set.

The two dyadic degree-based attributes, DWND and DWOD, fail to predict

revenue relations well, whereas the four node degree-based and six node centrality-based

attributes produce results nearly as good as those from using all 12 attributes together.

The poor performance of dyadic degree-based attributes may be due to their

reliance on the local (pairwise) flow of citations between the two companies. This

localized property of the dyadic attributes may fail to capture the relative importance of

the two companies, which is formed by all the citations they receive from or provide to

109

many other nodes in the network. The more global node degree- and node centrality-

based measures therefore better predict CRR.

Table 24.

Classification Results of CRR Using DWND and DWOD


Revenue Relation Precision Recall Accuracy

Logistic regression

0 52.6% 99.2% 52.6%1 54.5% 1.1%

Decision tree 0 52.6% 97.1% 52.5%1 49.1% 3.1%

Table 25.

Classification Results of CRR Using Source NWID, Source NWOD, Target NWID, and

Target NWOD



Logistic regression

0 71.3% 84.1% 73.8%1 78.0% 62.4%

Decision tree 0 80.1% 80.9% 79.4%1 78.6% 77.7%

Table 26.

110

Classification Results of CRR Using Source Pagerank, Source Hits, Source Betweenness,

Target Pagerank, Target Hits, and Target Betweenness



Logistic regression

0 74.6% 77.6% 74.3%1 74.0% 70.7%

Decision tree 0 80.2% 80.0% 79.1%1 77.9% 78.1%8.4.3.3 Discriminant Variate

At the first step of the discriminant analysis using the 1000 pairs with 2000

unique companies , before adding the first IV into the model, we find that ten IVs (four

node degree-based and six centrality-based) are significant (with significance equal to or

less than 0.05) and the (two) dyadic degree-based IVs are not. The result for dyadic

degree-based IVs is consistent with what we see in Table 24: those IVs produce very

poor prediction results. The first IV included in the discriminant model is source_hits

score as it has the largest score statistics. After including the source_hits and repeating

the evaluation procedures, the second IV to be added is target_hits. At this step, all the

eight IVs that were significant before including the first IV become insignificant due to a

high multicollinearity among the IVs (i.e., hits, pagerank, betweenness, NWIO and

NWOD). The high multicollinearity among those IVs explains the similar performance

by different sets of IVs in Tables 22 and 23. The coefficient β for source_hits is negative

(-1863.7) and for target_hits is positive (1627.5), which indicates that an increase in

source_hits decreases the likelihood of positive CRR; and increase in target_hits

increases the likelihood of positive CRR. In other words, global (hub-like) centrality of

source or target company is indicative of its higher revenues. Hence, the global

centrality-based hits metrics for source and target company consist of the discriminant

111

variate that can significantly discriminate between positive and negative CRR. The

prediction results obtained using the discriminant model (with a constant and the two IVs

– source_hits and targe_hits) are as follows:

Compared with Tables 20, 22, and 23, Table 27 shows inferior results, indicating

that adding more IVs can improve prediction performance (the main focus of this paper).

Table 27.

Prediction Results for Discriminant Model with Two IVs

Discriminant model


Logistic regression

0 69.4% 54.9% 66.8%1 64.2% 68.3%

8.4.4 Predicting CRR with Quarterly Revenues

With the second data set we also report the CRR prediction performance on the

basis of quarterly revenues. We present the CRR prediction results in Table 28 and the

CRRs are determined by revenues of Q4 2005. The prediction performance is very

similar to those in Table 23 that are generated on the basis of annual revenues.

8.4.5 Predicting Top-N Companies by Revenue

We now consider the related problem of predicting whether a company will fall

within the set of top-N companies by revenue (in dollars). Because we are no longer

interested in the direct relation between a pair of companies, we do not use the dyadic

attributes in these predictive methods. We employ five node-level attributes for each

company in the network (listed in the caption of Figure 26). The class label to be

112

predicted takes a value of 1 if the company is a top-N company by revenue and 0

otherwise. Again, we base all performance measurements on 10-fold cross-validation.

Figures 24 and 25 show the performances of the two classification methods as N varies

from 100 to 1000 with a step size of 100.

Table 28.

Classification Results of CRR with 12 Attributes (Second Data Set)


Class Label (CRR) Precision Recall Accuracy

Logistic regression

0 75.0% 80.1% 75.5%1 76.1% 70.4%

Decision tree 0 76.4% 76.2% 75.4%1 74.3% 74.6%

The two classification methods produce similar results. Performance for predicting the

negatives (i.e., a company is not in the set of top-N companies) is high, with precision

and recall (for both methods) in the range of 89–99%. However, precision for predicting

the positives is in the range of 57–75%, and recall is substantially low (24–36%). We

observe similar results with the second data set; for the negatives, both precision and

recall are between 88% and 99%, whereas for the positives, precision is 65–76% and

recall is 22–35%. Although these positive prediction performances may seem rather low,

they should be judged with the knowledge that the top-N companies, where N varies

from 100 to 1000, constitute only 1.6–16% of the total number of companies in the two

data sets. That is, the problem of correctly identifying a company in the set of top-N

companies by revenue is particularly hard, whereas identifying a company that is not in

113

the top-N is easier because most companies fall into this category. Given the high prior

probability of negatives, our results for this problem are encouraging.

0%

20%

40%

60%

80%

100%

120%

100 200 300 400 500 600 700 800 900 1000

Top-N

Perc

ent Precision 0

Recall 0

Precision 1

Recall 1

Figure 26. Precision and Recall for Logistic Regression in Predicting Top-N

companies

0%

20%

40%

60%

80%

100%

120%

100 200 300 400 500 600 700 800 900 1000

Top-N

Perc

ent Precision 0

Recall 0

Precision 1

Recall 1

Figure 27. Precision and Recall for Decision Tree in Predicting Top-N Companies

114

8.4.6 Analysis for CRR Flips

8.4.6.1 Analysis for CRR Flips on the Basis of Annual Revenues

Table 29 shows that for annual revenue-based CRRs, precision, recall, and accuracy are

in 70-80% for all 75,709 pairs and around 60% for flip pairs. Given that about 4% of all

pairs experienced CRR flips, a naïve technique that assumes current year’s (t) CRRs to be

the same as last year’s (t-1) will achieve an accuracy of 96%. However, such a high

accuracy would be at the cost of failure to detect any CRR flips (i.e., 0% precision, recall,

and accuracy among the flip pairs). In contrast, our approach is able to achieve precision,

recall, and accuracy of about 60% on the flip pairs. The flip pairs, due to their more

dynamic nature, constitute the more interesting part of the data set. Moreover, it is

important to note that our approach does not resort to any financial data when predicting

the CRR. This is a desirable property that would allow the approach to be easily extended

to private and/or foreign companies where it is harder or impossible to find accurate

financial data. Another naïve approach, which classifies company pairs as positive or

negative CRR randomly with 50% probability, would achieve 50% accuracy on both flips

pairs as well as all the pairs. Our approach clearly performs better than such a random

approach on both flip pairs and all pairs.

Table 30 lists three sample flip pairs with annual revenues at times t-1 and t. The

first two pairs’ CRRs flip from 0 to 1, and the third pair demonstrates a flip from 1 to 0.

Table 29.

Classification Results of All Pairs and the Flip Pairs (with Annual Revenues)

All pairs Flip pairsPerformance\Classifier DT LR DT LRPrecision for positive* 79.2% 75.1% 57.8% 64.0%

Recall for positive 78.6% 71.5% 55.7% 56.5%Precision for negative 80.9% 75.4% 58.2% 61.9%

Recall for negative 81.5% 78.7% 60.3% 68.9%Accuracy 80.1% 75.3% 58.0% 62.8%

* Positive means that the CRR flip is from 0 to 1.

Table 30.

Sample Flip Pairs

Pair by stock ticker Company1 Sector1Revenue1* at t-1

Revenue1 at t Company2 Sector2

Revenue2 at t-1

Revenue2 at t Flip (t-1, t)

MRGENGPS Merge Technologies Inc. Technology 37.0 72.1 NovAtel Inc. Technology 44.8 54.6 0->1

JNPRXLNX Juniper Networks, Inc. Technology 1336 2060 Xilinx Inc. Technology 1573 1640 0->1

MSOCTRNMartha Stewart Living Omnimedia Inc. Service 187.4 209.5 Citi Trends Service 157.2 289.8 1->0

* Revenue in million dollars.

115

116

116

8.4.6.2 Analysis for CRR Flips on the Basis of Quarterly Revenues

Table 31 shows that for quarterly revenue-based CRRs (Q4 2005 as current time t

and Q1 2006 as future time t+1), the precision, recall, and accuracy are around 80% for

all pairs and close to 60% for flip pairs by DT. Compared with DT, LR produces slightly

inferior results. The results are consistent with those seen in Table 14 for annual revenue-

based CRRs.

When measuring prediction performance on CRR flips using Q4 2005 as t and Q2

2006 as t+1, the results are shown in Table 32. Compared with results in Table 31, we

find that the prediction performance for flip pairs in Table 32 drops. This may be

explained by the fact that as the difference in time between news and target CRR (i.e.,

CRR to be predicted) increases the power to predict CRR among flip pairs decreases.

Table 31.

Classification Results of All Pairs and the Flip Pairs (with Quarterly Revenues of Q4

2005 and Q1 2006)

All pairs Flip pairsPerformance\Classifier DT LR DT LRPrecision for positive 80.5% 75.2% 54.2% 47.5%



118

Table 32.

Classification Results of All Pairs and the Flip Pairs (with Quarterly Revenues of Q4

2005 and Q2 2006)

All pairs Flip pairsPerformance\Classifier DT LR DT LRPrecision for positive 79.0% 74.8% 50.7% 45.5%



8.5 Discussions

We propose a news-driven, SNA-based business relationship discovery approach

to explore the predictive value of business news in discerning revenue relationships

between companies. Our approach uses citations in news stories to understand the

direction and strength of the relative importance between a pair of companies. In our

intercompany network, nodes are companies, and links are directed and weighted on the

basis of the direction and frequency of citations in news stories. We identify and quantify

various attributes of the network using standard network analysis metrics and suggest

modified or new metrics as needed (e.g., DWND). We then use these attributes to predict

the (future) relative revenue relation between a pair of companies as an example of

business relationships the approach might predict. We also examine the prediction

performance for flip pairs and investigate whether we can predict if a given company

falls into the set of top-N companies by revenue. We process and employ two sets of

multimonth data from the online business news available at Yahoo! finance. Both data

sets reaffirm the robustness of our findings on the basis of annual and quarterly revenues.

119

Applying discriminant analysis we identify a set of significant IVs. Moreover, our

approach is intrinsically language independent and can be extended to news in various

languages.

Similar to many other networks constructed from the Internet, we find that

various attributes of our network, such as NID, NOD, NWID, NWOD, and link weight,

follow the power law distribution. By exploring the relation between DWND and positive

CRR, we find that company pairs with large DWND tend to be associated with positive

CRR. Hence, as expected, the DWND metric (at least for large values) captures the

overall flow of revenue (importance) between a pair of companies.

We study the CRR prediction problem by using three groups of attributes

together, as well as individual groups separately. Different groups of attributes vary in the

range of the network covered for their computations. More global measures, such as node

degree- and node centrality-based attributes, are better predictors of CRR than are the

dyadic degree-based attributes that concentrate only on pairwise relationships and ignore

the rest of the network. In terms of CRR prediction performance, the precision, recall,

and accuracy are in the range of 70-80% for all pairs and are about 60% for flip pairs.

With regard to predicting whether a company’s revenue falls among the top-N,

the precision for predicting the positives (top-N) is much higher than the recall. These

results may seem humble until we consider them in the context of the prior distributions

in the data sets. Considering that only a small percentage of companies fall into the set of

top-N companies by revenue, a precision value in the range of 57–75%, as we achieve, is

encouraging. If our predictive models randomly assign companies to the top-N, the

precision for predicting positives should not exceed 16%.

120

Our approach thus can not only serve as a data filtering step for analysts but also

be useful for tracing and monitoring the dynamics of revenue relations for many

companies over time. We plan to further validate our approach with a variety of business

relationships, news from different languages (and countries), various types of companies

(e.g., private versus public), and over time. Further research might also attempt to derive

and evaluate additional graph attributes that synthesize the global and dyadic measures

that represent more effective predictors of business relationships between a pair of

companies.

120

CHAPTER 9

DISCOVERING COMPETITOR RELATIONSHIPS

9.1 Approach Outline and Research Questions

Figure 28 outlines the five main steps of our approach on competitor discovery.

The first two steps have been explained in Figure 18 in Section 6.1. In step 3, as a

preliminary investigation, we first examine the citation-based intercompany network for

both its competitor coverage (coverage of known competitors) and competitor density

(the likelihood of finding competitors among the linked company pairs in the network.)

We benchmark this preliminary investigation against an exhaustive as well as a random

search to provide a comparative analysis of a citation-based intercompany network in

terms of search cost. We find that competitor relationship discovery is especially

challenging in portions of our data set where the number of non-competitor pairs

overwhelm the number of competitor pairs. We use a combination of data from Hoover’s

and Mergent as our gold standards for evaluation purposes.

This study focuses on the following two research questions:

121

1. How well can we discover competitor relationships between companies using four

types of attributes derived from the intercompany network? Using special

classification techniques, we report the classification performance for an

Figure 28. Process View of the Competitor Discovery Approach

imbalanced data set where the number of noncompetitor pairs overwhelms the

number of competitor pairs.

2. To what extent can a gold standard cover the set of all competitors, and to what

extent does the proposed approach extend the knowledge covered by a gold

standard? We use Hoover’s and Mergent as gold standards for identifying

competitors, though we are keenly aware that these data sets are incomplete and

inconsistent, as we have illustrated. Therefore, we estimate their coverage on all

competitor pairs and propose metrics to estimate the extension offered by our

approach for each gold standard data source.

122

9.2 Data Sets

In the following two subsections, we introduce two data sets that will be used to

evaluate competitor classification performance. The first data set represents a whole set

of pairs in the network, and the second is created to represent the imbalanced part of the

whole data set.

9.2.1 Data Set I

We first use DWND (net flow of citations between a pair of companies) to

identify all distinct (linked) company pairs in the network; namely, we include only pairs

with non-negative DWND values, and for any link <n i, nj> with a DWND value of 0, we

ignore the opposite link <nj, ni>. In other words, all distinct company pairs in the

intercompany network that have any citations between them are identified. With this

method, we would identify a total of eight links in Figure 19 at Section 7.1. For the entire

intercompany network, we identify a total of 87,340 company pairs. Next, we sort the

pairs by their DWIOD values, which range from 1 to 990, in descending order, because

DWIOD captures the total volume of citations between two companies in news.

Therefore, more citations in news stories should increase the likelihood that two

companies have a business relationship. In terms of DWIOD values, the data set is

skewed; most company pairs have small DWIOD values. To examine competitor

relationships, we group company pairs with the same or similar DWIOD values by

dividing them into baskets, such that links with different DWIOD values do not appear in

the same basket unless the basket contains fewer than 200 pairs. This procedure results in

123

21 baskets associated with different DWIOD values. We randomly choose 40 pairs from

each basket, and the 840 pairs (40 × 21) constitute data set I, which we use to examine

the classification performance of the individual baskets in Section 9.4.

We manually determine whether each of the 840 company pairs in the 21 sample

baskets is a competitor pair using the Hoover’s and Mergent sources. If we find a

competitor relationship between the two companies according to either Hoover’s or

Mergent, we assign the pair a class label of 1 (positive instance); otherwise, it receives a

class label of 0 (negative instance). In Table 33, we show the DWIOD range and size of

each basket, as well as the number and percentage of competitor pairs in the 21 sample

baskets. As this table illustrates, higher DWIOD values tend to be associated with a

higher percentage of competitor pairs in a sample basket, in line with our intuition that as

the overall volume of citations between a pair of companies increases, the likelihood that

the companies have a business relationship (e.g., competitors) increases.

Table 33.

Distribution of Competitor Pairs in 21 Sample Baskets

Basket DWIOD range

Basket size

Number (percent) of positives in a

sample basket by Hoover’s

Number (percent) of positives in a

sample basket by Mergent

Number (percent) of positives by

union of Hoover’s and Mergent


intersection of Hoover’s and Mergent

1 [69, 990] 200 26(65.0%) 11(27.5%) 26(65.0%) 11(27.5%)2 [44, 68] 209 19(47.5%) 9(22.5%) 19(47.5%) 9(22.5%)3 [32, 43] 224 17(42.5%) 6(15.0%) 17(42.5%) 6(15.0%)4 [26, 31] 239 14(35.0%) 4(10.0%) 15(37.5%) 3(7.5%)5 [22, 25] 212 14(35.0%) 8(20.0%) 15(37.5%) 7(17.5%)6 [19, 21] 235 17(42.0%) 6(15.0%) 18(45.0%) 5(12.5%)7 [17, 18] 224 8(20.0%) 5(12.5%) 11(27.5%) 2(5.0%)8 [15, 16] 281 13(32.5%) 6(15.0%) 13(32.5%) 6(15.0%)9 [13, 14] 389 10(25.0%) 4(10.0%) 10(25.0%) 4(10.0%)

124

10 12 263 16(40.0%) 3(7.5%) 17(42.5%) 2(5.0%)11 11 330 8(20.0%) 4(10.0%) 9(22.5%) 3(7.5%)12 10 410 8(20.0%) 2(5.0%) 8(20.0%) 2(5.0%)13 9 470 8(20.0%) 3(7.5%) 8(20.0%) 3(7.5%)14 8 622 13(32.5%) 6(15.0%) 13(32.5%) 6(15.0%)15 7 769 10(25.0%) 3(7.5%) 11(27.5%) 2(5.0%)16 6 1,390 5(12.5%) 3(7.5%) 6(15.0%) 2(5.0%)17 5 1,543 5(12.5%) 2(5.0%) 5(12.5%) 2(5.0%)18 4 4,142 4(10.0%) 0(0.0%) 4(10.0%) 0(0.0%)19 3 4,972 2(5.0%) 2(5.0%) 4(10.0%) 0(0.0%)20 2 29,603 1(2.5%) 0(0.0%) 1(2.5%) 0(0.0%)21 1 40613 0(0.0%) 0(0.0%) 0(0.0%) 0(0.0%)

Total 87340 218 87 230 75

125

9.2.2 Data Sets II and III

In an imbalanced data set, most instances occur in one class, whereas the minority

is labeled as the other class, and the latter typically is the more important class [Kotsiantis

et al. 2006].

According to Table 33, several sample baskets have low percentages of positives

and therefore can be considered imbalanced data sets. As prior research [e.g., Weiss and

Provost 2003], as well as our results in Section 9.4, empirically show, typical

classification methods fail to detect the minority in an imbalanced data set and they

generate poor precision and recall (e.g., close to 0%) for positives, which in this study

mean the competitor pairs. The main reason for this poor performance is that the

classifiers, by default, maximize accuracy and therefore give more weight to majority

classes than minority ones [Kotsiantis et al. 2006]. For example, for a data set with 1%

positives, simply assigning every instance a negative label and not detecting any positives

achieves an accuracy of 99%. To handle the imbalanced data set problem, we first create

a larger data set, data set II, by proportionally (according to basket size) sampling a total

of 2000 pairs from the four imbalanced baskets (18, 19, 20, and 21) with the lowest ratio

of positives (≤10%). We manually label the 2000 pairs using Hoover’s and Mergent. The

numbers and percentages of competitors according to the different gold standards appear

in Table 34.

For further future analysis, in addition to data sets I and II, we also use 17 baskets

(1–17) in data set I and all pairs in data set II to produce estimated overall performance

results. For convenience, we call this combination of the two data sets data set III, which

contains 18 baskets, and data set II provides the 18th sample basket.

126

Table 34.

Number (percentage) of Positive Pairs in Data Set II

DWIODSamplebasket size


Hoover’s


Mergent

Number (percent) of positives by union of

Hoover’s and Mergent

Number (percent) of positives by intersection of Hoover’s and Mergent

1 1024 22 (2.1%) 15 (1.5%) 29 (2.8%) 8 (0.8%)2 747 30 (4.0%) 13 (1.7%) 39 (5.2%) 4 (0.5%)3 125 12 (9.6%) 3 (2.4%) 14 (11.2%) 1 (0.8%)4 104 15 (14.4%) 7 (6.7%) 18 (17.3%) 4 (3.8%)Total 2000 79 (4.0%) 38 (1.9%) 100 (5.0%) 17 (0.9%)

9.3 Examining Competitor Coverage and Density of the Intercompany Network

In this section, we examine two issues: the completeness of the intercompany

network in its coverage of competitor pairs (i.e., competitor coverage), and the likelihood

of competitor pairs being linked in the intercompany network (i.e., competitor density).

These issues clarify the extent to which “competitor semantics” are embedded in the links

of the constructed network. Greater competitor coverage and competitor density in the

intercompany network lowers the cost of searching for (and classifying) competitors by

using the network. Because we lack an ideal benchmark of intercompany networks from

other approaches, we benchmark the competitor coverage of the intercompany network

against that of an exhaustive network (clique) in which all nodes link to one another and

compare the competitor density of the intercompany network with that of a random

network having the same numbers of nodes and links as those of the intercompany

network. Table 35 includes notation we use to examine competitor coverage and

competitor density.

127

Table 35.

Notation for Competitor Coverage and Competitor Density

Notation Interpretation

K Number of unique companies in a sample basket that has 40 company pairs.

CL Citation-based links among the K companies in the intercompany network.

EL Exhaustive links among the K companies.CP(CL) Number of competitor pairs (CP) present in CL.CP(EL) Number of competitor pairs present in EL.Competitor coverage ratio

=CP(CL)/CP(EL), or the proportion of all known competitor pairs that are present as links in a citation-based intercompany network.

CP40(CL) Number of competitor pairs present in 40 links from a sample basket.RL Randomly generated company links from the K companies.CP40(RL) Number of competitor pairs present in 40 randomly generated links.

CD40(CL) =CP40(CL)/40, or competitor density for a small citation-based network that consists of the 40 links from a sample basket.

CD40(RL) =CP40(RL)/40, or competitor density for a random network that consists of 40 random links.

CD(EL) =CP(EL)/(K*(K-1)), or competitor density for an exhaustive network (clique) that consists of the exhaustive links.

9.3.1 Examining the Competitor Coverage

From 40 company pairs in each sample basket in data set I, we identify K and EL.

From the whole intercompany network, we further find CL. In addition, we identify

CP(CL) and CP(EL) through the union of the Hoover’s and Mergent data. In Figure 29,

we depict the competitor coverage ratio for the intercompany network across the 21

sample baskets; it is always greater than 66% and typically in the range of 87–100%

across the sample baskets. We also note that CL is a fraction of EL, ranging from 15% to

84% across the sample baskets. In other words, while our citation-based intercompany

network covers most of the competitor pairs found in an exhaustive network, for most

sample baskets it is much smaller as compared to the exhaustive network. Therefore, our

128

0%10%20%30%40%50%60%70%80%90%

100%

1 2 3 4 5 6 7 8 9 101112131415161718192021Basket

Rat

io

Figure 29. Competitor Coverage Ratio

classification models (in Section 9.4) explore a small subspace of all possible

relationships by using the intercompany network, and the subspace covers most of the

competitor pairs.

9.3.2 Examining the Competitor Density

Using the union of data from Hoover’s and Mergent, we label 40 company pairs

in each sample basket to find CP40(CL). Given K, we randomly generate 40 links from

the K unique companies and find CP40(RL). We repeat the random link generation and

link labeling procedures four times to obtain an average CP40(RL). Then, we compute the

competitor density CD40(CL) and average CD40(RL) for all sample baskets. Moreover,

because we know CP(EL), we can calculate CD(EL). Figure 30 provides the competitor

density for the citation-based intercompany network, random network, and exhaustive

129

0%

10%

20%

30%

40%

50%

60%

70%

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Basket

Pro

babi

lity

CD40(CL) CD40(RL) CD(EL)

Figure 30. Probability of Being a Competitor Pair

network across the 21 sample baskets. The curve for the average CD40(RL) is very close

to that of CD(EL), which indicates that the probability of finding a competitor pair in the

randomly generated 40 pairs is consistent with that in the exhaustive links. Moreover,

CD40(CL) is much higher than the average CD40(RL) and CD(EL) in 20 of the 21 sample

baskets. The difference in these probabilities suggests that pairs in the intercompany

network for most baskets are much more likely to be competitor pairs than those in the

random links. The high competitor density in the intercompany network for most sample

baskets therefore would benefit the classifiers in a competitor classification.

The results in Sections 9.3.1 and 9.3.2 show that the citation-based intercompany

network has high competitor coverage and density and therefore can alleviate the

problems associated with searching for competitors in an exhaustive or random space of

130

potential relationships. The results also confirm our intuition that links in the citation-

based intercompany network contain signals about competitor relationships instead of

being random.

9.4 Competitor Discovery

Our competitor classification models use four types of attributes to classify a

company pair as competitors or noncompetitors. Because the class label (dependent

variable) in the models is binary by nature, we can apply a variety of standard binary

classification models. As is common in machine learning, we use part of the data set for

training and leave a disjoint testing set to evaluate the discriminating power of the

models. We repeat this training–testing process several times with different data splits

(cross-validation) to ensure the robustness of observed results. Using several standard

metrics, which we describe next, we evaluate the discriminating power of the models.

9.4.1 Evaluation Metrics

Table 36 is the confusion matrix containing the actual and classified classes for a

classification problem with two class labels. TP refers to the number of true positives, TN

is the number of true negatives, FP is the number of false positives, and FN represents

the number of false negatives.

131

Table 36.

Confusion Matrix

Classified class labelPositive Negative

Actual class label

Positive TP FNNegative FP TN

Using the confusion matrix we introduce the common metrics for evaluating and

comparing classification performance as follows:

(11)

(12)

(13)

(14)

(15)

132

In most classification problems, precision and recall present a trade off, because

when a model prioritizes a conservative approach to boost the precision, it misses some

competitors, which reduces its recall. An F-measure is based on both precision and recall,

and the parameter α denote the relative importance of recall versus precision. F1 is the

harmonic mean of precision and recall.

One of the most common metrics to evaluate classifiers for an imbalanced data set

is the receiver operating characteristics (ROC) curve [Kotsiantis et al. 2006], a two-

dimensional curve with TP rate (recall) on the y-axis and FP rate on the x-axis (for

specific examples, see Figure 33 in Section 9.4.5). Thus, a ROC curve can address an

important tradeoff—namely, the number of correctly identified positives increases at the

expense of introducing additional false positives. The area under ROC, which is called

AUC, also offers an evaluation metric.

9.4.2 Competitor Classification with Data Set I

Using the publicly available Weka API [Witten and Frank 2005], we employ four

classification methods: artificial neural network (ANN), Bayes net (BN), C4.5 decision

tree (DT), and logistic regression (LR) to classify company pairs. Models based on ANN,

BN, and DT are common classifiers in data mining, and LR frequently appears in

business research to address problems with a binary class label (as in our competitor

classification problem). For each sample basket, except for 21, which does not contain

any competitor pairs (we address this basket, together with three other baskets as the

imbalanced data set II, in the next subsection), we report the average precision and recall

133

generated by 10-fold cross-validation for each classification method. We use different

classification methods to compare their performances for our application.

9.4.3 Competitor Classification with Data Set II

9.4.3.1 Background on Handling Imbalanced Data Set

Solutions to handling imbalanced data sets for classification problems exist at

both data and algorithmic levels. Several data-level solutions use different resampling

approaches, such as undersampling majority, oversampling minority, or oversampling

minority by creating a synthetic minority [Chawla et al. 2002], which changes the prior

distribution of the original data set [Kotsiantis et al. 2006] before learning from the data

set. Another approach at the data level segments the whole data into disjoint regions, such

that the data in certain region(s) are no longer imbalanced [Weiss 2004].

Some popular solutions at the algorithmic level include the following:

Decision threshold adjustment (DTA), which, given a (normalized) probability of

an instance being positive (or negative), changes the probability threshold used to

determine the class label of the instance [Kotsiantis et al. 2006].

Cost-sensitive learning (CSL), which assigns fixed and unequal costs to different

misclassifications, such as cost(false negative) > cost(false positive), to minimize

the misclassifications of positives [Pazzani et al. 1994].

Recognition-based learning (RBL), which, unlike a two-class classification

method that learns rules for both positive and negative classes, is a one-class

learning method and learns only rules that classify the minority [Weiss 2004;

Kotsiantis et al. 2006].

134

We employ several of these techniques to address our imbalanced data set.

Specifically, we divide the whole data set into 21 baskets on the basis of DWIOD, and

many of these turn out to be more “balanced” than the entire data set, so it matches the

segment data approach [Weiss 2004] for handling imbalanced data sets. For the few

imbalanced baskets, we sample more instances to form our imbalanced data set II. Next

we apply two different approaches, the simple DTA approach and an undersampling-

ensemble (UE) method (explained in subsection 5.3.3), to address the imbalanced data set

problem. We do not choose the CSL approach, mostly because we do not know the right

ratio for the cost of FN versus the cost of FP in the context of our competitor

classification problem. However, we consider DTA and CSL to be very similar, in that

they both create a bias toward positive classifications. For data set II, we report various

performance metrics suited for an imbalanced data set, including F1, precision, TP rate,

FP rate, ROC, AUC, and accuracy. We introduce the two approaches (DTA and UE) for

dealing with classification of imbalanced data in detail next.

9.4.3.2 DTA Approach

With this approach, we simply adjust the decision threshold used by a classifier to

determine whether to classify an instance as positive or negative, given its (normalized)

probability of being positive. For example, given that Pr(x is positive) = 0.3, the instance

x is labeled negative when the decision threshold is 0.5. However, when the threshold is

adjusted to 0.2, x is classified as positive.

For training and testing, we follow strict tuning procedures suggested in [Salzberg

1997]. In particular, we randomly select 1500 instances as a training set from the

135

imbalanced data set and the remaining 500 as the testing set. For each classification

method, we use 10-fold cross validation and tune the input parameters to observe the best

performance on the F1 measure with just the training set. Finally, we apply each trained

classifier with its respective “best” parameter setting to the testing set for evaluation

purposes. Moreover, to determine robustness, we randomly divide the 2000 pairs into

four disjoint sets of equal size, which form four different pairs of training and testing sets.

We then apply the training–tuning–testing procedures to the four pairs of training and

testing sets and report the average results (see the formula in subsection 5.5). In each

case, training and parameter tuning relies solely on the training data set, whereas our

evaluation of the trained and tuned classifier uses only the testing data set. For ANN, we

tune the learning rate from 0.1 to 1.0 and momentum from 0.1 to 0.3; for BN, we choose

K2 [Cooper and Herskovitz 1992] and TAN [Friedman et al. 1997] as algorithms for the

search network structure; for DT, we change the minimum leaf size from 2 to 10; and we

require no parameter tuning for LR. For all other parameters, we accept the default from

Weka. We apply the same tuning procedures throughout the study whenever we use

parameter tuning.

9.4.3.3 UE Approach

From the original imbalanced data set II, we generate multiple, smaller, more

balanced subdata sets by duplicating all minority (positive) instances in each subset and

then evenly splitting the majority into those subsets, as we depict in Figure 31. We build

136

a classifier from each subset and use an ensemble approach [Estabrooks and Japkowicz

2001] to generate the final classification result. Chan and Stolfo [1998] adopt a similar

Figure 31. Generating More Balanced Subdata Sets

undersampling method. We choose the majority vote as the ensemble approach, and for

the majority vote, we use the binary output (0 or 1) of each classifier and the probability

output (a value between 0 and 1) of each classifier, denoted as the majority vote by count

(MVC) and majority vote by probability (MVP), respectively.

During the training phrase, from the initial ratio of positives in the subsets, we

tune the parameters for each classifier (except for LR) and record its performance in an

output file. We repeat this procedure with different ratios of positives, which change from

0.05 to 0.60 with a step size of 0.05. From all output files, on the basis of the best

performance on the F1 measure, we determine a set of best parameters for a classifier and

a best ratio of positives. Finally, we apply the trained classifiers with their best parameter

137

settings and best ratios of positives to the testing set for evaluation. As in Section 9.4.3.2,

we divide the 2000 pairs into four disjoint sets of equal size, generate results separately

for the four pairs of training and testing sets, and report the average results.

9.4.4 Classification Performance for Data Set I

In Figure 32 we provide the precision and recall achieved by ANN for individual

sample baskets in data set I. For comparison, we also include the prior distribution of

positives in each sample basket. The precision curve is almost always above the prior

probability, except for the last two sample baskets with the lowest prior distributions

(5.0% and 2.5%.) As Figure 32 shows, though for most baskets ANN’s classification

performance is reasonably good, it weakens when DWIOD values are very small (last

few baskets). This result highlights the inherent challenge of accurately classifying the

minority class for imbalanced data sets (the last few baskets). The other three

classification methods (BN, DT, and LR) show similar performance patterns but poorer

performance overall. We provide the results of applying special techniques to imbalanced

parts of the data set in the next subsection.

138

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Basket

Precision Recall Prior

Figure 32. Precision and Recall of Data Set I by ANN and Prior Distribution

9.4.5 Classification Performance for Data Set II

In Table 37 we report precision, TP rate (recall), FP rate, F1, accuracy, and

AUC on training and testing sets for each classification method using the DTA

approach. Each bold number in the table indicates the best performance for a

measurement across the four classification models for the testing set. Since we have

four pairs of training (1500 instances) and testing (500 instances) sets, we generate and

report overall performance with the following equations which are based on definitions

in equations 11 to 15.

Table 37.

Classification Performance of Data Set II by DTA Approach

Without sector information With sector information**Data Set Overall ANN BN DT LR ANN BN DT LR

139

performance

Training*

Precision 0.280 0.142 0.119 0.353 0.361 0.277 0.318 0.398Recall 0.227 0.277 0.467 0.220 0.443 0.520 0.403 0.410

False positive rate 0.031 0.088 0.182 0.021 0.041 0.071 0.045 0.033F1 0.250 0.188 0.190 0.271 0.398 0.362 0.356 0.404

Accuracy 0.932 0.880 0.801 0.941 0.933 0.908 0.927 0.940AUC 0.753 0.703 0.656 0.756 0.870 0.863 0.740 0.865

Test

Precision 0.268 0.125 0.090 0.322 0.372 0.262 0.283 0.380Recall 0.220 0.240 0.400 0.190 0.420 0.430 0.360 0.380

False positive rate 0.032 0.088 0.213 0.021 0.037 0.064 0.048 0.033F1 0.242 0.164 0.147 0.239 0.394 0.326 0.317 0.380

Accuracy 0.931 0.878 0.768 0.940 0.936 0.911 0.923 0.938AUC 0.736 0.672 0.610 0.723 0.858 0.853 0.741 0.834

* Results of training set are based on the best performance on F1 with parameter tuning.** Company’s sector used in Yahoo! Finance is included as an attribute

(16)

(17)

(18)

140

(19)

(20)

In these equations, the definitions of TP, TN, FP, and FN are the same as those

in Section 9.4.1, and the subscript i represents a number between 1 and 4 to denote the

four disjoint testing sets from data set II.

Table 37 also contains results for the same data set with and without sector

information (sector encoded as a variable by nine categorical values). Using sector

information greatly improves the classification performance for data set II across the

four classifiers; for example, the maximum F1 measures (both produced by ANN)

increase by 63%. With sector information, we do not observe a significant difference

in the F1 measure across the 20 baskets in data set I (two-tailed t-test, p = 0.827), which

indicates that sector information is more helpful for imbalanced data sets than for more

balanced data sets. We find that for all 316 competitor pairs in data set III (216 in the

17 sample baskets of data set I and 100 in data set II), a total of 282 (89.2%) pairs are

in the same sector and 34 (10.8%) are not.

The UE approach with MVC and MVP produces similar results as those in

Table 37. For example, with MVC, the maximum values of the F1 measures are 0.381

and 0.204 with and without sector information, respectively. Although the UE

approach is more complex than the simple DTA approach, in that it requires an

141

undersampling of majority class to form multiple smaller data sets and adjusting ratios

of positives in these small data sets, the two methods show similar classification

performance. Thus, in Section 9.5, when estimating the extent to which our approach

extends beyond the gold standard, we use the results from the DTA approach.

Finally, Table 37 shows that ANN achieves the largest AUC values. In Figure

33, we illustrate the ROC curves for the four classifiers using sector information; the

curves for ANN, BN, and LR are close, and ANN and LR slightly outperform the DT

curve. The diagonal line represents random labeling of instances with different

likelihoods. For example, when the classifier randomly assigns an instance to the

positive class 10% of the time, it should find 10% of the positives correctly, producing

a TP rate of 0.1. At the same time, it identifies 90% of the negatives correctly, leading

to a FP rate of 0.1 (1 – 0.9). Thus, the process of guessing the positive class 10% of the

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

FP rate

TP r

ate

ANN BN DT LR

DTANN

BNLR

Figure 33. ROC Curves of Data Set II for Four Classification Methods

142

time yields the point (0.1, 0.1) in the ROC space, and random guesses with all different

likelihoods generate the diagonal line. Hence, our classification methods (the curves

above the diagonal line) identify the signals (i.e., competitor relationships) much more

effectively than a random assignment.

9.4.6 Estimated Overall Classification Performance on the Basis of Data Set III

Our classification performance measurements thus far compute values for each

sample basket. Because sample baskets consist of random samples of the original (larger)

baskets, these performance results represent the performance on the original baskets.

However, we also want to estimate the classification performance for all of the baskets

combined, or the whole data set with its 87,340 pairs. This estimation requires that we

extrapolate the performance observed from the sample baskets to the entire original

basket. Therefore, we adopt equations 15–19 to estimate overall precision, TP rate

(recall), FP rate, accuracy, and F1 using data set III. For the 17 sample baskets from data

set I, the classification results are based on 10-fold cross validation, whereas for the

eighteenth sample basket, we combine and use the results generated from the four disjoint

testing sets (each with 500 instances). We present the estimated overall measurements in

the following equations:

(21)

143

(22)

(23)

(24)

(25)

Where Bi is the size of basket i, and Si is the size of sample basket i.

With these equations, we estimate the overall classification performance by

extending performance measurements for a sample basket to the corresponding full

basket and then combining the measures across the 18 baskets in data set III. For

example, if the sample basket Si, which represents the original basket Bi,, contains m

instances that are classified as positives by a model, we expect the original basket B i to

144

contain instances that would be classified as positives by the same model. We

note that equations 20–25 estimate the overall classification performance for the whole

data set of 87,340 pairs, so the resulting estimation indicates the performance of an

ensemble of 18 classifiers (one for each basket), all using a given classification method.

The estimated overall prior probability for positives is 11.8% (approximately 1 in 9 pairs

in the original data set is a competitor pair). We note that compared to this low estimated

prior, Table 38 shows that our competitor discovery approach can achieve reasonably

good estimated classification performance. ANN achieves the best performance on more

metrics than the other three methods, but unlike the three methods (ANN, DT, and BN),

LR does not require any parameter turning and produces comparably good results. We

highlight the best performance value for each measurement in Table 38.

145

Table 38.

Estimated Overall Performances

Without sector information With sector information Precision Recall FP rate F1 Accuracy Precision Recall FP rate F1 AccuracyANN 0.419 0.378 0.046 0.397 0.907 0.450 0.513 0.055 0.479 0.910BN 0.238 0.354 0.095 0.284 0.863 0.388 0.514 0.071 0.442 0.895DT 0.167 0.463 0.203 0.245 0.770 0.432 0.457 0.053 0.444 0.907LR 0.388 0.330 0.046 0.357 0.904 0.382 0.437 0.062 0.407 0.897

9.5 Competitor Extension

In the introduction, we use an anecdote to note that the gold standards tend to be

incomplete. Now we suggest metrics to estimate (1) the coverage of competitive pairs by

a gold standard and (2) the extent to which our approach extends each gold standard.

9.5.1 Estimating the Coverage of a Gold Standard

We require the following notation in Figure 34 to describe the estimation

procedure:

C: (unknown) complete set of competitor pairs

H: set of competitor pairs covered by Hoover’s

M: set of competitor pairs covered by Mergent

JHM = H M, intersection of H and M

Following an idea proposed in a widely cited study [Lawrence and Giles 1998] to

estimate the coverage of search engines, we assume H and M are independent subsets of

C and thus estimate the extent to which H covers C, according to how much of H covers

146

Figure 34. Competitors Covered by Two Gold Standards

M (i.e., JHM) and the size of M. We therefore define the coverage of the entire competitor

set C by Hoover’s ( ) and Mergent ( ) as follows:

Cov(H) = (26)

Cov(M) = (27)

If H and M are not completely independent, the value of JHM (their intersection) is

expected to be larger than when they are independent. In that case, this coverage

estimation provides an upper bound on true coverage.

We previously labeled the positive instances according to Hoover’s and Mergent

for each sample basket, which enables us to compute the number of competitor pairs

147

identified by Hoover’s ( ) and Mergent ( ) separately, as well as the intersection of

Hoover’s and Mergent ( ) for the ith sample basket. Similar to our approach to

defining equation 11, we estimate the number of positives (for Hoover’s, Mergent, and

their intersection) in each original basket by multiplying the number of positives in the

sample basket by the ratio of the basket size to the sample basket size. Then, using

equations 26 and 27, we calculate the coverage of Hoover’s and Mergent as follows:

(28)

(29)

We find that the estimated coverage of Hoover’s and Mergent is 46.0% and

24.9%, respectively. So both data sources individually cover less than 50% of all

competitor pairs. This quantifies and confirms our initial anecdote about incompleteness

of these industry-strength data sources.

9.5.2 Estimating the Extension of One Gold Standard to Another

148

As shown in the above Figure 34, M - JHM represent competitors covered by

Mergent but not by Hoovers. With the same assumption and logic described in the

Section 9.5.1, we define the extension of Mergent to Hoovers and the extension of

Hoovers to Mergent as follows.

Ext(M, H) = (30)

Ext(H, M) = (31)

9.5.3 Estimating the Extension of Our Approach to a Gold Standard

We now present a procedure to estimate how much our automated approach might

extend a gold standard (i.e., identify competitor pairs that are not covered by the gold

standard). Our estimation procedure uses the following notation:

O: the set of competitor pairs classified by our approach

O = C – O

H = C – H

M = C – M

JHMO = H M O

JHMO = H M O

JHMO = H M O

JHMO = H M O

149

Figure 35. Competitors Covered by Two Gold Standards and Our Approach

Thus, JHMO is a subset of competitor pairs that our approach classifies as positive

and that Mergent confirms as positive but that Hoover’s does not identify as competitors.

Given that competitor pairs in Mergent are a subset of all competitor pairs, we estimate

the extent to which our approach extends Hoover’s (Ext(O, H)) as follows:

Ext(O, H) = (32)

Similarly, we estimate the extent to which our approach extends Mergent (Ext(O,

M)) as follows:

Ext(O, M) = (33)

150

Since we use one gold standard to examine the extension of our approach to

another, our extension therefore is bounded by the extension of one gold standard to

another, such that Ext(O, H) Ext(M, H) and Ext(O, M) Ext(H, M). On the basis of

equations (32) and (33), we compute the extension of our approach to each gold standard

using results from data set III with the following equations.

Ext(O, H) = (34)

Ext(O, M) =

(35)

We show in Table 39 the estimation of how much our approach extends the

knowledge available from each of the gold standards, for the different classification

methods (with and without sector information). Using the sector information and any

classification method, our approach extends Hoover’s and Mergent by more than 10%

and 32%, respectively. We base these extension values on classification results generated

from a set of input parameters and classification methods. As the ROC curves in Figure

33 illustrate, we could achieve a higher TP rate (recall) by adjusting some parameters,

and therefore obtain higher values for our expansion, but at the cost of a higher FP rate,

151

which lowers precision. The results in Table 39 are associated with estimated overall

performance in Table 38. For example, for ANN the extensions offered by our approach

Table 39.

Extensions to a Gold Standard

Without sector information With sector informationUpper bound ANN BN DT LR ANN BN DT LR

Ext(O, H) 35.0% 5.9% 7.3% 15.3% 5.0% 12.1% 11.3% 10.1% 10.5%Ext(O, M) 71.2% 28.7% 23.4% 37.2% 24.3% 33.8% 37.1% 35.8% 32.9%

to Hoover’s (12.1%) and Mergent (33.8%) are associated with precision, recall, and FP

rate of 0.450, 0.513, and 0.055, respectively.

9.6 Explorations of Competitors vs. Noncompetitor Pairs

In next two subsections, we report more exploration results on structural

equivalence similarity between competitor and noncompetitor pairs, and on company

annual revenues between competitor pairs with high and low DWIOD values.

9.6.1 SE Similarity Comparison between Competitor and Noncompetitor Pairs

For each sample basket of data set III, we compute and compare the average SE

similarities for competitor and non-competitor pairs. Figure 36 compares DWID-based

SE similarities of the 18 sample baskets in data set III. Except for the last basket with the

smallest DWIOD values, the average SE similarities for competitor pairs are greater than

those for non-competitor pairs (two-tailed t-test, p=0.003), which indicates that on

152

average competitor companies are more structurally equivalent than non-competitors.

Similar patterns are observed for DWOD- and DWIOD-based SE similarities (two-tailed

t-test, p=0.008 and p=0.001 respectively).

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

Basket

Competitor Non-competitor

Figure 36. Average DWID-based SE Similarity Comparison

9.6.2 Comparing Annual Revenues between Competitor Pairs with High and Low

DWIODs

We observe that the average revenue of company pairs with low DWIOD values

(100 competitor pairs in data set II) is significantly (two-tailed t-test, p<0.001) lower than

the average revenue of company pairs with high DWIOD values (92 competitor pairs in

the first five sample baskets, 1 - 5, in data set I.)

9.7 Discussions

153

We propose and evaluate an approach that exploits company citations in online

news articles to create an intercompany network whose structural attributes can identify

competitor relationships between a pair of companies. In addition to using standard

metrics to evaluate the classification performance of our approach, we suggest several

problem-specific metrics that can measure the degree to which our approach extends a

couple of industry-strength data sources. Our evaluations prompt three broad

observations. First, the intercompany network reduces the search cost of finding

competitors compared with that associated with an exhaustive network while avoiding

the poor competitor density of a random network. In other words, the intercompany

network can capture signals about competitor relationships effectively and efficiently.

Second, the structural attributes of our intercompany network, when combined in various

types of classification models, effectively discover competitor relationships, though for

imbalanced portions of the data, we require more advanced modeling techniques (e.g.,

data segmentation, DTA) to achieve reasonable performance. Third, we quantify the

degree to which two commercial data sources are incomplete in their coverage of

competitor relationships and measure the extent to which our approach extends them

while still maintaining adequate precision.

Because our approach is language neutral, it can employ news stories in various

languages and from different countries, as long as there is a mechanism to identify

company citations. We plan to test our approach with a non-English language news

source. Furthermore, this approach may provide a means to discover business

relationships other than competitors. In fact, in parallel research, we also have applied our

approach successfully to identify the relative size of company revenues. We note that

154

company citations can be noisy, and we exploit the large volume of freely available

online news sources to aggregate the signals and thus reduce the noise. However, it

would be interesting to investigate the effect of volume (number of news stories) on the

classification performance of our approach. Also, in continuing empirical studies, it

would be worthwhile to explore whether the intercompany network can predict future

competitor relationships, and if so, how far into the future.

In summary, we present a data mining approach to discovering business

relationships from online news. Because of its design, our approach is scalable along

several dimensions, such as news quantity, language, and type of business relationship.

153

CHAPTER 10

CONCLUSIONS

This dissertation explores two related topics – personalized search and business

relationship discovery, both of which follow the process of KDW. To conclude now I

summarize the two topics, highlight the main findings or contributions, and outline the

directions of future research.

Web search engines typically provide search results without considering a user’s

interests or context. We propose a personalized search approach that can easily extend a

conventional search engine on the client side. Our mapping framework automatically

maps a set of known user interests onto a group of categories in the Open Directory

Project (ODP) and takes advantage of manually edited data available in ODP for training

text classifiers that correspond to, and therefore categorize and personalize search results

according to user interests. In two sets of controlled experiments in two disjoint domains,

we compare our personalized categorization system (PCAT) with a list interface system

(LIST) that mimics a typical search engine and with a nonpersonalized categorization

system (CAT). In both experiments, we analyze system performances on the basis of the

type of task and query length and identify conditions under which our system

154

outperforms a baseline system. In particular, we find that PCAT is preferable to LIST for

information gathering types of tasks and for searches with short queries, and PCAT

outperforms CAT in both information gathering and finding types of tasks, and for

searches associated with free-form queries. From the subjects’ answers to a

questionnaire, we find that PCAT is perceived as a system that can find relevant Web

pages quicker and easier than LIST and CAT.

Potential future research along this line of research includes:

(1) On the basis of the conditions identified in this study, an interesting and related

direction is to study a smart system that can automatically choose a proper

interface (e.g., categorization, clustering, list) to display search results on the basis

of the nature of the query, the search results, and the user interest profile

(context).

(2) As mentioned in Section 2.2 that some prior works [e.g., Leroy et al. 2003; Gauth

et al. 2003; Shen et al. 2005b] use a user’s search and/or browsing activities to

learn his or her profile to further personalize search, thus it would be interesting to

build a user profile based not only on the given user’s activities, but from

behaviors of many other people who are known to have the same or similar

interests. In other words, a personalized search system (that extends our current

system) could try to improve a user’s Web search in a collaborative (e.g., intranet)

environment by considering search activities of other people who have the same

or similar interest profiles.

(3) In this study we assume that a user’s interests are given in that they can be

automatically extracted from his or her resume in digital form or from a database.

155

Thus how to capture and model the dynamics of users' interests can be an

extension of this research, because the interests may be not known in advance and

normally they change over time, even for long-term interests. A user’s interests

can be modeled by his or her behaviors, such as searched and browsed pages

(online behaviors) and composed or read documents and emails (offline

behaviors) [Teevan et al. 2005].

(4) When classifying search results under a user’s interests, we use page content up to

10KB and the page-fetching process is time consuming. Therefore it would be

worthwhile to study the performance of result categorization using other types of

data such as title and snippets (from search engine results) instead of page content

which would save the time on fetching Web pages.

In the second topic we present a new-driven, SNA-based business relationship

discovery framework and study two different business relationships, CRR and competitor

relationship, respectively, to illustrate the effectiveness of our approach. By taking

advantage of the fact that content providers, such as Yahoo! Finance, organize news by

company, we consider news stories organized under a company belong to the company

(i.e., source). We first identify company citations (from sources to targets) in news and

then construct a directed and weighted intercompany network. Using SNA techniques we

further identify four types of (dyadic degree-, node degree-, node centrality-, and

structural equivalence-based) attributes from the network structure. Then we apply

different classification methods with these attributes to finally discover the CRRs and

competitor relationships for large number of links (company pairs) in the network.

156

For the CRR study, besides reporting annual and quarterly revenue-based CRR

prediction, we also show that our approach achieve better performance for flip pairs than

two alternative methods. Further, with annual revenue-based CRR we examine the

prediction performance using each individual group of attributes and apply discriminant

analysis to identify two IVs that are significant in distinguishing positive and negative

CRRs. For the related problem of finding whether a company falls into a set of top-N

companies by revenue, we obtain 57–75% precision with substantially lower recall (24–

36%) for N between 100 and 1000.

For the competitor study, we first demonstrate the high competitor coverage and

density of our citation-based network to justify the use of it before presenting the

classification performance. With two company profile data sources, Hoover's and

Mergent, as gold standards we estimate to what extent a gold standard covers the

(unknown) complete competitor space. More important, we propose metrics to estimate

how much our approach extends the knowledge available in the each of the gold

standards.

Our approach is scalable and language-neutral. Thus it can not only serve as a

data filtering step but also be useful for tracing and monitoring the dynamics of business

relationships for many companies over time. The following research directions can serve

as extensions to our current work:

(1) It would be interesting to validate our approach with a variety of different

business relationships (e.g., supplier and customer relationship), news from

different languages and countries, various types of companies (e.g., private versus

public), and over time.

157

(2) Beyond the four types of network attributes we identify, in order to improve the

classification, it is desired to derive and evaluate additional graph-based attributes

that synthesize the global and dyadic measures and represent more effective

predictors of business relationships between a pair of companies.

(3) We have seen that the sector information which is at a higher level than industry

greatly improves classification for competitor relationships. Thus we may use

some industry-related attributes, such as industry taxonomy, to improve the

performance. Besides, with the taxonomy we can group certain companies under

the same industry or industry subcategory into a super node to form a smaller, but

more abstract, network. We can then examine the link patterns between those

super nodes.

(4) A more broad future research direction is to study new business questions beyond

the above business relationships with or without using the current news-citation-

based intercomapny network.

In this dissertation with three essays under two topics in the area of KDW, we

propose novel ideas, address relevant business problems, evaluate our approaches, verify

the effectiveness and justify the usefulness of our approaches to businesses, and indicate

broader applications and future research with our general approaches.

158

REFERENCES

Adamic, L. A. 2002. Zipf, power-laws, and Pareto - a ranking tutorial. http://ginger.hpl.hp. com/shl/papers/ranking/ranking.html.

Barábasi, A. L., R. Albert, H. Jeong. 2000. Scale-free characteristics of random networks: the topology of the World Wide Web. Physica A, 281 69–77.

Bernstein, A., S. Clearwater, S. Hill, F. Provost. 2002. Discovering knowledge from relational data extracted from business news. In Proceedings of the KDD 2002 Workshop on Multi-Relational Data Mining, Edmonton, Alberta, Canada.

Brandes, U. 2001. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25(2) 163–177.

Brin, S., L. Page. 1998. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7) 107–117.

Broder, A. 2002. A taxonomy of Web search. ACM SIGIR Forum, 36(2) 3–10.

Broder, A., R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J. L. Wiener. 2000. Graph structure in the Web. In Proceedings of the 9th World Wide Web Conference, 309–320.

Budzik, J., K. HAMMOND. 2000. User interactions with everyday applications as context for just-in-time information access. In Proceedings of the 5th International Conference on Intelligent User Interfaces, New Orleans, LA, 44–51.

Butler, D. 2000. Souped-up search engines. Nature, 405 112–115.

Carroll, J., M. B. Rosson. 1987. The paradox of the active user. In Interfacing Thought: Cognitive Aspects of Human-Computer Interaction, J.M. Carroll, Ed. MIT Press, Cambridge, MA.

Chan, P., S. Stolfo. 1998. Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In Proceedings of the 4th

159

International Conference on Knowledge Discovery and Data Mining. New York City, NY, 164–168.

Chakrabarti, S., B. E. Dom, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, J. Kleinberg. 1999. Mining the Web's link structure. Computer, 32(8) 60–67.

Chawla, N. V., K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16 321–357.

Chirita, P.A., W. Nejdl, R. Paiu, C. Kohlschűtter. 2005. Using ODP metadata to personalize search. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, 178–185.

Cooley, R., B. Mobasher, J. Srivastava. 1997. Web mining: information and pattern discovery on the World Wide Web. In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence, Newport Beach, CA, USA, 558–567.

Cooper, G., E. Herskovitz. 1992. A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9(4) 309–347.

Craswell, N., D. Hawking, S. Robertson. 2001. Effective site finding using link information. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, LA, 250–257.

Cutting, D.R., D.R. Karger, J.O. Pedersen, J.W. Tukey. 1992. Scatter/Gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR conference on Research and Development in Information Retrieval, Copenhagen, Denmark, 318–329.

Deerwester, S., S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6) 391–407.

Dietterich, T.G. 1997. Machine learning research: four current directions. AI Magazine, 18(4) 97–136.

Dreilinger, D., A. E. Howe. 1997. Experiences with selecting search engines using metasearch. ACM Transactions on Information Systems, 15(3) 195–222.

Dumais S., H. Chen. 2001. Optimizing search by showing results in context. In Proceedings of Computer-Human Interaction, Seattle, WA, 277–284.

160

Eirinaki, R., M. Vazirgiannis. 2003. Web mining for Web personalization. ACM Transactions on Internet Technology, 3(1) 1–27.

Estabrooks, A., N. Japkowicz. 2001. A mixture-of-experts framework for learning from unbalanced data sets. In Proceedings of the 4th International Symposium on Intelligent Data Analysis. Lisbon, Portugal, 34–43.

Faloutsos, M., P. Faloutsos, C. Faloutsos. 1999. On power-law relationships of the internet topology. In Proceedings ACM SIGCOMM, 251–262.

Fayyad, U. M., G. Piatetsky-Shapiro, P. Smyth. 1996. From data mining to knowledge discovery: an overview. In Advances in Knowledge Discovery and Data Mining, Eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, AAAI Press, Menlo Park, California, 1–30.

Finkelstein, L., E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, E. Ruppin. 2002. Placing search in context: the concept revisited. ACM Transactions on Information Systems, 20(1) 116–131.

Freeman, L. C. 1979. Centrality in social networks: conceptual clarification. Social Networks, 1 215–239.

Friedman, N., D. Geiger, M. Goldszmidt. 1997. Bayesian network classifiers. Machine Learning, 29(2–3) 131–163.

Garfield, E. 1979. Citation Indexing: Its Theory and Application in Science, Technology, and Humanities. Wiley, New York.

Gauch, S., J. Chaffee, A. Pretschner. 2003. Ontology-based personalized search and browsing. Web Intelligence & Agent Systems, 1(3/4) 219–234.

Gibson, D., J. Kleinberg, P. Raghavan. 1998. Inferring Web communities from link topology. In Proceedings of 9th ACM Conference on Hypertext and Hypermedia, Pittsburgh, USA, 225–234.

Giles, C. L., K. Bollacker, S. Lawrence. 1998. CiteSeer: An automatic citation indexing system. In Proceedings of the 3rd ACM Conference on Digital Libraries, Pittsburgh, PA, USA, 89–98.

Glover, E., S. Lawrence, W. Brimingham, C. L. Giles. 1999. Architecture of a metasearch engine that supports user information needs. In Proceedings of the 8th International Conference on Information Knowledge Management, Kansas City, MO, 210–216.

Gulati, R., M. Gargiulo. 1999. Where do interorganizational networks come from? American Journal of Sociology, 104(5) 1439–1493.

161

Hafri, Y., C. Djeraba. 2004. Dominos: a new Web crawler’s design. In Proceedings of the 4th International Web Archiving Workshop (IWAW), Beth, UK.

Hair, J. F., W. C. Black, B. J. Babin, R. E. Anderson, R. L. Tatham. 2006. Multivariate Data Analysis. 6th edition, Pretice Hall.

Harris, Z. 1985. Distributional structure. In The Philosophy of Linguistics. Katz, J.J., Ed. Oxford University Press, New York, 26–47.

Haveliwala, T.H. 2003. Topic-sensitive PageRank. IEEE Transactions on Knowledge and Data Engineering, 15(4) 784–796.

He, B., K. C. C. Chang. 2003. Statistical schema matching across Web query interfaces. In Proceedings of the ACM SIGMOD International Conference on management of Data, San Diego, CA, USA, 217–228.

Hu, M., B. Liu. 2004. Mining and summarizing customer reviews. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 168–177.

Jansen, B.J., A. Spink, J. Bateman, T. Saracevic. 1998. Real life information retrieval: a study of user queries on the Web. ACM SIGIR Forum. 32(1) 5–17.

Jansen, B. J., A. Spink, T. Saracevic. 2000. Real life, real users, and real needs: a study and analysis of user queries on the Web. Information Processing and Management, 36(2) 207–227.

Jansen, B. J., A. Spink, J. Pederson. 2005. A temporal comparison of AltaVista Web searching. Journal of the American Society for Information Science and Technology, 56( 6) 559–570.

Jansen, B. J., A. Spink. 2005. An analysis of Web searching by European AlltheWeb.com users. Information Processing and Management, 41 361–381.

Jeh, G., J. Widom. 2003. Scaling personalized Web search. In Proceedings of the 12th international conference on World Wide Web, Budapest, Hungary, 271–279.

Kalfoglou, Y., M. Schorlemmer. 2003. Ontology mapping: the state of the art. The Knowledge Engineering Review Journal, 18(1) 1–31.

Käki, M. 2005. Findex: search result categories help users when document ranking fails. In Proceedings of the SIGCHI conference on Human factors in computing systems, Portland, OR, 131–140.

Kautz, H., B. Selman, M. Shah. 1997. The hidden Web. AI Magazine, 18(2) 27–36.

162

Kessler, M. M. 1963. Bibliographic coupling between scientific papers. American Documentation, 24 123–131.

Kleinberg, J. 1999. Authoritative sources in a hyperlinked environment. Journal of ACM, 46(5) 604–632.

Kotsiantis, S., D. Kanellopoulos, P. Pintelas. 2006. Handling imbalanced datasets: a review. GESTS International Transactions on Computer Science and Engineering 30(1).

Lawrence, S., C. L. Giles. 1998. Searching the World Wide Web. Science, 280(3) 98–100.

Kraft, R., F. Maghoul., C. C. CHANG. 2005. Y!Q: contextual search at the point of inspiration. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management, Bremen, Germany, 816–823.

Kumar, R., P. Raghavan, S. Rajagopalan, A. Tomkins. 1999. Trawling the Web for emerging cyber-communities. Computer Networks, 31(11–16) 1481–1493.

Lawrence, S. 2000. Context in Web search. IEEE Data Engineering Bulletin, 23(3) 25–32.

Leory, G., A. M. Lally, H. Chen. 2003. The use of dynamic contexts to improve casual internet searching. ACM Transactions on Information Systems, 21(3) 229–253.

Levine, J. H. 1972. The Sphere of Influence. American Sociological Review, 37(1) 14–27.

Liu, B. 2006. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. 1st

edition, Springer.

Liu, F., C. Yu, W. Meng. 2004. Personalized Web search for improving retrieval effectiveness. IEEE Transactions on Knowledge and Data Engineering, 16(1) 28–40.

Lorrain, F., H. G. White. 1971. Structural equivalence of individuals in social networks. Journal of Mathematical Sociology 1 49–80.

Maltz, D., K. Ehrlich. 1995. Pointing the way: active collaborative filtering. In Proceedings of the Conference on Computer-Human Interaction, Denver, CO, 202–209.

Michael, T. 1997. Machine Learning. WCB/McGraw-Hill.

163

Menczer, F., G. Pant, P. Srinivasan. 2004. Topical Web crawlers: evaluating adaptive algorithms. ACM Transactions on Internet Technology, 4(4) 378–419.

Miller, G. A., R. Beckwith, C. Fellbaum, D. Gross, K. J. Miller. 1990. Introduction to WordNet: an on-line lexical database. International Journal of Lexicography, 3(4) 235–244.

Najork, M., A. Heydon. 2001. High-performance Web crawling. In Handbook of Massive Data Sets, J. ABELLO, P. PARDALOS, AND M. RESENDE, Eds. Kluwer Academic Publishers, 25–45.

O'Madadhain, J., D. Fisher, S. White, Y. B. Boey. 2006. JUNG: the Java universal network/graph framework (ver. 1.7.4). http://jung.sourceforge.net.

Oyama, S., T. Kokubo, T. Ishida. 2004. Domain-specific Web search with keyword spices. IEEE Transactions on Knowledge and Data Engineering, 16(1) 17–27.

Padmanabhan, B., Z. Zheng, S. Kimbrough. 2006. An empirical analysis of the value of complete information for eCRM models. MIS Quarterly, 30(2) 247–267.

Palmer J. W., J. P. Bailey, S. Faraj. 2000. The role of intermediaries in the development of trust on the WWW: the use and prominence of trusted third parties and privacy statements. Journal of Computer-Mediated Communication, 5(3).

G. Pant, P. Srinivasan. 2006. Link contexts in classifier-guided topical crawlers. IEEE Transactions on Knowledge and Data Engineering, 18(1), 107–122.

Park, H. W. 2003. Hyperlink network analysis: a new method for the study of social structure on the Web. Connections, 25(1) 49–61.

Pazzani, M., Merz, C., P. Murphy. 1994. Reducing misclassification costs. In Proceedings of the 11th International Conference on Machine Learning. New Brunswick, NJ, USA, 217–225.

Peng, B., L. Lee, S. Vaithyanathan. 2002. Thumbs up? sentiment classification using machine learning techniques. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA, USA, 79–86.

Pitkow, J., H. Schutze, T. Cass, R. Cooley, D. Turnbull, A. Edmonds, E. Adar, T. Breuel. 2002. Personalized search. Communication of the ACM, 45(9) 50–55.

Porter, M. 1980. An algorithm for suffix stripping. Program, 14(3) 130–137.

Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufman, San Mateo, CA.

164

Richards, W. D., G. A. Barnett (Eds.) 1993. Progress in Communication Science, 12, Ablex Pub. Corp., Norwood, NJ.

Riloff, E., J. Shepherd. 1997. A corpus-based approach for building semantic lexicons. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, Providence, RI, 117–124.

Salton, G., M. J. McGill. 1986. Introduction to Modern Information Retrieval, McGraw-Hill, New York.

Salzberg, S. 1997. On comparing classifiers: pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1 317–327.

Schapire, R. E. 1999. A brief introduction to boosting. Proceedings of the 16th

International Joint Conference on Artificial Intelligence. Stockholm, Sweden, 1401–1406.

Scott, J. 2000. Social Network Analysis: A Handbook, 2nd ed., Sage Publications, London.Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing

Surveys, 34(1) 1–47.

Sellen, A.J., R. Murphy, K. L. Shaw. 2002. How knowledge workers use the Web. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: Changing our World, Changing Ourselves. Minneapolis, MN, 227–234.

Shakes, J., M. Langheinrich, O. Etzioni. 1997. Dynamic reference sifting: a case study in the homepage domain. In Proceedings of the 6th International World Wide Web Conference, Santa Clara, CA, 189–200.

Shen, D., Z. Chen, Q. Yang, H. Zeng, B. Zhang, Y. Lu, W. Ma. 2004. Web-page classification through summarization. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, South Yorkshire, UK, 242–249.

Shen, X., B. Tan, C. X. Zhai. 2005a. Context-sensitive information retrieval using implicit feedback. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, Salvador, Brazil, 43–50.

Shen, X., B. Tan, C. X. Zhai. 2005b. Implicit user modeling for personalized search. In Proceedings of the 14th ACM international conference on Information and knowledge management, Bremen, Germany, 824–831.

Small, H. 1973. Co-citation in the scientific literature: a new measurement of the relationship between two documents. Journal of the American Society of Information Science, 24(4) 265–269.

165

Speretta, M., S. Gauch. 2005. Personalizing search based on user search histories. In Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, Compiegne University of Technology, France, 622–628.

Srinivasan, P., F. Memczer, G. Pant. 2005. A general evaluation framework for topical crawlers. Information Retrieval, 8(3) 417–447.

Srivastava, J., R. Cooley, M. Deshpande, P. Tan. 2000. Web usage mining: discovery and applications of usage patterns from Web data. SIGKDD Explorations, 1(2) 12–23.

Sugiyama, K., K. Hatano, M. Yoshikawa. 2004. Adaptive Web search based on user profile constructed without any effort from users. In Proceedings of the 13th International Conference on World Wide Web, New York, NY, 675–684.

Sullivan, D. 2000. NPD search and portal site study.http://searchenginewatch.com/sereport/article.php/2162791.

Tan, A. H. 2002. Personalized information management for Web intelligence. In Proceedings of World Congress on Computational Intelligence, Honolulu, HI, 1045–1050.

Tan, A. H., C. Teo. 1998. Learning user profiles for personalized information dissemination. In Proceedings of International Joint Conference on Neural Network, Anchorage, AK, 183–188.

Teevan, J., S. T. Dumais, E. Horvitz. 2005. Personalizing search via automated analysis of interests and activities. In Proceedings of 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, 449–456.

Uzzi, B. 1999. Embeddedness in the making of financial capital: how social relations and networks benefit firms seeking financing. American Sociological Review, 64 481–505.

Walker, G., B. Kogut, W. Shan. 1997. Social capital, structural holes and the formation of an industry network. Organization Science, 8(2) 109–125.

Wasserman, S., K. Faust. 1994. In Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge, UK.

Weiss, G. M. 2004. Mining with rarity: a unifying framework. Sigkdd Explorations 6(1) 7–19.

Weiss, G. M., F. Provost. 2003. Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research 19 315–354.

166

Wen, J.R., J. Y. Nie, H. J. Zhang. 2002. Query clustering using user logs. ACM Transactions on Information Systems, 20(1) 59–81.

Witten, I. H., E. Frank. 2005. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed., Morgan Kaufmann, San Francisco.

Xu, J., W.B. Croft. 1996. Query expansion using local and global document analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 4–11.

Yang, Y., X. Liu. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, CA, 42–49.

Zaïane, O. R., M. Xin, J. Han. 1998. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. In Proceedings of Advances in Digital Libraries, Santa Barbara, CA, 19–29.

Zamir, O., O. Etzioni. 1999. Grouper: A dynamic clustering interface to Web search results. Computer Networks: The International Journal of Computer and Telecommunications Networking, 31(11–16) 1361–1374.

web mining is the application of data mining techniques to ...zma/research/dissertation.doc · web...

Documents