web mining is the application of data mining techniques to ...zma/research/dissertation.doc · web...
TRANSCRIPT
WEB MINING FOR KNOWLEDGE DISCOVERY
by
Zhongming Ma
A dissertation submitted to the faculty ofThe University of Utah
in partial fulfillment of the requires for the degree of
Doctor of Philosophy
in
Business Administration
David Eccles School of Business
The University of Utah
May 2007
ABSTRACT
The Web has become an unprecedented world-wide repository of knowledge. It
contains valuable information for managers, analysts, and all types of knowledge
workers, yet, the Web is dynamic and noisy. Hence, knowledge discovery from the Web,
while being challenging, is an essential tool for the knowledge economy. This
dissertation covers two related topics – personalized search and business relationships
discovery – in the area of knowledge discovery from the Web.
In part I, we propose an automatic personalized search approach that categorizes
search results under a user’s interests by first mapping a user’s known interests to Open
Directory Project (ODP) categories. In two sets of controlled experiments, we compare
our personalized categorization system (PCAT) with two baseline systems, a list interface
system (LIST) and a nonpersonalized categorization system (CAT). We analyze system
performances on the basis of the type of task and query length and identify conditions
under which our system outperforms a baseline system.
In part II we present a news-driven, social network analysis (SNA)-based business
relationship discovery framework and study two different business relationships,
company revenue relation (CRR) and competitor relationship, respectively, to illustrate
the effectiveness of our approach. As a news story pertaining to a company often cites
several other companies, we construct an intercompany network using such citations,
employ SNA techniques to identify a set of attributes from the network structure, and use
the attributes to predict CRRs and discover the competitor relationships. We find that, for
the two business relationships studied, the structural attributes of the intercompany
network are valuable in predicting the business relationships. Also, our news-driven,
SNA-based business relationship discovery framework is scalable (as compared to
manual approaches) and language-neutral. While we validate our approach with data for
public companies in the U.S., the approach can be easily extended to discover business
relationships for private and foreign companies that are either unavailable or hard to
collect.
v
TABLE OF CONTENTS
ABSTRACT.......................................................................................................................iv
ACKNOWLEDGMENTS..................................................................................................ix
Chapter
1 INTRODUCTION.......................................................................................................1
1.1 Knowledge Discovery on the Web..................................................................11.2 Personalized Search.........................................................................................41.3 Business Relationship Discovery....................................................................71.4 Overview of Dissertation...............................................................................10
PART I PERSONALIZED SEARCH.........................................................................11
2 INTRODUCTION AND LITERATURE REVIEW.................................................12
2.1 Introduction....................................................................................................122.2 Related Literature..........................................................................................16
3 OUR APPROACH.....................................................................................................25
3.1 Step 1: Obtaining an Interest Profile.............................................................263.2 Step 2: Generating Category Profiles............................................................263.3 Step 3: Mapping Interests to ODP Categories...............................................283.4 Step 4: Resolving Mapped Categories...........................................................313.5 Step 5: Categorizing Search Results..............................................................363.6 Implementation..............................................................................................38
4 EXPERIMENTS........................................................................................................46
4.1 Studied Domains and Domain Experts..........................................................474.2 Professional Interests, Search Tasks, and Query Length...............................474.3 Subjects..........................................................................................................514.4 Experiment Process.......................................................................................54
5 EVALUATIONS AND DISCUSSIONS...................................................................55
5.1 Comparing Mean Log Search Time by Query Length..................................555.2 Comparing Mean Log Search Time for Information Gathering Tasks.........585.3 Comparing Mean Log Search Time for Site Finding Tasks..........................605.4 Comparing Mean Log Search Time for Finding Tasks.................................615.5 Questionnaire and Hypotheses......................................................................615.6 Hypothesis Test Based on Questionnaire......................................................635.7 Comparing Indices of Relevant Results........................................................655.8 Discussions....................................................................................................695.9 Limitations and Future Directions.................................................................71
PART II BUSINESS RELATIONSHIP DISCOVERY...............................................74
6 INTRODUCTION AND LITERATURE REVIEW.................................................75
6.1 Introduction....................................................................................................756.2 Literature Review..........................................................................................78
7 NETWORK-BASED ATTRIBUTES AND DATA..................................................82
7.1 Notation in Directed Graphs..........................................................................827.2 Notation in Directed, Weighted Graphs........................................................837.3 Raw Data.......................................................................................................897.4 Preliminary Data Processing..........................................................................907.5 Node and Link Identification.........................................................................917.6 Attribute Distributions...................................................................................91
8 PREDICTING COMPANY REVENUE RELATIONS............................................97
8.1 Measurements of CRR...................................................................................988.2 Research Questions........................................................................................998.3 Research Methods........................................................................................1008.4 Results and Analyses...................................................................................1038.5 Discussions..................................................................................................117
9 DISCOVERING COMPETITOR RELATIONSHIPS............................................120
9.1 Approach Outline and Research Questions.................................................1209.2 Data Sets......................................................................................................1219.3 Examining Competitor Coverage and Density of the Intercompany Network............................................................................................1259.4 Competitor Discovery..................................................................................1299.5 Competitor Extension..................................................................................1439.6 Explorations on Competitors vs. Noncompetitor pairs................................1499.7 Discussions..................................................................................................150
10 CONCLUSIONS.....................................................................................................153
vii
REFERENCES................................................................................................................158
viii
ACKNOWLEDGMENTS
I would like to first thank my advisors, Dr. Gautam Pant and Dr. Olivia R. Liu
Sheng, for their great efforts in consistently improving my research ideas such that I have
the three essays in this dissertation. I also thank Dr. Ellen Riloff, Dr. Paul Hu, and Dr.
Wei Gao for providing constructive comments for my dissertation.
I am sincerely grateful to David Eccles School of Business for the four years’
financial support during my Ph.D. study. I also appreciate the generous support from Dr.
Olivia R. Liu Sheng, Dr. David Plumlee (former department head), and Dr. Robert D.
Allen (department head) for covering some expenses occurred in my research and my
fifth year’s tuition. I am thankful to eBusiness Center at Pennsylvania State University
for the award funding that supports my research project in personalized search.
I should not forget that Dr. Olivia R. Liu Sheng brought me into this program.
And finally I would like to give my special thanks to my parents and my brother for their
continuous and unconditional support, no matter where I am, what I am doing, ups and
downs.
1
CHAPTER 1
INTRODUCTION
1.1 Knowledge Discovery on the Web
Knowledge discovery from databases (KDD) refers to “the nontrivial process of
identifying valid, novel, potentially useful, and ultimately understandable patterns in
data” [Fayyad et al. 1996]. KDD has achieved a broad range of applications including
pattern recognition and predictive analytics in many different areas, such as engineering,
business, and science. Knowledge discovery has two types of goals, verification and
discovery. In general the former goal refers to verifying a user’s hypothesis and the latter
can be further divided into prediction (i.e., predicting unknown or future values) and
description (i.e., presenting identified results such as patterns in a human-understandable
form) [Fayyad et al. 1996].
The Web has become a universal repository with tremendous amount of data that
can be accessed from anywhere in the world and has experienced continuous growth both
in content and its users. Therefore, the Web presents immense opportunities for
discovering knowledge. However, unlike conventional databases, the data on Web are
mostly semistructured or unstructured. This situation makes knowledge discovery from
2
Web (KDW) challenging as compared to KDD. The KDW process requires considerable
effort on identifying, selecting, and processing Web data possibly from multiple sources
and in different (often free-form text) formats. Manual analysis that turns such large and
heterogeneous Web data into knowledge is impractical, and thus KDW becomes an
attempt to address the accentuated problem of data overload on the Web. We adapt the
KDD process presented in [Fayyad et al. 1996] for the Web context and present the
process of KDW in the following Figure 1.
Web mining is a step in the KDW process and it aims to analyze data and
discover knowledge from the Web. The Web data include all kinds of Web documents,
hyperlinks among Web pages, and Web usage logs. Depending on the type of Web data
being mined, Web mining can be broadly divided into three categories: Web content
mining, Web structure mining, and Web usage mining [Srivastava et al. 2000].
Web content mining is the process of discovering knowledge from Web page content
(i.e., often text), and it often uses techniques based on data mining and text mining.
According to [Liu 2007] important Web content mining problems include Web
crawling [e.g., Brin and Page 1998; Pant and Srinivasan 2006], Web search [e.g., Brin
and Page 1998], processing (e.g., clustering or categorizing) of search results
according to page content [e.g., Zamir and Etzioni 1999; Dumais and Chen 2001],
3
Figure 1. Process of Knowledge Discovery from Web
Web information, such as online opinion, extraction [e.g., Peng et al. 2002; Hu and
Liu 2004], Web information integration [e.g., Kalfoglou and Schorlemmer 2003; He
and Chang 2003], etc.
Web structure mining tries to discover useful information such as importance of
pages from the structure of hyperlinks on the basis of social network analysis (SNA)
techniques and graph theory. Its research topics cover ranking pages [e.g., Brin and
Page 1998; Chakrabarti el al. 1999], finding Web communities [e.g., Gibson et al.
1998], etc.
Web usage mining is the automatic discovery of user access patterns from Web logs
[Cooley et al. 1997]. The identified visit patterns can help in understanding the
overall access patterns and trends for all users [e.g., Zaïane et al. 1998] and allow for
Web site design to be responsive to business goals and customer needs, such as user-
level customization [e.g., Eirinaki and Vazirgiannis 2003].
My dissertation consists of two related topics/parts: personalized (online) search
and business relationship discovery, both of which are in the area of KDW. The first
topic presents and evaluates an automatic personalized search framework that categorizes
search results under user’s interests in order to examine how the proposed personalized
search approach outperforms noncategorized and nonpersonalized baseline systems. This
4
research is of Web content mining. The second topic proposes an approach to identifying
an intercompany network using company citations from Web content (more specifically,
online news stories) and discovers business relationships between companies from the
network on the basis of SNA and machine learning techniques. Therefore the second
topic covers both Web content mining and Web structure mining. The main research
question we explore is whether structural attributes derived from the intercompany
network, which in turn is derived from company citations in online news, can identify
business relationships. As shown in Figure 2, at a high level, the first topic connects Web
content to people, and the second uses Web content to discover relationships between
companies. Thus the two topics are connected through mining of Web content. However,
the two topics generate different types of knowledge – categorized and personalized
search results versus company relationships – and hence entail diverse adoptions of Web
data, processing, and Web mining. In the next two sections we briefly introduce the two
topics.
1.2 Personalized Search
Most search engines, including the popular ones such as Google and Yahoo!, ignore
users’ search context, such as users’ interests. As a result the same query from different
5
Figure 2. Process View of the Two Topics of the Dissertation
users with different information needs retrieves the same search results displayed in the
same way. Hence, they use a “one size fits all” [Lawrence 2000] approach. We note that
currently Google is attempting to address this problem with some level of voluntary
personalization. Personalization techniques that consider users’ context during search can
improve search efficiency [Pitkow et al. 2002]. We propose and implement an automatic
approach to categorizing search results according to a user’s interests to help users find
relevant information and find it quicker. Our approach is particularly well suited for a
workplace scenario where much of the information, needed by the proposed system,
about professional interests and skills of knowledge workers is available to the employer.
Personalizing based on such information within an organization can be expected to have
less privacy concerns as compared to a general purpose search engine gathering data on
user interests. Moreover, unlike other approaches, our approach does not impose any
burden of implicit or explicit feedback from the user.
6
We customize the general process of KDW in Figure 1 and present the process of
interest-based personalized search for knowledge discovery in Figure 3 where processes
covered by the horizontal double-arrow-lines correspond to their equivalent ones in
Figure 1. The proposed approach includes a mapping framework that automatically maps
user interests into a group of categories from Open Directory Project (ODP) taxonomy.
A text classifier is built from the content of the mapped ODP categories and later is used
at query-time to categorize search results under user interests. For a workplace scenario
where the employees’ professional interests and skills can be automatically extracted
from their resume or company’s database, this approach is fully automatic in that users
do not need to provide implicit or explicit feedbacks during the search. Also the use of
ODP is transparent to the users because the mapping between interests and ODP
categories are automatically generated. The lack of explicit or implicit feedback and the
use of ODP taxonomy without a user’s awareness of it differentiates this work from many
others, such as [Gauch et al. 2003, Liu et al. 2004; Chirita et al. 2005]. In addition, we
study three search systems with different interfaces for displaying search results. The first
system (LIST) shows search results in a page-by-page list. The second (CAT) categorizes
and displays results under certain ODP categories. The third (PCAT) is what we propose,
and PCAT categorizes and displays results under user interests. We compare the PCAT
with LIST and PCAT with CAT on the basis of different query lengths and different
types of search tasks.
Contributions of this research are that we present an automatic approach to
personalizing Web searches given a set of user interests and compare
our proposed approach with each of two baseline systems to further
7
identify some boundary conditions under which our system
outperforms a baseline system. The main findings include (1) PCAT is better
than LIST for one word query and Information Gathering type of task,
Figure 3. Knowledge Discovery Process for Interest-Based Personalized Search
and PCAT outperforms CAT for free-form queries and for both Information Gathering
and Finding types of tasks in terms of the time spent on finding relevant results. We
conclude that there is not any system universally better than others – the performance of a
system depends on some parameters such as query length and type of task.
1.3 Business Relationship Discovery
Business news contains rich and current information about companies and the
relationships among them. Reading news is very time consuming and requires a reader to
possess certain skills, the most basic of which is a good understanding of the language in
which the news is written. The huge volume of news stories makes the manual
identification of relationships among a large number of companies nontrivial and
unscalable. The previous literature using news to automatically discover business
8
relationships among companies is sparse. Many researchers in areas such as organization
behavior and sociology employ SNA techniques to investigate the nature and
implications of business relationships on the basis of explicitly given company
relationships provided by reliable data sources [e.g., Levine 1972; Walker et al. 1997;
Uzzi 1999; Gulati and Gargiulo 1999]. In contrast, researchers in bibliometrics and
computer science tend to identify links between nodes using implicit signals, such as
article citations, URL links, and email communications, derived from large and noisy
data sources. They study problems such as identifying importance of individual nodes
(e.g., Web pages, journal articles) in a network [e.g., Garfield 1979; Brin and Page 1998;
Kleinberg 1999] and finding communities on the Web [e.g., Kautz et al. 1997; Gibson et
al. 1998], instead of discovering business relationships between companies. We present
an approach of automatic discovery of company relationships from online business news
using machine learning and SNA techniques. Figure 4 illustrates the knowledge
discovery process for business relationship discovery from Web data (i.e., online news).
Given that a news story pertaining to a company often cites one or more other
companies, we construct a directed and weighted intercompany network on the basis of
citations from a large amount of online news by considering company citations as
directed links from the focal companies to the cited companies. Further we identify four
types of attributes from the network structure using SNA techniques. More specifically
they are dyadic degree based-, node degree based-, node centrality based-, and structural
equivalence based-attributes. Those attributes differ in their coverage of the network.
With those network attributes, we study two types of company relationships using
classification methods. This news-driven, SNA-based business relationship discovery
9
approach is scalable and language-neutral. Research along this line consists of two
studies that differ in their target business relationships and we describe them as follows.
Figure 4. KDW Process for Business Relationship Discovery
The first one concentrates on predicting a company revenue relation (CRR).
Given a pair of companies, CRR refers to the relative size of two companies’ annual
revenues. We find that degree-based and centrality-based attributes derived from network
structure can predict CRR with reasonable precision, recall, and accuracy (all above 70%)
for all directly linked company pairs in the network. Contributions of this study are that
(1) Our approach can serve as a data filtering step for studying the revenue relations
among very large number of companies. (2) Since the revenue information for public
companies is available quarterly, our approach can be used as a prediction tool for
revenues. (3) Our approach can be applied to discover the revenue relations for private or
foreign companies as well.
10
In the second work we study competitor relationship between companies. We
discover the competitor relationship between a pair of connected companies in the
intercompany network on the basis of the four types of attributes. And in particular, we
study the classification of company pairs for imbalanced data set where the number of
competitor pairs is much smaller than that of noncompetitor pairs. We use two gold
standards: Hoovers.com and Mergentonline.com that are professional company profile
websites and contain manually identified competitors for each company, to evaluate the
classification performance of our approach. Given that neither of the gold standards is
complete in the coverage of competitors, we estimate the coverage of each gold standard.
Finally we present metrics to estimate how much our approach can extend each of the
gold standards. Contributions of this work include that we present an automatically
approach to discovering competitor relationships between companies. Our approach is
particularly useful to serve as an initial data filtering step to identify a group of potential
competitors for each of many companies. We study an imbalanced data set problem and
report the classification performance for competitor pairs in both the imbalanced data set
and the whole data set. Most important, we report the estimated extension of our
approach to each of two gold standards.
1.4 Overview of Dissertation
At a high level the dissertation consists of two parts. Part I, which consists of
chapters 2 to 5, covers the first topic of the dissertation: Interested-based Personalized
Search. Part II which includes chapters 6 to 9, covers the two related studies in business
relationship discovery. More specifically we highlight each chapter as follows.
11
Chapter 2 introduces the research on personalized search and reviews related prior
work. We detail our approach of personalized search in Chapter 3. Experiments are
covered in Chapter 4 and result analyses and conclusions are discussed in Chapter 5. We
introduce the topic of business relationship discovery and review prior literature in
Chapter 6. Chapter 7 describes how to identify attributes from the network structure and
explains the data and data processing procedures. We concentrate on predicting CRR in
Chapter 8 and on discovering competitor relationships in Chapter 9. Finally we conclude
the dissertation in Chapter 10.
12
CHAPTER 2
INTRODUCTION AND LITERATURE REVIEW
2.1 Introduction
The Web provides an extremely large and dynamic source of information, and the
continuous creation and updating of Web pages magnifies information overload on the
Web. Both casual and noncasual users (e.g., knowledge workers) often use search
engines to find a needle in this constantly growing “haystack.” Sellen et al. [2002], who
define a knowledge worker as someone “whose paid work involves significant time spent
in gathering, finding, analyzing, creating, producing or archiving information,” report
that 59% of the tasks performed on the Web by a sample of knowledge workers fall into
the categories of Information Gathering and Finding, which require an active use of Web
search engines.
Most existing Web search engines return a list of search results based on a user’s
query but ignore the user’s specific interests and/or search context. Therefore, the
13
identical query from different users or in different contexts will generate the same set of
results displayed in the same way for all users, a so called one-size-fits-all [Lawrence
2000] approach. Furthermore, the number of search results returned by a search engine is
often so large that the results must be partitioned into multiple result pages. In addition,
individual differences in information needs, polysemy (multiple meanings of the same
word), and synonymy (multiple words with same meaning) pose problems [Deerwester et
al. 1990] in that a user may have to go through many irrelevant results or try several
queries before finding the desired information. Problems encountered in searching are
exaggerated further when the search engine users employ short queries [Jansen et al.
1998]. However, personalization techniques that put a search in the context of the user’s
interests may alleviate some of these issues.
In this study, which focuses on knowledge workers’ search for information online
in a workplace setting, we assume that some information about the knowledge workers,
such as their professional interests and skills, is known to the employing organization and
can be extracted automatically with an information extraction (IE) tool or with database
queries. The organization can then use such information as an input to a system based on
our proposed approach and provide knowledge workers with a personalized search tool
that will reduce their search time and boost their productivity.
For a given query, a personalized search can provide different results for different
users or organize the same results differently for each user. It can be implemented on
either the server side (search engine) or the client side (organization’s intranet or user’s
computer). Personalized search implemented on the server side is computationally
expensive when millions of users are using the search engine, and it also raises privacy
14
concerns when information about users is stored on the server. A personalized search on
the client side can be achieved by query expansion and/or result processing [Pitkow et al.
2002]. By adding extra query terms associated with user interests or search context, the
query expansion approach can retrieve different sets of results. The result processing
includes result filtering, such as removal of some results, and reorganizing, such as
reranking, clustering, and categorizing the results.
Our proposed approach is a form of client-side personalization based on an
interest-to-taxonomy mapping framework and result categorization. It piggybacks on a
standard search engine such as Google1 and categorizes and displays search results on the
basis of known user interests. As a novel feature of our approach, the mapping
framework automatically maps the known user interests onto a set of categories in a Web
directory, such as the Open Directory Project2 (ODP) or Yahoo!3 directory. An advantage
of this mapping framework is that, after user interests have been mapped onto the
categories, a large amount of manually edited data under these categories is freely
available to be used to build text classifiers that correspond to these user interests. The
text classifiers then can categorize search results according to the user’s various interests
at query time. The same text classifiers may be used to categorize emails and other digital
documents, which suggests that our approach may be extended to a broader domain of
content management.
The main research questions that we explore are as follows: (1) What is an
appropriate framework for mapping a user’s professional interests and skills onto a group
of concepts in an taxonomy such as a Web directory? (2) How does a personalized
1 http://www.google.com.2 http://www.dmoz.com.3 http://www.yahoo.com.
15
categorization system (PCAT) based on our proposed approach perform differently from
a list interface system (LIST), similar to a conventional search engine? (3) How does
PCAT perform differently from a nonpersonalized categorization system (CAT) that
categorizes results without any personalization? The third question attempts to separate
the effect of categorization from the effect of personalization in the proposed system. We
explore the second and third questions along two dimensions, type of task and query
length.
Figure 5 illustrates the input and output of these three systems. LIST requires two
inputs: a search query and a search engine, and its output, similar to what a conventional
search engine adopts, is a page-by-page list of search results. Using a large taxonomy
(ODP Web directory), CAT classifies search results and displays them under some
taxonomy categories; in other words, it uses the ODP taxonomy as an additional input.
Finally, PCAT adds another input, namely, a set of user interests. The mapping
framework in PCAT automatically identifies a group of categories from the ODP
taxonomy as relevant to the user’s interests. Using data from these relevant categories,
the system generates text classifiers to categorize search results under the user’s various
interests at query time.
We compare PCAT with LIST and with CAT in two sets of controlled
experiments. Compared with LIST, PCAT works better for searches with short queries
and for Information Gathering tasks. In addition, PCAT outperforms CAT for both
Information Gathering and Finding tasks and for searches with free-form queries.
Subjects indicate that PCAT enable them to identify relevant results and complete given
tasks more quickly and easily than does LIST or CAT.
16
Figure 5. Input and Output of the Three Systems
2.2 Related Literature
This section reviews prior studies pertaining to personalized search. We also
consider several studies using the ODP taxonomy to represent a search context, review
studies on the taxonomy of Web activities, and end by briefly discussing text
categorization.
According to Lawrence [2000], next-generation search engines will increasingly
use context information. Pitkow et al. [2002] also suggest that a contextual computing
approach that enhances user interactions through a greater understanding of the user, the
context, and the applications may prove a breakthrough in personalized search efficiency.
17
They further identify two primary ways to personalize search, query expansion and result
processing [Pitkow et al. 2002] which can complement each other.
2.2.1 Query Expansion
We use an approach similar to query expansion for finding terms related to user
interests in our interest mapping framework. Query expansion refers to the process of
augmenting a query from a user with other words or phrases in order to improve search
effectiveness. It originally was applied in information retrieval (IR) to solve the problem
of word mismatch that arises when search engine users employ different terms than those
used by content authors to describe the same concept [Xu and Croft 1996]. Because the
word mismatch problem can be reduced through the use of longer queries, query
expansion may offer a solution [Xu and Croft 1996].
In line with query expansion, current literature provides various definitions of
context. In the Inquirus 2 project [Glover et al. 1999], a user manually chooses a context
in the form of a category, such as research papers or organizational homepages, before
starting a search. Y!Q4, a large-scale contextual search system, allows a user to choose a
context in the form of a few words or a whole article through three methods: a novel
information widget executed in the user’s Web browser, Yahoo! Toolbar5, or Yahoo!
Messenger6 [Kraft et al. 2005]. In the Watson project, Budzik and Hammond [2000]
derive context information from the whole document a user views. Instead of using a
whole document, Finkelstein et al. [2002] limit the context to the text surrounding a user-
marked query term(s) in the document. That text is part of the whole document so their
query expansion is based on a local context analysis approach [Xu and Croft 1996].
4 http://yq.search.yahoo.com.5 http://toolbar.yahoo.com.6 http://beta.messenger.yahoo.com.
18
Leroy et al. [2003] define context as the combination of titles and descriptions of clicked
search results after an initial query. In all these studies, queries get expanded on the basis
of the context information, and results are generated according to the expanded queries.
2.2.2 Result Processing
Relatively fewer studies deal with result processing which includes result filtering
and reorganizing. Domain filtering eliminates documents irrelevant to given domains
from the search results [Oyama et al. 2004]. For example, Ahoy!, a homepage finder
system, uses domain-specific filtering to eliminate most results returned by one or more
search engines but retain the few pages that are likely to be personal homepages [Shakes
et al. 1997]. Tan and Teo [1998] propose a system that filters out news items that may not
be of interest to a given user according to that user’s explicit (e.g., satisfaction ratings)
and implicit (e.g., viewing order, duration) feedback to create personalized news.
Another approach to result processing is to reorganize, which involves reranking,
clustering, and categorizing search results. For example, Teevan et al. [2005] construct a
user profile (context) over time with rich resources including issued queries, visited Web
pages, composed or read documents and emails. When the user sends a query, the system
reranks the search results on the basis of the learned profile. Shen et al. [2005a] use
previous queries and summaries of clicked results in the current session to rerank results
for a given query. Similarly, UCAIR [Shen et al. 2005b], a client-side personalized
search agent, employs both query expansion on the basis of the immediately preceding
query and result reranking on the basis of summaries of viewed results. Other works also
consider reranking according to a user profile [Gauch et al. 2003; Sugiyama et al. 2004;
19
Speretta and Gauch 2005; Chirita et al. 2005; Kraft et al. 2005]. Gauch et al. [2003] and
Sugiyama et al. [2004] learn a user’s profile from his or her browsing history, whereas
Speretta and Gauch [2005] build the profile on the basis of search history, and Chirita et
al. [2005] require the user to specify the profile entries manually.
Scatter/Gather [Cutting et al. 1992] is one of the first systems to present
documents in clusters. Another system, Grouper [Zamir and Etzioni 1999], uses snippets
of search engine results to cluster the results. Tan [2002] presents a user-configurable
clustering approach that clusters search results using titles and snippets of search results
and the user can manually modify these clusters.
Finally, in comparing seven interfaces that display search results, Dumais and
Chen [2001] report that all interfaces that group results into categories are more effective
than conventional interfaces that display results as a list. They also conclude that the best
performance occurs when both category names and individual page titles and summaries
are presented. We closely follow these recommendations for the two categorization
systems we study (PCAT and CAT). In recent work, Käki [2005] also finds that result
categorization is helpful when the search engine fails to provide relevant results at the top
of the list.
2.2.3 Representing Context Using Taxonomy
In our approach, we map user interests to categories in the ODP taxonomy. Figure
6 shows a portion of the ODP taxonomy in which Computers is a depth-one category, and
C++ and Java are categories at depth four. We refer to Computers/Programming/
Languages as the parent category of category C++ or Java. Hence various concepts
20
(categories) are related through a hierarchy in the taxonomy. Currently, the ODP is a
manually edited directory of 4.6 million URLs that have been categorized into 787,774
categories by 68,983 human editors. The ODP taxonomy has been applied to
personalization of Web search in some prior studies [Pitkow et al. 2002, Gauch et al.
2003, Liu et al. 2004 and Chirita et al. 2005].
For example, the Outride personalized search system (acquired by Google)
performs both query modification and result processing. It builds a user profile (context)
on the basis of a set of personal favorite links, the user’s last 1000 unique clicks, and the
ODP taxonomy, then modifies queries according to that profile. It also reranks search
results on the basis of usage and the user profile. The main focus of the Outride system is
capturing a user’s profile through his or her search and browsing behaviors [Pitkow et al.
2002]. The OBIWAN system [Gauch et al. 2003] automatically learns a user’s interest
profile from his or her browsing history and represents those interests with concepts in
Magellan taxonomy. It maps each visited Web page into five taxonomy concepts with the
21
Figure 6. ODP Taxonomy
highest similarities; thus, the user profile consists of accumulated categories generated
over a collection of visited pages. Liu et al. [2004] also build a user profile that consists
of previous search query terms and five words that surround each query term in each
Web page clicked after the query is issued. The user profile then is used to map the user’s
search query onto three depth-two ODP categories. In contrast, Chirita et al. [2005] use a
system in which a user manually selects ODP categories as entries in his or her profile.
When reranking search results, they measure the similarity between a search result and
the user profile using the node distance in an taxonomy concept tree, which means the
search result must associate with an ODP category. A difficulty in their study is that
many parameters’ values have been set without explanations. The current Google
personalized search7 also explicitly asks users to specify their interests through the
Google directory.
Similar to Gauch et al. [2003], we represent user interests with taxonomy
concepts, but we do not need to collect browsing history. Unlike Liu et al. [2004], we do
not need to gather previous search history, such as search queries and clicked pages, or
know the ODP categories corresponding to the clicked pages. Whereas Gauch et al
[2003] map a visited page onto five ODP categories and Liu et al. [2004] map a search
query onto three categories, we automatically map a user interest onto an ODP category.
A difference between Chirita et al. [2005] and our approach is that when mapping a
user’s interest onto an taxonomy concept, we employ text, that is, page titles and
summaries associated with the concept in taxonomy, while they use the taxonomy
category title and its position in the concept tree when computing the tree-node distance. 7 http://labs.google.com/personalized.
22
Also, in contrast to UCAIR [Shen et al. 2005b] that uses contextual information in the
current session (short-term context) to personalize search, our approach personalizes
search according to a user’s long-term interests, which may be extracted from his or her
resume.
Haveliwala [2002] and Jeh and Widom [2003] extend the PageRank algorithm
[Brin and Page 1998] to generate personalized ranks. Using 16 depth-one categories in
ODP, Haveliwala [2002] computes a set of topic-sensitive PageRank scores. The original
PageRank is a global measure of the query- or topic-insensitive popularity of Web pages
measured solely by a linkage graph derived from a large part of the Web. Haveliwala’s
experiments indicate that, compared with the original PageRank, a topic-sensitive
PageRank achieves greater precision in top-ten search results. Topic-sensitive PageRank
also can be used for personalization after a user’s interests have been mapped onto
appropriate depth-one categories of the ODP, which can be achieved through our
proposed mapping framework. Jeh and Widom [2003] present a scalable personalized
PageRank method in which they identify a linear relationship between basis vectors and
the corresponding personalized PageRank vectors. At query time, their method constructs
an approximation to the personalized PageRank vector from the precomputed basis
vectors.
2.2.4 Taxonomy of Web Activities
We study the performance of the three systems (described in Section 2.1) by
considering different types of Web activities. Sellen et al. [2002] categorize Web
23
activities into six categories: Finding (locate something specific), Information Gathering
(answer a set of questions; less specific than Finding), Browsing (visit sites without
explicit goals), Transacting (execute a transaction), Communicating (participate in chat
rooms or discussion groups), and Housekeeping (check the accuracy and functionality of
Web resources). As Craswell et al. [2001] define a Site Finding task specifically as "one
where the user wants to find a particular site, and their query names the site," we consider
it a type of Finding task. It should be noted that some Web activities, especially
Information Gathering, can involve several searches. On the basis of the intent behind
Web queries, Broder [2002] classifies Web searches into three classes: Navigational
(reach a particular site), Informational (acquire information from one or more Web
pages), and Transactional (perform some Web-mediated activities). As the taxonomy of
search activities suggested by Sellen et al. [2002] is broader than that by Broder [2002],
in this article we choose to study the two major types of activities studied in Sellen et al.
[2002].
2.2.5 Text Categorization
In our study, CAT and PCAT systems employ text classifiers to categorize search
results. Text categorization (TC) is a supervised learning task that classifies new
documents into a set of predefined categories [Yang and Liu 1999]. As a joint discipline
of machine learning and IR, TC has been studied extensively, and many different
classification algorithms (classifiers) have been introduced and tested, including the
Rocchio method, naïve Bayes, decision tree, neural networks, and support vector
machines [Sebastiani 2002]. A standard information retrieval metric, cosine similarity
24
[Salton and McGill 1986], computes the cosine angle between vector representations of
two text fragments or documents. In TC, a document can be assigned to the category with
the highest similarity score. Due to its simplicity and effectiveness, cosine similarity has
been used by many studies for TC [e.g., Yang and Liu 1999; Sugiyama et al. 2004; Liu et
al. 2004].
In summary, to generate user profiles for personalized search, previous studies
have asked users for explicit feedback, such as ratings and preferences, or collected
implicit feedback, such as search and browsing history. However, users are unwilling to
provide explicit feedback even when they anticipate a long-run benefit [Caroll and
Rosson 1987]. Implicit feedback has shown promising results for personalizing search
using short-term context [Leroy et al. 2003, Shen et al. 2005b]. However, generating user
profiles for long-term context through implicit feedback will take time and may raise
privacy concerns. In addition, a user profile generated from implicit feedback may
contain noise because the user preferences have been estimated from behaviors and not
explicitly specified. In our approach two user-related inputs, a search query and the user’s
professional interests and skills, are explicitly given to a system, so some prior work
[Leroy et al. 2003; Gauch et al. 2003; Liu et al. 2004; Sugiyama et al. 2004; Kraft et al.
2005] that relies on modeling user interests through searching or browsing behavior is not
readily applicable.
25
CHAPTER 3
OUR APPROACH
Our approach begins with the assumption that some user interests are known and
therefore is well suited for a workplace setting in which employees’ resumes often are
maintained in a digital form or information about users’ professional interests and skills
is stored in a database. An IE tool or database queries can extract such information as
input to complement the search query, search engine, and contents of the ODP taxonomy.
However, we do not include such an IE program in this study and assume instead that the
interests have been already given. Our interest-category mapping framework tries to
automatically identify an ODP category associated with each of the given user interests.
Then our system uses URLs organized under those categories as training examples to
classify search results into various user interests at query time. We expect the result
categorization to help the user quickly focus on results of interest and decrease total time
spent in searching. The result categorization may also lead to the discovery of
serendipitous connections between the concepts being searched and the user’s other
interests. This form of personalization therefore should reduce search effort and possibly
provide interesting and useful resources the user would not notice otherwise. We focus on
26
work-related search performance, but our approach could be easily extended to include
personal interests as well. We illustrate a process view of our proposed approach in
Figure 7 and present our approach in five steps. Steps 3 and 4 cover the mapping
framework.
3.1 Step 1: Obtaining an Interest Profile
Step 1 (Figure 7) pertains to how the user interests can be extracted from a
resume. Our study assumes that user interests are available to our personalized search
system in the form of a set of words and phrases which we call a user’s interest profile.
3.2 Step 2: Generating Category Profiles
As we explained previously, ODP is a manually edited Web directory with
millions of URLs placed under different categories. Each ODP category contains URLs
that point to external Web pages that human editors consider relevant to the category.
27
Figure 7. Process View of Proposed Approach
Those URLs are accompanied by manually composed titles and summaries that we
believe accurately represent the corresponding Web page content. The category profile of
an ODP category thus is built by concatenating the titles and summaries of the URLs
listed under the category. The constructed category profiles provide a solution to the
cold-start problem, which arises from the difficulty of creating a profile for a new user
from scratch [Maltz and Ehrlich 1995], and they later serve to categorize the search
results. Gauch et al. [2003], Menczer et al. [2004], and Srinivasan et al. [2005] use
similar concatenation to build topic profiles. In our study, we combine up to 30 pairs of
manually composed titles and summaries of URL links under an ODP category as the
category profile.8 In support of this approach, Shen et al. [2004] report that classification
using manually composed summarization in the LookSmart Web directory achieves
higher accuracy than the use of the content of Web pages. For building the category
profile, we pick the first 30 URLs based on the sequence in which they are provided by
8 A category profile does not include titles or summaries of its child (subcategory) URLs.
28
ODP. We note that ODP can have more than 30 URLs listed under a category. In order to
use similar amounts of information for creating profiles for different ODP categories, we
only use the titles and summaries of the first 30 URLs. When generating profiles for
categories in Magellan taxonomy, Gauch et al. [2003] show that a number of documents
between 5 and 60 provide reasonably accurate classification.
At depth-one, ODP contains 17 categories (for a depth-one category, Computers,
see Figure 6). We select five of these (Business, Computers, Games, Reference, and
Science) that are likely to be relevant to our subjects and their interests. These five broad
categories comprise a total of 8,257 categories between depths one and four. We generate
category profiles by removing stop words and applying Porter stemming9 [Porter 1980].
We also filter out any terms that appear only once in a profile to avoid noise and remove
any profiles that contain fewer than two terms. Finally, the category profile is represented
as a term vector [Salton and McGill, 1986] with term frequencies (tf) as weights. Shen et
al. [2004] also use tf-based weighting scheme to represent manually composed
summaries in the LookSmart Web directory to represent a Web page.
3.3 Step 3: Mapping Interests to ODP Categories
Next, we need a framework to map a user’s interests onto appropriate ODP
categories. The framework then can identify category profiles for building text classifiers
that correspond to the user’s interests. Some prior studies [Pitkow et al. 2002; Liu et al.
2004] and the existing Google personalized search use ODP categories with a few
hundred categories up to depth two, but for our study, categories up to depth two may
9 http://www.dcs.gla.ac.uk/idom/ir_resources/linguistic_utils/porter.java.
29
lack sufficient specificity. For example, Programming, a depth-two category, is too broad
to map a user interest in specific programming languages such as C++, Java, or Perl.
Therefore, we map user interests to ODP categories up to depth four. As we mentioned in
Step 2, a total of 8,257 such categories can be used for interest mapping. We employ four
different mapping methods to evaluate the mapping performance by testing and
comparing them individually as well as in different combinations. When generating an
output category, a mapping method includes the parent category of the mapped category;
for example, if the mapped category is C++, the output will be Computers/Programming/
Languages/C++.
3.3.1 Mapping Method 1 (m1-category-label):
Simple Term Match
The first method uses a string comparison to find a match between an interest and
the label of the category in ODP. If an interest is the same as a category label, the
category is considered a match to the interest. Plural forms of terms are transformed to
their singular forms by a software tool from the National Library of Medicine.10
Therefore, the interest of search engine is matched with the ODP category Search
Engines, and the output category is Computers/Internet/Searching/Search Engines.
10 http://umlslex.nlm.nih.gov/nlsRepository/nlp/doc/userDoc/index.html.
30
3.3.2 Mapping Method 2 (m2-category-profile):
Most Similar Category Profile
The cosine similarities between an interest and each of the category profiles are
computed, in which case the ODP category with the highest similarity is selected as the
output.
3.3.3 Mapping Method 3 (m3-category-profile-noun): Most Similar
Category Profile While Augmenting Interest
With Potentially Related Nouns
The m1-category-label and m2-category-profile will fail if the category labels and
profiles do not contain any of the words that form a given interest so it may be
worthwhile to augment the interest concept by adding a few semantically similar or
related terms. According to Harris [1985], terms in a language do not occur arbitrarily but
appear at a certain position relative to other terms. On the basis of the concept of
cooccurrence, Riloff and Shepherd [1997] present a corpus-based bootstrapping
algorithm that starts with a few given seed words that belong to a specific domain and
discovers more domain-specific semantically-related lexicons from a corpus. Similar to
query expansion, it is desirable to augment the original interest with a few semantically
similar or related terms.
For m3-category-profile-noun, one of our programs conducts a search on Google
using an interest as a search query and finds the N nouns that most frequently cooccur in
the top ten search results (page titles and snippets). We find cooccurring nouns because
most terms in interest profiles are nouns (for terms from some sample user interests, see
31
Table 1). Terms semantically similar or related to those of the original interest thus can
be obtained without having to ask a user for input such as feedback or a corpus. A noun is
identified by looking up the word in a lexical reference system,11 WordNet [Miller et al.
1990], to determine whether the word has the part-of-speech tag of noun. The similarities
between a concatenated text (a combination of the interest and N most frequently
cooccurring nouns) and each of the category profiles then are computed to determine the
category with the highest similarity as the output of this method.
3.3.4 Mapping Method 4 (m4-category-profile-np): Most Similar
Category Profile While Augmenting Interest With
Potentially Related Noun Phrases
Although similar to m3-category-profile-noun, this method finds the M most
frequently cooccurring noun phrases on the first result page from up to ten search results.
We developed a shallow parser program to parse sentences in the search results into NPs
(noun phrases), VPs (verb phrases), and PPs (prepositional phrases), where a NP can
appear in different forms, such as a single noun, a concatenation of multiple nouns, an
article followed by a noun, or any number of adjectives followed by a noun.
Table 1 lists some examples of frequently cooccurring nouns and NPs identified
by m3-category-profile-noun and m4-category-profile-np. Certain single-noun NPs
generated by m4-category-profile-np differ from individual nouns identified by m3-
category-profile-noun because a noun identified by this method may combine with other
11 http://wordnet.princeton.edu/.
32
terms to form a phrase in m4-category-profile-np and therefore not be present in the
result generated by m4-category-profile-np.
3.4 Step 4: Resolving Mapped Categories
For a given interest, each mapping method in Step 3 may generate a different
mapped ODP category, and m1-category-label may generate multiple ODP categories for
the same interest because the same category label sometimes is repeated in the ODP
taxonomy. For example, the category Databases appears in several different places in the
hierarchy of the taxonomy, such as Computers/Programming/Databases and
Computers/Programming/Internet/Databases.
Using 56 professional interests in the computer domain which were manually
extracted from several resumes of professionals collected from ODP (eight interests are
shown in the first column of Table 1), Table 2 compares the performances of each
individual mapping method. After verification by a domain expert, m1-category-label
generated mapped categories for 29 of 56 interests, and only two did not contain the right
category. We note that m1-category-label has much higher precision than the other three
methods, but it generates the fewest mapped interests. Machine learning research [e.g.,
Dietterich 1997] has shown that an ensemble of classifiers can outperform each classifier
in that ensemble. Since the mapping methods can be viewed as classification techniques
that classify interests into ODP categories, a combination of the mapping methods may
outperform any one method.
Table 1.
33
Frequently Cooccurring Nouns and NPs
Domain Interest Two cooccurring nouns Cooccurring NP
Computer
C++ programme, resource general cIBM DB2 database, software databaseJava tutorial, sun sunMachine Learning information, game ai topicNatural Language Processing intelligence, speech intelligence
Object Oriented Programming concept, link data
Text Mining information, data text mine toolUML model tool acceptance *
Web Site Design html, development library resource web development
Finance
Bonds saving, rate saving bondDay Trading resource, article bookDerivatives trade, international goldMutual Funds news, stock accountOffshore Banking company, formation bank accountRisk Management open source * software risk evaluation *Stocks Exchange trade, information official siteTechnical Analysis market, chart market pullbackTrading Cost service, cap product
* Some cooccurring nouns or NPs may be not semantically similar or related.
Table 2.
Individual Mapping Method Comparison (Based on 56 Computer Interests)
Mapping method m1 m2 m3 m4Number of correctly mapped interests 27 29 25 19Number of incorrectly mapped interests 2 25 30 36Number of total mapped interests 29 54 55 55Precision (
) 93.0% 53.7% 45.5% 34.5%
Recall ( ) 48.2% 51.8% 44.6% 33.9%
F1 63.5% 52.7% 45.0% 34.2%
34
Figure 8 lists the detailed pseudocode of the procedure used to automatically
resolve a final set of categories for an interest profile with the four mapping methods. M1
represents a set of mapped category/categories generated by m1-category-label as do M2,
M3, and M4. Because of its high precision, we prioritize the category/categories
generated by m1-category-label as shown in Step (2); if a category generated by m1-
category-label is the same as, or a parent category of, a category generated by any other
method, we include the category generated by m1-category-label in the list of final
resolved categories. Because m1-category-label uses an exact match strategy, it does not
always generate a category for a given interest. In Step (3), if methods m2-category-
profile, m3-category-profile-noun, and m4-category-profile-np generate the same mapped
category, we select that category, irrespective of whether m1-category-label generates
one. Steps (2) and (3) attempt to produce a category for an interest by considering
overlapping categories from different methods. If no such overlap is found, we look for
overlapping categories generated for different interests in Step (6) because if more than
one interest is mapped to the same category, it is likely to be of interest. In Step (8), we
try to represent all remaining categories at a depth of three or less by truncating the
category at depth four and thereby hope to find overlapped categories through the parent
categories. Step (9) is similar to Step (5) except that all remaining categories are at the
depth of three or less.
(1) For each interest i in interest profile Given i, the four mapping methods generate M1, M2, M3, and M4(2) For each category c in M1 If c is the same as, or a parent of, a category in M2, M3, or M4, add c to a list of
35
final categories, then go to Step (1) End For(3) If M2, M3, and M4 contain the same category c, add c into the list of final categories, then go to Step (1)(4) Put any category c in M1, M2, M3, and M4 into a list of candidate categories End For(5) For each category c in candidate categories Count the frequency for c End For(6) For each depth-four category c in candidate categories If frequency of c >= threshold, add c into final categories. (We chose the threshold equal to the number of mapping methods – 1. The threshold was three in our tests because we used four mapping methods. The number of three or larger means there is an overlap of candidate category between at least two different interests. Then we choose the overlapped candidate category to represent these interests.) End For(7) Removing all candidate categories for the mapped interests in Step (6)(8) Resolving all remaining categories of depth four into depth three by truncating the category at depth four. For example, after truncating to depth three from depth four, reference/knowledge management/publications/articles is resolved as reference/ knowledge management/publications(9) For each category c in candidate categories Count the frequency for c End For(10) For each depth-three category c in candidate categories If frequency of c >= threshold, add c into final categories End For
Figure 8. Category Resolving Procedures
To determine appropriate values for N (number of nouns) and M (number of NPs)
for m3-category-profile-noun and m4-category-profile-np, we tested different
combinations of values ranging from 1 to 3 with the 56 computer interests. According to
the number of correctly mapped interests, choosing the two most frequently cooccurring
nouns and one most frequently cooccurring NP offers the best mapping result (see Table
1 for some examples of identified nouns and NPs.) With the 56 interests, Table 3
compares the number of correctly mapped interests when different mapping methods are
combined. Using all four mapping methods provides the best results; 39 of the 56
36
interests were correctly mapped onto ODP categories. The resolving procedures in Figure
8 thus are based on four mapping methods. When using three methods, we adjusted the
procedures accordingly, such as setting the thresholds in Steps (6) and (10) to two instead
of three.
Table 4 lists mapped and resolved categories for some interests in computer and
finance domains.
After the automatic resolving procedures, mapped categories for some interests
may not be resolved because different mapping methods generate different categories.
Table 3.
Comparison of Combined Mapping Methods
Combination of mapping methodsm1+m2+
m3m1+m2+
m4m1+m3+
m4m1+m2+m3
+m4Number of correctly mapped interests 34 35 32 39Precision* 60.7% 62.5% 57.1% 69.6%
* Recall and F1 were same as precision because the number of mapped interests was 56.
37
Unresolved interests can be handled by having the user manually map them onto the ODP
taxonomy. An alternative approach could use a unresolved user interest as a query to a
search engine (in a manner similar to m3-category-profile-noun and m4-category-profile-
np), then combine the search results, such as page titles and snippets, to compose an ad
hoc category profile for the interest. Such a profile could flexibly represent any interest
and avoid the limitation of taxonomy in that it contains a finite set of categories. It would
be worthwhile to examine the effectiveness of such ad hoc category profiles in a future
study. In this article, user interests are fully mapped and resolved to ODP categories.
These four steps are performed just once for each user, possibly during a software
installation phase, unless the user’s interest profile changes. To reflect such a change in
interests, our system can automatically update the mapping periodically or allow a user to
request an update from the system. As shown in Figure 7, the first four steps can be
performed in a client-side server, such as a machine on the organization’s intranet, and
the category profiles can be shared by each user’s machine.
Finally, user interests, even long-term professional ones, are dynamic in nature. In
the future, we will explore more techniques to learn about and finetune interest mapping
and handle the dynamics of user interests.
3.5 Step 5: Categorizing Search Results
When a user submits a query, our system obtains search results from Google and
downloads the content of up to the top-50 results which correspond to the first five result
38
pages. The average number of result pages viewed by a typical user for a query is 2.35
[Jansen et al. 2000], and a more recent study [Jansen et al. 2005] reports that about 85–
Table 4.
Resolved Categories
Domain Interest ODP category
Computer
C++ computers/programming/languages/c++IBM DB2 computers/software/databases/ibm db2Java computers/programming/languages/javaMachine Learning computers/artificial intelligence/machine learningNatural Language Processing computers/artificial intelligence/natural language
Object Oriented Programming computers/software/object-oriented
Text Mining reference/knowledge management/ knowledge discovery/text mining
UML computers/software/data administration *Web Site Design computers/internet/web design and development
Finance
Bonds business/investing/stocks and bonds/bondsDay Trading business/investing/day tradingDerivatives business/investing/derivativesMutual Funds business/investing/mutual fundsOffshore Banking business/financial services/offshore servicesRisk Management business/management/software *Stocks Exchange business/investing/stocks and bonds/exchanges
Technical Analysis business/investing/research and analysis/technical analysis
Trading Cost business/investing/derivatives/brokerages*Because the mapping and resolving steps are automatic, some resolved categories are erroneous.
92% of users view no more than two result pages. Hence, our system covers
approximately double the number of results normally viewed by a search engine user. On
the basis of page content, the system categorizes the results into various user interests. In
PCAT, we employ a user’s original interests as class labels rather than the ODP category
39
labels because the mapped and resolved ODP categories are associated with user
interests. Therefore, the use of ODP (or any other Web directory) is transparent to the
user. A Web page that corresponds to a search result is categorized by (1) computing the
cosine similarity between the page content and each of the category profiles of the
mapped and resolved ODP categories that correspond to user interests and (2) assigning
the page to the category with the maximum similarity if the similarity is greater than a
threshold. If a search result does not fall into any of the resolved user interests, it is
assigned to the Other category.
The focus of our study is to explore the use of PCAT, an implementation based on
the proposed approach, and compare it with LIST and CAT. With regard to interest
mapping and result categorization (classification problems), we choose the simple and
effective cosine similarity instead of comparing different classification algorithms and
selecting the best one.
3.6 IMPLEMENTATION
We developed three search systems12 with different interfaces to display search
results, and the online searching portion was implemented as a wrapper on Google search
engine using the Google Web API.13 Although the current implementation of our
approach uses a single search engine (Google), following the metasearch approach
[Dreilinger and Howe 1997], it can be extended to handle results from multiple engines.
Because Google has become the most popular search engine,14 we use Google’s search
12 In experiments, we named the systems A, B, or C; in this article, we call them PCAT, LIST, or CAT, respectively.13 http://www.google.com/apis/.14 http://www.comscore.com/press/release.asp?press=873.
40
results to feed the three systems. That is, the systems have the same set of search results
for the same query; recall that LIST can be considered very similar to Google. For
simplicity, we limit the search results in each system to Web pages in HTML format. In
addition, for a given query, each of the systems retrieves up to 50 search results.
PCAT and CAT download the contents of Web pages that correspond to search
results and categorize them according to user interests and ODP categories, respectively.
For faster processing, the systems use multithreading for simultaneous HTTP connections
and download up to 10KB of text for each page. It took our program about five seconds
to fetch 50 pages. We note that our page-fetching program is not an industry strength
module and much better concurrent download speeds have been reported by other works
[Hafri and Djeraba 2004, Najork and Heydon 2001]. Hence, we feel that our page-
fetching time can be greatly reduced in a production implementation. After fetching the
pages, the systems remove stop words and perform word stemming before computing the
cosine similarity between each page content and a category profile. Each Web page is
assigned to the category (and its associated interest for PCAT) with the greatest cosine
similarity. However, if the similarity is not greater than a similarity threshold, the page is
assigned to the Other category. We determined the similarity threshold by testing query
terms from “irrelevant” domains (not relevant to any of the user’s interests). For example,
given that our user interests are related to computer and finance, we tested ten irrelevant
queries, such as NFL, Seinfeld, allergy, and golden retriever. For these irrelevant queries,
when we set the threshold at 0.1, at least 90% (often 96% or higher) of retrieved results
were categorized under the Other category. Thus we chose 0.1 as our similarity threshold.
The time for classifying results according to user interests in PCAT is negligible (tens of
41
milliseconds). However, the time for CAT is three magnitudes greater than that for PCAT
because the number of potential categories for CAT is 8,547, whereas the number of
interests is less than 8 in PCAT.
Figure 9 displays a sample output from PCAT for the query of regular expression.
Once a user logs in with his or her unique identification, PCAT displays a list of the
user’s interests on top of the GUI. After a query is issued, search results are categorized
into various interests and displayed in the result area, as shown in Figure 9. A number
next to the interest indicates how many search results are classified under that interest; if
there is no classified search result, the interest will not be displayed in the result area.
Under each interest (category), PCAT (CAT) shows no more than three results on the
main page. If more than three results occur under an interest or category, a More link
appears next to the number of results. (In Figure 9, there is a More link for the interest of
Java.) Upon clicking this link, the user sees all of the results under that interest in a new
window as shown in Figure 10.
42
Figure 9. Sample Output of PCAT. Category titles are user interests mapped and
resolved to ODP categories
user interests result area previous task next task copy paste query field
44
Figure 11 displays a sample output of LIST for the same query of regular
expression and shows all search results in the result area as a page-by-page list. Clicking
a page number causes a result page with up to ten results to appear in the result area of
the same window. For the search task in Figure 11, the first relevant document is shown
as the sixth result on page 2 in LIST.
Figure 11. Sample Output of LIST
45
Figure 12 displays a sample output for CAT in which the category labels in the
result area are ODP category names sorted alphabetically such that output categories
under business are displayed before those under computers.
We now describe some of the features of the implemented systems that would not
appear in a production system but are meant only for experimental use. We predefined a
set of search tasks the subjects used to conduct searches during the experiments that
specified what information and how many Web pages needed to be found (Section 4.2.2
describes the search tasks in more detail.) Each search result consists of a page title,
46
Figure 12. Sample Output of CAT. Category labels are ODP category titles
snippet, URL, and a link called relevant15 next to the title. Except for the relevant link, the
items are the same as those found in typical search engines. A subject can click the
hyperlinked page title to open the page in a regular Web browser, such as Internet
Explorer. The subject determines whether a result is relevant to a search task by looking
at the page title, snippet, URL, and/or the content of the page.
Many of our search tasks require subjects to find one relevant Web page for a task
but some require two. In Figure 9, the task requires finding two Web pages which is also
indicated by the number 2 at the end of the task description. Once the user finds enough
relevant pages, he or she can click the Next button to proceed to the next task; clicking on
Next before enough relevant page(s) have been found prompts a warning message, which
allows the user to either give up or continue the current search task.
We record search time, or the time spent on a task, as the difference between the
time that the search results appear in the result area and the time that the user finds the
required number of relevant result(s).
15 When a user clicks on the relevant link, the corresponding search result is treated as the answer or solution for the current search task. This clicked result is considered as relevant, and is not necessarily the most relevant among all search results.
46
CHAPTER 4
EXPERIMENTS
We conducted two sets of controlled experiments to examine the effects of
personalization and categorization. In experiment I, we compare PCAT with LIST, that
is, a personalized system that uses categorization versus a system similar to a typical
search engine. Experiment II compares PCAT with CAT in order to study the difference
between personalization and nonpersonalization, given that categorization is common to
both systems. These experiments were designed to examine whether subjects’ mean log
search time16 for different types of search tasks and query lengths varied between the
compared systems. The metric evaluates the efficiency of each system because all three
systems return the same set of search results for the same query. Before experiment I, we
conducted a preliminary experiment comparing PCAT and LIST with several subjects
who later did not participate in either the experiment I or II. The preliminary experiment
16 Mean log search time is the average log-transformed search time for a task across a group of subjects using the same system. We transformed the original search times (measured in seconds) with base 2 log to make the log search times closer to a normal distribution. In addition, taking the average makes the mean log search times more normally distributed.
47
helped us make decisions relating to experiment and system design. Next we introduce
our experiments I and II in detail.
4.1 Studied Domains and Domain Experts
Because we were interested in personalizing search according to a user’s
professional interests, we chose two representative professional domains, computer and
finance, that appear largely disjointed.
For the computer domain, two of the authors, who are researchers in the area of
information systems, served as the domain experts. Both experts also have industrial
experiences related to computer science. For the finance domain, one expert has a
doctoral degree and the other has a master’s degree in finance.
4.2 Professional Interests, Search Tasks, and Query Length
4.2.1 Professional Interests (Interest Profiles)
For each domain, the two domain experts manually chose several interests and
skills that could be considered fundamental which enables us to form a generic interest
profile that would be shared by all subjects within the domain. Moreover, the
fundamental nature of these interests allows us to recruit more subjects, leading to greater
statistical significance in our results. By defining some fundamental skills in the
computer domain, such as programming language, operating system, database, and
applications, the two computer domain experts identified six professional interests:
algorithms, artificial intelligence, C++, Java, Oracle, and Unix. Similarly, the two finance
48
experts provided seven fundamental professional interests: bonds, corporate finance, day
trading, derivatives, investment banking, mutual funds, and stock exchange.
4.2.2 Search Tasks
The domain experts generated search tasks on the basis of the chosen interest
areas but also considered different types of tasks, that is, Finding and Information
Gathering. The content of those search tasks include finding a software tool, locating a
person’s or organization’s homepage, finding pages to learn about a certain concept or
technique, collecting information from multiple pages, and so forth. Our domain experts
predefined 26 nondemo search tasks for each domain as well as 8 and 6 demo tasks for
the computer and finance domains, respectively. The demo tasks were similar to, but not
identical to the non-demo tasks, and therefore offer subjects some familiarity with both
systems before they started to work on the nondemo tasks. Nondemo tasks are used in
post-experiment analysis, while demo tasks are not. All demo and nondemo search tasks
belong to the categories of Finding and Information Gathering [Sellen et al. 2002] as
discussed in Section 2.2.4, and within the finding tasks, we included some Site Finding
tasks [Craswell et al. 2001].
4.2.3 Query length
Using different query lengths, we specified four types of queries for search tasks
in each domain:
(1) One-word query (e.g., jsp, underinvestment)
49
(2) Two-word query (e.g., neural network, security line)
(3) Three-word query (e.g., social network analysis)
(4) Free-form query, which had no limitations on the number of words used
For a given task a user was free to enter any query word(s) of his or her own
choice that conformed to the associated query-length requirement, and the user could
issue multiple queries for the same task. For example, Table 5 shows some sample search
tasks, types of search tasks, and their associated query lengths.
Table 6 lists the distributions of search tasks and their associated query lengths.
For each domain, we divided the 26 nondemo search tasks and demo tasks into two
groups such that the two groups have the same number of tasks and distribution of query
lengths. During each experiment, subjects searched for the first group of tasks using one
system, and the second group of tasks using the other.
Table 5.
Examples of Search Tasks, Types of Tasks, and Query Lengths
Domain Search task Type of search task Query lengthComputer You need an open source IDE
(Integrated Development Environment) for C++. Find a page that provides any details about such an IDE.
Finding one-word
Computer You need to provide a Web service to your clients. Find two pages that describe Web services support using Java technology.
Information Gathering two-word
Finance Find a portfolio management spreadsheet program.
Finding three-word
Finance Find the homepage of New York Stock Exchange.
Site Finding free-form
50
Table 6.
Distribution of Search Tasks and their Associated Query Lengths
Experiment Domain\Query length
One-word
Two-word
Three-word
Free-form
Total tasks
I & II Computer 6 6 4 10 26Finance 8 6 6 6 26
We chose these different query lengths for several reasons. First, numerous
studies show that users tend to submit short Web queries with an average length of two
words. A survey by the NEC Research Institute in Princeton reports that up to 70% of
users typically issue a query with one word in Web searches, and nearly half of the
Institute’s staff—who should be Web-savvy (knowledge workers and researchers)—fail
to define their searches precisely with query terms [Butler 2000]. By collecting search
histories for a two-month period from 16 faculty members across various disciplines at a
university, Käki [2005] found that the average query length was 2.1 words. Similarly,
Jansen et al. [1998] find through their analysis of transaction logs on Excite that, on
average, a query contains 2.35 words. In yet another study, Jansen et al. [2000] report that
the average length of a search query is 2.21 words. From their analysis of users’ logs in
the Encarta encyclopedia, Wen et al. [2002] report that the average length of Web queries
is less than 2 words.
51
Second, we chose different query lengths to simulate different types of Web
queries and examine how these different types affect system performance. A prior study
follows a similar approach; in comparing the IntelliZap system with four popular search
engines, Finkelstein et al. [2002] set the length of queries to one, two, and three words
and allow users to type in their own query terms.
Third, in practice, queries are often incomplete or may not incorporate enough
contextual information which leads to many irrelevant results and/or relevant results that
do not appear at the top of the list. A user then has two obvious options: enter a different
query to start a new search session or go through the long result list page-by-page, both
of which consume time and effort. From a study with 33,000 respondents, Sullivan
[2000] finds that 76% of users employ the same search engine and engage in multiple
search sessions on the same topic. To investigate this problem of incomplete or vague
queries, we associate search tasks with different query lengths to simulate the real-world
problem of incomplete or vague queries. We believe that categorization will present
results in such a way to help disambiguate such queries. Unlike Leroy et al. [2003], who
extract extra query terms from users’ behaviors during consecutive searches, we do not
modify users’ queries but rather observe how a result-processing approach (personalized
categorization of search results) can improve search performance.
4.3 Subjects
Prior to the experiments, we sent emails to students in the business school and the
computer science department of our university, as well as to some professionals in the
computer industry, to solicit their participation. In these emails, we explicitly listed the
52
predefined interests and skills we expected potential subjects to have. We also asked
several questions, including the following two self-reported ones:
(1) When searching online for topics in the computer or finance domain, what do you
think of your search performance (with a search engine) in general?
(a) slow (b) normal (c) fast
(2) How many hours do you spend on online browsing and searching per week (not
limited to your major)?
(a) [0, 7) (b) [7+, 14) (c) [14+)
We verified their responses to ensure each subject possessed the predefined skills
and interests. After the experiments we did not manually verify the correctness of
subject-selected relevant documents. However, in our preliminary experiment with
different subjects, we manually examined all of the relevant documents chosen by
subjects and we confirmed that, on an average, nearly 90% of their choices were correct.
We assume that with the sufficient background the subjects were capable of identifying
the relevant pages. Because we used PCAT in both experiments, no subject from
experiment I participated in experiment II. We summarize some demographic
characteristics of the subjects in tables 7 through 9.
To compare the two studied systems for each domain, we divided the subjects into
two groups, such that subjects in one group were as closely equivalent to the subjects in
the other as possible with respect to their self-reported search performance, weekly
browsing and searching time, and educational status. We computed the mean log search
time for a task by averaging the log search times for each group.
54
Table 7.
Educational Status of Subjects
Experiment
Domain\Status
Undergraduate
Graduate Professional
Total
I Computer 3 7 4 14Finance 4 16 0 20
II Computer 3 11 2 16Finance 0 20 0 20
Table 8.
Self-reported Performance on Search within a Domain
Experiment Domain\Performance Slow Normal Fast
I Computer 0 8 6Finance 2 15 3
II Computer 1 8 7Finance 2 11 7
Table 9.
Self-reported Time (Hours) Spent Searching and Browsing Per Week
Experiment Domain\Time (hours)
[0, 7) [7, 14)
[14+)
I Computer 1 9 4Finance 5 10 5
II Computer 2 7 7Finance 2 11 7
55
4.4 Experiment Process
In experiment I, all subjects used both PCAT and LIST and searched for the same
demo and non-demo tasks. As we show in Table 10, the program automatically switched
between PCAT and LIST according to the task numbers, and the group identified by user
id so users in different groups always used different systems for the same task. The same
system-switching mechanism was adopted in experiment II to switch between PCAT and
CAT.
Table 10.
Distribution of System Uses by Tasks and User Groups
Task Group
First half demo tasks
Second half demo tasks
Non-demo tasks 1–13
Non-demo tasks 14–26
Group one PCAT LIST PCAT LISTGroup two LIST PCAT LIST PCAT
55
CHAPTER 5
EVALUATIONS
In this Chapter we compare two pairs of systems (PCAT vs. LIST, PCAT vs.
CAT) on the basis of the mean log search time along two dimensions: query length and
type of task. We also test five hypotheses using the responses to a postexperiment
questionnaire provided to the subjects. Finally, we demonstrate the differences of the
indices of the relevant results across all tasks for the two pairs of systems.
5.1 Comparing Mean Log Search Time by Query Length
We first compared the two systems by different query lengths. Tables 11 and 12
contain the average mean log search times across tasks with the same query length and
1 standard error for different systems in the two experiments (lower values are better).
The last column of each table provides the average mean log search time across all 26
search tasks and 1 standard error. For most of the comparisons between PCAT vs.
LIST (Table 11) or PCAT vs. CAT (Table 12), for a given domain and query length,
PCAT has lower average mean log search times. We conducted two-tailed t-tests to
56
determine whether PCAT was significantly faster than LIST or CAT for different
domains and query lengths. Table 13 shows the degrees of freedom and p-values for
the t-tests. The
Table 11.
Average Mean Log Search Time across Tasks Associated with Four Types of Query
(PCAT vs. LIST)
Experiment Query length
Domain-SystemOne-word Two-word Three-word Free-form Total
I(PCAT vs.
LIST)
Computer-PCATComputer-LISTFinance- PCAT 3.97 0.34
Finance-LIST 5.10 0.26
Table 12.
Average Mean Log Search Time across Tasks Associated with Four Types of Query
(PCAT vs. CAT)
Experiment Query length
Domain-SystemOne-word Two-word Three-word Free-form Total
II(PCAT vs.
CAT)
Computer-PCAT 4.14 0.26 3.88 0.19 4.30 0.15Computer-CAT 4.96 0.26 4.94 0.34 5.17 0.17Finance-PCAT 4.10 0.35 4.46 0.14Finance-CAT 5.10 0.25 5.11 0.16
Table 13.
The t-test Comparisons (degrees of freedom, p-values)
57
Experiment Domain One-word Two-word Three-word Free-form TotalI
(PCAT vs. LIST)Computer 10, 0.058 10, 0.137 6, 0.517 18, 0.796 50, 0.116Finance 14, 0.015 10, 0.370 10, 0.752 10, 0.829 50, 0.096
II(PCAT vs. CAT)
Computer 10, 0.147 10, 0.050 6, 0.309 18, 0.013 50, 0.001Finance 14, 0.193 10, 0.152 10, 0.237 10, 0.041 50, 0.003
numbers in bold in the Tables 9 and 10 highlight the systems with statistically significant
differences (p < 0.05) in average mean log search times.
In Table 13, for both computer and finance domains, PCAT has a lower mean log
search time than LIST for one-word query tasks with greater than 90% statistical
significance. The two systems are not statistically significantly different for tasks
associated with two-word, three-word, or free-form queries. Compared with a long query,
a one-word query may be more vague or incomplete so a search engine may not provide
relevant pages in its top results, whereas PCAT may show the relevant result at the top of
a user interest. The user therefore could directly jump to the right category in PCAT and
locate the relevant document quickly.
Compared with CAT, PCAT has a significantly lower mean log search time for
free-form queries (p < 0.05). The better performance of PCAT can be attributed to two
main factors. First, the number of categories in the result area for CAT is often large
(about 20) so even if the categorization is accurate, the user must still commit additional
search effort to sift through the various categories. Second, the categorization of CAT
might not be as accurate as that of PCAT because of the much larger number (8,547) of
potential categories which can be expected to be less helpful in disambiguating a vague
or incomplete query. The fact that category labels in CAT are longer than those in PCAT
may also have a marginal effect on the time needed for scanning them.
58
For all 26 search tasks, PCAT has a lower mean log search time than LIST or
CAT with 90% or higher statistical significance except for the computer domain in
experiment I that indicates a p-value of 0.116. When computing the p-values across all
tasks, we notice that the result depends on the distribution of different query lengths and
types of tasks. Therefore, it is important to drill down the systems’ performance for each
type of task.
For reference, Table 14 illustrates the systems’ performance in terms of the
number of tasks that had a lower mean log search time for each type of query length. For
example, the table entry 4 vs. 2 for one-word query in the computer domain of
experiment I indicates that four out of the six one-word query tasks had lower mean log
search time with PCAT, whereas two had a lower mean log search time with LIST.
5.2 Comparing Mean Log Search Time for Information Gathering Tasks
According to Sellen et al. [2002], during information gathering, a user finds
multiple pages to answer a set of questions. Figure 13 compares the mean log search
times of the ten search tasks in the computer domain in experiment I that required the
user to find two relevant results for each task. We sorted the tasks by the differences in
their mean log search times between PCAT and LIST. On average, PCAT allowed the
users to finish eight of ten Information Gathering tasks more quickly than LIST (t(18), p
Table 14.
Numbers of Tasks with a Lower Mean Log Search Time
59
Experiment Domain \ Query length One-word Two-word Three-word Free-form Total
I(PCAT vs. LIST)
Computer 4 vs. 2 6 vs. 0 3 vs. 1 6 vs. 4 19 vs. 7Finance 6 vs. 2 5 vs. 1 3 vs. 3 3 vs. 3 17 vs. 9
II(PCAT vs. CAT)
Computer 4 vs. 2 5 vs. 1 3 vs. 1 10 vs. 0 22 vs. 4Finance 6 vs. 2 6 vs. 0 5 vs. 1 6 vs. 0 23 vs. 3
= 0.005), possibly because PCAT already groups the similar results into a given category.
Therefore, if in a category one page is relevant, the other results in that category are
likely to be relevant as well. This spatial localization of relevant results enables PCAT to
perform this type of task faster than LIST. For the computer domain, experiment II has a
similar result in that PCAT is faster than CAT (t(18), p = 0.007). Since the finance
domain contains only two Information Gathering tasks (too few to make a statistically
robust argument), we only report the mean log search times for the tasks in Table 15. We
observe that the general trend of the results for the finance domain is the same as for the
computer domain (i.e., PCAT has lower search time than LIST or CAT).
0.0
1.0
2.0
3.0
4.0
5.0
6.0
7.0
1 2 3 4 5 6 7 8 9 10
Task
Mea
n Lo
g Se
arch
Tim
e
PCAT LIST
61
Table 15.
Mean Log Search Times for Information Gathering Tasks (Finance Domain)
Experiment I Experiment IIPCAT LIST PCAT CAT
Information Gathering task 1 6.33 6.96 6.23 7.64Information Gathering task 2 4.62 5.13 4.72 5.61
5.3 Comparing Mean Log Search Time for Site Finding Tasks
In the computer domain, there were six tasks related to finding particular sites,
such as “Find the home page for the University of Arizona AI Lab.” All six tasks were
associated with free-form queries, and we note that the queries from all subjects
contained site names. Therefore, according to Craswell et al. [2001], those tasks were
Site Finding tasks. Table 16 shows the average mean log search times for the Site Finding
tasks and 1 standard error. There is no significant difference (t(10), p = 0.508) between
PCAT and LIST, as shown in Table 17. This result seems reasonable because for this
type of search task, LIST normally shows the desired result at the top of the first result
page when the site name is in the query. Even if PCAT tended to rank it at the top of a
certain category, users often found the relevant result faster with the LIST layout,
possibly because with PCAT the users had to move to a proper category first and then
looked for the relevant result. However, there is a significant difference between PCAT
and CAT (t(10), p = 0.019); again, the larger number of output categories in CAT may
have required more time for a user to find the relevant site, given that both CAT and
PCAT arrange the output categories alphabetically.
62
Table 16.
Average Mean Log Search Times for Six Site Finding Tasks in Computer Domain
Experiment System Average mean log search time
I(PCAT vs. LIST)
PCATLIST
II(PCAT vs. CAT)
PCAT 3.51 0.12CAT 4.46 0.32
5.4 Comparing Mean Log Search Time for Finding Tasks
As Table 17 shows, for 16 Finding tasks in the computer domain, we do not
observe a statistically significant difference in the mean log search time between PCAT
and LIST (t(30), p = 0.592), but the difference between PCAT and CAT is significant
(t(30), p = 0.013). However, PCAT has lower average mean log search time than both
LIST and CAT. Similarly, for 24 Finding tasks in the finance domain, PCAT achieves a
lower mean log search time than both LIST (t(46), p = 0.101) and CAT (t(46), p = 0.002).
The computer domain includes 6 Site Finding tasks of 16 Finding tasks, whereas the
finance domain has only 2 (of 24). To a certain extent, this situation confirms our
observations about Finding tasks in the computer domain. We conclude that PCAT had a
lower mean log search time for Finding tasks than CAT but not LIST.
5.5 Questionnaire and Hypotheses
After a subject finished the search tasks with the two systems, he or she filled out
a questionnaire with five multiple-choice questions designed to compare the two systems
63
Table 17.
The t-tests for Finding Tasks
Experiment Domain Type of task Degrees, p-value
I(PCAT vs. LIST)
Computer Site Finding 10, 0.508Computer Finding (including Site Finding) 30, 0.592Finance Finding (including Site Finding) 46, 0.101
II(PCAT vs. CAT)
Computer Site Finding 10, 0.019Computer Finding (including Site Finding) 30, 0.013Finance Finding (including Site Finding) 46, 0.002
in terms of their usefulness and ease of use. We use their answers to test several
hypotheses relating to the two systems.
5.5.1 Questionnaire
Subjects completed a five-item, seven-point questionnaire in which their
responses could range from (1) strongly disagree to (7) strongly agree. (The phrase
system B was replaced by system C in experiment II. As explained in footnote 13,
systems A, B, and C refer to PCAT, LIST, and CAT, respectively.)
Q1. System A allows me to identify relevant documents more easily than system B.
Q2. System B allows me to identify relevant documents more quickly than system A.
Q3. I can finish search tasks faster with system A than with system B.
Q4. It’s easier to identify one relevant document with system B than with system A.
Q5. Overall I prefer to use system A over system B.
64
5.5.2 Hypotheses
We developed five hypotheses corresponding to these five questions. (The phrase
system B was replaced by system C for experiment II.)
H1. System A allows users to identify relevant documents more easily than system B.
H2. System B allows users to identify relevant documents more quickly than system A.
H3. Users can finish search tasks more quickly with system A than with system B.
H4. It is easier to identify one relevant document with system B than with system A.
H5. Overall, users prefer to use system A over system B.
5.6 Hypothesis Test Based on Questionnaire
Table 18 shows the means for the choice responses to each of the questions in the
questionnaire. Based on seven scale options described in Section 5.5, we compute
numbers in this table by replacing strongly disagree with 1, strongly agree by 7, and so
on.
As each question in Section 5.5.1 corresponds to a hypothesis in 5.5.2 so we
conducted a two-tailed t-test based on subjects’ responses to each question to test the
hypotheses. We calculated p-values by comparing the subjects’ responses with the mean,
neither agree nor disagree that had a value of 4. The table shows that for both computer
and finance domains, H1, H3, and H5 are supported with at least 95% significance, and
H2 and H4 are not supported.17 The only exception to these results is that we find only
90% significance (p = 0.083) for H1 in the finance domain of experiment I. According to
17 For example, the mean choice in the computer domain for H2 was 2.36 with p < 0.001. According to our scale, 2 means disagree and 3 means mildly disagree, so a score of 2.36 indicates subjects did not quite agree with H2. Hence, we claim that H2 is not supported. The same is true for H4.
65
Table 18.
Mean Responses to Questionnaire Items.
Degrees of Freedom: 13 for Computer and 19 for Finance in Experiment I; 15 for
Computer and 19 for Finance in Experiment II.
Experiment Domain Q1 Q2 Q3 Q4 Q5I
(PCAT vs. LIST)Computer 6.21*** 2.36*** 5.43* 2.71* 5.57**Finance 5.25 3.65* 5.45*** 3.65** 5.40**
II(PCAT vs. CAT)
Computer 6.25*** 2.00*** 6.06*** 2.50*** 6.31***Finance 6.20*** 1.90*** 6.20*** 2.65* 6.50***
*** p < 0.001, ** p < 0.01, * p < 0.05.
these responses on the questionnaire, we conclude that users perceive PCAT as a system
that allows them to identify relevant documents more easily and quickly than LIST or
CAT.
Several results reported in a recent work [Käki 2005] are similar to our findings.
In particular,
Categories are helpful when document ranking in a list interface fails, which fits with
our explanation of why PCAT is faster than LIST for short queries.
When desired results are found at the top of the list, the list interface is faster, in line
with our result and analysis pertaining to Site Finding tasks.
Categories make it easier to access multiple results, consistent with our report for the
Information Gathering tasks.
However, the categorization employed in Käki [2005] does not use examples to
build a classifier. The author simply identifies some frequent words and phrases in search
result summaries and uses them as category labels. Hence, each frequent word or phrase
66
becomes a category (label). A search result is assigned to a category if the result’s
summary contains the category label. Käki [2005] also does not analyze or compare the
two interfaces according to different types of tasks. Moreover, Käki [2005: Figure 4]
shows, though without explicit explanations, that categorization is always slower than a
list. This result contradicts our findings and several prior studies [e.g., Dumais and Chen
2001]. We notice that the system described by Käki [2005] uses a list interface to show
the search results by default so a user may always look for a desired page from the list
interface first and switch to the category interface only if he or she does not find it within
a reasonable time.
5.7 Comparing Indices of Relevant Results
To better understand why PCAT was perceived as faster and easier to use by the
subjects as compared with LIST or CAT, we looked at the indices of relevant results in
the different systems. An expert from each domain completed all search tasks using
PCAT and LIST. Using the relevant results identified by them, we compare the indices of
the relevant search results for the two systems, as we show in Figures 14 and 15.
We sort the tasks by the index differences between LIST and PCAT in ascending
order. Thus, the task numbers on the x-axis are not necessarily the original task numbers
in our experiments. Because PCAT organizes the search results into different categories
(interests), the index of a result reflects the relative position of that result under a
category. In LIST, a relevant result’s index number equals its relative position on the
particular page on which it appears plus ten (i.e., the number of results per page) times
the number of preceding pages. Thus, a result that appears in the fourth position on the
67
third page would have an index number of 24 (4 + 10 × 2). If users had to find two
relevant results for a task, we took the average of the indices. In Figure 14, PCAT and
LIST share the same indices in 10 of 26 tasks, and PCAT has lower indices than LIST in
15 tasks. In Figure 15, PCAT and LIST share the same indices in 7 of 26 tasks, and
PCAT has smaller indices than LIST in 18 tasks.
Similarly, Figures 16 and 17 show indices of the relevant search results of PCAT
and CAT in experiment II. The data for PCAT in Figures 16 and 17 are same as those in
Figures 14 and 15, and we show tasks by the index differences between PCAT and CAT
in ascending order. In Figure 16 for the computer domain, PCAT and CAT share same
indices in 15 of 26 tasks, and CAT has lower indices in 6 tasks. In Figure 17 for the
finance domain, the two systems share same indices in 10 of 26 tasks, and CAT has lower
indices in 14 of 26 tasks.
0
5
10
15
20
25
30
35
1 6 11 16 21 26
Task
Inde
x
PCAT (Computer) LIST (Computer)
Figure 14. Indices of Relevant Results in PCAT and LIST (Computer Domain)
68
0
5
10
15
20
25
30
35
40
45
1 6 11 16 21 26
Task
Inde
x
PCAT (Finance) LIST (Finance)
Figure 15. Indices of Relevant Results in PCAT and LIST (Finance Domain)
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
1 6 11 16 21 26
Task
Inde
x
PCAT (Computer) CAT (Computer)
Figure 16. Indices of Relevant Results in PCAT and CAT (Computer Domain)
69
0
1
2
3
4
5
6
7
8
9
1 6 11 16 21 26
Task
Inde
x
PCAT (Finance) CAT (Finance)
Figure 17. Indices of Relevant Results in PCAT and CAT (Finance Domain)
The indices for PCAT in Figures 14, 15, 16, and 17, and CAT in Figures 16 and
17 reflect an assumption that a user first jumps to the right category and then finds a
relevant page by looking through the results under that category. This assumption may
not always hold, so Figures 14 and 15 may be optimistic in favor of PCAT. However, if
the time taken to locate the right category is not large (as probably in the case of PCAT),
the figures provide a possible explanation for some of the results we observe such as the
lower search times for PCAT with one-word query and Information Gathering tasks in
Experiment I. However, CAT has smaller index numbers for relevant results than PCAT,
which may seem to contradict the better performance (lower search time) for PCAT in
experiment II. We note that due to its nonpersonalized nature, CAT has a much larger
number of potential categories as compared to PCAT. Therefore, a user can be expected
70
to take a longer time to locate the right category (before jumping to the relevant result in
it) as compared to PCAT.
5.8 Discussions
This article presents an automatic approach to personalizing Web searches given a
set of user interests. The approach is well suited for a workplace setting where
information about professional interests and skills can be obtained automatically from an
employee’s resume or a database using an IE tool or database queries. We present a
variety of mapping methods which we combine into an interest-to-taxonomy mapping
framework. The mapping framework automatically maps and resolves a set of user
interests with a group of categories in the ODP taxonomy. Our approach then uses data
from ODP to build text classifiers to automatically categorize search results according to
various user interests. This approach has several advantages, in that it does not (1) collect
a user’s browsing or search history, (2) ask a user to provide explicit or implicit feedback
about the search results, or (3) require a user to manually specify the mappings between
his or her interests and taxonomy categories. In addition to mapping interests into
categories in a Web directory, our mapping framework can be applied to other types of
data, such as queries, documents, and emails. Moreover, the use of taxonomy is
transparent to the user.
We implemented three search systems: A (personalized categorization system,
PCAT), B (list interface system, LIST,) and C (nonpersonalized categorization system,
CAT). PCAT followed our proposed approach and categorized search results according
to a user’s interests, whereas LIST simply displayed search results in a page-by-page list,
71
similar to conventional search engines, and CAT categorized search results using a large
number of ODP categories without personalization. We experimentally compared two
pairs of systems with different interfaces (PCAT vs. LIST and PCAT vs. CAT) in two
domains, computer and finance. We recruited 14 subjects for the computer domain and
20 subjects for the finance domain to compare PCAT with LIST in experiment I, and 16
in the computer domain and 20 in finance to compare PCAT with CAT in experiment II.
There was no common subject across the experiments. Based on the mean log search
times obtained from our experiments, we examined search tasks associated with four
types of queries. We also considered different types of search tasks to tease out the
relative performances of the compared systems as the nature of task varied.
We find that PCAT outperforms LIST for searches with short queries (especially
one-word queries) and for Information Gathering tasks; by providing personalized
categorization results, PCAT also is better than CAT for searches with free-form queries
and for both Information Gathering and Finding tasks. From subjects’ responses to five
questionnaire items, we conclude that, overall, users identify PCAT as a system that
allows them to find relevant pages more easily and quickly than LIST or CAT.
Considering the fact that most users (even noncasual users) often cannot issue appropriate
queries or provide query terms to fully disambiguate what they are looking for, a PCAT
approach could help users find relevant pages with less time and effort. In comparing two
pairs of search systems with different presentation interfaces, we realize that no system
with a particular interface is universally more efficient than the other, and the
performance of a search system depends on parameters such as the type of search task
and the query length.
72
5.9 Limitations and Future Directions
Our search tasks were generated on the basis of user interests. We realize some
limitations of this experimentation setup in adequately capturing the work-place scenario.
The first limitation is that some of the user interests may not be known in a real-world
application, and hence some search tasks may not reflect the known user interests.
Secondly, a worker may search for information that is unrelated with his or her job. In
both of these cases tasks may not match up with any of the known interests. However,
these limitations reflect a general fact that personalization can only benefit based on what
is known about the user. A future direction of research is to model the dynamics of user
interests over time.
For the purposes of a comparative study, we carefully separated the personalized
system (PCAT) from the nonpersonalized (CAT) one by maintaining a low overlap
between the two systems. This allows us to understand the merits of personalization
alone. However, we can envision a new system that is a combination of the current CAT
and PCAT systems.
In particular, the new system replaces the Other category in PCAT by adding
categories of ODP that match the results that are currently placed in the Other category.
A study of such a PCAT+CAT system could be a future direction for this research. An
interesting and related direction is a smart system that can automatically choose a proper
interface (e.g., categorization, clustering, list) to display search results on the basis of the
nature of the query, the search results, and the user interest profile (context).
As shown in Figures 9 and 12, for PCAT in experiments I and II and CAT in
experiment II, we rank the categories alphabetically but always leave the Other category
73
at the end.18 There are various alternatives for the order in which categories are displayed
such as by the number of (relevant) results in each category or by the total relevance of
results under each category. We recognize that choosing different methods may provide
different individual and relative performances. Also, CAT tends to show more categories
on the main page than PCAT. On one hand, more categories on a page may be a negative
factor for locating a relevant result. On the other hand, more categories provide more
results in the same page which may speed up the discovery of a relevant result as
compared to clicking a More link to open another window (as in PCAT system). We
think that the issues of category ordering and number of categories on a page deserve
further examination.
From the subjects’ log files we observed that when some of the subjects could not
find a relevant document under a relevant category due to result misclassification, they
moved to another category or tried a new query. Such a situation can be expected to
increase the search time for categorization-based systems. Thus, another direction of
future research is to compare different result classification techniques based on their
effect on mean log search time.
It would be worthwhile to study the performance of result categorization using
other types of data such as title and snippets (from search engine results) instead of page
content which would save the time on fetching Web pages. In addition, it may be
interesting to examine how a user could improve his or her performance in Internet
searches in a collaborative (e.g., intranet) environment. In particular, we would like to 18 For the computer domain in experiment I, PCAT shows C++ and Java before other alphabetically ordered interests, and the “Other” category is at the end.
74
measure the benefit the user can derive from the search experiences of other people with
similar interests and skills in a workplace setting.
75
CHAPTER 6
INTRODUCTION AND LITERATURE REVIEW
6.1 Introduction
Business news contains rich and current information about companies and the
relationships among them. Online business news from media companies (e.g., Reuters),
content providers (e.g., Yahoo!), and company Web sites offer readers timely
assessments of dynamic company relationships. The task of reading news is very time
consuming and it requires a reader to possess certain skills, the most basic of which is a
good understanding of the language in which the news is written. However, the huge
volume of news stories makes the manual identification, without automated news
analysis, of relationships among a large number of companies nontrivial and unscalable.
For professional or personal finance–related interests, many people regularly spend
significant amounts of time scanning the news to monitor recent companies’ financial
milestones. For tasks such as investment or market research, researchers often need to
compare a pair of companies or identify top-performing companies on the basis of
revenue. The company revenue relationships are dynamic and information about them
may not be readily or continuously available. Public companies typically update their
76
earning or balance sheet data on a quarterly basis, whereas the availability of private,
initial public offering (IPO), or foreign companies’ financials is more limited overall.
Scanning the competitive environment of a company or a group of companies is
essential for supply chain, marketing, investment and strategic partnership
management. Once its competitors have been identified, a company can look for their
product lines, marketing strategies, directions of R&D, key personnel, customers, and
suppliers, and so on to potentially improve its competitive advantage. Analysts and
managers may resort to various options for discovering and monitoring competitor
relationships. These options may include: asking business associates (e.g., customers
or suppliers), reading news, searching on the Web, attending business conventions,
and looking through company profile resources such as Hoover’s19 and Mergent.20
While the availability of company profiling resources has reduced the search effort
and made some of business relationship information easily accessible, the other
above-mentioned approaches, due to their largely manual nature, are still time
consuming and limited in scale. Besides, using possibly different criteria in collecting
and identifying information, businesses that provide company profiles also suffer
from the scalability problem due to limited resources, manpower and budget, leading
to incomplete and inconsistent information. For example, Hoover considers
Interchange Corp. as a competitor of Google, while Mergent does not specify this
relationship. In contrast, Mergent includes Tercica Inc. as a competitor of
GlaxoSmithKline plc while Hoover’s does not. Therefore, it is important to explore
approaches to automatically discover important business relationships that can
19 Hoover’s, Inc., http://www.hoovers.com.20 Mergent Inc., http://www.mergentonline.com.
77
complement and extend existing time consuming efforts. An automated approach also
allows for a timely update of business relationships thus avoiding information
staleness that can mar manual approaches.
Social network analysis (SNA) refers to a set of research procedures for
identifying and quantifying structures in a social network on the basis of relationships
among the nodes [Richards and Barnett 1993]. A social network consists of a set of
nodes, such as individuals or organizations, which are connected through edges that
represent various relationships (e.g., friendship, affiliation) [Wasserman and Faust 1994]
that tend to be simple to identify and yet voluminous to analyze. It is feasible and
effective to discover network structures by analyzing quantitative measures of the
information represented by nodes and edges of social networks for diverse fields, such as
social and behavioral science, anthropology, psychology [Scott 2000], and information
science.
In this study, we present an approach that applies SNA and machine learning
techniques for automated discovery of business relationships. In particular, we study two
different relationships, CRR and competitor relationships, as two illustrative examples of
our approach. Figure 18 illustrates the main steps for discovery of the two relationships at
a high level. First with a collection of news stories that have been organized by company,
given that a news story pertaining to a company often cites one or more other companies,
we identify company citations in news stories and treat them as links from the focal
(source) companies to those cited (target) companies, and then construct a directed,
weighted intercompany network. Further we identify four types of network attributes
78
based on network topology. The four types of attributes differ in their coverage of the
intercompany network. Finally we feed these identified attributes to classification
Figure 18. A High Level Process View for Studying CRR and Competitor
Relationship
methods to predict CRR and discover competitor relationship between two companies.
This approach is effective and scalable for business relationship screening, and can be
extended for automated discovery of a broad range of business relationships. Moreover,
the approach is language neutral (i.e., we do not analyze the vocabulary or grammar in
news stories to find relationships). This last feature of the approach can help extend it to
news written in languages other than English.
6.2 Literature Review
Many researchers in areas such as organization behavior and sociology have
investigated the nature and implications of social networks created by business
relationships. For example, Levine [1972], using a network of interlocked directorates
79
between major banks and large industrial companies, constructs a map of the sphere of
influence that provides a quick (though approximate) overview of the relations (e.g.,
well-linked bank–company ties) in the network. Walker et al. [1997] examine an
interfirm network on the basis of cooperative relationships from a commercial directory
of biotechnology firms. Using regression techniques with ten independent variables, they
demonstrate that network structure strongly influences the choices of a biotechnology
startup in terms of establishing new relationships (licensing, joint venture, and R&D
partnership) with other companies. Uzzi [1999] investigates how social relationships and
networks affect a firm’s acquisition and cost of capital. Gulati and Gargiulo [1999]
demonstrate that an existing interorganizational network structure affects the formation of
new alliances which eventually modifies the existing network. A major difference
between those prior studies and ours is that prior works construct a social network using
explicit given relationships from gold standard data sources whereas we try to predict a
business relationship, i.e., CRR, between two companies using structural attributes
derived from citation based intercompany network.
Research in information retrieval and bibliometrics has previously exploited SNA
and graph-theoretic techniques on a network of documents They consider implicit
signals, such as URL links, email communications, or article citations, as links between
nodes and further study problems such as identifying importance of individual nodes in
the network [e.g., Brin and Page 1998; Kleinberg 1999; Garfield 1979] and communities
in Web [e.g., Kautz et al. 1997; Gibson et al. 1998], instead of discovering business
relationships between companies.
80
For example, articles such as scholarly publications can be considered to be
connected with one another through citations. A citation index indexes the citations
among such articles [Garfield 1979]. Using a citation index, a researcher can find not
only articles that a given article cites but also articles that cite the given article. CiteSeer
[Giles et al. 1998] is an example of an autonomous citation indexing system that
retrieves, indexes, and builds bibliographic and citation databases from research articles
on the Web. Furthermore, analyses of the networks created by citations have led to
various measures of prestige and the impact of published articles and the journals in
which they appear. Some measures closely resemble measurements of Web page
popularity [Brin and Page 1998] used by Web search engines such as Google.
Park [2003] identifies hyperlink network analysis as a subset of SNA, in which
nodes are Web sites and the relationships are URL links among sites. In such a network,
the linkages among sites reflect the authority, prestige, or trust of the sites [Kleinberg
1999, Palmer et al. 2000]. Brin and Page [1998] propose the PageRank algorithm to rank
the nodes (pages) on the www network with directed URL links among pages and use the
ranks of pages to order search results. Kleinberg [1999] presents the Hyperlink-Induced
Topic Search (HITS) algorithm to compute the hub and authority importance measures
for each node (page), also based on the link structure of the www.
Bernstein et al. [2002] apply a commercial information extraction system to
extract company entities from Yahoo! business news and posit that two companies have a
relationship (link) if they appear in the same piece of news (cooccurrence approach). The
network, which consists of 1,790 identified companies and in which links between two
companies are undirected and unweighted (binary weight), illustrates some central
81
industry players. They further filter out nodes in the network to produce a smaller
network with 315 companies and 1,047 links, which they use to count how many other
companies are connected with each company, rank all companies by the counts, and
indicate that some of the 30 top-ranked companies in the computer industry are also
Fortune 1000 companies. Hence, their result indicates that companies with high revenues
tend to be linked to many other companies in a network derived purely from news stories.
Their work is somewhat similar to our study, in that they use online business news to
construct an intercompany network. However, unlike Bernstein et al. [2002], we qualify
links in the constructed network by both direction and weights. Furthermore, different
from the abovementioned research we employ various graph-based metrics to predict the
CRR between any pair of companies linked in the network that contains tens of thousands
of such company pairs.
82
CHAPTER 7
NETWORK-BASED ATTRIBUTES AND DATA
In this Chapter we first introduce relevant notation in directed graphs, followed by
notation in directed, weighted graphs. Then we describe data and data processing
procedures. To provide statistical insights into the data, we report distributions of the
various network attributes. Hereafter, we use the following pairs of terms
interchangeably: network and graph, node and company, link and company pair or pair of
companies.
7.1 Notation in Directed Graphs
Figure 19 presents a directed graph (digraph) that consists of four nodes joined by
eight directed links. More formally, a digraph Gd = (N, L) consists of a set of nodes N and
a set of links L, where
N = (n1, n2, …, nm) and
L = (l1, l2, …, lk), where li = <nsource, ntarget>.
83
The node indegree, NID(ni), in a digraph is the number of nodes linked to ni; the
node outdegree, NOD(ni), is the number of nodes linked from ni [Wasserman and Faust
1994]. Node indegree, or a metric based on it, has been used often to represent authority
Figure 19. Directed Graph
and prestige in many prior works [e.g., Brin and Page 1998, Kleinberg 1999]. In this
figure NID(n1) and NOD(n1) are 3 and 2, and NID(n4) and NOD(n4) are 1 and 2.
7.2 Notation in Directed, Weighted Graphs
Web portals such as Yahoo! Finance21 and Google Finance22 provide news stories
arranged by company. A news story pertaining to a company (source company) often
cites one or more other companies referred to as target companies. we consider that the
company citation is a directed link (outlink) from the source company to a target and
each citation adds a unit of weight to the link. Finally the link weight between the two
companies is the accumulated citation count across a set of news stories.
21 http://finance.yahoo.com.22 http://finance.google.com/finance.
84
Figure 20 depicts a digraph in which each link carries a weight. It is a very small
portion of the intercompany network that consists of five companies/nodes joined by 15
directed and weighted links. More formally, a weighted digraph Gwd = (N, L, W) includes
N, L, and a weight vector W associated with the set of links, where W = (w1, w2, …, wk).
Figure 20. Directed, Weighted Graph
DELL: Dell Inc., INCX: Interchange Corp., GOOG: Google Inc., JPM: JP Morgan Chase
& Co., YHOO: Yahoo! Inc.
We derive various attributes from the intercompany network that characterize
either a node (one value for each node) or a pair of nodes (one value for each pair). we
divide the various attributes into four types on the basis of the range of the network
covered for computing the attributes and describe these attributes as follows.
7.2.1 Dyadic and Node Degree-based Attributes
We first introduce a group of dyadic degree-based attributes as follows.
85
Dyadic weighted indegree (DWID), DWID(ni, nj) is the weight of the link from nj
to ni.
In Figure 20 the DWID(YHOO, GOOG) is 478.
Dyadic weighted outdegree (DWOD), DWOD(ni, nj) is the weight of the link
from ni to nj.
Again, based on Figure 20, the DWOD(YHOO, GOOG) is 512. We note that both
DWID(GOOG, YHOO) and DWOD(YHOO, GOOG) are large (as compared to
other pairs) and almost equal values. News stories about two competing
companies can be expected to frequently cite the other company and the volume
of citations for each company can be expected to be almost equal when there is no
absolute winner (e.g., monopoly).
Dyadic weighted netdegree (DWND)
DWND(ni, nj) = DWOD(ni, nj) – DWID(ni, nj) (1)
Hence, DWND(YHOO, GOOG) = 512 – 478 = 34 shows a net flow of citations in
the direction of pointing to GOOG when we consider the pair <YHOO, GOOG>.
The positive net flow to GOOG may indicate its slight dominance as reflected by
news citations.
Dyadic weighted inoutdegree (DWIOD)
DWIOD(ni, nj) = DWOD(ni, nj) + DWID(ni, nj) (2)
86
Again, DWIOD(YHOO, GOOG) = 990, which is a relatively large as compared
to other links in the example network. A large DWIOD value may indicate a strong
relationship between the given pair of companies.
The dyadic nature of these attributes captures the flow of citations and hence
potential relationships between a pair of companies. However, dyadic attributes consider
only a pair of connected nodes. To take into account a given node’s neighbors, we
consider the following node degree-based attributes.
Node weighted indegree (NWID)
NWID(ni) = (3)
This measures the flow of citations from all companies in the network to the
given company. We expect “important” companies to possibly draw a large total
number of citations in news from other companies.
Node weighted outdegree (NWOD)
NWOD(ni) = (4)
This measures the flow of citations from the given company to all other
companies in the network.
Node weighted inoutdegree (NWIOD)
87
NWIOD(ni) = (5)
This measures the overall flow of citations both to and from the given company
(ni). In essence, this attribute measures the overall connectivity of the given company
and all neighbor companies in the network independent of the direction of citations.
In Figure 20 for node n1 (YHOO) the NWID, NWOD, and NWIOD values are
513, 541, and 1054 respectively. If a pair of companies has a large DWIOD value as
well as large individual NWIOD values, it may suggest that the two companies have a
strong relationship and are both important players.
7.2.2 Centrality-based Attributes
In addition to the dyadic and node degree-based measurements, we also use a
network analysis package [O'Madadhain 2006] to compute scores on the basis of three
different centrality/importance measuring schemas: PageRank [Brin and Page 1998],
HITS [Kleinberg 1999], and betweenness centrality [Brandes 2001]. These schemas
extend beyond immediate neighbors to compute the importance or centrality of a given
node in the whole network. The PageRank algorithm computes a popularity score for
each Web page on the basis of the probability that a “random surfer” will visit the page
[Brin and Page 1998]. The HITS algorithm generates a pair of scores, “hub” and
“authority,” for each page. Both HITS and PageRank compute principal eigenvectors of
matrices derived from graph representations of the Web [Kleinberg 1999], so our use of
them for a graph whose nodes are companies differs from their original use. As a node
88
centrality measurement, betweenness measures the extent to which a node lies between
the shortest paths of other nodes in the graph [Freeman 1979]. The three schemas do not
consider link weights. JUNG [2006] provides the node authority scores for HITS and
ignores the link direction when computing betweenness centrality. The intuition behind
these global centrality attributes is the same as that for the node degree based attributes
but the former are more informative since they consider the entire network for
computation instead of focusing on immediate neighbors.
7.2.3 Structural Equivalance (SE)-based Attributes
Lorrain and White [1971] identify two nodes to be structurally equivalent if they
have the same links to and from other nodes in the network. As it is unlikely that two
nodes will be exactly structurally equivalent in our intercompany network, we use a
similarity metric to measure the degree to which two nodes are structurally equivalent.
The intercompany network is represented as a weighted adjacency NxN matrix, where N
is the number of nodes. The SE similarity between two nodes is the normalized dot
product (i.e., cosine similarity) of the two corresponding rows in the matrix, where a
matrix element can be DWID, DWOD, or DWIOD value and therefore producing
DWID-, DWOD-, or DWIOD-based SE similarity. Intuitively, the DWID-based SE
similarity between company A and company B captures the overlap between companies
whose news stories cite A and companies whose news stories cite B (analogous to co-
citation [Small 1973]); the DWOD-based SE similarity reflects the overlap between
companies that news stories of A and B cite (analogous to bibliometric coupling [Kessler
1963]). A high overlap between neighbors of two nodes in our intercompany network
89
may be reflective of the overlap in their businesses or markets. Intuitively, this
phenomenon may indicate a competitor relationship. For example, for the sample graph
of Figure 20 DWID-based SE similarity between n1 and n3, or YHOO and GOOG, is 0.98
out of 1 for the maximum possible value.
For classifying whether a pair of companies are competitors we use the above
described attributes. As noted earlier, some of the attributes have one value for a pair of
nodes (DWID, DWOD, DWIOD, and three different SE similarities) while others have a
value for each node (NWID, NWOD, NWIOD, pagerank, hits, and betweenness) in the
pair. Hence, we use the total of 18 attributes for classifying competitor relationship for a
company pair. Table 19 summarizes the four types of attributes by type and range of
network covered.
7.3 Raw Data
Now we describe the source and nature of the raw data (news stories) and the
process by which we constructed the intercompany network from them. The first data set
consists of eight months (July 2005–February 2006) of business news for all companies
on Yahoo! Finance. Both Chapter 8 (predicting CRRs) and Chapter 9 (Discovering
Competitor Relationships) use this data set. In addition in Chapter 8 we use three more
months’ (October–December 2005) news stories from the first data set as a second data
set to validate the major results obtained from the first, but with the second data set we
Table 19.
90
Four Types of Network Attributes
Attribute Type Attributes Range of Network CoveredDyadic degree-based DWID, DWOD, DWIOD A given node and only one directly connected nodeNode degree-based NWID, NWOD, NWIOD A given node and all directly connected nodesNode centrality-based pagerank, hits, betweenness Whole networkSE-based DWID-, DWOD-, DWIOD-
based SE similarityAny two nodes and their directly connected nodes in the whole network
study CRRs on the basis of quarterly revenues. In Section 9.2 we describe three smaller
data sets sampled from the first data set for discovering competitors.
7.4 Preliminary Data Processing
Yahoo! Finance organizes business news stories by company and date. The news
stories are not limited to those available from yahoo.com but also include those from
other news sources, such as forbes.com, thestreet.com, and businessweek.com. In other
words, URL links corresponding to news titles that have been organized under a company
in Yahoo! Finance may point to Web pages located at several domains. Taking advantage
of this organizing mechanism provided by Yahoo!, we consider that news stories
organized under a company belong to the company and identify all news pertaining to a
given company within a period of time. For example, for news belonging to Google and
dated February 28, 2006, a page containing both all news titles and their URLs linking to
news content is at http://finance.yahoo.com/q/h?s=GOOG&t=2006-02 -28, where GOOG
is the stock ticker of Google Inc. We automatically construct similar URLs to gather links
of news stories for each company in Yahoo! Finance across the eight-month period. We
then programmatically fetch news stories corresponding to the links. Yahoo! may
91
organize the same piece of news under different companies; we treat such a news story as
belonging to each of the companies that Yahoo! identifies.
7.5 Node and Link Identification
A news story identifies a company according to its stock ticker on NYSE,
NASDAQ or AMEX. If a piece of news pertaining to a company ni mentions another
company nj, we consider there is a directed link from n i to nj, denoted as <ni, nj>. If
company nj is cited several times in the same piece of news, each citation adds to the
accumulated weight for the directed link. We aggregate citation frequency across all
news stories in a data set. Furthermore, we do not count self-references; therefore, we
ignore citations to company ni if they appear in a news story belonging to n i. For
example, if a news story pertaining to company n1 mentions the companies in the
sequence [n2, n1, n3, n4, n4, n2, n5], we derive the set of links and weight vector as (<n1,
n2>, < n1, n3>, < n1, n4>, < n1, n5>) and (2, 1, 2, 1), respectively. We filter out news stories
that do not mention any other company. After we collected the annual revenues and news
stories for all companies across all nine sectors in Yahoo! Finance, we emerged with a
total of 6,428 companies and 60,532 news stories. For the first data set, we note that the
early months (i.e., July–September 2005) included fewer news stories than later months,
because Yahoo! does not archive as many historical news stories as recent ones. In Table
20, we provide company and news distribution across the nine sectors in the first data set.
92
7.6 Attribute Distributions
Several variables derived from social phenomena and networks, such as Pareto
distribution of wealth and the frequency of word usage in the English language [Adamic
2002], follow the power law distribution. Recent research shows that several aspects of
Table 20.
Company and News Distribution across Sectors
Sector Number of Companies
Percentage of Companies
Number of News Stories
Percentage of News Stories
Basic materials 522 8.12% 4398 7.27%Conglomerates 30 0.47% 1004 1.66%Consumer goods 496 7.72% 4947 8.17%Financial 1402 21.81% 5512 9.11%Healthcare 706 10.98% 7481 12.36%Industrial goods 423 6.58% 2677 4.42%Services 1334 20.75% 13144 21.71%Technology 1386 21.56% 20723 34.23%Utilities 129 2.00% 646 1.07%Total 6428 100% 60532 100%
digital networks such as the Internet follow power law distributions as well. For example,
the rank and frequency of the outdegrees of Internet domains [Faloutsos et al. 1999] and
the indegree and outdegree of Web page links [Barábasi et al. 2000, Broder et al. 2000,
Kumar et al. 1999] reflect power law distributions. With the directed, weighted
intercompany network, we observe similar power law distributions for various node
degree measurements (NID, NOD, NWID, and NWOD) and link weight. All logarithms
used in the distributions are base 10.
93
7.6.1 Node Indegree Distribution
Figure 21 shows that the distribution of node indegree (NID) follows a power law
distribution with a Pearson correlation at 0.945 (negative sign ignored). The distribution
indicates a few nodes (companies) attract most of the citations, similar to social
phenomena such as the distribution of wealth (Pareto distribution) [Adamic 2002]. We
observe similar power law distributions for other node degree measurements, such as
NOD, NWID, and NWOD. For brevity, we do not show their distribution plots herein.
95
7.6.2 Link Weight Distribution
Figure 22 shows the link weight distribution in our intercompany network. The link
weight also follows the power law distribution with a Pearson correlation at 0.944. The
power law distribution of link weights indicates there are a few very strong links and
many weak ones.
0.00 0.50 1.00 1.50 2.00 2.50 3.00
Log(Weight)
0.00
1.00
2.00
3.00
4.00
5.00
Log(
coun
t)
Figure 22. Link Weight Distribution
96
7.6.3 Revenue Distribution
We choose a million of dollars as the unit to record the revenue for each
company, group companies with similar logged revenues, and obtain the histogram in
Figure 23, which shows that the (logged) revenues across the 6,428 companies
approximately follow a normal distribution.
-1.05-.75-.50-.25.00.25.50.751.001.251.501.752.002.252.502.753.003.253.503.754.004.254.504.755.005.255.50
Log(Revenue)
0
100
200
300
400
500
600
Cou
nt
Figure 23. Revenue Distribution
97
7.6.4 Revenue Node Weighted Indegree Distribution
Figure 24 represents a plot of the logged revenues and logged node NWID of all
nodes, with a Pearson correlation of 0.534. Unlike the prior three subsections, we find no
clear pattern for the two variables. In addition, we observe similar distributions for
logged revenue with NID, NOD, and NWOD.
-4.00 -2.00 0.00 2.00 4.00 6.00
Log(Revenue)
0.00
1.00
2.00
3.00
4.00
Log(
WID
)
Figure 24. Scatter Plot of Revenue and NWID
97
CHAPTER 8
PREDICTING COMPANY REVENUE RELATIONS
As explained in Section 7.5, in our approach nodes in an intercompany network
consist of companies mentioned in business news stories. When determining a link
between two nodes, unlike traditional SNA that uses explicit social relationships (e.g.,
common directorship [Levine 1972], cooperative business relationships [Walker et al.
1997]), we assume a directed link from company A to company B if a news story
pertaining to the company A mentions (cites) company B. Moreover, a link from
company A to company B carries a weight that equals the total number of citations for
company B in a set of news stories belonging to company A. The direction and weight
should provide additional information about the flow and strength of business
relationships in the constructed network. Also, by noting the direction, we can examine
the effects of links coming into a node and those going away from it separately. The
weights in our network reflect the accumulated citations between a pair of companies and
enable us to quantitatively identify a relationship between two companies over time. We
identify a “netdegree” measurement (DWND) that combines the direction and weights to
provide an overall view of the relationship between a pair of companies. Hence, this
98
approach is more comprehensive than prior related literature on several dimensions,
including a richer network (with weights and direction), a new degree-based metric,
larger data sets, and various analyses related to business relationship prediction.
To illustrate business relationship prediction, in this chapter we focus on
predicting a (positive or negative) CRR between any pair of linked companies and further
estimate whether a company’s revenue is in the top-N (where N varies from 100 to 1000)
companies on the basis of the network structure. Before we present our research
questions in detail, we first describe how we measure CRR.
8.1 Measurements for CRR
As we mentioned in the introduction, a positive or negative revenue relation exists
between a pair of companies. However, when the two companies come from different
sectors, their (absolute) revenue values may not be comparable. Therefore, besides a
direct comparison of revenues in dollars, we derive the following three metrics to
determine a positive or negative CRR by taking the size of a sector into consideration:
Revenue rank, or the rank of the company’s revenue in its sector, namely,
revenue rank(ni) [1, |sector(ni)|], where revenue rank(ni) is company ni’s rank order
in its sector by revenue and |sector(ni)| is the total number of companies in the sector
to which company ni belongs.
Normalized revenue rank(ni) = (6)
99
Revenue share(ni) = (7)
where revenue(ni) is company ni’s revenue value (in dollars).
In Section 8.4, we report the detailed results measured by normalized revenue
ranks. The results measured by those three metrics are similar to those and therefore are
not included in the paper.
8.2 Research Questions
We want to explore the broad hypothesis that attributes derived from a network
constructed from news stories can indicate meaningful business relationships (in
particular, CRR and top-N by revenue). Therefore, we identify attributes that capture the
pairwise relationships between companies (dyadic degree-based) or estimate the
individual importance of each company (node degree-based and node centrality-based).
In each case, the attributes are computed purely from weighted and directed links formed
by citations in news stories. In turn, based on the problem described previously and the
identified network-based attributes, we ask the following specific research questions:
(1) Is DWND, which captures the net flow of citations between a pair of companies,
an effective indicator of positive CRR?
(2) How well can the attributes derived purely from network structure, as shown in
Table 19 in Section 7.2, predict CRR for a pair of companies in the network?
(3) How does CRR prediction performance differ among the three groups of
attributes, which represent different amounts of network covered?
100
(4) Which of the network structure-based attributes (when combined linearly) are
significant in distinguishing positive and negative CRRs?
(5) How well can CRRs for pairs that flip their revenue relations at different time
periods be predicted?
(6) How well can individual importance measures of each company, such as node
degree- and centrality-based attributes, predict top-N revenue companies?
8.3 Research Methods
With Figure 25 we introduce the specific procedures and methods we use to
address our research questions. For our analysis with pairs of companies, we use DWND
to identify the source and target and ensure each pair is selected only once: If (n i, nj) is
identified as a pair, (nj, ni) cannot be selected. We sort all the links by their DWND
values in descending order and consider only those links whose DWND values are
greater than or equal to 0. For any link <n i, nj> in the network with a DWND value of 0,
we ignore the opposite link <nj, ni>. We identify 87,340 company pairs from the first data
set and use them to predict CRR. With this data set we also predict the top-N companies
by revenue and note that the range of netdegree values is 0–49.
8.3.1 Classification Methods
Using Weka [Witten and Frank 2005] as a data analysis tool, we employ two
classification methods to evaluate the CRR prediction performance for company pairs.
For our classification methods, we select logistic regression and C4.5 [Quinlan 1993]
101
Figure 25. Diagram of Methodology and Analysis Approaches
decision tree (i.e., J48 classifier in Weka). Logistic regression is frequently used in
business research for problems with a binary class label (as for our CRR prediction
problem); decision tree is one of the commonly used classifiers in data mining, because it
is highly accurate for binary classification problems, it does not impose assumptions
about the distribution of data, and its results are well suited for human interpretation
[Padmanabhan et al. 2006]. We use two different methods so we may compare their
performances for our applications. For each of the classification methods, we employ and
report results on the basis of 10-fold cross-validation. In line with standard metrics used
in data mining and information retrieval, we report the following precision, recall, and
accuracy to evaluate the performance of the predictive models:
(8)
102
(9)
(10)
8.3.2 Discriminant Analysis with Logistic Regression
The main purpose of this paper is to explore the power of structural attributes in
predicting CRR. However, we would also like to investigate the significance (if any) of
individual IVs in discriminating between positive and negative CRRs. Therefore we
perform a discriminant analysis using logistic regression. The linear nature in which
attributes are combined in logistic regression allows for a simplistic understanding of
their individual significance. In particular, from the 87,340 pairs in the first data set we
randomly select 1000 pairs such that each company in the chosen pairs is distinct. As a
result, there are 2000 unique companies in the 1000 pairs and hence these pairs are
considered independent. With 12 IVs (DWID and DWOD for source, NWID and NWOD
for source and target, pagerank, hits and betweenness scores for source and target) and
CRR as the dependent variable (DV), we employ binary logistic regression in SPSS
(version 12.0) to find the discriminant variables. In particular, we start with a base model
that uses the mean of the DVs and does not include any IVs. Then from a list of candidate
IVs which have statistically significant differences between the two DV groups, we add
an additional IV at one step by choosing the IV having the largest score statistics (method
“Forward: LR” in SPSS) until the stepwise estimation procedure stops (e.g., no remaining
IV is significant) [Hair et al. 2006].
103
8.3.3 CRR Flips
CRR flip refers to a CRR change (negative to positive or vice-versa) over two different
time periods. We would like to measure the prediction performance of our approach on
CRR flips that represent a more interesting subset of the data since they capture the
dynamics of CRR. We note that for this subset of data, a naïve approach of assuming that
CRR does not change would result in a precision, recall, and accuracy of 0%. We analyze
how well we predict the CRR among the flip pairs based on annual and quarterly revenue
data. For the first data set, we collect annual revenues for the year 2004. The flips are
identified by 2004’s annual revenues and revenues of the last four quarter ending in April
2006. From the 87,340 pairs in the first data set, we find a total of 75,709 pairs that have
annual revenues in both time periods. For the four different CRR measurements (see
Section 3.2), about 4% of pairs flipped. With the two classifiers from Weka we run 10-
fold cross validation and report the prediction performance of CRR on all the 75,709
pairs and all the flip pairs.
With the second data set, we identify quarterly revenues for Q4 2005, Q1 and Q2
2006 from Yahoo! Finance to derive CRRs. We then identify flip pairs for time periods
of Q4–Q1 and Q4–Q2. For the four different CRR measurements, the percentage of flip
pairs is about 5%.
8.4 Results and Analyses
With the first data set, we first explore how DWND is associated with positive
CRR by determining whether the net flow of news citations between a pair of companies
indicates the relative size of their revenues. Then we report how well the various
104
attributes derived from network structure predict CRRs for company pairs. To tease out
the effects of the three different groups of attributes—dyad degree-based, node degree-
based, and node centrality-based —we repeat the prediction experiment with each set of
attributes separately. Using logistic regression as discriminant analysis we report what
IVs are significant in distinguishing CRRs. With the CRR prediction results we further
examine the classification performance for flip pairs. For the second data set, we briefly
report results similar to those obtained by the first data set. In particular, we provide
prediction performance of CRR on the basis of Q4 2005. When analyzing CRR flips,
instead of using revenues from a previous time period, we compare revenues of Q4 2005
with those in next two time periods (i.e., Q1 and Q2 2006) respectively. Then we
examine how well data collected in Q4 2005 can classify flip pairs identified at different
future time periods.
8.4.1 Positive CRR and Top Links
We sort all of the links in the network by their DWND values (in descending
order). Using a set of the top few links from the sorted list, we compute the percentage
that correctly reflects positive CRR. We then successively increase the number of top
links (T); in Table 21, we provide the number and percentage of the top links (where T
varies from 20 to a few hundred) that follow the positive CRR. We measure the
significance of the percentages in Table 21 through a binomial test. Finally, we note that
if the DWND were independent of CRR, the percentages would be close to 50%. When
the DWND values are relatively high, DWND seems to be a good indicator of positive
revenue relations.
105
Table 21.
Positive CRR for Top-N links
Top Links(T)
DWND Range
Number of Links Following Positive CRR
Percentage of Links Following Positive CRR
20 [24, 49] 16 80.0% *37 [19, 49] 31 83.8% ***64 [16, 49] 50 78.1% ***79 [14, 49] 58 73.4% ***114 [12, 49] 80 70.2% ***135 [11, 49] 92 68.2% ***175 [10, 49] 115 65.7% ***217 [9, 49] 134 61.8% ***289 [8, 49] 172 59.5% ***
* p < 0.05, *** p < 0.001 (two-tailed).
8.4.2 Positive CRR and All Links
As the DWND value decreases, so does the signal indicating the positive CRR
between a pair of companies. To examine this observation further, we segment the links
in the intercompany network into baskets, such that links in each basket have the same
DWND, and combine links with different DWND values into one basket only if the
basket contains fewer than 20 links. In Table 22, we provide the percentages of links
following positive CRR in each basket.
When DWND values are small (e.g., less than 10), links in the same baskets do
not display a clear trend toward a positive CRR. In other words, for company pairs in
those baskets, pointing to a company with the same or higher revenue rank is about as
likely as pointing to one with lower revenue rank. However, as the DWND values
increase, positive CRR becomes more salient.
106
In summary DWND can be an indicator of positive CRR for top links, i.e. links
with large DWND values. Overall 48% of the 87,340 pairs whose DWND are non-
Table 22.
Positive CRR for All Links with the Same or Similar DWND
Basket No. DWND Percentage of Links Following Positive CRR
1 1 46.5%2 2 48.8%3 3 46.8%4 4 51.9%5 5 51.8%6 6 57.1%7 7 56.3%8 8 52.8%9 9 45.2%10 10 57.5%11 [11, 12] 55.6%12 [13, 17] 62.5%13 [18, 23] 86.7% ***14 [24, 49] 80.0% *
* p < 0.05, *** p < 0.001 ( two-tailed, binomial test).
negative follow positive CRR, suggesting the indication of DWND disappears when
considering all the pairs.
8.4.3 Predicting CRR with Annual Revenues
For the first data set we first predict CRR using three groups of attributes
identified in Section 7.2, then use each individual group of attributes separately and
107
observe its predictive power. Moreover, we conduct discriminant analysis to identify
what IVs are significant in discriminating CRRs.
8.4.3.1 All Three Groups of Attributes
To predict the CRR for each pair of companies, we use a total of 12 attributes (2
dyadic degree-based, 4 node degree-based, and 6 node centrality-based). For the node
degree-based and node centrality-based measures, we employ a pair of attributes for the
source and target companies of each link. Of the dyadic degree-based attributes, we do
not use DWID because it can be derived directly from DWND and DWOD. Table 23
shows the results of the two classification methods for the first data set (87,340 company
pairs).
From Table 23 we observe that using attributes derived from a network without
resorting to any information about a company’s sector or revenue, we achieve reasonable
precision, recall, and accuracy of approximately 70–80% in predicting the CRR between
companies, given our data set consists of an almost equal number of positive and
negative CRR instances (see the third column in Table 23). In addition we divide the
87,340 pairs into two subsets: (1) all pairs in which both companies in a pair belong to
the same sector and (2) the remaining pairs (different sectors). We examine the prediction
performance for each subset separately, and again, the precision, recall, and accuracy fall
around the 70–80% range, similar to those in Table 23. Using the ten accuracy values
generated through the 10-fold cross-validation, we find that the average accuracies of the
logistic regression and decision tree differ significantly (two-tailed t-test, p < 0.001), with
decision tree proving to be a superior method.
108
Table 23.
Classification Results of CRR with 12 Attributes (First Data Set)
Classification Method
Class Label (CRR)
Number (Percentage) of Pairs Precision Recall Accuracy
Logistic regression
0 45907 (52.6%) 74.8% 77.1% 74.3%1 41433 (47.4%) 73.7% 71.2%
Decision tree 0 45907 (52.6%) 80.5% 81.1% 79.7%1 41433 (47.4%) 78.9% 78.2%Notes: Attributes are DWND, DWOD, source NWID, source NWOD, target NWID, target NWOD, source pagerank, source hits, source betweenness, target pagerank, target hits, target betweenness.
8.4.3.2 Each Individual Group of Attributes
We are also interested in comparing the performances with individual groups of
attributes separately; in Tables 21, 22, and 23, we provide the associated results for the
first data set.
The two dyadic degree-based attributes, DWND and DWOD, fail to predict
revenue relations well, whereas the four node degree-based and six node centrality-based
attributes produce results nearly as good as those from using all 12 attributes together.
The poor performance of dyadic degree-based attributes may be due to their
reliance on the local (pairwise) flow of citations between the two companies. This
localized property of the dyadic attributes may fail to capture the relative importance of
the two companies, which is formed by all the citations they receive from or provide to
109
many other nodes in the network. The more global node degree- and node centrality-
based measures therefore better predict CRR.
Table 24.
Classification Results of CRR Using DWND and DWOD
Classification Method
Revenue Relation Precision Recall Accuracy
Logistic regression
0 52.6% 99.2% 52.6%1 54.5% 1.1%
Decision tree 0 52.6% 97.1% 52.5%1 49.1% 3.1%
Table 25.
Classification Results of CRR Using Source NWID, Source NWOD, Target NWID, and
Target NWOD
Classification Method
Revenue Relation Precision Recall Accuracy
Logistic regression
0 71.3% 84.1% 73.8%1 78.0% 62.4%
Decision tree 0 80.1% 80.9% 79.4%1 78.6% 77.7%
Table 26.
110
Classification Results of CRR Using Source Pagerank, Source Hits, Source Betweenness,
Target Pagerank, Target Hits, and Target Betweenness
Classification Method
Revenue Relation Precision Recall Accuracy
Logistic regression
0 74.6% 77.6% 74.3%1 74.0% 70.7%
Decision tree 0 80.2% 80.0% 79.1%1 77.9% 78.1%8.4.3.3 Discriminant Variate
At the first step of the discriminant analysis using the 1000 pairs with 2000
unique companies , before adding the first IV into the model, we find that ten IVs (four
node degree-based and six centrality-based) are significant (with significance equal to or
less than 0.05) and the (two) dyadic degree-based IVs are not. The result for dyadic
degree-based IVs is consistent with what we see in Table 24: those IVs produce very
poor prediction results. The first IV included in the discriminant model is source_hits
score as it has the largest score statistics. After including the source_hits and repeating
the evaluation procedures, the second IV to be added is target_hits. At this step, all the
eight IVs that were significant before including the first IV become insignificant due to a
high multicollinearity among the IVs (i.e., hits, pagerank, betweenness, NWIO and
NWOD). The high multicollinearity among those IVs explains the similar performance
by different sets of IVs in Tables 22 and 23. The coefficient β for source_hits is negative
(-1863.7) and for target_hits is positive (1627.5), which indicates that an increase in
source_hits decreases the likelihood of positive CRR; and increase in target_hits
increases the likelihood of positive CRR. In other words, global (hub-like) centrality of
source or target company is indicative of its higher revenues. Hence, the global
centrality-based hits metrics for source and target company consist of the discriminant
111
variate that can significantly discriminate between positive and negative CRR. The
prediction results obtained using the discriminant model (with a constant and the two IVs
– source_hits and targe_hits) are as follows:
Compared with Tables 20, 22, and 23, Table 27 shows inferior results, indicating
that adding more IVs can improve prediction performance (the main focus of this paper).
Table 27.
Prediction Results for Discriminant Model with Two IVs
Discriminant model
Revenue Relation Precision Recall Accuracy
Logistic regression
0 69.4% 54.9% 66.8%1 64.2% 68.3%
8.4.4 Predicting CRR with Quarterly Revenues
With the second data set we also report the CRR prediction performance on the
basis of quarterly revenues. We present the CRR prediction results in Table 28 and the
CRRs are determined by revenues of Q4 2005. The prediction performance is very
similar to those in Table 23 that are generated on the basis of annual revenues.
8.4.5 Predicting Top-N Companies by Revenue
We now consider the related problem of predicting whether a company will fall
within the set of top-N companies by revenue (in dollars). Because we are no longer
interested in the direct relation between a pair of companies, we do not use the dyadic
attributes in these predictive methods. We employ five node-level attributes for each
company in the network (listed in the caption of Figure 26). The class label to be
112
predicted takes a value of 1 if the company is a top-N company by revenue and 0
otherwise. Again, we base all performance measurements on 10-fold cross-validation.
Figures 24 and 25 show the performances of the two classification methods as N varies
from 100 to 1000 with a step size of 100.
Table 28.
Classification Results of CRR with 12 Attributes (Second Data Set)
Classification Method
Class Label (CRR) Precision Recall Accuracy
Logistic regression
0 75.0% 80.1% 75.5%1 76.1% 70.4%
Decision tree 0 76.4% 76.2% 75.4%1 74.3% 74.6%
The two classification methods produce similar results. Performance for predicting the
negatives (i.e., a company is not in the set of top-N companies) is high, with precision
and recall (for both methods) in the range of 89–99%. However, precision for predicting
the positives is in the range of 57–75%, and recall is substantially low (24–36%). We
observe similar results with the second data set; for the negatives, both precision and
recall are between 88% and 99%, whereas for the positives, precision is 65–76% and
recall is 22–35%. Although these positive prediction performances may seem rather low,
they should be judged with the knowledge that the top-N companies, where N varies
from 100 to 1000, constitute only 1.6–16% of the total number of companies in the two
data sets. That is, the problem of correctly identifying a company in the set of top-N
companies by revenue is particularly hard, whereas identifying a company that is not in
113
the top-N is easier because most companies fall into this category. Given the high prior
probability of negatives, our results for this problem are encouraging.
0%
20%
40%
60%
80%
100%
120%
100 200 300 400 500 600 700 800 900 1000
Top-N
Perc
ent Precision 0
Recall 0
Precision 1
Recall 1
Figure 26. Precision and Recall for Logistic Regression in Predicting Top-N
companies
0%
20%
40%
60%
80%
100%
120%
100 200 300 400 500 600 700 800 900 1000
Top-N
Perc
ent Precision 0
Recall 0
Precision 1
Recall 1
Figure 27. Precision and Recall for Decision Tree in Predicting Top-N Companies
114
8.4.6 Analysis for CRR Flips
8.4.6.1 Analysis for CRR Flips on the Basis of Annual Revenues
Table 29 shows that for annual revenue-based CRRs, precision, recall, and accuracy are
in 70-80% for all 75,709 pairs and around 60% for flip pairs. Given that about 4% of all
pairs experienced CRR flips, a naïve technique that assumes current year’s (t) CRRs to be
the same as last year’s (t-1) will achieve an accuracy of 96%. However, such a high
accuracy would be at the cost of failure to detect any CRR flips (i.e., 0% precision, recall,
and accuracy among the flip pairs). In contrast, our approach is able to achieve precision,
recall, and accuracy of about 60% on the flip pairs. The flip pairs, due to their more
dynamic nature, constitute the more interesting part of the data set. Moreover, it is
important to note that our approach does not resort to any financial data when predicting
the CRR. This is a desirable property that would allow the approach to be easily extended
to private and/or foreign companies where it is harder or impossible to find accurate
financial data. Another naïve approach, which classifies company pairs as positive or
negative CRR randomly with 50% probability, would achieve 50% accuracy on both flips
pairs as well as all the pairs. Our approach clearly performs better than such a random
approach on both flip pairs and all pairs.
Table 30 lists three sample flip pairs with annual revenues at times t-1 and t. The
first two pairs’ CRRs flip from 0 to 1, and the third pair demonstrates a flip from 1 to 0.
Table 29.
Classification Results of All Pairs and the Flip Pairs (with Annual Revenues)
All pairs Flip pairsPerformance\Classifier DT LR DT LRPrecision for positive* 79.2% 75.1% 57.8% 64.0%
Recall for positive 78.6% 71.5% 55.7% 56.5%Precision for negative 80.9% 75.4% 58.2% 61.9%
Recall for negative 81.5% 78.7% 60.3% 68.9%Accuracy 80.1% 75.3% 58.0% 62.8%
* Positive means that the CRR flip is from 0 to 1.
Table 30.
Sample Flip Pairs
Pair by stock ticker Company1 Sector1Revenue1* at t-1
Revenue1 at t Company2 Sector2
Revenue2 at t-1
Revenue2 at t Flip (t-1, t)
MRGENGPS Merge Technologies Inc. Technology 37.0 72.1 NovAtel Inc. Technology 44.8 54.6 0->1
JNPRXLNX Juniper Networks, Inc. Technology 1336 2060 Xilinx Inc. Technology 1573 1640 0->1
MSOCTRNMartha Stewart Living Omnimedia Inc. Service 187.4 209.5 Citi Trends Service 157.2 289.8 1->0
* Revenue in million dollars.
115
116
116
8.4.6.2 Analysis for CRR Flips on the Basis of Quarterly Revenues
Table 31 shows that for quarterly revenue-based CRRs (Q4 2005 as current time t
and Q1 2006 as future time t+1), the precision, recall, and accuracy are around 80% for
all pairs and close to 60% for flip pairs by DT. Compared with DT, LR produces slightly
inferior results. The results are consistent with those seen in Table 14 for annual revenue-
based CRRs.
When measuring prediction performance on CRR flips using Q4 2005 as t and Q2
2006 as t+1, the results are shown in Table 32. Compared with results in Table 31, we
find that the prediction performance for flip pairs in Table 32 drops. This may be
explained by the fact that as the difference in time between news and target CRR (i.e.,
CRR to be predicted) increases the power to predict CRR among flip pairs decreases.
Table 31.
Classification Results of All Pairs and the Flip Pairs (with Quarterly Revenues of Q4
2005 and Q1 2006)
All pairs Flip pairsPerformance\Classifier DT LR DT LRPrecision for positive 80.5% 75.2% 54.2% 47.5%
Recall for positive 81.0% 78.1% 57.0% 55.9%Precision for negative 81.3% 77.8% 60.9% 54.9%
Recall for negative 80.8% 74.9% 58.1% 46.5%Accuracy 80.9% 76.4% 57.6% 50.9%
118
Table 32.
Classification Results of All Pairs and the Flip Pairs (with Quarterly Revenues of Q4
2005 and Q2 2006)
All pairs Flip pairsPerformance\Classifier DT LR DT LRPrecision for positive 79.0% 74.8% 50.7% 45.5%
Recall for positive 78.7% 74.2% 56.3% 47.0%Precision for negative 80.1% 76.0% 57.1% 51.6%
Recall for negative 80.4% 76.6% 51.4% 50.1%Accuracy 79.6% 75.4% 53.7% 48.6%
8.5 Discussions
We propose a news-driven, SNA-based business relationship discovery approach
to explore the predictive value of business news in discerning revenue relationships
between companies. Our approach uses citations in news stories to understand the
direction and strength of the relative importance between a pair of companies. In our
intercompany network, nodes are companies, and links are directed and weighted on the
basis of the direction and frequency of citations in news stories. We identify and quantify
various attributes of the network using standard network analysis metrics and suggest
modified or new metrics as needed (e.g., DWND). We then use these attributes to predict
the (future) relative revenue relation between a pair of companies as an example of
business relationships the approach might predict. We also examine the prediction
performance for flip pairs and investigate whether we can predict if a given company
falls into the set of top-N companies by revenue. We process and employ two sets of
multimonth data from the online business news available at Yahoo! finance. Both data
sets reaffirm the robustness of our findings on the basis of annual and quarterly revenues.
119
Applying discriminant analysis we identify a set of significant IVs. Moreover, our
approach is intrinsically language independent and can be extended to news in various
languages.
Similar to many other networks constructed from the Internet, we find that
various attributes of our network, such as NID, NOD, NWID, NWOD, and link weight,
follow the power law distribution. By exploring the relation between DWND and positive
CRR, we find that company pairs with large DWND tend to be associated with positive
CRR. Hence, as expected, the DWND metric (at least for large values) captures the
overall flow of revenue (importance) between a pair of companies.
We study the CRR prediction problem by using three groups of attributes
together, as well as individual groups separately. Different groups of attributes vary in the
range of the network covered for their computations. More global measures, such as node
degree- and node centrality-based attributes, are better predictors of CRR than are the
dyadic degree-based attributes that concentrate only on pairwise relationships and ignore
the rest of the network. In terms of CRR prediction performance, the precision, recall,
and accuracy are in the range of 70-80% for all pairs and are about 60% for flip pairs.
With regard to predicting whether a company’s revenue falls among the top-N,
the precision for predicting the positives (top-N) is much higher than the recall. These
results may seem humble until we consider them in the context of the prior distributions
in the data sets. Considering that only a small percentage of companies fall into the set of
top-N companies by revenue, a precision value in the range of 57–75%, as we achieve, is
encouraging. If our predictive models randomly assign companies to the top-N, the
precision for predicting positives should not exceed 16%.
120
Our approach thus can not only serve as a data filtering step for analysts but also
be useful for tracing and monitoring the dynamics of revenue relations for many
companies over time. We plan to further validate our approach with a variety of business
relationships, news from different languages (and countries), various types of companies
(e.g., private versus public), and over time. Further research might also attempt to derive
and evaluate additional graph attributes that synthesize the global and dyadic measures
that represent more effective predictors of business relationships between a pair of
companies.
120
CHAPTER 9
DISCOVERING COMPETITOR RELATIONSHIPS
9.1 Approach Outline and Research Questions
Figure 28 outlines the five main steps of our approach on competitor discovery.
The first two steps have been explained in Figure 18 in Section 6.1. In step 3, as a
preliminary investigation, we first examine the citation-based intercompany network for
both its competitor coverage (coverage of known competitors) and competitor density
(the likelihood of finding competitors among the linked company pairs in the network.)
We benchmark this preliminary investigation against an exhaustive as well as a random
search to provide a comparative analysis of a citation-based intercompany network in
terms of search cost. We find that competitor relationship discovery is especially
challenging in portions of our data set where the number of non-competitor pairs
overwhelm the number of competitor pairs. We use a combination of data from Hoover’s
and Mergent as our gold standards for evaluation purposes.
This study focuses on the following two research questions:
121
1. How well can we discover competitor relationships between companies using four
types of attributes derived from the intercompany network? Using special
classification techniques, we report the classification performance for an
Figure 28. Process View of the Competitor Discovery Approach
imbalanced data set where the number of noncompetitor pairs overwhelms the
number of competitor pairs.
2. To what extent can a gold standard cover the set of all competitors, and to what
extent does the proposed approach extend the knowledge covered by a gold
standard? We use Hoover’s and Mergent as gold standards for identifying
competitors, though we are keenly aware that these data sets are incomplete and
inconsistent, as we have illustrated. Therefore, we estimate their coverage on all
competitor pairs and propose metrics to estimate the extension offered by our
approach for each gold standard data source.
122
9.2 Data Sets
In the following two subsections, we introduce two data sets that will be used to
evaluate competitor classification performance. The first data set represents a whole set
of pairs in the network, and the second is created to represent the imbalanced part of the
whole data set.
9.2.1 Data Set I
We first use DWND (net flow of citations between a pair of companies) to
identify all distinct (linked) company pairs in the network; namely, we include only pairs
with non-negative DWND values, and for any link <n i, nj> with a DWND value of 0, we
ignore the opposite link <nj, ni>. In other words, all distinct company pairs in the
intercompany network that have any citations between them are identified. With this
method, we would identify a total of eight links in Figure 19 at Section 7.1. For the entire
intercompany network, we identify a total of 87,340 company pairs. Next, we sort the
pairs by their DWIOD values, which range from 1 to 990, in descending order, because
DWIOD captures the total volume of citations between two companies in news.
Therefore, more citations in news stories should increase the likelihood that two
companies have a business relationship. In terms of DWIOD values, the data set is
skewed; most company pairs have small DWIOD values. To examine competitor
relationships, we group company pairs with the same or similar DWIOD values by
dividing them into baskets, such that links with different DWIOD values do not appear in
the same basket unless the basket contains fewer than 200 pairs. This procedure results in
123
21 baskets associated with different DWIOD values. We randomly choose 40 pairs from
each basket, and the 840 pairs (40 × 21) constitute data set I, which we use to examine
the classification performance of the individual baskets in Section 9.4.
We manually determine whether each of the 840 company pairs in the 21 sample
baskets is a competitor pair using the Hoover’s and Mergent sources. If we find a
competitor relationship between the two companies according to either Hoover’s or
Mergent, we assign the pair a class label of 1 (positive instance); otherwise, it receives a
class label of 0 (negative instance). In Table 33, we show the DWIOD range and size of
each basket, as well as the number and percentage of competitor pairs in the 21 sample
baskets. As this table illustrates, higher DWIOD values tend to be associated with a
higher percentage of competitor pairs in a sample basket, in line with our intuition that as
the overall volume of citations between a pair of companies increases, the likelihood that
the companies have a business relationship (e.g., competitors) increases.
Table 33.
Distribution of Competitor Pairs in 21 Sample Baskets
Basket DWIOD range
Basket size
Number (percent) of positives in a
sample basket by Hoover’s
Number (percent) of positives in a
sample basket by Mergent
Number (percent) of positives by
union of Hoover’s and Mergent
Number (percent) of positives by
intersection of Hoover’s and Mergent
1 [69, 990] 200 26(65.0%) 11(27.5%) 26(65.0%) 11(27.5%)2 [44, 68] 209 19(47.5%) 9(22.5%) 19(47.5%) 9(22.5%)3 [32, 43] 224 17(42.5%) 6(15.0%) 17(42.5%) 6(15.0%)4 [26, 31] 239 14(35.0%) 4(10.0%) 15(37.5%) 3(7.5%)5 [22, 25] 212 14(35.0%) 8(20.0%) 15(37.5%) 7(17.5%)6 [19, 21] 235 17(42.0%) 6(15.0%) 18(45.0%) 5(12.5%)7 [17, 18] 224 8(20.0%) 5(12.5%) 11(27.5%) 2(5.0%)8 [15, 16] 281 13(32.5%) 6(15.0%) 13(32.5%) 6(15.0%)9 [13, 14] 389 10(25.0%) 4(10.0%) 10(25.0%) 4(10.0%)
124
10 12 263 16(40.0%) 3(7.5%) 17(42.5%) 2(5.0%)11 11 330 8(20.0%) 4(10.0%) 9(22.5%) 3(7.5%)12 10 410 8(20.0%) 2(5.0%) 8(20.0%) 2(5.0%)13 9 470 8(20.0%) 3(7.5%) 8(20.0%) 3(7.5%)14 8 622 13(32.5%) 6(15.0%) 13(32.5%) 6(15.0%)15 7 769 10(25.0%) 3(7.5%) 11(27.5%) 2(5.0%)16 6 1,390 5(12.5%) 3(7.5%) 6(15.0%) 2(5.0%)17 5 1,543 5(12.5%) 2(5.0%) 5(12.5%) 2(5.0%)18 4 4,142 4(10.0%) 0(0.0%) 4(10.0%) 0(0.0%)19 3 4,972 2(5.0%) 2(5.0%) 4(10.0%) 0(0.0%)20 2 29,603 1(2.5%) 0(0.0%) 1(2.5%) 0(0.0%)21 1 40613 0(0.0%) 0(0.0%) 0(0.0%) 0(0.0%)
Total 87340 218 87 230 75
125
9.2.2 Data Sets II and III
In an imbalanced data set, most instances occur in one class, whereas the minority
is labeled as the other class, and the latter typically is the more important class [Kotsiantis
et al. 2006].
According to Table 33, several sample baskets have low percentages of positives
and therefore can be considered imbalanced data sets. As prior research [e.g., Weiss and
Provost 2003], as well as our results in Section 9.4, empirically show, typical
classification methods fail to detect the minority in an imbalanced data set and they
generate poor precision and recall (e.g., close to 0%) for positives, which in this study
mean the competitor pairs. The main reason for this poor performance is that the
classifiers, by default, maximize accuracy and therefore give more weight to majority
classes than minority ones [Kotsiantis et al. 2006]. For example, for a data set with 1%
positives, simply assigning every instance a negative label and not detecting any positives
achieves an accuracy of 99%. To handle the imbalanced data set problem, we first create
a larger data set, data set II, by proportionally (according to basket size) sampling a total
of 2000 pairs from the four imbalanced baskets (18, 19, 20, and 21) with the lowest ratio
of positives (≤10%). We manually label the 2000 pairs using Hoover’s and Mergent. The
numbers and percentages of competitors according to the different gold standards appear
in Table 34.
For further future analysis, in addition to data sets I and II, we also use 17 baskets
(1–17) in data set I and all pairs in data set II to produce estimated overall performance
results. For convenience, we call this combination of the two data sets data set III, which
contains 18 baskets, and data set II provides the 18th sample basket.
126
Table 34.
Number (percentage) of Positive Pairs in Data Set II
DWIODSamplebasket size
Number (percent) of positives by
Hoover’s
Number (percent) of positives by
Mergent
Number (percent) of positives by union of
Hoover’s and Mergent
Number (percent) of positives by intersection of Hoover’s and Mergent
1 1024 22 (2.1%) 15 (1.5%) 29 (2.8%) 8 (0.8%)2 747 30 (4.0%) 13 (1.7%) 39 (5.2%) 4 (0.5%)3 125 12 (9.6%) 3 (2.4%) 14 (11.2%) 1 (0.8%)4 104 15 (14.4%) 7 (6.7%) 18 (17.3%) 4 (3.8%)Total 2000 79 (4.0%) 38 (1.9%) 100 (5.0%) 17 (0.9%)
9.3 Examining Competitor Coverage and Density of the Intercompany Network
In this section, we examine two issues: the completeness of the intercompany
network in its coverage of competitor pairs (i.e., competitor coverage), and the likelihood
of competitor pairs being linked in the intercompany network (i.e., competitor density).
These issues clarify the extent to which “competitor semantics” are embedded in the links
of the constructed network. Greater competitor coverage and competitor density in the
intercompany network lowers the cost of searching for (and classifying) competitors by
using the network. Because we lack an ideal benchmark of intercompany networks from
other approaches, we benchmark the competitor coverage of the intercompany network
against that of an exhaustive network (clique) in which all nodes link to one another and
compare the competitor density of the intercompany network with that of a random
network having the same numbers of nodes and links as those of the intercompany
network. Table 35 includes notation we use to examine competitor coverage and
competitor density.
127
Table 35.
Notation for Competitor Coverage and Competitor Density
Notation Interpretation
K Number of unique companies in a sample basket that has 40 company pairs.
CL Citation-based links among the K companies in the intercompany network.
EL Exhaustive links among the K companies.CP(CL) Number of competitor pairs (CP) present in CL.CP(EL) Number of competitor pairs present in EL.Competitor coverage ratio
=CP(CL)/CP(EL), or the proportion of all known competitor pairs that are present as links in a citation-based intercompany network.
CP40(CL) Number of competitor pairs present in 40 links from a sample basket.RL Randomly generated company links from the K companies.CP40(RL) Number of competitor pairs present in 40 randomly generated links.
CD40(CL) =CP40(CL)/40, or competitor density for a small citation-based network that consists of the 40 links from a sample basket.
CD40(RL) =CP40(RL)/40, or competitor density for a random network that consists of 40 random links.
CD(EL) =CP(EL)/(K*(K-1)), or competitor density for an exhaustive network (clique) that consists of the exhaustive links.
9.3.1 Examining the Competitor Coverage
From 40 company pairs in each sample basket in data set I, we identify K and EL.
From the whole intercompany network, we further find CL. In addition, we identify
CP(CL) and CP(EL) through the union of the Hoover’s and Mergent data. In Figure 29,
we depict the competitor coverage ratio for the intercompany network across the 21
sample baskets; it is always greater than 66% and typically in the range of 87–100%
across the sample baskets. We also note that CL is a fraction of EL, ranging from 15% to
84% across the sample baskets. In other words, while our citation-based intercompany
network covers most of the competitor pairs found in an exhaustive network, for most
sample baskets it is much smaller as compared to the exhaustive network. Therefore, our
128
0%10%20%30%40%50%60%70%80%90%
100%
1 2 3 4 5 6 7 8 9 101112131415161718192021Basket
Rat
io
Figure 29. Competitor Coverage Ratio
classification models (in Section 9.4) explore a small subspace of all possible
relationships by using the intercompany network, and the subspace covers most of the
competitor pairs.
9.3.2 Examining the Competitor Density
Using the union of data from Hoover’s and Mergent, we label 40 company pairs
in each sample basket to find CP40(CL). Given K, we randomly generate 40 links from
the K unique companies and find CP40(RL). We repeat the random link generation and
link labeling procedures four times to obtain an average CP40(RL). Then, we compute the
competitor density CD40(CL) and average CD40(RL) for all sample baskets. Moreover,
because we know CP(EL), we can calculate CD(EL). Figure 30 provides the competitor
density for the citation-based intercompany network, random network, and exhaustive
129
0%
10%
20%
30%
40%
50%
60%
70%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20Basket
Pro
babi
lity
CD40(CL) CD40(RL) CD(EL)
Figure 30. Probability of Being a Competitor Pair
network across the 21 sample baskets. The curve for the average CD40(RL) is very close
to that of CD(EL), which indicates that the probability of finding a competitor pair in the
randomly generated 40 pairs is consistent with that in the exhaustive links. Moreover,
CD40(CL) is much higher than the average CD40(RL) and CD(EL) in 20 of the 21 sample
baskets. The difference in these probabilities suggests that pairs in the intercompany
network for most baskets are much more likely to be competitor pairs than those in the
random links. The high competitor density in the intercompany network for most sample
baskets therefore would benefit the classifiers in a competitor classification.
The results in Sections 9.3.1 and 9.3.2 show that the citation-based intercompany
network has high competitor coverage and density and therefore can alleviate the
problems associated with searching for competitors in an exhaustive or random space of
130
potential relationships. The results also confirm our intuition that links in the citation-
based intercompany network contain signals about competitor relationships instead of
being random.
9.4 Competitor Discovery
Our competitor classification models use four types of attributes to classify a
company pair as competitors or noncompetitors. Because the class label (dependent
variable) in the models is binary by nature, we can apply a variety of standard binary
classification models. As is common in machine learning, we use part of the data set for
training and leave a disjoint testing set to evaluate the discriminating power of the
models. We repeat this training–testing process several times with different data splits
(cross-validation) to ensure the robustness of observed results. Using several standard
metrics, which we describe next, we evaluate the discriminating power of the models.
9.4.1 Evaluation Metrics
Table 36 is the confusion matrix containing the actual and classified classes for a
classification problem with two class labels. TP refers to the number of true positives, TN
is the number of true negatives, FP is the number of false positives, and FN represents
the number of false negatives.
131
Table 36.
Confusion Matrix
Classified class labelPositive Negative
Actual class label
Positive TP FNNegative FP TN
Using the confusion matrix we introduce the common metrics for evaluating and
comparing classification performance as follows:
(11)
(12)
(13)
(14)
(15)
132
In most classification problems, precision and recall present a trade off, because
when a model prioritizes a conservative approach to boost the precision, it misses some
competitors, which reduces its recall. An F-measure is based on both precision and recall,
and the parameter α denote the relative importance of recall versus precision. F1 is the
harmonic mean of precision and recall.
One of the most common metrics to evaluate classifiers for an imbalanced data set
is the receiver operating characteristics (ROC) curve [Kotsiantis et al. 2006], a two-
dimensional curve with TP rate (recall) on the y-axis and FP rate on the x-axis (for
specific examples, see Figure 33 in Section 9.4.5). Thus, a ROC curve can address an
important tradeoff—namely, the number of correctly identified positives increases at the
expense of introducing additional false positives. The area under ROC, which is called
AUC, also offers an evaluation metric.
9.4.2 Competitor Classification with Data Set I
Using the publicly available Weka API [Witten and Frank 2005], we employ four
classification methods: artificial neural network (ANN), Bayes net (BN), C4.5 decision
tree (DT), and logistic regression (LR) to classify company pairs. Models based on ANN,
BN, and DT are common classifiers in data mining, and LR frequently appears in
business research to address problems with a binary class label (as in our competitor
classification problem). For each sample basket, except for 21, which does not contain
any competitor pairs (we address this basket, together with three other baskets as the
imbalanced data set II, in the next subsection), we report the average precision and recall
133
generated by 10-fold cross-validation for each classification method. We use different
classification methods to compare their performances for our application.
9.4.3 Competitor Classification with Data Set II
9.4.3.1 Background on Handling Imbalanced Data Set
Solutions to handling imbalanced data sets for classification problems exist at
both data and algorithmic levels. Several data-level solutions use different resampling
approaches, such as undersampling majority, oversampling minority, or oversampling
minority by creating a synthetic minority [Chawla et al. 2002], which changes the prior
distribution of the original data set [Kotsiantis et al. 2006] before learning from the data
set. Another approach at the data level segments the whole data into disjoint regions, such
that the data in certain region(s) are no longer imbalanced [Weiss 2004].
Some popular solutions at the algorithmic level include the following:
Decision threshold adjustment (DTA), which, given a (normalized) probability of
an instance being positive (or negative), changes the probability threshold used to
determine the class label of the instance [Kotsiantis et al. 2006].
Cost-sensitive learning (CSL), which assigns fixed and unequal costs to different
misclassifications, such as cost(false negative) > cost(false positive), to minimize
the misclassifications of positives [Pazzani et al. 1994].
Recognition-based learning (RBL), which, unlike a two-class classification
method that learns rules for both positive and negative classes, is a one-class
learning method and learns only rules that classify the minority [Weiss 2004;
Kotsiantis et al. 2006].
134
We employ several of these techniques to address our imbalanced data set.
Specifically, we divide the whole data set into 21 baskets on the basis of DWIOD, and
many of these turn out to be more “balanced” than the entire data set, so it matches the
segment data approach [Weiss 2004] for handling imbalanced data sets. For the few
imbalanced baskets, we sample more instances to form our imbalanced data set II. Next
we apply two different approaches, the simple DTA approach and an undersampling-
ensemble (UE) method (explained in subsection 5.3.3), to address the imbalanced data set
problem. We do not choose the CSL approach, mostly because we do not know the right
ratio for the cost of FN versus the cost of FP in the context of our competitor
classification problem. However, we consider DTA and CSL to be very similar, in that
they both create a bias toward positive classifications. For data set II, we report various
performance metrics suited for an imbalanced data set, including F1, precision, TP rate,
FP rate, ROC, AUC, and accuracy. We introduce the two approaches (DTA and UE) for
dealing with classification of imbalanced data in detail next.
9.4.3.2 DTA Approach
With this approach, we simply adjust the decision threshold used by a classifier to
determine whether to classify an instance as positive or negative, given its (normalized)
probability of being positive. For example, given that Pr(x is positive) = 0.3, the instance
x is labeled negative when the decision threshold is 0.5. However, when the threshold is
adjusted to 0.2, x is classified as positive.
For training and testing, we follow strict tuning procedures suggested in [Salzberg
1997]. In particular, we randomly select 1500 instances as a training set from the
135
imbalanced data set and the remaining 500 as the testing set. For each classification
method, we use 10-fold cross validation and tune the input parameters to observe the best
performance on the F1 measure with just the training set. Finally, we apply each trained
classifier with its respective “best” parameter setting to the testing set for evaluation
purposes. Moreover, to determine robustness, we randomly divide the 2000 pairs into
four disjoint sets of equal size, which form four different pairs of training and testing sets.
We then apply the training–tuning–testing procedures to the four pairs of training and
testing sets and report the average results (see the formula in subsection 5.5). In each
case, training and parameter tuning relies solely on the training data set, whereas our
evaluation of the trained and tuned classifier uses only the testing data set. For ANN, we
tune the learning rate from 0.1 to 1.0 and momentum from 0.1 to 0.3; for BN, we choose
K2 [Cooper and Herskovitz 1992] and TAN [Friedman et al. 1997] as algorithms for the
search network structure; for DT, we change the minimum leaf size from 2 to 10; and we
require no parameter tuning for LR. For all other parameters, we accept the default from
Weka. We apply the same tuning procedures throughout the study whenever we use
parameter tuning.
9.4.3.3 UE Approach
From the original imbalanced data set II, we generate multiple, smaller, more
balanced subdata sets by duplicating all minority (positive) instances in each subset and
then evenly splitting the majority into those subsets, as we depict in Figure 31. We build
136
a classifier from each subset and use an ensemble approach [Estabrooks and Japkowicz
2001] to generate the final classification result. Chan and Stolfo [1998] adopt a similar
Figure 31. Generating More Balanced Subdata Sets
undersampling method. We choose the majority vote as the ensemble approach, and for
the majority vote, we use the binary output (0 or 1) of each classifier and the probability
output (a value between 0 and 1) of each classifier, denoted as the majority vote by count
(MVC) and majority vote by probability (MVP), respectively.
During the training phrase, from the initial ratio of positives in the subsets, we
tune the parameters for each classifier (except for LR) and record its performance in an
output file. We repeat this procedure with different ratios of positives, which change from
0.05 to 0.60 with a step size of 0.05. From all output files, on the basis of the best
performance on the F1 measure, we determine a set of best parameters for a classifier and
a best ratio of positives. Finally, we apply the trained classifiers with their best parameter
137
settings and best ratios of positives to the testing set for evaluation. As in Section 9.4.3.2,
we divide the 2000 pairs into four disjoint sets of equal size, generate results separately
for the four pairs of training and testing sets, and report the average results.
9.4.4 Classification Performance for Data Set I
In Figure 32 we provide the precision and recall achieved by ANN for individual
sample baskets in data set I. For comparison, we also include the prior distribution of
positives in each sample basket. The precision curve is almost always above the prior
probability, except for the last two sample baskets with the lowest prior distributions
(5.0% and 2.5%.) As Figure 32 shows, though for most baskets ANN’s classification
performance is reasonably good, it weakens when DWIOD values are very small (last
few baskets). This result highlights the inherent challenge of accurately classifying the
minority class for imbalanced data sets (the last few baskets). The other three
classification methods (BN, DT, and LR) show similar performance patterns but poorer
performance overall. We provide the results of applying special techniques to imbalanced
parts of the data set in the next subsection.
138
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Basket
Precision Recall Prior
Figure 32. Precision and Recall of Data Set I by ANN and Prior Distribution
9.4.5 Classification Performance for Data Set II
In Table 37 we report precision, TP rate (recall), FP rate, F1, accuracy, and
AUC on training and testing sets for each classification method using the DTA
approach. Each bold number in the table indicates the best performance for a
measurement across the four classification models for the testing set. Since we have
four pairs of training (1500 instances) and testing (500 instances) sets, we generate and
report overall performance with the following equations which are based on definitions
in equations 11 to 15.
Table 37.
Classification Performance of Data Set II by DTA Approach
Without sector information With sector information**Data Set Overall ANN BN DT LR ANN BN DT LR
139
performance
Training*
Precision 0.280 0.142 0.119 0.353 0.361 0.277 0.318 0.398Recall 0.227 0.277 0.467 0.220 0.443 0.520 0.403 0.410
False positive rate 0.031 0.088 0.182 0.021 0.041 0.071 0.045 0.033F1 0.250 0.188 0.190 0.271 0.398 0.362 0.356 0.404
Accuracy 0.932 0.880 0.801 0.941 0.933 0.908 0.927 0.940AUC 0.753 0.703 0.656 0.756 0.870 0.863 0.740 0.865
Test
Precision 0.268 0.125 0.090 0.322 0.372 0.262 0.283 0.380Recall 0.220 0.240 0.400 0.190 0.420 0.430 0.360 0.380
False positive rate 0.032 0.088 0.213 0.021 0.037 0.064 0.048 0.033F1 0.242 0.164 0.147 0.239 0.394 0.326 0.317 0.380
Accuracy 0.931 0.878 0.768 0.940 0.936 0.911 0.923 0.938AUC 0.736 0.672 0.610 0.723 0.858 0.853 0.741 0.834
* Results of training set are based on the best performance on F1 with parameter tuning.** Company’s sector used in Yahoo! Finance is included as an attribute
(16)
(17)
(18)
140
(19)
(20)
In these equations, the definitions of TP, TN, FP, and FN are the same as those
in Section 9.4.1, and the subscript i represents a number between 1 and 4 to denote the
four disjoint testing sets from data set II.
Table 37 also contains results for the same data set with and without sector
information (sector encoded as a variable by nine categorical values). Using sector
information greatly improves the classification performance for data set II across the
four classifiers; for example, the maximum F1 measures (both produced by ANN)
increase by 63%. With sector information, we do not observe a significant difference
in the F1 measure across the 20 baskets in data set I (two-tailed t-test, p = 0.827), which
indicates that sector information is more helpful for imbalanced data sets than for more
balanced data sets. We find that for all 316 competitor pairs in data set III (216 in the
17 sample baskets of data set I and 100 in data set II), a total of 282 (89.2%) pairs are
in the same sector and 34 (10.8%) are not.
The UE approach with MVC and MVP produces similar results as those in
Table 37. For example, with MVC, the maximum values of the F1 measures are 0.381
and 0.204 with and without sector information, respectively. Although the UE
approach is more complex than the simple DTA approach, in that it requires an
141
undersampling of majority class to form multiple smaller data sets and adjusting ratios
of positives in these small data sets, the two methods show similar classification
performance. Thus, in Section 9.5, when estimating the extent to which our approach
extends beyond the gold standard, we use the results from the DTA approach.
Finally, Table 37 shows that ANN achieves the largest AUC values. In Figure
33, we illustrate the ROC curves for the four classifiers using sector information; the
curves for ANN, BN, and LR are close, and ANN and LR slightly outperform the DT
curve. The diagonal line represents random labeling of instances with different
likelihoods. For example, when the classifier randomly assigns an instance to the
positive class 10% of the time, it should find 10% of the positives correctly, producing
a TP rate of 0.1. At the same time, it identifies 90% of the negatives correctly, leading
to a FP rate of 0.1 (1 – 0.9). Thus, the process of guessing the positive class 10% of the
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
FP rate
TP r
ate
ANN BN DT LR
DTANN
BNLR
Figure 33. ROC Curves of Data Set II for Four Classification Methods
142
time yields the point (0.1, 0.1) in the ROC space, and random guesses with all different
likelihoods generate the diagonal line. Hence, our classification methods (the curves
above the diagonal line) identify the signals (i.e., competitor relationships) much more
effectively than a random assignment.
9.4.6 Estimated Overall Classification Performance on the Basis of Data Set III
Our classification performance measurements thus far compute values for each
sample basket. Because sample baskets consist of random samples of the original (larger)
baskets, these performance results represent the performance on the original baskets.
However, we also want to estimate the classification performance for all of the baskets
combined, or the whole data set with its 87,340 pairs. This estimation requires that we
extrapolate the performance observed from the sample baskets to the entire original
basket. Therefore, we adopt equations 15–19 to estimate overall precision, TP rate
(recall), FP rate, accuracy, and F1 using data set III. For the 17 sample baskets from data
set I, the classification results are based on 10-fold cross validation, whereas for the
eighteenth sample basket, we combine and use the results generated from the four disjoint
testing sets (each with 500 instances). We present the estimated overall measurements in
the following equations:
(21)
143
(22)
(23)
(24)
(25)
Where Bi is the size of basket i, and Si is the size of sample basket i.
With these equations, we estimate the overall classification performance by
extending performance measurements for a sample basket to the corresponding full
basket and then combining the measures across the 18 baskets in data set III. For
example, if the sample basket Si, which represents the original basket Bi,, contains m
instances that are classified as positives by a model, we expect the original basket B i to
144
contain instances that would be classified as positives by the same model. We
note that equations 20–25 estimate the overall classification performance for the whole
data set of 87,340 pairs, so the resulting estimation indicates the performance of an
ensemble of 18 classifiers (one for each basket), all using a given classification method.
The estimated overall prior probability for positives is 11.8% (approximately 1 in 9 pairs
in the original data set is a competitor pair). We note that compared to this low estimated
prior, Table 38 shows that our competitor discovery approach can achieve reasonably
good estimated classification performance. ANN achieves the best performance on more
metrics than the other three methods, but unlike the three methods (ANN, DT, and BN),
LR does not require any parameter turning and produces comparably good results. We
highlight the best performance value for each measurement in Table 38.
145
Table 38.
Estimated Overall Performances
Without sector information With sector information Precision Recall FP rate F1 Accuracy Precision Recall FP rate F1 AccuracyANN 0.419 0.378 0.046 0.397 0.907 0.450 0.513 0.055 0.479 0.910BN 0.238 0.354 0.095 0.284 0.863 0.388 0.514 0.071 0.442 0.895DT 0.167 0.463 0.203 0.245 0.770 0.432 0.457 0.053 0.444 0.907LR 0.388 0.330 0.046 0.357 0.904 0.382 0.437 0.062 0.407 0.897
9.5 Competitor Extension
In the introduction, we use an anecdote to note that the gold standards tend to be
incomplete. Now we suggest metrics to estimate (1) the coverage of competitive pairs by
a gold standard and (2) the extent to which our approach extends each gold standard.
9.5.1 Estimating the Coverage of a Gold Standard
We require the following notation in Figure 34 to describe the estimation
procedure:
C: (unknown) complete set of competitor pairs
H: set of competitor pairs covered by Hoover’s
M: set of competitor pairs covered by Mergent
JHM = H M, intersection of H and M
Following an idea proposed in a widely cited study [Lawrence and Giles 1998] to
estimate the coverage of search engines, we assume H and M are independent subsets of
C and thus estimate the extent to which H covers C, according to how much of H covers
146
Figure 34. Competitors Covered by Two Gold Standards
M (i.e., JHM) and the size of M. We therefore define the coverage of the entire competitor
set C by Hoover’s ( ) and Mergent ( ) as follows:
Cov(H) = (26)
Cov(M) = (27)
If H and M are not completely independent, the value of JHM (their intersection) is
expected to be larger than when they are independent. In that case, this coverage
estimation provides an upper bound on true coverage.
We previously labeled the positive instances according to Hoover’s and Mergent
for each sample basket, which enables us to compute the number of competitor pairs
147
identified by Hoover’s ( ) and Mergent ( ) separately, as well as the intersection of
Hoover’s and Mergent ( ) for the ith sample basket. Similar to our approach to
defining equation 11, we estimate the number of positives (for Hoover’s, Mergent, and
their intersection) in each original basket by multiplying the number of positives in the
sample basket by the ratio of the basket size to the sample basket size. Then, using
equations 26 and 27, we calculate the coverage of Hoover’s and Mergent as follows:
(28)
(29)
We find that the estimated coverage of Hoover’s and Mergent is 46.0% and
24.9%, respectively. So both data sources individually cover less than 50% of all
competitor pairs. This quantifies and confirms our initial anecdote about incompleteness
of these industry-strength data sources.
9.5.2 Estimating the Extension of One Gold Standard to Another
148
As shown in the above Figure 34, M - JHM represent competitors covered by
Mergent but not by Hoovers. With the same assumption and logic described in the
Section 9.5.1, we define the extension of Mergent to Hoovers and the extension of
Hoovers to Mergent as follows.
Ext(M, H) = (30)
Ext(H, M) = (31)
9.5.3 Estimating the Extension of Our Approach to a Gold Standard
We now present a procedure to estimate how much our automated approach might
extend a gold standard (i.e., identify competitor pairs that are not covered by the gold
standard). Our estimation procedure uses the following notation:
O: the set of competitor pairs classified by our approach
O = C – O
H = C – H
M = C – M
JHMO = H M O
JHMO = H M O
JHMO = H M O
JHMO = H M O
149
Figure 35. Competitors Covered by Two Gold Standards and Our Approach
Thus, JHMO is a subset of competitor pairs that our approach classifies as positive
and that Mergent confirms as positive but that Hoover’s does not identify as competitors.
Given that competitor pairs in Mergent are a subset of all competitor pairs, we estimate
the extent to which our approach extends Hoover’s (Ext(O, H)) as follows:
Ext(O, H) = (32)
Similarly, we estimate the extent to which our approach extends Mergent (Ext(O,
M)) as follows:
Ext(O, M) = (33)
150
Since we use one gold standard to examine the extension of our approach to
another, our extension therefore is bounded by the extension of one gold standard to
another, such that Ext(O, H) Ext(M, H) and Ext(O, M) Ext(H, M). On the basis of
equations (32) and (33), we compute the extension of our approach to each gold standard
using results from data set III with the following equations.
Ext(O, H) = (34)
Ext(O, M) =
(35)
We show in Table 39 the estimation of how much our approach extends the
knowledge available from each of the gold standards, for the different classification
methods (with and without sector information). Using the sector information and any
classification method, our approach extends Hoover’s and Mergent by more than 10%
and 32%, respectively. We base these extension values on classification results generated
from a set of input parameters and classification methods. As the ROC curves in Figure
33 illustrate, we could achieve a higher TP rate (recall) by adjusting some parameters,
and therefore obtain higher values for our expansion, but at the cost of a higher FP rate,
151
which lowers precision. The results in Table 39 are associated with estimated overall
performance in Table 38. For example, for ANN the extensions offered by our approach
Table 39.
Extensions to a Gold Standard
Without sector information With sector informationUpper bound ANN BN DT LR ANN BN DT LR
Ext(O, H) 35.0% 5.9% 7.3% 15.3% 5.0% 12.1% 11.3% 10.1% 10.5%Ext(O, M) 71.2% 28.7% 23.4% 37.2% 24.3% 33.8% 37.1% 35.8% 32.9%
to Hoover’s (12.1%) and Mergent (33.8%) are associated with precision, recall, and FP
rate of 0.450, 0.513, and 0.055, respectively.
9.6 Explorations of Competitors vs. Noncompetitor Pairs
In next two subsections, we report more exploration results on structural
equivalence similarity between competitor and noncompetitor pairs, and on company
annual revenues between competitor pairs with high and low DWIOD values.
9.6.1 SE Similarity Comparison between Competitor and Noncompetitor Pairs
For each sample basket of data set III, we compute and compare the average SE
similarities for competitor and non-competitor pairs. Figure 36 compares DWID-based
SE similarities of the 18 sample baskets in data set III. Except for the last basket with the
smallest DWIOD values, the average SE similarities for competitor pairs are greater than
those for non-competitor pairs (two-tailed t-test, p=0.003), which indicates that on
152
average competitor companies are more structurally equivalent than non-competitors.
Similar patterns are observed for DWOD- and DWIOD-based SE similarities (two-tailed
t-test, p=0.008 and p=0.001 respectively).
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
Basket
Competitor Non-competitor
Figure 36. Average DWID-based SE Similarity Comparison
9.6.2 Comparing Annual Revenues between Competitor Pairs with High and Low
DWIODs
We observe that the average revenue of company pairs with low DWIOD values
(100 competitor pairs in data set II) is significantly (two-tailed t-test, p<0.001) lower than
the average revenue of company pairs with high DWIOD values (92 competitor pairs in
the first five sample baskets, 1 - 5, in data set I.)
9.7 Discussions
153
We propose and evaluate an approach that exploits company citations in online
news articles to create an intercompany network whose structural attributes can identify
competitor relationships between a pair of companies. In addition to using standard
metrics to evaluate the classification performance of our approach, we suggest several
problem-specific metrics that can measure the degree to which our approach extends a
couple of industry-strength data sources. Our evaluations prompt three broad
observations. First, the intercompany network reduces the search cost of finding
competitors compared with that associated with an exhaustive network while avoiding
the poor competitor density of a random network. In other words, the intercompany
network can capture signals about competitor relationships effectively and efficiently.
Second, the structural attributes of our intercompany network, when combined in various
types of classification models, effectively discover competitor relationships, though for
imbalanced portions of the data, we require more advanced modeling techniques (e.g.,
data segmentation, DTA) to achieve reasonable performance. Third, we quantify the
degree to which two commercial data sources are incomplete in their coverage of
competitor relationships and measure the extent to which our approach extends them
while still maintaining adequate precision.
Because our approach is language neutral, it can employ news stories in various
languages and from different countries, as long as there is a mechanism to identify
company citations. We plan to test our approach with a non-English language news
source. Furthermore, this approach may provide a means to discover business
relationships other than competitors. In fact, in parallel research, we also have applied our
approach successfully to identify the relative size of company revenues. We note that
154
company citations can be noisy, and we exploit the large volume of freely available
online news sources to aggregate the signals and thus reduce the noise. However, it
would be interesting to investigate the effect of volume (number of news stories) on the
classification performance of our approach. Also, in continuing empirical studies, it
would be worthwhile to explore whether the intercompany network can predict future
competitor relationships, and if so, how far into the future.
In summary, we present a data mining approach to discovering business
relationships from online news. Because of its design, our approach is scalable along
several dimensions, such as news quantity, language, and type of business relationship.
153
CHAPTER 10
CONCLUSIONS
This dissertation explores two related topics – personalized search and business
relationship discovery, both of which follow the process of KDW. To conclude now I
summarize the two topics, highlight the main findings or contributions, and outline the
directions of future research.
Web search engines typically provide search results without considering a user’s
interests or context. We propose a personalized search approach that can easily extend a
conventional search engine on the client side. Our mapping framework automatically
maps a set of known user interests onto a group of categories in the Open Directory
Project (ODP) and takes advantage of manually edited data available in ODP for training
text classifiers that correspond to, and therefore categorize and personalize search results
according to user interests. In two sets of controlled experiments in two disjoint domains,
we compare our personalized categorization system (PCAT) with a list interface system
(LIST) that mimics a typical search engine and with a nonpersonalized categorization
system (CAT). In both experiments, we analyze system performances on the basis of the
type of task and query length and identify conditions under which our system
154
outperforms a baseline system. In particular, we find that PCAT is preferable to LIST for
information gathering types of tasks and for searches with short queries, and PCAT
outperforms CAT in both information gathering and finding types of tasks, and for
searches associated with free-form queries. From the subjects’ answers to a
questionnaire, we find that PCAT is perceived as a system that can find relevant Web
pages quicker and easier than LIST and CAT.
Potential future research along this line of research includes:
(1) On the basis of the conditions identified in this study, an interesting and related
direction is to study a smart system that can automatically choose a proper
interface (e.g., categorization, clustering, list) to display search results on the basis
of the nature of the query, the search results, and the user interest profile
(context).
(2) As mentioned in Section 2.2 that some prior works [e.g., Leroy et al. 2003; Gauth
et al. 2003; Shen et al. 2005b] use a user’s search and/or browsing activities to
learn his or her profile to further personalize search, thus it would be interesting to
build a user profile based not only on the given user’s activities, but from
behaviors of many other people who are known to have the same or similar
interests. In other words, a personalized search system (that extends our current
system) could try to improve a user’s Web search in a collaborative (e.g., intranet)
environment by considering search activities of other people who have the same
or similar interest profiles.
(3) In this study we assume that a user’s interests are given in that they can be
automatically extracted from his or her resume in digital form or from a database.
155
Thus how to capture and model the dynamics of users' interests can be an
extension of this research, because the interests may be not known in advance and
normally they change over time, even for long-term interests. A user’s interests
can be modeled by his or her behaviors, such as searched and browsed pages
(online behaviors) and composed or read documents and emails (offline
behaviors) [Teevan et al. 2005].
(4) When classifying search results under a user’s interests, we use page content up to
10KB and the page-fetching process is time consuming. Therefore it would be
worthwhile to study the performance of result categorization using other types of
data such as title and snippets (from search engine results) instead of page content
which would save the time on fetching Web pages.
In the second topic we present a new-driven, SNA-based business relationship
discovery framework and study two different business relationships, CRR and competitor
relationship, respectively, to illustrate the effectiveness of our approach. By taking
advantage of the fact that content providers, such as Yahoo! Finance, organize news by
company, we consider news stories organized under a company belong to the company
(i.e., source). We first identify company citations (from sources to targets) in news and
then construct a directed and weighted intercompany network. Using SNA techniques we
further identify four types of (dyadic degree-, node degree-, node centrality-, and
structural equivalence-based) attributes from the network structure. Then we apply
different classification methods with these attributes to finally discover the CRRs and
competitor relationships for large number of links (company pairs) in the network.
156
For the CRR study, besides reporting annual and quarterly revenue-based CRR
prediction, we also show that our approach achieve better performance for flip pairs than
two alternative methods. Further, with annual revenue-based CRR we examine the
prediction performance using each individual group of attributes and apply discriminant
analysis to identify two IVs that are significant in distinguishing positive and negative
CRRs. For the related problem of finding whether a company falls into a set of top-N
companies by revenue, we obtain 57–75% precision with substantially lower recall (24–
36%) for N between 100 and 1000.
For the competitor study, we first demonstrate the high competitor coverage and
density of our citation-based network to justify the use of it before presenting the
classification performance. With two company profile data sources, Hoover's and
Mergent, as gold standards we estimate to what extent a gold standard covers the
(unknown) complete competitor space. More important, we propose metrics to estimate
how much our approach extends the knowledge available in the each of the gold
standards.
Our approach is scalable and language-neutral. Thus it can not only serve as a
data filtering step but also be useful for tracing and monitoring the dynamics of business
relationships for many companies over time. The following research directions can serve
as extensions to our current work:
(1) It would be interesting to validate our approach with a variety of different
business relationships (e.g., supplier and customer relationship), news from
different languages and countries, various types of companies (e.g., private versus
public), and over time.
157
(2) Beyond the four types of network attributes we identify, in order to improve the
classification, it is desired to derive and evaluate additional graph-based attributes
that synthesize the global and dyadic measures and represent more effective
predictors of business relationships between a pair of companies.
(3) We have seen that the sector information which is at a higher level than industry
greatly improves classification for competitor relationships. Thus we may use
some industry-related attributes, such as industry taxonomy, to improve the
performance. Besides, with the taxonomy we can group certain companies under
the same industry or industry subcategory into a super node to form a smaller, but
more abstract, network. We can then examine the link patterns between those
super nodes.
(4) A more broad future research direction is to study new business questions beyond
the above business relationships with or without using the current news-citation-
based intercomapny network.
In this dissertation with three essays under two topics in the area of KDW, we
propose novel ideas, address relevant business problems, evaluate our approaches, verify
the effectiveness and justify the usefulness of our approaches to businesses, and indicate
broader applications and future research with our general approaches.
158
REFERENCES
Adamic, L. A. 2002. Zipf, power-laws, and Pareto - a ranking tutorial. http://ginger.hpl.hp. com/shl/papers/ranking/ranking.html.
Barábasi, A. L., R. Albert, H. Jeong. 2000. Scale-free characteristics of random networks: the topology of the World Wide Web. Physica A, 281 69–77.
Bernstein, A., S. Clearwater, S. Hill, F. Provost. 2002. Discovering knowledge from relational data extracted from business news. In Proceedings of the KDD 2002 Workshop on Multi-Relational Data Mining, Edmonton, Alberta, Canada.
Brandes, U. 2001. A faster algorithm for betweenness centrality. Journal of Mathematical Sociology, 25(2) 163–177.
Brin, S., L. Page. 1998. The anatomy of a large-scale hypertextual Web search engine. Computer Networks and ISDN Systems, 30(1–7) 107–117.
Broder, A. 2002. A taxonomy of Web search. ACM SIGIR Forum, 36(2) 3–10.
Broder, A., R. Kumar, F. Maghoul, P. Raghavan, S. Rajagopalan, R. Stata, A. Tomkins, J. L. Wiener. 2000. Graph structure in the Web. In Proceedings of the 9th World Wide Web Conference, 309–320.
Budzik, J., K. HAMMOND. 2000. User interactions with everyday applications as context for just-in-time information access. In Proceedings of the 5th International Conference on Intelligent User Interfaces, New Orleans, LA, 44–51.
Butler, D. 2000. Souped-up search engines. Nature, 405 112–115.
Carroll, J., M. B. Rosson. 1987. The paradox of the active user. In Interfacing Thought: Cognitive Aspects of Human-Computer Interaction, J.M. Carroll, Ed. MIT Press, Cambridge, MA.
Chan, P., S. Stolfo. 1998. Toward scalable learning with non-uniform class and cost distributions: a case study in credit card fraud detection. In Proceedings of the 4th
159
International Conference on Knowledge Discovery and Data Mining. New York City, NY, 164–168.
Chakrabarti, S., B. E. Dom, S.R. Kumar, P. Raghavan, S. Rajagopalan, A. Tomkins, D. Gibson, J. Kleinberg. 1999. Mining the Web's link structure. Computer, 32(8) 60–67.
Chawla, N. V., K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16 321–357.
Chirita, P.A., W. Nejdl, R. Paiu, C. Kohlschűtter. 2005. Using ODP metadata to personalize search. In Proceedings of the 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, 178–185.
Cooley, R., B. Mobasher, J. Srivastava. 1997. Web mining: information and pattern discovery on the World Wide Web. In Proceedings of the 9th IEEE International Conference on Tools with Artificial Intelligence, Newport Beach, CA, USA, 558–567.
Cooper, G., E. Herskovitz. 1992. A Bayesian method for the induction of probabilistic networks from data. Machine Learning 9(4) 309–347.
Craswell, N., D. Hawking, S. Robertson. 2001. Effective site finding using link information. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, New Orleans, LA, 250–257.
Cutting, D.R., D.R. Karger, J.O. Pedersen, J.W. Tukey. 1992. Scatter/Gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th Annual International ACM SIGIR conference on Research and Development in Information Retrieval, Copenhagen, Denmark, 318–329.
Deerwester, S., S.T. Dumais, G.W. Furnas, T.K. Landauer, R. Harshman. 1990. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 41(6) 391–407.
Dietterich, T.G. 1997. Machine learning research: four current directions. AI Magazine, 18(4) 97–136.
Dreilinger, D., A. E. Howe. 1997. Experiences with selecting search engines using metasearch. ACM Transactions on Information Systems, 15(3) 195–222.
Dumais S., H. Chen. 2001. Optimizing search by showing results in context. In Proceedings of Computer-Human Interaction, Seattle, WA, 277–284.
160
Eirinaki, R., M. Vazirgiannis. 2003. Web mining for Web personalization. ACM Transactions on Internet Technology, 3(1) 1–27.
Estabrooks, A., N. Japkowicz. 2001. A mixture-of-experts framework for learning from unbalanced data sets. In Proceedings of the 4th International Symposium on Intelligent Data Analysis. Lisbon, Portugal, 34–43.
Faloutsos, M., P. Faloutsos, C. Faloutsos. 1999. On power-law relationships of the internet topology. In Proceedings ACM SIGCOMM, 251–262.
Fayyad, U. M., G. Piatetsky-Shapiro, P. Smyth. 1996. From data mining to knowledge discovery: an overview. In Advances in Knowledge Discovery and Data Mining, Eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, AAAI Press, Menlo Park, California, 1–30.
Finkelstein, L., E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan, G. Wolfman, E. Ruppin. 2002. Placing search in context: the concept revisited. ACM Transactions on Information Systems, 20(1) 116–131.
Freeman, L. C. 1979. Centrality in social networks: conceptual clarification. Social Networks, 1 215–239.
Friedman, N., D. Geiger, M. Goldszmidt. 1997. Bayesian network classifiers. Machine Learning, 29(2–3) 131–163.
Garfield, E. 1979. Citation Indexing: Its Theory and Application in Science, Technology, and Humanities. Wiley, New York.
Gauch, S., J. Chaffee, A. Pretschner. 2003. Ontology-based personalized search and browsing. Web Intelligence & Agent Systems, 1(3/4) 219–234.
Gibson, D., J. Kleinberg, P. Raghavan. 1998. Inferring Web communities from link topology. In Proceedings of 9th ACM Conference on Hypertext and Hypermedia, Pittsburgh, USA, 225–234.
Giles, C. L., K. Bollacker, S. Lawrence. 1998. CiteSeer: An automatic citation indexing system. In Proceedings of the 3rd ACM Conference on Digital Libraries, Pittsburgh, PA, USA, 89–98.
Glover, E., S. Lawrence, W. Brimingham, C. L. Giles. 1999. Architecture of a metasearch engine that supports user information needs. In Proceedings of the 8th International Conference on Information Knowledge Management, Kansas City, MO, 210–216.
Gulati, R., M. Gargiulo. 1999. Where do interorganizational networks come from? American Journal of Sociology, 104(5) 1439–1493.
161
Hafri, Y., C. Djeraba. 2004. Dominos: a new Web crawler’s design. In Proceedings of the 4th International Web Archiving Workshop (IWAW), Beth, UK.
Hair, J. F., W. C. Black, B. J. Babin, R. E. Anderson, R. L. Tatham. 2006. Multivariate Data Analysis. 6th edition, Pretice Hall.
Harris, Z. 1985. Distributional structure. In The Philosophy of Linguistics. Katz, J.J., Ed. Oxford University Press, New York, 26–47.
Haveliwala, T.H. 2003. Topic-sensitive PageRank. IEEE Transactions on Knowledge and Data Engineering, 15(4) 784–796.
He, B., K. C. C. Chang. 2003. Statistical schema matching across Web query interfaces. In Proceedings of the ACM SIGMOD International Conference on management of Data, San Diego, CA, USA, 217–228.
Hu, M., B. Liu. 2004. Mining and summarizing customer reviews. In Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Seattle, WA, USA, 168–177.
Jansen, B.J., A. Spink, J. Bateman, T. Saracevic. 1998. Real life information retrieval: a study of user queries on the Web. ACM SIGIR Forum. 32(1) 5–17.
Jansen, B. J., A. Spink, T. Saracevic. 2000. Real life, real users, and real needs: a study and analysis of user queries on the Web. Information Processing and Management, 36(2) 207–227.
Jansen, B. J., A. Spink, J. Pederson. 2005. A temporal comparison of AltaVista Web searching. Journal of the American Society for Information Science and Technology, 56( 6) 559–570.
Jansen, B. J., A. Spink. 2005. An analysis of Web searching by European AlltheWeb.com users. Information Processing and Management, 41 361–381.
Jeh, G., J. Widom. 2003. Scaling personalized Web search. In Proceedings of the 12th international conference on World Wide Web, Budapest, Hungary, 271–279.
Kalfoglou, Y., M. Schorlemmer. 2003. Ontology mapping: the state of the art. The Knowledge Engineering Review Journal, 18(1) 1–31.
Käki, M. 2005. Findex: search result categories help users when document ranking fails. In Proceedings of the SIGCHI conference on Human factors in computing systems, Portland, OR, 131–140.
Kautz, H., B. Selman, M. Shah. 1997. The hidden Web. AI Magazine, 18(2) 27–36.
162
Kessler, M. M. 1963. Bibliographic coupling between scientific papers. American Documentation, 24 123–131.
Kleinberg, J. 1999. Authoritative sources in a hyperlinked environment. Journal of ACM, 46(5) 604–632.
Kotsiantis, S., D. Kanellopoulos, P. Pintelas. 2006. Handling imbalanced datasets: a review. GESTS International Transactions on Computer Science and Engineering 30(1).
Lawrence, S., C. L. Giles. 1998. Searching the World Wide Web. Science, 280(3) 98–100.
Kraft, R., F. Maghoul., C. C. CHANG. 2005. Y!Q: contextual search at the point of inspiration. In Proceedings of the 14th ACM International Conference on Information and Knowledge Management, Bremen, Germany, 816–823.
Kumar, R., P. Raghavan, S. Rajagopalan, A. Tomkins. 1999. Trawling the Web for emerging cyber-communities. Computer Networks, 31(11–16) 1481–1493.
Lawrence, S. 2000. Context in Web search. IEEE Data Engineering Bulletin, 23(3) 25–32.
Leory, G., A. M. Lally, H. Chen. 2003. The use of dynamic contexts to improve casual internet searching. ACM Transactions on Information Systems, 21(3) 229–253.
Levine, J. H. 1972. The Sphere of Influence. American Sociological Review, 37(1) 14–27.
Liu, B. 2006. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. 1st
edition, Springer.
Liu, F., C. Yu, W. Meng. 2004. Personalized Web search for improving retrieval effectiveness. IEEE Transactions on Knowledge and Data Engineering, 16(1) 28–40.
Lorrain, F., H. G. White. 1971. Structural equivalence of individuals in social networks. Journal of Mathematical Sociology 1 49–80.
Maltz, D., K. Ehrlich. 1995. Pointing the way: active collaborative filtering. In Proceedings of the Conference on Computer-Human Interaction, Denver, CO, 202–209.
Michael, T. 1997. Machine Learning. WCB/McGraw-Hill.
163
Menczer, F., G. Pant, P. Srinivasan. 2004. Topical Web crawlers: evaluating adaptive algorithms. ACM Transactions on Internet Technology, 4(4) 378–419.
Miller, G. A., R. Beckwith, C. Fellbaum, D. Gross, K. J. Miller. 1990. Introduction to WordNet: an on-line lexical database. International Journal of Lexicography, 3(4) 235–244.
Najork, M., A. Heydon. 2001. High-performance Web crawling. In Handbook of Massive Data Sets, J. ABELLO, P. PARDALOS, AND M. RESENDE, Eds. Kluwer Academic Publishers, 25–45.
O'Madadhain, J., D. Fisher, S. White, Y. B. Boey. 2006. JUNG: the Java universal network/graph framework (ver. 1.7.4). http://jung.sourceforge.net.
Oyama, S., T. Kokubo, T. Ishida. 2004. Domain-specific Web search with keyword spices. IEEE Transactions on Knowledge and Data Engineering, 16(1) 17–27.
Padmanabhan, B., Z. Zheng, S. Kimbrough. 2006. An empirical analysis of the value of complete information for eCRM models. MIS Quarterly, 30(2) 247–267.
Palmer J. W., J. P. Bailey, S. Faraj. 2000. The role of intermediaries in the development of trust on the WWW: the use and prominence of trusted third parties and privacy statements. Journal of Computer-Mediated Communication, 5(3).
G. Pant, P. Srinivasan. 2006. Link contexts in classifier-guided topical crawlers. IEEE Transactions on Knowledge and Data Engineering, 18(1), 107–122.
Park, H. W. 2003. Hyperlink network analysis: a new method for the study of social structure on the Web. Connections, 25(1) 49–61.
Pazzani, M., Merz, C., P. Murphy. 1994. Reducing misclassification costs. In Proceedings of the 11th International Conference on Machine Learning. New Brunswick, NJ, USA, 217–225.
Peng, B., L. Lee, S. Vaithyanathan. 2002. Thumbs up? sentiment classification using machine learning techniques. In Proceedings of Conference on Empirical Methods in Natural Language Processing, Philadelphia, PA, USA, 79–86.
Pitkow, J., H. Schutze, T. Cass, R. Cooley, D. Turnbull, A. Edmonds, E. Adar, T. Breuel. 2002. Personalized search. Communication of the ACM, 45(9) 50–55.
Porter, M. 1980. An algorithm for suffix stripping. Program, 14(3) 130–137.
Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufman, San Mateo, CA.
164
Richards, W. D., G. A. Barnett (Eds.) 1993. Progress in Communication Science, 12, Ablex Pub. Corp., Norwood, NJ.
Riloff, E., J. Shepherd. 1997. A corpus-based approach for building semantic lexicons. In Proceedings of the 2nd Conference on Empirical Methods in Natural Language Processing, Providence, RI, 117–124.
Salton, G., M. J. McGill. 1986. Introduction to Modern Information Retrieval, McGraw-Hill, New York.
Salzberg, S. 1997. On comparing classifiers: pitfalls to avoid and a recommended approach. Data Mining and Knowledge Discovery, 1 317–327.
Schapire, R. E. 1999. A brief introduction to boosting. Proceedings of the 16th
International Joint Conference on Artificial Intelligence. Stockholm, Sweden, 1401–1406.
Scott, J. 2000. Social Network Analysis: A Handbook, 2nd ed., Sage Publications, London.Sebastiani, F. 2002. Machine learning in automated text categorization. ACM Computing
Surveys, 34(1) 1–47.
Sellen, A.J., R. Murphy, K. L. Shaw. 2002. How knowledge workers use the Web. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems: Changing our World, Changing Ourselves. Minneapolis, MN, 227–234.
Shakes, J., M. Langheinrich, O. Etzioni. 1997. Dynamic reference sifting: a case study in the homepage domain. In Proceedings of the 6th International World Wide Web Conference, Santa Clara, CA, 189–200.
Shen, D., Z. Chen, Q. Yang, H. Zeng, B. Zhang, Y. Lu, W. Ma. 2004. Web-page classification through summarization. In Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval, Sheffield, South Yorkshire, UK, 242–249.
Shen, X., B. Tan, C. X. Zhai. 2005a. Context-sensitive information retrieval using implicit feedback. In Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, Salvador, Brazil, 43–50.
Shen, X., B. Tan, C. X. Zhai. 2005b. Implicit user modeling for personalized search. In Proceedings of the 14th ACM international conference on Information and knowledge management, Bremen, Germany, 824–831.
Small, H. 1973. Co-citation in the scientific literature: a new measurement of the relationship between two documents. Journal of the American Society of Information Science, 24(4) 265–269.
165
Speretta, M., S. Gauch. 2005. Personalizing search based on user search histories. In Proceedings of IEEE/WIC/ACM International Conference on Web Intelligence, Compiegne University of Technology, France, 622–628.
Srinivasan, P., F. Memczer, G. Pant. 2005. A general evaluation framework for topical crawlers. Information Retrieval, 8(3) 417–447.
Srivastava, J., R. Cooley, M. Deshpande, P. Tan. 2000. Web usage mining: discovery and applications of usage patterns from Web data. SIGKDD Explorations, 1(2) 12–23.
Sugiyama, K., K. Hatano, M. Yoshikawa. 2004. Adaptive Web search based on user profile constructed without any effort from users. In Proceedings of the 13th International Conference on World Wide Web, New York, NY, 675–684.
Sullivan, D. 2000. NPD search and portal site study.http://searchenginewatch.com/sereport/article.php/2162791.
Tan, A. H. 2002. Personalized information management for Web intelligence. In Proceedings of World Congress on Computational Intelligence, Honolulu, HI, 1045–1050.
Tan, A. H., C. Teo. 1998. Learning user profiles for personalized information dissemination. In Proceedings of International Joint Conference on Neural Network, Anchorage, AK, 183–188.
Teevan, J., S. T. Dumais, E. Horvitz. 2005. Personalizing search via automated analysis of interests and activities. In Proceedings of 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Salvador, Brazil, 449–456.
Uzzi, B. 1999. Embeddedness in the making of financial capital: how social relations and networks benefit firms seeking financing. American Sociological Review, 64 481–505.
Walker, G., B. Kogut, W. Shan. 1997. Social capital, structural holes and the formation of an industry network. Organization Science, 8(2) 109–125.
Wasserman, S., K. Faust. 1994. In Social Network Analysis: Methods and Applications. Cambridge University Press, Cambridge, UK.
Weiss, G. M. 2004. Mining with rarity: a unifying framework. Sigkdd Explorations 6(1) 7–19.
Weiss, G. M., F. Provost. 2003. Learning when training data are costly: the effect of class distribution on tree induction. Journal of Artificial Intelligence Research 19 315–354.
166
Wen, J.R., J. Y. Nie, H. J. Zhang. 2002. Query clustering using user logs. ACM Transactions on Information Systems, 20(1) 59–81.
Witten, I. H., E. Frank. 2005. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed., Morgan Kaufmann, San Francisco.
Xu, J., W.B. Croft. 1996. Query expansion using local and global document analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Zurich, Switzerland, 4–11.
Yang, Y., X. Liu. 1999. A re-examination of text categorization methods. In Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, Berkeley, CA, 42–49.
Zaïane, O. R., M. Xin, J. Han. 1998. Discovering Web access patterns and trends by applying OLAP and data mining technology on Web logs. In Proceedings of Advances in Digital Libraries, Santa Barbara, CA, 19–29.
Zamir, O., O. Etzioni. 1999. Grouper: A dynamic clustering interface to Web search results. Computer Networks: The International Journal of Computer and Telecommunications Networking, 31(11–16) 1361–1374.