web crawling in r: predicting leads · in this research, we investigated whether textual...

42
GHENT UNIVERSITY FACULTY OF ECONOMICS AND BUSINESS ADMINISTRATION ACADEMIC YEAR 2015 – 2016 Web Crawling in R: Predicting Leads Master thesis presented to obtain the degree of Master in Applied Economic Sciences: Business Engineer Arno Liseune under the guidance of Prof. Dirk Van den Poel, Jeroen D’Haen & Tijl Carpels

Upload: others

Post on 10-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

GHENT UNIVERSITY

FACULTY OF ECONOMICS AND BUSINESS

ADMINISTRATION

ACADEMIC YEAR 2015 – 2016

Web Crawling in R: Predicting Leads

Master thesis presented to obtain the degree of

Master in Applied Economic Sciences: Business Engineer

Arno Liseune

under the guidance of

Prof. Dirk Van den Poel, Jeroen D’Haen & Tijl Carpels

Page 2: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in
Page 3: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

GHENT UNIVERSITY

FACULTY OF ECONOMICS AND BUSINESS

ADMINISTRATION

ACADEMIC YEAR 2015 – 2016

Web Crawling in R: Predicting Leads

Master thesis presented to obtain the degree of

Master in Applied Economic Sciences: Business Engineer

Arno Liseune

under the guidance of

Prof. Dirk Van den Poel, Jeroen D’Haen & Tijl Carpels

Page 4: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

PERMISSION Ondergetekende verklaart dat de inhoud van deze masterproef mag geraadpleegd

en/of gereproduceerd worden, mits bronvermelding.

Undersigned declares that the content of this master thesis may be consulted, and/or

reproduced on condition that the source is quoted.

Arno Liseune

Page 5: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

I

PREFACE Writing this master thesis has been made possible thanks to many different people.

First of all, I would like to thank my promoter, Professor Dirk Van den Poel, who

facilitated this research.

I would also like to express my gratitude to Jeroen D’Haen, who provided me with

guidance, suggestions and valuable insights throughout this research project.

Further, I would like to thank Tijl Carpels for the help he offered me during the final

stage of this master thesis.

In addition, I would also like to thank the anonymous Belgian energy supplier for

providing me with the data that made this research possible.

Finally, I would like to address a special thanks to my family and friends who helped

me and gave me moral support throughout the entire duration of this study.

Page 6: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

II

TABLE OF CONTENTS

PREFACE..........................................................................................................................ITABLE OF CONTENTS..................................................................................................IILIST OF ABBREVIATIONS...........................................................................................IIILIST OF TABLES...........................................................................................................IVLIST OF FIGURES..........................................................................................................VSAMENVATTING.............................................................................................................1

ABSTRACT......................................................................................................................21 INTRODUCTION......................................................................................................2

2 METHODOLOGY.....................................................................................................62.1 Web mining..................................................................................................................7

2.1.1 Identification of corporate websites..........................................................................72.1.2 Data collection.............................................................................................................9

2.2 Text mining..................................................................................................................92.2.1 Text preparation..........................................................................................................92.2.2 Text representation..................................................................................................102.2.3 Dimensionality reduction.........................................................................................11

2.3 Incorporating expert knowledge............................................................................132.4 Predictive modeling..................................................................................................15

2.4.1 Regularized logistic regression...............................................................................162.4.2 Random forest...........................................................................................................172.4.3 Rotation forest...........................................................................................................172.4.4 AdaBoost...................................................................................................................182.4.5 Support Vector Machine..........................................................................................18

2.5 Model evaluation criteria.........................................................................................20

3 EMPIRICAL VERIFICATION.................................................................................213.1 Research data............................................................................................................213.2 Optimal dimensionality and model selection.......................................................223.3 Results........................................................................................................................24

4 CONCLUSION........................................................................................................26

5 LIMITATIONS AND FURTHER RESEARCH.......................................................27

ACKNOWLEDGEMENTS.............................................................................................28BIBLIOGRAPHY............................................................................................................29

Page 7: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

III

LIST OF ABBREVIATIONS B2B Business-to-Business

CRM Customer Relationship Management

XML Extensible Markup Language

HTML HyperText Markup Language

URL Uniform Resource Locator

PCA Principal Component Analysis

PC Principal Component

SVM Support Vector Machine

ROC Receiver Operating Characteristic

AUC Area Under the receiver operating characteristic Curve

TP True Positives

TN True Negatives

FP False Positives

FN False Negatives

Page 8: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

IV

LIST OF TABLES

Table 1: Variables used in research…………………...…………………………………14

Table 2: Characteristics marketing data…………………......……………………….….22

Table 3: AUC and top-decile lift…………………...……………………………………...26

Page 9: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

V

LIST OF FIGURES

Figure 1: Methodology………………………………………………………………………6

Figure 2: Corporate website identification…………………………………………………8

Figure 3: Text mining stages………………………………………………..…………….10

Figure 4: Term filtering…………………………………………………………………..…12

Figure 5: Principal Component Analysis…………………………………………………13

Figure 6: Hybrid ensemble……………………………………………………...…………16

Figure 7: Random forest…………………………………………………...………………17

Figure 8: AdaBoost…………………………………………………………………………18

Figure 9: Support Vector Machine…...………………...…………………………………19

Figure 10: Model performance in function of dimensionality……..……………………23

Figure 11: Cumulative lift curves…………………………………………….……………25

Figure 12: ROC curves…………………………………………………….………………25

Page 10: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

1

SAMENVATTING In deze studie onderzochten we of tekstuele informatie, gevonden op websites van

ondernemingen, gebruikt kan worden om veelbelovende leads te identificeren in een

Business-to-Business (B2B) context. In het bijzonder hebben we aangetoond hoe

verschillende web- en text mining technieken toegepast kunnen worden om deze

ongestructureerde tekstuele gegevens te verzamelen en te organiseren. Ook

onderzochten we hoe Principal Component Analysis kan helpen deze inhoud te

transformeren naar een verzameling van karakteristieken die verbonden zijn aan deze

websites. Aan de hand van een hybrid ensemble konden we vaststellen dat deze

kenmerken de identificatie van veelbelovende leads kunnen bevorderen. Bovendien

toonden we aan dat karakteristieken aangevuld met variabelen die voortkwamen uit

domeinexpertise tot nog betere resultaten leidden. Bijgevolg kan het raamwerk zoals

voorgesteld in dit onderzoek gebruikt worden door B2B verkoopvertegenwoordigers,

aangezien dit hen in staat stelt om leads te rangschikken volgens hun probabiliteit van

conversie. Het resultaat is een meer gerichte marketingaanpak die vooral gunstig is

voor bedrijven die geconfronteerd worden met lage conversieratio’s of die gelimiteerd

zijn door beperkte marketingbudgetten.

Page 11: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

2

ABSTRACT In this research, we investigated whether textual information extracted from corporate

websites could be used to identify promising leads in a Business-to-Business (B2B)

environment. Particularly, we showed how several web- and text mining techniques

can be applied to extract and organize this unstructured information and how principal

component analysis (PCA) may help to transform this content into a set of corporate

website characteristics. By means of a hybrid ensemble, we found that these

characteristics can facilitate the identification of promising leads. Additionally, we

showed that a data augmentation with variables that were constructed through expert

knowledge rendered even more promising results. Hence, the framework presented in

this research can be used by B2B marketers as it allows them to rank leads according

to their predicted conversion probabilities. The result is a more targeted marketing

approach which can be especially beneficial for businesses confronted with low

conversion ratios or constrained by limited marketing budgets.

Keywords: Acquisition, B2B, Web Mining, Text Mining, PCA, Machine Learning

1 INTRODUCTION In the beginning of the 20th century, mass production dominated the business

landscape. For many decades following the industrial revolution, producers

manufactured large amounts of standardized products, longing for economies of scale.

As a result, marketing activities were solely focused on covering a large market share,

ignoring different customer segments (Bauer, Grether, & Leach, 2002). Traditional

advertising media such as radio and television enabled firms to spread their message,

trying to reach as many people as possible. As time passed, advancements in

technology changed marketing focus to a more targeted approach, as direct mailing

and telemarketing facilitated direct communication with the customer (Ling & Yen,

2001). Instead of providing the whole market with a single offer, marketers were now

able to present products that were relevant to specific customer segments (Petrison,

Blattberg, & Wang, 1997). Still, little was known about the true individual customer

Page 12: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

3

needs as firms lacked personal interactions employing this one-way communication

strategy.

In recent years however, improvements in information systems and technologies

introduced a paradigm shift in marketing. Throughout the 1980s and 1990s, major

innovations in database technology allowed firms to store and analyze data (Petrison

et al., 1997). The emergence of the internet offered the opportunity to collect vast

amounts of customer information through various interactions over time. As computer

power rapidly increased, sophisticated analysis of this data became more attractive.

Moreover, as customers became aware of competitive offers with the advent of the

World Wide Web, many firms began to recognize the value of this information to cope

with the increasingly competitive environment (Rygielski, Wang, & Yen, 2002; Shaw,

Subramaniam, Tan, & Welge, 2001). In order to remain successful, customer

information was leveraged to produce customized products and services in an effort to

create superior value and to build long-term relationships (Kothandaraman & Wilson,

2000; Ulaga & Chacour, 2001). Nowadays, this relational marketing approach is known

as Customer Relationship Management (CRM). According to Ling and Yen (2001),

“CRM is a concept whereby an organization takes a comprehensive view of its

customers to maximize customer’s relationship with an organization and the

customer’s profitability for the company” (pp. 82-83). Whereas operational CRM

provides support to business processes, analytical CRM focuses on the analysis of

customer characteristics as well as behavior in order to maximize marketing

effectiveness (Ngai, Xiu, & Chau, 2009). Data mining tools are often used to analyze

this customer data as they offer advanced algorithms to extract hidden knowledge from

corporate data warehouses. (Bose & Mahapatra, 2001; Ngai et al., 2009; Rygielski et

al., 2002). With the help of these techniques, firms are able to improve customer

acquisition, retention and development, hence achieving higher marketing success

rates (Baecke & Van den Poel, 2010). Generally, CRM focuses on keeping and

satisfying existing customers, since this strategy is considerably more profitable than

acquiring new ones on a regular basis (Isaac & Tooker, 2001; Reinartz & Kumar, 2003;

Wilson, 2006). At some point in time, however, customer relationships will dissolve

(Dwyer, Schurr, & Oh, 1987). Therefore, identifying new customers remains critical to

the viability of today’s organizations (Thorleuchter, Van den Poel, & Prinzie, 2012;

Wilson, 2006). Whereas customer retention is a relatively easy process, relying on

Page 13: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

4

existing customer data, companies are much less familiar with the acquisition of new

customers, which requires the search for unknown information. Traditionally, firms use

external data purchased from commercial vendors as input for acquisition models (Hill,

1999; Wilson, 2006). Even so, these lists tend to be very expensive and of poor quality

as they often contain many missing values (D’Haen, Van den Poel, & Thorleuchter,

2013; Shankaranarayanan & Cai, 2005).

Today, web data could be a valuable alternative for acquisition purposes in a Business-

to-Business (B2B) context as modern organizations use their websites to keep

customers informed (Thorleuchter et al., 2012). However, this does not come without

challenges. Firstly, web data is so unstructured that only humans can understand it.

Secondly, the huge amount of data requires machines to process it (Stumme, Hotho,

& Berendt, 2006). As a result, firms seldom use this valuable source of information for

marketing activities (Coussement & Van den Poel, 2009). Nonetheless, new data

mining techniques are emerging to take on these challenges. According to Kosala and

Blockeel (2000), “Web mining is the use of data mining techniques to automatically

discover and extract information from Web documents and services” (p. 2). Hence,

web mining applications deliver tools that could help marketers extract information from

the web (Thorleuchter et al., 2012). Even so, the result can be highly unstructured.

Luckily, text mining algorithms exist to extract hidden knowledge from unstructured

information (Hotho, Nürnberger, & Paaß, 2005). Since approximately 80% of all

information is stored in textual form (Gentsch & Hänlein, 1999), the need to master

these techniques increases.

In this study, web-, text- and data mining techniques will be applied in order to identify

promising leads in a B2B context. Based upon the data of a firm’s previous marketing

campaign, we attempt to determine the relationship between a lead’s website

characteristics and the success rate of customer conversion for this particular firm.

Information from corporate websites will be crawled in order to uncover these hidden

characteristics. Nowadays, vector space models are often used to represent

unstructured text in a way machines can process it (Silva & Ribeiro, 2003).

Nevertheless, the feature space can still be of high-order as text collections often

comprise thousands of terms (Yang & Pedersen, 1997). This high dimensionality may

be problematic for many classifiers when the number of terms is much higher than the

Page 14: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

5

number of documents included in the text corpus. Therefore, dimensionality reduction

techniques need to be applied in order to reduce the feature set to a more manageable

form (Sebastiani, 2002; Silva et al., 2003; Yang et al., 1997). Several methods exist to

achieve this purpose. Feature selection techniques aim at the retrieval of the most

informative terms in a text collection, resulting in a subset of the original corpus. An

alternative to this approach is to reduce dimensionality by some linear or nonlinear

projection of the high dimensional space onto a lower one, also known as feature

extraction (Tang, Shepherd, Milios, & Heywood, 2005). After the dimensionality

reduction phase, a prediction model is built upon the new feature set by means of the

combination of several machine learning algorithms. The discovery of a pattern

between a lead’s characteristics and the success rate of customer conversion assists

sales representatives in two ways. On the one hand, characteristics of converted leads

could be used for the search of new ones. On the other hand, the model comprising

the uncovered relationship can be applied on websites of potential customers, resulting

in conversion probabilities. Consequently, sales representatives are able to better

identify interesting leads as well as to allocate marketing means towards those leads

with high conversion probabilities.

This study contributes to the literature in several ways. Firstly, a multilayer web mining

approach is presented to extract the right corporate websites (see Sect. 2.1). Secondly,

we demonstrated the ability of principal component analysis (PCA) to construct

corporate website characteristics that relate to the success rate of customer

conversion (see Sect. 2.2.3). Thirdly, we added expert knowledge to the reduced

feature space in order to cope with the information loss induced by dimensionality

reduction (see Sect 2.3). Finally, a hybrid ensemble is presented in an effort to

optimally approach the underlying relationship (see Sect. 2.4).

Page 15: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

6

2 METHODOLOGY Corporate websites are identified by means of a multilayer web crawling algorithm.

Additionally, unstructured textual information from these websites is extracted and

represented as a vector space model after a text preparation phase. This high-order

structure is projected onto a lower dimensional space through the application of PCA.

Next, expert knowledge is added through the construction of new predictors in order

to compensate for information loss. Finally, a hybrid ensemble is built upon this new

feature set in order to uncover a pattern between a lead’s website characteristics and

the probability of customer conversion. Fig. 1 shows the methodology of this approach.

Figure 1: Methodology

Page 16: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

7

2.1 Web mining In order to retrieve companies’ websites, a web mining approach is applied allowing

automated data collection. Based on an actual data set containing companies’ names

and locations, the internet is crawled in search of the corresponding websites. The web

crawling technique comprises the parsing of HTML files found on the web, resulting in

structural HTML trees. These hierarchical representations of web documents allow

web miners to query the content by means of XPath expressions which enables them

to find the desired pieces of information.

2.1.1 Identification of corporate websites The identification of a company’s website depends upon two consecutive stages.

Firstly, a pool of plausible websites is generated by means of the company’s business

name. Secondly, the city of the firm’s establishment is used to identify and collect the

right corporate website. This principle is applied in three successive search

approaches and is illustrated in Fig. 2.

As a first attempt to identify the website of a company, alternative URL’s are generated

based on its name since a firm’s business name usually corresponds with the firm’s

domain name (e.g. www.company-name.com). Therefore, the company’s business

name and several meaningful variations to this are converted into a set of multiple

URL’s. Existing websites are crawled and collected if the city or corresponding postal

code of the firm’s establishment is identified in webpages that potentially contain this

information (e.g. ‘contact’, ‘sitemap’, ‘location’ etc.).

Companies whose websites were not identified are now passed to a Google query

generator. The first two corresponding Google search results pages are collected and

parsed, followed by the extraction of the hyperlinks by means of an XPath expression.

Links to information pages such as Gouden Gids, Kompass, Trendstop Knack and

Infobel are gathered and crawled since these pages could contain the URL of the

requested corporate website. XPath queries can be constructed and repeatedly used

to extract relevant data that is presented in these information pages because each of

these websites have their own specific and constant HTML structure. As a result, the

retrieved information webpages can be queried for the company’s name, city, postal

Page 17: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

8

codes and URL. Finally, this URL is extracted when the firm’s name and city or postal

code correspond to the name and city or postal code found on the information page.

A final approach consists of crawling the highest ranked Google search results as

these are likely to contain the link to the correct company’s website. Whereas the

Google query in the previous search approach was constructed by a string

concatenation of the company’s name and city, search results are now generated by

a query solely comprising the business name. This eliminates the presence of websites

that only show a relationship with the firm’s city but not the firm itself. Next, the results

are filtered from several predefined irrelevant websites (e.g. ‘facebook’, ‘youtube’,

‘jobat’ etc.) as an extra measure to avoid the collection of faulty websites. At last, a

search result is extracted if the firm’s city or postal code is included in its crawled

content.

Figure 2: Corporate website identification

Page 18: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

9

2.1.2 Data collection Once the correct corporate website is identified, all its subdirectories are downloaded

and saved as HTML files. Next, each file is parsed, allowing the extraction of textual

content which does not contain markup tags and other irrelevant HTML objects. The

collected information from all subdirectories is then bundled into one plain text

document, which represents the entire textual content of a firm’s website. After this

process is repeated for all websites, a text corpus is created, containing all the firms’

text documents. This structured collection of textual data facilitates further text mining

operations.

2.2 Text mining Despite having a structured set of text documents, further operations need to be done

in order to deal with the unstructured nature of the documents themselves. Text mining

applications provide the techniques to automatically extract relevant information from

unstructured written resources (Gupta & Lehal, 2009). In particular, they allow the

transformation of a text corpus into a more meaningful text representation, suitable for

statistical analysis. The text mining techniques used in this study are discussed in the

next sections and are illustrated in Fig. 3.

2.2.1 Text preparation In a first stage, several text cleansing procedures are conducted to prepare the text for

the subsequent text representation stage. At first, raw text cleansing is done and

encompasses the removal of numbers, punctuation, whitespace and special

characters as these bear no content information. Additionally, text is converted into

lower case, which avoids the occurrence of duplicate terms in the final text collection.

The last step consists of removing extremely common words, also known as stop

words, as they have little or no discriminative power with respect to the response

variable (Thorleuchter, Van den Poel, & Prinzie, 2010). Several predefined lists exist

that contain language-specific stop words and are used in this study for the stop word

elimination process.

Page 19: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

10

Figure 3: Text mining stages

2.2.2 Text representation After the text preparation stage, the remaining text has to be transformed to a

representation that can be processed by computers. This is generally accomplished

by employing the bag-of-words approach. This process converts a text collection into

a vector space model where each document is represented as a vector with entries for

each term that occurs in the whole text collection. The values of the entries are

determined by the number of times the terms appear in the specific document (Silva

et al., 2003). Despite its simplicity, several experiments found that this approach does

not perform worse than more sophisticated representation techniques (Apté, Damerau,

& Weiss, 1994; Dumais, Platt, Heckerman, & Sahami, 1998; Lewis, 1992). Therefore,

this study uses this method for the text representation process, though with the addition

of a more advanced weighting scheme regarding the term values, as this significantly

improves classification performance (Sparck Jones, 1972). The idea is that terms

Page 20: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

11

should be weighted according to their importance in the whole text collection, rather

than to their occurrence in a single document. The term frequency is hence multiplied

by the inverse document frequency (i.e. a coefficient that expresses the uniqueness of

the term in the document collection). Finally, a normalization factor is added to ensure

that each document has an equal chance to be retrieved regardless of its length. The

term weighting formula is:

where tf is the term frequency, idf is the inverse document frequency, and N is the

normalization factor (Salton & Buckley, 1988). The result is a weighted document-by-

term matrix wherein each row represents a firm’s website and each column a term

occurring in the cleansed textual content.

2.2.3 Dimensionality reduction The weighted document-by-term matrix is a structured representation of the

unstructured textual content, allowing machines to process it. However, in this case,

the feature set first needs to be reduced since the number of terms is too large to derive

a pattern from the data. In order to solve this problem automatically, dimensionality

reduction techniques can be applied on the original data set. The new feature space

facilitates the construction of a much simpler model, improving the classifier’s

performance and reducing the learning process time (Eyheramendy & Madigan, 2005;

Silva et al., 2003; Yang et al., 1997; Tang et al., 2005). In practice, two major types of

dimensionality reduction techniques are commonly used.

Feature selection techniques aim at finding a subset of the most descriptive instances

(Eyheramendy et al., 2005). For this research, a term filter is used to remove sparse

terms. These are words exceeding a specific sparse percentage, i.e. the percentage

of documents where the word does not occur. Because these terms bear little to no

(1)

Page 21: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

12

information with respect to the entire document collection, they are removed from the

data, resulting in a subset of the most relevant features (see Fig. 4).

Figure 4: Term filtering

A more sophisticated dimensionality reduction technique exists that transforms the

high dimensional space into a subspace by means of a linear or nonlinear combination

of the original features. This approach, also known as feature extraction, thus results

in a set of newly created features (Tang et al., 2005). The dimensionality reduction

technique used in this study is PCA, which orthogonally transforms the native feature

space into a set of new variables that are the closest fit to the observations, hence

maximizing the variance in the data (Wold, Esbensen, & Geladi, 1987). These new

features are called principal components (PC) and are a linear combination of the

original variables and their loadings, which describe the direction along which the data

varies the most (see Fig. 5). The first principal component explains more variance in

the data then the second principal component and so forth, under the constraint that

they are all orthogonal, thus uncorrelated. The values in the new feature space are

obtained by the orthogonal projection of the original observations onto the principal

components (Abdi & Williams, 2010). So the score z for an observation i on the

principal component k is then:

where 𝜃𝑗𝑘 = loading of PC𝑘 on variable j and xij = value of observation i for variable j.

(2)

Page 22: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

13

Figure 5: Principal Component Analysis

The combination of term filtering and PCA results in the reduced feature set that is

used to build the prediction model. This subspace is composed of a specific number

of principal components together with the corresponding scores. Since the principal

components group together related terms, they are actually describing different

concepts. The decision regarding the number of principal components is therefore very

important as these concepts represent the corporate website characteristics in this

particular study. Too many principal components would result in the incorporation of

irrelevant characteristics while important characteristics are possibly not considered

with too few principal components. The optimal number of principal components is

determined by building and evaluating prediction models for several dimensions. This

procedure is explained in Sect. 3.2.

2.3 Incorporating expert knowledge The disadvantage that accompanies dimensionality reduction is the potential loss of

information (Sebastiani, 2002). This can be countered by the incorporation of domain-

specific expert knowledge (Baesens, Mues, Martens, & Vanthienen, 2009). In

particular, business expertise can be translated into the construction of new variables

that are expected to provide predictive power to the response variable. The knowledge

fusion of reduced data and domain expertise thus results in increased model accuracy

Page 23: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

14

(Martens et al., 2006).

In this research, the reduced feature space is augmented with predictors that are

expected to relate to the success rate of customer conversion. Firstly, a dummy

variable is created, indicating whether or not contact information such as a telephone

number, an email address or a contact form can be found on the firm’s website. This

information is very valuable for sales representatives as it allows them to contact their

leads regarding the marketing offer. Secondly, a social media activity variable is

constructed to indicate the extent to which companies are open to be approached by

external parties. Firms whose website contain hyperlinks to social media pages such

as Facebook, Twitter or LinkedIn are marked active on social media. Lastly, the region

where the firms are established is extracted as some firms prefer to do business within

their own region. Table 1 gives an overview of all the variables used in this research.

Table 1: Variables used in research

Variable name Description

Dependent variable

Target Binary variable indicating whether the company has successfully converted into a customer

Independent variables PC 1

… Principal components representing corporate website characteristics

PC k

Contact Binary variable indicating whether the company is contactable through its website

Social media

Binary variable indicating whether the company is active on social media

Region

Binary variable indicating whether the company is located in Flanders

Page 24: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

15

2.4 Predictive modeling The final stage consists of approaching the underlying relationship between a

company’s website features and the probability of its conversion. This pattern could be

used to classify new leads, improving marketing effectiveness. In this study, several

prediction models are fit onto the final data set and are combined into a hybrid

ensemble. This technique uses a set of different learners and combines their

predictions in order to classify new instances. Ensembles generally yield better

performance than single classifiers for several reasons. First of all, predictions of

classifiers used in the ensemble are aggregated which reduces the risk of

misclassification. Secondly, different representable functions to approach the reality

are combined which improves the approximation of the true function. Finally, several

fitting procedures are applied, reducing the chance of getting stuck in local optima

(Dietterich, 2000). In practice, there are two main strategies used for creating

ensembles. The first strategy is data-induced and consists of fitting a model on several

manipulations of the original data. The obtained learners are combined in an

aggregated model which classifies unseen observations based on the average

predictions. This technique reduces the variance of a model, resulting in more reliable

results. The algorithm-induced strategy focuses on the variation of the learning

algorithms instead, and hence increases the diversity of the ensemble. The idea is that

classifiers producing wrong votes are compensated by those that do make the right

decisions (Banfield, Hall, Bowyer, & Kegelmeyer, 2005). This study combines both

strategies into a hybrid ensemble in an effort to optimize the model fit (see Fig. 6). The

different machine learning algorithms used to construct the hybrid ensemble are briefly

discussed in the next sections.

Page 25: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

16

Figure 6: Hybrid ensemble

2.4.1 Regularized logistic regression Logistic regression is a parametric supervised learning technique especially used for

binary classification. The model makes use of the logit function to obtain predicted

probabilities with respect to the response variable and is given by:

The regression coefficients are estimated according to the maximum-likelihood

estimation (Allison, 1999). In practice, logistic regression is very popular as it is a fairly

easy, quick and robust modelling technique (DeLong, DeLong, & Clarke-Pearson,

1988; Greiff, 1998). In case of a large number of predictors however, logistic

regression tends to over fit the data, hence describing the sample’s random noise

instead of approaching the underlying relationship. In order to avoid this, the model

can be regularized by shrinking coefficients towards zero by means of the incorporation

of a complexity penalty parameter (Le Cessie & Van Houwelingen, 1992). The result

is a model comprising less predictors, which reduces complexity and avoids overfitting.

(3)

Page 26: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

17

2.4.2 Random forest Random forest is a non-parametric modelling technique in which several decision trees

are built upon multiple bootstrap samples (Breiman, 2001). This process fabricates a

collection of similar training sets obtained by taking random samples with replacement

of the original data. Each bootstrap is then used for fitting a decision tree, which is

grown by only considering a random subset of the available features at each splitting

node. The result is a large set of decorrelated decision trees which are combined to

obtain average predictions. These characteristics make random forest a relatively

stable and diverse ensemble (Breiman, 1996). Fig. 7 illustrates this modelling

technique.

Figure 7: Random forest

2.4.3 Rotation forest Like random forest, rotation forest also creates an ensemble based on bootstrap

aggregation and random feature selection. The algorithm generates a set of random

feature subspaces and draws bootstrap samples from each of these subspaces. Next,

PCA is applied on each bootstrap and decision trees are trained on these

reconstructed feature sets. The resulting classifiers are combined into a final ensemble

which is both diverse and accurate (Rodriguez, Kuncheva, & Alonso, 2006).

Page 27: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

18

2.4.4 AdaBoost Like the two previously explained methods, AdaBoost relies on data manipulation to

generate predictions. The algorithm builds a weak learner (e.g. decision tree) on the

original data and iteratively improves its performance by fitting the learner on the same

data multiple times, but with increased weights for incorrectly classified examples. This

enforces subsequent learners to focus on getting the more difficult cases right. The

final model is a weighted sum of all the constructed classifiers with their performances

as weighting criteria (Freund & Schapire, 1999). Despite the fact that this technique is

quite slow in comparison with other classifiers, it manages to achieve significant

performance improvements. In fact, Breiman (1996) called AdaBoost one of the best

performing classifiers in the world. The procedure is shown in Fig. 8.

Figure 8: AdaBoost

2.4.5 Support Vector Machine The last model included in the hybrid ensemble is the Support Vector Machine (SVM)

(Vapnik, 1995; Vapnik, 1998a; Vapnik, 1998b). This algorithm searches a hyperplane

that optimizes the separation of the data according to their class labels. Often however,

observations are not linearly separable. In this case, observations can be mapped onto

a higher dimensional space which does allow linear separation. The dimensionality

transformation is accomplished by applying a kernel function to the original data and

is succeeded by the training of a SVM in the newly obtained feature space (Lodhi,

Page 28: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

19

Saunders, Shawe-Taylor, Cristianini, & Watkins, 2002). The result is an optimal

hyperplane that maximizes the distances between the different groups of observations

and can be reused for classifying new data that underwent the same transformation

rule. Fig. 9 demonstrates this approach.

Figure 9: Support Vector Machine

Finally, the different models are joined into a hybrid ensemble and combined with a

specific combination rule. Predictions are often weighted according to the

performances of the corresponding classifiers on a validation set. The higher a

classifier’s validation performance, the higher its weight in future predictions. This

performance-based voting can lead to a significant improvement of the ensemble’s

accuracy as it increases the influence of models which are best approaching the

underlying relationship. With a small data set however, one needs to be careful with

applying these heuristics. The possibility exists that some models accidentally perform

extremely well on the validation set, but seem to fail in general. In order to avoid this,

a simpler combination rule can be applied where each classifier is assigned an equal

weight. Given the discreet dimensions of the extracted data set in this research, the

single classifiers’ predictions are combined by a simple average.

Page 29: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

20

2.5 Model evaluation criteria Once a hybrid ensemble is constructed on training examples, it needs to be evaluated

in order to determine whether company characteristics, extracted from corporate

websites and uncovered by PCA, can be used to predict customer conversion

probabilities. In this study, the performance of the prediction model is assessed by the

area under the receiver operating characteristic curve (AUC) and the lift, two commonly

used performance measures in practice.

The AUC is a measure that reduces the receiver operating characteristic (ROC) curve

into a single figure (Hanley & McNeil, 1982). This curve visualizes the performance of

a model by means of plotting the sensitivity versus (1-specificity) for the entire range

of decision thresholds. These performance measures can be derived from the

confusion matrix, which contains the model’s TP (true positives), TN (true negatives),

FP (false positives) and FN (false negatives). The sensitivity (TP/(TP+FN)), also called

the true positive rate, is the proportion of positive examples that are correctly classified

by the model. The specificity (TN/(TN+FP)), also called the false positive rate, is the

proportion of negative examples that are correctly classified by the model. Since a

model’s output comprises a list of examples ranked according to their predicted class

probabilities, decision makers need to decide which group to target for further

investigation. However, as this decision boundary varies, the previously mentioned

performance measures vary as well. This means that a model’s performance cannot

be determined for a single threshold since it is unknown how these measures will

evolve as this threshold is changed (Bradley, 1997). For this reason, evaluation

measures need to be aggregated for all possible operating points in order to obtain a

model’s overall accuracy. The AUC does this by representing the surface underneath

the ROC curve and ranges from 0.5 to 1, with 0.5 being a random model and 1 being

a perfect model (Hanley et al., 1982). The actual meaning of this number is the

probability that a randomly chosen positive example is ranked higher than a randomly

chosen negative example. As the AUC reduces a model’s overall performance to a

single figure, it can be used for model comparison and selection (Bradley, 1997). This

research used the AUC for selecting the optimal dimension as well as constructing a

fine-tuned hybrid ensemble. These procedures are explained in detail in Sect. 3.2.

Page 30: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

21

Another criterion, commonly used in practice, is the lift, which is a measure of model

effectiveness. In particular, the lift measures the ratio between a model’s response rate

for a specific target and the average response rate for the entire population. Suppose

for example that there are 10% churners in an entire customer base and that a certain

model is able to identify 50% of the churners in a particular customer segment. This

model then yields a lift of 5 (50%/10%) for this segment. If this exercise is performed

for every percentage of the population targeted, the cumulative lift curve is obtained,

which indicates how well a model performs compared to the baseline. This can be

particularly useful for marketing purposes, as it allows decision makers to segment

their market and target those groups with a high density of positive responders. In this

research, the cumulative lift curve represents the hybrid ensemble’s effectiveness in

identifying the right leads according to several decision thresholds (See Sect. 3.3).

3 EMPIRICAL VERIFICATION

3.1 Research data In this paper, we used data obtained from an anonymous Belgian energy supplier. The

data set was created based upon the results of a marketing campaign, containing

information about the targeted leads’ business names and the location of their

establishments as well as their binary success rates of customer conversion. In order

to discover a relationship between these leads’ characteristics and their customer

conversion rates, textual content from corporate websites was extracted. From the

3507 targeted leads, 1284 corresponding corporate websites were identified of which

1272 could be used for further research. The collection of unstructured web content

was cleansed and transformed to a structured website-by-term matrix and augmented

with predictors constructed through expert knowledge. This data was then randomly

split in a training, validation and test set. The optimal dimension and fine-tuned model

parameters were derived using the training and validation set (see Sect. 3.2). The test

set was used to estimate the performance of the final ensemble (see Sect. 3.3). The

characteristics of the different data sets are summarized in Table 2.

Page 31: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

22

Number of leads Relative percentage

Training set

Converted leads 62 12.18%

Not-converted leads 447 87.82%

total 509 100%

Validation set

Converted leads 49 12.86%

Not-converted leads 332 87.14%

total 381 100%

Test set

Converted leads 47 12.30%

Not-converted leads 335 87.70%

Total 382 100%

Table 2: Characteristics marketing data

3.2 Optimal dimensionality and model selection Once the websites were collected, they were transformed into a high-dimensional

website-by-term matrix by means of a text preparation and representation phase.

Subsequently, PCA was applied in an effort to construct a set of principal components,

describing corporate website characteristics. The final data set was eventually

obtained by augmenting the PCA subspace with variables indicating whether

companies were approachable by external parties through their websites or social

media channels, along with the region of their establishments. Per dimension, a

validation process was applied in which each model, discussed in Sect. 2.4, was

trained on the training set and evaluated on the validation set for several parameter

configurations. The parameters that yielded the highest model performance on the

validation set were extracted together with the corresponding AUC. The result was a

selection of optimized models per dimension, along with their validation AUC’s. The

averages of these AUC’s were used to compare and select the optimal number of

principal components.

Page 32: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

23

As illustrated in Fig. 10, the model performance strongly increased up to 19 principal

components. At that point, the performance yielded an average validation AUC of 0.63.

In the range of 20-24 principal components, the AUC dropped to approximately 0.59.

From 25 principal components on, the performance recovered and started to fluctuate

around an AUC of 0.63 for the remaining dimensions. A maximal AUC of 0.64 was

reached at 34 principal components. However, as dimensionality increased, model

complexity increased as well. Compared to the model built on a subspace containing

19 principal components, this model was much more complex while it hardly achieved

a higher AUC. In general, a model yielding approximately the same performance as a

more complex model will perform better on future data. Furthermore, model complexity

reduces the readability and the interpretability of the model (Baesens et al., 2009).

Therefore, 19 principal components were used for representing the customer

conversion related characteristics.

Figure 10: Model performance in function of dimensionality

0 10 20 30 40

0.52

0.56

0.60

0.64

Principal Components (company characteristics)

Mod

el P

erfo

rman

ce (A

UC)

Page 33: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

24

3.3 Results The optimized parameters and most discriminative company characteristics, obtained

in the dimensionality selection stage, were used to build a final hybrid ensemble. A

basic model was built upon the feature space solely containing the corporate website

characteristics. An extended model was built upon the same feature space, but

augmented with predictors that resulted from expert knowledge. Both ensembles were

fit on a fusion of the training and validation set and were evaluated on test examples

by comparing the companies’ predicted conversion probabilities against their actual

success rates of customer conversion. Finally, both models were compared with each

other in order to assess the predictive leverage of the expert knowledge variables.

Fig. 11 and Fig. 12 illustrate that the two models perform better than the random model,

as both figures delineate model performance curves that are situated above the

baseline. These improvements were significant (Z = 4.4085 and p < 0.001 for the basic

model, Z = 4.5953 and p < 0.001 for the extended model). This means that both models

succeeded in uncovering a relationship between a company’s website characteristics

and whether it positively responded to the marketing campaign or not. Both figures

also show that the model built on the augmented feature space outperformed the

model that was exclusively fit on corporate website characteristics. The addition of the

expert knowledge predictors to the PCA subspace increased the AUC from 0.656 to

0.694 and is translated in an ROC curve that is located further from the random model.

The extended model’s overall ability to distinguish promising leads from low-potential

leads is thus higher than that of the model without the self-constructed features. This

is especially the case for the first five deciles where the extended model’s cumulative

lift curve is located far above the basic model’s curve. In the first decile, the lift

increased from 1.283 to 2.139, meaning that the hybrid ensemble, built on a feature

space comprising corporate website characteristics and expert knowledge predictors,

is able to identify approximately 21% of the positive responders in the top 10 percentile

of the entire population. Targeting the top 30 percentile would result in the identification

of approximately 53% of all positive responders as the extended ensemble’s lift yields

a value of 1.77 in the third decile.

Page 34: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

25

Figure 11: Cumulative lift curves

Figure 12: ROC curves

1.0

1.2

1.4

1.6

1.8

2.0

Cumulative Lift Curves

bucket

Cum

ulat

ive li

ft

10% 30% 50% 70% 90%

bucket

Cum

ulat

ive li

ft

model with expert knowledgemodel without expert knowledge

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

ROC Curves

1− specificity

sens

itivi

ty

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

1− specificity

sens

itivi

ty

model with expert knowledgemodel without expert knowledgerandom model

Page 35: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

26

Model with expert knowledge Model without expert knowledge

AUC 0.694 0.656

Top-decile lift 2.139 1.283

Table 3: AUC and top-decile lift

4 CONCLUSION Customer acquisition is a time-consuming and cost-intensive process as only certain

leads will actually convert (Cooper & Budd, 2007; Patterson, 2007; Yu & Cai, 2007).

Consequently, companies often focus on doing business with existing customers

rather than searching for new ones on a regular basis (Rygielski et al., 2002). Today,

this marketing strategy may no longer suffice. On the one hand, the increasingly

competitive environment provides customers with valuable alternatives to fulfil their

sophisticated needs (Shaw et al., 2001). On the other hand, customers become more

informed about these competitive offerings due to the flourishing of the World Wide

Web (Rygielski et al., 2002). Attracting new customers thus becomes a critical success

factor for modern organizations (Thorleuchter et al., 2012; Wilson, 2006). In a B2B

context, web crawling activities could facilitate the acquisition process as current

companies often provide information concerning their specific businesses on their

websites (Thorleuchter et al., 2012). This information could be valuable to determine

whether a company would be a suitable target for further acquisition purposes.

In this study, we set out to (1) determine whether textual content extracted from

corporate websites could be used to identify promising leads and (2) whether the

incorporation of expert knowledge would improve the acquisition model. The data we

used contained information about a Belgian company’s marketing campaign situated

in a B2B context. This enabled us to crawl companies’ websites and to analyze

whether companies’ features extracted from web content related with their actual

marketing campaign responses. The websites were collected by means of a multilayer

Page 36: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

27

web crawling algorithm. Text mining applications such as text preparation and

representation were used to transform the unstructured web content into a structured

and more manageable form. PCA was subsequently applied in a dimension reduction

phase and resulted in a feature subspace representing groups of frequently related

terms. Lastly, we incorporated expert knowledge by investigating whether companies

were approachable through their website or social media as this may facilitate

acquisition activities. In addition, we determined the region of the leads’ establishments

as this could influence their preferences towards certain business relations.

This research showed that a well-chosen set of company characteristics, derived from

textual information and extracted from corporate websites, do provide discriminative

power concerning the success rate of customer conversion. The underlying

relationship was uncovered by a hybrid ensemble constructed through the combination

of several diverse machine learning algorithms. The ensemble’s ability to identify

promising leads was even more significant when augmenting the companies’

characteristics with predictors that resulted from domain expertise. Moreover, the

extended ensemble succeeded in detecting 53% of all positive responders by targeting

30% of the entire population. As a result, the framework presented in this research

could assist B2B sales representatives in identifying promising leads. It enables

marketers to manage a targeted marketing approach that is more effective and

efficient. This could be especially beneficial for businesses confronted with low

conversion ratios or constrained by limited marketing budgets.

5 LIMITATIONS AND FURTHER RESEARCH We would like to mention that this research was conducted in a specific B2B context

based on the data of a Belgian energy supplier. Similar analyses should be done in

different market settings in order to generalize the findings of this study. Additionally,

we augmented the PCA subspace with three variables that were expected to relate

with a company’s conversion probability. Further research could investigate whether

more discriminative variables exist to predict a lead’s marketing response.

Page 37: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

28

Furthermore, the machine learning algorithms are not restricted to the ones used in in

this research for constructing the hybrid ensemble. Other experiments may give more

insight into which combination of machine learners and parameter settings is able to

render the best performance. We would also like to stress that we used principal

component analysis for the dimension reduction process. Other research could

investigate whether other feature extraction techniques, such as latent semantic

indexing, yield better results.

ACKNOWLEDGEMENTS We would like to thank the anonymous Belgian energy supplier for providing us with a

data set that made this research possible. In addition, we would like to thank Prof. Dirk

van Den Poel, Jeroen D’haen and Tijl Carpels for their support and suggestions during

this study. For this research, we used R as a programming language and software

environment as it is todays leading tool in data analysis. Furthermore, R is freely

available, platform-independent, open-source and has a large community of users,

resulting in an extensive amount of available packages submitted by experts in their

respective fields. The RCurl package was used in combination with the XML package

for web crawling activities. The former allows one to extract information resources from

the web, whereas the latter provides functionalities to interpret these retrieved web

documents. The tm package was used for several text mining purposes such as text

cleansing, representation and term filtering. Principal component analysis was

performed by functions available in the R Base package. The hybrid ensemble was

built with different models that were implemented by means of the packages ada,

glmnet, randomForest, rotationForest and e1071. Finally, the model performance was

evaluated by means of the AUC, lift and pROC packages.

Page 38: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

29

BIBLIOGRAPHY Abdi, H., & Williams, L. J. (2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433-459. Allison, P. D. (1999). Logistic Regression using the SAS System: Theory and Application. Cary: SAS Institute Inc. Apté, C., Damerau, F., & Weiss, S. M. (1994). Automated learning of decision rules for text categorization. ACM Transactions on Information Systems (TOIS), 12(3), 233-251. Baecke, P., & Van den Poel, D. (2010). Improving purchasing behavior predictions by data augmentation with situational variables. International Journal of Information Technology & Decision Making, 9(6), 853-872. Baesens, B., Mues, C., Martens, D., & Vanthienen, J. (2009). 50 years of data mining and OR: upcoming trends and challenges. Journal of the Operational Research Society, S16-S23. Banfield, R. E., Hall, L. O., Bowyer, K. W., & Kegelmeyer, W. P. (2005). Ensemble diversity measures and their application to thinning. Information Fusion, 6(1), 49-62. Bauer, H. H., Grether, M., & Leach, M. (2002). Building customer relations over the Internet. Industrial Marketing Management, 31(2), 155-163. Bose, I., & Mahapatra, R. K. (2001). Business data mining—a machine learning perspective. Information & management, 39(3), 211-225. Bradley, A. P. (1997). The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern recognition, 30(7), 1145-1159. Breiman, L. (1996). Bagging predictors. Machine learning, 24(2), 123-140. Breiman, L. (2001). Random forests. Machine learning, 45(1), 5-32. Cooper, M. J., & Budd, C. S. (2007). Tying the pieces together: A normative framework for integrating sales and project operations. Industrial Marketing Management, 36(2), 173-182. Coussement, K., & Van den Poel, D. (2009). Improving customer attrition prediction by integrating emotions from client/company interaction emails and evaluating multiple classifiers. Expert Systems with Applications, 36(3), 6127-6134. DeLong, E. R., DeLong, D. M., & Clarke-Pearson, D. L. (1988). Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics, 44(1), 837-845.

Page 39: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

30

D’Haen, J., Van den Poel, D., & Thorleuchter, D. (2013). Predicting customer profitability during acquisition: Finding the optimal combination of data source and data mining technique. Expert systems with applications, 40(6), 2007-2012. Dietterich, T. G. (2000). Ensemble methods in machine learning. In Kittler, J., & Roli, F. (Eds.), Proceedings of the First International Workshop on Multiple Classifier Systems (pp. 1–15). Berlin: Springer. Dumais, S., Platt, J., Heckerman, D., & Sahami, M. (1998). Inductive learning algorithms and representations for text categorization. In Makki, K., & Bouganim, L. (Eds.), Proceedings of the Seventh International Conference on Information and Knowledge Management (pp. 148-155). New York: ACM Press. Dwyer, F. R., Schurr, P. H., & Oh, S. (1987). Developing buyer-seller relationships. The Journal of marketing, 51(1), 11-27. Eyheramendy, S., & Madigan, D. (2005). A novel feature selection score for text categorization. In Proceedings of the Workshop on Feature Selection for Data Mining, in conjunction with the 2005 SIAM International Conference on Data Mining (pp. 1-8). Freund, Y., & Schapire, R. (1999). A short introduction to boosting. Journal of Japanese Society for Artificial Intelligence, 14(5), 771–780. Gentsch, P., & Hänlein, M. (1999). Text mining. Das Wirtschaftsstudium (WiSu), 28(1), 1646-1653. Greiff, W. R. (1998). A theory of term weighting based on exploratory data analysis. In Croft, W. B., Moffat, A., van Rijsbergen, C. J., Wilkinson, R., & Zobel, J. (Eds.), Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval (pp. 11-19). New York: ACM Press. Gupta, V., & Lehal, G. S. (2009). A survey of text mining techniques and applications. Journal of emerging technologies in web intelligence, 1(1), 60-76. Hanley, J. A., & McNeil, B. J. (1982). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology, 143(1), 29-36. Hill, L. (1999). CRM: easier said than done. Intelligent Enterprise, 2(18), 53-55. Hotho, A., Nürnberger, A., & Paaß, G. (2005). A Brief Survey of Text Mining. Journal for Computational Linguistics and Language Technology, 20(1), 19-62. Isaac, S & Tooker, R. N. (2001). The many faces of CRM. LIMRA’s marketFacts Quarterly, 20(1), 84-88. Kosala, R., & Blockeel, H. (2000). Web mining research: A survey. SIGKDD Explorations, 2(1), 1-15.

Page 40: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

31

Kothandaraman, P., & Wilson, D. T. (2000). Implementing relationship strategy. Industrial Marketing Management, 29(4), 339-349. Le Cessie, S., & Van Houwelingen, J. C. (1992). Ridge estimators in logistic regression. Applied Statistics, 41(1), 191-201. Lewis, D. D. (1992). An evaluation of phrasal and clustered representations on a text categorization task. In Proceedings of SIGIR-92, 15th ACM International Conference on Research and Development in Information Retrieval (pp. 37-50). New York: ACM Press. Ling, R., & Yen, D. C. (2001). Customer relationship management: An analysis framework and implementation strategies. The Journal of Computer Information Systems, 41(3), 82-97. Lodhi, H., Saunders, C., Shawe-Taylor, J., Cristianini, N., & Watkins, C. (2002). Text classification using string kernels. The Journal of Machine Learning Research, 2(1), 419-444. Martens, D., De Backer, M., Haesen, R., Baesens, B., Mues, C., & Vanthienen, J. (2006). Ant-based approach to the knowledge fusion problem. In Dorigo, M., Gambardella, L., Birattari, M., Martinoli, A., Poli, R., & Stützle, T. (Eds.), Ant colony optimization and swarm intelligence, fifth international workshop (pp. 84-95). Berlin: Springer. Ngai, E. W., Xiu, L., & Chau, D. C. (2009). Application of data mining techniques in customer relationship management: A literature review and classification. Expert systems with applications, 36(2), 2592-2602. Patterson, L. (2007). Marketing and sales alignment for improved effectiveness. Journal of digital asset management, 3(4), 185-189. Petrison, L. A., Blattberg, R. C., & Wang, P. (1997). Database marketing: Past, present, and future. Journal of Direct Marketing, 11(4), 109-125. Reinartz, W. J., & Kumar, V. (2003). The impact of customer relationship characteristics on profitable lifetime duration. Journal of marketing, 67(1), 77-99. Rodriguez, J. J., Kuncheva, L. I., & Alonso, C. J. (2006). Rotation forest: A new classifier ensemble method. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1619-1630. Rygielski, C., Wang, J. C., & Yen, D. C. (2002). Data mining techniques for customer relationship management. Technology in society, 24(4), 483-502. Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information processing & management, 24(5), 513-523. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1-47.

Page 41: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

32

Shankaranarayanan, G., & Cai, Y. (2005). A web services application for the data quality management in the B2B networked environment. In Proceedings of the 38th Hawaii international conference on system sciences (pp. 1–10). Shaw, M. J., Subramaniam, C., Tan, G. W., & Welge, M. E. (2001). Knowledge management and data mining for marketing. Decision support systems, 31(1), 127-137. Silva, C., & Ribeiro, B. (2003). The importance of stop word removal on recall values in text categorization. In Neural Networks, 2003. Proceedings of the International Joint Conference on (pp. 1661-1666). Sparck Jones, K. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation, 28(1), 11-21. Stumme, G., Hotho, A., & Berendt, B. (2006). Semantic web mining: State of the art and future directions. Web semantics: Science, services and agents on the world wide web, 4(2), 124-143. Tang, B., Shepherd, M., Milios, E., & Heywood, M. I. (2005). Comparing and combining dimension reduction techniques for efficient text clustering. In Proceeding of SIAM International Workshop on Feature Selection for Data Mining (pp. 17-26). Thorleuchter, D., Van den Poel, D., & Prinzie, A. (2010). Mining innovative ideas to support new product research and development. In Locarek-Junge, H., & Weihs, C. (Eds.), Classification as a Tool for Research (pp. 587-594). Berlin: Springer. Thorleuchter, D., Van den Poel, D., & Prinzie, A. (2012). Analyzing existing customers’ websites to improve the customer acquisition process as well as the profitability prediction in B-to-B marketing. Expert systems with applications, 39(3), 2597-2605. Ulaga, W., & Chacour, S. (2001). Measuring customer-perceived value in business markets: a prerequisite for marketing strategy development and implementation. Industrial marketing management, 30(6), 525-540. Vapnik, V. (1995). The nature of statistical learning theory. New York: Springer. Vapnik, V. (1998a). Statistical learning theory (Vol. 1). New York: Wiley. Vapnik, V. (1998b). The support vector method of function estimation. In Suykens, J. A. K., & Vandewalle, J., (Eds.), Nonlinear Modeling: Advanced Black-box Techniques (pp. 55-85). Boston: Kluwer Academic Publishers. Wilson, R. D. (2006). Developing new business strategies in B2B markets by combining CRM concepts and online databases. Competitiveness Review: An International Business Journal, 16(1), 38-43.

Page 42: Web Crawling in R: Predicting Leads · In this research, we investigated whether textual information extracted from corporate websites could be used to identify promising leads in

33

Wold, S., Esbensen, K., & Geladi, P. (1987). Principal component analysis. Chemometrics and intelligent laboratory systems, 2(1-3), 37-52. Yang, Y., & Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Kaufmann, M. (Ed.), 14th International Conference on Machine Learning (pp. 412-420). Yu, Y. P., & Cai, S. Q. (2007). A new approach to customer targeting under conditions of information shortage. Marketing intelligence & planning, 25(4), 343-359.