report: web classification of digital libraries using gate machine learning

Stephen J. Stose April 18, 2011 IST 565: Final Project

1

Web classification of Digital Libraries using GATE Machine Learning Introduction Text mining is considered by some as a form of data mining that operates on unstructured and semi-‐structured texts. It applies natural language processing models to analyze textual content in order to extract and generate actionable (i.e., potentially useful) knowledge from the information inherent in words, sentences, paragraphs and documents (Witten, 2005). However, many of the linguistic patterns easy for humans to comprehend and reproduce end up being astonishingly complicated for machines to process. For instance, machines struggle interpreting natural language forms quite simple for most humans, such as metaphor, misspellings, irregular forms, slang, irony, verbal tense and aspect, anaphora and ellipses, and the context that frames meaning. On the other hand, humans lack a computer’s ability to process large volumes of data at high speeds. The key to successful text mining is to combine these assets into a single technology. There are many uses of this new interdisciplinary effort at mining unstructured texts towards the discovery of new knowledge. For instance, some techniques attempt to extract structure to fill out templates (e.g., address forms) or extract key-‐phrases as a form of document metadata. Others attempt to summarize the content of a document, identify a document’s language, classify the document into a pre-‐established taxonomy, or cluster it along with similar documents based on token or sentence similarity (see Witten, 2005 for others). Other techniques include concept linkage, whereby concepts across swathes of scientific research articles can be linked to elucidate new hypotheses that otherwise wouldn’t occur to humans, but also topic tracking and question-‐answering (Fan et al., 2006). Consider the implications of being able to automatically classify text documents. Given the massive size of the World Wide Web and all it contains (e.g., news feeds, e-‐mail, medical and corporate records, digital libraries, journal and magazine articles, blogs), imagine the practical consequences of training machines to automatically categorize this content. Indeed, text classification algorithms have already had moderate success in cataloging news articles (Joachims, 1998) and web pages (Nigam, MacCallum, Thrun & Mitchell, 1999). Indeed, some text mining systems have even been incorporated into digital library systems (the Greenstone DAM), such that users can benefit by displaying digital library items automatically co-‐referenced by use of semantic annotations (Witten, Don, Dewsnip & Tablan, 2004). Natural language pre-‐processing for text and document classification Text and document classification make use of natural language processing (NLP) technology to pre-‐process, encode and store linguistic features to texts and documents, and then to processes selected features using Machine Learning (ML) algorithms which then are applied to a new set of texts and documents. The first step in this process usually involves tokenization, a process that involves removing punctuation marks, tabs, and other non-‐textual characters to replace these with white space. This produces a mere stream of word tokens which forms the set of data upon which further processing occurs. From this stream,


2

a filter usually is applied to reduce from this set of tokens all stop-‐words (e.g., prepositions, articles, conjunctions etc.) that otherwise provide little if any meaning. In a related vein, we see in such instances that tokens are not always the same as words per se. Tokenization may insert white space between two and three-‐word tokens. “New York” should be considered one token, not two (not “New” and “York”). Hyphens and apostrophes present difficult challenges. Often words like “don’t” are tokenized into two separate words: “do” and “n’t”, the latter which is later transduced as “n’t” = “not”. When considering all the continually changing conventions used to display words as text, you will begin to appreciate the multitude of problems. Often, pre-‐processing can stop here, as many text and document classification methods rely on simple tokenization, such that each token represents one term amongst a bag of other words occurring within each document and between all documents in the corpus. One common approach to determining word importance within a bag-‐of-‐words is the term frequency-‐inverse document frequency approach (tf-‐idf). In this way, each document represents a vector of terms, and each term is encoded in binary 1 (term occurs) or 0 (term does not occur) form, upon which weighting schemes apply more weight to terms occurring frequently within relevant documents but infrequently between all documents considered together. In a corpus of documents about political parties, for instance, the word “political” may occur a lot in relevant documents, but its weight would be low given that it also occurs frequently in all the other documents within the corpus. This renders the term rather meaningless when trying to distinguish relevant from non-‐relevant documents, as they all are about something political. If the word “suffrage” occurs frequently in relevant documents, on the other hand, but rarely across the corpus, its specificity and hence weight for determining document type is considered much greater. This is the reason tf is balanced with (i.e., multiplied by) idf, a factor that diminishes the weight of frequent terms and increases the weight of rare ones (for the mathematics of such an approach, see Hotho, Nurnberger & Paass, 2005). In this way a set of documents can be mined for keywords. If all of the documents within our corpus are related to political parties, the word “political” hardly qualifies as a keyword. Words that occur frequently within a subset of documents serve as words that categorize content. As such, if the word “suffrage” occurs frequently in some documents, but not in all of the documents, thus qualifies as a good candidate as a keyword that classifies the relevant text. A good text-‐mining program utilizing the tf-‐idf weighting scheme would be able to extract this term and present it to a human as a possible keyword. These weighting schemes are applied within vector space models in order to retrieve, filter and index terms occurring in documents (Salton, Wong & Yang, 1975). Such models form the basis of many search and indexing engines (e.g., Apache Lucene) insofar as the HTML content of each Web page is crawled and indexed to determine its relevance based on words and phrases occurring within the <title> and <heading> elements, among other ways (see Chau & Chen, 2008). Still, a bag-‐of-‐words approach to text and document mining can be improved upon by incorporating domain knowledge from experts into the analysis. For instance, experts can identify domain-‐specific words, phrases and/or rules. If a document or Web page is checked against a dictionary of these listed features, those documents containing the features will be deemed more relevant to the search. This is what often occurs after tokenization in many kinds of NLP software (e.g., GATE). That is, tokenized words are mapped to an internal


3

gazetteer (an internal dictionary), which operates as a sort of pre-‐classification, such that commonly occurring or well-‐known entities are extracted and annotated as such. For instance, a gazetteer might by default be outfitted to recognize all common first-‐ and surnames (Noam or Bradeley; Chomsky or Manning) or organizations (UN, United Nations, OPEC, White House, Planned Parenthood) or dates formats (02/10/1973 or February 10, 1973). Thus, the selection of these kinds of annotations constrains the set of words chosen to represent space in space vector models. Thus, if we want to ensure a domain-‐specific vocabulary is annotated as relevant to text or document classification, we might create a separate space for those terms, and annotate each term as belonging to a particular category. As described later, we created a gazetteer of terms most likely to occur on Web sites functioning as digital libraries, such that when a random Web site contains these terms it would with a higher likelihood be classified as relevant. Other forms of linguistic pre-‐processing exist which may or may not enhance document and text classification algorithms, depending on the nature and specificity of the task. For instance, sentence splitters chunk tokens into sentence spaces when phrases are an important feature in classification. At times, tagging each term within a document with its part-‐of-‐speech (POS Tagging) is important. For instance, it allows for the classification of documents into language groups (e.g., Spanish vs. English vs. German etc.) or sentence types. Given that language is full of ambiguity of which we’ll only scratch the surface here, Named-‐Entity (NE) transducers ease the confusion by contextualizing certain tokens. For instance, General Motors can be recognized as a company, and not as the name of a military officer (e.g., General Lee). Or “May 10” is a date, “May Day” is a holiday, “May I leave the room” is a request, and “Sallie May Jones” is a person. That is, the transducer disambiguates homographs and homonyms and other such linguistic confusions. Another common problem in pre-‐processing is co-‐reference matching. Often, the same entity is known in different ways or by different spellings: “Center” is the same as “centre”; NATO is the same entity as North Atlantic Treaty Organization; or Mr. Smith is the same person as Joachim Smith is the same person as “he” or “him” (e.g., “Joachim Smith went to town. Everyone greeted him as Mr. Smith and he didn’t care for that”). This is an important element when considering frequency weights in space vector models, as two different tokens referencing the same entity should be co-‐referenced as occurring with frequency = 2, and not frequency = 1, one for each way of referring to the entity. Basic classification models Most classification models are forms of supervised learning in that each input value (e.g., a word vector) is paired with an expected discrete output value (i.e., the pre-‐defined category). As such, the supervised algorithm in training analyzes these pairings to produce an inferred classifier function and thereby in testing be able to predict the output value (i.e., the correct classification) for any new valid input. One instance commonly used in document classification is training a classifier to automatically classify Web pages into a pre-‐established taxonomy of categories (e.g., sports, politics, art, design, poetry, automobiles etc.). Accuracy of the training function on correctly classifying the test set is then computed as a performance measure, each document falling within the expected class by some degree. Herein we establish a trade-‐off between recall and precision. High


4

precision implies a high degree threshold for allowing membership into a class. In this way, the algorithm refuses to accept many false positives, but in doing so sacrifices its ability to recall an otherwise larger set of documents, and thus risks missing many relevant documents (i.e., they become false negatives). On the other hand, if a threshold of high recall is permitted, we risk lowering our rate of precision and thus allow many documents not relevant to the category (i.e., false positives) into the set. The F1-‐score serves as a statistical measure of compromise (i.e., average) between recall and precision. For the mathematical details of many of the classification algorithms, we defer to Hotho, Nurnberger and Paass (2005), but here outline the rudimentary basics of the four most common algorithms, NaïveBayes, k-‐nearest neighbor, decision trees, and support vector machines (SVM). Naïve Bayes applies the conditional probability that document d with the vector of terms t1,…,tn belongs to a certain class: P(ti|classj). Documents with a probability reaching a pre-‐established threshold are deemed as belonging to the category. Instead of building a model of probability, k-‐nearest neighbor method of classification is an instance-‐based approach that operates on the basis of the similarity of a document’s k number of nearest neighbors. By using word-‐vectors stored as document attributes and document labels as the class, most computation occurs in testing whereby class labels are assigned based on the k most frequent training samples nearest to the document to be classified. Decision trees (e.g., C4.5) operate by the information gain established based on a recursively built hierarchy of word selection. From labeled documents, term t is selected as the best predictor of the class according to the amount of information gain. The tree splits into subsets, one branch of documents containing the term and the other without, only to then find the next term to split on, and is applied recursively until all documents in a subset belong to the same class. Support vector machines (SVM) operate by representing each document according to a weighted vector td1,…,tdn based on word frequencies within each document. SVM determines a maximum hyper-‐plane at point 0 that separates positive (+1) class examples and negative class examples (-‐1) in the training set. Only a small fraction of documents are support vectors, and any new document is classified as belonging to the class if the vector is greater than 0, and not belonging to the class if the vector is less than 0. SVMs can be used with linear or polynomial kernels to transform space to ensure the classes can be separated linearly. While the level of performance of each of these classifiers depends on the kind of classification task, the SVM algorithm most reliably outperforms other kinds of algorithms on document classification (Sebastiani, 2002), and thus will utilized with priority in the study that follows. Goals and objectives of current study For our own purposes, we focus on the domain of web classification in order to achieve a twofold purpose: 1) to learn and teach my colleagues about the natural language processing


5

suite known as GATE (General Architecture for Text Engineering), especially with regards to its Machine Learning (ML) capabilities; and 2) to utilize the GATE architecture in order to classify web documents into two groups: those sites that function as digital library sites (DL) distinguished from all other non-‐digital library sites (non-‐DL). The purpose of such an exercise is to identify from amongst the millions of websites only those sites that operate as digital library sites. Assuming digital library sites are identifiable through certain characteristic earmarks that distinguish them as containing searchable digital collections, the goal is to develop a set of annotations that by way of an ML algorithm can be applied as part of a web crawler in order to extract the URL of each of the sites that qualify as belonging to the DL group, while omitting those that do not. While somewhat confident it is possible to obtain a strong level of precision in obtaining many of the relevant sites, recalling sites that seem relevant (e.g., those merely about digital libraries) are of a greater concern; that is, the false positives. The current author is developing as a prototype a website (www.digitallibrarycentral.com) that seeks to operate as a digital library of all digital library websites; a sort of one-‐stop visual reference library that points to the collection of all digital libraries. Achieving the goal outlined here would serve to populate this site. Before and whether such grand ideals can be implemented, however, the current paper will outline some of the first steps in implementing the GATE ML architecture towards this objective. In doing so, of immediate concern is understanding the GATE architecture and how it functions in natural language processing tasks, in order that we can properly pre-‐process and annotate our target corpora before carrying out ML learning algorithms on it. We turn now to an explanation of the GATE architecture. The GATE architecture and text annotation GATE (General Architecture for Text Engineering) is a set of Java tools developed at the University of Sheffield for the processing of natural language and text engineering tasks for various languages. At its core is an information extraction system called ANNIE (A Nearly-‐New Information Extraction System), a set of functions that operates on individual documents (including XML, TXT, Doc, PDF, Database and HTML structures) and across the corpora within which many documents can belong. These functions comprise tokenizing, a gazetteer, sentence splitting, part-‐of-‐speech tagging, named-‐entity transduction, and co-‐reference tagging, among others. It also boasts extensive tools for RDF and OWL metadata annotation for creating ontologies for use within the Semantic Web. Most of these language processes operate seamlessly within GATE Developer’s integrated development environment (IDE) and graphical user interface (GUI), the latter allowing users to visualize these functions within a user-‐friendly environment. For instance, a left-‐sidebar resource tree displays the Language Resources panel, where the document and document sets (the corpus) reside. Below that, it also displays the ANNIE Processing Resources (PR), the natural language processing functions mentioned above that form part of an ordered pipeline to linguistically pre-‐process the documents. A right-‐sidebar illustrates the resulting color-‐coded annotation lists after pipeline processing. Additionally, a bottom table exposes the various resulting annotation attributes, as well as a popup annotation editor that allows one to edit and classify (i.e., provide values to) these


6

annotation sets for training, prototyping, and/or analysis. Figure 1 below shows all of these elements in action.

Figure 1. These tools complete much of the gritty text-‐engineering work of document pre-‐processing in order that useful research can be quickly deployed, but in a way that is visually explicit and apparent to those less initiated with these common natural language engineering preprocessing tasks. And in a way that allows for editing these functions as well as the introduction of various pre-‐processing plugins and other scripts developed for individual text-‐mining applications. Figure 1 displays four open documents uploaded directly by entering the URL: Newsweek and Reuters (news sites), and JohnJayPapers and DigitalScriptorium (digital libraries). These, along with 10 other news sites and 11 other digital libraries all belong to the corpus names “DL_eval_2” above (which will serve as Sample 1 later, our first test of DL discrimination). This provides a testing sample to ensure the pre-‐processing pipeline and Machine Learning (ML) functions operate correctly on our soon-‐to-‐be annotated documents. Just by uploading URLs, GATE by default automatically annotates the HTML markup, as you can see on within bottom right-‐sidebar where the <a>, <body> and <br> tags are located.


7

After running the PR pipeline over the “DL_eval_2” corpus, the upper right-‐sidebar shows the annotations that result from running the tokenizer, gazetteer, sentence splitter, POS tagger, NE transducer and co-‐referencing orthomatcher. Organization is checked and highlighted in green, for instance, and by clicking on “White House” (one instantiation of Organization), we learn about GATE’s {Type.feature=value} syntax, which in the case of “White House” is represented accordingly: {Organization.orgType=government}. This syntax operates as the core annotation engine, and allows for the scripting and manipulation of annotation strings. The ANNIE PRs in this case provide automatic annotations that serve as a rudimentary start to any text engineering projects upon which to build. There are many other plugins and PR functions we will not discuss within this review. For our own purposes, we want to call attention to two annotation types ANNIE generates: 1) Token, and 2) Lookup. A few examples of the Type.feature syntax for the token type is: the kind of token {Token.kind=word}; the token character length {Token.length=12}; the token POS {Token.category=NN}; the token orthography {Token.orth=lowercase}; or the content of the token string {Token.string=painter}. Our interest is in analyzing string content: determining whether a particular document is an instance of a digital library or not will require an ML analysis of the n-‐gram string unigrams comprising both DL sites and nonDL sites. We can either use all tokens (after removing stop-‐words) to analyze the tf-‐idf weighting of the documents in question, or we can constrain the kinds of tokens analyzed within the documents by making further specifications. The ANNIE annotation schema provides many default annotations (e.g., Person, Organization, Money, Date, Job Title etc.) to constrain the kinds of words chosen for analysis, as can be seen in Figure 1 in the upper right-‐sidebar. Additionally, the Gazetteer provides many other kinds of dictionary lookup entries (60,000 arranged in 80 lists) above and beyond the ANNIE default annotations. For instance, the list name “city” will have as dictionary entities a list of all worldwide cities, such that by mapping these onto the text, a new annotation of the kind {Lookup.minorType=city} is created, and thus annotates each instance of a city with this markup. The lookup uses a set-‐ subset hierarchy we will not describe, except to say that {Lookup.majorType} is a parent of {Lookup.minorType}. Thus, there are different kinds of locations, for instance, city and country. City and country are thus minorTypes (children) of the {Lookup.majorType=location}. Classification with GATE Machine Learning Given that the GATE Developer extracts and annotates training documents, several processing plugins that operate at the end of a document pre-‐processing pipeline serve Machine Learning (ML) functions. The Batch Learning PR has three functions: chunk recognition, relation extraction and classification. This paper is interested in applying supervised ML processes to classify web documents that qualify as instances of digital libraries (DL) or not (nonDL). Supervised ML requires two phases: learning and application. The first phase requires building a data model from instances within a document that has already been correctly


8

classified. In our case, it requires giving value to certain sets of annotations that, as a whole, will represent the document instance (i.e., the website) as either a hit (DL) or a miss (non-‐DL). The point is to develop a training set D = (d1,…,dn) of correctly classified DL website documents (d) to build a classification model able to discriminate any future website d as being either a true DL or some other website (not-‐DL). The first task requires annotating each document, as a whole, and in doing so assign it to the dependent DL or non-‐DL class. Up until now, annotations refer to parts of a document (tokens, sentences, dates etc.). To annotate a whole document, we begin by creating a new {Type.feature=value} term. To do so, we demarcate the entire text within each document and create a new annotation type called “Mention,” a feature called “type” (not to be confusing) and two distinct values: {Mention.type=dl} and {Mention.type=nondl}. The attributes used to predict class membership are the two annotations types we highlighted above: 1) Token {Token.string}, and 2) Lookup {Lookup.majorType}. To take full advantage of the Gazetteer, we added a list entry named “dlwords” (i.e., digital library words) with a list of terms commonly found on many digital library websites. This list of word is reproduced below1:

Advanced Search Archive(s) Browse Catalog Collection(s) Digital Digital Archive(s) Image(s)

Digital Collection(s) Digital Content Digital Library(ies) Digitization Digitisation Image Collection(s) Keyword(s) Library(ies)

Manuscript(s) Repository(ies) Search Search Tip(s) Special Collection(s) Unversity(ies) University Library(ies)

All of our analyses will operate using the bag-‐of-‐words that by GATE default applies tf-‐idf weighting schemes to a specified n-‐gram (we’ll be using only 1-‐gram unigrams). Two attribute annotations, each representing a slightly different bag-‐of-‐words, will be used to predict DL or nonDL class membership:

1. When the {Token.string} attribute is chosen to predict {Mention.type} class membership, the bag-‐of-‐words includes all non-‐stop word tokens within its attribute set.

2. When the Gazetteer is used and “dlwords” are included as part of its internal dictionary, the attribute {Lookup.majorType=“dlwords”) along with all the other 60,000 entries will serve to constrain the set of tokens predicting {Mention.type} class membership.

The GATE Batch Learning PR requires an XML configuration file specifying the ML parameters and the attribute-‐class annotation sets2. We will only discuss a few essential settings here. For starters, we set the evaluation method as “holdout” with a ratio of .66/.33 training to test. The main algorithm we will be using is SVM (in GATE, the SVMLibSvmJava) with the following parameter settings being varied, the ones reported below providing the best results: 1 Note: the Gazetteer does not stem nor is it caps sensitive, thus plural and uppercase variations of these words were provided but are not reproduced here. 2 For a list of all parameter setting possibilities, see http://gate.ac.uk/sale/tao/splitch17.html#x22-‐43500017.


9

-‐t: kernel: 0 (linear (0) vs. polynomial (1)) -‐c: cost: 0.7 (higher values allow softer margins leading to better generalization) -‐tau: uneven margins: 0.5 (varies positive to negative instance ratio)

The XML configuration file is reproduced below, in case others are interested in getting started using GATE ML for basic document classification:

<?xml version="1.0"?> <ML-‐CONFIG> <VERBOSITY level="1"/> <SURROUND value="false"/> <PARAMETER name="thresholdProbabilityClassification" value="0.5"/> <multiClassification2Binary method="one-‐vs-‐another"/> <EVALUATION method="holdout" ratio="0.66"/> <FILTERING ratio="0.0" dis="near"/> <ENGINE nickname="SVM" implementationName="SVMLibSvmJava" options=" -‐c 0.7 -‐t 0 -‐m 100 -‐tau 0.4 "/> <DATASET> <INSTANCE-‐TYPE>Mention</INSTANCE-‐TYPE> <NGRAM> <NAME>ngram</NAME> <NUMBER>1</NUMBER> <CONSNUM>1</CONSNUM> <CONS-‐1> <TYPE>Token</TYPE> <FEATURE>string</FEATURE> </CONS-‐1> </NGRAM> <ATTRIBUTE> <NAME>Class</NAME> <SEMTYPE>NOMINAL</SEMTYPE> <TYPE>Mention</TYPE> <FEATURE>type</FEATURE> <POSITION>0</POSITION> <CLASS/> </ATTRIBUTE> </DATASET> </ML-‐CONFIG>

The ngram is set to 1 (unigram), and the <CLASS/> tag within the <Attribute> tag indicates this attribute is the class being predicted, with <Type>Mention</Type> and <Feature>type</Feature> as {Mention.type=dl or nondl}. The <Type>Lookup</Type>and <Feature>majorType</Feature> will be changed to accommodate the other attribute value. Thus, after running ANNIE PR pipeline over the “DL_eval_2” corpus, the ML Batch Learning PR is placed alone in the pipeline to run over the annotated set of documents in Evaluation mode. As mentioned, the Batch Learning PR can also operate in Training-‐Application mode on two separate sets of corpora: one corpus for training and another for application (i.e., testing). The initial results below reflect only the results of holdout 0.66 evaluation run over one corpus. The current report does not utilize the Training-‐Application mode. Sample corpora and results The Web now boasts over 8 billion indexable pages (Chau & Chen, 2008). Thus training an ML algorithm to pick out the estimated few thousand digital libraries will not be a simple matter. Assuming there are 5 thousand library-‐standard digital libraries (which may be a high estimate), some of which reside within umbrella Digital Asset Management portals, discriminating these will be cherry picking at a ratio of an average of 5 digital libraries per every 8 million Web sites. Spiders (or Web crawlers) can curtail this number greatly by


10

crawling only a specified argument depth from the starting URL. Unfortunately, librarians are not always good at applying search optimized SEO standards, and many well-‐known DLs are deeply embedded in arguments or strange ports (University of Wyoming’s uses port 8180) or within site subdomains. Thus, curtailing this argument space too much will result in decreased recall. Additionally, there are many non-‐DL websites that use language quite similar to DL websites. For instance, many websites operate as librarian blogs or digital library magazines that serve discussion spaces regarding DLs, but are not DLs themselves. Unfortunately, these false positives will prove daunting to exclude. We seek only DLs or DL portals that boast archival collections that have been digitized, and as such serve as electronic resources that are co-‐referenced, searchable, browsable, and catalogued according to some taxonomy or ontology. One suggested way of narrowing down to only these kinds of resources might be to tap into the <meta content> in which librarians often apply conventions such as Dublin Core to demarcate these spaces as digital collection spaces. This is an avenue for further research, and is possible within GATE by utilizing a {Meta.content} attribute. On quick pre-‐testing, however, it provided no worthy results. In what follows are two samples of data we evaluated for the ML classification of DL and non-‐DL websites. Sample 1 This sample mostly ensured that the GATE ML software and configuration files were operating correctly given the kinds of document-‐level attributions made. As mentioned, the first corpus we tested was called “DL_eval_2,” which contained 25 websites: 13 DL sites (from Columbia University Digital Collections) and 12 distinct news sites, listed below: Reuters Newsweek National Review

LA Times The Guardian CS Monitor

CNN Chicago Tribune Boston Globe

Bloomberg BBC Wall Street Journal

Using both {Token.string} and {Lookup.majorType} as attributes, the results of the classification of {Mention.type} as either DL or nonDL follow. These results correspond to the ML configuration file found above and utilize the SVMLibSvmJava ML engine at .66 holdout evaluation. The training set thus included 16/25 websites (.66) and the ML learning algorithm was tested on 9/25 of the remaining sites (.33). {Token.string} misclassified only one instance: Bloomberg News was classified as falsely belonging to {Mention.type=dl}. Nothing in the text of the front page of Bloomberg gave any indication as to why this was the case. Thus, precision, recall and the F1 value for the set was 0.89. {Lookup.majorType} comprises the Gazetteer, but also included the digital library terms (“dlwords”) we added. Thus, it is a more constrained bag-‐of-‐words, smaller than the set of all tokens. Classification improved to 100% using the “dlwords” enhanced Gazetteer. Given the fact this is such a small sample, we cannot conclude very much, except to say that there is something about DL content when compared to ordinary mainstream news sites that allows for their discrimination.


11

Sample 2 To train the ML algorithm to make our target discrimination and allow for any generalizable conclusions, we increase the sample size. Sample 2 consists of 181 non-‐DLs and 62 DLs. Each set was chosen in the following way: Non-‐DL set A random website generator was used to generate 181 websites that were not digital libraries, were English language only, and had at least some text (excluding the websites generated with only images etc.).

• http://www.whatsmyip.org/random_websites/

DL set A set of 62 university digital libraries were chosen, mostly across three main DL university portals:

• Harvard University Digital Collections o http://digitalcollections.harvard.edu/

• Cornell University Libraries “Windows on the Past” o http://cdl.library.cornell.edu/

• Columbia University Digital Collections o http://www.columbia.edu/cu/lweb/digital/collections/index.html

These websites are slightly more representative, but still fall way short of the kind of precision that will be needed to crawl the web as a whole. The results bode well, nevertheless. Again, using both {Token.string} and {Lookup.majorType} as attributes, the results of the classification of {Mention.type} as either DL or nonDL follow. The .66/.33 (total) holdout training-‐test ratios of the data were: 160/83 (243) websites: 40/22 (62) DL and 120/61 (181) non-‐DL. Naïve Bayes and C4.5 algorithms achieved 100% misclassification of DL websites (22/22) with both sets of attributes, achieving a total F1 of 0.73. It is not clear why this is the case. Given that SVM is well-‐known as the best performing classifier for texts (Sebastiani, 2002), we stick to it for our purposes. {Token.string} performed slightly better than {Lookup.majorType}. For both cases, there were very few misclassifications, and most of these were false positives. When only the Gazetter entry words, including “dlwords,” were taken into account ({Lookup.majorType}), 3/22 DLs were misclassified as non-‐DL (precision=0.95; recall=0.86; F1=0.90) and 1/61 non-‐DLs were misclassified as DL (precision=0.95; recall=0.98; F1=0.97). When all tokens were entered into the bag-‐of-‐words ({Token.string}), precision was perfect for DL classification and recall was perfect for nonDL classification. That is, 3/22 DLs were still misclassified as nonDL (precision=1.0; recall=0.86; F1=0.93). All 61/61 of nonDLs were


12

classified correctly, resulting in perfect recall, but lacking precision insofar as 64 total websites were classified as non-‐DL: 61 of the expected, but the 3 others which should have been classified as DL (precision=0.95; recall=1.0; F1=0.98). Thus, overall, using all tokens achieved slightly higher rates of precision and recall for the discrimination of DL websites from all websites, based on this small and still very non-‐proportional sample. Total F1 values for were 0.96 for {Token.string} and 0.95 for {Lookup.majorType}. The question remains as to whether both attributes were misclassifying the exact same websites. It turns out that 2/3 of the websites misclassified for both attributes were the same. Figure 2 below illustrates the breakdown of these statistics per attribute. {Token.string} Precision=1.0, Recall=0.86

{Lookup.majorType} Precision=0.95, Recal=0.86

False Negatives (misclassified as nonDL)

False Negatives (misclassified as nonDL)

Digital Scriptorium www.scriptorium.columbia.edu Holocaust Rescue & Relief (Andover-‐Harvard Theological) www.hds.harvard.edu/library/collections/digital/service_committee.html Joseph Urban Stage Design Collection www.columbia.edu/cu/lweb/eresources/archives/rbml/urban/

Harvard Business Education for Women (1937-‐1970) http://www.library.hbs.edu/hc/daring/intro.html#nav-‐intro Holocaust Rescue & Relief (Andover-‐Harvard Theological) www.hds.harvard.edu/library/collections/digital/service_committee.html Joseph Urban Stage Design Collection www.columbia.edu/cu/lweb/eresources/archives/rbml/urban/

False positives (misclassified as DL)

False positives (misclassified as DL)

none www.spi-‐poker.sourceforge.net

Figure 2. Conclusion In this paper we discussed how GATE (General Architecture for Text Engineering) employs Machine Learning to classify documents from the web into two categories: websites that operate as digital library sites, and websites that do not operate as digital library sites. This exercise was completed firstly in order to learn about GATE, but secondarily to hopefully provide a solution to populating a site the current author is creating for digital libraries (www.digitallibrarycentral.com). No current directory exists as a single-‐stop go-‐to resource for digital libraries; as is, digital libraries are difficult to find and hence often un-‐ or under-‐utilized to the ordinary web user. By creating a digital library of all digital libraries, we hope to bring the ordinary user to the plethora of digitized resources available, and categorize these digital collections according to a taxonomy that allows for the collation of similar kinds and types of digital libraries. Indeed, once these digital resources are all


13

collection, GATE Machine Learning might provide a solution to the automatic classification of these resources into the supervised taxonomy. As it is, we first seek to locate these resources using Machine Learning. If the web were made up of three ordinary non-‐DL website for every one DL website, the current classifier we trained would have a very easy time locating all of the DLs (with 96% accuracy). As it is, however, of the 8 billion websites in existence today, we reckon that only 3-‐6 thousand of these operate as digital libraries in some form or another. Thus, a lot of work still needs to be done in order to find the DL needle in the haystack of all websites online today. References Chau, M. and Chen, H. (2008). A machine learning approach to web page filtering using content and structure analysis. Decision Support Systems, 44(2), 482-‐494. Hotho, A., Nurnberger, A. and Paass, G. (2005). A brief survey of text mining. LDV Forum – GLDV Journal for Computational Linguistics and Language Technology. Joachims, T. (1998). Text Categorization with Support Vector Machines: Learning with Many Relevant Features. Proceedings of the European Conference on Machine Learning (ECML), Springer. Nigam, K., McCallum, A., Thrun, S., and Mitchell, T. (2000). Text Classification from Labeled and Unlabeled Documents using EM. Machine Learning, 39(2/3), 103-‐134. Salton, G., Wong, A., and Yang, C. S. (1975). A Vector Space Model for Automatic Indexing. Communications of the ACM. 18(11), 613–620. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34, 1-‐47. Witten, I. H. (2005). “Text mining.” in Practical handbook of internet computing, ed. M.P. Singh. Chapman & Hall/CRC Press, Boca Raton, Florida. Witten, I. H., Don, K. J., Dewsnip, M. and Tablan, V. (2004). Text mining in a digital library. Journal of Digital Libraries, 4(1), 56-‐59.

report: web classification of digital libraries using gate machine learning

Documents

text documents

documents language

document classification

text mining systems

documents witten

text classification

successful text mining

natural language forms