big data meets nature conservation: automatic tools for ... · big data meets nature conservation:...

Big data meets nature conservation: Automatic toolsfor information extraction

Ricardo Jose Ribeiro Pereira

Thesis to obtain the Master of Science Degree in:

Information Systems and Computer Engineering

Supervisors: Prof. Pavel Pereira CaladoDr. Goncalo Jose Monteiro Marques

Examination CommitteeChairperson: Prof. Luıs Manuel Antunes VeigaSupervisor: Prof. Pavel Pereira CaladoMembers of the Committee: Prof. Bruno Emanuel Da Graca Martins

October 2018

placeholder

Abstract

Biodiversity has been declining globally while the scientific community has been thinking

and developing models to understand and stop this decline. Over the years, the number of

scientific articles with data collected by scientists has been increasing and, due to the dispersion

of information, it has become almost impossible to gather all the data of an high-level taxonomic

group.

In this work we present a system capable of answering the following questions: “Is it possible

to build a tool capable of extracting the information available in scientific articles and selecting

the one that may correspond to the selected physiological characteristics?” and “Is it possible to

take advantage of the user’s knowledge to improve the effectiveness of this tool?”. The system

receives as scientific articles, extracts data, about physiological characteristics of the species be-

ing studied, from those articles and classifies it, using regular expressions and machine learning

techniques.

Keywords: Biology , Machine Learning , Regular Expressions , Information Extraction

iv

placeholder

Resumo

Nos tempos que correm, a biodiversidade tem estado em declınio globalmente, e, por isso, a

comunidade cientıfica tem pensado e desenvolvendo modelos para compreender e parar

esse mesmo declınio. Ao longo dos anos, o numero de artigos cientıficos com dados recolhidos

pelos cientistas tem vindo a aumentar e, devido a dispersao de informacao, tornou-se impossıvel

a recolha de informacao acerca de um grupo taxonomico de alto nıvel, como o das aves.

Neste trabalho apresentamos um sistema capaz de responder as seguintes questoes: “Sera

possıvel construir uma ferramenta capaz de extrair informacao disponıvel em artigos cientıficos

e selecionar aquela que corresponde a certas caracterısticas fisiologicas?” e “Sera possıvel tirar

vantagem do conhecimento do utilizador para melhorar a eficacia dessa ferramenta?”. O sis-

tema recebe artigos cientıficos, extrai dados, sobre caracterısticas fisiologicas das especies em

estudo, desses artigos e classifica-os, usando expressoes regulares e tecnicas de aprendizagem

automatica.

Palavras-Chave: Biologia , Aprendizagem Automatica , Expressoes Regulares , Extracao de

Informacao

vi

placeholder

Acknowledgment

First of all, I would like to thank my parents who gave me all the support and conditions to

complete my academic life successfully so that my future would also benefit.

Special thanks, also, to all my colleagues and friends who have helped me to overcome the

challenges that have been proposed to me over the years.

I also thank to Professor Goncalo Marques and Carlos Teixeira for their participation in various

discussions, about the work developed, and the feedback given that allowed me to learn and

progress over the last months.

Last but not least, I would like to thank Professor Pavel Calado, who helped me during the

development of this work and guided me in the search for solutions to the challenges that have

arisen until the presentation of the final version.

viii

placeholder

Contents

Abstract iv

Resumo vi

Acknowledgment viii

1 Introduction 3

1.1 Motivation and Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.2 Hypothesis and Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1.4 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Concepts 7

2.1 Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 IE Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.2 Architecture and Components of IE Systems . . . . . . . . . . . . . . . . . 9

2.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.2 Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.3 Semi-Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.4 Online Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

x

3 Related Work 19

3.1 Dictionary-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 Rules-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Machine-Learning-Based . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.4 Hybrid Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4 Extraction Tool: Implementation 27

4.1 Architecture and Knowledge Base . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2 Extraction System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.3 Learning Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.4 Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5 Tests and Results 37

5.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.1.1 Effectiveness Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.1.2 Learning Rate Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.1 Classification Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.2.2 Learning Rate Assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

6 Conclusion 47

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

Bibliography 50

xi

placeholder

List of Tables

4.1 Categories and Related Words (Gomes, 2016). . . . . . . . . . . . . . . . . . . . . 28

4.2 Normalization Rules (Gomes, 2016). . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.3 Example of output passed to the server. . . . . . . . . . . . . . . . . . . . . . . . . 30

5.4 Dataset size by class. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

xiv

placeholder

List of Figures

2.1 Typical architecture of an information extraction system (Piskorski & Yangarber,

2013). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.2 The process of a supervised machine learning algorithm (Kotsiantis, 2007). . . . . 11

2.3 Online Machine Learning algorithms representation. . . . . . . . . . . . . . . . . . 13

2.4 Perceptron Algorithm (Ben-David & Shalev-Shwartz, 2014). . . . . . . . . . . . . . 14

2.5 Confusion matrix (Fawcett, 2006). . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.6 FAT Classification Process G. Sautter & Agosti (2006). . . . . . . . . . . . . . . . . 21

3.7 Overview of the Bootstrapping-Based Unsupervised Machine Learning Algorithm (Oak-

leaf, 2009). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.8 Example of a template (D. Corney & Jones, 2004). . . . . . . . . . . . . . . . . . . 24

4.9 Extraction Tool’s Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.10 Process Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.11 Interface Homepage. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.12 Interface Document’s Page. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5.13 Effectiveness Assessment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.14 Bernoulli Learning Curve - Leave-One-Out. . . . . . . . . . . . . . . . . . . . . . . 41

5.15 Multinomial Learning Curve - Leave-One-Out. . . . . . . . . . . . . . . . . . . . . . 42

5.16 Bernoulli Learning Curve - Incremental Training Set. . . . . . . . . . . . . . . . . . 43

5.17 Multinomial Learning Curve - Incremental Training Set. . . . . . . . . . . . . . . . . 44

1

placeholder

Chapter 1

Introduction

Biodiversity is an increasingly important issue among the scientific community and the gen-

eral public. Life is what sets us apart from all the other planets we know and diversity plays

a key role in maintaining this (Marjorie L Reaka-kudla & Henry, 1997). However, biodiversity has

been affected by several kinds of problems and has been suffering a serious decline. These

problems, with human and natural origins, have led to the extinction of many species. This is

the phenomenon that makes it so important to collect information about living beings and the

development of models that can guarantee their protection (Tilman, 2000).

1.1 Motivation and Problem

To prevent the decline of biodiversity, the scientists have done research and saved the result

of their work through the publication of scientific articles. But the number of articles has been

increasing exponentially over the years (P.Larsen & von Ins, 2010), creating a problem, given

that biology embraces several disciplines, such as: zoology, genetics, botany, parasitology and

many more, making it difficult to collect and gather data when they want to study a given species.

With the creation of computers and the development of computer science, problems, such as

obtaining all the transversal information about an animal species or even an entire taxonomic

group can be solved, since all this information can be stored in structures such as a database,

for example. And later be available just a click away.

However the real question is not “How do we store the information?” but “How do we obtain this

information?”. In order to obtain the relevant information, the computer must be able to read

the contents of an article and process the same information, so that it can recognize the data

3

4 CHAPTER 1. INTRODUCTION

needed to build a database capable of accurately expressing the characteristics of the species

being studied. This is a problem of Information Extraction.

There are already some tools that use information extraction techniques to extract data from

documents and web sites. running a brief Internet search we can find some of them, such

as: GATE (General Architecture for Text Engineering) (H. Cunningham & Tablan, 2001), which

includes an information extraction system called ANNIE (A Nearly-New Information Extraction

System), or polyglot1.

As far as biology is concerned, there are some databases that keep online information about all

species as ARKive2, which shows, not only, information about the animal, but also, photos and

videos, and the Encyclopedia of Life3, which is a free collaborative encyclopedia that shares,

mostly, information about the physiognomy and the habitat.

But, the problem is that the scientist do not have quick and effective access to data that has been

stored in scientific articles, given that the process of collecting this information usually involves

reading part or all of the article and, taking into account the number of existing articles, this

process becomes impossible or very tiring.

1.2 Hypothesis and Methodology

With that being said, the main goal of this project is to answer to the following questions: “Is it

possible to build a tool capable of extracting the information available in scientific articles and

selecting the one that may correspond to the selected physiological characteristics?” and “Is it

possible to take advantage of the user’s knowledge to improve the effectiveness of this tool?”.

In the following document, we explain the elaboration of a tool capable of using existing models to

extract data from scientific articles, present such data to the user and receive feedback from the

user in what concerns the accuracy of the data extracted. In addition, the tool is complemented,

with a machine learning algorithm that uses the feedback provided by users to improve the clas-

sification process. The extraction model is based on natural language processing techniques and

regular expressions.

1http://polyglot.readthedocs.io/en/latest/2http://www.arkive.org3http://www.eol.org

1.3. CONTRIBUTIONS 5

1.3 Contributions

The most innovative contributions of this work are: the possibility of simplifying the work of re-

searchers in what concerns the collection of biological data and, most importantly, the ability, of

this tool, to learn using the users’ input.

1.4 Document Structure

This document is organized as follows. Chapter 2 explains some concepts needed to understand

all the work developed. In Chapter 3 some closely related works already developed by different

authors are described. In Chapter 4 is described how our solution was implemented, including

the architecture and the most important parts of the system. In Chapter 5 is outlined how we

evaluated the some system components. Finally, in Chapter 6 we reflect on the work done and

on the results obtained, and we also present some suggestions for improvements that may be

made in the future.

placeholder

Chapter 2

Concepts

This chapter explains some of the concepts needed to understand this proposal, in particu-

lar, the definition of Information Extraction and a brief explanation about the strategies to

do it, as well as, the typical architecture of Information Extraction System, the description of the

basic concepts of Machine Learning and four types of learning. Some evaluation metrics will also

be described at the end of this chapter.

2.1 Information Extraction

Information Extraction is what we call the process of extracting structured information such

as entities, relationships between entities, and attributes describing entities from unstructured

sources (Sarawagi, 2007). Based on the Natural Language Processing (NLP) community this

is now a topic that involves many more disciplines like machine learning, information retrieval,

database, Web, and document analysis. Consequently, it has undergone a great evolution over

the years, since the first tasks were based only on the identification of named entities, while

nowadays it contemplates a set of new tasks such as establishing relationships between these

entities (Sarawagi, 2007). By increasing the number of possible applications for this technique, it

is now possible to apply it to biology or science in general.

2.1.1 IE Strategies

An information extraction system can be developed based on different strategies. There are

four main strategies according to M. Krallinger & Valencia (2005): the rule-based strategy, the

7

8 CHAPTER 2. CONCEPTS

dictionary-based strategy, the strategy with machine-learning techniques and the hybrid ap-

proaches that take advantage of different techniques.

Rule-Based Strategy

Rule-based methods usually work by establishing a set of rules either manually or through au-

tomatic learning, and apply them to a text (A. Thessen & Mozzherin, 2012). A rule consists of

a pattern and an action. This pattern is defined, for example, by a regular expression. When a

pattern matches a sequence of tokens of the text under consideration the consequent action is

performed (Aggarwal & Zhai, 2013).

The manual creation of the rules requires the participation of specialized personnel and requires

much effort. Automatic methods contemplate two approaches: top-down and bottom-up. The

top-down approach requires the rules to be defined first so that they can be applied to the max-

imum number of training instances and system will learn and specify more rules ”by taking the

intersections of the more general rules”. On the other hand, the bottom-up approach rules are

defined based on training instances and then generalized (Aggarwal & Zhai, 2013).

The main disadvantages of this strategy is that building the rule set manually may take a large

amount of work and the set only applies to a given domain. In addition, it’s practically impossible

to create a completely effective rule set. On the other hand, the strength of using this strategy

is the fact that, comparing with a dictionary-based strategy, it is capable of handling variations in

the word’s order or in the sentences structure (A. Thessen & Mozzherin, 2012).

Dictionary-Based Strategy

Dictionary-Based strategies consist in having one or more lists composed by terms that are

matched against a text and the result is the list of terms that appear in both. The advantage

over rule-based systems is that the dictionary can list references to other knowledge sources.

Nevertheless, it is not easy to make a list with all the necessary terms when the knowledge base

is too large and when this knowledge base is constantly updated (M. Krauthammer & Friedman,

2000).

Machine Learning Strategy

This type of strategy aims to create an algorithm that learns to extract information more effectively

than the previous strategies. The algorithm may (supervised learning) or may not (unsupervised

2.1. INFORMATION EXTRACTION 9

learning) receive a training set. There are also tools that combine these two approaches (Tan-

abe & Wilbur, 2002). The learning is done through the establishment of rules using statistical

procedures (A. Thessen & Mozzherin, 2012). The first approach manifests a major disadvantage

because collecting a quality training set can be hard to do, especially in some areas, such as

biology, for the reasons mentioned above (Tanabe & Wilbur, 2002).

Hybrid Strategy

There are some tools that use a hybrid strategy, in other words, they try to combine two or more

of the previous techniques hoping to benefit from the advantages of both of them (Tanabe &

Wilbur, 2002). Examples are BioRat (D. Corney & Jones, 2004), that uses dictionaries and rules

to perform the information extraction, or Caramba (Grouin & AB, 2010), that uses rules, to locate

trigger terms, and machine learning techniques, for the classification task.

2.1.2 Architecture and Components of IE Systems

According to Piskorski & Yangarber (2013), most Information Extraction Systems have some

components in common. They are called domain-independent components and usually aim to

carry out the linguistic analysis, with the following steps:

• Meta-data analysis - extraction of elements like the title, the body and its structure, and

the document’s date.

• Tokenization - text segmentation into tokens and their respective type classification.

• Morphological analysis - extraction of the morphological information of each of the tokens.

• Sentence/Utterance boundary detection - segmentation of text into sentences (sequence

of lexical items together with their features).

• Named-entity extraction - detection of named entities like organizations, currencies, geo-

graphical references, etc.

• Phrase recognition - Recognition of local structures such as noun phrases, acronyms,

abbreviations, etc.

• Syntactic analysis - Analysis of the syntactic function of each token that is part of a sen-

tence. It can be deep, i.e., all possible interpretations and grammatical relations within the

sentence, or shallow, i.e., the analysis is made only to non-recursive linguistic structures

and phenomena like ambiguities.


All these steps, as well as the two components pointed by the authors, are represented on Figure

2.1, being part of a typical architecture of an information extraction system.

Figure 2.1: Typical architecture of an information extraction system (Piskorski & Yangarber, 2013).

2.2 Machine Learning

Learning is the result of transforming the experience we get into any kind of knowledge and this

is what machine learning consists of. Machine learning consists on the creation of an algorithm

capable of learning by building a model from example inputs with the goal of making predictions

or decisions based on the same data (Ben-David & Shalev-Shwartz, 2014).

At first glance, this topic may be seen as another branch of artificial intelligence but the goal is

not to create something that imitates human intelligence but rather something that complements

human intelligence (Ben-David & Shalev-Shwartz, 2014).

Being something created to complement the human intelligence, naturally it is used to solve

something that is beyond the human capabilities as the analysis of very large and/or complex

datasets.

Another of its utilities is the possibility to compute tasks that we do without thinking and that we

can not translate by code, like speech recognition or image understanding, because these are

tasks that can be learned effectively if there are enough examples (Ben-David & Shalev-Shwartz,

2014).

2.2. MACHINE LEARNING 11

Over the years, several types of learning have been developed, four of which will be described

in this document: Supervised Learning, Unsupervised Learning, Semi-Supervised Learning and

Online Learning.

2.2.1 Supervised Learning

Supervised machine learning is based on algorithms that depend on externally supplied in-

stances, the training set, to create models that make it possible to make predictions about other

instances. This makes the construction of a concise model of the distribution of class labels in

terms of predictor features the main goal of supervised learning. Then the algorithm is used to

assign class labels to the new instances, or test set, in which all the values are known except the

class label (Kotsiantis, 2007).

Kotsiantis outlined the process of applying a supervised machine learning algorithm with Figure

2.2 (Kotsiantis, 2007).

Figure 2.2: The process of a supervised machine learning algorithm (Kotsiantis, 2007).

According to the author, the initial part, prior to the choice of the algorithm, is very important

because that part is what will truly define the effectiveness of the prediction. This first part,

includes the collection of data that will be part of the dataset under study and the handling of

this data because these may contain noise or missing values. This data treatment will preferably

be carried out by specialists using suitable methods for the purpose. Besides that, it is also

important to choose the features that are relevant for the prediction enabling the process to be


faster and more efficient later on (Yu & Liu, 2004). Then, the training set and the test set are

defined and used by the chosen algorithm (Kotsiantis, 2007).

Some of the most used algorithms of this type of learning are: Naıve Bayes, Support Vector

Machine, K-Nearest Neighbors and Random Forests (Ben-David & Shalev-Shwartz, 2014).

2.2.2 Unsupervised Learning

Unsupervised learning is based on a process similar to the one exposed on Figure 2.2 but with

a difference. When dealing with a unsupervised machine learning algorithm we don’t need to

establish a training set or a test set, the learner processes all the dataset in order to come up

with a summary of the input data (Ben-David & Shalev-Shwartz, 2014). Clustering the dataset

into groups of similar objects is a typical way to do it (D. Greene & Mayer, 2008).

Some of the most used algorithms in this type of learning are the k-Means Clustering and the

Hierarchical Clustering (D. Greene & Mayer, 2008).

2.2.3 Semi-Supervised Learning

This type of learning is like an hybrid between the last two and can be applied in different ways.

One way is, to provide unlabeled input data to the algorithm but also some supervision infor-

mation, for some examples. It can also be treated as unsupervised learning guided by some

constraints. And another approach is to see it like supervised learning with additional information

on the distribution of the examples (C. Olivier & Zien, 2006).

Some of the most used algorithms in this type of learning are the Transductive Support Vector

Machine Algorithm and the Local and Global Consistency algorithm (Y. Guo & Zhang, 2010).

2.2.4 Online Learning

The last type is the Online Machine Learning. This approach aims to make a series of predic-

tions having as prior knowledge previous correct answers to other tasks and another, possible,

additional information (Shalev-Shwartz, 2011).

In each round, the learner receives an instance and its task is to predict a label. Afterwards,

the learner receives the correct answer and uses it as knowledge for the next round, to improve

the prediction accuracy, making this type of learning perfect to be introduced in an interactive

tool (Ben-David & Shalev-Shwartz, 2014). The generality of online machine learning algorithms

2.2. MACHINE LEARNING 13

can be represented by Figure 2.3.

Figure 2.3: Online Machine Learning algorithms representation.

There are numerous algorithms used in this type of learning, in this document we will describe

two of them: the Perceptron and the Naıve Bayesian Classifier.

Online Perceptron Algorithm

The Perceptron is a classic learning algorithm, invented by Rosenblatt (1957), for binary classifi-

cation and, according to Ben-David & Shalev-Shwartz (2014), its online version is the following:

Considering Υ = {−1, 1} and wt as a weight vector, in which w1 = 0. On round t the learner

receives a vector xt and predicts pt = sign(〈wt, xt〉). Then, it receives the correct answer, yt ∈ Υ,

and, if pt 6= yt, it returns 1 and 0 otherwise.

At each round the weight value is updated taking into account the accuracy of the prediction. If

the predicted value is correct, the weight remains the same, if not, the weight value is modified,

being added the product between yt and xt, as demonstrated on the following equation:

wt+1 =

wt if yt〈wt, xt〉 > 0

wt + ytxt otherwise

(2.1)

The pseudo-code for the Perceptron algorithm for online machine learning will look like the one

on Figure 2.4.


Figure 2.4: Perceptron Algorithm (Ben-David & Shalev-Shwartz, 2014).

Naıve Bayesian Classifier

Bayesian classifiers are statistical classifiers, that can predict class membership probabilities,

such as the probability that a given sample belongs to a particular class, based on Bayes’ the-

orem. These classifiers are called “naıve” because they assume that the effect of an attribute

value on a given class is independent of the values of the other attributes.

According to Rish (2001) a Naıve Bayesian Classifier works as follows: having a training set T,

with a class label by sample and k classes (C0,C1,...,Ck), each sample is represented by a vector

X of n attributes, the classifier will predict the class of a given sample X according to the following

formula:

P (Ci | X) > P (Cj | X)for1 ≤ j ≤ m, j 6= i (2.2)

finding the class that maximizes this equation. Equation 2.2 is calculated using Bayes’ theorem

(see Equation 2.3).

P (Ci | X) =P (X | Ci)P (Ci)

P (X)(2.3)

For data sets with many attributes the assumption of class conditional independence is made and

is presumed that the values of the attributes are conditionally independent, given the class label

of the sample. And, therefore, the calculated equation is the one represented by Equation 2.4.

P (X | Ci) =

n∏k=1

P (xk | Ci) (2.4)

2.3. EVALUATION 15

2.3 Evaluation

There are several metrics capable of evaluating the performance of Information Extraction Sys-

tems, among them are Precision, Recall, F1 score and Accuracy. These metrics can be cal-

culated with the elements that are part of the Confusion Matrix illustrated by the figure be-

low (Fawcett, 2006).

Figure 2.5: Confusion matrix (Fawcett, 2006).

Precision - calculates the True Positive Accuracy (tpa), or confidence, being obtained the per-

centage of true positives present in the predicted positives (Powers & D.M.W., 2011).

Precision = Confidence = tpa =tp

tp+ fp=tp

Y(2.5)

Recall - evaluates the True Positive Rate (tpr), or sensitivity, which is obtained measuring the

percentage of true positives against the real positives (Powers & D.M.W., 2011).

Recall = Sensitivity = tpr =tp

tp+ fn=tp

P(2.6)

F1 score - used as a weighted harmonic mean of precision and recall. β is a non-negative value

used to adjust their relative weighting (Piskorski & Yangarber, 2013).

F =(12 + 1) ∗ precision ∗ recall

(12 * precision) + recall(2.7)

Accuracy - measures the percentage of accuracy of the forecast taking into account the total set

of possibilities (Powers & D.M.W., 2011).

Accuracy =tp+ fp

tp+ fp+ fn+ tn(2.8)


2.4 Summary

In this chapter we have explained some concepts that should be understood in order to under-

stand the remnant of the work we have developed. We approached themes like information

extraction, including the four main strategies and the typical architecture of a system prepared to

do so, machine learning, in which we explained some of the main techniques of learning as well

as a typical algorithm. At the end we described some evaluation metrics used during the tests

phase.

placeholder

Chapter 3

Related Work

Several information extraction systems have been created over the years, aiming to collect

data related to biology from scientific articles.

Despite having, usually, a well defined global structure (title, authors, keywords, references, etc.)

and few variances from document to document, the main problem in its processing is the form

of the content, which can either be composed only by text or can also contain images or tables.

There are other problems related to the subject itself and the documents that deal with subjects

related to biology are no exception (Sarawagi, 2007).

Therefore, some characteristics have already been recognized as obstacles to the perfect pro-

cessing of biology related documents, such as:

• The language used is always changing due to new discoveries that change the scientific

community understanding (L. Hirschman & Yeh, 2002).

• Processing the contents of tables or figure captions (H. Dai & Hsu, 2009).

• Multiple words to reference the same entity, including acronyms or even words like it (H. Dai

& Hsu, 2009).

• The linking of structures by words like and or or (Cohen & Hunter, 2004).

• Many names are easily mistaken by common words (D. Corney & Jones, 2004).

As previously explained, there are several strategies for extracting information, namely approaches

using dictionaries, rules, machine learning algorithms and hybrid strategies, which merge two or

more approaches into a single system.

19

20 CHAPTER 3. RELATED WORK

In this section we will describe some existing systems, according to the strategy used for the

extraction of information present in scientific articles within the scope of biology.

3.1 Dictionary-Based

Information extraction systems, developed, within the scope of biology, that use dictionary-based

algorithms essentially serve to recognize named entities (NER) or taxonomic names (TNR) (P. R. Leary

& Sarkar, 2007).

The TaxonFinder1 system is a TNR tool, that uses this type of algorithm, which recognizes tax-

onomic names in documents, comparing all the words in this document with several word lists

from a version of NameBank2.

The system splits the document into words and computes a comparison between them and the

words in the lists. Whenever the system finds a capitalized word, it checks if the word is in the

genus or in the above-genus list. If it is in the genus list, it can be returned as a name. If it is in

the genus list, the system will check the “species-or-below” name list. If it is in that list, the next

words are analyzed, until a complete polynomial is returned. If the next word is not in the list, the

name is returned as a genus (A. Thessen & Mozzherin, 2012).

Of course the major disadvantage of this system is that the dictionary has to be constantly up-

dated, running the risk of not being able to find new names, although it can can discover new

combinations of known names.

3.2 Rules-Based

There are few information extraction systems that rely solely, and exclusively, on rules to perform

their task (Gomes, 2016). Instead, this approach is more often used to improve the results of

other tools by combining it with other methods of extracting information. An example of this is the

FAT (Find All Taxon names) system, which uses another system,TaxonGrab (D. Koning & Moritz,

2005), and tries to improve it using rules (G. Sautter & Agosti, 2006). Figure 3.6 represents the

classification process of this system.

According to G. Sautter & Agosti (2006), the idea of their approach is to pick up the parts of the

text that are already classified, as taxonomic names (precision rules), and as not being taxonomic

1http://taxonfinder.org/2http://www.ubio.org/index.php?pagename=namebank

3.2. RULES-BASED 21

names (recall rules), and use those already classified parts to build lexica and statistics that will

be used to classify the rest of the text.

In a first pass, the system uses the parts already classified to detect all sequences of words

equivalent to that rule. The second step is to do the same but using the recall rules. Then,

the results of the last steps is used to build another lexica that will be applied to the text that

did not match in any of the first two steps. The words, or phrases, that remain with uncertain

classification are then subject to analysis by a word-level language recognizer to be classified.

Figure 3.6: FAT Classification Process G. Sautter & Agosti (2006).

The structure of taxonomic names is represented by several regular expressions built to match

any sequence of words that conforms to the Linnaean rules of nomenclature (de Queiroz, 1997).

The system Protein Active Site Template Acquisition (PASTA) (R. Gaizauskas & Willett, 2003) is

another example of a system that uses this type of approach to extract information, in this case,

about the roles of amino acid residues in protein molecules.

This system exploits basic templates to extract information from the text. These templates store

information about an entity, a relation between two entities or a scenario and contain one or more

slots of information. (R. Gaizauskas & Willett, 2003)

Then the system executes five major tasks: text preprocessing, terminological processing, syn-

tactic and semantic analysis, discourse interpretation, and template extraction. (R. Gaizauskas &

Willett, 2003)


3.3 Machine-Learning-Based

The inclusion of machine learning algorithms, in this type of system, is more recent than the

other approaches and only since the mid-1990s this technique became dominant against the

others (Oakleaf, 2009).

The system developed by Cui, Boufford and Selden is an example of the use of unsupervised

machine learning for semantic annotation inside the biology scope, in this case, the annotation

of morphological descriptions of whole organisms (Oakleaf, 2009).

These types of systems are very closely related to information extraction systems, since they are

both based on the discovery of the semantic role that a word plays in a given text.

The developed system makes use of a Bootstrapping-Based Unsupervised Machine Learning

Algorithm (see Figure 3.7, which is a process that begins with a small group of known items that

are used iteratively to know new items. The only external resource used is WordNet, that is an

online lexical reference system (G. A. Miller & Miller, 1990).

Figure 3.7: Overview of the Bootstrapping-Based Unsupervised Machine Learning Algo-rithm (Oakleaf, 2009).

The first steps of the algorithm include normalizing the text and segmenting the documents into

clauses. The normalization involves the conversion of the uppercase letters to lowercase and the

standardization of the use of hyphens.

The next steps are a preparation for the core bootstrapping modules, in which sets of tokens

3.4. HYBRID SYSTEMS 23

(like stop words) and known words are loaded, as well as their semantic roles. The module also

learns nouns, in their plural and singular forms, and annotates clauses with distinct patterns.

In the core bootstrapping modules, inferences are made between subjects and boundary words.

The remaining unknown words are tagged according to conventions determined by experience

in this type of tasks. Then the subject is used to annotate each clause.

The secondary modules deal with more complex language features in the biosystematics litera-

ture such as: the use of an adjective for a noun as a subject or conjunctions like “and” or “or” to

form a compound subject. Finally, using the new knowledge to annotate other clauses.

The knowledge base is constantly updated with the information obtained during execution allow-

ing the algorithm to learn by itself. After completing these three steps the algorithm should have

learned enough to annotate directly the rest of the clauses, and this is what it will do during the

post-bootstrapping modules.

NetiNeti (L. M. Akella & Miller, 2012) is another system that uses machine learning during the

information extraction. In this case, is used a supervised machine learning algorithm that involves

probabilistic classifiers, Naıve Bayes and Maximum Entropy, to estimate the probability of a label

given a word and its context.

As training set were used about 5,000 names, from a BHL1 book, MEDLINE2 abstracts and from

some contents from EOL, segmented using NTLK (L. M. Akella & Miller, 2012).

3.4 Hybrid Systems

Lately, in order to improve the effectiveness of the information extraction systems, it was con-

cluded that the path to be taken was to mix the various approaches that exist, so that their

strengths could complement each other. And now this hybrid approach is the most used in the

creation of this type of systems (Sarawagi, 2007).

BioRat, developed by D. Corney & Jones (2004), is an example of one of these systems, since

it uses both dictionary-based and rules-based approaches to perform biomedical information

extraction.

Being based on the GATE toolbox (H. Cunningham & Tablan, 2001), this system uses it to label

words according to their parts of speech but with small changes on the gazetteers and templates.

Gazetteers are lists of words (dictionaries) used for the Name Entity Recognition. BioRat uses

1https://www.biodiversitylibrary.org2https://www.nlm.nih.gov/bsd/pmresources.html


gazetteers from three different sources: MeSH1, Swiss-Prot2 and hand-made lists. Being the

hand-made gazetteers created with the help of domain experts.

The templates are a representation of a pattern, matched by the text, that allows the system to

extract information automatically. In short, they are predefined slots that the system tries to match

with the text.

Figure 3.8: Example of a template (D. Corney & Jones, 2004).

Figure 3.8 contains one example of the templates, in which ”EXPRESSION” represents a word

from the gazetteers relating to a protein expression or interaction, the elements with a question

mark are optional words and the other ”PROTEINS” match protein names.

The output is given in both XML and HTML, either to be used in databases or to be seen in web

browsers or spreadsheets.

Another example of one of these systems is Caramba (Grouin & AB, 2010). This system is

divided into three tasks: Concept Extraction, Assertion Annotation and Relation Annotation.

The first task is based on a machine-learning method that depends on a linguistic analysis, whose

output is represented in the form of n-gram tokens, typographic clues and semantic and syntactic

tags for each token. Then, with the training set, a model is created using a machine learning

tool called CRF++3. MetaMap (Aronson, 2001) is then used to locate medical terms and their

concepts and semantic types, its output is then enhanced segmenting it into noun-phrases with

treetagger-chunker and searching the located terms in pre-compiled lists.

For the second task two systems were developed, one using machine learning techniques and

the other using hand-made rules. The first system is a Support Vector Machine (SVM) trained

with the libsvm tool4, focusing on three types of features: contextual lexical features, trigger-

based features and target concept internal features. And the second system is an extension of

the NegEx (W. W. Chapman & Buchanan, 2001) algorithm used to locate trigger terms indicating

a negation or a probability and to determine if the concepts are within the scope of the trigger.

1http://www.nlm.nih.gov/mesh/2http://www.expasy.org/3http://chasen.org/taku/software/crf++/4https://www.csie.ntu.edu.tw/˜cjlin/libsvm/

3.5. SUMMARY 25

The final task is a classification task, in which eight relation types were considered and is used

an hybrid approach based on a trained SVM and manually constructed linguistic patterns.

As it has been said this is a growing approach within the world of information extraction systems

and many other systems could be described, such as TaxonGrab (D. Koning & Moritz, 2005)

that identifies taxonomic names using a combination of nomenclature rules and a dictionary of

nontaxonomic terms.

3.5 Summary

In this chapter were presented some systems that perform tasks similar to ours divided by the

strategies explained in Chapter 2.

placeholder

Chapter 4

Extraction Tool: Implementation

This chapter presents the proposed information extraction tool, describing each of its com-

ponents. First is given an overview of the architecture, the knowledge base, the whole

process that involves the user’s interaction with the system and the internal functioning of the

system. Afterwards, is described how the main components of this tool work.

4.1 Architecture and Knowledge Base

Figure 4.9: Extraction Tool’s Architecture.

27

28 CHAPTER 4. EXTRACTION TOOL: IMPLEMENTATION

Figure 4.9 reflects the overall look of the tool’s architecture. The system follows an hybrid ap-

proach complementing the rules-based approach with an Online Machine Learning Algorithm.

The rules-based approach is present on the extraction system and is an adaptation of the work

developed by Gomes (2016), described on Section 4.2. This adaptation was made so that it

was able to communicate with the server. In addition to the module related to the extraction of

candidates, was also used the knowledge base of the same project. Table 4.1 contains the list of

categories and words used on the extraction process.

Category Related WordsBody Mass wet weight; dry weight; wet mass; dry mass; at birth; hatching; at fledg-

ing; fledgling; adult; grams; kilogramsBody Temperature chick; adult; body; temperature; CelsiusEgg Temperature incubation; egg; temperature; Celsius

Fledging fledging; leaves the nest; daysIncubation incubation; hatching; days

Total Body Water total; body; water; content; percentage

Table 4.1: Categories and Related Words (Gomes, 2016).

The process begins with the user making a folder of scientific articles available. Then, for each

PDF file, the text is analyzed morphologically through the use of Natural Language Processing

methods, like splitting by sentences. The forth step is to apply rules, in the form of regular

expressions, to the text in order to extract only the interesting parts according to the fields defined

a priori, i.e. those present in the column of categories of Table 4.1. Then an online learning

algorithm is applied in order to improve the effectiveness of the tool, over time, depending on the

input passed by the users through the response acceptance and rejection buttons.

In the interface the user can always consult the document under analysis, the fields studied, the

possible answers and the phrase in which each answer is inserted, in order to be able to verify

the context in which the answer is inserted and to be able to make a better decision about the

response correction, the user can also write some comment that wants to see associated with

the data that is to be classified.

4.2 Extraction System

The extraction process consists of seven steps, three related to the morphological analysis and

the other four correspond to the application of regular expressions.

4.2. EXTRACTION SYSTEM 29

The first step is to convert the PDF file text into the equivalent in a .txt file extension. To achieve

this, a tool called PDFMiner.six1 was used, which makes it possible to manipulate the data

present in the article in order to obtain the expected results.

Then, the text is divided by sentences using the Natural Language Toolkit2 (NLTK) library, being

given due treatment to situations where there are words separated by hyphens or line changes.

The next step is to use the StanfordPOSTTagger3 library to go through all the sentences in the

document and to search and limit it only to sentences containing numbers.

After filtering the relevant sentences of the document, the process enters the regular expressions

phase. At this stage, regular expressions are used to find phrases that contain words referring

to the metrics previously identified (grams, celsius, etc.) and the words related to each category

(see Table 4.1).

Having found any phrase in the previous step, the values are extracted and normalized. Values

that are written in their extensive form are converted to digits using the word2number4 module.

The normalization is done as in the project developed by Gomes (2016), i.e. the values are con-

verted taking into account their units and values presented in intervals were converted according

to the value of each number, as shown on Table 4.2.

Case Normalization ExampleBigger Number, Small Number Sum 20 and 3 = 23Small Number, Bigger Number Median 5 and 15 = 10

Table 4.2: Normalization Rules (Gomes, 2016).

At the end of this process a list is given to the server that contains, for each identified phrase, the

values, the context in which these values are inserted and the class to which they belong, being

classified by default as belonging to the class OTHER but, later, this by the class assigned by the

machine learning algorithm (see Table 4.3).

The result of this process is cached using a module called pickle5 that allows us to serialize

and de-serialize data structures in order to use them later, making it unnecessary to repeat the

procedure for the same document.

1https://github.com/pdfminer/pdfminer.six2https://www.nltk.org3https://nlp.stanford.edu/software/tagger.shtml4https://pypi.org/project/word2number/5https://docs.python.org/3/library/pickle.html

https://github.com/pdfminer/pdfminer.six

https://www.nltk.org

https://nlp.stanford.edu/software/tagger.shtml

https://pypi.org/project/word2number/

https://docs.python.org/3/library/pickle.html


Phrase Value Context Class

On 27 May the nest had two chicks 1 and 3 days old, anatural egg which subsequently hatched, and theradioegg; the radioegg and antenna were removed fromthe nest.

1 1 and 3 days OTHER

3 1 and 3 days OTHER

We have excluded the final 3 days because of the presenceof the hatched chicks, and there were about 64 hours forwhich there is no record because at some egg positionsthe radio signal was not picked up by the antenna.

3 3 days OTHER

Table 4.3: Example of output passed to the server.

4.3 Learning Algorithm

The classification system, which includes the online learning algorithm, aims to assign the right

class to the value that is extracted by the system. For this, the system uses some positive exam-

ples (extracted and classified manually), so that the algorithm does not begin the classification

process without any knowledge, but mainly knowledge passed by the user through the interface.

The features analyzed in each value were the ones defined in the work developed by Gomes

(2016):

• Number of vocabulary words: identification of the words contained in the phrases that are

part of the previously known vocabulary (the one generated with all the training examples);

• Number of words: sentence size;

• Distance between value and specific words: recognition of the words contained in the

phrases that are part of the previously known vocabulary and calculation of the distance

between them and the values;

• Interval verification with parameterized numbers: knowing the mean and median values

of the various categories, we checked if the analyzed value was close to those values,

returning 1 if it is and 0 if it’s not;

• Distance of parameterized numbers: numerical distances of the value under analysis

compared to the mean and median values.

Initially, the algorithm is trained with about 70 positive cases, collected and classified manually,

and is fitted to the vocabulary resulting from the phrases extracted from the set of documents

uploaded by the user. Once the algorithm is created, it will be saved and will only be deleted if

the user decides.

4.3. LEARNING ALGORITHM 31

With that being done, the algorithm is only called in two occasions: every time the process of

extraction of a PDF is finished, to classify the extracted values, and when the user clicks the

’next’ button,to be trained with the changes made and also to reclassify the remaining values

of the document being analyzed, causing the user to see data always updated according to the

latest version of the algorithm. All data changed by the user is added to the file containing the

positive cases, making possible to use them to train a new algorithm. Figure 4.10 represents the

whole process described previously.

Figure 4.10: Flow chart with an overview of the whole process.


4.4 Interface

The interface is what allows the system to obtain the user’s knowledge and consequently to

develop its classification ability. This interface consists of two main pages and was tested on the

Firefox 62.0 and on the Opera 57.0.3072.0. The first page is the homepage, where the user can

upload and see which files are already available for review. The second is the page that shows

the data extracted from each document, in which the user can see the data, the sentences in

which they are inserted, the entire document and assign a class to the same data. Figures 4.11

and 4.12 are screen shots of each of these pages and the description of the functionality of each

of its elements:

Figure 4.11: Interface Homepage.

1. Button to find the folder with the documents to analyze.

2. Button to upload the folder.

3. Button to reset the classifier and the vocabulary.

4. List of documents ready to analysis.

There is another page that only appears when there is no classifier created, this page only has

a progression bar making the user wait until all the uploaded documents are processed by the

extraction system to create the vocabulary and the classifier.

4.4. INTERFACE 33

Figure 4.12: Interface Document’s Page.

1. Document’s PDF (with features like search or go to).

2. Export button to export the data to a .csv file.

3. Next and Previous buttons to change the sentence under analysis.

4. Sentence in which the data appears.

5. Values extracted from the sentence.

6. Possible classes for these values.

7. Acceptance and Rejection button (green to confirm that the class is correct and red to rejectthe value as a relevant data)

8. Comments box.


4.5 Summary

In this chapter we presented our information extraction tool, including its architecture and the

description of its main components, namely the interface, the extraction system and the learning

algorithm.

placeholder

Chapter 5

Tests and Results

During this chapter we present some tests that were made while the system was being

implemented. These tests were useful to make some decisions about the system, such

as choosing the learning algorithm. Was also made, a prediction of the learning curve of the

algorithm and an analysis of the results obtained by the system.

5.1 Evaluation Methodology

The execution of these tests involved the use of about 50 articles that, after the extraction process,

resulted in more than 500 values. The algorithms used were recommended by the scikit-learn 1

library as being suitable for the incremental learning that we wanted to carry out, this is due to

the fact that they have implemented a partial fit function, being able to learn from a mini-batch of

instances.

Class Group SizeBody Mass 12

Body Temperature 22Egg Temperature 30

Fledging 20Incubation 24

Total Body Water 12Other 409Total 529

Table 5.4: Dataset size by class.

As we can see on Table 5.4 the group of values classified as part of the “OTHER” class was

too large, we decided that the best way to test it was with different sizes. First we tested without1http://scikit-learn.org/stable/modules/scaling_strategies.html

37

http://scikit-learn.org/stable/modules/scaling_strategies.html

38 CHAPTER 5. TESTS AND RESULTS

changing the data resultant from the extraction process, then with the group of the class “OTHER”

having the same size as the sum of the other groups and then without any data classified as

belonging to this group.

5.1.1 Effectiveness Assessment

The first tests aimed to collect information about the performance of each of the algorithms with

respect to the classification of all extracted data.

We followed the Leave One Out approach, with the help of the LeaveOneOut 1 library from sckit-

learn, testing in each iteration the classification of a value using the rest as training values until

all the values were classified. Then, we calculated the averages for accuracy, precision, f1-score

and recall of each of the algorithms.

5.1.2 Learning Rate Assessment

We also measure the learning curve of each of the algorithms, to see if there were any who

learned notoriously faster than the others, in order to minimize the time the tool has a low success

rate, when being used for the first time.

First we tested the evolution at each iteration of the leave-one-out approach, then was defined

that the first third of the values extracted would work as training values and the rest as test

values. Starting from a training set with only one value and adding a new value to this set at each

iteration, also making a new prediction at each iteration. To avoid getting results based on luck

or chance, the results demonstrated were obtained after performing this test as many times as

the size of the training set, and then averaged. The training set was randomly defined using the

train test split 2 library of scikit-learn.

5.2 Results

5.2.1 Classification Assessment

As we can see on Figure 5.13, both the Perceptron and the Passive Aggressive do not seem to

learn and end up classifying the vast majority of data always in the same way. This, in some

1http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html

2http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.LeaveOneOut.html

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

5.2. RESULTS 39

(a) No data change.

(b) Number of OTHERs equals the sum of the remaining.

(c) No data classified as OTHER.

Figure 5.13: Classification Assessment


situations, results in a high accuracy, as in Figure 5.13a, because the class that these algorithms

choose exist has the majority of elements. Also the SGD and the MLP classifiers, seem to lose

effectiveness as the distribution of test data classes becomes more uniform, yet the behavior of

these two classifiers may be due to the type of features, given that the data provided is mostly

discrete and some features are even binary. This type of data seems to benefit the operation

of Bernoulli Naıve Bayes and Multinomial Naıve Bayes algorithms, and especially the former

seems to have a very good overall performance that is improving as the number of OTHERs is

decreasing.

5.2.2 Learning Rate Assessment

The results, shown below, only correspond to the two algorithms with better results in the previous

tests, Bernoulli Naıve Bayes and Multinomial Naıve Bayes, keeping the three variants already

explained.

5.2.2.1 Leave-One-Out Approach

On Figures 5.14 and 5.15 we can see that Bernoulli’s curve shows a more or less constant

growth and, in addition, a maximum point around the 80%. On the other hand, the Multinomial

algorithm shows a large and fast initial growth but from then on the value remains constant, with

an accuracy of about 60%. We can also see that the Multinomial is much more affected by the

presence of data classified as OTHER, showing, on Figures 5.15a and 5.15b, very low values.

This presence is also felt by Bernoulli’s algorithm, affecting most of all his behavior and not so

much the values presented.

5.2.2.2 Incremental Training Set Approach

As we can see on Figures 5.16 and 5.17, the Bernoulli algorithm, once again, obtains better

results than the Multinomial, not only with respect to the values of the accuracy but also with

respect to the evolution of this measure as the number of examples increases over the test,

being the only algorithm that demonstrates a remarkably positive evolution over time. We can

also conclude that the high number of data classified as OTHER is really a problem, because, in

the tests in which this condition was present, the variation of the accuracy is zero or practically

null, giving the impression that the algorithms did not learn. The Multinomial algorithm has a

faster rise initially but then the accuracy value remains constant, at only 45%.

5.2. RESULTS 41

(a) No data change. (b) Number of OTHERs equals the sum ofthe remaining.


Figure 5.14: Bernoulli learning curve following the leave-one-out approach.




Figure 5.15: Multinomial learning curve following the leave-one-out approach.

5.2. RESULTS 43



Figure 5.16: Bernoulli learning curve following the incremental training set approach.




Figure 5.17: Multinomial learning curve following the incremental training set approach.

5.3. SUMMARY AND DISCUSSION 45

5.3 Summary and Discussion

Through these tests we could conclude that the Bernoulli Naıve Bayes algorithm clearly showed

the best results, both in terms of absolute values, as demonstrated on Section 5.2.1, and in

relation to the learning rate, as demonstrated on Section 5.2.2. These results are not completely

surprising because it was known in advance that this is a suitable classifier for discrete data,

which is the main type of data present in the features being studied.

We also concluded that these tests were negatively influenced by the excessive presence of data

classified as belonging to the class OTHER, so anyone who uses the tool should limit as much

as possible the choice of this class to classify the extracted data, something that can also be

corrected by increasing the number of themes covered by the tool itself. This influence was felt

not only in the behavior of the classifiers but also in the logistic of the tests, as we can see on

Figures 5.16c and 5.17c, in which the training set ends up only with about 40 elements.

placeholder

Chapter 6

Conclusion

Th is work arises with the purpose of helping the scientific community to obtain information

about the physiognomy of the constituent species of the taxonomic group of birds, with the

main objective of answering to the following questions: “Is it possible to build a tool capable of

extracting the information available in scientific articles and selecting the one that may correspond

to the selected physiological characteristics?” and “Is it possible to take advantage of the user’s

knowledge to improve the effectiveness of this tool?”.

In order to answer these questions, we have built a tool capable of receiving scientific articles,

extracting data that it considers relevant from the same articles, classifying the data, presenting

it to the user, and taking advantage of the user’s feedback to improve the classification process.

The system follows an hybrid approach complementing a rules-based approach with an Online

Machine Learning Algorithm.

This tool is composed of 3 modules: the extraction system, in which is made the morphological

analysis and the application of regular expressions to extract data from the articles, the learning

algorithm, that classifies this data, and the interface, which allows the user to evaluate the work

done by the algorithm and correct it when necessary.

Two phases of tests were performed, the first one that essentially served to test the effectiveness

of each algorithm, in which the accuracy, recall, recall and f1-score of 6 algorithms were tested.

With this test we were able to remove 4 of the candidate algorithms, leaving Bernoulli Naıve

Bayes and Multinomial Naıve Bayes as the best options. The second phase of tests was based

on the learning curve and allowed us to assess the level of knowledge each algorithm needs to

have initially, in order to have a good performance, it also served to understand which algorithm

had the greatest learning ability over time given the data we had at our disposal. We came to the

47

48 CHAPTER 6. CONCLUSION

conclusion that Bernoulli’s algorithm was the most capable of accomplishing what was proposed.

Given the results of the tests we recommend that the user provide about 80 manually classified

examples, so that the algorithm begins to classify efficiently as soon as possible.

With the work carried out we believe that we have built a tool capable of simplifying the work

of the scientists in what concerns the collection of information and also, able to improve their

performance constantly used something as simple as a click of who uses it.

6.1 Future Work

This work fulfilled its function of facilitating the life of researchers in relation to access to data

present in scientific articles, yet there are changes that can be made that would not only make

the tool more useful but also improve its performance.

In terms of performance the biggest problem identified is the time the system takes to extract the

data from the files, this is essentially due to the way the regular expressions are made, so, in

order to get a faster access to the data, it would be necessary a reformulation of this part of the

code in future works.

The utility of the tool can be extended by increasing the number of fields that the system is

capable of analyzing, this would require more regular expressions capable of picking up the data

belonging to this new fields. The ideal would be to include the user participation in this process

by introducing new fields and their keywords to help in the extraction process. It would also be

interesting to offer the user the possibility of choosing the fields that he wants to extract from the

documents he wants to analyze.

The interface can also be improved by including the possibility of locating the phrase in the PDF

viewer that is embedded, we tried to implement this functionality but we did not find a solution.

placeholder

Bibliography

A. THESSEN, H.C. & MOZZHERIN, D. (2012). Applications of natural language processing in

biodiversity science.

AGGARWAL, C. & ZHAI, C. (2013). Mining text data.

ARONSON, A.R. (2001). Effective mapping of biomedical text to the umls metathesaurus: the

metamap program. Proceedings of the AMIA Symposium, 17.

BEN-DAVID, S. & SHALEV-SHWARTZ, S. (2014). Understanding Machine Learning: From Theory

to Algorithms. Cambridge University Press.

C. OLIVIER, B.S. & ZIEN, A. (2006). Semi-Supervised Learning. The MIT Press.

COHEN, K. & HUNTER, L. (2004). Artificial Intelligence Methods and Tools for Systems Biology .

Springer.

D. CORNEY, W.L., B. BUXTON & JONES, D. (2004). Biorat: Extracting biological information from

full-length papers. Bioinformatics, 20, 3206–3213.

D. GREENE, P.C. & MAYER, R. (2008). Unsupervised learning and clustering. Lecture Notes in

Applied and Computational Mechanics.

D. KONING, I.N.S. & MORITZ, T.M. (2005). Taxongrab: extracting taxonomic names from text.

Biodiversity Informatics, 2, 79–82.

DE QUEIROZ, K. (1997). The linnaean hierarchy and the evolutionization of taxonomy, with em-

phasis on the problem of nomenclature. Aliso, 15, 125–144.

FAWCETT, T. (2006). An introduction to roc analysis. Pattern Recognit. Lett., 27, 861–874.

G. A. MILLER, C.F.D.G., R. BECKWITH & MILLER, K.J. (1990). Introduction to wordnet: An

on-line lexical database. International journal of lexicography , 3, 235–244.

51

52 BIBLIOGRAPHY

G. SAUTTER, K.B. & AGOSTI, D. (2006). A combining approach to find all taxon names ( fat ) in

legacy biosystematics literature. Biodiversity Informatics, 3, 46–58.

GOMES, J. (2016). Extracao de informacao biologica de artigos cientıficos engenharia in-

formatica e computadores.

GROUIN, B. & AB, A. (2010). Caramba: concept, assertion, and relation annotation using

machine-learning based approaches. Proceedings of the 2010 i2b2/VA Workshop on Chal-

lenges in Natural Language Processing for Clinical Data.

H. CUNNINGHAM, K.B., D. MAYNARD & TABLAN, V. (2001). Gate: an architecture for develop-

ment of robust hlt applications. Proceedings of the 40th Annual Meeting on Association for

Computational Linguistics - ACL ’02, 168.

H. DAI, R.T.H.T., Y. CHANG & HSU, W. (2009). New challenges for biological text-mining in the

next decade. Journal of Computer Science and Technology , 25, 169–179.

KOTSIANTIS, S. (2007). Supervised machine learning: A review of classification techniques. In-

formatica, 31, 249–268.

L. HIRSCHMAN, A.M. & YEH, A. (2002). Rutabaga by any other name: Extracting biological

names. Journal of Biomedical Informatics, 35, 247–259.

L. M. AKELLA, C.N.N. & MILLER, H. (2012). Netineti: discovery of scientific names from text

using machine learning methods. BMC Bioinformatics, 13, 211.

M. KRALLINGER, R.E. & VALENCIA, A. (2005). Text-mining approaches in molecular biology and

biomedicines. Drug Discovery Today , 10, 439–445.

M. KRAUTHAMMER, P.M., A. RZHETSKY & FRIEDMAN, C. (2000). Using blast for identifying gene

and protein names in journal articles. Gene, 259, 245–252.

MARJORIE L REAKA-KUDLA, O.E., DON E WILSON & HENRY, A.J. (1997). Biodiversity II. Under-

standing and Protecting our Biological Resources. Environment International.

OAKLEAF, M. (2009). Writing information literacy assessment plans: A guide to best practice.

Communications in Information Literacy , 3, 80–90.

P. R. LEARY, C.N.N.D.J.P., D. P. REMSEN & SARKAR, I.N. (2007). Ubiorss: tracking taxonomic

literature using rss. Bioinformatics, 23, 1434–1436.

PISKORSKI, J. & YANGARBER, R. (2013). Multi-source, multilingual information extraction and

summarization. 23–50.

BIBLIOGRAPHY 53

P.LARSEN & VON INS, M. (2010). The rate of growth in scientific publication and the decline in

coverage provided by science citation index. Scientometrics, 84, 575–603.

POWERS, M.W. & D.M.W. (2011). Evaluation: from precision, recall and f-measure to roc,

informedness, markedness and correlation. Journal of Machine Learning Technologies, 2,

37–63.

R. GAIZAUSKAS, P.J.A., G. DEMETRIOU & WILLETT, P. (2003). Protien structures and informa-

tion extraction from biological texts: The pasta system. Bioinformatics, 19, 135–143.

RISH, I. (2001). An empirical study of the naive bayes classifier.

ROSENBLATT, F. (1957). The Perceptron - A Perceiving and Recognizing Automaton. Report 85,

Cornell Aeronautical Laboratory.

SARAWAGI, S. (2007). Information Extraction. now Publishers Inc.

SHALEV-SHWARTZ, S. (2011). Online learning and online convex optimization. Foundations and

Trends® in Machine Learning, 4, 107–194.

TANABE, L. & WILBUR, W. (2002). Tagging gene and protein names in biomedical text. Bioinfor-

matics, 18, 1124–1132.

TILMAN, D. (2000). Causes, consequences and ethics of biodiversity. Nature, 405, 208–211.

W. W. CHAPMAN, P.H.G.F.C., W. BRIDEWELL & BUCHANAN, B.G. (2001). A simple algorithm

for identifying negated findings and diseases in discharge summaries. Journal of Biomedical

Informatics, 34, 301–310.

Y. GUO, X.N. & ZHANG, H. (2010). An extensive empirical study on semi-supervised learning.

YU, L. & LIU, H. (2004). Efficient feature selection via analysis of relevance and redundancy. J.

Mach. Learn. Res., 5, 1205–1224.

big data meets nature conservation: automatic tools for ... · big data meets nature conservation:...

Documents