mário rodrigues, antónio teixeira

123

MrioRodriguesAntnioTeixeira

Advanced Applications of Natural Language Processing for Performing Information Extraction

S P R I N G E R B R I E F S I N E L E C T R I C A L A N D CO M P U T E R E N G I N E E R I N G S P E E C H T E C H N O LO G Y

Mrio Rodrigues Antnio Teixeira

Advanced Applications of Natural Language Processing for Performing Information Extraction

ISSN 2191-8112 ISSN 2191-8120 (electronic) SpringerBriefs in Electrical and Computer Engineering ISBN 978-3-319-15562-3 ISBN 978-3-319-15563-0 (eBook) DOI 10.1007/978-3-319-15563-0

Library of Congress Control Number: 2015935192

Springer Cham Heidelberg New York Dordrecht London The Authors 2015 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifi cally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfi lms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specifi c statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made.

Printed on acid-free paper

Springer International Publishing AG Switzerland is part of Springer Science+Business Media (www.springer.com)

Mrio Rodrigues ESTGA/IEETA University of Aveiro Portugal

Antnio Teixeira DETI/IEETA University of Aveiro Portugal

v Pref ace

The amount of content available in natural language (English, Italian, Portuguese, etc.) increases every day. This book provides a timely contribution on how to create information extraction (IE) applications that are able to tap the vast amount of relevant information available in natural language sources: web pages, offi cial documents (such as laws and regulations, books and newspapers), and the social web.

Trends, such as Open Data and Big Data, show that there is value to be added by effectively processing large amounts of available data. Natural language sources are usually stored in digital format, searched using keyword-based methods, displayed as they were stored, and interpreted by the end users. However, it is not common to have software to manipulate these sources in order to present information in an adequate manner to fi t users context and needs. If such sources would have struc-tured and formal representations (relational and/or with some markup language, etc.), computer systems would be able to effectively manipulate that data to meet end users expectations: summarize data, present graphics, etc.

The research community has been very active in producing software tools to sup-port the development of information extraction systems for several natural languages. These tools are now mature enough to be tested in production systems. To stimulate the adoption of those technologies by the broad community of software developers, it is necessary to show their potential and how they can be used. Readers are intro-duced to the problem of IE and its current challenges and limitations, all supported with examples. The book discusses the need to fi ll the gap between data/documents/people and provides a broad overview of the state-of-the-art technology in IE.

This book presents a description of a generic architecture for developing systems that are able to learn how to extract relevant information from natural language documents, and assign semantic meaning to it. We also illustrate how to implement a working system using, in most parts, state-of-the-art and freely available software for several languages. Some concrete examples of systems/applications are pro-vided to illustrate how applications can deliver information to end users.

Aveiro, Portugal Mrio Rodrigues December 2014 Antnio Teixeira

vii

1 Introduction ............................................................................................... 1 1.1 Document Society .............................................................................. 1 1.2 Problems ............................................................................................ 2 1.3 Semantics and Knowledge Representation ........................................ 3 1.4 Natural Language Processing ............................................................ 4 1.5 Information Extraction ....................................................................... 5

1.5.1 Main Challenges in Information Extraction .......................... 5 1.5.2 Approaches to Information Extraction ................................... 6 1.5.3 Performance Measures ........................................................... 7 1.5.4 General Architecture for Information Extraction .................. 8

1.6 Book Structure ................................................................................... 8 References ................................................................................................... 10

2 Data Gathering, Preparation and Enrichment ...................................... 13 2.1 Process Overview ............................................................................... 13 2.2 Tokenization and Sentence Boundary Detection ............................... 15

2.2.1 Tools ....................................................................................... 15 2.2.2 Representative Tools: Punkt and iSentenizer ......................... 16

2.3 Morphological Analysis and Part-of-Speech Tagging ....................... 17 2.3.1 Tools ....................................................................................... 18 2.3.2 Representative Tools: Stanford POS Tagger,

SVMTool, and TreeTagger ..................................................... 19 2.4 Syntactic Parsing ................................................................................ 20

2.4.1 Representative Tools: Epic, StanfordParser, MaltParser, TurboParser ......................................................... 21

2.5 Representative Software Suites .......................................................... 23 2.5.1 Stanford NLP ......................................................................... 23 2.5.2 Natural Language Toolkit (NLTK) ........................................ 24 2.5.3 GATE ..................................................................................... 24

References ................................................................................................... 24

Contents

viii

3 Identifying Things, Relations, and Semantizing Data ........................... 27 3.1 Identifying the Who, the Where, and the When ................................ 27 3.2 Relating Who, What, When, and Where ............................................ 30 3.3 Getting Everything Together .............................................................. 32

3.3.1 Ontology ................................................................................ 32 3.3.2 Ontology-Based Information Extraction (OBIE) ................... 33

References ................................................................................................... 34

4 Extracting Relevant Information Using a Given Semantic ................... 37 4.1 Introduction ........................................................................................ 37 4.2 Defi ning How and What Information Will Be Extracted ................... 38 4.3 Architecture ........................................................................................ 39 4.4 Implementation of a Prototype Using State-of-the-Art Tools ............ 40

4.4.1 Natural Language Processing ................................................ 41 4.4.2 Domain Representation .......................................................... 44 4.4.3 Semantic Extraction and Integration ...................................... 45

References ................................................................................................... 49

5 Application Examples ............................................................................... 51 5.1 A Tutorial Example ............................................................................ 51

5.1.1 Selecting and Obtaining Software Tools ................................ 53 5.1.2 Tools Setup ............................................................................. 53 5.1.3 Processing the Target Document ............................................ 54 5.1.4 Using for Other Languages and for Syntactic Parsing ........... 58

5.2 Application Example 2: IE Applied to Electronic Government ........ 58 5.2.1 Goals ...................................................................................... 58 5.2.2 Documents ............................................................................. 59 5.2.3 Obtaining the Documents ...................................................... 59 5.2.4 Application Setup ................................................................... 61 5.2.5 Making Available Extracted Information Using a Map ......... 65 5.2.6 Conducting Semantic Information Queries ........................... 67

References ................................................................................................... 68

6 Conclusion ................................................................................................. 71

Index ................................................................................................................. 73

Contents

1 The Authors 2015 M. Rodrigues, A. Teixeira, Advanced Applications of Natural Language Processing for Performing Information Extraction, SpringerBriefs in Electrical and Computer Engineering, DOI 10.1007/978-3-319-15563-0_1

Chapter 1Introduction

Abstract Chapter 1 introduces the problem of extracting information from natural language unstructured documents, which is becoming more and more relevant in our document society. Despite the many useful applications that the information in these documents can potentiate, it is harder and harder to obtain the wanted infor-mation. Major problems result from the fact that much of the documents are in a format non usable by humans or machines. There is the need to create ways to extract relevant information from the vast amount of natural language sources.

After this, the chapter presents, briefly, background information on Semantics, knowledge representation and Natural Language Processing, to support the presenta-tion of the area of Information Extraction [IE, the analysis of unstructured text in order to extract information about pre-specified types of events, entities or relation-ships, such as the relationship between disease and genes or disease and food items; in so doing value and insight are added to the data. (Text mining of web-based medical content, Berlin, p 50)], its challenges, different approaches and general architecture, which is organized as a processing pipeline including domain independent compo-nentstokenization, morphological analysis, part-of-speech tagging, syntactic pars-ingand domain specific IE componentsnamed entity recognition and co-reference resolution, relation identification, information fusion, among others.

Keywords Document society Unstructured documents Natural language Semantics Ontologies Information extraction Natural language processing NLP Knowledge representation

1.1 Document Society

Our society is a document society (Buckland 2013). Documents have become the glue that enables societies to cohere. Documents have increasingly become the means for monitoring, influencing, and negotiating relationships with others (Buckland 2013). With the advent of the web and other technologies, the concept of document evolved to include from classical books and reports to complex online multimedia information incorporating hyperlinks.

2The number of such documents and rate of increase are overwhelming. Some examples: Governments produce large amounts of documents at the several levels (local, central) and of many types (laws, regulations, minutes of meetings (public), etc.); information in companies intranets is increasing; more and more exams, reports and other medical documents are stored in servers by health institutions. Ourpersonal documents augment day by day in number and size. As such, health research is one of the most active areas, resulting in a steady flow of documents (e.g. medical journals and masters and doctoral theses) reporting on new findings and results. There are also many portals and web sites with health information such as the example presented in Fig. 1.1.

Much of the information that would be of interest to citizens, researchers, and professionals is found in unstructured documents. Despite the increasing use of tables, images, graphs and movies, a relevant part of these documents adopts at least partially written natural language. The amount of contents available in natural language (English, Portuguese, Chinese, Spanish, etc.) increases every day. This is particularly noticeable in the web.

1.2 Problems

Despite the many useful applications that the information on these documents can potentiate, it is harder and harder to obtain the wanted information. This huge and increasing amount of documents available in the web, companies intranets and

Fig. 1.1 An example of website providing health information (www.womenshealth.gov)

1 Introduction

3accumulated by most of us in our computers and online services potentiate many applications but also pose several challenges to make those documents really useful.

A major problem results from the fact that much of the documents/data is in a format non usable by machines. Hence, there is the need to create ways to extract relevant information from the vast amount of natural language sources. Natural language is the most comprehensive tool for humans to encode knowledge (Santos 1992), but creating tools to decode this knowledge is far from simple.

The second problem that needs to be solved is how to represent and store the information extracted. One must also make this information usable by machines.Regarding the discovery of information, general search engines do not allow the end-user to obtain a clear and organized presentation of the available information. Instead, it is more or less of a hit or miss, random return of information on any given search. Efficient access to this information implies the development of semantic search systems (Guha et al. 2003) capable of taking into consideration the concepts involved and not the words.

Semantic search has some advantages over search that directly index text words (Teixeira et al. 2014): (1) produces smaller sets of results, by being capable of iden-tifying and removing duplicated or irrelevant results; (2) can integrate related infor-mation scattered across documents; (3) can produce relevant results even when the question and answer do not have common words; and (4) makes possible complex and more natural queries.

To make possible semantic search and other applications based on semantic information, we need to add semantics to the documents or create semantic descrip-tions representing or summarizing the original documents. This semantic informa-tion must be derived from the documents and this can be done using techniques from Information Extraction (IE) and Natural Language Processing (NLP) fields, as will be described and exemplified in this book. In general, to make IE possible, texts are first pre-processed (ex: to separate into sentences and words) and enriched (ex: to mark words as nouns or verbs) by applying several NLP methods.

1.3 Semantics and Knowledge Representation

As argued in the previous section, there is the need to extract semantic information from natural language documents to make possible new semantic based applica-tions, and semantic search, on the information nowadays hidden in natural language documents. In this section, some background is given on the foundational concepts of semantics, ontologies and knowledge representation.

Semantics is the study of meaning of linguistic expressions, including the relations between signifiers, such as words, phrases, signs and symbols, and their meaning. The language can be an artificial language (e.g. a computer programming language) or a natural language, such as English or Portuguese. The second kind is directly related to the topic of this book. Computational semantics addresses the automation of the processes of constructing representations of meaning and reasoning with them.

1.2 Problems

4Knowledge representation (KR) addresses how to represent information aboutthe world in forms that are usable by computer systems to solve complex tasks. Research in KR includes studying how to use symbols to represent a set of factswithin a knowledge domain. As defined by Sowa (2000), knowledge representa-tion is the application of logic and ontology to the task of constructing computable models for some domain. In general, KR implies creating surrogates that repre-sent real world entities, and endow them with properties and interactions that rep-resent real world properties and interactions. Examples of knowledge representation formalisms are logic representations, semantic networks, rules, frames, and ontologies.

Ontology is a central concept in KR. It is formally defined as an explicit specifi-cation of a shared conceptualization (Gruber 1993). It describes a hierarchy of con-cepts related by subsumption relationships, and can include axioms to express other relationships between concepts and to constrain their intended interpretation. From the computer science point of view, the usage of an ontology to explicitly define the application domain brings large benefits regarding information accessibility, main-tainability, and interoperability. The ontology formalizes and allows making public the applications view of the world (Guarino 1998).

Ontologies allow specifying knowledge in machine processable formats sincethey can be specified using languages with well-defined syntax, such as the Resource Description Framework Schema (RDFS) and the Web Ontology Language (OWL).As ontology specification languages have well defined semantics, specifying knowl-edge using ontologies prevents the meaning of the knowledge to be open to subjec-tive intuitions and different interpretations (Antoniou and van Harmelen 2009).

1.4 Natural Language Processing

Allen (2000) defines NLP as computer systems that analyze, attempt to under-stand, or produce one or more human languages, such as English, Japanese, Italian, or Russian. The input might be text, spoken language, or keyboard input. The task might be to translate to another language, to comprehend and represent the content of text, to build a database or generate summaries, or to maintain a dialogue with a user as part of an interface for database/information retrieval.

The area of NLP can be divided in several subareas, such as Computational Linguistics, InformationExtraction, InformationRetrieval,LanguageUnderstandingand Language Generation (Jurafsky and Martin 2008).

From the many tasks integrated in NLP, here is a list of those that are particularly relevant for this book:

Sentence breaking: Find the sentences boundaries; Part-of-speech tagging: Given a sentence, determine the part of speech (morpho-

syntax role) for each word; Named Entity Recognition (NER): Determine which items in the text map

entities such as people, places or dates;

1 Introduction

5 Parsing: Grammatical analysis of a sentence; Information Extraction: To be described in the next section.

1.5 Information Extraction

Information extraction is a sub-area of Natural Language Processing dedicated to the general problem of detecting entities referred in natural language texts, the relations between them and the events they participate in. Informally, the goal is to detect elements such as who did what to whom, when and where (Mrquez et al. 2008). Natural language texts can be unstructured, plain texts, and/or semi-structured machine-readable documents, with some kind of markup.

As Gaizauskas and Wilks (1998) observed, IE may be seen as populating struc-tured information sources from unstructured, free text, information sources.

IE differs from information retrieval, the task of locating relevant documents in large document sets usually performed by current search engines such as Google or Bing, as its purpose is to acquire relevant information that can be later manipulated as needed. IE aims to extract relevant information from documents, and information retrieval aims to retrieve relevant documents from collections. In IR, after querying search engines, users must read each document of the result set for knowing the facts reported. Systems featuring IE would be capable of merging related informa-tion scattered across different documents, producing summaries of facts reported in large amounts of documents, having facts presented in tables, etc.

Early extraction tasks were concentrated around the identification of named enti-ties, like people and company names, and relationship among them from natural language text (Piskorski and Yangarber 2013). With the developments in recent years, that made easier online access to both structured and unstructured data, new applications of IE appeared and, to address the needs of these new applications, the techniques of structure extraction have evolved considerably over the last decades (Piskorski and Yangarber 2013).

1.5.1 Main Challenges in Information Extraction

Two important challenges exist in IE. One derives from the variety of ways ofexpressing the same fact. As illustrated by McNaught and Black (2006), the next statements inform that a woman named Torretta is the new chair-person of a com-pany named BNC Holdings:

BNC Holdings Inc. named Ms. G. Torretta to succeed Mr. N. Andrews as its new chair-person.

Nicholas Andrews was succeeded by Gina Torretta as chair-person of BNC Holdings Inc.

Ms. Gina Torretta took the helm at BNC Holdings Inc. She succeeds Nick Andrews.


6To extract the relevant information from each of these alternative formulations it is required linguistic analysis to cope with grammatical variation (active/passive), lexical variation (named to/took the helm), and anaphora resolution for cross- sentence references (Ms. Gina Torretta. She).

The other challenge, shared by almost all NLP tasks, derives from the high expressiveness of natural languages, which can have ambiguous structure and meaning. Lee (2004) exemplifies this phenomenon with a McDonnell-Douglas ad from 1985: At last, a computer that understands you like your mother. This sen-tence can be interpreted in, at least, three different ways: (1) the computer under-stands you as well as your mother understands you; (2) the computer understands that you like your mother; (3) the computer understands you as well as it understand your mother.

1.5.2 Approaches to Information Extraction

Over the years several different approaches have been proposed to solve the chal-lenges of IE. They have been classified according different dimensions. Some clas-sifications are relative the type of input documents (Muslea 1999), others to the type of technology used (Piskorski and Yangarber 2013; Chiticariu et al. 2013), and oth-ers to the degree of automation of the system (Hsu and Dung 1998; Chang et al. 2003). The distinct classification schemes reflect the variety of concerns of the pro-posing authors and also the evolution of IE over time.

Regarding the type of input documents, the methods developed to extract infor-mation from unstructured texts differ from the approaches employed when docu-ments have some kind of markup as XML. The methods to extract information from unstructured sources tend to rely more on deep NLP. The lack of structure in data implies that one of the most suitable ways to discriminate different concepts involved in texts is by analyzing them as thoroughly as possible. However, it is also possible to use superficial patterns targeted at information that is expressed in a reduced set of sentences such as X was born in Y or X is a Y-born or targeted at information with well-defined formats such as email addresses, dates, and money amounts.

When information sources have markups such as XML and/or are machine gen-erated content based on templates, IE methods can take advantage of the markups and the structure of the document since they provide clues about the type of content. Markups can occur embedded in the text, e.g. John was born on 14th March 1959, or in special places such as Wikipedias page infoboxes. Methods that extract information from such contents tend to rely on markups and the document structure since they were produced by the information publisher and thus should be accurate. It is also common the usage of such information as seed examples for train-ing and improvement of the accuracy of methods looking for information that origi-nally is in unstructured data (Suchanek et al. 2007; Kasneci et al. 2008).

1 Introduction

7Relative to the technology used, earlier IE systems were essentially rule based approaches, also called knowledge engineered approaches. This type of technology is still used in modern approaches, at least partially. It uses hard coded rules created by human experts that encode linguistic knowledge by matching patterns over a variety of structures: text strings, part-of-speech tags, dictionary entries. The rules are usually targeted for specific languages and domains and this type of systems are generically very accurate and ready to use out of the box (Andersen et al. 1992; Appelt et al. 1993; Lehnert et al. 1993). As manual coding of the rules can become a time-consuming task, and also because rules rarely remain unchanged when porting for other languages and/or domains, some implementations introduced algorithms for automatically learning rules from examples (Soderland 1999; Califf and Mooney 1999; Ciravegna 2001).

The success IE motivated the broadening of its scope to include more unstruc-tured and noisy sources and, as result, statistical learning algorithms were intro-duced. Among the most successful approaches are the ones based on Hidden Markov Models, conditional random fields, and maximum entropy models (Ratnaparkhi 1999; Lafferty et al. 2001). Later were developed more holistic analy-ses of the document including techniques for grammar construction and ontology-based IE (Viola and Narasimhan 2005; Wimalasuriya and Dou 2010).

Hybrid approaches, which use a mix of the previous two, combine the best fea-tures of each kind of approach: the accuracy of rule based approaches with the coverage and adaptability of machine learning approaches.

Some IE approaches use ontologies to store and guide the IE process. The suc-cess of these approaches motivated the creation of the term Ontology-BasedInformation Extraction (OBIE). These approaches will be described in Chap. 4 of this book.

Despite the different approaches, there is no clear winner. The advent of Semantic Web and Open Data made ontology-based IE (OBIE) one of the most popular trendsin the field. However OBIE includes other IE algorithms and is not an alternativemethod but rather an approach that processes natural language text through a mech-anism guided by ontologies and presents the output using ontologies (Wimalasuriya and Dou 2010).

Comprehensive overviews about IE approaches are provided in (Sarawagi 2008; Piskorski and Yangarber 2013).

1.5.3 Performance Measures

The metrics commonly used in the evaluation of IE systems are precision, recall and F-measure (Makhoul et al. 1999). Precision is the ratio between the number of correct or relevant findings and the number of all findings of the system, recall is the ratio between the numbers of correct or relevant findings and expected find-ings, which are the total amount of relevant facts that exist in the documents.


8F-measure is the weighted harmonic mean of precision and recall, commonly calculated as F1 which is when is equal to 1. These definitions can be expressed as formulas as follows:

Precisionnumberof correct findings

numberof findings=

Recallnumberof correct findings

numberof expected findings=

FPrecision Recall

Precision Recallmeasure= +( )

+1

2b

b

FPrecision Recall

Precision Recall12=

+

A difficulty when computing these measures is that it is necessary to know all the relevant findings of the documents, specifically when calculating recall, and thus F-measure. This implies having someone reading all documents and annotating the relevant parts of texts, which is a time consuming task. Ideally, the annotation should be performed by more than one person and followed by group consensus about which annotations are the correct ones. It is possible to find some sets of documents already annotated, named golden collections.

1.5.4 General Architecture for Information Extraction

Although IE approaches differ significantly, the core process is usually organized as a processing pipeline that include domain independent componentstokenization, morphological analysis, part-of-speech tagging, syntactic parsingand domain specific IE componentsnamed entity recognition and co-reference resolution, relation identification, information fusion, among others. This general pipeline is illustrated in Fig. 1.2. Having as input documents, the sequence of domain indepen-dent and domain specific processing modules extract information (or knowledge) that is made available for applications, humans or further processing.

1.6 Book Structure

In this first chapter, readers are introduced to the area of IE and its current chal-lenges. The chapter starts by introducing the need to fill the gap between documents and people and ends with the presentation of a generic architecture for developing

1 Introduction

9systems that are able to learn how to extract relevant information from natural language documents, and assigning semantic meaning to it. The chapter also includes some background information on semantics, ontologies, knowledge repre-sentation and Natural Language Processing.

The two main groups of processing modules of the generic architecture are the subject of the following two chapters. First, in Chap. 2, are presented the domain independent modules that, in general, split the text into relevant units (sentences and tokens) and enrich the document by adding morphological and syntactic informa-tion. The third chapter presents information on how to extract entities and relations and create a semantic representation with the extracted information.

As OBIE is a very important trend, a complete chapter, the fourth, is dedicatedto present a proposal of software architecture for performing OBIE using an arbi-trary ontology and describing a system developed based on the architecture.

As this book aims at including real applications, Chap. 5 illustrates how to imple-ment working systems. The chapter presents two systems: the first is a tutorial systemthat we challenge all readers to performdeveloped by almost direct use of freely available tools and documents; the second one, more complex and for a language other than English, illustrating a state-of-the-art system and how it can deliver information to end users.

Book ends with some comments to what was selected as content for the book and some considerations regarding the future.

Documents

Domain independent Sentence split Tokenization Morphological analysis POS tagging Syntatic parsing

--> see Chapter 2

Domain Specific Named Entity Recognition Co-reference resolution Relation identification Information fusion

--> see Chapter 3 and 4

Information /

Knowledge

Fig. 1.2 The general processing pipeline of information extraction systems

1.6 Book Structure

10

References

Allen JF (2000) Natural language processing. In: Ralston A, Reilly ED, Hemmendinger D (eds) Encyclopedia of computer science, 4th edn. Wiley, Chichester, pp 12181222

Andersen PM et al (1992) Automatic extraction of facts from press releases to generate news stories. In: Proceedings of the third conference on applied natural language processing. pp 170177

Antoniou G, van Harmelen F (2009) Web Ontology Language: OWL. In: Staab S, Studer R (eds)Handbook on ontologies, 2nd edn. International handbooks on information systems. Springer, Berlin, pp 91110

Appelt DE et al (1993) FASTUS: a finite-state processor for information extraction from real-world text. In: IJCAI. pp 11721178

Buckland M (2013) The quality of information in the web. BiD: textos universitaris de biblioteco-nomia i documentaci (31)

Califf ME, Mooney RJ (1999) Relational learning of pattern-match rules for information extrac-tion. In: AAAI/IAAI. pp 328334

Chang C-H, Hsu C-N, Lui S-C (2003) Automatic information extraction from semi-structured web pages by pattern discovery. Decis Support Syst 35(1):129147

Chiticariu L, Li Y, Reiss FR (2013) Rule-based information extraction is dead! Long live rule- based information extraction systems! In: EMNLP. pp 827832

Ciravegna F (2001) Adaptive information extraction from text by rule induction and generalisa-tion. In: International joint conference on artificial intelligence. pp 12511256

Gaizauskas R, Wilks Y (1998) Information extraction: beyond document retrieval. J Doc 54(1): 70105

Gruber TR (1993) A translation approach to portable ontology specifications. Knowl Acquis5(2):199220

Guarino N (1998) Formal ontology and information systems. In: FOIS 98proceedings of theinternational conference on formal ontology in information systems. IOS Press, Amsterdam,pp 315

Guha R, McCool R, Miller E (2003) Semantic search. In: The twelfth international World Wide Web conference (WWW), Budapest. p 779

Hsu C-N, Dung M-T (1998) Generating finite-state transducers for semi-structured data extraction from the web. Inf Syst 23(8):521538

Jurafsky D, Martin JH (2008) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd edn. Prentice Hall, New York

Kasneci G et al (2008) The YAGO-NAGA approach to knowledge discovery. ACM SIGMOD 37:7Lafferty J, McCallum A, Pereira F (2001) Conditional random fields: probabilistic models for

segmenting and labeling sequence data. In: Proceedings of the international conference on machine learning (ICML-2001)

Lee L (2004) Im sorry Dave, Im afraid I cant do that: linguistics, statistics, and natural language processing circa 2001. In: Committee on the Fundamentals of Computer Science: Challenges and Computer Science Opportunities and National Research Council Telecommu-nications Board (ed) Computer science: reflections on the field, reflections from the field. The National Academies Press, Washington, pp 111118

Lehnert W et al (1993) UMass/Hughes: description of the CIRCUS system used for Tipster text.In: Proceedings of TIPSTER93, 1923 September 1993. pp 241256

Makhoul J et al (1999) Performance measures for information extraction. In: Proceedings of DARPA broadcast news workshop. pp 249252

Mrquez L et al (2008) Semantic role labeling: an introduction to the special issue. Comput Linguist 34(2):145159

McNaught J, Black W (2006) Information extraction. In: Ananiadou S, McNaught J (eds) Text mining for biology and biomedicine. Artech House, Boston

1 Introduction

11

Muslea I (1999) Extraction patterns for information extraction tasks: a survey. In: Proceedings of the AAAI 99 workshop on machine learning for information extraction, Orlando, July 1999.pp 16

Neustein A et al (2014) Application of text mining to biomedical knowledge extraction: analyzing clinical narratives and medical literature. In: Neustein A (ed) Text mining of web-based medi-cal content. De Gruyter, Berlin, p 50

Piskorski J, Yangarber R (2013) Information extraction: past, present and future. In: Multi-source, multilingual information extraction and summarization. Springer, Berlin, pp 2349

Ratnaparkhi A (1999) Learning to parse natural language with maximum entropy models. Mach Learn 34(13):151175

Santos D (1992) Natural language and knowledge representation. In: Proceedings of the ERCIM workshop on theoretical and experimental aspects of knowledge representation. pp 195197

Sarawagi S (2008) Information extraction. Found Trends Database 1(3):261377Soderland S (1999) Learning information extraction rules for semi-structured and free text. Mach

Learn 34(13):233272Sowa JF (2000) Knowledge representation: logical, philosophical, and computational foundations.

Brooks Cole, Pacific GroveSuchanek F, Kasneci G, Weikum G (2007) Yago: a core of semantic knowledge. In: Proceedings

of the 16th international conference on World Wide Web. ACM Press, New York, p 697Teixeira A, Ferreira L, Rodrigues M (2014) Online health information semantic search and explo-

ration: reporting on two prototypes for performing extraction on both a hospital intranet and the world wide web. In: Neustein A (ed) Text mining of web-based medical content. De Gruyter, Berlin, p 50

Viola P, Narasimhan M (2005) Learning to extract information from semi-structured text using a discriminative context free grammar. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval. pp 330337

Wimalasuriya DC, Dou D (2010) Ontology-based information extraction: an introduction and asurvey of current approaches. J Inf Sci 36(3):306323

References


Chapter 2 Data Gathering, Preparation and Enrichment

Abstract This chapter presents the domain independent part of the general architecture of Information Extraction (IE) systems. This fi rst part aims at preparing documents by the application of several Natural Language processing tasks that enrich the documents with morphological and syntactic information. This is made in successive processing steps which start by making contents uniform, and end by identifying the roles of the words and how they are arranged.

Here are described the most common steps: sentence boundary detection, tokeni-zation, part-of-speech tagging, and syntactic parsing. The description includes information on a selection of relevant tools available to implement each step.

The chapter ends with the presentation of three very representative software suites that make easier the integration of the several steps described.

Keywords Information extraction Tokenization Sentence splitting Morphological analysis Part-of-speech POS Syntactic parsing Tools

2.1 Process Overview

The IE process usually starts by identifying and associating morphosyntactic fea-tures to natural language contents that, otherwise, would be quite undistinguishable character strings. The process is composed of successive NLP steps starting on making contents uniform, and ending with the identifi cation of the roles of the words and how they are arranged. The fi rst steps are usually tokenization and sen-tence boundary detection. Its purpose is to break contents into sentences and defi ne the limit of each token: word, punctuation mark, or other character clusters such as currencies. Afterwards, all processing is usually conducted in a per-sentence fash-ion and tokens are considered atomic. Then, morphological analysis makes tokens uniform by determining word lemmata, see win and won in Fig. 2.1 , and part-of- speech tagging assigns a part-of-speech to each token, visible after the slashes. The fi nal step is usually syntactic parsing which can be done using signifi cantly different formalisms. These NLP steps prepare the textual contents for subsequent identifi cation and extraction of relevant information.

14

Figure 2.1 depicts the mentioned successive processing steps and their effect on data. The processing steps are on the left-hand side and their effect on data is visible on the right-hand side. The output of one step is the input of the next one, and the effects are representative as they provide a real example of what can be done but are not the unique possible formalism or solution. The syntactic parsing result in Fig. 2.1 is relative to dependency parsing and is depicted as a graph for simplicity.

Task Data

John Bardeen is the only laureate to win the Nobel Prize in Physics twice in 1956 and 1972. Maria Curie also won two Nobel Prizes, for physics in 1903 and chemistry in 1911.sentence

boundarydetection

+tokenization [John] [Bardeen] [is] [the] [only] [laureate] [to] [win] [the] [Nobel]

[Prize] [in] [Physics] [twice] [] [in] [1956] [and] [1972] [.]

[Maria] [Curie] [also] [won] [two] [Nobel] [Prizes] [,] [for] [physics] [in] [1903] [and] [chemistry] [in] [1911] [.]morphological

analysis+

part-of-speechtagging

(dependency)syntacticparsing

[John/NNP] [Bardeen/NNP] [be/VBZ] [the/DT] [only/JJ] [laureate/NN] [to/TO] [win/VB] [the/DT] [Nobel/NNP] [Prize/NNP] [in/IN] [Physics/NNP] [twice/RB] [/:] [in/IN] [1956/CD] [and/CC] [1972/CD] [./.]

[Maria/NNP] [Curie/NNP] [also/RB] [win/VBD] [two/CD] [Nobel/NNP] [Prizes/NNS] [,/,] [for/IN] [physics/NN] [in/IN] [1903/CD] [and/CC] [chemistry/NN] [in/IN] [1911/CD] [./.]

Fig. 2.1 Representative example of the NLP steps for morphosyntactic data generation relative to plain text natural language sentences

2 Data Gathering, Preparation and Enrichment

15

In the following sections, each of these major steps will be described and representative tools briefl y presented. A bias towards alphabet languages is assumed, but, whenever possible, some information is provided on other languages, such as Arabic and Chinese. Representative tools, in general used later in the book, are given some additional attention. They are described with some detail and relevant information, such as the way of obtaining the tool and languages supported out of the box, is presented in tabular form at the end of each section.

2.2 Tokenization and Sentence Boundary Detection

Document processing usually starts by separating documents texts in its atomic units. Breaking a stream of text into tokens (words, numbers, and symbols) is known as tokenization (Mcnamee and Mayfi eld 2004 ). It is a quite straightforward process for languages that use spaces between words, such as most languages using the Latin alphabet. Tokenizers often rely on simple heuristics as (1) all contiguous strings of alphabetic characters are part of one token, the same applies to numbers, and (2) tokens are separated by whitespace charactersspace and line breakor by punctuation characters that are not included in abbreviations. For languages that do not use whitespaces between tokens, as Chinese, this process can be particularly challenging (Chang and Manning 2014 ; Huang et al. 2007 ).

Sentence boundary detection, as its name suggests, addresses the problem of fi nding sentence boundaries. The concept of sentence is central in several natural language processing tasks, since sentences are standard textual units which confi ne a variety of linguistic phenomena such as collocations and variable binding. However, fi nding these boundaries is not a trivial task since end-of-sentence punc-tuation marks are ambiguous in many languages. The period is often used as sen-tence boundary marker and also in ordinal numbers, initials, abbreviations, and even abbreviations at the end of sentences. Like the period, other punctuations such as exclamation points and question marks can mark the end of sentences and also occur within quotation or parenthesis in the middle of sentences (Kiss and Strunk 2006 ; Palmer and Hearst 1997 ; Reynar and Ratnaparkhi 1997 ).

2.2.1 Tools

Tools for tokenizing texts are found in software suites such as Freeling (Padr and Stanilovsky 2012 ), NLTK (Bird et al. 2009 ), OpenNLP (Apache 2014 ), or StanfordNLP (Manning et al. 2014 ). There are no specialized tools exclusively ded-icated to this problem since tokenization can be reasonably well done using regular expressions (regex) when processing languages using Latin alphabet. For languages not using Latin alphabet there are fewer tools. The tokenizer Stanford Word

2.2 Tokenization and Sentence Boundary Detection

16

Segmenter 1 has models able to handle Arabic and Chinese (Chang et al. 2008 ; Monroe et al. 2014 ).

Regarding the sentence boundary detection problem, several systems addressing it have been proposed with good results. Here we focus on two proposals that achieved good results when tested with distinct natural languages: Punkt (Kiss and Strunk 2006 ) and iSentenizer (Wong et al. 2014 ).

2.2.2 Representative Tools: Punkt and iSentenizer

Punkt is included in the Natural Language Toolkit (NLTK), a software suite in Python that provides tools for handling natural languages (see Sect. 2.5.2 ). Punkt implemen-tation follows the tokenizer interface defi ned by NLTK in order to be seamlessly integrated programmatically in a NLP pipeline. It is provided with source code and, alongside the execution method, the software also includes methods for training new sentence boundary detection models from corpora (see tested languages in Table 2.1 ).

Punkt approach is based on unsupervised machine learning. The method assumes that most of end of sentence ambiguities can be solved if abbreviations are identi-fi ed as the remaining periods would mark end of sentences (Kiss and Strunk 2006 ). It operates in two steps. The fi rst step detects abbreviations by assuming that they are collocations of a truncated word and a fi nal period, they are short, and they often contain internal periods. These assumptions are used to estimate the likelihood of a given period being part of an abbreviation. The second step evaluates if the deci-sions of the fi rst step should be corrected. The evaluation is based on the word immediately at the right of the period. It is checked if the word is a frequent sen-tence starter, if it is capitalized, or if the two tokens surrounding the period do not form a frequent collocation. Periods are considered sentence boundary markers if they are not part of abbreviations.

iSentenizer is provided with a Visual C++ application programming interface (API) and a standalone tool featuring a graphical user interface (GUI). Having these two interfaces makes easier using the tool. The GUI can be used to easily and con-veniently construct and verify a sentence boundary detection system for a specifi c language, and the API allows later integration of the constructed model into larger software systems using Visual C++.

1 http://nlp.stanford.edu/software/segmenter.shtml

Table 2.1 Main features of Punkt

Name Punkt Task Sentence boundary detection URL http://www.nltk.org/_modules/nltk/tokenize/punkt.html Languages tested Dutch, English, Estonian, French, German, Italian,

Norwegian, Portuguese, Spanish, Swedish, and Turkish Performance F1 above 0.95 for most of 11 tested languages


17

iSentenizer is based on an algorithm, named i + Learning, that constructs a deci-sion tree in two steps (Wong et al. 2014 ). The fi rst step constructs a decision tree in a top-down approach based on the training corpus. The second step increments the tree whenever a new instance or attribute is detected, revising the tree model by incorporating new knowledge instead of retraining it from scratch. The features used in tree construction are the words immediately preceding and following the potential boundary punctuation marks: period, exclamation mark, colon, semicolon, question mark, quotation marks, brackets, and dash. The inclusion of more punctua-tion marks than the usual sentence boundariesperiod, and exclamation and ques-tion marksis because those punctuation marks may also denote a sentence boundary depending on the text genre. Features are encoded in a way independent from corpus and alphabet language to maximize the adaptability of the system for different languages and text genres (see tested languages in Table 2.2 ).

2.3 Morphological Analysis and Part-of-Speech Tagging

Having texts separated in tokens, the next step is usually morphosyntactic analysis, in order to identify characteristics as word lemma and parts of speech (Marantz 1997 ). It is important to distinguish two concepts: lexeme and word form. The dif-ference is well illustrated with two examples: (1) the words book and books refer to the same concept and thus have the same lexeme and have different word forms; (2) the words book and bookshelf have different word forms and differ-ent lexemes as they refer to two different concepts (Marantz 1997 ). The form cho-sen to conventionally represent the canonical form of lexemes is called lemma. Finding word lemmata brings the advantage of having a single form for all words that have similar meanings. For example, the words connect, connected, con-necting, connection, and connections roughly refer to the same concept and have the same lemma. Also, this process reduces the total number of terms to han-dle, which is advantageous from a computer processing point of view, as it reduces the size and complexity of data in the system (Porter 1980 ). The complexity of the task depends of the target natural language. For languages with simple infl ectional morphology, as English, the task is more straightforward than for languages with more complex infl ectional morphology as German (Appelt 1999 ).

Table 2.2 Main features of iSentenizer

Name iSentenizer Task Sentence boundary detection URL http://nlp2ct.cis.umac.mo/views/utility.html Languages tested Danish, Dutch, English, Finnish, French, German, Greek, Italian,

Portuguese, Spanish, Swedish Performance Detects sentence boundaries of a mixture of different text genres

and languages with high accuracy F1 above 0.95 for most of 11 tested languages using Europarl corpus


18

The process of determining the word lemma is called lemmatization. Another method called word stemming is common due to its simplicity. Word stemming reduces words to their base form by removing suffi xes. The remaining form is not necessarily a valid root but it is usually suffi cient that related words map to the same stem or to a reduced set of stems if words are irregular. For example the words mice and mouse have the lemma mouse but some stemmers produce mic and mous, respectively (Hotho et al. 2005 ).

Other important features for characterizing word are its morphosyntactic cate-gory, or part of speech (POS), such as noun, adjective, verb, preposition, etc. along-side with other properties that depend on the POS. For example verbs have features such as tense and person that are not applicable to nouns (Piskorski and Yangarber 2013 ). Finding the part of speech is known as POS tagging and the systems devel-oped for this task usually include algorithms for word lemmatization or stemming before determining the POS tag.

POS tagging has two main challenges. One challenge is dealing with part-of- speech ambiguity as words often can have distinct parts of speech depending on its context in sentences. The other challenge is the assignment of POS to words for which the system has no knowledge about (Alusio et al. 2003 ). For solving both problems, typically is taken into account the context around the target word, within a sentence, and selected the most probable tag using information provided by the word and its context (Gngr 2010 ). POS tag information is commonly taken into consideration in syntactic parsing, a following processing stage at the sentence level. POS information is relevant in syntactic parsing since morphosyntactic categories group words that occur with the same syntactic distribution (Brants 1995 ). This implies that replacing a token by another with that same category does not affect the sentence grammaticality. Considering the next example, is possible to have 24 (2 4 3) sentences by picking one word from each of the three groups between brackets. More sentences are possible if more words are added to the groups.

[the | a][fast | slow | red | pretty][car | bicycle | plane] passed by.

POS tagging is a step common to most natural language processing (NLP) tasks and an extensively researched subject. As result, is often considered a solved task, with baseline precision around 90 % and state of the art systems achieving values around 97 %. However these values are being disputed as the precision is being measured for uniform text genres and in a word basis. If results are measured in terms of full sen-tences, i.e. considering the proportion of sentences without a single tag error, the preci-sion values drop to around 5557 % (Giesbrecht and Evert 2009 ; Manning 2011 ).

2.3.1 Tools

Several approaches have been proposed over the years. It is common that available implementations are developed for English and trained and evaluated using Penn Treebank data. Nevertheless, most have the potential to be used for tagging other languages. Here we privileged implementations that proven good results with


19

several natural languages, are provided with methods to train a tagger model for other languages, given POS-annotated training text for that language, and that are not part of larger software suites. The only exception will be Stanford POS tagger, from the StanfordNLP suite, because it is provided with tagger models for six dif-ferent languages, making it very relevant even if the rest of the suite is not used.

2.3.2 Representative Tools: Stanford POS Tagger, SVMTool, and TreeTagger

Three tools were selected as they represent, respectively, POS tagging implementa-tions using models based on maximum entropy, support vector machines (SVM), and Markov models. All tools include the POS tagger and methods to create new tagger models given training data.

The Stanford POS Tagger includes components for command-line invocation, for running as a server, and for being integrated in software projects using a Java API. The full download version contains tagger models for six different languages (see languages list in Table 2.3 ). It is based on a bidirectional maximum entropy model that decides the POS tag of a token taking into consideration the preceding and following tags, and broad lexical features such as join conditioning of multiple consecutive words. The tagger achieved a precision value above 0.97 with the Penn Treebank Wall Street Journal (WSJ) corpus (Toutanova et al. 2003 ).

SVMTool supports standard input and output pipelining, making easier its inte-gration in larger systems. Also it is provided with an C++ API to support embedded usage. The algorithm is based on support vector machines classifi ers and uses a rich set of features, including: word and POS bigrams and trigrams, surface patterns as prefi xes, suffi xes, letter capitalization, word length, and sentence punctuation. The tagging decisions can be done using a reduced context or at the sentence level. The tagger achieved accuracy above 0.97 with the English corpus Wall Street Journal, and above 0.98 with the Spanish corpus LEXEP (Gimnez and Mrquez 2004 ). Table 2.4 presents the highlights of SVMTool.

TreeTagger can be run from the command line or using a GUI, and is provided as a binary package for Intel-Macs, Linux, or Windows operating systems. The project website includes ready to use models for 16 languages (see language list in Table 2.5 ). TreeTagger algorithm is based on n-gram Markov models having transition probabilities estimated using a binary decision tree. Comparing to other algorithms

Table 2.3 Main features of Stanford tagger

Name Stanford POS tagger Task Part of speech tagging URL http://nlp.stanford.edu/software/tagger.shtml Languages tested Arabic, Chinese, English, French, German, and Spanish Performance Accuracy of 0.9724 for English


20

using Markov models, this technique needs less data to obtain reliable transition probabilities as binary decision trees have relatively few parameters to estimate. Such feature mitigates the sparse data problem (Schmid 1994 ).

2.4 Syntactic Parsing

Syntactic parsing is usually a computational intensive task that is not as often used in IE systems as tokenization, sentence boundary detection, or POS tagging. When information sources are (semi-)structured, or are machine generated, or the output is coarse grained, other methods less computational intensive such as locating tex-tual patterns can provide similar results (Feldman and Sanger 2007 ; Huffman 1996 ).

The goal of syntactic parsing is to analyze sentences in order to produce struc-tures representing how words are arranged in sentences (Langacker 1997 ). Structures are produced with respect to a given formal grammar, and over the years were proposed different formalisms refl ecting both linguistic and computational concerns. In a broad sense, grammars can have two structural formalisms: constituency and dependency (Jurafsky and Martin 2008 ; Nugues 2006 ).

Constituent is a unit within a hierarchical structure that is composed by a word or a group of words. Although, in a strict formal sense constituent structures can be observed in dependency grammars, constituency is usually associated to phrase structure grammars as these are only based on the constituency relation. Phrase struc-ture grammars are composed by sets of syntactic rules that fractionate a phrase into sub-phrases and hence describe a sentence composition in terms of phrase struc-ture (Chomsky 2002 ). Figure 2.2 presents a possible parse of the sentence This book has two authors. using a phrase structure grammar.

Dependency grammars describe sentence structures in terms of links between words. Each link refl ects a relation of dominance/dependence between a headword

Table 2.4 Main features of SVMTool

Name SVMTool Task Part of speech tagging URL http://www.lsi.upc.edu/~nlp/SVMTool/ Languages tested Catalan, English, and Spanish Performance Accuracy of 0.9739 for English and

0.9808 for Spanish

Table 2.5 Main features of TreeTagger

Name TreeTagger Task Part of speech tagging URL http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/ Languages tested

Bulgarian, Dutch, English, Estonian, Finnish, French, Galician, German, Italian, Portuguese, Mongolian, Polish, Russian, Slovak, Spanish, and Swahili

Performance Accuracy above 0.95 for most languages


21

and a dependent word. The original work of Tesnire ( 1959 ) received formal mathematical defi nitions thus becoming suitable for automatic processing. As result, sentence dependencies form graphs that have a single head and usually have three properties: acyclicity, connectivity, and projectivity (Nivre 2005 ). Dependency grammars often prove more effi cient to parse texts. Figure 2.3 presents a possible parse of the same example sentence using a dependency grammar.

Nugues ( 2006 ) provides a comprehensive discussion about syntax theories and parsing techniques that have been proposed over the years. Here the focus will be on tools that proven able to be adaptable to different languages without need to rewrite grammars, which is a diffi cult task and requires some expertise in lan-guage models. The fi rst two parsers presented use phrase structure grammarEpic parser and StanfordParserand the other two use dependency grammarsMaltParser and TurboParser.

2.4.1 Representative Tools: Epic, StanfordParser, MaltParser, TurboParser

Epic is a probabilistic context-free grammar (PCFG) parser that can be used from the command line or programmatically using a Scala API. Its algorithm uses surface patterns to reduce the propagation of information through the grammar structure, thus avoid having too many features in the grammar structure. Having a simpler

authors

S

NounPhrase VerbPhrase

DT NN AUX NounPhrase .

.

This book has CD NNS

two

Fig. 2.2 Possible constituency grammar tree for the sentence this book has two authors

This book has two authors .

DT NN VBZ CD NNS .

det nsubj dobj

punct

num

root Fig. 2.3 Possible dependency grammar graph for the sentence this book has two authors

2.4 Syntactic Parsing

22

structural backbone improves the adaptation to new languages (Hall et al. 2014 ). Epic parser provides ready to use parser models for eight languages and was tested with more three languages achieving accuracy results over 0.78 (see Table 2.6 ).

StanfordParser is also a PCFG parser provided with a command line interface as well as with a Java API for programmatic usage. It uses an unlexicalized grammar at its core. Unlexicalized PCFG is a grammar that relies on word categories such as POS categories that can be more or less broad and does not systematically specifi es rules to the lexical level. However some categories can represent a single word. This brings the advantage of producing compact and robust grammar representa-tions as there is no need for large structures to store the lexicalized probabilities (Klein and Manning 2003 ). StanfordParser is provided with models for fi ve lan-guages and was also used with Bulgarian, Italian, and Portuguese (see Table 2.7 ).

MaltParser is provided as a JAR package for command line usage, and with the Java source code for integration into larger software projects. Maltparser is a data- driven dependency parsing system able to induce parsing models from Treebank data. The parsing model builds dependency graphs in one left to right pass over the input using a stack to store partially processed tokens, and a history-based feature model to predict the next parser action (Hall et al. 2010 ; Nivre et al. 2007 ). There are ready to use parsing models for 4 languages and was tested with other 14 lan-guages and results showed an accuracy around 0.75 or more (see Table 2.8 ).

TurboParser is provided with C++ source code ready to be compiled in systems complying with the Portable Operating Systems Interface (POSIX) and also in Windows. The approach followed formulates the problem of non-projective depen-dency parsing as an optimization problem of integer linear programming of polyno-mial size. The model supports expert knowledge in form of constraints, and training data is used to automatically learn soft constraints. Having a model requiring a polynomial number of constraints as a function of the sentence length, instead of

Table 2.6 Main features of Epic parser

Name Epic Task Syntactic parsing (phrase structure grammar) URL http://www.scalanlp.org/ Languages tested

Ready to use models for Basque, English, French, German, Hungarian, Korean, Polish, and Swedish Other languages tested with accuracy over 0.78: Arabic, Basque, and Hebrew

Table 2.7 Main features of StanfordParser

Name StanfordParser Task Syntactic parsing (phrase structure grammar) URL http://nlp.stanford.edu/software/lex-parser.shtml Languages tested Ready to use models for: Arabic, Chinese, English, French, and German

Other languages tested with accuracy over 0.75: Bulgarian, Italian, and Portuguese


23

exponential constraints of previous linear programming approaches, eliminates the need for incremental procedures and impacts the accuracy and processing speed (Martins et al. 2009 ). The parser is provided with models for fi ve languages and was tested with more six languages (see Table 2.9 ).

2.5 Representative Software Suites

NLP software suites make easier the integration of all tasks in a processing pipeline. They integrate several tools using a coherent data representation designed to allow directly using the output of a step as input of the following one. The list of suites available includes Apache OpenNLP, Freeling, GATE, LingPipe, Natural Language Tool Kit (NLTK), and StanfordNLP, among others. Here will be described Stanford NLP, as it is used in a tutorial example in Chap. 5 ; NLTK, as it is very well docu-mented and uses a distinct programming language of StanfordNLP; and GATE for historical reasons at it was (one of) the fi rst matured suites available.

2.5.1 Stanford NLP

Stanford NLP (Manning et al. 2014 ) is a machine learning based toolkit for the processing of natural language text. It includes software for realizing several NLP tasks such as tokenization, sentence segmentation, part-of-speech tagging, named entity recognition, parsing, coreference resolution, and relation extraction, that can be incorporated into applications with human language technology needs.

Table 2.8 Main features of MaltParser

Name MaltParser Task Syntactic parsing (dependency grammar) URL http://www.maltparser.org/ Languages tested

Ready to use models for English, French, Spanish, and Swedish Other languages tested with accuracy around 0.75 or above: Arabic, Basque, Catalan, Chinese, Czech, Danish, Dutch, German, Greek, Hungarian, Italian, Japanese, Portuguese, and Turkish

Table 2.9 Main features of TurboParser

Name TurboParser Task Syntactic parsing (dependency grammar) URL http://www.ark.cs.cmu.edu/TurboParser/ Languages tested

Ready to use models for: Arabic, English, Farsi, Kinyarwanda, and Malagasy Other languages tested with accuracy above 0.75: Danish, Dutch, Portuguese, Slovene, Swedish, and Turkish

2.5 Representative Software Suites

24

The suite is developed using Java programming language although it is possible to fi nd binding or translations for other programming languages such as .NET lan-guages, Perl, Python, and Ruby. All tools include methods for training new models from corpora.

2.5.2 Natural Language Toolkit (NLTK)

NLTK (Bird et al. 2009 ) supports a wide range of text processing libraries, including text classifi cation, tokenization, stemming, tagging, chunking, parsing, and semantic reasoning. It also provides intuitive interfaces to more than 50 corpora and lexical resources, including WordNet. It is well documented with tutorials, animated algo-rithms, problem sets, and is thoroughly discussed in a comprehensive book by Bird et al. ( 2009 ). The suite is developed using Python programming language and an active community also create Python wrappers for state of the art tools, respecting the NLTK interfaces. For instance, there is a Python wrapper to use MaltParser in NLTK.

2.5.3 GATE

GATE (Cunningham et al. 2011 ) is a development environment for the creation of software components designed to process natural languages. More than providing the end algorithm, it provides specialized data structures and a set of intuitive tools to assist the development of the algorithm. The tools include document annotation mechanisms, collocation viewer, fi nite state machines, support vector machines, and text extractors from documents in PDF, RTF, and XML. GATE is over 15 years old and is in active use.

References

Alusio S, Pelizzoni J, Marchi AR, de Oliveira L, Manenti R, Marquiafvel V (2003) An account of the challenge of tagging a reference corpus for brazilian portuguese. In: Computational processing of the Portuguese language. Springer, Berlin, pp 110117

Apache Open NLP Development Community (2014) Apache OpenNLP developer documentation. www.openlp.apache.org

Appelt DE (1999) Introduction to information extraction. Artif Intell Commun 12:161172 Bird S, Klein E, Loper E (2009) Natural language processing with Python. OReilly, Sebastopol Brants T (1995) Tagset reduction without information loss. In: Proceedings of the 33rd annual

meeting on Association for Computational Linguistics. pp 287289 Chang AX, Manning CD (2014) TOKENS REGEX: defi ning cascaded regular expressions over

tokens. Technical report CSTR 201402. Department of Computer Science, Stanford University, Stanford


25

Chang P, Galley M, Manning CD (2008) Optimizing Chinese word segmentation for machine translation performance. In: Proceedings of the third workshop on statistical machine transla-tion. pp 224232

Chomsky N (2002) Syntactic structures. Walter de Gruyter, New York Cunningham H, Maynard D, Bontcheva K (2011) Text processing with GATE, Cunningham:2011:

TPG:2018860. Gateway Press, Murphys, CA Feldman R, Sanger J (2007) The text mining handbook: advanced approaches in analyzing unstruc-

tured data. Cambridge University Press, Cambridge Giesbrecht E, Evert S (2009) Is part-of-speech tagging a solved task? An evaluation of pos taggers

for the German Web as Corpus. In: Proceedings of the fi fth Web as Corpus workshop. pp 2735 Gimnez J, Mrquez L (2004) SVMTool: a general POS tagger generator based on support vector

machines. In: Proceedings of the 4th international conference on Language Resources and Evaluation (LREC04). Lisbon

Gngr T (2010) Part-of-speech tagging. In: Indurkhya N, Damerau FJ (eds) Handbook of natural language processing, 2nd edn. CRC/Taylor and Francis Group, Boca Raton

Hall J, Nilsson J, Nivre J (2010) Single malt or blended? A study in multilingual parser optimiza-tion. In: Trends in parsing technology. Springer, Berlin, pp 1933

Hall D, Durrett G, Klein D (2014) Less grammar, more features. In: Proceedings of ACL. Baltimore, pp 228237

Hotho A, Nrnberger A, Paa G (2005) A brief survey of text mining. LDV Forum 20:1962 Huang C-R, imon P, Hsieh S-K, Prvot L (2007) Rethinking Chinese word segmentation: tokeni-

zation, character classifi cation, or wordbreak identifi cation. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions. pp 6972

Huffman SB (1996) Learning information extraction patterns from examples. In: Wertmer S, Riloff E, Scheler G (eds) Connectionist, statistical and symbolic approaches to learning for natural language processing. Springer, Berlin, pp 246260

Jurafsky D, Martin JH (2008) Speech and language processing: an introduction to natural language processing, computational linguistics, and speech recognition, 2nd edn. Prentice Hall, New York

Kiss T, Strunk J (2006) Unsupervised multilingual sentence boundary detection. Comput Linguist 32:485525

Klein D, Manning CD (2003) Accurate unlexicalized parsing. In: Proceedings of the 41st annual meeting on Association for Computational Linguistics, vol 1. pp 423430

Langacker RW (1997) Constituency, dependency, and conceptual grouping. Cogn Linguist 8:132 Manning CD (2011) Part-of-speech tagging from 97% to 100%: is it time for some linguistics? In:

Gelbukh A (ed) Computational linguistics and intelligent text processing12th international conference CICLing. Lecture notes in computer science. Springer, Berlin, pp 171189

Manning CD, Surdeanu M, Bauer J, Finkel J, Bethard SJ, McClosky D (2014) The Stanford CoreNLP natural language processing toolkit. In: Proceedings of 52nd annual meeting of the Association for Computational Linguistics: system demonstrations. pp 5560

Marantz A (1997) No escape from syntax: dont try morphological analysis in the privacy of your own lexicon. University of Pennsylvania working papers in linguistics 4, p 14

Martins AFT, Smith NA, Xing EP (2009) Concise integer linear programming formulations for dependency parsing. In: Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP, vol 1vol 1. pp 342350

Mcnamee P, Mayfi eld J (2004) Character n-gram tokenization for European language text retrieval. Inf Retr 7:7397

Monroe W, Green S, Manning CD (2014) Word segmentation of informal Arabic with domain adaptation. In: Proceedings of the 52nd annual meeting of the Association for Computational Linguistics, vol 2 (short papers). ACL, Baltimore, pp 206211

Nivre J (2005) Dependency grammar and dependency parsing. MSI report 5133. pp 132 Nivre J, Hall J, Nilsson J, Chanev A, Eryigit G, Kbler S, Marinov S, Marsi E (2007) MaltParser:

a language-independent system for data-driven dependency parsing. Nat Lang Eng 13:95135

References

26

Nugues PM (2006) Syntactic formalisms. In: Nugues PM (ed) An introduction to language pro-cessing with Perl and Prolog. Springer, Berlin, pp 243275

Padr L, Stanilovsky E (2012) FreeLing 3.0: towards wider multilinguality, In: Proceedings of the Language Resources and Evaluation Conference (LREC 2012). Istanbul, pp 24732479

Palmer DD, Hearst MA (1997) Adaptive multilingual sentence boundary disambiguation. Comput Linguist 23:241267

Piskorski J, Yangarber R (2013) Information extraction: past, present and future. In: Poibeau T, Saggion H, Piskorski J, Yangarber R (eds) Multi-source, multilingual information extraction and summarization. Springer, Berlin, pp 2349

Porter MF (1980) An algorithm for suffi x stripping. Program Electron Libr Inf Syst 14:130137 Reynar JC, Ratnaparkhi A (1997) A maximum entropy approach to identifying sentence boundaries.

In: Proceedings of the fi fth conference on applied natural language processing, ANLC97. ACL, Stroudsburg, pp 1619

Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of the international conference on new methods in language processing. Manchester

Tesnire L (1959) Elments de syntaxe structurale. Librairie C. Klincksieck, Paris Toutanova K, Klein D, Manning CD, Singer Y (2003) Feature-rich part-of-speech tagging with a

cyclic dependency network. In: Proceedings of the 2003 conference of the North American chapter of the Association for Computational Linguistics on human language technology, vol 1. pp 173180

Wong DF, Chao LS, Zeng X (2014) iSentenizer-: multilingual sentence boundary detection model. Scientifi cWorldJournal 2014. doi:10.1155/2014/196574



Chapter 3 Identifying Things, Relations, and Semantizing Data

Abstract This chapter concludes the presentation of the generic pipelined architecture of Information Extraction (IE) systems, by presenting its domain dependent part.

After preparation and enrichment, the documents contents are now character-ized and suitable to be processed to locate and extract information. This chapter explains how this can be performed, addressing both extraction of entities and rela-tions between entities.

Identifying entities mentioned in texts is a pervasive task in IE. It is called Named Entity Recognition (NER) and seeks to locate and classify textual mentions that refer to specifi c types of entities, such as, for example, persons, organizations, addresses and dates.

The chapter also dedicates attention to how to store the extracted information and how to take advantage of semantics to improve the information extraction process, presenting the basis of Ontology-Based Information Extraction (OBIE) systems.

Keywords Information extraction Entities Relations Named entity recognition NER Parse tree Dependencies Ontology-based information extraction OBIE

3.1 Identifying the Who, the Where, and the When

After preparation and enrichment, the documents contents are now characterized and suitable to be processed by algorithms that will locate and extract information (Ratinov and Roth 2009 ). The type of information to be extracted depends on the purpose of the application and can range from the detection of a defi ned set of rel-evant entities to an attempt to extract arbitrary information at the Web scale, or something in between.

The goal is to identify entities in texts and the relations they participate in, which informally translates to discover who did what to whom, when and why (Mrquez et al. 2008 ). Entities to locate include people, organizations, locations, and dates, while relations can be physical (near, part), personal or social (son, friend, business), and membership (staff, member-of-group) (Bontcheva et al. 2009 ).

28

Identifying entities mentioned in texts is a pervasive task in IE, known as named entity recognition (NER). Named entity recognition seeks to locate and classify textual mentions that refer to specifi c types of individuals such as persons and organizations, and can also be references to addresses and dates (Nadeau and Sekine 2007 ; Tjong Kim Sang and De Meulder 2003 ). Named entities are often composed by a sequence of nouns referent to a single entity, e.g. Ban Ki-moon or The Secretary General of the United Nations. Named entity recognition is usually an early step that prepares further processing, and is also a relevant task by itself as there are many applications that just need to detect the entities referred in the documents.

To illustrate the utility of recognizing named entities, consider a website gather-ing contributions from several authors (think of Wikipedia or a news website) that wants to link each authors name to a page with a short biography, or with informa-tion about professional interests. If done manually, this task is error prone and time consuming. Having a method to automatically detect authors can be quite straight-forward and advantageous.

Another possibility would be having each person referred in the articles, not just the author, tracked across the website pages, allowing a way to navigate through related topics, pointing readers to historical data about that person, such as a politi-cian and how he has performed recently in the polls, or an athlete and his latest scores and achievements, or to that of the latest gossip concerning a public fi gure. Other benefi ts would be using such data, and also data about locations or products mentioned in articles, to improve website visibility by introducing those entities automatically as page metadata, or having advertisements associated with specifi c types of entities.

Named entity recognition is also a relevant preprocessing step for language analyses other than IE. For instance in machine translation it is known that names translate differently than regular text and thus is important to detect them to allow applying distinct procedures (Babych and Hartley 2003 ; Koehn et al. 2007 ). The same applies to question answering systems as questions are usually about specifi c domains, and names help to discover the domain as it is possible to detect if names represent a person, a government organization, a sports organization, a location, etc. (Grishman 1997 ).

Named entity recognition is often considered a two-step procedure: fi rst are detected the boundaries of entities, and then is assigned a predefi ned category such as person, organization, location, or date. Boundary detection methods, whether using hand-crafted rules or some probabilistic approach, usually rely on features such as part of speech tags, word capitalization, and lexical features such as the values of the preceding, current, and following words (Nadeau and Sekine 2007 ). For instance, if a word as the value Mr. the following word(s) likely denotes a persons name. Adding to these methods it is also common to have gazetteers of common entities, including people names and well-known companies. In the case of entities with well-defi ned shapes, like dates, email addresses, and phone num-bers, a widespread technique is to match their patterns using regular expressions.

3 Identifying Things, Relations, and Semantizing Data

29

An example for locating people, using part of speech tags and word capitaliza-tion, is setting boundaries in sequences of proper nouns. Considering the example depicted in Fig. 2.1 , this simple method would allow isolating candidate entities for people names in the sentence:

Table 3.1 Wikipedia categories found for each candidate entity in the example presented in Fig. 2.1

NE candidate Wikipedia categories

John Bardeen

People from Madison, Wisconsin | American people of Russian descent | 1908 births | 1991 deaths | American agnostics | American electrical engineers | American Nobel laureates | American physicists | Foreign Members of the Royal Society | Nobel laureates in Physics | Nobel laureates with multiple Nobel awards | Oliver E. Buckley Condensed Matter Prize winners | Princeton University alumni | Quantum physicists | University of WisconsinMadison alumni

Nobel Prize

Academic awards | Awards established in 1895 | International awards | Science and engineering awards | Organizations based in Sweden | Nobel Prize

First are presented the categories relevant for the example

John Bardeenis the only laureate to win theNobel Prizein physics twicein 1956 and 1972.

After having the candidates, the next step is to assign a category to each candi-date. Considering the example, the goal of this classifi cation is to discriminate the type of John Bardeen as person and Nobel Prize as an award. Classifi cation methods of NE include textual patterns for detecting elements as addresses and dates, the use of gazetteers, and algorithms exploring information sources such as Wikipedia or Google (Whitelaw et al. 2008 ).

Although gazetteers can be used to detect boundaries and classify entities, mod-ern approaches avoid relying too much on them as compiling such lists is a time consuming process, often need to be redone when changing language and/or appli-cation domain, and lists rapidly are proven incomplete. Some recent approaches replace gazetteers by information sources as Wikipedia (Bizer et al. 2009 ; Suchanek et al. 2007 ; Wu et al. 2008 ). Wikipedia brings the advantage of being updated daily, having the possibility of querying it online or downloading and using freely avail-able snapshots offl ine.

Considering the example, a possible classifi cation algorithm using Wikipedia can be based on querying the page of each named entity candidate and, if found, evaluate if the Wikipedia categories include one of the pre-defi ned categories of the applica-tion. If the application includes categories for people and awards would be possible to classify John Bardeen as people given that people is included in Wikipedia categories, and classify Nobel Prize as award for the same reason. Table 3.1 presents the Wikipedia categories found for our example.

Nadeau and Sekine ( 2007 ) and Mohit ( 2014 ) provide comprehensive surveys of the methods proposed for NER.

3.1 Identifying the Who, the Where, and the When

30

The recognition of generic named entities such as people, locations, and dates, can be done using suites such as OpenNLP, NLTK, or StanfordNLP, presented in Chap. 2 . For named entities relative to more specialized domains it can be diffi cult to fi nd a ready to use software package. One exception is the biomedical domain for which is possible to fi nd named entities recognizers. Becas 1 (Nunes et al. 2013 ) and KLEIO 2 (Nobata et al. 2008 ) are two relevant examples of such tools.

3.2 Relating Who, What, When, and Where

Named entity recognition identifi es the entities referred in the documents but, by itself, does not inform in what kind of events those entities were involved, the rea-son why they were mentioned in the fi rst place. For that, it is necessary to know in what actions they are involved in, the same is to say to know the relations they established with other entities (Banko and Etzioni 2008 ; Schutz and Buitelaar 2005 ). This is an important task for applications wishing to have a formal s tructure about parts of the content of the document. Considering, again, the example of Fig. 2.1 , detecting and classifying the entities John Bardeen and Nobel Prize is not enough to know if both entities are related and, if they are, how they are related. Already knowing that John Bardeen is a person and that Nobel Prize is an award, possible relations would be having John Bardeen as the winner, or the sponsor, or a jury member, or someone that attended to the ceremony of the award "Nobel Prize.

A relation is a predication about a pair of entities. Examples of common relations include relations of the types: (1) physical: located, near, part, etc.; (2) personal or social: business, family, friend, etc.; (3) employment or membership: member of, employee, staff, etc.; (4) agent to artifact: user, owner, inventor, etc.; and (5) affi li-ation: citizen, resident, ideology, ethnicity, etc. In the example of John Bardeen and the Nobel Prize, a relation between the entities is John Bardeen winner _ of Nobel Prize. Differently from named entity recognition, relation extraction is not a pro-cess of annotating a sequence of tokens of the original document. Relationships express associations between two entities represented by distinct text segments (Sarawagi 2008 ). Relations involving two or more objects and subjects are known as events.

Approaches to relation extraction tend to steer away from using corpus anno-tated data due to the cost of creating such resources, and because there are other sources available that, not having the quality of an annotated corpus, can provide high

mário rodrigues, antónio teixeira

Documents

e c t r i c

p u t e r e n g

natural language sources

natural language english

information storage

relevant information

information extraction

markup language