1. off-the-shelf ne recognizers: the stanford web viewthe nltk python library includes a number of...

CE807 Lab 5Named Entity Recognition

February 16

In this lab we are going to experiment with Named Entity Resolution (NER), first using Python libraries, then using tools outside Python, in particular CRF++. .All the materials for this lab, including this script, can be found at

http://csee.essex.ac.uk/staff/poesio/Teach/807/Labs/Lab5

Copy all the material in that folder, including the data sub-folder, in a directory in your file space.

1. Off-the-shelf NE Recognizers: The Stanford NER

There are several publically available NER systems, some of them available as part of complete pipelines.

One of the most widely used such NER systems is the Stanford NER system, part of the Stanford CORENLP information extraction pipeline available from (and described at)

http://stanfordnlp.github.io/CoreNLP/

Stanford CoreNLP includes a POS tagger, parser, NER system. coreference resolver, and sentiment analyser, and supports Arabic, Chinese, English, French, German and Spanish

ToDo: read the description of the Stanford CoreNLP pipeline at the page above.

The Stanford NER system is described at

http://stanfordnlp.github.io/CoreNLP/ner.html

like most modern NER systems, is based on the Conditional Random Fields sequence learning model. It is written in Java and supports English, Chinese, and German.

ToDo: read the description of the Stanford NER system at the page above

The system can be downloaded from a number of sites, including the github site above and from

http://stanfordnlp.github.io/CoreNLP/ner.html


http://csee.essex.ac.uk/staff/poesio/Teach/807/Labs/Lab5

http://nlp.stanford.edu/software/CRF-NER.shtml

Three different pre-trained models are included with the package:

1. english.all.3class.distsim.crf.ser.gz: a 3 class NER tagger that can label Person, Organization, and Location entities. It is trained on data from CoNLL, MUC6, MUC7, and ACE.

2. english.conll.4class.caseless.distsim.crf.ser.gz: a 4 class NER tagger that can label Person, Organization, Location, and Misc. It is trained on the CoNLL 2003 Shared Task.

3. english.muc.7class.caseless.distsim.crf.ser.gz: It is trained o from MUC and

distinguishes between 7 different classes One of these models has to be selected when using the Stanford NER parser, either from the command line or from within Python / NLTK (see below).

There is a demo of the system at:

http://corenlp.run/ (To see only NER, you can unselect the other types of annotations). The output of the demo is shown in the following picture.

ToDo: experiment with the demo, trying to break it. Try first in English, then try other languages.

In the next Section, we will see one way for using the Stanford NER from Python using the NLTK interface.

http://corenlp.run/


2. NER in NLTK

The NLTK Python library includes a number of NE taggers and interfaces to several others, including the Stanford tagger.

2.1 The standard NLTK NE tagger

The standard NER in NTK is the function ne_chunk(). The example in the script ne_chunk_ex.py shows how to use it:

import nltkimport reimport time

exampleArray = ['Prime Minister Theresa May travelled to Washington in 2017.']

##let the fun begin!##def nerInNLTK(): try: for item in exampleArray: tokenized = nltk.word_tokenize(item) pos_tagged = nltk.pos_tag(tokenized) print (pos_tagged)

namedEnt = nltk.ne_chunk(pos_tagged) namedEnt.draw()

time.sleep(1)

except Exception as e: print(e)

nerInNLTK()

Note: this example uses Python 3 syntax. Also: in order to run it you’ll need to have the resources punkt and averaged_perceptron_tagger installed. If they are not installed, run nltk.download() and look under Models.

ToDo: try different examples. Make sure you understand the syntax of the output.

2.2 The Stanford Tagger in NLTK

An interface to the Stanford taggers is included in NLTK as part of the nltk.tag.stanford module, described at:

http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.stanford

In particular, the Stanford NERTagger can be used as by invoking the function nltk.tag.stanford.StanfordNERTagger(), that requires as an obligatory first argument the name of the model to be used. The second argument is the location of the NER executable, stanford-ner.jar—you can either specify that in the CLASSPATH variable or specify it explicitly (which is what we will do). In order to run the tagger, you need (i) to make sure that Java is installed, (ii) to download the appropriate models and the archive stanford-ner.jar either from this lab’s page or by downloading a version of the Stanford NER system from

http://nlp.stanford.edu/software/CRF-NER.shtml#Download

The latest version of the package (version 3.7) requires Java 8. To run on a machine with Java 6 or 7, you need version 3.4.1. In the module page you can find files for both versions.

ToDo: download the model english.all.3class.distsim.crf.ser.gz (3 classes: Person, Organization, and Location) and the archive stanford-ner.jar, put them in the folder you are using as working directory in Python, and then try to get the following code to run (script stanford_NER_NLTK.py)

from nltk.tag.stanford import StanfordNERTagger

st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz', './stanford-ner.jar')

st.tag('Prime Minister Theresa May travelled to Washington in 2017'.split())

ToDo: try to replace the model used in this script with a different model from those distributed with the package.

3. Building specialized classifiers with CRF++

A ready-made (pre-trained) model like those used in the previous parts of this lab have a fixed set of entities that may not be what you require, and have poor performance in domains different from the one on which you trained them. The solution is to train new models for the new domains. This involves

Preparing training and test datasets for those new domains Training models Using them to classify new data.

We’ll do each step in turn.

3.1 The dataset

http://nlp.stanford.edu/software/CRF-NER.shtml#Download

http://www.nltk.org/api/nltk.tag.html#module-nltk.tag.stanford

As an example, we will use the University of Essex corpus collected by Maha Althobaiti in 2011 and containing information about three types of entities: person names, course codes, and room numbers. The dataset can be downloaded from the lab’s page as UEC.zip

ToDo: do that.

The University of Essex Corpus is in CONLL IOB format: A token for each line Columns separated by a single space, one for each feature

o <Token annotation> An empty line after each sentence As the last element of the column, the NE tag for the token, in IOB format:

o O for tokens that do not belong to a NEo B-PER: The Beginning of the name of a person. o I-PER: The continuation (Inside) of the name of a persono B-COR: The Beginning of a course code. o I-COR: The continuation (Inside) of a course code. o B-ROM: The Beginning of a room number.o I-ROM: The continuation (Inside) of a room number

For example, the following sentence

Dean of the graduate school Dr Pam Cox said: “This is a great result for Essex”

is represented as:

NB: the dataset does not provide any additional features for the tokens other than the token itself (the first column), but you can compute additional features and store them in additional columns. For instance, suppose you also want to use a binary feature Capitalized which is 1 if a token is Capitalized and 0 if it isn’t. You can do this by writing a script that goes over the training and test files line by line, determines if the

token is capitalized or no, and stores the features in a new second column. The output should be something like this:Dean 1 Oof 0 Othe 0 OGraduate 1 OSchool 1 O….

Etc.

ToDo: write a Python script that does this.

3.2 CRF++

CRF++ is possibly the most widely used implementation of CRFs. It can be downloaded from the following page:

https://taku910.github.io/crfpp/

ToDo: read the documentation at the page above, it’s pretty self-explanatory

ToDo:- Download and unzip the Binary package for Windows (or download the

sources and compile them for your computer, e.g. using brew for MacOS or make for Linux)

- Keep training and test datasets in the distribution directory

3.3 Training

To create a model called Uni_model from training data UniTrainingSet with CRF++ the following command should be used:

>> crf_learn template UniTrainingSet Uni_model

A number of parameters can also be specified:

-a CRF-L2 or CRF-L1: Changing the regularization algorithm. Default setting is L2.

-c float: With this option, you can change the hyper-parameter for the CRFs. With larger c value, CRF tends to overfit to the given training corpus. This parameter trades the balance between overfitting and underfitting.

-f NUM: This parameter sets the cut-off threshold for the features. CRF++ uses the features that occurs no less than NUM times in the given training data. The default value is 1.

-p NUM: If the PC has multiple CPUs, you can make the training faster by using


multi-threading. NUM is the number of threads. For example, you could try:

>> crf_learn –f 3 –c 1.5 template UniTrainingSet Uni_model

(but wait until you learn how the template works)

3.4 The template

The template tells CRF++ which features should be used—in particular, the syntax used in the template makes is easy to pick up information from the previous or following elements in the sequence. Suppose your dataset looks as follows, where ‘the’ in the first column (column 0) is the current token, and the following columns 1, 2 etc contain features that have been created as discussed in Section 3.1.

Then the following template:

Tells CRF++ to use the following features for token ‘the’:

- The item in the current line (0), column 0 –i.e., the token itself (‘the’)- The feature in the current line (0), column 1 (feature 31)- The item in column 0 of the previous line- The item in column 1 of the line 2 back- Etc

As follows:

ToDo: write a template for the University of Essex corpus, and run crf_learn using it.

3.5 Testing

To run a model created by crf_learn called Uni_model over test data UniTestingSet and to save the output in a file called testResult the following command should be used:

>> crf_test –m Uni_model UniTestingSet > testResult

Important note: the test data have to be in the same format as the training data, i.e., they have to include the same features!!

3.6 Evaluation

The result file produced by crf_test, testResult, will contain in each line a column for the tags predicted by the trained model. At this point you have two options to get a quantitative assessment of the result:

• Write your own code to compute Precision, Recall, and F-measure from the ‘testResult’ file.

• Use the evaluation script used in the CoNLL shared task, which is available from:

http://www.cnts.ua.ac.be/conll2002/ner/bin/conlleval.txt

After downloading, rename the file from ‘conlleval.txt’ to ‘conlleval.pl’. It is written in Perl. So, it requires Perl to be installed on your machine.

4. CRF++ in Python

There are a number of packages that provide an interface between CRF implementations and Python. Two such packages are:

- python-crfsuite (https://pypi.python.org/pypi/python-crfsuite) - sklearn-crfsuite, that provides an interface similar to that of sklearn

(https://pypi.python.org/pypi/sklearn-crfsuite)

https://pypi.python.org/pypi/sklearn-crfsuite)

https://pypi.python.org/pypi/python-crfsuite)

Mikhail Korobov, one of the authors of these suites, provided an excellent online tutorial with Jupyter notebook at:

https://media.readthedocs.org/pdf/sklearn-crfsuite/latest/sklearn-crfsuite.pdf

ToDo: follow the tutorial (you can find the code in the Lab page as sklearn_crfsuite_tutorial.py)

References

- The Stanford Core NLP documentation at http://stanfordnlp.github.io/CoreNLP/

- The Stanford CRF NER description at: http://nlp.stanford.edu/software/CRF-NER.shtml

- The CRF++ documentation at https://taku910.github.io/crfpp/





https://media.readthedocs.org/pdf/sklearn-crfsuite/latest/sklearn-crfsuite.pdf

1. off-the-shelf ne recognizers: the stanford web viewthe nltk python library includes a number of...

Documents