©2012 paula matuszek gate information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 paula...

43
©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek CSC 9010: Text Mining Applications: GATE Machine Learning Dr. Paula Matuszek [email protected] [email protected] (610) 647-9789

Upload: rose-cannon

Post on 16-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html©2012 Paula Matuszek

CSC 9010: Text Mining Applications:

GATE Machine Learning

Dr. Paula Matuszek

[email protected]

[email protected]

(610) 647-9789

Page 2: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

Information Extraction in GATE

In the last assignment we looked at how to set up GATE to do information extraction “by hand”.– Gazeteers– JAPE rules

We modified the system to add UPenn and Villanova to universities to be extracted

Not hard for two entities, but this can get tricky. Is there a better way?

Page 3: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

Machine Learning We would like:

– give GATE examples of universities – let it figure out for itself how to identify them

This is basically a classification problem: given an entity, is it a university?

This sounds familiar! The classification methods we looked at

in NLTK can be considered as forms of supervised machine learning.

Page 4: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

Machine Learning in GATE GATE has machine learning processing

resources: CREOLE Learning plugin Same classification algorithms we saw

in NLTK But both features to be used and class

to be learned can be richer, using PRs like ANNIE

Page 5: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

5

University of Sheffield NLP

gate.ac.uk/sale/talks/gate-course-july09/ml.ppt

Machine Learning

We have data items comprising labels and featuresE.g. an instance of “cat” has features “whiskers=1”,

“fur=1”. A “stone” has “whiskers=0” and “fur=0”

Machine learning algorithm learns a relationship between the features and the labelsE.g. “if whiskers=1 then cat”

This is used to label new dataWe have a new instance with features “whiskers=1”

and “fur=1”--is it a cat or not???

Page 6: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

6

University of Sheffield NLP

gate.ac.uk/sale/talks/gate-course-july09/ml.ppt

ML in Information Extraction

We have annotations (classes)

We have features (words, context, word features etc.)

Can we learn how features match classes using ML?

Once obtained, the ML representation can do our annotation for us based on features in the textPre-annotation

Automated systems

Possibly good alternative to knowledge engineering approachesNo need to write the rules

However, need to prepare training data

Page 7: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

University of Sheffield NLP

gate.ac.uk/sale/talks/gate-course-july09/ml.ppt

ML in Information Extraction

Central to ML work is evaluationNeed to try different methods, different parameters, to obtain good result

Precision: How many of the annotations we identified are correct?

Recall: How many of the annotations we should have identified did we?

F-Score:

F = 2(precision.recall)/(precision+recall)

Testing requires an unseen test setHold out a test set

Simple approach but data may be scarce

Cross-validationsplit training data into e.g. 10 sections

Take turns to use each “fold” as a test set

Average score across the 10

Page 8: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

More on SVMs Primary Machine Learning engine in

GATE is the Support Vector Machine– Handles very large feature sets well– Has been shown empirically to be effective

in a variety of unstructured text learning tasks.

We covered this briefly earlier; more detail will help with using them in GATE

Page 9: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

Basic Idea Underlying SVMs

Find a line, or a plane, or a hyperplane, that separates our classes cleanly.– This is the same concept as we have seen

in regression. By finding the greatest margin

separating them

Page 10: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers

denotes +1

denotes -1

How would you classify this data?

Page 11: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers

denotes +1

denotes -1

How would you classify this data?

Page 12: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Copyright © 2001, 2003, Andrew W. Moore

Linear Classifiers

denotes +1

denotes -1

Any of these would be fine..

..but which is best?

Page 13: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Copyright © 2001, 2003, Andrew W. Moore

Classifier Margin

denotes +1

denotes -1 Define the margin of a linear classifier as the width that the boundary could be increased by before hitting a datapoint.

Page 14: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin

denotes +1

denotes -1 The maximum margin linear classifier is the linear classifier with the maximum margin.

Called Linear Support Vector Machine (SVM)

Page 15: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials. Copyright © 2001, 2003, Andrew W. Moore

Maximum Margin

denotes +1

denotes -1 The maximum margin linear classifier is the linear classifier with the, um, maximum margin.

Called Linear Support Vector Machine (SVM)

Support Vectors are those datapoints that the margin pushes up against

Page 16: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

Messy Data This is all good so far. Suppose our aren’t that neat:

Page 17: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

Soft Margins Intuitively, it still looks like we can make

a decent separation here.– Can’t make a clean margin– But can almost do so, if we allow some

errors A soft margin is one which lets us make

some errors, in order to get a wider margin

Tradeoff between wide margin and classification errors

Page 18: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

Messy Data

Page 19: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

Slack Variables and Cost In order to find a soft margin, we allow slack

variables, which measure the degree of misclassification.– Takes into account the number of misclassified instances

and the distance from the margin We then modify this by a cost (C) for these

misclassified instances.– High cost -- narrow margins, won’t generalize well– Low cost -- broader margins but misclassify more data.

How much we want it to cost to misclassify instances depends on our domain -- what we are trying to do

Page 20: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

Non-Linearly-Separable Data

Suppose even with a soft margin we can’t do a good linear separation of our data?

Allowing non-linearity will give us much better modeling of many data sets.

In SVMs, we do this by using a kernel. A kernel is a function which maps our

data into a higher-order order feature space where we can find a separating hyperplane

Page 21: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials.

Hard 1-dimensional dataset

What can be done about this?

x=0

Page 22: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials.

Hard 1-dimensional dataset

x=0

Page 23: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials.

Hard 1-dimensional dataset

x=0

Page 24: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

The Kernel Trick! If we can’t do a linear separation

– add higher order features– and combinations of features

Combinations of features may allow separation where either feature alone will not

This is known as the kernel trick Typical kernels are polynomial and RBF

(Radial Basis Function (Gaussian)) GATE supports linear and polynomial

kernels

Page 25: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials.

Page 26: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

Borrowed heavily from Andrew tutorials:: http://www.cs.cmu.edu/~awm/tutorials.

Page 27: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

What If We Want Three Classes?

Suppose our task involves more than two classes: universities, cities, businesses?

Reduce multiple class problem to multiple binary class problems.– one-versus-all, one-vs-others– one-versus-one, one-vs-another

GATE will do this automatically if there are more than two classes.

Page 28: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

Doing this in GATE GATE has a Learning plugin available in

CREOLE. The PR available from it is Batch Learning– Newest machine learning PR– focused on chunk recognition, text classification,

relation extraction.– Primary algorithm is an SVM– Can also interface to WEKA for Naive Bayes,

KNN and decision trees There is also an older Machine_Learning

plugin, which contains wrappers for several different external learning systems.

Page 29: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

Using Batch Learning PR The batch learning PR is basically learning

new annotations (the class) from previous annotations (the features)

Need three things:– annotated training documents– possibly preprocessed to get annotations we want

to learn from– an XML configuration file (which is external to the

IDE) The PR treats the parent directory of the

config file as its working directory.

Page 30: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

University of Sheffield NLP

gate.ac.uk/sale/talks/gate-course-july09/ml.ppt

ML applications in GATE

• Batch Learning PR Evaluation Training Application

• Runs after all other PRs – must be last PR

• Configured via xml file

• A single directory holds generated features, models, and config file

Page 31: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

University of Sheffield NLP

gate.ac.uk/sale/talks/gate-course-july09/ml.ppt

Instances, attributes, classes

California Governor Arnold Schwarzenegger proposes deep cuts.

Token Token Token Token Tok Tok

Entity.type=Person

Attributes: Any annotation feature relative to instancesToken.StringToken.category (POS)Sentence.length

Instances: Any annotationTokens are often convenient

Class: The thing we want to learnA feature on an annotation

Sentence

Token

Entity.type=Location

Page 32: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

University of Sheffield NLP

gate.ac.uk/sale/talks/gate-course-july09/ml.ppt

Surround mode

• This learned class covers more than one instance....

• Begin / End boundary learning

• Dealt with by API - surround mode• Transparent to the user

California Governor Arnold Schwarzenegger proposes deep cuts.

Token Token

Entity.type=Person

Page 33: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

University of Sheffield NLP

gate.ac.uk/sale/talks/gate-course-july09/ml.ppt

Multi class to binary

California Governor Arnold Schwarzenegger proposes deep cuts.

Entity.type=PersonEntity.type=Location

• Three classes, including null

• Many algorithms are binary classifiers

• One against all (One against others)

LOC vs PERS+NULL / PERS vs LOC+NULL / NULL vs LOC+PERS

• One against one (One against another one)

LOC vs PERS / LOC vs NULL / PERS vs NULL

• Dealt with by API - multClassification2Binary

• Transparent to the user

Page 34: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

The XML File Root element is ML-CONFIG. Contains two required elements:

– DATASET– ENGINE

And some optional settings– Described in manual; most important are

next.

Page 35: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

University of Sheffield NLP

gate.ac.uk/sale/talks/gate-course-july09/ml.ppt

The configuration file

• Verbosity: 0,1,2

• Surround mode: set true for entities, false for relations

• Filtering: e.g. remove instances distant from the hyperplane

<?xml version="1.0"?><ML-CONFIG> <VERBOSITY level="1"/> <SURROUND value="true"/> <FILTERING ratio="0.0" dis="near"/>

Page 36: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

University of Sheffield NLP

gate.ac.uk/sale/talks/gate-course-july09/ml.ppt

Thresholds

• Control selection of boundaries and classes in post processing

• The defaults we give will work

• Experiment

• See the documentation

<PARAMETERname="thresholdProbabilityEntity" value="0.3"/><PARAMETERname="thresholdProbabilityBoundary" value="0.5"/><PARAMETERname="thresholdProbabilityClassification" value="0.5"/>

Page 37: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

University of Sheffield NLP

gate.ac.uk/sale/talks/gate-course-july09/ml.ppt

Multiclass andevaluation

• Multi-class one-vs-others One-vs-another

• Evaluation Kfold – runs gives number of folds holdout – ratio gives training/test

<multiClassification2Binary method="one-vs-others"/><EVALUATION method="kfold" runs="10" />

Page 38: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

University of Sheffield NLP

gate.ac.uk/sale/talks/gate-course-july09/ml.ppt

The learning Engine

• Learning algorithm and implementation specific

• SVM: Java implementation of LibSVM– Uneven margins set with -tau

<ENGINE nickname="SVM" implementationName="SVMLibSvmJava" options=" -c 0.7 -t 1 -d 3 -m 100 -tau 0.6"/>

<ENGINE nickname="NB" implementationName="NaiveBayesWeka"/>

<ENGINE nickname="C45" implementationName="C4.5Weka"/>

Page 39: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

University of Sheffield NLP

gate.ac.uk/sale/talks/gate-course-july09/ml.ppt

The dataset

• Defines Instance annotation Class Annotation feature to instance attribute

mapping

<DATASET>

</DATASET>

Page 40: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

Parameters in IDE Path to the config file: when you create

the new PR Run-time:

– corpus– input annotation set name: features and

class name– output annotation set name: where results

go. Same as input when evaluating– learningMode

Page 41: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

LearningMode Run-time parameter. Default is TRAINING Primary modes are:

– TRAINING: PR learns from data and saves models into learnedModels.save

– APPLICATION: reads model from learnedModels.save and applies to data

– EVALUTION: k-fold or hold-out evaluation. Output to GATE Developer.

Additional modes for incremental (adaptive) training and for displaying outcomes

Page 42: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

Running the Batch Learning PR

Multiple PRs: – Running training and evaluation modes update

the model and other data files: don’t run more than one in the same working directory

Processing order:– for training and evaluation modes: needs to go

last. If there’s post-processing to be done make a separate application.

– for application mode: controlled by BATCH-APP-INTERVAL.

– set to 1: acts like normal PR

– set higher: must go last (may be more efficient)

Page 43: ©2012 Paula Matuszek GATE information based on //gate.ac.uk/sale/tao/splitch18.html ©2012 Paula Matuszek

©2012 Paula Matuszek GATE information based on http://gate.ac.uk/sale/tao/splitch18.html

Summary GATE supports a very rich set of classifier-

type machine learning capabilities– Current: the Learning plugin has the Batch Learner– Older: Machine_Learning plugin has wrappers for

various ML systems Both class to be learned and features to learn

from are document annotations Same algorithms as we saw in NLTK, but use

can be very different– classifying chunks of text within documents– alternative to engineering JAPE rules by hand