fex feature extractor - v2. topics vocabulary syntax of scripting language –feature functions...
Post on 28-Dec-2015
229 Views
Preview:
TRANSCRIPT
Fex Feature Extractor - v2
Topics
• Vocabulary• Syntax of scripting language
– Feature functions– Operators
• Examples– POS tagging
• Input Formats
Vocabulary• example
– A list of active records for which Fex produces a single SNOW example. Usually a sentence.
• record – a single position in an example (sentence). – Contains a list of fields, each of which holds a different info: e.g. NLP: Word, Tag, Vision: color, etc.
• Raw input to Fex – A list of valid example, (raw sentences, tagged corpora, etc. )
• Fex’s Output – Lexical features written to the lexicon file. – Their corresponding numeric ID’s are written to the example file.
• feature function – A relation among one or more records.
Example: Feature Functions
Script Syntax• A Fex script file contains a list of definitions, each of
which will rewrite the given observation into a set of active features.
• Definition format, terms in ()’s optional:• target (inc) (loc): FeatureFunc ([left, right])
• target - Target index or word. To treat each record in the observation as a target, use -1. This is a macro for “all words”.
• inc - Include target word instead of placeholder (*) in some features.
• loc - Generate features with location relative to target.
• FeatureFunc - A feature function defined in terms of certain unary and n-ary relations, and operators.
• left - Left offset of scope for generating features. Negative values are left of the target, positive to the right.
• right - Right offset of scope.
Basic Feature Functions• Type Def Fex Notation Interpretation Output to
Lexicon Labellab produces a label feature lab[target word]lab(t) lab[target tag]
Word w Active if word(s) in current w[current word] record is within scope
Tag (pos) t Active if tag(s) in current t[current tag] record is within scope
Vowel v Active if the word(s) in v[initial vowel]
current record begin with a vowel.
Prefix pre Active if the word(s) in the pre[active prefix]current record begins witha prefix in a given list.
Type Def Fex Notation Interpretation Output to Lexicon
Suffix suf Active if the word(s) in suf[the active suffix] the current record begins
with a prefix in a given list
Baseline base Active if a baseline tag from base[baseline tag]a prepared list exists for the word(s) in the current record
Lemma lem Active if a lemma from the lem[active lemma]WordNet database exists forthe word(s) in the current
record
Example
• Sentence = “(DET The) (NN dog) (V is) (JJ mad)”method 1
Script Def Output to lexicon Output to example filedog: w [-1,1] 10001 w[The] 10001, 10002, 10003, 10004:
10002 w[is]dog: t [1,2] 10003 t[V] 10004 t[JJ]
method 2Script Def Output to lexicon Output to example file -1: lab 10001 w[The] 1, 10001, 10002, 10003, 10004:-1: w [-1,1] 10002 w[is]-1: t [1,2] 10003 t[V] 10004 t[JJ]
Operators & Complex Functions
• (X) operator - Indicate that a feature is active without any specific instantiation.
Script Def Output to Lexicon
dog: v(X) [-1,1] 10001 v[]
• (x=y) operator – Creates an active feature iff the active instantiation matches the given
argument.Script Def Output to Lexicon
dog: w(x=is) 10001 w[is]
Sentence = “(DET The) (NN dog) (V is) (JJ mad)”
• & operator - conjunct two features:
producing a new feature which is active iff record fulfills both constituent features.
Script Def Output to Lexicon
dog: w&t [-1,-1] 10001 w[The]&t[DET]
• | operator - disjunction of two feature:
outputting a feature for each term of the
disjunction that is active in the current record.Script Def Output to Lexicon
dog: w|t [-1,-1] 10001 w[The] 10002 t[DET]
Sentence = “(DET The) (NN dog) (V is) (JJ mad)”
Operators & Complex Functions
• coloc function - Consecutive feature function: takes two or more features as arguments to produce a
consecutive collocation over two or more records. The order of the arguments is preserved in the active feature.Script Def Output to Lexicon mad: coloc(w, t) [-3,-1] 10001 w[The]-t[NN]
10002 w[dog]-t[V]
• scoloc function –Sparse Consecutive feature function: operates similarly to coloc, except that active colocations need not be consecutive. However, the order of the arguments is still preserved in determining whether a feature is active.Script Def Output to Lexicon mad: scoloc(w,t) [-3,-1] 10001 w[The]-t[NN]
10002 w[dog]-t[V] 10003 w[The]-t[V]
Operators & Complex Functions
Example: POS tagging
• Useful features for POS tagging:– The preceding word is tagged c.
– The following word is tagged c.
– The word two before is tagged c.
– The word two after is tagged c.
– The preceding word is tagged c and the following word is tagged t.
– The preceding word is tagged c and the word two before is tagged t
– The following word is tagged c and the word two after is tagged t.
– The current word is w.
– The most probable part of speech for the current word is c.
• Given the sentence:– (t1 The) (t2 dog) (t3 ran) (t4 very) (t5 quickly)
• The following Fex script will produce the features from the last slide.
-1: lab(t) -1 loc: t [-2,2] -1: coloc(t,t,t) [-2,2] -1 inc: w[0,0] -1: base[0,0]
• To do POS tagging, an example needs to be generated for each word in observation.
• For the third word, “ran”, the script produces the following output:– Script: Lexicon Output:
-1: lab(t) 1 lab[t3]
-1 loc: t [-2,2] 10001 t[t1_*]10002 t[t2*]
10003t[*t4]10004 t[*_t5] -1:
coloc(t,t,t) [-2,2] 10005 t[t1]-t[t2]-*10006 t[t2]-*-
t[t4] 10007 *-t[t4]-t[t5] -1 inc: w [0,0] 10008
w[ran] -1: base [0,0]10009 base[V]
• And an example in the example file:– 1, 10001, 10002, 10003, 10004, 10005, 10006, 10007, 10008, 10009:
Input Formats• Fex can presently accept data in two formats:
– w1 w2 w3 w4 …
– (t1 w1) (t2 w2) (t3 w3) (t4 w4) …
– w1 (t2 w2) (t3 t3a; w3) (t4; w4 w4a) …
Using Fex (command line)
fex [options] script-file lexicon-file corpus-file example-file
Options:
• -t: target file– do not have any empty line in your file!!!
– Each target in a separate line
• -r: test mode– Does not create new features
• -h, -I– Creates a histogram of active features
Using Fex (command line)
• Target file= targ: Script file = script: dog -1 : lab
cat -1 : w [-1,-1]
-1 : t [-1,-1]
Corpus file = corpus (DET The) (NN dog) (V is) (JJ mad)
Lexicon file =lexicon
Example file=example
fex –t targ script lexicon corpus example
SNoW
Word representation
0.75 11.51 2
join
as will the NOUN_ VERB-to_modal
say
_"
2 0.251.25
Restrictions on the learning approach
• Multi- Class
• Variable number of features– per class– per example
• Efficient learning
• Efficient evaluation
SNoW• Network of threshold gates• Target nodes represent class labels• Input nodes (features) and links are allocated in a data
driven way (Order of 105 input features for many target nodes)
• Each sub-network (target nodes) is learned autonomously as a function of the features
• An example presented is positive to one network negative to others (depends on the algorithm)
• Allocations of nodes (features) and links is Data-Driven
(a link between feature fi and target tj is created only when fi was active with any target tj)
0.75 11.51 2
join
as will the NOUN_ VERB-to_modal
say
_"
2 0.251.25
Word prediction using SNoW
• Target nodeseach word in the set of candidates words is atarget node
• Input nodesan input node for feature fi is allocated only if that feature fi was active with any target
• Decision task we need to choose one target among all
possible candidates
0.75 11.51 2
join
as will the NOUN_ VERB-to_modal
say
_"
2 0.251.25
SNoW (Command line)
snow –train –I inputfile –F networkfile [-ABcdePrsTvW]
snow –test –I inputfile –F networkfile [-bEloRvw]
ArchitectureWinnow: -W [, , , init weight] :targets
Perceptron: -P [, , init weight] :targets
NB: -B :targets
SNoW parameters (training)
-d <none | abs:<k> | rel > : discarding method
-e <i> : eligibility threshold
-r <i> : number of cycles
output modes-c <i> : interval for network snapshot
-v < off | min | med | max > :details for the output
to the screen
SNoW parameters (testing)
-b <k> : smoothing for NB
-w <k> : smoothing for W, P
output modes-E : error file
-o < accuracy | winners | allpred | allact | allboth > :details for the output
-R : results file (stdout)
File Format (Example file)
6, 10034, 10141, 10151, 10158, 10179:
177, 10034, 10035, 10047:
With weights:
6, 10034(1), 10141(1.5), 10151(0.4), 10158(2), 10179(0.1):
177, 10034(2), 10035(4), 10047(0.6):
Only active feature appear in an example !!!
File Format (Network file)
NBtarget 111 0 1 135 1 naivebayes 0 0.1 0.5
111 : 0 : 10020 : 4 0 -3.518980417
111 : 0 : 10021 : 1 0 -4.905274778
Winnow
target 111 1 1 135 1562 winnow 0 1.1 0.9 15 1
111 : 0 : 10020 : 4 1 1.1
111 : 0 : 10021 : 1 0 1
Perceptron
target 111 2 1 270 1 perceptron 0 0.1 4 0.2
111 : 0 : 10020 : 4 1 0.3
111 : 0 : 10021 : 1 0 0.2
File Format (Error file)
Algorithms:Perceptron: (1, 30, 0.05) Targets: 3, 53, 73 Ex: 8 Prediction: 3 Label: 533: 0.586653: 0.2592*73: 0.1192
Ex: 15 Prediction: 3 Label: 733: 0.598773: 0.001229*53: 0.0002248
top related