decision list ling 572 fei xia 1/18/06. outline basic concepts and properties case study

Decision List

LING 572

Fei Xia

1/18/06

Outline

• Basic concepts and properties

• Case study

Definitions

• A decision list (DL) is an ordered list of conjunctive rules.– Rules can overlap, so the order is important.

• A decision tree determines an example’s class by using the first matched rule.

An example

A simple DL: x=(f1, f2, f3)

1. If f1=v11 && f2=v21 then c1

2. If f2=v21 && f3=v34 then c2

Classify an example (v11,v21,v34)

c1 or c2 ?

Decision list

• A decision list is a list of pairs

(t1, v1), …, (tr, vr),

ti are terms, and tr=true.

• A “term” in this context is a conjunction of literals: – f1=v11 is a literal.

– “f1=v11 && f2=v21” is a term.

How to build a decision list

• Decision tree Decision list

• Greedy, iterative algorithm that builds DLs directly.

Decision tree Decision list

Incomehigh low

Nothing Respond

The greedy algorithm

• RuleList=[ ], E=training_data

• Repeat until E is empty or gain is small– t = Find_best_term(E)– Let E’ be the examples covered by t– Let c be the most common class in E’– Add (t, c) to RuleList– E E – E’

Problem of greedy algorithm

• The interpretation of rules depends on preceding rules.

• Each iteration reduces the number of training examples.

• Poor rule choices at the beginning of the list can significantly reduce the accuracy of DL learned.

Several papers on alternative algorithms

Algorithms for building DL

• AQ algorithm (Michalski, 1969)

• CN2 algorithm (Clark and Niblett, 1989)

• Segal and Etzioni (1994)

• Goodman (2002)

• …

Probabilistic DL

• DL: a rule is (t, v)

• Probabilistic DL: a rule is

(t, c1/p1 c2/p2 … cn/pn)

Case study(Yarowsky, 1994)

Case study: accent restoration

• Task: to restore accents in Spanish and French A special case of WSD

• Ex: ambiguous de-accented forms:– cesse cesse, cessé– cote côté, côte, cote, coté

• Algorithm: build a DL for each ambiguous de-accented form: e.g., one for cesse, another one for cote

• Attributes: words within a window

The algorithm

• Training:– Find the list of de-accent forms that are

ambiguous.– For each ambiguous form, build a decision

list.

• Testing: check each word in a sentence– if it is ambiguous,

then restore the accent form according to the DL

Algorithm for building DLs

• Select feature templates

• Build attribute-value table

• Find the feature ft that maximizes

• Split the data and iterate.

In this paper• Binary classification problem: each form has only two possible

accent patterns.

• Each rule tests only one feature

• Very high baseline: 98.7%

• Notation:– Accent pattern: label/target/y– Collocation: feature

Step 1: Identify forms that are ambiguous

Step 2: Collecting training context

Context: the previous three and next three words.Strip the accents from the data. Why?

Step 3: Measure collocational distributions

Feature types are pre-defined.

Collocations (a.k.a. features)

Step 4: Rank decision rules by log-likelihood

word class

There are many alternatives.

Step 5: Pruning DLs

• Pruning: – Cross-validation– Remove redundant rules: “WEEKDAY” rule

precedes “domingo” rule.

Summary of the algorithm

• For a de-accented form w, find all possible accented forms

• Collect training contexts:– collect k words on each side of w– strip the accents from the data

• Measure collocational distributions:– use pre-defined attribute combination:– Ex: “-1 w”, “+1w, +2w”

• Rank decision rules by log-likelihood • Optional pruning and interpolation

Experiments

Prior (baseline): choose the most common form.

Global probabilities vs. Residual probabilities

• Two ways to calculate the log-likelihood, log (ci | ft):

– Global probabilities: using the full data set– Residual probabilities: using the residual

training data • More relevant, but less data and more expensive

to compute.

• Interpolation: use both

• In practice, global probability works better.

Combining vs. Not combining evidence

• Each decision is based on a single piece of evidence (i.e., feature). – Run-time efficiency and easy modeling– It works well, at least for this task, but why?

• Combining all available evidence rarely produces different results

• “The gross exaggeration of prob from combining all of these non-independent log-likelihood is avoided” (c.f. Naïve Bayes)

Summary of case study

• It allows a wider context (compared to n-gram methods)

• It allows the use of multiple, highly non-independent evidence types (compared to Bayesian methods)

kitchen-sink approach of the best kind (at that time)

Summary of decision list• Rules are easily understood by humans (but remember the order

factor)

• DL tends to be relatively small, and fast and easy to apply in practice.

• Learning: greedy algorithm and other improved algorithms

• Extension: probabilistic DL– Ex: if A & B then (c1, 0.8) (c2, 0.2)

• DL is related to DT, CNF, DNF, and TBL (see “additional slides”).

Additional slides

Rivest’s paper

• It assumes that all attributes (including goal attribute) are binary.

• It shows DL is easily learnable from examples.

Assignment and formula

• Input attributes: x1, …, xn

• An assignment gives each input attribute a value (1 or 0): e.g., 10001

• A boolean formula (function) maps each assignment to a value (1 or 0):

3221 xxxx

• Two formulae are equivalent if they give the same value for same input.

• Total number of different formulae:

Classification problem: learn a formula given a partial table

CNF an DNF

• Literal: • Term: conjunction (“and”) of literals• Clause: disjunction (“or”) of literals

• CNF (conjunctive normal form): the conjunction of clauses.

• DNF (disjunctive normal form): the disjunction of terms.

• k-CNF and k-DNF

ii xx ,

)()( 32541 xxxxx

32541 xxxxx

A slightly different definition of DT

• A decision tree (DT) is a binary tree where each internal node is labeled with a variable, and each leaf is labeled with 0 or 1.

• k-DT: the depth of a DT is at most k.

• A DT defines a boolean formula: look at the paths whose leaf node is 1.

• An example

Decision list

• A decision list is a list of pairs

(f1, v1), …, (fr, vr),

fi are terms, and fr=true.

• A decision list defines a boolean function:

given an assignment x, DL(x)=vj, where j is the least index s.t. fj(x)=1.

Relations among different representations

• CNF, DNF, DT, DL

• k-CNF, k-DNF, k-DT, k-DL– For any k < n, k-DL is a proper superset of the

other three.– Compared to DT, DL has a simple structure,

but the complexity of the decisions allowed at each node is greater.

k-CNF and k-DNF are proper subsets of k-DL

• k-DNF is a subset of k-DL:– Each term t of a DNF is converted into a decision rule (t, 1).– Ex:

• k-CNF is a subset of k-DL:– Every k-CNF is a complement of a k-DNF: k-CNF and k-DNF are

duals of each other.– The complement of a k-DL is also a k-DL.– Ex:

• Neither k-CNF nor k-DNF is a subset of the other – Ex: 1-DNF: nxxx ...21

1,2121 xxxx

)0,( 2121 xxxx

K-DT is a proper subset of k-DL• K-DT is a subset of k-DNF

– Each leaf labeled with “1” maps to a term in k-DNF.

• K-DT is a subset of k-CNF– Each leaf labeled with “0” maps to a clause in k-

CNF

k-DT is a subset of

DNFkCNFk

K-DT, k-CNF, k-DNF and k-DT

k-CNF k-DNFk-DT

K-DL

Learnability

• Positive examples vs. negative examples of the concept being learned.– In some domains, positive examples are easier to

collect.

• A sample is a set of examples.

• A boolean function is consistent with a sample if it does not contradict any example in the sample.

Two properties of a learning algorithm

• A learning algorithm is economical if it requires few examples to identify the correct concept.

• A learning algorithm is efficient if it requires little computational effort to identify the correct concept.

We prefer algorithms that are both economical and efficient.

Hypothesis space

• Hypothesis space F: a set of concepts that are being considered.

• Hopefully, the concept being learned should be in the hypothesis space of a learning algorithm.

• The goal of a learning algorithm is to select the right concept from F given the training data.

• Discrepancy between two functions f and g:

• Ideally, we want to be as small as possible.

• To deal with ‘bad luck’ in drawing example according to Pn, we define a confidence parameter:

)()(|

)(xgxfxn xPgf

gf

1)( gfP

gf

“Polynomially learnable”

• A set of Boolean functions is polynomially learnable if there exists an algorithm A and a polynomial function

when given a sample of f of size drawn according to Pn, A will with probability at least

output a s.t.

Furthermore, A’s running time is polynomially bounded in n and m.

• K-DL is polynomially learnable.

nnn FfXonPntss ,,,,..),,( )1,1|,|(log

nFsm

nFg gf 1

The algorithm in (Rivest, 1987)

1. If the example set S is empty, halt.

2. Examine each term of length k until a term t is found s.t. all examples in S which make t true are of the same type v.

3. Add (t, v) to decision list and remove those examples from S.

4. Repeat 1-3.

Summary of (Rivest, 1987)

• Formal definition of DL

• Show the relation between k-DL, k-CNF, k-DNF and k-DL.

• Prove that k-DL is polynomially learnable.

• Give a simple greedy algorithm to build k-DL.

In practice

• Input attributes and the goal are not necessarily binary.– Ex: the previous word

• A term a feature (it is not necessarily a conjunction of literals)– Ex: the word appears in a k-word window

• Only some feature types are considered, instead of all possible features:– Ex: previous word and next word

• Greedy algorithm: quality measure– Ex: a feature with minimum entropy

decision list ling 572 fei xia 1/18/06. outline basic concepts and properties case study

Documents