sauro menchetti, fabrizio costa, paolo frasconi department of systems and computer science

36
Kernels and Recursive Neural Networks for Learning Preferences on Structured Data Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science Università di Firenze, Italy http://www.dsi.unifi.it/neural/ Massimiliano Pontil Department of Computer Science University College London, UK

Upload: fathia

Post on 11-Jan-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Comparing Convolution Kernels and Recursive Neural Networks for Learning Preferences on Structured Data. Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science Università di Firenze, Italy http://www.dsi.unifi.it/neural/ Massimiliano Pontil - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

Comparing Convolution Kernelsand Recursive Neural Networks

for Learning Preferences on Structured Data

Sauro Menchetti, Fabrizio Costa, Paolo FrasconiDepartment of Systems and Computer Science

Università di Firenze, Italyhttp://www.dsi.unifi.it/neural/

Massimiliano PontilDepartment of Computer Science

University College London, UK

Page 2: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

2

Structured Data

Many applications…

… is useful to represent the objects of the domain by structured data (trees, graphs, …)

… better capture important relationships between the sub-parts that compose an object

Page 3: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

3

Natural Language: Parse Trees

He was viceprevious president

.

SS

VPVP

NPNP ADVPADVP NPNP

PRPPRP VBDVBD RBRB NNNN NNNN ..

Page 4: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

4

Structural Genomics:Protein Contact Maps

Page 5: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

5

Document Processing: XY-Trees

0 0.00 1.00 0.00 1.00

1 0.00 0.23 0.00 1.00

1 0.23 0.26 0.00 1.00

1 0.23 0.29 0.00 1.00

1 0.00 0.23 0.00 0.01

1 0.00 0.23 0.02 1.00

1 0.00 0.23 0.00 0.01

1 0.01 0.12 0.02 1.00

1 0.01 0.12 0.23 1.00

Page 6: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

6

Predictive Toxicology, QSAR:Chemical Compounds as

Graphs

CH3(CH(CH3,CH2(CH2(CH3))))

CH3

CH

CH3 CH2

CH2

CH3

[-1,-1,-1,1]([-1,1,-1,-1]([-1,-1,-1,1],[-1,-1,1,-1]([-1,-1,1,-1]([-1,-1,-1,1]))))

Page 7: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

7

Ranking vs. Preference

Ranking

Preference

55 334422

11

Page 8: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

8

Preference on Structured Data

Page 9: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

9

Classification, Regression and Ranking

Supervised learning taskf:X→Y

ClassificationY is a finite unordered set

RegressionY is a metric space (reals)

Ranking and PreferenceY is a finite ordered setY is a non-metric space

Metric space

Metric spaceUnorderedUnordered

Non-metric space

Finite Ordered

Non-metric space

Finite Ordered

Classification

Regression

Ranking and Preference

The Target Space

Page 10: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

10

Learning on Structured Data

Learning algorithms on discrete structures often derive from vector based methods

Both Kernel Machines and RNNs are suitable for learning on structured domains

Conventional Learning Algorithms

Conventional Learning Algorithms

1 2

Page 11: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

11

Kernels vs. RNNs

Kernel MachinesVery high-dimensional feature space

How to choose the kernel? prior knowledge, fixed representation

Minimize a convex functional (SVM)

Recursive Neural NetworksLow-dimensional space

Task-driven: representation depends on the specific learning task

Learn an implicit encoding of relevant information

Problem of local minima

Page 12: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

12

A Kernel for Labeled Trees

Feature SpaceSet of all tree fragments (subtrees) with the only constraint that a father can not be separated from his children

Φn(t) = # occurences of tree fragment n in t

Bag of “something”A tree is represented by

Φ(t) = [Φ1(t),Φ2(t),Φ3(t), …]

K(t,s) = Φ(t)∙Φ(s) is computed efficiently by dynamic programming (Collins & Duffy, NIPS 2001)

A

C

A

B

BC

A

CB C

A B

B

C

A

C

A

B

B

A

CB

C

A

C

A

B

BC

Φ

Page 13: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

13

Recursive Neural Networks

Composition of two adaptative functionsφ transition function

o output function

φ,o functions are implemented by feedforward NNs

Both RNN parameters and representation vectors are found by maximizing the likelihood of training data

AA

CCDD

AA CCBB

outputspace

outputspace

φw:X→Rn ow’:Rn→O

Page 14: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

14

Recursive Neural Networks

Labeled Tree

Network UnfoldingPrediction

PhaseError

Correction

D B B

A

C E

output network

Page 15: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

15

Preference Models

Kernel Preference ModelBinary classification of pairwise differences between instances

RNNs Preference ModelProbabilistic model to find the best alternative

Both models use an utility function to evaluate the importance of an element

Page 16: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

16

Utility Function Approach

Modelling of the importance of an objectUtility function U:X→Rx>z ↔ U(x)>U(z)

If U is linearU(x)>U(z) ↔ wTx>wTz

U can be also model by a neural networkRanking and preference problems

Learn U and then sort by U(x)

U(z)=3 U(x)=11

Page 17: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

17

Kernel Preference Model

x1 = best of (x1,…,xr)

Create a set of pairs between x1 and x2,…,xr

Set of constraints if U is linearU(x1)>U(xj) ↔ wTx1>wTxj ↔ wT(x1-xj)>0 for j=2,…,r

x1-xj can be seen as a positive example

Binary classification of differences between instancesx → Φ(x): the process can be easily kernelizedNote: this model does not take into consideration all the alternatives together, but only two by two

Page 18: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

18

RNNs Preference Model

Set of alternatives (x1,x2,…,xr)

U modelled by a recursive neural network architecture

Compute U(xi) = o(φ(xi)) for i=1,…,ri

j

U(x )

i rU(x )

j=1

eo =

eThe error (yi - oi) is backpropagated through whole network

Note: the softmax function compares all the alternatives together at once

Softmax function

Page 19: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

19

Learning Problems

First Pass Attachment

Modeling of a psycolinguistic phenomenon

Reranking Task

Reranking the parse trees output by a statistical parser

Page 20: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

20

First Pass Attachment (FPA)

The grammar introduces some ambiguitiesA set of alternatives for each word but only one is correct

The first pass attachment can be modelled as a preference problem

4

It has no bearing 1 432

on

NP

PP

IN

NP

PP

ADVP

IN

NP

ADJP

NP

QP

IN

NP

SBAR

ADVP

NONE

NP

PRN

INPRP VBZ DT NN

PRP

NP

NP

VP

S

It has no bearing

1

32

PRP VBZ DT NN

PRP

NP

NP

VP

S

Page 21: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

21

Heuristics forPrediction Enhancement

Specializing the FPA prediction for each class of wordGroup the words in 10 classes (verbs, articles, …)

Learn a different classifier for each class of words

Removing nodes from the parse tree that aren’t important for choosing between different alternatives

Tree reduction

Evaluation Measure =# correct trees ranked in first position

total number of sets

Page 22: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

22

Experimental Setup

Wall Street Journal (WSJ) Section of Penn TreeBankRealistic Corpus of Natural Language

40,000 sentences, 1 million wordsAverage sentence length: 25 words

Standard Benchmark in Computational LinguisticsTraining on sections 2-21, test on section 23 and validation on section 24

Page 23: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

23

Voted Perceptron (VP)

FPA + WSJ = 100 million trees for trainingVoted Perceptron instead of SVM (Freund & Schapire, Machine Learning 1999)

Online algorithm for binary classification of instances based on perceptron algorithm (simple and efficient)Prediction value: weighted sum of all training weight vectorsPerformance comparable to maximal-margin classifiers (SVM)

Page 24: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

24

Kernel VP vs. RNNsNoun - 33%

86

88

90

92

94

96

98

100 500 2000 10000 40000

Training Set Size

VPRNN

Verb - 13.4%

80

85

90

95

100

100 500 2000 10000 40000

Training Set Size

VPRNN

Preposition - 12.6%

50

55

60

65

70

100 500 2000 10000 40000

Training Set Size

VPRNN

Article - 12.5%

65

70

75

80

85

90

95

100 500 2000 10000 40000

Training Set Size

VPRNN

Page 25: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

25

Kernel VP vs. RNNsPunctuation - 11.7%

455055606570758085

100 500 2000 10000 40000

Training Set Size

VPRNN

Adjective - 7.5%

757779818385878991

100 500 2000 10000 40000

Training Set Size

VPRNN

Adverb - 4.3%

3035404550556065

100 500 2000 10000 40000

Training Set Size

VPRNN

Conjuction - 2.3%

5055606570758085

100 500 2000 10000 40000

Training Set Size

VPRNN

Page 26: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

26

Kernel VP vs. RNNsModularization

Learning Curve

70

75

80

85

90

100 500 2000 10000 40000

Training Set Size

VPRNN

Page 27: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

27

Small Datasets No Modularization

72

73

74

75

76

77

78

79

Split 1 Split 2 Split 3 Split 4 Split 5

5 Independent Splits of 100 Sentences

VP - Average 75.4RNN - Average 77

Page 28: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

28

Complexity Comparison

VP does not scale linearly with the number of training examples as the RNNs do

Computational costSmall datasets

5 splits of 100 sentences ~ a week @ 2GHz CPU

CPU(VP) ~ CPU(RNN)

Large datasets (all 40,000 sentences)VP took over 2 months to complete an epoch @ 2GHz CPU

RNN learns in 1-2 epochs ~ 3 days @ 2GHz CPU

VP is smooth in respect to training iterations

Page 29: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

29

Reranking Task

Reranking problem: rerank the parse trees generated by a statistical parser

Same problem setting of FPA (preference on forests)

1 forest/sentence vs. 1 forest/word (less computational cost involved)

StatisticalParser

StatisticalParser

Page 30: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

30

Evaluation: Parseval Measures

Standard evaluation measure

Labeled Precision (LP)

Labeled Recall (LR)

Crossing Brackets (CBs)

Compare a parse from a parser with an hand parsing of a sentence

Page 31: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

31

Reranking Task

6065707580859095

100

All Sentences

VP RNN

Model

≤ 40 Words (2245 sentences)

LR LP CBs0 CBs

2 CBs

VP 89.1 89.4 0.85 69.3 88.2

RNN 89.2 89.5 0.84 67.9 88.4

Model

≤ 100 Words (2416 sentences)

LR LP CBs0 CBs

2 CBs

VP 88.6 88.9 0.99 66.5 86.3

RNN 88.6 88.9 0.98 64.8 86.3

Page 32: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

32

Why RNNs outperform Kernel VP?

Hypothesis 1

Kernel Function: feature space not focused on the specific learning task

Hypothesis 2

Kernel Preference Model worst than RNNs preference model

Page 33: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

33

Linear VP on RNN Representation

Checking hypothesis 1

Train VP on RNN representation

The tree kernel replaced by a linear kernel

State vector representation of parse trees generated by RNN as input to VP

Linear VP is trained on RNN state vectors

Page 34: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

34

Linear VP on RNN Representation

70717273747576777879

Split 1 Split 2 Split 3 Split 4 Split 5

5 Indipendent Splits of 100 Sentences

VP - Average 75.4

RNN - Average 77

VP on RNN State -Average 74.7

Page 35: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

35

Conclusions

RNNs show better generalization properties…… also on small datasets… at smaller computational cost

The problem is…… neither the kernel function… nor the VP algorithm

Reasons: linear VP on RNN representation experiment

The problem is… … the preference model!

Reasons: kernel preference model does not take into consideration all the alternatives together, but only two by two as opposed to RNN

Page 36: Sauro Menchetti, Fabrizio Costa, Paolo Frasconi Department of Systems and Computer Science

ANNPR 2003, Florence 12-13 September 2003

36

Acknowledgements

Thanks to:

Alessio Ceroni Alessandro Vullo

Andrea Passerini Giovanni Soda