from pos tagging to question answering: state-of-the...
TRANSCRIPT
From POS tagging to question answering: State-of-the-art NLP results from
simple deep learning models
Christopher ManningStanford University
@chrmanning❀@stanfordnlpDL4NLP summer school 2017
I am a student <EOS> Je suis étudiant
Je suis étudiant <EOS>
1. RNN encoder-decoder networks
-0.10.3
-0.1-0.70.1
0.2 0.6 -0.1 -0.7 0.1
0.2 0.6 -0.1 -0.7 0.1
0.2 0.6 -0.1 -0.7 0.1
0.2-0.3-0.1-0.40.2
0.2 0.4 0.1 -0.5 -0.2
0.4 -0.2 -0.3 -0.4 -0.2
0.2 0.6 -0.1 -0.7 0.1
Encoder Decoder
ht = tanh(W[xt] + Uht–1 + b)
0.00.00.00.00.0
0.30.5
-0.20.1
-0.1
W
U
Gated Recurrent Unit[Cho et al., EMNLP2014; Chung, Gulcehre, Cho, Bengio, DLUFL2014]
Long Short-Term Memory [Hochreiter & Schmidhuber, NC1999; Gers, Thesis2001]
Gated Recurrent Units ≈ “LSTMs”
ht = ut � ht + (1� ut)� ht�1
h = tanh(W [xt] + U(rt � ht�1) + b)
ut = �(Wu [xt] + Uuht�1 + bu)
rt = �(Wr [xt] + Urht�1 + br)
h
t
= o
t
� tanh(ct
)
c
t
= f
t
� c
t�1 + i
t
� c
t
c
t
= tanh(Wc
[xt
] + U
c
h
t�1 + b
c
)
o
t
= �(Wo
[xt
] + U
o
h
t�1 + b
o
)
i
t
= �(Wi
[xt
] + U
i
h
t�1 + b
i
)
f
t
= �(Wf
[xt
] + U
f
h
t�1 + b
f
)
Equations of the two most widely used gated recurrent units
Basic update to memory cell (GRU h = LSTM c) is via a standard neural net layer
ht = tanh(W [xt] + U(rt � ht�1) + b)
Gated Recurrent Unit[Cho et al., EMNLP2014; Chung, Gulcehre, Cho, Bengio, DLUFL2014]
Long Short-Term Memory [Hochreiter & Schmidhuber, NC1999; Gers, Thesis2001]
Gated Recurrent Units ≈ “LSTMs”
ht = ut � ht + (1� ut)� ht�1
h = tanh(W [xt] + U(rt � ht�1) + b)
ut = �(Wu [xt] + Uuht�1 + bu)
rt = �(Wr [xt] + Urht�1 + br)
h
t
= o
t
� tanh(ct
)
c
t
= f
t
� c
t�1 + i
t
� c
t
c
t
= tanh(Wc
[xt
] + U
c
h
t�1 + b
c
)
o
t
= �(Wo
[xt
] + U
o
h
t�1 + b
o
)
i
t
= �(Wi
[xt
] + U
i
h
t�1 + b
i
)
f
t
= �(Wf
[xt
] + U
f
h
t�1 + b
f
)
Equations of the two most widely used gated recurrent units
Summing previous & new candidate hidden states gives direct gradient flow & more effective memory
ht = tanh(W [xt] + U(rt � ht�1) + b)
Gated Recurrent Unit[Cho et al., EMNLP2014; Chung, Gulcehre, Cho, Bengio, DLUFL2014]
Long Short-Term Memory [Hochreiter & Schmidhuber, NC1999; Gers, Thesis2001]
Gated Recurrent Units ≈ “LSTMs”
ht = ut � ht + (1� ut)� ht�1
h = tanh(W [xt] + U(rt � ht�1) + b)
ut = �(Wu [xt] + Uuht�1 + bu)
rt = �(Wr [xt] + Urht�1 + br)
h
t
= o
t
� tanh(ct
)
c
t
= f
t
� c
t�1 + i
t
� c
t
c
t
= tanh(Wc
[xt
] + U
c
h
t�1 + b
c
)
o
t
= �(Wo
[xt
] + U
o
h
t�1 + b
o
)
i
t
= �(Wi
[xt
] + U
i
h
t�1 + b
i
)
f
t
= �(Wf
[xt
] + U
f
h
t�1 + b
f
)
Equations of the two most widely used gated recurrent units
Bernoulli variable “gates” control how much history is kept & input is attended to
ht = tanh(W [xt] + U(rt � ht�1) + b)
Die Proteste waren am Wochenende eskaliert <EOS> The protests escalated over the weekend
0.20.6
-0.1-0.70.1
0.4-0.60.2
-0.30.4
0.2-0.3-0.1-0.40.2
0.20.40.1
-0.5-0.2
0.4-0.2-0.3-0.4-0.2
0.20.6
-0.1-0.70.1
0.20.6
-0.1-0.70.1
0.20.6
-0.1-0.70.1
-0.10.3
-0.1-0.70.1
-0.20.60.10.30.1
-0.40.5
-0.50.40.1
0.20.6
-0.1-0.70.1
0.20.6
-0.1-0.70.1
0.2-0.2-0.10.10.1
0.20.6
-0.1-0.70.1
0.10.3
-0.1-0.70.1
0.20.6
-0.1-0.40.1
0.2-0.8-0.1-0.50.1
0.20.6
-0.1-0.70.1
-0.40.6
-0.1-0.70.1
0.20.6
-0.10.30.1
-0.10.6
-0.10.30.1
0.20.4
-0.10.20.1
0.30.6
-0.1-0.50.1
0.20.6
-0.1-0.70.1
0.2-0.1-0.1-0.70.1
0.10.30.1
-0.40.2
0.20.6
-0.1-0.70.1
0.40.40.3
-0.2-0.3
0.50.50.9
-0.3-0.2
0.20.6
-0.1-0.50.1
-0.10.6
-0.1-0.70.1
0.20.6
-0.1-0.70.1
0.30.6
-0.1-0.70.1
0.40.4
-0.1-0.70.1
-0.20.6
-0.1-0.70.1
-0.40.6
-0.1-0.70.1
-0.30.5
-0.1-0.70.1
0.20.6
-0.1-0.70.1
The protests escalated over the weekend <EOS>
An LSTM encoder-decoder MT net [Sutskever et al. 2014]
Encoder:Builds up sentence meaning
Source sentence
Translation generated
Feeding in last word
Decoder
Bottleneck
I am a student <EOS> Je suis étudiant
Je suis étudiant <EOS>
A BiLSTM encoder and LSTM-with-attention decoder [Luong et al. 2015]
-0.10.3
-0.1-0.70.1
0.2 0.6 -0.1 -0.7 0.1
0.2 0.6 -0.1 -0.7 0.1
0.2 0.6 -0.1 -0.7 0.1
0.2-0.3-0.1-0.40.2
0.2 0.4 0.1 -0.5 -0.2
0.4 -0.2 -0.3 -0.4 -0.2
0.2 0.6 -0.1 -0.7 0.1
Encoder Decoder
Bilinear attention
Progress in Machine Translation[Edinburgh En-De WMT newstest2013 Cased BLEU; NMT 2015 from U. Montréal]
0
5
10
15
20
25
2013 2014 2015 2016
Phrase-based SMT Syntax-based SMT Neural MT
From [Sennrich 2016, http://www.meta-net.eu/events/meta-forum-2016/slides/09_sennrich.pdf]
IWSLT 2015, TED talk MT, English-German [Luong and Manning 2015]
9
16.16
21.84 22.67 23.42
28.18
0
5
10
15
20
25
30
Stanford Edinburgh Karlsruhe Heidelberg PJAIT
HTER (HE SET)
26%
Four big wins of Neural MT1. End-to-end training
All parameters are simultaneously optimized to minimize a loss function on the network’s output
2. Distributed representations share strengthBetter exploitation of word and phrase similarities
3. Better exploitation of contextNMT can use a much bigger context – both source and partial target text – to translate more accurately
4. More fluent text generationDeep learning text generation is much higher quality
10
Enormous success!
From first modern research attempts in 2014, neural MT has seen rapid and significant success
2017: Almost everyone is now using Neural MT in production, at least for many language pairs
11
The BiLSTM Hegemony
To a first approximation,the de facto consensus in NLP in 2017 is
that no matter what the task,you throw a BiLSTM at it, with
attention if you need information flow, and get great performance!
12
Simplicity
We still understand badly how to make good neural networks
So, it is scientifically better to see how far we can get with
very simple models
13
Talk outline
1. The BiLSTM (with attention) hegemony2. Question answering: The Stanford Attentive Reader3. Effective human-machine dialog: copying & memory4. State-of-the-art dependency parsing5. More complex futures
14
A Thorough Examination of the CNN/ Daily Mail Reading Comprehension Task
[Chen, Bolton, & Manning 2016]
• Demonstrated a simple, highly successful architecture for question answering and reading comprehension
• Known as the Stanford Attentive Reader
15
Stanford Attentive Reader
17
characters in " @placeholder " movies have gradually become more diverse
Q
Bidirectional LSTMs
characters in “ @placeholder more diverse
…
…
…
Stanford Attentive Reader
18
Q
… ……P
Bidirectional LSTMs
entity6A
characters in " @placeholder " movies have gradually become more diverse
Attention
Stanford Attentive Reader
A very simple model• Learned word embeddings feed into• Bi-directional shallow 128d GRUs for passage and question• Question representation used for soft attention over
passage with same simple bilinear attention function
• A final softmax layer predicts the answer entity• SGD, dropout (0.2), batch size = 32, hidden size = 128, …
19
Lots of complex models; lots of resultsNothing does much better than LSTM+Attn
CNN Daily Mail
Dev Test Dev Test
(Hermann et al, 2015) NIPS’15 61.8 63.8 69.0 68.0(Hill et al, 2016) ICLR’16 63.4 66.8 N/A N/A
(Kobayashi et al, 2016) NAACL’16 71.3 72.9 N/A N/A(Kadlec et al, 2016) ACL’16 68.6 69.5 75.0 73.9(Dhingra et al, 2016) 2016/6/5 73.0 73.8 76.7 75.7(Sodorni et al, 2016) 2016/6/7 72.6 73.3 N/A N/A(Trischler et al, 2016) 2016/6/7 73.4 74.0 N/A N/A(Weissenborn, 2016) 2016/7/12 N/A 73.6 N/A 77.2
(Cui et al, 2016) 2016/7/15 73.1 74.4 N/A N/AOurs: neural net ACL’16 73.8 73.6 77.6 76.6
Ours: neural net (ensemble) ACL’16 77.2 77.6 80.2 79.2
WebQuestions (Berant et al, 2013)Q: What part of the atom did Chadwick discover? A: neutron
TREC Q: What U.S. state’s motto is “Live free or Die”? A: New Hampshire
WikiMovies (Miller et al, 2016)Q: Who wrote the film Gigli? A: Martin Brest
SQuADQ: How many of Warsaw's inhabitants spoke Polish in 1933? A: 833,500
Open-domain Question Answering
23
DocumentReader
Document Retriever
833,500
Q: How many of Warsaw's inhabitants spoke Polish in 1933?
24
Document Retriever
25
70-86% of questions we have that the answer segment appears in the top 5 articles
Traditional tf.idf
inverted index +
efficient bigram
hash
Stanford Attentive Reader++
26
Who did Genghis Khan unite before hebegan conquering the rest of Eurasia?Q
… ……P
Bidirectional LSTMs
Attention
predict start token
Attention
predict end token
SQuAD Results (single model)
27
F1
Logistic regression 51.0
Fine-Grained Gating (Carnegie Mellon U) 73.3
Match-LSTM (Singapore Management U) 73.7
DCN (Salesforce) 75.9
BiDAF (UW & Allen Institute) 77.3
Multi-Perspective Matching (IBM) 78.7
ReasoNet (MSR Redmond) 79.4
DrQA (Chen et al. 2017) 79.4
r-net (MSR Asia) [Wang et al., ACL 2017] 79.7
r-net (MSR Asia) [Jul 2017] 83.5
Human performance 91.2
3. Effective human-machine dialog[Eric and Manning EACL 2017, Eric and Manning SIGDIAL 2017]
Voice interfaces are starting to appear everywhere
But are often frustrating to use!How can we get them to understand us the way we talk?
Hi, what can I help you with?
Cool! Where is it?
Hmm … actually any froyo around?
I’m feeling likeItalian
There are 3 italianrestaurants nearby
No froyo, sadly, but there is a gelato cafe
New: Copy-Augmented Seq2Seq
• Add copy mechanism as an option to generator• A second attention distribution can copy a context word
• Known to improve machine translation, text summarization
• Give the system extra information about the entity type of items (restaurant, address, etc.)
Experiment
• Data: • Dialogue State Tracking Challenge 2 (DSTC2)• Restaurant Reservation Domain; 1600 Dialogues
• Train:• Seq2seq model predicts system from Wizard-of-Oz
Model Per-response Acc.Memory Net (Bordes & Weston 2016) 41.1Our model (Copy + EntType) 48.0QRN (Salesforce) 50.7
System output: seq2seq plus copyingUser: cheap restaurant in east part of town System: api call r cuisine east cheap User: <silence>System: the missing sock is a nice place in the east of town and the prices are cheap User: address System: sure, the missing sock is on [Finders Place, Newmarket Rd] User: phone number System: the phone number of the missing sock is [1223 812660]User: thank you good byeSystem: you are welcome
But only knowledge is what LSTM learned during training!
What time is mydoctor appt. ?
Your appt. is at 3pm on Thursday
Natural Language Understanding
Knowledge Base
Natural language generation
Background
Representing the KBEvent Time Date Party Agenda
dinner 8pm the 13th Ana
Subject Relation Object
dinner time 8pm
dinner date the 13th
dinner party Ana
dinner agenda
Representing the KBEvent Time Date Party Agenda
dinner 8pm the 13th Ana
Subject Relation Object
dinner time 8pm
dinner date the 13th
dinner party Ana
dinner agenda
Key Value
Cf. [Miller et al. 2016 Key-value Memory Networks]
Automatic evaluation scoresData• 3000 dialogues collected
from 3 domains: calendar, weather, navigation
Baselines• Rule-based system
• Intent and state tracking to KB• Template-based NLG
• Copy-augmented Seq2Seq• Encoder-decoder LSTM• Attention model• Attention-based hard-copy
Model BLEU Entity F1
Rule-based 6.6 43.8Attn Seq2Seq 10.2 30.0KV Retr. Net 13.2 48.2Human 13.5 60.7
44
• 120 distinct scenarios, 3 dialogue domains• AMT workers paired real-time with either our
model or another worker• Assess for: Fluency, Cooperative, Human-like (1–5)
45
Human evaluation
Model Fluency Cooperative Human-like
Rule-based 3.2 3.4 2.9Copy net 2.3 2.4 2.0KV Retr. Net 3.4 3.4 3.1Human 4.0 4.0 4.0
Sample Dialogue
POI Category Traffic Info
Civic Center Parking garage
Car collision
Valero Gas station Road block
Webster Garage
Parking garage
Car collision
Trader Joes Grocery Store
Heavy
Mandarin Roots.
Chinese rest.
Moderate
Driver: I am looking for a gas station near me with the shortest route from me.
Car: The closest gas station is valero but, but there is a road block nearby.
Driver: What is the next nearest gas station?
Car: Valero is the only nearby gas station that I can find
Driver: Thanks
Car: Here to serve
Dialogue using sequence-to-sequence models
• Still not commercial quality• Currently too small domain and fragile• Commercial systems are still using simple machine learning
for understanding user queries• Followed by “templated generation” (each generated
sentence is hand-written)• Tiny exception: Google Auto-reply
• Nonetheless, a very exciting research direction!!• Aim is to give not only good task completion but human-like
naturalness and friendliness
Deep Biaffine Attention for Neural Dependency ParsingDozat and Manning (ICLR 2017, CoNLL 2017)
A simple, carefully tuned graph-based dependencyparser that gives the state of the art in parsing performance.
Dependency parsingSentence structure is shown by indicating for each word what it is a modifier or argument of, via typed dependency edges
A dependency tree can be used to guide natural language understanding – it’s almost a semantic network
Methods of Dependency Parsing
1. Dynamic programmingEisner (1996) gives a clever O(n3) algorithm, by producing parse items with heads at the ends rather than in the middle
2. Graph algorithmsEach edge is scored – McDonald et al.’s (2005) MSTParserscored dependencies independently using an ML classifier You create a Minimum Spanning Tree for a sentence
3. “Transition-based dependency parsing”Maintain stack and buffer; make greedy choices of shift/ reduce transition actions (e.g., MaltParser, Nivre et al. 2004)
Graph-based Parsing
Arc probabilityHead
root Casey hugged Kim
Dependent
Casey 0.10 − 0.85 0.05
hugged 0.80 0.15 − 0.05
Kim 0.05 0.05 0.90 −
Score putting an arc between each word pair (+ root as head)
Graph-based Parsing
Arc probabilityHead
root Casey hugged Kim
Dependent
Casey 0.10 − 0.85 0.05
hugged 0.80 0.15 − 0.05
Kim 0.05 0.05 0.90 −
Highest scoring tree under root is found with a minimal spanning tree (MST) algorithm
Deep Biaffine Attention for Neural Dependency ParsingDozat and Manning (ICLR 2017)
• We use an LSTM recurrent neural network with word/tag embedding vectors as input as parsing input
• Similar to Kiperwasser & Goldberg (2016), however:• Their feedforward scorer/classifier is somewhat unintuitive
and needlessly complex• Their Representations don’t distinguish heads/dependents• Their model is relatively small and unregularized• Maybe they just didn’t tune their model very well?
Our Approach
A similar, more carefully tuned, parser with a simpler and more intuitive scorer/classifier• Idea 1: biaffine classifiers
• Arguably simpler than MLPs• More natural in this context
• Idea 2: MLP layers produce head and dependent representations between LSTM and biaffine classifier• Applying a nonlinearity helps the network remove
irrelevant information, helping to avoid overfitting• Reducing dimensionality gives speed without reducing
representational capacity of the recurrent network
Unlabeled arc parser
• Bidirectional LSTM over word/tag embeddings• Two separate FC ReLU layers
• One representing each token as a dependent trying to find (attend to) its head
• One representing each token as a head trying to find (be attended to by) its dependents
56
Dependencies: Biaffine Self-Attention
• H is the stacked matrix of LSTM vectors•W is the weight matrix• S represents the n x n matrix of arc scores
H(arc-head) ⊕ 1 W ⊕ b H(arc-dep) S
Graph-based Parsing
Arc probabilityHead
root Casey hugged Kim
Dependent
Casey 0.10 − 0.85 0.05
hugged 0.80 0.15 − 0.05
Kim 0.05 0.05 0.90 −
Our Approach: Biaffine models
• Typical fixed-class classification: given input vector x, do affine transformation to get a vector of scores
s = Wx + b• b provides the prior probability of each class
P(class = c)• Wx provides the likelihood of each class given the
input, P(class = c | x)• We need to extend this to variable-class classification• We don’t know how many words in a sentence• We don’t know what those words will be
Our Approach: Biaffine models
• Given words i and j with LSTM vectors ri and rj, what function captures the prior probability P(headi = j | rj) and likelihood P(headi = j | ri, rj) in the same way?• Answer: a biaffine transformation
sij = rjTWri + rj
Tw• rj
Tw provides the prior probability P(headi = j | rj)• Function words never take dependents; verbs
frequently do• rj
TWri provides the likelihood P(headi = j | ri,rj)
Labeler: LSTM
• Take the topmost BiLSTM vectors used for the unlabeled parser
• Two more separate FC ReLU layers: • One representing each token as a dependent trying to
determine its label• One representing each token as a head trying to determine
its dependents’ labels
61
Labeler: Classifier
• Biaffine layer scores possible relations for each best-head/dependent pair
• Train with softmax cross-entropy, added to the loss of the unlabeled parser
62
Initial evaluation
• Penn TreeBank converted to Stanford Dependencies, version 3.3.0• Predicted POS tags, ignoring punctuation
• Evaluation metrics• Unlabeled attachment score (UAS)• Labeled attachment score (LAS)
Parsing performance on PTB WSJ SD 3.3
Type Feat Model LASTransition-based
symb MaltParser (Nivre et al. 2006) 87.2neur Chen & Manning 2014 89.7neur Weiss et al. 2014 92.05neur Andor et al. 2016 92.79
Graph-based symb MSTParser (McDonald et al. 2005) 88.1symb TurboParser (Martins et al. 2013) 89.6neur Kipperwasser and Goldberg 2016 91.9neur Dozat and Manning 2017 94.08
64
Universal Dependencies
Part-of-speech tags
Morphological features
Dependency relations
En katt jagar rattor och moss
det nsubj conj
dobj
conj
En kat jager rotter og mus
nsubj
? dobj cc conj
A cat chases rats and mice
det nsubj dobj cc
conj
Toutefois , les filles adorent les desserts .
ADV PUNCT DET NOUN VERB DET NOUN PUNCT
Definite=Def Gender=Fem Number=Plur Definite=Def Gender=MascNumber=Plur Number=Plur Person=3 Number=Plur Number=Plur
Tense=Pres
advmod
punct
det nsubj
root
det
dobj
punct
1
http://universaldependencies.org
UD v2.0 parsing was a shared task at CoNLL 2017
Universal Guidelines Group
Release and Documentation Task ForceChief Cat Herder
Part-of-Speech Tagger: LSTM
• A possible weakness of our system was POS tagging• Can parsing be improved with better tagger?• Uses a BiLSTM (distinct from parser BiLSTM!) • Two separate FC ReLU layers:
• One for (universal) UPOS tags• One for (language particular) XPOS tags
68
Tagger: Classifiers
• Affine layers to score possible tags for each word
• Train jointly by adding softmax cross-entropy • When using in the main parser, add UPOS and XPOS
embeddings together (eltwise)
69
Tagger: Experiment
• Use of our tagger outperformed baseline tagger (Straka et al., 2015) (p < .05) or no tagger (p < .05)
• Parser performance correlated with tagger performance (ours vs. baseline) (p < .05)
70
Character-level model: Motivation• Many languages have complex morphology
• Grammatical functions indicated more by word form than relative location
• Rare words with highly predictive suffixes won’t be attested in the frequent word embedding matrix
• Extreme sparsity may yield low-quality pretrainedembeddings
• Idea: Compose word embeddings orthographically with a character-based embedding model
• Question: Does this improve accuracy on inflectionally rich languages?
71
Character model: LSTM
• Unidirectional LSTM over character embeddings Concatenate two sources of information:
• Linear attention over top hidden states (Cao and Rei, 2016) Final cell state (Ballesteros et al., 2015)
72
Character model: Embedding
• Linearly transform to the desired size
• When using in the parser/tagger, add with pretrainedand frequent-token embeddings (eltwise)
73
Character model: Experiment
• Systems trained with a character model outperformed models trained without (p < .05)
• Improvement correlated with morphological complexity (p < .05)
74
Details: Dropout• Lots of dropout: keep_prob is .67 throughout the whole
network• Embedding dropout
• Drop token/tag embeddings independently• When one is dropped, the other is scaled up to compensate • When both are dropped, replace with zeros• Seems to work better than random vector/UNK replacement
• Same-mask recurrent dropout (Gal and Ghahramani, 2016)• Drop input connections and recurrent connections• Drop the same connections at each recurrent timestep• Seems to work better than traditional dropout/zoneout (Krueger et al.,
2017) 75
Optimizer: Adam
• Adam optimizer (Kingma and Ba 2015) with β1 = β2 = .9 • For embedding matrices, only decay m and v
accumulators for tokens used in the minibatch• I.e., for words that are seen in the minibatch, we apply
Adam’s accumulator update rule: mt = β1mt−1 + (1 − β1)gt
vt = β2vt−1 + (1 − β2)gt2
• But for words that aren’t, we don’t update the accumulators, preventing uncommon words from decaying down to zero
• Note: this is not the behavior of most DL toolkits! 76
Hyperparameters
• Initialization • Preference for initializing to zero wherever possible• Bias terms • Final linear layers (character model, output layers) • Word/POS embeddings (other than pretrained)
• Otherwise, we use orthonormal initialization (Saxe et al., 2014)
• Recurrent Cells • LSTMs vastly outperformed GRUs and slightly outperformed
coupled input-forget LSTMs (Greff et al., 2016) • Adding a forget bias hurts performance 77
Nonprojectivity
• Our system outperforms UDPipe v1.1 (transition-based) by a larger margin on treebanks with many crossing arcs (p < .05)
• Stronger correlation for treebanks with more crossing arcs in the test set than in the training set (p < .05)
79
Is there anything more?
• I believe the answer is yes!• Structured memories
• Structured sentences
• But we’re still working to prove out those ideas81
Envoi
• Deep learning – distributed representations, end-to-end training, and richer modeling of state – has brought great gains to NLP
• At the moment, neural sequence models with attention are often the sweet spot of good performance, simplicity and speed
• However, I do believe that we do need more structure and modularity for language, memory, knowledge, and planning; it’ll just take some time