![Page 1: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/1.jpg)
1
Using Perception to Supervise Language Learning and
Language to Supervise PerceptionRay Mooney
Department of Computer Sciences
University of Texas at Austin
Joint work withDavid Chen, Sonal Gupta,
Joohyun Kim, Rohit Kate, Kristen Grauman
![Page 2: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/2.jpg)
Learning for Language and Vision
• Natural Language Processing (NLP) and Computer Vision (CV) are both very challenging problems.
• Machine Learning (ML) is now extensively used to automate the construction of both effective NLP and CV systems.
• Generally uses supervised ML and requires difficult and expensive human annotation of large text or image/video corpora for training.
![Page 3: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/3.jpg)
Cross-Supervision of Language and Vision
• Use naturally co-occurring perceptual input to supervise language learning.
• Use naturally co-occurring linguistic input to supervise visual learning.
Blue cylinder on top of a red cube.
Language Learner
Input
SupervisionVision
Learner
Input
Supe
rvisi
on
![Page 4: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/4.jpg)
Using Perception to Supervise Language:Learning to Sportscast
(Chen & Mooney, ICML-08)
![Page 5: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/5.jpg)
5
Semantic Parsing
• A semantic parser maps a natural-language sentence to a complete, detailed semantic representation: logical form or meaning representation (MR).
• For many applications, the desired output is immediately executable by another program.
• Sample test application:– CLang: RoboCup Coach Language
![Page 6: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/6.jpg)
6
CLang: RoboCup Coach Language
• In RoboCup Coach competition teams compete to coach simulated soccer players
• The coaching instructions are given in a formal language called CLang
Simulated soccer field
Coach
If the ball is in our penalty area, then all our players except player 4 should stay in our half.
CLang((bpos (penalty-area our))
(do (player-except our{4}) (pos (half our)))
Semantic Parsing
![Page 7: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/7.jpg)
7
Learning Semantic Parsers
• Manually programming robust semantic parsers is difficult due to the complexity of the task.
• Semantic parsers can be learned automatically from sentences paired with their logical form.
NLMR Training Exs
Semantic-Parser Learner
Natural Language
Meaning Rep
SemanticParser
![Page 8: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/8.jpg)
8
Our Semantic-Parser Learners
• CHILL+WOLFIE (Zelle & Mooney, 1996; Thompson & Mooney, 1999, 2003) – Separates parser-learning and semantic-lexicon learning.– Learns a deterministic parser using ILP techniques.
• COCKTAIL (Tang & Mooney, 2001)– Improved ILP algorithm for CHILL.
• SILT (Kate, Wong & Mooney, 2005) – Learns symbolic transformation rules for mapping directly from NL to MR.
• SCISSOR (Ge & Mooney, 2005) – Integrates semantic interpretation into Collins’ statistical syntactic parser.
• WASP (Wong & Mooney, 2006; 2007)– Uses syntax-based statistical machine translation methods.
• KRISP (Kate & Mooney, 2006)– Uses a series of SVM classifiers employing a string-kernel to iteratively build
semantic representations.
![Page 9: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/9.jpg)
9
WASPA Machine Translation Approach to Semantic Parsing
• Uses latest statistical machine translation techniques:– Synchronous context-free grammars (SCFG)
(Wu, 1997; Melamed, 2004; Chiang, 2005)– Statistical word alignment
(Brown et al., 1993; Och & Ney, 2003)
• SCFG supports both:– Semantic Parsing: NL MR– Tactical Generation: MR NL
![Page 10: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/10.jpg)
KRISPA String Kernel/SVM Approach to Semantic Parsing
• Productions in the formal grammar defining the MR are treated like semantic concepts.
• An SVM classifier is trained for each production using a string subsequence kernel (Lodhi et al.,2002) to recognize phrases that refer to this concept.
• Resulting set of string classifiers is used with a version of Early’s CFG parser to compositionally build the most probable MR for a sentence.
![Page 11: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/11.jpg)
11
Learning Language from Perceptual Context
• Children do not learn language from annotated corpora.• Neither do they learn language from just reading the
newspaper, surfing the web, or listening to the radio.– Unsupervised language learning
– DARPA Learning by Reading Program
• The natural way to learn language is to perceive language in the context of its use in the physical and social world.
• This requires inferring the meaning of utterances from their perceptual context.
![Page 12: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/12.jpg)
Ambiguous Supervision for Learning Semantic Parsers
• A computer system simultaneously exposed to perceptual contexts and natural language utterances should be able to learn the underlying language semantics.
• We consider ambiguous training data of sentences associated with multiple potential MRs.– Siskind (1996) uses this type “referentially uncertain”
training data to learn meanings of words.
• Extracting meaning representations from perceptual data is a difficult unsolved problem.– Our system directly works with symbolic MRs.
![Page 13: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/13.jpg)
13
Tractable Challenge Problem:Learning to Be a Sportscaster
• Goal: Learn from realistic data of natural language used in a representative context while avoiding difficult issues in computer perception (i.e. speech and vision).
• Solution: Learn from textually annotated traces of activity in a simulated environment.
• Example: Traces of games in the Robocup simulator paired with textual sportscaster commentary.
![Page 14: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/14.jpg)
14
Grounded Language Learning in Robocup
Robocup Simulator
Sportscaster
Simulated Perception
Perceived Facts
Score!!!!Grounded
Language LearnerLanguageGenerator
SemanticParser
SCFG Score!!!!
![Page 15: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/15.jpg)
15
Robocup Sportscaster TraceNatural Language Commentary Meaning Representation
Purple goalie turns the ball over to Pink8
badPass ( Purple1, Pink8 )
Pink11 looks around for a teammate
Pink8 passes the ball to Pink11
Purple team is very sloppy today
Pink11 makes a long pass to Pink8
Pink8 passes back to Pink11
turnover ( Purple1, Pink8 )
pass ( Pink11, Pink8 )
pass ( Pink8, Pink11 )
ballstopped
pass ( Pink8, Pink11 )
kick ( Pink11 )
kick ( Pink8)
kick ( Pink11 )
kick ( Pink11 )
kick ( Pink8 )
![Page 16: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/16.jpg)
16
Robocup Sportscaster TraceNatural Language Commentary Meaning Representation
Purple goalie turns the ball over to Pink8
badPass ( Purple1, Pink8 )
Pink11 looks around for a teammate
Pink8 passes the ball to Pink11
Purple team is very sloppy today
Pink11 makes a long pass to Pink8
Pink8 passes back to Pink11
turnover ( Purple1, Pink8 )
pass ( Pink11, Pink8 )
pass ( Pink8, Pink11 )
ballstopped
pass ( Pink8, Pink11 )
kick ( Pink11 )
kick ( Pink8)
kick ( Pink11 )
kick ( Pink11 )
kick ( Pink8 )
![Page 17: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/17.jpg)
17
Robocup Sportscaster TraceNatural Language Commentary Meaning Representation
Purple goalie turns the ball over to Pink8
badPass ( Purple1, Pink8 )
Pink11 looks around for a teammate
Pink8 passes the ball to Pink11
Purple team is very sloppy today
Pink11 makes a long pass to Pink8
Pink8 passes back to Pink11
turnover ( Purple1, Pink8 )
pass ( Pink11, Pink8 )
pass ( Pink8, Pink11 )
ballstopped
pass ( Pink8, Pink11 )
kick ( Pink11 )
kick ( Pink8)
kick ( Pink11 )
kick ( Pink11 )
kick ( Pink8 )
![Page 18: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/18.jpg)
18
Robocup Sportscaster TraceNatural Language Commentary Meaning Representation
Purple goalie turns the ball over to Pink8
P6 ( C1, C19 )
Pink11 looks around for a teammate
Pink8 passes the ball to Pink11
Purple team is very sloppy today
Pink11 makes a long pass to Pink8
Pink8 passes back to Pink11
P5 ( C1, C19 )
P2 ( C22, C19 )
P2 ( C19, C22 )
P0
P2 ( C19, C22 )
P1 ( C22 )
P1( C19 )
P1 ( C22 )
P1 ( C22 )
P1 ( C19 )
![Page 19: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/19.jpg)
Sportscasting Data
• Collected human textual commentary for the 4 Robocup championship games from 2001-2004.– Avg # events/game = 2,613
– Avg # sentences/game = 509
• Each sentence matched to all events within previous 5 seconds.– Avg # MRs/sentence = 2.5 (min 1, max 12)
• Manually annotated with correct matchings of sentences to MRs (for evaluation purposes only).
19
![Page 20: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/20.jpg)
KRISPER: KRISP with EM-like Retraining
• Extension of KRISP that learns from ambiguous supervision (Kate & Mooney, AAAI-07).
• Uses an iterative EM-like self-training method to gradually converge on a correct meaning for each sentence.
![Page 21: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/21.jpg)
21
saw(john, walks(man, dog))
KRISPER’s Training Algorithm
Daisy gave the clock to the mouse.
Mommy saw that Mary gave the hammer to the dog.
The dog broke the box.
John gave the bag to the mouse.
The dog threw the ball.
ate(mouse, orange)
gave(daisy, clock, mouse)
ate(dog, apple)
saw(mother, gave(mary, dog, hammer))
broke(dog, box)
gave(woman, toy, mouse)
gave(john, bag, mouse)
threw(dog, ball)
runs(dog)
1. Assume every possible meaning for a sentence is correct
![Page 22: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/22.jpg)
22
saw(john, walks(man, dog))
KRISPER’s Training Algorithm
Daisy gave the clock to the mouse.
Mommy saw that Mary gave the hammer to the dog.
The dog broke the box.
John gave the bag to the mouse.
The dog threw the ball.
ate(mouse, orange)
gave(daisy, clock, mouse)
ate(dog, apple)
saw(mother, gave(mary, dog, hammer))
broke(dog, box)
gave(woman, toy, mouse)
gave(john, bag, mouse)
threw(dog, ball)
runs(dog)
1. Assume every possible meaning for a sentence is correct
![Page 23: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/23.jpg)
23
saw(john, walks(man, dog))
KRISPER’s Training Algorithm
Daisy gave the clock to the mouse.
Mommy saw that Mary gave the hammer to the dog.
The dog broke the box.
John gave the bag to the mouse.
The dog threw the ball.
ate(mouse, orange)
gave(daisy, clock, mouse)
ate(dog, apple)
saw(mother, gave(mary, dog, hammer))
broke(dog, box)
gave(woman, toy, mouse)
gave(john, bag, mouse)
threw(dog, ball)
runs(dog)
2. Resulting NL-MR pairs are weighted and given to KRISP
1/2
1/2
1/41/4
1/41/4
1/5 1/51/5
1/51/5
1/3 1/31/3
1/31/3
1/3
![Page 24: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/24.jpg)
24
saw(john, walks(man, dog))
KRISPER’s Training Algorithm
Daisy gave the clock to the mouse.
Mommy saw that Mary gave the hammer to the dog.
The dog broke the box.
John gave the bag to the mouse.
The dog threw the ball.
ate(mouse, orange)
gave(daisy, clock, mouse)
ate(dog, apple)
saw(mother, gave(mary, dog, hammer))
broke(dog, box)
gave(woman, toy, mouse)
gave(john, bag, mouse)
threw(dog, ball)
runs(dog)
3. Estimate the confidence of each NL-MR pair using the resulting trained parser
0.92
0.11
0.320.88
0.220.24
0.180.85
0.24 0.890.33
0.970.81
0.34
0.71
0.950.14
![Page 25: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/25.jpg)
25
saw(john, walks(man, dog))
KRISPER’s Training Algorithm
Daisy gave the clock to the mouse.
Mommy saw that Mary gave the hammer to the dog.
The dog broke the box.
John gave the bag to the mouse.
The dog threw the ball.
ate(mouse, orange)
gave(daisy, clock, mouse)
ate(dog, apple)
saw(mother, gave(mary, dog, hammer))
broke(dog, box)
gave(woman, toy, mouse)
gave(john, bag, mouse)
threw(dog, ball)
runs(dog)
4. Use maximum weighted matching on a bipartite graph to find the best NL-MR pairs [Munkres, 1957]
0.92
0.11
0.320.88
0.220.24
0.180.85
0.24 0.890.33
0.970.81
0.34
0.71
0.950.14
![Page 26: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/26.jpg)
26
saw(john, walks(man, dog))
KRISPER’s Training Algorithm
Daisy gave the clock to the mouse.
Mommy saw that Mary gave the hammer to the dog.
The dog broke the box.
John gave the bag to the mouse.
The dog threw the ball.
ate(mouse, orange)
gave(daisy, clock, mouse)
ate(dog, apple)
saw(mother, gave(mary, dog, hammer))
broke(dog, box)
gave(woman, toy, mouse)
gave(john, bag, mouse)
threw(dog, ball)
runs(dog)
4. Use maximum weighted matching on a bipartite graph to find the best NL-MR pairs [Munkres, 1957]
0.92
0.11
0.320.88
0.220.24
0.180.85
0.24 0.890.33
0.970.81
0.34
0.71
0.950.14
![Page 27: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/27.jpg)
27
saw(john, walks(man, dog))
KRISPER’s Training Algorithm
Daisy gave the clock to the mouse.
Mommy saw that Mary gave the hammer to the dog.
The dog broke the box.
John gave the bag to the mouse.
The dog threw the ball.
ate(mouse, orange)
gave(daisy, clock, mouse)
ate(dog, apple)
saw(mother, gave(mary, dog, hammer))
broke(dog, box)
gave(woman, toy, mouse)
gave(john, bag, mouse)
threw(dog, ball)
runs(dog)
5. Give the best pairs to KRISP in the next iteration, and repeat until convergence
![Page 28: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/28.jpg)
WASPER
• WASP with EM-like retraining to handle ambiguous training data.
• Same augmentation as added to KRISP to create KRISPER.
28
![Page 29: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/29.jpg)
KRISPER-WASP
• First iteration of EM-like training produces very noisy training data (> 50% errors).
• KRISP is better than WASP at handling noisy training data.– SVM prevents overfitting.– String kernel allows partial matching.
• But KRISP does not support language generation.• First train KRISPER just to determine the best
NL→MR matchings.• Then train WASP on the resulting unambiguously
supervised data.
29
![Page 30: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/30.jpg)
WASPER-GEN
• In KRISPER and WASPER, the correct MR for each sentence is chosen based on maximizing the confidence of semantic parsing (NL→MR).
• Instead, WASPER-GEN determines the best matching based on generation (MR→NL).
• Score each potential NL/MR pair by using the currently trained WASP-1 generator.
• Compute NIST MT score between the generated sentence and the potential matching sentence.
30
![Page 31: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/31.jpg)
Strategic Generation
• Generation requires not only knowing how to say something (tactical generation) but also what to say (strategic generation).
• For automated sportscasting, one must be able to effectively choose which events to describe.
31
![Page 32: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/32.jpg)
Example of Strategic Generation
32
pass ( purple7 , purple6 )
ballstopped
kick ( purple6 )
pass ( purple6 , purple2 )
ballstopped
kick ( purple2 )
pass ( purple2 , purple3 )
kick ( purple3 )
badPass ( purple3 , pink9 )
turnover ( purple3 , pink9 )
![Page 33: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/33.jpg)
Example of Strategic Generation
33
pass ( purple7 , purple6 )
ballstopped
kick ( purple6 )
pass ( purple6 , purple2 )
ballstopped
kick ( purple2 )
pass ( purple2 , purple3 )
kick ( purple3 )
badPass ( purple3 , pink9 )
turnover ( purple3 , pink9 )
![Page 34: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/34.jpg)
Learning for Strategic Generation
• For each event type (e.g. pass, kick) estimate the probability that it is described by the sportscaster.
• Requires NL/MR matching that indicates which events were described, but this is not provided in the ambiguous training data.– Use estimated matching computed by KRISPER,
WASPER or WASPER-GEN.
– Use a version of EM to determine the probability of mentioning each event type just based on strategic info.
34
![Page 35: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/35.jpg)
Iterative Generation Strategy Learning (IGSL)
• Directly estimates the likelihood of commenting on each event type from the ambiguous training data.
• Uses self-training iterations to improve estimates (à la EM).
![Page 36: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/36.jpg)
Demo
• Game clip commentated using WASPER-GEN with EM-based strategic generation, since this gave the best results for generation.
• FreeTTS was used to synthesize speech from textual output.
• Also trained for Korean to illustrate language independence.
![Page 37: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/37.jpg)
37
![Page 38: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/38.jpg)
38
![Page 39: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/39.jpg)
Experimental Evaluation
• Generated learning curves by training on all combinations of 1 to 3 games and testing on all games not used for training.
• Baselines:– Random Matching: WASP trained on random choice of
possible MR for each comment.
– Gold Matching: WASP trained on correct matching of MR for each comment.
• Metrics:– Precision: % of system’s annotations that are correct
– Recall: % of gold-standard annotations correctly produced
– F-measure: Harmonic mean of precision and recall
![Page 40: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/40.jpg)
Evaluating Semantic Parsing
• Measure how accurately learned parser maps sentences to their correct meanings in the test games.
• Use the gold-standard matches to determine the correct MR for each sentence that has one.
• Generated MR must exactly match gold-standard to count as correct.
![Page 41: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/41.jpg)
Results on Semantic Parsing
![Page 42: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/42.jpg)
Evaluating Tactical Generation
• Measure how accurately NL generator produces English sentences for chosen MRs in the test games.
• Use gold-standard matches to determine the correct sentence for each MR that has one.
• Use NIST score to compare generated sentence to the one in the gold-standard.
![Page 43: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/43.jpg)
Results on Tactical Generation
![Page 44: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/44.jpg)
Evaluating Strategic Generation
• In the test games, measure how accurately the system determines which perceived events to comment on.
• Compare the subset of events chosen by the system to the subset chosen by the human annotator (as given by the gold-standard matching).
![Page 45: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/45.jpg)
Results on Strategic Generation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
Average results on leave-one-game-out cross-validation
F-m
easu
re
inferred fromWASPinferred fromKRISPERinferred fromWASPERinferred fromWASPER-GENIGSL
inferred fromgold matching
![Page 46: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/46.jpg)
Human Evaluation(Quasi Turing Test)
• Asked 4 fluent English speakers to evaluate overall quality of sportscasts.
• Randomly picked a 2 minute segment from each of the 4 games.• Each human judge evaluated 8 commented game clips, each of
the 4 segments commented once by a human and once by the machine when tested on that game (and trained on the 3 other games).
• The 8 clips presented to each judge were shown in random counter-balanced order.
• Judges were not told which ones were human or machine generated.
46
![Page 47: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/47.jpg)
Human Evaluation Metrics
ScoreEnglish Fluency
Semantic Correctness
Sportscasting Ability
5 Flawless Always Excellent
4 Good Usually Good
3 Non-native Sometimes Average
2 Disfluent Rarely Bad
1 Gibberish Never Terrible
47
![Page 48: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/48.jpg)
Results on Human Evaluation
CommentatorEnglishFluency
Semantic Correctness
SportscastingAbility
Human 3.94 4.25 3.63
Machine 3.44 3.56 2.94
Difference 0.5 0.69 0.69
48
![Page 49: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/49.jpg)
Co-Training with Visual and Textual Views
(Gupta, Kim, Grauman & Mooney, ECML-08)
49
![Page 50: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/50.jpg)
Semi-Supervised Multi-Modal Image Classification
• Use both images or videos and their textual captions for classification.
• Use semi-supervised learning to exploit unlabeled training data in addition to labeled training data.
• How?: Co-training (Blum and Mitchell, 1998) using visual and textual views.
• Illustrates both language supervising vision and vision supervising language.
50
![Page 51: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/51.jpg)
Sample Classified Captioned Images
Cultivating farming at Nabataean Ruins of the Ancient Avdat
Bedouin Leads His Donkey That Carries Load Of Straw
Ibex Eating In The Nature Entrance To Mikveh Israel Agricultural School
Desert
Trees
![Page 52: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/52.jpg)
The University of Texas at Austin
•52
Co-training
• Semi-supervised learning paradigm that exploits two mutually independent and sufficient views
• Features of dataset can be divided into two sets:– The instance space:
– Each example:
• Proven to be effective in several domains– Web page classification (content and hyperlink)
– E-mail classification (header and body)
21 XXX
x (x1, x2)
![Page 53: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/53.jpg)
The University of Texas at Austin
•53
Co-training
+
+
-
+
Initially Labeled Instances
Visual Classifier
Text Classifier
Text View Visual View
Text View Visual View
Text View Visual View
Text View Visual View
![Page 54: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/54.jpg)
The University of Texas at Austin
•54
Co-training
Initially Labeled Instances
Visual Classifier
Text Classifier
Supervised Learning
Text View
Text View
Text View
Text View
Visual View
Visual View
Visual View
Visual View
+
+
-
+
+
+
-
+
![Page 55: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/55.jpg)
The University of Texas at Austin
•55
Co-training
Unlabeled
Instances
Visual Classifier
Text Classifier
Text View
Text View
Text View
Text View
Visual View
Visual View
Visual View
Visual View
![Page 56: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/56.jpg)
The University of Texas at Austin
•56
Co-training
PartiallyLabeledInstances
Classify most confident instances
Text Classifier
Visual Classifier
Text View
Text View
Text View
Text View
Visual View
Visual View
Visual View
Visual View
+
-
+
-
![Page 57: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/57.jpg)
The University of Texas at Austin
•57
Co-training
ClassifierLabeledInstances
Label all views in instances
Text Classifier
Visual Classifier
Text View
Text View
Text View
Text View
Visual View
Visual View
Visual View
Visual View
+
+
-
-
+
+
-
-
![Page 58: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/58.jpg)
The University of Texas at Austin
•58
Co-training
Retrain Classifiers
Text Classifier
Visual Classifier
Text View
Text View
Text View
Text View
Visual View
Visual View
Visual View
Visual View
+
+
-
-
+
+
-
-
![Page 59: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/59.jpg)
The University of Texas at Austin
•59
Co-training
Label a new Instance
Text Classifier
Visual Classifier
+ -Text View Visual View
Text View Visual View
-
+-
Text View Visual View
![Page 60: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/60.jpg)
The University of Texas at Austin60
Baseline - Individual Views
• Image/Video View : Only image/video features are used
• Text View : Only textual features are used
![Page 61: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/61.jpg)
The University of Texas at Austin61
Baseline - Early Fusion Concatenate visual and textual features
+
-
Text View Visual View
Text View Visual View
Classifier
Training
Testing
Text View Visual View
-
![Page 62: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/62.jpg)
The University of Texas at Austin62
Baseline - Late Fusion
Visual Classifier
Text Classifier
Text View
Text View
Visual View
Visual View
+
-
+
-
Training
+ -Text View Visual View
Text View Visual View
-
+-
Label a new instance
![Page 63: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/63.jpg)
The University of Texas at Austin
•63
Image Dataset
• Our captioned image data is taken from
(Bekkerman & Jeon CVPR ‘07, www.israelimages.com)
• Consists of images with short text captions.
• Used two classes, Desert and Trees.
• A total of 362 instances.
![Page 64: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/64.jpg)
Text and Visual Features
• Text view: standard bag of words.
• Image view: standard bag of visual words that capture texture and color information.
64
![Page 65: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/65.jpg)
The University of Texas at Austin
•65
Experimental Methodology
• Test set is disjoint from both labeled and unlabeled training set.
• For plotting learning curves, vary the percentage of training examples labeled, rest used as unlabeled data for co-training.
• SVM with RBF kernel is used as base classifier for both visual and text classifiers.
• All experiments are evaluated with 10 iterations of 10-fold cross-validation.
![Page 66: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/66.jpg)
Learning Curves for Israel Images
66
![Page 67: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/67.jpg)
Using Closed Captions to SuperviseActivity Recognition in Videos(Gupta & Mooney, VCL-09)
67
![Page 68: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/68.jpg)
Activity Recognition in Video
• Recognizing activities in video generally uses supervised learning trained on human-labeled video clips.
• Linguistic information in closed captions (CCs) can be used as “weak supervision” for training activity recognizers.
• Automatically trained activity recognizers can be used to improve precision of video retrieval.
68
![Page 69: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/69.jpg)
Sample Soccer Videos
“I do not thinkthere is any real intent,just trying to make surehe gets his body across,but it was a free kick .”
“Lovelykick.”
“Goal kick.”
“Good save aswell.”
“I think brownmade a wonderful fingertip save there.”
“And it is a reallychopped save”
Kick Save
![Page 70: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/70.jpg)
“If you aredefending a lead, yourthrow back takes it thatfar up the pitch and getsa throw-in.”
“And CarlosTevez has won thethrow.”
“Anothershot for a throw.”
“When theyare going to pass it in the back, it is a really pure touch.”
“Look atthat, Henry, again, hehad time on the ball totake another touch and prepare that ball properly.”
“All itneeded was a touch.”
Throw Touch
![Page 71: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/71.jpg)
•71
Using Video Closed-Captions
• CCs contains both relevant and irrelevant information:“Beautiful pull-back.” relevant
“They scored in the last kick of the game against the Czech Republic.” irrelevant
“That is a fairly good tackle.” relevant
“Turkey can be well-pleased with the way they started.” irrelevant
• Use a novel caption classifier to rank the retrieved video clips by relevance.
![Page 72: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/72.jpg)
72
Manually Labeled Captions
Query
CaptionedVideo
Training
Testing
CaptionedTraining
Videos
Video Classifier
Ranked List of
Video Clips
Caption Based Video
Retriever
Caption Based Video
Retriever
Automatically Labeled Video Clips
Video Ranker
RetrievedClips
Caption Classifier
SYSTEM OVERVIEW
![Page 73: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/73.jpg)
73
Manually Labeled Captions
Query
CaptionedVideo
Training
Testing
CaptionedTraining
Videos
Video Classifier
Ranked List of
Video Clips
Caption Based Video
Retriever
Caption Based Video
Retriever
Automatically Labeled Video Clips
Video Ranker
RetrievedClips
Caption Classifier
![Page 74: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/74.jpg)
74
Retrieving and Labeling Data– Identify all closed
caption sentences that contain exactly one of the set of activity keywords• kick, save, throw,
touch– Extract clips of 8 sec
around the corresponding time
– Label the clips with corresponding classes
…What a nice kick!…
kick
save
touch
![Page 75: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/75.jpg)
75
Manually Labeled Captions
Query
CaptionedVideo
Training
Testing
CaptionedTraining
Videos
Video Classifier
Ranked List of
Video Clips
Caption Based Video
Retriever
Caption Based Video
Retriever
Automatically Labeled Video Clips
Video Ranker
RetrievedClips
Caption Classifier
![Page 76: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/76.jpg)
•76
Video Classifier
• Extract visual features from clips.– Histogram of oriented gradients and optical
flow in space-time volume (Laptev et al., ICCV 07; CVPR 08)
– Represent as ‘bag of visual words’
• Use automatically labeled video clips to train activity classifier.
• Use DECORATE (Melville and Mooney, IJCAI 03 )
– An ensemble based classifier– Works well with noisy and limited training data
![Page 77: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/77.jpg)
77
Manually Labeled Captions
Query
CaptionedVideo
Training
Testing
CaptionedTraining
Videos
Video Classifier
Ranked List of
Video Clips
Caption Based Video
Retriever
Caption Based Video
Retriever
Automatically Labeled Video Clips
Video Ranker
RetrievedClips
Caption Classifier
![Page 78: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/78.jpg)
•78
Caption Classifier
• Sportscasters talk about both events on the field as well as other information– 69% of the captions in our dataset are ‘irrelevant’ to the
current events
• Classifies relevant vs. irrelevant captions– Independent of the query classes
• Use SVM string classifier – Uses a subsequence kernel that measures how many
subsequences are shared by two strings (Lodhi et al. 02, Bunescu and Mooney 05)
– More accurate than a “bag of words” classifier since it takes word order into account.
![Page 79: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/79.jpg)
•79
Retrieving and Ranking Videos
• Videos retrieved using captions, same way as before.
• Two ways of ranking:– Probabilities given by video classifier (VIDEO) – Probabilities given by caption classifier (CAPTION)
• Aggregating the rankings– Weighted late fusion of rankings from VIDEO and
CAPTION P(label | clip-with-caption)P(label | clip)
(1 )P(relevant | clip-caption)
![Page 80: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/80.jpg)
•80
Experiment
• Dataset– 23 soccer games recorded from TV broadcast– Avg. length: 1 hr 50 min– Avg. number of captions: 1,246 – Caption Classifier
• Trained on hand labeled 4 separate games
• Metric: MAP score: Mean Averaged Precision• Methodology: Leave one-game-out cross-validation• Baseline: ranking clips randomly
![Page 81: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/81.jpg)
•81
Dataset Statistics
Query # Total # Correct % Noise
Kick 303 120 60.39
Save 80 47 41.25
Throw 58 26 55.17
Touch 183 122 33.33
![Page 82: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/82.jpg)
•82
Retrieval Results
65.68
70.749
72.11
70.5370.747
62
64
66
68
70
72
74
Baseline VIDEO CAPTION VIDEO+CAPTION Gold VIDEO+CAPTION
Mea
n A
vera
ge P
reci
sion
(M
AP
)
![Page 83: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/83.jpg)
Future Work
• Use real (not simulated) visual context to supervise language learning.
• Use more sophisticated linguistic analysis to supervise visual learning.
83
![Page 84: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/84.jpg)
84
Conclusions
• Current language and visual learning uses expensive, unrealistic training data.
• Naturally occurring perceptual context can be used to supervise language learning:– Learning to sportscast simulated Robocup games.
• Naturally occurring linguistic context can be used to supervise learning for computer vision:– Using multi-modal co-training to improve
classification of captioned images and videos.– Using closed-captions to automatically train
activity recognizers and improve video retrieval.
![Page 85: 1 Using Perception to Supervise Language Learning and Language to Supervise Perception Ray Mooney Department of Computer Sciences University of Texas at](https://reader035.vdocument.in/reader035/viewer/2022062802/56649ec15503460f94bcd9b8/html5/thumbnails/85.jpg)
Questions?
Relevant Papers at:http://www.cs.utexas.edu/users/ml/publication/clamp.html
85