data-driven generation of image descriptions
TRANSCRIPT
Data-driven Generation of Image Descriptions
Vicente Ordonez-Roman
The State University of New YorkPreviously:
Advisor: Tamara Berg
What most Computer Vision systems aim to say about a picture
skytreeswaterbuildingbridgerivertree
Computer Vision
What we are able to say about a picture
One of the many stone bridges in town that carry the gravel carriage roads.
An old bridge over dirty green water.
A stone bridge over a peaceful river.
Our Goal
Let’s just borrow captions from similar images!
Im2Text: Describing Images Using 1 Million Captioned Photographs.Vicente Ordonez, Girish Kulkarni, Tamara L. Berg.
Advances in Neural Information Processing Systems. NIPS 2011.
Harness the Web!
Smallest house in paris between red (on right) and beige (on left).
Bridge to temple in Hoan Kiem lake.
The water is clear enough to see fish swimming around in it.
A walk around the lake near our house with Abby.
Hangzhou bridge in West lake.
The daintree river by boat.
. . .
Images + Captionsfrom the Web
Transfer Caption(s)
Matching using Global Image Features(GIST + Color)
e.g. “The water is clear enough to see fish swimming around in it.”
Use the web to collect images + captions
6, 000, 000, 000 photographs! (*)A lot of them with captions(lots of them publicly available )
90, 000, 000, 000 pictures~!! (**)A lot of them with captions(a lot of them not publicy available )
(*) http://blog.flickr.net/en/2011/08/04/6000000000/(**) http://www.quora.com/How-many-photos-are-uploaded-to-Facebook-each-day
Dog with a ball in its mouth running around like crazy on the green grass.
cat in a sink
A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
Flickr images + captions
Dog with a ball in its mouth running around like crazy on the green grass.
cat in a sink
A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
Flickr images + captions
Dog with a ball in its mouth running around like crazy on the green grass.
Dog with a ball in its mouth running around like crazy on the green grass.
cat in a sink
A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
Flickr images + captions
Dog with a ball in its mouth running around like crazy on the green grass.
cat in a sink
A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
Flickr images + captions
cat in a sink
Dog with a ball in its mouth running around like crazy on the green grass.
cat in a sink
A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
Flickr images + captions
Dog with a ball in its mouth running around like crazy on the green grass.
cat in a sink
A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
Flickr images + captions
A 10-kg cat called Hercules.. and got caught in a pet door when trying to sneak into another house to steal dog food. 'Nuff said
Solution: Collect hundreds of millions of captions
Filter them out
We found “good captions” have visual concepts and relation words “by”, “in”, “over”, “beside”,
“on top of”
~1 “good caption” for every 1000 “bad captions”
Im2Text: Describing Images Using 1 Million Captioned Photographs.Vicente Ordonez, Girish Kulkarni, Tamara L. Berg.
Advances in Neural Information Processing Systems. NIPS 2011.
SBU Captioned Photo Dataset
Our dog Zoe in her bed
Interior design of modern white and brown living room furniture against white wall with a lamp hanging.
The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon
Man sits in a rusted car buried in the sand on Waitarere beach
Emma in her hat looking super cute
Little girl and her dog in northern Thailand. They both seemed interested in what we were doing
1 million captioned
photos!1 m
illion
captioned
photos!
Results
(1) while walking by the water(2) plane flying over the sun(3) shot this in a moving car at the nkve highway(4) sunset over creve coeur lake and the page bridge(5) sunset on 12th sep 2009 as seen from the field polder near my house(6) window over yellow door(7) sunset over capitol hill as seen from the roof of my building(8) an orange sky over the irish sea(9) beautiful golden sunset reflected in the waves of the ocean(10) red sky probably caused by volcanic ash from iceland(11) a view of sunset over river brahmaputa from koliyabhumura bridge(12) red sky in the morning
Results
(1) burnt wooden door in derelict building portugal(2) peterborough cathedral norman door in south wall(3) amazing wooden door with wider light above(4) door in wall(5) girl looking in a classroom window(6) a interesting cross in a window of an ancient city(7) this mirror decorated with fruit painting was left behind by theprevious owners(8) unusual exterior wall postbox at st albans post office in st peters street al1(9) door in oxford uk in black and white(10) 19 plate behind glass in brass mat and preserver(11) this is some of the window decoration external on the house justover the porch 0364(12) cat in a window
Results
(1) img8783 ginger in the red chair(2) red sky in the morning(3) the cat is in the bag and the bag is in the river quot(4) the light in the kitchen made everythin glow my little girl is growing up (5) my cat in a box that is far too small for her(6) one of the towel animals in the cabin edno ot jivotnite napraveno ot havlieni karpi v kabinata(7) baby in her later years turned from green to red but she never went fully red all over(8) if you take pictures through the hole in the bottom of a flower pot the whole of the eldritch world is revealed(9) glazed ceramic poop form in orange wooden box(10) rock garden in library(11) it s funny to capture the preciousest cat in the house at his most devillicious(12) the pink will get replaced by orange and blue in the fall
Results
(1) starfish from the book toys to knitdashing dachs superwash sock yarn in goldfishbacking is orange fabricstuffing is pillow stuffing(2) mural of birds and trees in the crypt of wat ratburana ayutthaya(3) carvings in the rock wall(4) acrylic on paper scarlet macaws communicate in the color red withyellow and blue as visual grammar(5) epsom and table salt crystals growing in concentrated green tea solution(6) the hops dried to a golden green in a matter of a few days almosttoo pretty to bag up(7) after staring at the gorgeous colors of the leaves claes discoveredthat there were about 100 birds sleeping in the tree(8) you know you re in wisconsin when the beach has pine needles inthe sand(9) i was walking down the sidewalk and i saw this glove craft droppedin the dirt it seemed really unusual(10) made by fusing plastic bags(11) bark pattern from a ponderosa pine tree in grand canyon national park(12) the peasant that found a statue of the black virgin on a rock in ariver
What to do next?
Use High Level Content to Rerank (Objects, Stuff, People, Scenes, Captions)
The bridge over the lake on Suzhou Street.
The Daintree river by boat. Bridge over Cacapon river.
Iron bridge over the Duck river.
. . .
Transfer Caption(s)
e.g. “The bridge over the lake on Suzhou Street.”
Some success…
Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind.
Strange cloud formation literally flowing through the sky like a river in relation to the other clouds out there.
Fresh fruit and vegetables at the market in Port Louis Mauritius.
Tree with red leaves in the field in autumn.
A female mallard duck in the lake at Luukki Espoo
The sun was coming through the trees while I was sitting in my chair by the river
Under the sky of burning clouds. Stained glass
window in Eusebius church.
Still far from perfect
Kentucky cows in a field.
The cat in the window.
Incorrect objects
Still far from perfect
The sky is blue over the Gherkin.
The boat ended up a kilometre from the water in the middle of the airstrip.
Tree beside the river.
Water over the road.
Incorrect context
Completely wrong
How to Evaluate?
• “Ground truth”: The car is parked next to the train station besides a building.
• Candidates: “There is car parked in front of an office building”“This is the building that hosted the ceremony”“A vehicle stopped next to my house”
Similar to evaluation on Machine Translation
Method BLEU score
Global matching (1k) 0.0774
Global matching (10k) 0.0909
Global matching (100k) 0.0917
Global matching (1million) 0.1177
Global + Content matching (linear regression)
0.1215
Global + Content matching (linear SVM)
0.1259
BLEU score evaluation against Human Captions
Human Visual Verification
View overlooking Kuala Lumpur from my office building
Please choose the image that better corresponds to the given caption:
Human Visual Verification
View overlooking Kuala Lumpur from my office building
Please choose the image that better corresponds to the given caption:
Caption from Flickr
Random image
Human Visual Verification
View overlooking Kuala Lumpur from my office building
Please choose the image that better corresponds to the given caption:
Caption from Flickr
Random image
Caption used Success rate
Original human caption 96.0%
Top caption 66.7%
Best from our top 4 captions 92.7%
Human Visual Evaluation
Caption used Success rate
Original human caption 96.0%
Top caption 66.7%
Best from our top 4 captions 92.7%
The view from the 13th floor of an apartment building in Nakano awesome.
Please choose the image that better corresponds to the given caption:
Caption produced by our system
Random image
Human Visual Evaluation
Caption used Success rate
Original human caption 96.0%
Top caption 66.7%
Best from our top 4 captions 92.7%
The view from the 13th floor of an apartment building in Nakano awesome.
Please choose the image that better corresponds to the given caption:
Caption produced by our system
Random image
What to do next?
Let’s not borrow captions from other images, let’s just borrow short phrases!
Collective Generation of Natural Image Descriptions.Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg, Tamara L. Berg, Yejin Choi.
Association for Computational Linguistics. ACL 2012.
Large Scale Retrieval for Image Description Generation Vicente Ordonez, Xufeng Han, Polina Kuznetsova, Girish Kulkarni, Margaret Mitchell,Kota Yamaguchi, Karl Stratos, Amit Goyal, Jesse Dodge, Alyssa Mensch, Hal Daume III,
Alexander C. Berg, Yejin Choi, Tamara L. BergOn Submission to IJCV special issue on Big Data.
Retrieving noun phrases from similar object detections
this dog was laying in the middle of the road on a back street in jaco
Closeup of my dog sleeping under my desk.
Detect: dog
Find matching dog detections by visual similarity
Peruvian dog sleeping on city street in the city of Cusco, (Peru)
Contented dog just laying on the edge of the road in front of a house..
Retrieving verb phrases from similar
object detections
Find matching region detections using appearance + arrangement
Mini Nike soccer ball all alone in the grass
Comfy chair under a tree.
I positioned the chairs around the lemon tree -- it's like a shrine
Object: car
Cordoba - lonely elephant under an orange tree...
Retrieving prepositional phrases from region +
detection matches
Retrieving prepositional phrases from scene matches
View from our B&B in this photo
Extract scene descriptor
Find matching images by scene similarity
Pedestrian street in the Old Lyon with stairs to climb up the hill of fourviere
I'm about to blow the building across the street over with my massive lung power.
Only in Paris will you find a bottle of wine on a table outside a bookstore
Data Processing
1 million images:– Run object detectors– Run region based stuff detectors (e.g.
grass, sky, etc)– Run global scene classifiers– Parse captions associated with images
and retrieve phrases referring to objects (NPs, VPs), region relationships (PPstuff), and general scene context (PPscene).
Recognition, aka Vision is hardDetecting one hundred objects
Sometimes you can make it (a little) better
Detecting “mentioned” objects
The background is a vintage paint by number painting I have and the fabulous forest dress is by candyjunky!
Kevin’s mom, so punxrawk in Kev’s black flag hat
Look in the mountain for a lion face
Ecuador, amazon basin, near coca, rain forest, passion fruit flower
Everything together
bird in waterin Lincoln City Oregon coast
Objects
Actions
Scene
Stuff
looking for food
Everything together
Retrieved phrases
bird in wateron the beach
bird in waterin Lincoln City Oregon coast
bird in waterin Atlantic Citylooking for
food
looking for food
looking for food
Binary Integer Linear Programming
Phrase sij Position kPhrase spq Position k+1
Phrase sij Position k
Phrase VisionConfidence
Pairwise phrase
cohesion
Ngram cohesion
Head words co-
occurrence+=
Composing Descriptions
Compose descriptions from phrases with ILP approach
• Linguistic constraints – Allow only one phrase of each type– Enforce plural/singular agreement between NP and VP
• Discourse constraints– Prevent inclusion of repeated phrasing
• Phrase cohesion constraints– n-gram statistics between phrases– Co-occurrence statistics between head words of phrases (last
word or main verb) to encourage longer range cohesion
Good Results
This is a sporty little red convertible made for a great day in Key West FL. This car was in the 4th parade of the apartment buildings.
Taken in front of my cat sitting in a shoe box. Cat likes hanging around in my recliner.
This is a brass viking boat moored on beach in Tobago by the ocean.
Bad Results
This is a shoulder bag with a blended rainbow effect One of the most shirt in the wall of
the house.Here you can see a cross by the frog in the sky.
Not relevantGrammatically incorrect. Cognitive absurdity.
Method BLEU scoreHMM (using cognitive phrases) 0.111HMM (without using cognitive phrases) 0.114ILP (using cognitive phrases) 0.114ILP (without using cognitive phrases) 0.116
BLEU score evaluation
Human Forced Choice Evaluation
Caption used ILP Selection
ILP vs. HMM (no images, no cognitive phrases) 67.2%
ILP vs. HMM (no images, with cognitive phrases) 66.3%
ILP vs. HMM (with images, no cognitive phrases) 53.17%
ILP vs. HMM (with images, with cognitive phrases) 54.5%
ILP vs. NIPS 2011 (Global matching 1M) 71.8%
ILP vs. HUMAN 16%
Visual Turing Test
In some cases (16%), ILP generated captions were preferred over human written ones!
Us vs Original Human Written Caption
What’s next?
Meaning from large-scale computer vision
To be presented at ICCV 2013
Images with the word “house” Images recognized as more likely to produce the word “house”
Meaning from large-scale computer vision
Images with the word “girl” Images recognized as more likely to produce the word “girl”
To be presented at ICCV 2013
Meaning from large-scale computer vision
Mammals Birds InstrumentsStructuresPlantsOther
Weights learned to recognize images with “desk” in caption
Top weighted classifier outputs
Weights learned over outputs of ~8k classifiers
To be presented at ICCV 2013
Meaning from large-scale computer vision
MammalsBirds InstrumentsStructuresPlantsOther
Weights learned to recognize images with “tree” in caption
Top weighted classifier outputs
Weights learned over outputs of ~8k classifiers
To be presented at ICCV 2013
Meaning from large-scale computer vision
MammalsBirds InstrumentsStructuresPlantsOther
Weights learned to recognize images with “tree” in caption
Top weighted classifier outputs
Weights learned over outputs of ~8k classifiers
Questions?