natural language driven image generation

46
Natural Language driven Image Generation Prepared by: Shreya Agarwal Guide: Mrs. Nirali Nanavati

Upload: thane-gray

Post on 04-Jan-2016

30 views

Category:

Documents


3 download

DESCRIPTION

Natural Language driven Image Generation. Prepared by: Shreya Agarwal Guide: Mrs. Nirali Nanavati. Introduction. Natural Language driven Image Generation, as the name suggests, refers to the task of mapping a natural language text to a scene. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Natural Language driven Image Generation

Natural Language driven Image Generation

Prepared by: Shreya AgarwalGuide: Mrs. Nirali Nanavati

Page 2: Natural Language driven Image Generation

Introduction

• Natural Language driven Image Generation, as the name suggests, refers to the task of mapping a natural language text to a scene.

• The general processes involved in achieving this task are Natural Language Understanding Image Retrieval and Positioning

Page 3: Natural Language driven Image Generation

Natural Language Understanding

Natural Languages are those used by humans to communicate with each other on a daily basis. Example: English

Computers cannot understand Natural Language unless it is parsed and represented in a predefined template-like form.

Page 4: Natural Language driven Image Generation

Image Retrieval and Positioning

This part of the process involves retrieving images from the local database or the internet relating to the text.

The final task is to position the images in a manner such that all elements are in their correct places in accordance with the natural language text.

Page 5: Natural Language driven Image Generation

Systems and Techniques

NALIG (NAtural Language driven Image Generation) [1]

Text-to-Picture Synthesis Tool [2]

WordsEye [3]

Carsim [4]

Suggested Technique

Page 6: Natural Language driven Image Generation

NALIG

Generated images of static scenes Proposes a theory for equilibrium

and stability. Based on description in the form of

the following phrase:<subject> <preposition> <object> [ <reference> ]

Page 7: Natural Language driven Image Generation

NALIG: Object Taxonomy and Spatial Primitives

Defines “primitive relationships” Example, H_SUPPORT(a,b) Attributes like FLYING, REPOSITORY,

etc. associated with each object Conditions like CANFLY are used. Example, “the airplane on the desert

“ vs. “the airplane on the runaway”

Page 8: Natural Language driven Image Generation

NALIG: Object Instantiation

All objects mentioned in natural language text are initialized.

If existence of an object depends on another one, it is also instantiated.

Such dependence is stored in relation HAS(a,b) which defines the strict relationship.

Example, “branch blocking the window “

Page 9: Natural Language driven Image Generation

NALIG: Consistency Checking and Qualitative Reasoning Rules known as “naïve statics” are

defined to check for equilibrium and stability. Example,

Law of gravity is checked. Space conditions are checked.

(Object Positioning) Example, “The book is on the table”.

Page 10: Natural Language driven Image Generation

NALIG: Advantages

• Successful for limited static scene generation

• Checks equilibrium, space and stability conditions

• Instantiates implied objects

Page 11: Natural Language driven Image Generation

NALIG: Limitations

• Works for predefined form of phrases. Not suitable for full-blown natural language texts

• Fails to construct dynamic scenes• Low success rate for complex

scenes

Page 12: Natural Language driven Image Generation

Text-to-Picture Synthesis Tool

The technique has the following processes: Selecting Keyphrases Selecting Images Picture Layout

Example

Page 13: Natural Language driven Image Generation

Selecting Keyphrases

Uses keyword-based text summarization

Keywords and Phrases extracted based on lexicosyntactic rules

Unsupervised learning approach based on TextRank algorithm

Stationary Distribution of Random walk used to determine relative importance of words.

Page 14: Natural Language driven Image Generation

Selecting Images

Two sources are used in the search for images for the selected keyphrases Local database of images Internet based image search engine

15 images retrieved and image processing is done to find the correct image.

Page 15: Natural Language driven Image Generation

Picture Layout

The technique aims to convey the gist of the text. Hence, a good layout is characterized as having: Minimum Overlap Centrality ClosenessA Monte Carlo Randomized algorithm is

used to solve this highly non-convex optimization problem

Page 16: Natural Language driven Image Generation

Advantages

Successfully conveys the gist of the natural language text

Searches for images online, thus delivering an output for every natural language input

Capable of processing complex sentences

Fit to represent action sequences

Page 17: Natural Language driven Image Generation

Limitations

Does not render a cohesive image Does not work well for all inputs

without a healthy internet connection

Slower than other methods as it spends time on generating a TextRank graph and a co-occurrence matrix

Page 18: Natural Language driven Image Generation

WordsEye This system

generates a high quality 3D image from a natural language description.

It utilizes a large database of 3D models and poses.

Page 19: Natural Language driven Image Generation

WordsEye: Linguistic Analysis

Utilizes a Part-of-Speech (POS) tagger and a statistical parser to generate a Dependency Representation of the input text. For Example,

Page 20: Natural Language driven Image Generation

WordsEye: Linguistic Analysis This Dependency Representation is then

converted into a Semantic Representation.

It describes the entities in the scene and the relations between them.

Page 21: Natural Language driven Image Generation

WordsEye: Semantic Representation

WordNet is used to find relations between different words.

Personal names are mapped to male/female humanoid bodies.

Spatial propositions are handled by semantic functions which look at the dependents and generate semantic representation accordingly.

Page 22: Natural Language driven Image Generation

WordsEye: Depictors

Depictors – are low-level graphical specifications used to specify scenes.

They control 3D object visibility, size, position, orientation, surface color and transparency.

They are also used to specify poses, control Inverse Kinematics (IK) and modify vertex displacements for facial expression.

Page 23: Natural Language driven Image Generation

WordsEye: Models Models are stored in the database and

have the following associated information: Skeletons Shape Displacements Parts Color Parts Opacity Parts Default Size Functional Properties Spatial Tags

Page 24: Natural Language driven Image Generation

WordsEye: Prepositions denote the layout If we say “The

daisy is in the test tube”, the system finds the cup tag for the test tube and the stem tag for the daisy. Hence, it puts the stem into the cupped opening of the test tube.

Page 25: Natural Language driven Image Generation

WordsEye: Poses

Poses are used to depict a character in a configuration which suggests a particular action being performed.

They are categorized here as: Standalone pose Specialized Usage pose Generic Usage pose Grip pose Bodywear pose

Page 26: Natural Language driven Image Generation

WordsEye: Pose examples

Specialized Usage pose (Cycling)

Grip pose(hold wine bottle)

Generic Usage pose (throw small object)

Page 27: Natural Language driven Image Generation

WordsEye: Depiction Process

Process to convert high level semantic representation into low-level depictors.

Consists of the following tasks: Convert semantic representation from the

node structure to a list of typed semantic elements where all references have been resolved

Interpret the semantic representation Assign depictors to each semantic element

Page 28: Natural Language driven Image Generation

WordsEye: Depiction Process

Resolve implicit and conflicting constraints of depictors.

Read in the referenced 3D models Apply each assigned depictor to

incrementally build up the scene while maintaining constraints.

Add background environment, ground plane, lights.

Adjust the camera (automatically or by hand)

Render

Page 29: Natural Language driven Image Generation

WordsEye: Depiction Rules

Many constraints and conditions are applied so as to generate a coherent scene.

Constraints are explicit and implicit. Sentences which cannot be depicted

are handled by using one of Textualization, Emblematization, Characterization, Conventional Icons or Literalization.

Page 30: Natural Language driven Image Generation

WordsEye: Advantages Generates high quality 3D models Ability to read poses and grips,

constraints and use of IK makes the picture coherent.

Depiction rules help in mapping linguistically analyzed text to exact depictors.

Semantic representation lets the depiction process truly understand what is being conveyed.

Page 31: Natural Language driven Image Generation

WordsEye: Limitations

Works on high quality 3D models, hence, required a lot of memory and fast searching algorithm.

Because of its restriction to its own database, the system does not guarantee an output for all natural language text inputs.

Page 32: Natural Language driven Image Generation

Carsim

Developed to convert text descriptions of road accidents into 3D scenes

2-tier architecture communicating with a formal representation of the accident.

Page 33: Natural Language driven Image Generation

Carsim: Formalism

The tabular structure generated after parsing the natural language text has the following information: Location of accident and configuration

of roads List of road objects Event chains for object and

movements Collision description

Page 34: Natural Language driven Image Generation

Carsim: Information Extraction Module

Utilizes tokenizing, part-of-speech tagging, splitting into sentences, detecting noun groups, named entities, non-recursive clauses and domain-specific multiwords for: Detecting the participants Marking the events Detecting the roads

Page 35: Natural Language driven Image Generation

Carsim: Scene Synthesis and Visualization

The previously generated template is taken as input.

Rule-based modules are used to check consistency of the scene.

A planner is used to generate vehicle trajectories.

A temporal module is used to assign time intervals to all segments of these trajectories

Page 36: Natural Language driven Image Generation

Suggested Technique

This technique is a hybrid of the techniques we have seen so far along with a few additions.

It is a theoretical technique and has not been implemented yet.

Page 37: Natural Language driven Image Generation

Natural Language Understanding

Words of interest will be categorized into the following groups using a part-of-speech (POS) tagger and a named entity recognizer (NER). OBJECT STATE SIZE RELATIVITY

Page 38: Natural Language driven Image Generation

The template and the co-relation matrix

A co-relation matrix specifies position of each object in the scene with respect to every other object.

The template for each object in the list of objects to be instantiated contains the following information. Size Co-ordinates Image Location

Page 39: Natural Language driven Image Generation

Image Selection Module

This module finds images using two sources: Internal database of images Internet based image search engines

First 10 images are retrieved Image processing is used to find the

correct image This image is stored in the database for

future use

Page 40: Natural Language driven Image Generation

Position Determiner and Synthesis Module

The Position Determiner computes the co-ordinates of each and every image that is to be placed based on the input template (which has the image size and location paths).

The synthesis module resizes all images and places them at the co-ordinates in the template (supplied by the position determiner module)

Page 41: Natural Language driven Image Generation

Introducing Machine Learning The aim is to finally make a computer

think like a human. We can greatly enhance our system by using the techniques of Machine Learning.

The system can be made to learn the objects through unsupervised learning (clustering).

The system can be feedback controlled and let the user point out meanings of terms (SIZE, RELATIVITY, STATE) not previously known.

Page 42: Natural Language driven Image Generation

Advantages Linguistic analysis is efficient since

there is no statistical/rule-based parser being used.

Searching for images on the internet ascertains that an image is generated for every natural language input.

Introducing machine learning makes the system coachable (also, user feedback and instant adaptation)

Page 43: Natural Language driven Image Generation

Limitations

It might not generate coherent images for complex sentences since we do not make use of an advanced NLU technique.

It depends on internet availability for finding images not within its local database.

Page 44: Natural Language driven Image Generation

Summary

All the methods that have been developed till date for tackling the problem have been explained.

A technique based on some additions and the positives of the existing techniques has been specified.

Lot of research is still required to make a computer achieve this task as simply as a human brain does.

Page 45: Natural Language driven Image Generation

References [1] ACL 1984 Proceedings of the 10th International Conference

on Computational Linguistics, Natural Language driven Image Generation, Giovanni Adorni, Mauro Di Manzo, Fausto Giunchiglia, University of Geneo

[2] A Text-to-Picture Synthesis System for Augmenting Communication, Xiaojin Zhu, Andrew B. Goldberg, Mohamed Eldawy, Charles R. Dyer, Bradley Strock, University of Wisconsin, Madison

[3] Proceedings of the 28th annual conference on Computer Graphics and interactive techniques 2001, WordsEye: An Automatic Text-to-Scene Conversion System, Bob Coyne, Richard Sproat, AT&T Labs (Research)

[4] Converting Texts of Road Accidents into 3D Scenes, Richard Johansson, David Williams, Pierre Nugues, 2004 TextMean Proceedings

Page 46: Natural Language driven Image Generation

Thank You!