beyond narrative description: generating poetry from

Beyond Narrative Description: Generating Poetry from Images byMulti-Adversarial Training

Bei Liu ∗ Jianlong FuKyoto University Microsoft Research Asia

[email protected] [email protected]

Makoto P. Kato Masatoshi YoshikawaKyoto University Kyoto University

[email protected] [email protected]

March 22, 2022

Abstract

Automatic generation of natural language from imageshas attracted extensive attention. In this paper, we takeone step further to investigate generation of poetic lan-guage (with multiple lines) to an image for automaticpoetry creation. This task involves multiple challenges,including discovering poetic clues from the image (e.g.,hope from green), and generating poems to satisfy bothrelevance to the image and poeticness in language level.To solve the above challenges, we formulate the task ofpoem generation into two correlated sub-tasks by multi-adversarial training via policy gradient, through which thecross-modal relevance and poetic language style can beensured. To extract poetic clues from images, we pro-pose to learn a deep coupled visual-poetic embedding, inwhich the poetic representation from objects, sentiments1 and scenes in an image can be jointly learned. Two dis-criminative networks are further introduced to guide thepoem generation, including a multi-modal discriminatorand a poem-style discriminator. To facilitate the research,we have collected two poem datasets by human annota-

∗This work was performed when Bei Liu was visiting Microsoft Re-search Asia as a research intern.

1In this paper, we consider both adjectives and verbs as sentimentwords, which can usually express many types of emotions and feelingsin a poem.

tors with two distinct properties: 1) the first human an-notated image-to-poem pair dataset (with 8, 292 pairs intotal), and 2) to-date the largest public English poem cor-pus dataset (with 92, 265 different poems in total). Exten-sive experiments are conducted with 8K images generatedwith our model, among which 1.5K image are randomlypicked for evaluation. Both objective and subjective eval-uations show the superior performances against the state-of-art methods for poem generation from images. Tur-ing test carried out with over 500 human subjects, amongwhich 30 evaluators are poetry experts, demonstrates theeffectiveness of our approach.

1 Introduction

Researches that involve both vision and languages haveattracted great attentions recently as we can witness fromthe bursting works on image descriptions like image cap-tion and paragraph [1, 4, 16, 27]. Researches of imagedescriptions aim to generate sentence(s) to describe factsfrom images in human-level languages. In this paper, wetake one step further to tackle a more cognitive task: gen-eration of poetic language to an image for the purpose ofpoetry creation, which has attracted tremendous interestin both research and industry fields.

In the natural language processing field, poem gener-

1

arX

iv:1

804.

0847

3v2

[cs

.CV

] 2

5 A

pr 2

018

Description:A falcon is eating during sunset. The falcon is standing on earth.

Poem:Like a falcon by the night

Hunting as the black knightWaiting to take over the fightWith all of it's mind and might

Figure 1: Example of human written description and poem ofthe same image. We can see a significant difference from wordsof the same color in these two forms. Instead of describing factsin the image, poem tends to capture deeper meaning and poeticsymbols from objects, scenes and sentiments from the image(such as knight from falcon, hunting and fight from eating,and waiting from standing).

ation related problems have been studied. For instance,in [11, 32], the authors mainly focused on the quality ofstyle and rhythm. In [7, 32, 37], these works have takenone more step to generate poems from topics. In the in-dustrial field, Facebook has proposed to generate Englishrhythmic poetry with neural networks [11], and Microsofthas developed a system called Xiaoice, in which poemgeneration is one of the most important features. Never-theless, generating poems from images in an end-to-endfashion remains a new topic with grand challenges.

Compared with image captioning and paragraphingthat focus on generating descriptive sentences about animage, generation of poetic language is a more challeng-ing problem. There is a larger gap between visual repre-sentations and poetic symbols that can be inspired fromimages and facilitate better generation of poems. Forexample, “man” detected in image captioning can fur-ther indicate “hope” with “bright sunshine” and “open-ing arm”, or “loneliness” with “empty chairs” and “dark”background in poem creation. Figure 1 shows a concreteexample of the differences between descriptions and po-ems for the same image.

In particular, to generate a poem from an image, we arefacing with the following three challenges. First of all, itis a cross-modality problem compared with poem gener-ation from topics. An intuitive way for poem generationfrom images is to first extract keywords or captions fromimages and then consider them as seeds for poem gener-

ation as what poem generation from topics do. However,keywords or captions will miss a lot of information in im-ages, not to mention the poetic clues that are important forpoem generation [7, 37]. Secondly, compared with im-age captioning and image paragraphing, poem generationfrom images is a more subjective task, which means animage can be relevant to several poems from various as-pects while image captioning/paragraphing is more aboutdescribing facts in the images and results in similar sen-tences. Thirdly, the form and style of poem sentences isdifferent from that of narrative sentences. In this research,we mainly focus on free verse which is an open form ofpoetry. Although we do not require meter, rhyme or othertraditional poetic techniques, it remains some sense of po-etic structures and poetic style language in poems. Wedefine this quality of poem as poeticness in this research.For example, length of poems are usually not very long,specific words are preferred in poems compared with im-age descriptions, and sentences in one poem should beconsistent to one topic.

To address the above challenges, we propose to col-lect two poem datasets by human annotators, and con-duct the research on poetry creation by integrating re-trieval and generation techniques in one system. Specifi-cally, to better learn poetic clues from images for poemgeneration, we first learn a deep coupled visual-poeticembedding model with CNN features of images, andskip-thought vector features [15] of poems from a multi-modal poem dataset (namely “MultiM-Poem”) that con-sists of thousands of image and poem pairs. This embed-ding model is then used to retrieve relevant and diversepoems from a larger uni-modal poem corpus (namely“UniM-Poem”) for images. Images with these retrievedpoems and MultiM-Poem together construct an enlargedimage-poem pair dataset (namely “MultiM-Poem (Ex)”).We further propose to leverage the state-of-art sequentiallearning techniques for training an end-to-end poem gen-eration model on the MultiM-Poem (Ex) dataset. Such aframework ensures substantial poetic clues, that are sig-nificant for poem generation, could be discovered andmodeled from those extended pairs.

To avoid exposure bias problems caused by long lengthof long sequence (all poem lines together) and the prob-lem that there is no specific loss available to score a gener-ated poem, we propose a recurrent neural network (RNN)for poem generation with multi-adversarial training and

2

further optimize it by policy gradient. Two discrimina-tive networks are used to provide rewards in terms of thegenerated poem’s relevance to the given image and poet-icness of the generated poem. We conduct experimentson MultiM-Poem, UniM-Poem and MultiM-Poem (Ex)to generate poems for images. The generated poems areevaluated in both automatic and manual ways. We de-fine automatic evaluation metrics concerning relevance,novelty and translative consistence and conducted userstudies about relevance, coherence and imaginativenessto compare our generated poems with those generated bybaseline methods. The contributions in this research areconcluded as follows:• We propose to generate poems (English free verse)

from images in an automatical fashion. To the bestof our knowledge, this is the first attempt to study theimage-inspired poem generation problem in a holis-tic framework, which enables a machine to approachhuman capability in cognition tasks.• We incorporate a deep coupled visual-poetic embed-

ding model and a RNN-based generator for jointlearning, in which two discriminators provide re-wards for measuring cross-modality relevance andpoeticness by multi-adversarial training.• We collect the first paired dataset of image and poem

annotated by human annotators and the largest publicpoem corpus dataset. Extensive experiments demon-strate the effectiveness of our approach comparedwith several baselines by using both automatic andmanual evaluation metrics, including a Turing testfrom more than 500 human subjects. To better pro-mote the research in poetry generation from images,we will release these datasets in the near future.

2 Related Work

2.1 Poetry Generation

Traditional approaches for poetry generation include tem-plate and grammar-based method [19, 20, 21], genera-tive summarization under constrained optimization [32]and statistical machine translation model [10, 12]. By ap-plying deep learning approaches recent years, researchesabout poetry generation has entered a new stage. Recur-rent neural network is widely used to generate poems that

can even confuse readers from telling them from poemswritten by human poets [7, 8, 11, 33, 37]. Previous worksof poem generation mainly focus on style and rhythmicqualities of poems [11, 32], while recent studies introducetopic as a condition for poem generation [7, 8, 32, 37].For a poem, topic is still a rather abstract concept withoutspecific scenarios. Inspired by the fact that many poemswere created when poets were in a conditioned scenarioand viewing some specific views, we take one step furtherto tackle the problem of generating poems inspired by avisual scenario. Compared with previous researches, ourwork is facing with more challenges, especially in termsconsidering multi-modal problems.

2.2 Image DescriptionImage captioning is first regarded as a retrieval problemwhich aims to search captions from dataset for a given im-age [5, 13] and hence cannot provide accurate and properdescriptions for all images. To overcome this problem,methods like template filling [17] and paradigm for in-tegrating convolutional neural network (CNN) and recur-rent neural network (RNN) [2, 27, 34] are proposed togenerate readable human-level sentences. Recently, gen-erative adversarial network (GAN) is applied to generatecaptions based on different problem settings [1, 35]. Simi-larly to image captioning, image paragraphing is going thesimilar way. Recent researches about image paragraphingmainly focus on region detection and hierarchical struc-ture for generated sentences [16, 18, 23]. However, as wehave addressed, image captioning and paragraphing aimto generate descriptive sentences to tell the facts in im-ages, while poem generation is tackling an advanced formof linguistic form which requires poeticness and languagestyle constrains.

3 ApproachIn this research, we aim to generate poems from images sothat the generated poems are relevant to input images andsatisfy poeticness. For this purpose, we cast our problemin a multi-adversarial procedure [9] and further optimizeit with a policy gradient [30, 36]. A CNN-RNN genera-tive model acts as an agent. The parameters of this agentdefine a policy whose execution will decide which word

3

(e) Multi-modal space

Ob

jec

t CN

N

Se

ntim

en

t CN

N

Sce

ne

CN

N

Deep Coupled Visual-Poetic Embedding Model

GRU

GRU

Generator as Agent

<BOS>

buttercups

GRU <EOS>

...

Discriminators as Rewards

(g) Multi-Modal Discriminator

Generated

Poetic

Disordered

Paragraph

(h) Poem-Style Discriminator

)( pairedcCm =

)( poeticcC p =

pm CCR )1( −+=

Multi-Adversarial Training

(i) Policy Gradient

buttercups and daisies[1]oh the pretty flowers[2]

coming ere the springtime[3]to tell of sunny hours[4]

(a) image and poem pairs(c) skip-thoughts model

trained on PoemC

(f) RNN generator

Sentim

ent labe

ls:pretty, sunny

Sentim

ent labe

ls:pretty, sunny

Scene lab

els:

springtime

Scene lab

els:

springtime

Ob

ject labels:

Buttercup

, daisy, flo

wer, h

our

Ob

ject labels:

Buttercup

, daisy, flo

wer, h

our

Mean pooling

Generated

Paired

Unpaired

(b) Poetic CNN features

(d) sentence features

[1 ]

[2 ]

[3 ]

[4 ]

POS parser

Figure 2: The framework of poetry generation with multi-adversarial training. We first use image and poem pairs (a) from human-annotated paired image and poem dataset (MultiM-Poem) to train a deep coupled visual-poetic embedding model (e). The imagefeatures (b) are poetic multi-CNN features obtained by fine-tuning a CNN with the extracted poetic symbols (e.g., objects, scenesand sentiments) by a POS parser (Stanford NLP tool) from poems. The sentence features (d) of poems are extracted from a skip-thought model (c) trained on the largest public poem corpus (UniM-Poem). A RNN-based sentence generator (f) is trained asagent and two discriminators considering multi-modal (g) and poem-style (h) critics of a generated poem to a given image providerewards to policy gradient (i). POS parser extracts Part-Of-Speech words from poems.

to be picked as an action. When the agent has picked allwords in a poem, it observes a reward. We define twodiscriminative networks to serve as rewards concerningwhether the generated poem is a paired one with the in-put image and whether the generated poem is poetic. Thegoal of our poem generation model is to generate a se-quence of words as a poem for an image to maximizethe expected end reward. This policy-gradient methodhas shown significant effectiveness to many tasks withoutnon-differentiable metrics [1, 24, 35].

As shown in Figure 2, the framework consists of severalparts: (1) a deep coupled visual-poetic embedding model(e) to learn poetic representations from images, and (2) amulti-adversarial training procedure optimized by policygradient. A RNN based generator (f ) serves as agent, andtwo discriminative networks (g and h) provide rewards tothe policy gradient.

3.1 Deep Coupled Visual-Poetic Embed-ding

The goal of visual-poetic embedding model [6, 14] isto learn an embedding space where points of differentmodality, e.g. images and sentences, can be projected to.In a similar way to image captioning problem, we assumethat a pair of image and poem shares similar poetic se-mantics which makes the embedding space learnable. Byembedding both images and poems to the same featurespace, we can directly compute the relevance between apoem and an image by poetic vector representations ofthem. Moreover, the embedding feature can be furtherutilized to initialize a optimized representation of poeticclues for poem generation.

The structure of our deep coupled visual-poetic embed-ding model is shown in left part of Figure 2. For imageinput, we leverage the deep convolutional neural network(CNN) concerning three aspects that indicate important

4

poetic clues from images, namely object (v1), scene (v2)and sentiment (v3), after conducting a prior user studyabout important factors for poem creation from images.We observed that concepts in poems are often imaginativeand poetic while concepts in the classification datasetswe use to train our CNN models are concrete and com-mon. To narrow the semantic gap between the visual rep-resentation of images and the textual representation of po-ems, we propose to fine-tune these three networks withMultiM-Poem dataset. Specifically, frequent used key-words about object, sentiment and scenes in the poems arepicked as label vocabulary, and then we build three multi-label datasets based on MultiM-Poem dataset for object,sentiment and scenes detection respectively. Once themulti-label datasets are built, we fine-tune the pre-trainedCNN models on the three datasets independently, whichis optimized by sigmoid cross entropy loss as shown inEqn. (1). After that, we adopt the D-dimension deep fea-tures for each aspect from the penultimate fully-connectedlayer of the CNN models, and get a concatenated N -dimension (N = D × 3) feature vector v ∈ RN as inputof visual-poetic embedding for each image:

loss =−1n

N∑n=1

(tnlogpn + (1− tn)log(1− pn)), (1)

v1 = fObject(I),

v2 = fScene(I),

v3 = fSentiment(I),

v = (v1,v2,v3),

(2)

in which we use the “FC7” layer outputs as the featuresfor v1,v2,v3. The output of visual-poetic embeddingvector x is a K-dimension vector representing the imageembedding with linear mapping from image features:

x = Wv · v + bv ∈ RK , (3)

where Wv ∈ RK×N is the image embedding matrixand bv ∈ RK is the image bias vector. Meanwhile,representation feature vector of a poem is computed byaverage value of its sentences’ skip-thought vectors[15].Combine-skip with M -dimension vector denoted by t ∈RM is used as it demonstrates better performance asshown in [15]. The skip-thought model is trained on

UniM-Poem dataset. Similar to image embedding, thepoem embedding is denoted as:

m = Wt · t+ bt ∈ RK , (4)

where Wt ∈ RK×M for the poem embedding matrix andbt ∈ RK for the poem bias vector. Finally, the image andpoem are embedded together by minimizing a pairwiseranking loss with dot-product similarity:

L =∑x

∑k

max(0, α− x ·m+ x ·mk)

+∑m

∑k

max(0, α−m · x+m · xk),(5)

where mk is a contrastive (irrelevant unpaired) poem forimage embedding x, and vice-versa with xk. α denotesthe contrastive margin. As a result, the model we trainedwill produce higher cosine similarity (consistent with dot-product similarity) between embedding features of origi-nal image-poem pairs than similarity between randomlygenerated pairs.

3.2 Poem Generator as an AgentA conventional CNN-RNN model for image captioning isused in our approach to serve as an agent. Instead of usinghierarchical methods that are used recently in generatingmultiple sentences in image paragraphing [16], we use anon-hierarchical recurrent model by treating the end-of-sentence token as a word in the vocabulary. The reasonis that poems often consist of fewer words compared withparagraphs. Moreover, there is lower consistent hierarchybetween sentences in the training poems, which makes thehierarchy between sentences much more difficult to learn.We also conduct experiment with hierarchical recurrentlanguage model as a baseline and we will show the resultin the experiment part.

The generative model includes CNNs for image en-coder and a RNN for poem decoder. In this research, weapply Gated Recurrent Units (GRUs) [3] for decoder. Weuse image-embedding features learned by the deep cou-pled visual-poetic embedding model explained in Section3.1 as input of image encoder. Suppose θ is the param-eters of the model. Traditionally, our target is to learnθ by maximizing the likelihood of the observed sentence

5

y = y1:T ∈ Y∗ where T is the maximum length of gener-ated sentence (including < BOS > for start of sentence,< EOS > for end of sentence and line breaks) and Y∗denotes a space of all sequences of selected words.

Let r(y1:t) denote the reward achieved at time t andR(y1:T ) is the cumulative reward, namely R(yk:T ) =∑Tt=k r(y1:t). Let pθ(yt|y1:(t−1)) be a parametric con-

ditional probability of selecting yt at time step t given allthe previous words y1:(t−1). pθ is defined as a paramet-ric function of policy θ. The reward of policy gradient ineach batch can be computed as the sum over all sequencesof valid actions as the expected future reward. To iterateover sequences of all possible actions is exponential, butwe can further write it in expectation so that it can be ap-proximated with an unbiased estimator:

J(θ) =∑

y1:T∈Y∗

pθ(y1:T )R(y1:T ) = Ey1:T∼pθT∑t=1

r(y1:t).

(6)

We aim to maximize J(θ) by following its gradient:

∇θJ(θ) = Ey1:T∼pθ

[T∑t=1

∇θlogpθ(y1:t−1)

]T∑t=1

r(y1:t).

(7)In practice the expected gradient can be approximated us-ing a Monte-Cartlo sample by sequentially sample eachyt from the model distribution pθ(yt|y1:(t−1)) for t from1 to T . As discussed in [24], a baseline b can be intro-duce to reduce the variance of the gradient estimate with-out changing the expected gradient. Thus, the expectedgradient with a single sample is approximated as follow:

∇θJ(θ) ≈T∑t=1

∇θlogpθ(y1:t−1)T∑t=1

(r(y1:t)− bt). (8)

3.3 Discriminators as RewardsA good poem for an image has to satisfy at least two crite-ria: the poem (1) is relevant to the image, and (2) has somesense of poectiness concerning proper length, poem’s lan-guage style and consistence between sentences. Based onthese two requirements, we propose two discriminativenetworks to guide the generated poem: multi-modal dis-criminator and poem-style discriminator. Deep discrim-inative networks have been shown of great effectiveness

in text classification task [1, 35], especially for tasks thatcannot establish good loss functions. In this work, bothdiscriminators we propose have several classes includingone positive class and several negative classes.

Multi-Modal Discriminator. In order to checkwhether the generated poem y is paired with input imagex, we train a multi-modal discriminator (Dm) to classify(x,y) as paired, unpaired and generated. Dm includes amulti-modal encoder, modality fusion layer and a classi-fier with softmax function:

c = LSTMρ(y), (9)f = tanh(Wx · x+ bx)� tanh(Wc · c+ bc), (10)

Cm = softmax(Wm · f + bm), (11)

where ρ, Wx, bx, Wc, bc, Wm, bm are parameters to belearned, � is element-wise multiplication and Cm de-notes the probabilities over three classes of the multi-modal discriminator. We utilize LSTM-based sentenceencoder for discriminator training. Equation 11 pro-vides way to generate the probability of (x,y classifiedinto each class as denoted by Cm(c|x,y) where c ∈{paired, unpaired, generated}.

Poem-Style Discriminator. In contrast with mostpoem generation researches that emphasize on meter,rhyme or other traditional poetic techniques, we focus onfree verse which is an open form of poetry. Even though,we require our generated poems have the quality of poet-icness as we define in Section 1. Without making specifictemplates or rules for poems, we propose a poem-stylediscriminator (Dp) to guide generated poems towards hu-man written poems. In Dp, generated poems will be clas-sified into four classes: poetic, disordered, paragraphicand generated.

Class poetic is addressed as positive example of po-ems that satisfy poeticness. The other three classes areall regarded as negative examples. Class disordered con-cerns about the inner structure and coherence betweensentences of poems and paragraphic class uses paragraphsentences as negative examples. In Dp, we use UniM-Poem as positive poetic samples. To construct disorderedpoems, we first construct a poem sentence pool by split-ting all poems in UniM-Poem. Examples of class dis-ordered are poems that we reconstruct by sentences ran-domly picked up with a reasonable line numbers frompoem sentence pool. Paragraph dataset provided by [16]

6

is used as paragraph examples.A completed generated poem y is encoded by LSTM

and parsed to a fully connected layer, and the probabil-ity of falling into four classes is computed by a softmaxfunction. Formula of this procedure is as follow:

Cp = softmax(Wp · LSTMη(y) + bp), (12)

where η, Wp, bp are parameters to be learned.The probability of classifying generated poem y toa class c is formulated as Cp(c|y) where c ∈{poetic, disordered, paragraphic, generated}.

Reward Function. We define the reward function forpolicy gradient as a linear combination of probability ofclassifying generated poem y for an input image x to thepositive class (paired for multi-modal discriminator Dm

and poetic for poem-style discriminator Dp) weighted bytradeoff parameter λ:

R(y|·) = λCm(c = paired|x,y)+(1−λ)Cp(c = poetic|y).(13)

3.4 Multi-Adversarial Training

Before adversarial training, we pre-train a generator basedon image captioning method [27] which can provide abetter policy initialization for generator. The generatorand discriminators are iteratively updated in an adver-sarial way. The generator aims to generate poems thathave higher rewards for both discriminators so that theycan fool the discriminators while the discriminators aretrained to distinguish the generated poems from pairedand poetic poems. The probabilities of classifying gener-ated poem into positive classes in both discriminators areused as rewards to policy gradient as explained above.

Multiple discriminators (two in this work) are trainedby providing positive examples from the real data (pairedpoems inDm and poem corpus inDp) and negative exam-ples from poems generated from the generator as well asother negative forms of real data (unpaired poems in Dm,paragraphs and disordered poems in Dp. Meanwhile, byemploying a policy gradient and Monte Carlo sampling,the generator is updated based on the expected rewardsfrom multiple discriminators. Since we have two discrim-inators, we apply a multi-adversarial training method thatwill train two discriminators in a parallel way.

Image and poem pair dataset (MultiM-Poem)

spreading his armsdancing the floorthrough the light

Towardsa new dawn

the morning sunriseso beautiful to beholdglowing breath of light

back on its golden hingesthe gate of memory swings

and my heart goes into the gardenand walks with the olden things

burning base of skyhalf the hemisphere high

a fire hourcrept upon by night

Poem Corpus Dataset (UniM-Poem)

in crescent forma vasty crescent nigh two leagues acrossfrom horn to horn the lesser ships within

the great without they did bestride as 't wereand make a township on the narrow seas

his little hands when flowers were seenwere held for the bluebell

as he was carried oer the green

in a brown gloom the moats gleamslender the sweet wife stands

her lips are red her eyes dreamkisses are warm on her hands

this realm of raingrey sky and cloud

it's quite and peacefulsafe allowed

Figure 3: Examples in two datasets: UniM-Poem and MultiM-Poem.

Name #Poem #Line/poem #Word/lineMultiM-Poem 8,292 7.2 5.7UniM-Poem 93,265 5.7 6.2

MultiM-Poem (Ex) 26,161 5.4 5.9

Table 1: Detailed information about the three datasets. Thefirst two datasets are collected by ourselves and the third one isextended by VPE.

4 Experiments

4.1 Datasets

To facilitate the research of poetry generation from im-ages, we collected two poem datasets, in which one con-sists of image and poem pairs, namely Multi-Modal Poemdataset (MultiM-Poem), and the other is a large poemcorpus, namely Uni-Modal Poem dataset (UniM-Poem).By using the embedding model we have trained, the im-age and poem pairs are extended by adding the nearestthree neighbor poems from the poem corpus without re-dundancy, and an extended image and poem pair datasetis constructed and denoted as MultiM-Poem (Ex). Thedetailed information about these datasets is listed in Table1. Examples of the two collected datasets can be seen inFigure 3. To better promote the research in poetry gen-eration from images, we will release these datasets in thenear future.

For MultiM-Poem dataset, we first crawled 34,847

7

image-poem pairs in Flickr from groups that aim to useimages illustrating poems written by human. Five hu-man assessors majoring in English literature were furtherasked to evaluate these poems as relevant or irrelevant byjudging whether the image can exactly inspire the poemin a pair by considering the associations of objects, sen-timents and scenes. We filtered out pairs labeled as irrel-evant and kept the remaining 8,292 pairs to construct theMultiM-Poem dataset.

UniM-Poem is crawled from several public online po-etry websites, such as Poetry Foundation2, PoetrySoup3,best-poem.net and poets.org. To achieve robust modeltraining, a poem pre-processing procedure is conductedto filter out those poems with too many lines (> 10) ortoo fewer lines (< 3). We also remove poems with strangecharacters, poems in languages other than English and du-plicate poems.

4.2 Compared Methods

To investigate the effectiveness of the proposed methods,we compare with four baseline models with different set-tings. The models of show-and-tell [27] and SeqGAN[35] are selected due to their state-of-art results in imagecaptioning. A competitive image paragraphing model isselected for its strong capability for modeling diverse im-age content. Note that all methods use MultiM-Poem (Ex)as the training dataset, and can generate multiple lines aspoems. The detailed methods and experiment settings areshown as follows:

Show and tell (1CNN): CNN-RNN model trained withonly object CNN by VGG-16 .

Show and tell (3CNNs): CNN-RNN model trainedwith three CNN features by VGG-16.

SeqGAN: CNN-RNN model optimized with one dis-criminator that is used to tell from generated poems andground-truth poems.

Regions-Hierarchical: Hierarchical paragraph gener-ation model based on [16]. To better align with poemdistribution, we restrict the maximum lines to be 10 andeach line has up to 10 words in the experiment.

Our Model: To demonstrate the effectiveness of thetwo discriminators, we train our model (Image to Poem

2https://www.poetryfoundation.org/3https://www.poetrysoup.com/

with GAN, I2P-GAN) in four settings: pretrained modelwithout discriminators (I2P-GAN w/o discriminator),with multi-modal discriminator only (I2P-GAN w/ Dm),with poem-style discriminator only (I2P-GAN w/ Dp)and with both discriminators (I2P-GAN).

4.3 Automatic Evaluation MetricsEvaluation of poems is generally a difficult task and thereare no established metrics in existing works, not to men-tion the new task of generating poems from images. Tobetter address the performance of the poems, we proposeto evaluate them in both automatic and manual way.

We propose to employ three evaluation metrics for au-tomatic measurement, e.g., BLEU, novelty and relevance.Then an overall score is summarized based on the threemetrics after normalization.

BLEU. We first use Bilingual Evaluation Understudy(BLEU) [22] score-based evaluation to examine howlikely the generated poems can approximate towards theground truth ones as researches of image captioning andparagraphing usually apply. It is also used in some poemgeneration works [32]. For each image, we only use thehuman written poems as ground-truth poems.

Novelty. By introducing discriminator Dp, the genera-tor is supposed to introduce words or phrases from UniM-Poem dataset and results in words or phrases that are notvery frequent in MultiM-Poem (Ex). We use novelty asproposed by [31] to measure the number of infrequentwords or phrases observed in the generated poems. Twoscales of N-gram are explored, e.g. bigram and trigram,as Novelty-2 and Novelty-3. We first rank the n-gramsthat occur in the training dataset of MultiM-Poem (Ex)and take the top 2,000 as frequent ones. Novelty is com-puted as the proportion of n-grams that occur in trainingdataset except the frequent ones in the generated poem.

Relevance. Different from poem generation researchesthat have no or weak constrains to poem contents, we con-sider relevance of the generated poem to the given imageas an important measurement in this research. However,unlike captions that concern more about fact descriptionsabout images, different poems can be relevant to the sameimage from various aspects. Thus, instead of computingrelevance between generated poem and ground-truth po-ems, we define relevance between a poem and an imageusing our learned deep coupled visual-poetic embedding

8

https://www.poetryfoundation.org/

https://www.poetrysoup.com/

model. After mapping the image and the poem to thesame space through our embedding model, cosine simi-larity is used to measure their relevance. Although ourembedding model can approximately quantify relevancebetween images and poems, we leverage subjective eval-uation to better explore the effectiveness of our generatedpoems in human level.

Overall. We compute an overall score based on theabove three metrics. For each value ai in all values of onemetric a, we first normalize it with following method:

a′ =a−min(a)

min(a). (14)

After that, we get average values for BLEU (e.g. BLEU-1, BLEU-2 and BLEU-3) and novelty (e.g. Novelty-2 andNovelty-3). A final score is computed by averaging thenormalized values, to ensure equal contribution of differ-ent metrics.

However, in such an open-ended task, there are no par-ticularly suitable metrics that can perfectly evaluate theperformance of generated poems. The automatic metricswe use can be regarded as a guidance to some extent. Tobetter illustrate the performance of poems from humanperception, we further conduct extensive user studies asin the follows.

4.4 Human EvaluationWe conducted human evaluation in Amazon MechanicalTurk. In particular, three types of tasks are assigned toAMT workers as follows :

Task1: to explore the effectiveness of our deep cou-pled visual-poetic embedding model, annotators were re-quested to give a 0-10 scale score to a poem given an im-age considering their relevance in case of content, emo-tion and scene.

Task2: the aim of this task is to compare the gener-ated poems by different methods (four baseline methodsand our four model settings) for one image on differentaspects. Given an image, the annotators were asked togive ratings to a poem on a 0-10 scale with respect to fourcriteria: relevance (to the image), coherence (whetherthe poem is coherent across lines), imaginativeness (howmuch imaginative and creative the poem can show for thegiven image) and overall impression.

Task3: Turing test was conducted by asking annotatorsto select human written poem from mixed human writtenand generated poems. Note that Turing test was imple-mented in two settings, i.e., poems with images and po-ems only.

For each task, we have randomly picked up 1K imagesand each task is assigned to three assessors. As poemis a form of literature, we also ask 30 annotators whosemajors are related to English literature (among which tenannotations are English natives) as expert users to do theTuring test.

4.5 Training Details

In the deep coupled visual-poetic embedding model, weuseD = 4, 096-dimension features for each CNN. Objectfeatures are extracted from VGG-16 [26] trained on Ima-geNet [25], scene features from Place205-VGGNet model[29], and sentiment features from sentiment model[28].

To better extract visual feature for poetic symbols, wefirst get nouns, verbs and adjectives with at least fivefrequency in UniM-Poem dataset. Then we manuallypicked adjectives and verbs for sentiment (including 328labels), nouns for object (including 604 labels) and scenes(including 125 labels). As for poem features, we ex-tract a combined skip-thought vector with M = 2, 048-dimension (in which each 1, 024-dimension represents foruni-direction and bi-direction, respectively) for each sen-tence, and finally we get poem features by mean pooling.And the margin α is set to 0.2 based on empirical experi-ments in [14]. We randomly select 127 poems as unpairedpoems for an image and used them as contrastive poems(mk and xk in Equation 5), and we re-sample them ineach epoch. We empirically set the tradeoff parameterλ = 0.8 by conducting a comparable observation on au-tomatic evaluation results from 0.1 to 0.9.

4.6 Evaluations

Retrieved Poems. We compare three kinds of poems con-sidering their relevance to images: ground-truth poems,poems retrieved with VPE and image features before fine-tuning (VPE w/o FT), and poems retrieved with VPE andfine-tuned image features (VPE w/ FT). Table 2 showsa comparison of these three types of poems on a scale

9

Show and tell (3CNNs)i will find a little birdthat shivers and falls the morrowand every day of night and seas immortal nighti know that all the world shall be aloneand the wild wild horsesthe women of the great city of the seaand i like to be a jellyfishi will never find a way

SeqGANthe sun is shining on the seashining on the wind and a sudden greenand round the little little boy they saidand look at the little noises the with awith a coffee a silver penny a huge a drumsteer the spider and you

I2P-GAN w/o discriminatorwhen we can gothe sun is singing in the forest rainand the mist is the sound of the seaand the soul is the golden sunand the light of god is gone

I2P-GAN w/ Dm

he sun is singing in the forest rainand like the wind in its warm juneis still my heart is a falconbut it sings in the nighti feel like the meaning it

I2P-GAN w/ Dp

the sun is singingthe sound is rainingi will catch youdon't look at youyou are the skyyou're beginning

Regions-Hierarchicalthe sun was shining on the seathe waves are crashing inthe light's spilled out of heaven and flowing growingthe sun is warm and slow

I2P-GANthe sun is singing in the forest windand let us go to the wind of the sunlet the sun be freelet us be the storm of heavenand let us be the slow sunwe keep our own strength togetherwe live in love and hate

Show and tell (1CNN)i am a coal-truckby a broken hearti have no soundthe sound of my hearti am not

Figure 4: Example of poems generated by six methods for an image.

Ground-Truth VPE w/o FT VPE w/ FTRelevance 7.22 6.32 5.82

Table 2: Average score of relevance to images for three types ofhuman written poems on 0-10 scale (0-irrelevant, 10-relevant).One-way ANOVA revealed that evaluation on these poems isstatistically significant (F (2, 9) = 130.58, p < 1e− 10).

of 0-10 (0 means irrelevant and 10 means the most rele-vant). We can see that by using the proposed visual-poeticembedding model, the retrieved poems can achieve a rel-evance score above the average score (i.e., the score offive). And image features fine-tuned with poetic symbolscan improve the relevance significantly.

Generated Poems. Table 3 exhibits the automaticevaluation results of the proposed model with four set-tings, as well as the four baselines proposed in previ-ous works. Comparing results of caption model with oneCNN and three CNNs, we can see that multi-CNN canactually help to generate poems that are more relevant toimages. Regions-Hierarchical model emphasizes more onthe topic coherence between sentences while many hu-man written poems will cover several topics or use dif-ferent symbols for one topic. SeqGAN shows the advan-tage of applying adversarial training for poem generationcompared with only caption models with only CNN-RNNwhile lacking of generating novel concepts in poems. Bet-ter performance of our pre-trained model with VPE thancaption model demonstrates the effectiveness of VPE inextracting poetic features from images for better poemgeneration. We can see that our three models outperformin most of the metrics with each one performs better atone aspect. The model with only multi-modal discrim-inator (I2P-GAN w/ Dm) will guide the model to gen-

erate poems towards ground-truth poems, thus it resultsin the highest BLEU scores that emphasize the similar-ity of n-grams in a translative way. Poem-style discrim-inator (Dp) is designed to guide the generated poem tobe more poetic in language style, and the highest noveltyscore of I2P-GAN w/ Dm shows thatDp helps to providemore novel and imaginative words to the generated poem.Overall, I2P-GAN combines the advantages of both dis-criminators with a rational intermediate score regardingBLEU and novelty while still outperforms compared withother generation models. Moreover, our model with bothdiscriminators can generate poems that have highest rele-vance on our embedding relevance metric.

Comparison of human evaluation results are shownin Table 4. Different from automatic evaluation resultswhere Regions-Hierarchical performs not well, it gets aslightly better result than caption model for the reasonthat sentences all about the same topic tend to gain bet-ter impressions from users. Our three models outperformthe other four baseline methods on all metrics. Two dis-criminators promote human-level comprehension towardspoems compared with pre-trained model. The model withtwo discriminators has generated better poems from im-ages in terms of relevance, coherence and imaginative-ness. Figure 4 shows one example of poems generatedwith three baselines and our methods for a given image.More examples generated by our approach can be seen inFigure 5.

Turing Test. For the Turing test of annotators in AMT,we have hired 548 workers with 10.9 tasks for each oneon average. For experts, 15 people were asked to judgehuman written poems with images and another 15 anno-tators were asked to do test with only poems. Each oneis assigned with 20 images and in total we have 600 tasksconducted by expert users. Table 5 shows the probability

10

when the sun shines through the snow and the nightis somebody really need you

if only can be really always justjust as if you can be a different way to be

walking through the sky is it possibleyou know that is you wantto be pure as you can see

the light leaves you can seethe cherry blossoms

i have been a great cityspinning and shout

the sound of the roadwashed away

the mountain passes throughthe streets are gonethe silence is rainingit sits still in silence

glint its own

i will arise and go by the sea gate i watch it flyand let me lie in the dark green valleys

and let me sing to the suni know you are not beautiful enough to me

and you know that you are so muchyou are so much you can see you

love you will never flyif you are always

the sun is shiningthe wind moves

naked treesyou dance

the sun is a beautiful thingin silence is drawnbetween the trees

only the beginning of light

i have seen the wind blowsmy heart filled with air

my eyes are bleared with stingingi am

a woman is a reminderof love that's just

just one thing

and now i am tired of my ownlet me be the freshening bluehaunted through the sky bare

and cold waterwarm blue air shimmering

brightly never arrives

it seems to say

the sun rays struck my facewarm tingles to my fingertipsthe light showed me a path

i should walk downi spoke and the whispers of the breeze

told me to close my eyesi lost my way in a paradise

Figure 5: Example of poems generated by our approach I2P-GAN.

Method Relevance Novelty-2 Novelty-3 BLEU-1 BLEU-2 BLEU-3 OverallShow and Tell (1CNN)[27] 1.79 43.66 76.76 11.88 3.35 0.76 5.61Show and Tell (3CNNs)[27] 1.91 48.09 81.37 12.64 3.34 0.8 11.89

SeqGAN[35] 2.03 47.52 82.32 13.40 3.72 0.76 15.86Regions-Hierarchical[16] 1.81 46.75 79.90 11.64 2.5 0.67 2.35

I2P-GAN w/o discriminator 1.94 45.25 80.13 13.35 3.69 0.88 14.65I2P-GAN w/ Dm 2.07 43.37 78.98 15.15 4.13 1.02 22.09I2P-GAN w/ Dp 1.90 60.66 89.74 12.91 3.05 0.72 16.00

I2P-GAN 2.25 54.32 85.37 14.25 3.84 0.94 27.57

Table 3: Automatic evaluation. Note that BLEU scores are computed in comparison with human-annotated ground-truth poems(one poem for one image). Overall score is computed as an average of three metrics after normalization (Equation 14). All scoresare reported as percentage (%).

Method Rel Col Imag OverallShow and Tell (1CNN)[27] 6.31 6.52 6.57 6.67Show and Tell (3CNNs)[27] 6.41 6.59 6.63 6.75

SeqGAN[35] 6.13 6.43 6.50 6.63Regions-Hierarchical[16] 6.35 6.54 6.63 6.78

I2P-GAN w/o discriminator 6.44 6.64 6.77 6.85I2P-GAN w/ Dm 6.59 6.83 6.94 7.06I2P-GAN w/ Dp 6.53 6.75 6.80 6.93

I2P-GAN 6.83 6.95 7.05 7.18Ground-Truth 7.10 7.26 7.23 7.37

Table 4: Human evaluation results of six methods on four cri-teria: relevance (Rel), coherence (Col), imaginativeness (Imag)and Overall. All criteria are evaluated on 0-10 scale (0-bad, 10-good).

Data Users Ground-Truth Generated

Poem w/ ImageAMT 0.51 0.49Expert 0.60 0.40

Poem w/o ImageAMT 0.55 0.45Expert 0.57 0.43

Table 5: Accuracy of Turing test on AMT users and expertusers on poems with and without images.

of different poems being selected as human-written po-ems for an given image. As we can see, the generatedpoems have caused a competitive confusion to both or-dinary annotators and experts though experts can figureout the accurate one better than ordinary people. One in-

11

teresting observation comes from that experts are better atfiguring out correct ones with images while AMT workersdo better with only poems.

5 Conclusion

As the first work of poetry (English free verse) genera-tion from images, we propose a novel approach to modelthe problem by incorporated deep coupled visual-poeticembedding model and RNN based adversarial trainingwith multi-discriminators as rewards for policy gradi-ent. Furthermore, we introduce the first image and poempair dataset (MultiM-Poem) and a large poem corpus(UniM-Poem) to enhance the researches on poem gen-eration, especially from images. Extensive experimentshave demonstrated that our embedding model can approx-imately learn a rational visual-poetic embedding space.Automatic and manual evaluation results demonstratedthe effectiveness of our poem generation model.

References

[1] T.-H. Chen, Y.-H. Liao, C.-Y. Chuang, W.-T. Hsu,J. Fu, and M. Sun. Show, adapt and tell: Adversar-ial training of cross-domain image captioner. ICCV,2017.

[2] X. Chen and C. Lawrence Zitnick. Mind’s eye: A re-current visual representation for image caption gen-eration. In CVPR, pages 2422–2431, 2015.

[3] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio. Em-pirical evaluation of gated recurrent neural networkson sequence modeling. NIPS, 2014.

[4] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava,L. Deng, P. Dollar, J. Gao, X. He, M. Mitchell, J. C.Platt, et al. From captions to visual concepts andback. In CVPR, pages 1473–1482, 2015.

[5] A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young,C. Rashtchian, J. Hockenmaier, and D. Forsyth. Ev-ery picture tells a story: Generating sentences fromimages. In ECCV, pages 15–29, 2010.

[6] A. Frome, G. S. Corrado, J. Shlens, S. Bengio,J. Dean, T. Mikolov, et al. Devise: A deep visual-

semantic embedding model. In NIPS, pages 2121–2129, 2013.

[7] M. Ghazvininejad, X. Shi, Y. Choi, and K. Knight.Generating topical poetry. In EMNLP, pages 1183–1191, 2016.

[8] M. Ghazvininejad, X. Shi, J. Priyadarshi, andK. Knight. Hafez: an interactive poetry generationsystem. ACL, pages 43–48, 2017.

[9] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-gio. Generative adversarial nets. In NIPS, pages2672–2680, 2014.

[10] J. He, M. Zhou, and L. Jiang. Generating chineseclassical poems with statistical machine translationmodels. In AAAI, 2012.

[11] J. Hopkins and D. Kiela. Automatically generatingrhythmic verse with neural networks. In ACL, vol-ume 1, pages 168–178, 2017.

[12] L. Jiang and M. Zhou. Generating chinese coupletsusing a statistical mt approach. In COLING, pages377–384, 2008.

[13] A. Karpathy, A. Joulin, and F. F. F. Li. Deep frag-ment embeddings for bidirectional image sentencemapping. In NIPS, pages 1889–1897, 2014.

[14] R. Kiros, R. Salakhutdinov, and R. S. Zemel.Unifying visual-semantic embeddings with multi-modal neural language models. arXiv preprintarXiv:1411.2539, 2014.

[15] R. Kiros, Y. Zhu, R. R. Salakhutdinov, R. Zemel,R. Urtasun, A. Torralba, and S. Fidler. Skip-thoughtvectors. In NIPS, pages 3294–3302, 2015.

[16] J. Krause, J. Johnson, R. Krishna, and L. Fei-Fei.A hierarchical approach for generating descriptiveimage paragraphs. CVPR, 2017.

[17] G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi,A. C. Berg, and T. L. Berg. Baby talk: Understand-ing and generating image descriptions. In CVPR,2011.

[18] Y. Liu, J. Fu, T. Mei, and C. W. Chen. Let your pho-tos talk: Generating narrative paragraph for photostream via bidirectional attention recurrent neuralnetworks. In AAAI, 2017.

12

[19] H. M. Manurung. A chart generator for rhythm pat-terned text. In Proceedings of the First InternationalWorkshop on Literature in Cognition and Computer,pages 15–19, 1999.

[20] H. Oliveira. Automatic generation of poetry: anoverview. Universidade de Coimbra, 2009.

[21] H. G. Oliveira. Poetryme: a versatile platformfor poetry generation. Computational Creativity,Concept Invention, and General Intelligence, 1:21,2012.

[22] K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu.Bleu: a method for automatic evaluation of machinetranslation. In ACL, pages 311–318, 2002.

[23] C. C. Park and G. Kim. Expressing an image streamwith a sequence of natural sentences. In NIPS, pages73–81, 2015.

[24] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, andV. Goel. Self-critical sequence training for imagecaptioning. arXiv preprint arXiv:1612.00563, 2016.

[25] O. Russakovsky, J. Deng, H. Su, J. Krause,S. Satheesh, S. Ma, Z. Huang, A. Karpathy,A. Khosla, M. Bernstein, et al. Imagenet large scalevisual recognition challenge. IJCV, 115(3):211–252, 2015.

[26] K. Simonyan and A. Zisserman. Very deep convo-lutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014.

[27] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan.Show and tell: A neural image caption generator.In CVPR, pages 3156–3164, 2015.

[28] J. Wang, J. Fu, Y. Xu, and T. Mei. Beyond objectrecognition: Visual sentiment analysis with deepcoupled adjective and noun neural networks. In IJ-CAI, pages 3484–3490, 2016.

[29] L. Wang, S. Guo, W. Huang, and Y. Qiao.Places205-vggnet models for scene recognition.arXiv preprint arXiv:1508.01667, 2015.

[30] R. J. Williams. Simple statistical gradient-followingalgorithms for connectionist reinforcement learning.Machine learning, 8(3-4):229–256, 1992.

[31] Z. Xu, B. Liu, B. Wang, S. Chengjie, X. Wang,Z. Wang, and C. Qi. Neural response generation

via gan with an approximate embedding layer. InEMNLP, pages 628–637, 2017.

[32] R. Yan, H. Jiang, M. Lapata, S.-D. Lin, X. Lv, andX. Li. i, poet: Automatic chinese poetry composi-tion through a generative summarization frameworkunder constrained optimization. In IJCAI, pages2197–2203, 2013.

[33] X. Yi, R. Li, and M. Sun. Generating chinese clas-sical poems with rnn encoder-decoder. In ChineseComputational Linguistics and Natural LanguageProcessing Based on Naturally Annotated Big Data,pages 211–223. Springer, 2017.

[34] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Imagecaptioning with semantic attention. In CVPR, pages4651–4659, 2016.

[35] L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: Se-quence generative adversarial nets with policy gra-dient. In AAAI, pages 2852–2858, 2017.

[36] W. Zaremba and I. Sutskever. Reinforcement learn-ing neural turing machines-revised. arXiv preprintarXiv:1505.00521, 2015.

[37] X. Zhang and M. Lapata. Chinese poetry generationwith recurrent neural networks. In EMNLP, pages670–680, 2014.

13

beyond narrative description: generating poetry from

Documents