on mental imagery in lexical processing: computational ...ceur-ws.org/vol-1419/paper0026.pdf · on...

On Mental Imagery in Lexical Processing:Computational Modeling of the Visual Load Associated to Concepts

Daniele P. Radicioniχ, Francesca Garbariniψ, Fabrizio Calzavariniφ

Monica Biggioψ, Antonio Lietoχ, Katiuscia Saccoψ, Diego Marconiφ

([email protected])

χDepartment of Computer Science, Turin University – Turin, ItalyφDepartment of Philosophy, Turin University – Turin, ItalyψDepartment of Psychology, Turin University – Turin, Italy

Abstract

This paper investigates the notion of visual load, an es-timate for a lexical item’s efficacy in activating mentalimages associated with the concept it refers to. We elab-orate on the centrality of this notion which is deeplyand variously connected to lexical processing. A com-putational model of the visual load is introduced thatbuilds on few low level features and on the dependencystructure of sentences. The system implementing theproposed model has been experimentally assessed andshown to reasonably approximate human response.

Keywords: Visual imagery; Computational modeling;Natural Language Processing.

IntroductionOrdinary experience suggests that lexical competence,i.e. the ability to use words, includes both the abil-ity to relate words to the external world as accessedthrough perception (referential tasks) and the ability torelate words to other words in inferential tasks of sev-eral kinds (Marconi, 1997). There is evidence from bothtraditional neuropsychology and more recent neuroimag-ing research that the two aspects of lexical competencemay be implemented by partly different brain processes.However, some very recent experiments appear to showthat typically visual areas are also engaged by purelyinferential tasks, not involving visual perception of ob-jects or pictures (Marconi et al., 2013). The present workcan be considered as a preliminary investigation aimedat verifying this main hypothesis, by investigating thefollowing issues: i) to what extent the visual load asso-ciated with concepts can be assessed, and which sort ofagreement exists among humans about the visual loadassociated to concepts; ii) which features underlie thevisual load associated to concepts; and iii) whether thenotion of visual load can be grasped and encapsulatedinto a computational model.

As it is widely acknowledged, one main visual cor-relate of language is imageability, that is the propertyof a particular word or sentence to produce an experi-ence of imagery: in the following, we focus on visual im-agery (thus disregarding acoustic, olfactory and tactileimagery), which we denote as visual load. The visualload is related to the easiness of producing visual im-agery when an external linguistic stimulus is processed.

Intuitively, words like ‘dog’ or ‘apple’ refer to concreteentities and are associated with a high visual load, im-plying that these terms immediately generate a mentalimage. Conversely, words like ‘algebra’ or ‘idempotence’are hardly accompanied by the production of vivid im-ages. Although the construct of visual load is closelyrelated to that of concreteness, concreteness and visualload can clearly dissociate, in that i) some words havebeen rated high in visual load but low in concreteness,such as some concrete nouns that have been rated lowin visual load (Paivio, Yuille, & Madigan, 1968); and,conversely, ii) abstract words such as ‘bisection’ are as-sociated with a high visual load.

The notion of visual load is relevant to many disci-plines, in that it contributes to shed light on a wide vari-ety of cognitive and linguistic tasks and helps explaininga plethora of phenomena observed in both impaired andnormal subjects. In the next Section we survey a mul-tidisciplinary literature showing how mental imagery af-fects memory, learning and comprehension; we considerhow imagery is characterized at the neural level; and weshow how visual information is exploited in state-of-the-art Natural Language Processing research. In the subse-quent Section we illustrate the proposed computationalmodel for providing concepts with their visual load char-acterization. We then describe the experiments designedto assess the model through an implemented system, re-port and discuss the obtained results. Conclusion willsummarize the work done and provide an outlook on fu-ture work.

Related Work

As regards linguistic competence, it is generally ac-cepted that visual load facilitates cognitive perfor-mance (Bergen, Lindsay, Matlock, & Narayanan, 2007),leading to faster lexical decisions than not-visuallyloaded concepts (Cortese & Khanna, 2007). For ex-ample, nouns with high visual load ratings are remem-bered better than those with low visual load ratings inlong-term memory tests (Paivio et al., 1968). More-over, visually loaded terms are easier to recognize forsubjects with deep dyslexia, and individuals respond

181

bered better than those with low visual load ratings inlong-term memory tests (Paivio, 1986, 1991). Yet, vi-sually loaded terms are easier to recognize for subjectswith deep dyslexia, and individuals respond more quicklyand accurately when making judgments about visuallyloaded sentences (Kiran & Tuchtenhagen, 2005). Neu-ropsychological researches have shown that many apha-sic patients perform better with linguistic items thatare easier to elicit visual imagery (Goodglass, Hyde, &Blumstein, 1969; Coltheart, 1980), although the oppo-site pattern has also been documented (Breedin, Sa↵ran,& Coslett, 1994; Cipolotti & Warrington, 1995; Warring-ton, 1975).

Visual imageability of concepts evoked by words andsentences is commonly known to a↵ect brain activity.While visuosemantic processing regions, such as left in-ferior temporal gyrus and fusiform gyrus revealed greaterinvolvement during the comprehension of highly image-able words and sentences (Bookheimer et al., 1998; Mel-let, Tzourio, Denis, & Mazoyer, 1998), other seman-tic brain regions (i.e., superior and middle temporalcortex) are selectively activated by low-imageable sen-tences (Mellet et al., 1998; Just, Newman, Keller, McE-leney, & Carpenter, 2004). Furthermore, a mountingnumber of studies suggests that words encoding di↵er-ent visual properties (such as color, shape, motion, etc.)are processed in cortical areas that overlap with some ofthe same areas that are activated during visual percep-tion of those properties (Kemmerer, 2010).

Investigating the visual features associated to linguis-tic input can be useful to many Natural Language Pro-cessing tasks, such as individuating verbs subcategoriza-tion frames (Bergsma & Goebel, 2011), enriching the tra-ditional extraction of distributional semantics from textwith a multimodal approach, integrating textual featureswith visual ones (Bruni, Tran, & Baroni, 2011, 2014). Fi-nally, visual attributes are at the base of the developmentof annotated corpora and resources that can be used toextend text-based distributional semantics by groundingword meanings on visual features, as well (Silberer, Fer-rari, & Lapata, 2013).

The computational model for mental imagery de-scribed by Glasgow and Papadias (1992) aims at recon-structing image representations for retrieving visual andspatial information: the first mode is concerned withhow objects look like, and the second one regards theirplacement within a visual scene. This bipartite schemeis justified based on neural accounts. In fact, visual andspatial correlates of imagery seem to have a direct coun-terpart in the cortical pathways involved in vision: whilethe temporal cortex is involved in recognizing objectsthemselves, on the other side the parietal cortex is acti-vated for accessing spatial information (Mishkin, Unger-leider, & Macko, 1983). The model proposes three stagesof image representation —each featured by its own kind

L’ animale che mangia banane su un albero e la scimmiaThe animal that eats bananas on a tree is the monkey

Figure 1: The dependency tree corresponding to a stim-ulus.

of information processing—, including a deep represen-tation, which is a semantic network stored in long-termmemory that contains a hierarchical representation ofimage descriptions; the spatial representation intendedfor collecting image components along with their spatialfeatures; the visual representation that builds on an oc-cupancy array, storing information such as shape, size,etc..

ModelThe modeling phase has been characterized by the needof defining the notion of visual load in a uniform andcomputationally tractable manner. Such concept, infact, is used by and large in literature with di↵erentmeanings, thus giving raise to di↵erent levels of ambi-guity. We define visual load as the concept representinga direct indicator (a numeric value) of the e�cacy for alexical item to activate mental images associated to theconcept referred to by the lexical item. Consequently,we expect that visual load also represents an indirectmeasure of the probability of activation of a brain areadeputed to the visual processing.

We conjecture that visual load is situated at the inter-section of lexical and semantic spaces, mostly associatedto the semantic level. That is, the visual load is primar-ily associated to a concept, although lexical phenomenalike terms availability (implying that the most frequentlyused terms are easier to recognize than those seen lessoften (Tversky & Kahneman, 1973)) can also a↵ect it.

Based on the work by Kemmerer (2010) we explore thehypothesis that a limited number of primitive elementscan be used to characterize and evaluate the visual loadassociated to concepts. Namely, Kemmerer’s SimulationFramework allows to grasp information about a wide va-riety of concepts and properties used to denote objects,events and spatial relations. Three main visual semanticcomponents have been individuated that, in our opin-ion, are also suitable to be used as di↵erent dimensionsalong which to characterize the concept of visual load.They are: color properties, shape properties, and mo-tion properties. The perception of these properties isexpected to occur in a immediate way, such that “dur-ing our ordinary observation of the world, these threeattributes of objects are tightly bound together in uni-

Figure 1: The (simplified) dependency tree correspond-ing to the sentence ‘The animal that eats bananas on atree is the Monkey’.

more quickly and accurately when making judgmentsabout visually loaded sentences (Kiran & Tuchtenhagen,2005). Neuropsychological research has shown thatmany aphasic patients perform better with linguisticitems that more easily elicit visual imagery (Coltheart,1980), although the opposite pattern has also been doc-umented (Cipolotti & Warrington, 1995).

Visual imageability of concepts evoked by words andsentences is commonly known to affect brain activity.While visuosemantic processing regions, such as left in-ferior temporal gyrus and fusiform gyrus revealed greaterinvolvement during the comprehension of highly image-able words and sentences (Bookheimer et al., 1998; Mel-let, Tzourio, Denis, & Mazoyer, 1998), other seman-tic brain regions (i.e., superior and middle temporalcortex) are selectively activated by low-imageable sen-tences (Mellet et al., 1998; Just, Newman, Keller, McE-leney, & Carpenter, 2004). Furthermore, a growing num-ber of studies suggests that words encoding different vi-sual properties (such as color, shape, motion, etc.) areprocessed in cortical areas that overlap with some of theareas that are activated during visual perception of thoseproperties (Kemmerer, 2010).

Investigating the visual features associated to linguis-tic input can be useful to build semantic resources de-signed to deal with Natural Language Processing (NLP)problems, such as individuating verbs subcategorizationframes (Bergsma & Goebel, 2011), enriching the tradi-tional extraction of distributional semantics from textwith a multimodal approach, integrating textual featureswith visual ones (Bruni, Tran, & Baroni, 2014). Finally,visual attributes are at the base of the development ofannotated corpora and resources that can be used to ex-tend text-based distributional semantics by groundingword meanings on visual features, as well (Silberer, Fer-rari, & Lapata, 2013).

ModelAlthough much work has been invested in different ar-eas for investigating imageability in general and visualimagery in particular, at the best of our knowledge noattempt has been carried out to formally characterizevisual load, and no computational model has been de-vised to compute how visually loaded are sentences and

lexicalized concepts therein. We propose a model thatrelies on a simple hypothesis additively combining fewlow-level features, refined by exploiting syntactic infor-mation.

The notion of visual load, in fact, is used by and largein literature with different meanings, thus giving rise todifferent levels of ambiguity. We define visual load as theconcept representing a direct indicator (a numeric value)of the efficacy for a lexical item to activate mental imagesassociated to the concept referred to by the lexical item.We expect that visual load also represents an indirectmeasure of the probability of activation of brain areasdeputed to the visual processing.

We conjecture that the visual load is primarily as-sociated to concepts, although lexical phenomena liketerms availability (implying that the most frequentlyused terms are easier to recognize than those seen lessoften (Tversky & Kahneman, 1973)) can also affect it.

Based on the work by Kemmerer (2010) we explore thehypothesis that a limited number of primitive elementscan be used to characterize and evaluate the visual loadassociated to concepts. Namely, Kemmerer’s SimulationFramework allows to grasp information about a wide va-riety of concepts and properties used to denote objects,events and spatial relations. Three main visual semanticcomponents have been individuated that, in our opin-ion, are also suitable to be used as different dimensionsalong which to characterize the concept of visual load.They are: color properties, shape properties, and mo-tion properties. The perception of these properties isexpected to occur in a immediate way, such that “dur-ing our ordinary observation of the world, these threeattributes of objects are tightly bound together in uni-fied conscious images” (Kemmerer, 2010). We added afurther perceptual component related to size. More pre-cisely, our assumption is that information about the sizeof a given concept can also contribute, as an adjoint fac-tor and not as a primitive one, to the computation of avisual load value for the considered concept.

In this setting, we represent each concept/property asa boolean-valued vector of four elements, each encodingthe following information: lemma, morphological infor-mation on POS (part of speech), and then whether theconsidered concept/property conveys information aboutcolor, shape, motion and size.1 For example, this pieceof information

table,Noun,1,1,0,1 (1)

can be used to indicate that the concept table (associatedwith a Noun, and differing, e.g., from that associatedwith a Verb) conveys information about color, shape andsize, but not about motion. In the following, these are

1We adopt here a simplification, since we are assumingthat the pair 〈lemma,POS〉 is sufficient to identify a con-cept/property, and that in general we can access items bydisregarding the word sense disambiguation problem, whichis known as an open problem in the field of NLP.

182

TUP parser

definition dz }| {The big carnivore with yellow and black stripes is the

target Tz }| {. . . tiger| {z }

stimulus st

Dictionary annotatedwith features

lemma,POS,�col,�sha,�mot,�siz

lemma,POS,�col,�sha,�mot,�siz

lemma,POS,�col,�sha,�mot,�siz.....

Dependency structure


Figure 1: The dependency tree corresponding to a stimulus.

the following information: lemma, morphological infor-mation on POS (part of speech), and then whether theconsidered concept/property conveys information aboutcolor, shape, motion and size.1 For example, this pieceof information

finger,Noun,1,1,0,1 (1)

can be used to indicate that the concept finger (associ-ated to a Noun, and di↵ering, e.g., from that associatedto a Verb) conveys information about color, shape andsize, but not about motion. In the following these are re-ferred to as the visual features �· associated to the givenconcept.

We have then built a dictionary by extracting it from aset of stimuli (illustrated hereafter) composed by simplesentences describing a concept; and manually annotatedthe visual features associated to each concept. The as-signment of features scores has been conducted by theauthors on a purely introspective basis.

Di↵erent weighting schemes ~w = {↵,�, �} have beentested in order to set features contribution to the visualload associated to a concept c, that results from com-puting

VL(c, ~w) =X

i

�i = ↵(�col+�sha)+� �mot+� �siz. (2)

For the experimentation we set ↵ to 1.35, � to 1.1 and� to .9.

To the ends of combining the contribution of conceptsin a sentence s to the overall VL score for s, we adaptthe principle of compositionality2 to the visual load do-main. In other words, we assume that the visual load ofa sentence can be computed by starting from the visual

1We adopt here a simplification, since we are assumingthat the pair hlemma, POSi is su�cient to identify a con-cept/property, and that in general we can access items bydisregarding the word sense disambiguation problem, whichis actually an open problem in the field of Natural LanguageProcessing (Vidhu Bhala & Abirami, 2014).

2This principle states that the meaning of an expressionis a function of the meanings of its parts and of the way theyare syntactically combined: to get the meaning of a sentencewe combine words to form phrases, then we combine phrasesto form clauses, and so on (Partee, 1995).

load of the concepts denoted by the lexical items in thesentence, that is VL(s) =

Pc2s VL(c).

The calculation of the VL score also accounts for thedependency structure of the input sentences. The syn-tactic structure of sentences is computed through theTurin University Parser (TUP) in the dependency for-mat (Lesmo, 2007). Dependency formalisms representsyntactic relations by connecting a dominant word, thehead (e.g., the verb ‘fly’ in the sentence The eagle flies)and a dominated word, the dependent (e.g., the noun‘eagle’ in the same sentence). The connection betweenthese two words is usually represented by using labeleddirected edges (e.g., subject): the collection of all depen-dency relations of a sentence forms a tree, rooted in themain verb (see the parse tree illustrated in Figure 1).The dependency structure is relevant in our approach,because we assume that some sort of reinforcement ef-fect may apply in cases where both a word and its de-pendent(s) (or governor(s)) are associated to some visualfeature. For example, a phrase like ‘with black stripes’is expected to evoke mental images in a more vivid waythan its elements taken in isolation (that is, ‘black’ and‘stripes’), and its visual load is expected to still growif we add a coordinated term, like in ‘with yellow andblack stripes’. Yet, the VL would –recursively– grow ifwe added a governor term (like ‘fur with yellow and blackstripes’). We then introduced a parameter ⇠ to controlthe contribution of the aforementioned features in casethe corresponding terms are linked in the parse tree bya modifier/argument relation (denoted as mod and argin Equation 3).

V L(ci) =

(⇠ V L(ci) if 9 cj s.t. mod(ci, cj) _ arg(ci, cj)

V L(ci) otherwise.

(3)In the experimentation ⇠ was set to 1.2.

The stimuli in the dataset are pairs consisting ofa definition d and a target T (st = hd, T i), such as



stimulus st

.

The visual load associated to st components, given the


Figure 1: The dependency tree corresponding to a stimulus.

the following information: lemma, morphological infor-mation on POS (part of speech), and then whether theconsidered concept/property conveys information aboutcolor, shape, motion and size.1 For example, this pieceof information

finger,Noun,1,1,0,1 (1)

can be used to indicate that the concept finger (associ-ated to a Noun, and di↵ering, e.g., from that associatedto a Verb) conveys information about color, shape andsize, but not about motion. In the following these are re-ferred to as the visual features �· associated to the givenconcept.

We have then built a dictionary by extracting it from aset of stimuli (illustrated hereafter) composed by simplesentences describing a concept; and manually annotatedthe visual features associated to each concept. The as-signment of features scores has been conducted by theauthors on a purely introspective basis.

Di↵erent weighting schemes ~w = {↵,�, �} have beentested in order to set features contribution to the visualload associated to a concept c, that results from com-puting

VL(c, ~w) =X

i

�i = ↵(�col+�sha)+� �mot+� �siz. (2)

For the experimentation we set ↵ to 1.35, � to 1.1 and� to .9.

To the ends of combining the contribution of conceptsin a sentence s to the overall VL score for s, we adaptthe principle of compositionality2 to the visual load do-main. In other words, we assume that the visual load ofa sentence can be computed by starting from the visual

1We adopt here a simplification, since we are assumingthat the pair hlemma, POSi is su�cient to identify a con-cept/property, and that in general we can access items bydisregarding the word sense disambiguation problem, whichis actually an open problem in the field of Natural LanguageProcessing (Vidhu Bhala & Abirami, 2014).

2This principle states that the meaning of an expressionis a function of the meanings of its parts and of the way theyare syntactically combined: to get the meaning of a sentencewe combine words to form phrases, then we combine phrasesto form clauses, and so on (Partee, 1995).

load of the concepts denoted by the lexical items in thesentence, that is VL(s) =

Pc2s VL(c).

The calculation of the VL score also accounts for thedependency structure of the input sentences. The syn-tactic structure of sentences is computed through theTurin University Parser (TUP) in the dependency for-mat (Lesmo, 2007). Dependency formalisms representsyntactic relations by connecting a dominant word, thehead (e.g., the verb ‘fly’ in the sentence The eagle flies)and a dominated word, the dependent (e.g., the noun‘eagle’ in the same sentence). The connection betweenthese two words is usually represented by using labeleddirected edges (e.g., subject): the collection of all depen-dency relations of a sentence forms a tree, rooted in themain verb (see the parse tree illustrated in Figure 1).The dependency structure is relevant in our approach,because we assume that some sort of reinforcement ef-fect may apply in cases where both a word and its de-pendent(s) (or governor(s)) are associated to some visualfeature. For example, a phrase like ‘with black stripes’is expected to evoke mental images in a more vivid waythan its elements taken in isolation (that is, ‘black’ and‘stripes’), and its visual load is expected to still growif we add a coordinated term, like in ‘with yellow andblack stripes’. Yet, the VL would –recursively– grow ifwe added a governor term (like ‘fur with yellow and blackstripes’). We then introduced a parameter ⇠ to controlthe contribution of the aforementioned features in casethe corresponding terms are linked in the parse tree bya modifier/argument relation (denoted as mod and argin Equation 3).

VL(ci) =

(⇠ VL(ci) if 9 cj s.t. mod(ci, cj) _ arg(ci, cj)

VL(ci) otherwise.

(3)In the experimentation ⇠ was set to 1.2.

The stimuli in the dataset are pairs consisting ofa definition d and a target T (st = hd, T i), such as



stimulus st

.


weighting scheme ~w, is then computed as follows:

VL(d, ~w) =P

c2d VL(c) (4)

VL(T, ~w) = VL(T ). (5)

Aggiungere figura e descrizione di alto livello dellapipeline.

Experimentation

Materials and Methods

Forty-five healthy volunteers (23 females and 22 males),19 � 52 years of age (mean ±sd = 25.7 ± 5.1) , wererecruited for the experiment. One of them was excludedbecause she was outlier with respect to the group. Noneof the subjects had a history of psychiatric or neuro-logical disorders. All participants gave their written in-formed consent before taking part to the experimentalprocedure, which was approved by the ethical commit-tee of the University of Turin, in accordance with theDeclaration of Helsinki ( BMJ 1991; 302: 1194 ). Par-ticipants were all naıve to the experimental procedureand to the aims of the study.

The set of stimuli was devised by the multidisciplinaryteam of philosophers, neuropsychologists and computerscientists in the frame of a broader project aimed at in-vestigating both the role of visual load in concepts in-volved in inferential and referential tasks.

Experimental design and procedure Participantswere asked to perform an inferential task “Naming bydefinition”. During the task a sentence was pronouncedand the subjects were instructed to listen to the stim-ulus given in the headphones and to overtly name, asaccurately and as fast as possible, the target word cor-responding to the definition, using a microphone con-nected to a response box. Auditory stimuli were pre-sented through the E-Prime software, which was alsoused to record data on accuracy and reaction times.

Furthermore, at the end of the experimental session,the subjects were administered a questionnaire: they hadto rate on a 1� 7 Likert scale the intensity of the visualload they perceived as related to each target and to eachdefinition.

The factorial design of the study included two within-subjects factors, in which the visual load of both targetand definition was manipulated. The resulting four ex-perimental conditions were as follows:

VV Visual Target—Visual Definition (e.g., ‘The bird ofprey with great wings flying over the mountains is the. . . eagle’);

VNV Visual Target—Non-Visual Definition (e.g., Thehottest of the four elements of the ancients is . . . fire);

NVV Non-Visual Target—Visual Definition (e.g., Thenose of Pinocchio stretched when he said a . . . lie);

NVNV Non-Visual Target—Non-Visual Definition(e.g., The quality of people that easily solve di�cultproblems is said . . . intelligence).

For each condition, there were 48 sentences, for overall192 sentences. Each trial lasted about 30 minutes. Thenumber of words (nouns and adjectives) and the (syn-tactic dependency) structure of the considered sentenceswere homogeneous within conditions.

The same set of stimuli used for the human experi-ment was given in input to the system implementing theproposed computational model. The system was used tocompute the visual load score associated to (lexicalized)concepts according to Eq. 4 and 5, implementing the vi-sual load model in Eq. 2, with the system’s parametersset to the aforementioned values.

Data analysis

The participants’ performance in the “naming by def-inition” task was evaluated by recording, for each re-sponse, the reaction time RT, in milliseconds, and theaccuracy AC, as the percentage of the correct answers.Then, for each subject, both RT and AC were com-bined in the Inverse E�ciency Score (IES), by usingthe formula IES = (RT ·AC)/100. IES is a metrics com-monly used to aggregate reaction time and accuracy andto summarize them. The mean IES value was used asdependent variable and entered in a 2⇥ 2 repeated mea-sures ANOVA with ‘target’ (two levels: ‘visual’ and ‘not-visual’) and ‘definition’ (two levels: ‘visual’ and ‘not-visual’) as within-subjects factors. Post hoc comparisonswere performed by using the Duncan test.

The scores obtained by the participants in the vi-sual load questionnaire were analyzed by using pairedT-tests, two tailed. Two comparisons were performedfor visual and not-visual targets, and for visual and not-visual definitions.

The computational model results were analyzed by us-ing paired T-tests, two tailed. Two comparisons wereperformed for visual and not-visual targets and for vi-sual and not-visual definitions.

Correlations between IES, computational modeland visual load questionnaire. We also explored theexistence of correlations between IES, the visual loadquestionnaire and the computational model output byusing linear regressions. For both the IES values andthe questionnaire scores, we calculated for each item themean of the 30 subjects’ responses. In a first model, weused the visual-load questionnaire scores as independentvariable to predict the participants’ performance (withthe IESas dependent variable); in a second model, weused the computational data as independent variable topredict the participants’ visual load evaluation (with thequestionnaire scores as independent variable).

System implementing thecomputational model of VL

hlemma, POSi{hlemma, POSihlemma, POSihlemma, POSi.....

Morphological information

Figure 2: The pipeline to compute the VL score according to the proposed computational model.

referred to as the visual features φ· associated with thegiven concept.

We have then built a dictionary by extracting it froma set of stimuli (illustrated hereafter) composed of sim-ple sentences describing a concept; next, we have man-ually annotated the visual features associated with eachconcept. The automatic annotation of visual propertiesassociated with concepts is deferred to future work: itcan be addressed either through a classical InformationExtraction approach building on statistics, or in a moresemantically-principled way.

Different weighting schemes ~w = {α, β, γ} have beentested in order to determine the features’ contribution tothe visual load associated with a concept c, that resultsfrom computing

VL(c, ~w) =∑

i

φi = α(φcol+φsha)+β φmot+γ φsiz. (2)

For the experimentation we set α to 1.35, β to 1.1 andγ to .9: these assignments reflect the fact that color andshape information is considered more important, in thecomputation of VL.

To the ends of combining the contribution of conceptsin a sentence s to the overall VL score for s, we adoptedthe following additive schema: VL(s) =

∑c∈s VL(c).

The computation of the VL score also accounts forthe dependency structure of the input sentences. Thesyntactic structure of sentences is computed by theTurin University Parser (TUP) in the dependency for-mat (Lesmo, 2007). Dependency formalisms representsyntactic relations by connecting a dominant word, thehead (e.g., the verb ‘fly’ in the sentence The eagle flies)and a dominated word, the dependent (e.g., the noun

‘eagle’ in the same sentence). The connection betweenthese two words is usually represented by using labeleddirected edges (e.g., subject): the collection of all depen-dency relations of a sentence forms a tree, rooted in themain verb (see the parse tree illustrated in Figure 1).The dependency structure is relevant in our approach,because we assume that a reinforcement effect may ap-ply in cases where both a word and its dependent(s) (orgovernor(s)) are associated with visual features. For ex-ample, a phrase such as ‘with black stripes’ is expectedto evoke mental images in a more vivid way than its el-ements taken in isolation (that is, ‘black’ and ‘stripes’),moreover its visual load is expected to further grow ifwe add a coordinated term, as in ‘with yellow and blackstripes’. Moreover, the VL would –recursively– grow ifwe added a governor term (like ‘fur with yellow and blackstripes’). We then introduced a parameter ξ to controlthe contribution of the aforementioned features in casethe corresponding terms are linked in the parse tree bya modifier/argument relation (denoted as mod and argin Equation 3).

VL(ci) =

{ξ VL(ci) if ∃ cj s.t. mod(ci, cj) ∨ arg(ci, cj)

VL(ci) otherwise.

(3)In the experimentation ξ was set to 1.2.

The stimuli in the dataset are pairs consisting ofa definition d and a target T (st = 〈d, T 〉), such as

definition d︷︸︸︷The big carnivore with yellow and black stripes is the

target T︷︸︸︷. . . tiger︸︷︷︸

stimulus st

.


183

weighting scheme ~w, is then computed as follows:

VL(d, ~w) =∑c∈d VL(c) (4)

VL(T, ~w) = VL(T ). (5)

The whole pipeline from the input parsing to compu-tation of the VL for the considered stimulus has beenimplemented as a computer program; its main steps in-clude the parsing of the stimulus, the extraction of the(lexicalized) concepts by exploiting the output of themorphological analysis, and the tree traversal of the de-pendency structure resulting from the parsing step. Themorphological analyzer has been preliminarily fed withthe whole set of stimuli, and its output has been anno-tated with the visual features and stored into a dictio-nary. At run time, the dictionary is accessed based onmorphological information, then used to retrieve the val-ues of the features associated with the concepts in thestimulus. The output obtained by the proposed modelhas been compared with the results obtained in a behav-ioral experimentation as described below.

Experimentation

Materials and MethodsThirty healthy volunteers, native Italian speakers, (16females and 14 males), 19 − 52 years of age (mean±sd = 25.7 ± 5.1), were recruited for the experiment.None of the subjects had a history of psychiatric or neu-rological disorders. All participants gave their writteninformed consent before participating in the experimen-tal procedure, which was approved by the ethical com-mittee of the University of Turin, in accordance withthe Declaration of Helsinki (World Medical Association,1991). Participants were all naıve to the experimentalprocedure and to the aims of the study.

Experimental design and procedure Participantswere asked to perform an inferential task “Naming fromdefinition”. During the task a sentence was pronouncedand the subjects were instructed to listen to the stim-ulus given in the headphones and to overtly name, asaccurately and as fast as possible, the target word cor-responding to the definition, using a microphone con-nected to a response box. Auditory stimuli were pre-sented through the E-Prime software, which was alsoused to record data on accuracy and reaction times. Fur-thermore, at the end of the experimental session, thesubjects were administered a questionnaire: they had torate on a 1 − 7 Likert scale the intensity of the visualload they perceived as related to each target and to eachdefinition.

The factorial design of the study included two within-subjects factors, in which the visual load of both targetand definition was manipulated. The resulting four ex-perimental conditions were as follows:

VV Visual Target—Visual Definition (e.g., ‘The bird of

prey with great wings flying over the mountains is the. . . eagle’);

VNV Visual Target—Non-Visual Definition (e.g., Thehottest of the four elements of the ancients is . . . fire);

NVV Non-Visual Target—Visual Definition (e.g., Thenose of Pinocchio stretched when he told a . . . lie);

NVNV Non-Visual Target—Non-Visual Definition(e.g., The quality of people that easily solve difficultproblems is said . . . intelligence).

For each condition, there were 48 sentences, 192 sen-tences overall. Each trial lasted about 30 minutes. Thenumber of words (nouns and adjectives), their balancingacross stimuli, and the (syntactic dependency) structureof the considered sentences were uniform within condi-tions, so that the most relevant variables were controlled.The same set of stimuli used for the human experimentwas given in input to the system implementing the com-putational model.

Data analysisThe participants’ performance in the “Naming from def-inition” task was evaluated by recording, for each re-sponse, the reaction time RT, in milliseconds, and theaccuracy AC, computed as the percentage of correct an-swers. The answers were considered correct if the targetword was plausibly matched with the definition. Then,for each subject, both RT and AC were combined inthe Inverse Efficiency Score (IES), by using the formulaIES = (RT/AC) · 100. IES is a metrics commonly usedto aggregate reaction time and accuracy, and to summa-rize them (Townsend & Ashby, 1978). The mean IESvalue was used as the dependent variable and enteredin a 2× 2 repeated measures ANOVA with ‘target’ (twolevels: ‘visual’ and ‘non-visual’) and ‘definition’ (two lev-els: ‘visual’ and ‘non-visual’) as within-subjects factors.Post hoc comparisons were performed by using the Dun-can test.

The scores obtained by the participants in the visualload questionnaire were analyzed by using unpaired T-tests, two tailed. Two comparisons were performed forvisual and non-visual targets, and for visual and non-visual definitions. The computational model results wereanalyzed by using unpaired T-tests, two tailed. Twocomparisons were performed for visual and non-visualtargets and for visual and non-visual definitions.Correlations between IES, computational modeland visual load questionnaire. We also explored theexistence of correlations between IES, the visual loadquestionnaire and the computational model output byusing linear regressions. For both the IES values andthe questionnaire scores, we computed for each item themean of the 30 subjects’ responses. In a first model, weused the visual load questionnaire scores as independentvariable to predict the participants’ performance (with

184

Figure 3: The graph shows, for each condition, the meanIES with standard error.

IESas the dependent variable); in a second model, weused the computational data as independent variable topredict the participants’ visual load evaluation (with thequestionnaire scores as the independent variable). Inorder to verify the consistency of the correlation effects,we also performed linear regressions where we controlledfor three covariate variables: the number of words, theirbalancing across stimuli and the syntactic dependencystructure.

ResultsThe ANOVA showed a significant effect of the within-subject factors “target” (F1,29 = 14.4; p < 0.001), sug-gesting that the IES values were significantly lower inthe visual than in the non-visual targets, and “defini-tion” (F1,29 = 32.78; p < 0.001), suggesting that the IESvalues were significantly lower in the visual than in thenon-visual definitions. This means that, for both the tar-get and the definition, the participants’ performance wassignificantly faster and more accurate in the visual thanin the non-visual condition. We also found a significantinteraction “target*definition” (F1,29 = 7.54; p = 0.01).Based on the Duncan post hoc comparison, we verifiedthat this interaction was explained by the effect of thevisual definitions of the visual targets (VV condition),in which the participants’ performance was significantlyfaster and more accurate than in all the other conditions(VNV; NVV; NVNV), as shown in Figure 3.

By comparing the questionnaire scores for visual(mean ±sd = 5.69± 0.55) and non-visual (mean ±sd =4.73± 0.71) definitions we found a significant difference(p < 0.001; unpaired T-test, two tailed). By compar-ing the questionnaire scores for visual (mean ±sd =6.32 ± 0.4) and non-visual (mean ±sd = 4.23 ± 0.9)targets we found a significant difference (p < 0.001).This suggest that our arbitrary categorization of eachsentences within the four conditions was supported by

Figure 4: Linear regression “Inverse Efficiency Score(IES) by Visual Load Questionnaire”. The mean scorein the Visual Load Questionnaire, reported on 1−7 Lik-ert scale, was used as an independent variable to predictthe subjects’ performance, as quantified by the IES.

the general agreement of the subjects. By compar-ing the computational model scores for visual (mean±sd = 4.0± 2.4) and non-visual (mean ±sd = 2.9± 2.0)definitions we found a significant difference (p < 0.001;unpaired T-test, two tailed). By comparing the compu-tational model scores for visual (mean ±sd = 2.53±1.29)and non-visual (mean ±sd = 0.26 ± 0.64) targets wefound a significant difference (p < 0.001). This suggestthat we were able to computationally model the visual-load of both targets and descriptions, describing it as alinear combination of different low-level features: color,shape, motion and dimension.

Results correlations. By using the visual load ques-tionnaire scores as independent variable we were ableto significantly (R2 = 0.4; p < 0.001) predict the partici-pants’ performance (that is, their IES values), illustratedin Figure 4. This means that the higher the participants’visual score for a definition, the better the participants’performance in giving the correct response (or, alterna-tively, the lower the IES value).

By using the computational data as independent vari-able we were able to significantly (R2 = 0.44; p < 0.001)predict the participants’ visual load evaluation (theirquestionnaire scores), as shown in Figure 5. This meansthat a correlation exists between the computational pre-diction about the visual load of the definitions and theparticipants visual load evaluation: the higher is thecomputational model result, the higher is the partici-pants’ visual score in the questionnaire. We also foundthat these effects were still significant in the regres-sion models where the number of words, their balancingacross stimuli and the syntactic dependency structurewas controlled for.

185

Figure 5: Linear regression “Visual Load Questionnaireby Computational Model”. The mean value obtainedby the Computational model was used as an indepen-dent variable to predict the subjects’ scores on the VisualLoad Questionnaire, reported on 1− 7 Likert scale.

ConclusionsIn the next future we plan to extend the representationof the conceptual information by grounding the concep-tual representation on a hybrid representation composedof conceptual spaces and ontologies (Lieto, Minieri, Pi-ana, & Radicioni, 2015; Lieto, Radicioni, & Rho, 2015).Additionally, we plan to integrate the current model inthe context of cognitive architectures.

AcknowledgmentsThis work has been partly supported by the Project TheRole of the Visual Imagery in Lexical Processing, grantTO-call03-2012-0046, funded by Universita degli Studidi Torino and Compagnia di San Paolo.

ReferencesBergen, B. K., Lindsay, S., Matlock, T., & Narayanan, S.

(2007). Spatial and linguistic aspects of visual imageryin sentence comprehension. Cognitive Sci , 31 (5), 733–764.

Bergsma, S., & Goebel, R. (2011). Using visual infor-mation to predict lexical preference. In RANLP (pp.399–405).

Bookheimer, S., Zeffiro, T., Blaxton, T., Gaillard, W.,Malow, B., & Theodore, W. (1998). Regional cerebralblood flow during auditory responsive naming: evi-dence for cross-modality neural activation. Neurore-port , 9 (10), 2409–2413.

Bruni, E., Tran, N.-K., & Baroni, M. (2014). Multimodaldistributional semantics. J. Artif. Intell. Res., 49 , 1–47.

Cipolotti, L., & Warrington, E. K. (1995). Semanticmemory and reading abilities: A case report. J INTNEUROPSYCH SOC , 1 (01), 104–110.

Coltheart, M. (1980). Deep dyslexia: A right hemispherehypothesis. Deep dyslexia, 326–380.

Cortese, M. J., & Khanna, M. M. (2007). Age of acquisi-tion predicts naming and lexical-decision performanceabove and beyond 22 other predictor variables: Ananalysis of 2,342 words. Q J Exp Psychol A, 60 (8),1072–1082.

Just, M. A., Newman, S. D., Keller, T. A., McEleney,A., & Carpenter, P. A. (2004). Imagery in sentencecomprehension: an fmri study. Neuroimage, 21 (1),112–124.

Kemmerer, D. (2010). Words and the Mind: How wordscapture human experience. In B. Malt & P. Wolff(Eds.), (chap. How Words Capture Visual Experience- The Perspective from Cognitive Neuroscience). Ox-ford Scholarship Online.

Kiran, S., & Tuchtenhagen, J. (2005). Imageability ef-fects in normal spanish–english bilingual adults and inaphasia: Evidence from naming to definition and se-mantic priming tasks. Aphasiology , 19 (3-5), 315–327.

Lesmo, L. (2007, June). The Rule-Based Parser of theNLP Group of the University of Torino. IntelligenzaArtificiale, 2 (4), 46–47.

Lieto, A., Minieri, A., Piana, A., & Radicioni, D. P.(2015). A knowledge-based system for prototypicalreasoning. Connection Science, 27 (2), 137–152.

Lieto, A., Radicioni, D. P., & Rho, V. (2015, July).A Common-Sense Conceptual Categorization SystemIntegrating Heterogeneous Proxytypes and the DualProcess of Reasoning. In Proc. of IJCAI 2015. BuenosAires, Argentina: AAAI Press.

Marconi, D. (1997). Lexical competence. MIT Press.Marconi, D., Manenti, R., Catricala, E., Della Rosa,

P. A., Siri, S., & Cappa, S. F. (2013). The neuralsubstrates of inferential and referential semantic pro-cessing. Cortex , 49 (8), 2055–2066.

Mellet, E., Tzourio, N., Denis, M., & Mazoyer, B. (1998).Cortical anatomy of mental imagery of concrete nounsbased on their dictionary definition. Neuroreport ,9 (5), 803–808.

Paivio, A., Yuille, J. C., & Madigan, S. A. (1968). Con-creteness, imagery, and meaningfulness values for 925nouns. Journal of experimental psychology , 76 , 1.

Silberer, C., Ferrari, V., & Lapata, M. (2013). Modelsof semantic representation with visual attributes. InAcl 2013 proceedings (pp. 572–582).

Townsend, J. T., & Ashby, F. G. (1978). Methods ofmodeling capacity in simple processing systems. Cog-nitive theory , 3 , 200–239.

Tversky, A., & Kahneman, D. (1973). Availability: Aheuristic for judging frequency and probability. Cog-nitive psychology , 5 (2), 207–232.

World Medical Association. (1991). Code of Ethics:Declaration of Helsinki. BMJ , 302 , 1194.

186

on mental imagery in lexical processing: computational ...ceur-ws.org/vol-1419/paper0026.pdf · on...

Documents