1 reveal this and the cosmoroe cross-media relations framework katerina pastra language technology...

1

REVEAL THIS and the COSMOROE REVEAL THIS and the COSMOROE cross-media relations frameworkcross-media relations framework

Katerina Pastra

Language Technology Applications Department

Institute for Language and Speech Processing (ILSP)

“Athena” Research CentreAthens, Greece

2

• Basic and applied research in the field of Natural Language Processing focusing on the design of computational models for natural language recognition and understanding with application to three interwoven tracks:

- information processing, extraction & retrieval- multilingual information processing (multilingual applications & translation

systems)- multimedia information processing (fusion of language with other modalities)

ILSP/LTA GoalsILSP/LTA Goals

4

Research and Development Research and Development directionsdirections

• Aiming at enhancing the capacity of processing multilingual multimedia content

• Enabling fusion of unimodal (text, speech, image) processing results in order to better understand the workings of language, information access and communication phenomena

• Preparing for the important role of language technologies in the forthcoming full-fledged convergence of information and edutainment channels (tv, radio, web)

5

• Multilingual Information Processing

Use parallel corpora for automatically acquiring bilingual lexica in EN – EL

Employ contextual information for lexical transfer selection

Use the annotated parallel corpus and the automatically extracted lexica to build a statistical machine translation infrastructure

TRAID translation memory – Machine Translation Toolkit

Research and Development Research and Development projects (1)projects (1)

6

• Improving machine-assisted subtitling in a universal access framework Investigate the cognitive models underlying human subtitling and implement the appropriate computational architectures

Integrate image processing to improve video segmentation and recognise subtitle unit

Investigate the extent to which existing subtitle generation methods are portable and can be parameterised across special classes of viewers, e.g. children

Projects: MUSA/IST


7


•Multimedia indexing and retrievalAugment the content of multimedia documents with high-level semantic indexical information (e.g. names of entities, terms, topics, facts)Develop cross-media and cross-language representations to enable linking of topically relevant video programmes, webtexts and images. Build high-level functionalities like semantic search, retrieval, filtering, categorization, translation, summarization

Projects: CIMWOS/IST, REVEAL THIS/IST

8

Research Focus on MultimediaResearch Focus on Multimedia

Multimedia discourse relations (the COSMOROE framework) applications: cross-media indexing and retrieval, segmentation of audiovisual data, multimedia summarization

Sensorimotor & Symbolic Integration Resources Ongoing work for building an extensible computational resource which associates symbolic representations (words/concepts) with corresponding sensorimotor representations, enriched with patterns of combinations among these representations for forming conceptual structures at different levels of abstraction; focus on human action and interaction in every day life.

Going bottom-up in the resource (from sensorimotor representations to concepts) one will get a hierarchical composition of human behaviour, while going top-down (from concepts to sensorimotor

representations) one will get intentionally-laden interpretations of those structures

9

Cross-Media Decision Cross-Media Decision MechanismsMechanismsMechanisms that decide on the relation

that holds between medium specific pieces of information: across documents (Boll et al. 1999) within documents (Pastra 2006)The mechanisms decide whether medium-specific pieces of information within the same Multimedia Document are: associated (multimedia integration) complementary semantically compatible/incompatible

complementaritycomplementarityindependenceindependence

equivalenceequivalence

10

Cross-media Relation ExamplesCross-media Relation Examples

Equivalence: “the yellow taxi-boats…”

Essential complementarity: “…[pollution has taken its toll] on that..”

Non-essential complementarity: “…we are heading to Patmos…”

Independence: “…I have finally found a place that’s not overrun by tourists…”

11

Cross-media relationsCross-media relations

Equivalence: info expressed by different media refers to the same entity (object, state, event or property) Complementarity: info in one medium is an (essential or not) complement of the info expressed in another. Essential complementarity usually indicated through association signals (e.g. indexicals) Non-essentially complementarity info in one medium is a modifier/adjunct of info expressed in another

Independence: each medium carries an independent (but coherent) part of the MM messageIncoherence due to errors in Incoherence due to errors in

medium-specific processing or medium-specific processing or artistic/editorial reasonsartistic/editorial reasons

12Non-essential complementarity: “…we are heading to Patmos…”

Essential complementarity: “…[pollution has taken its toll] on this..”

Independence: “…I have finally found a place that’s not overrun by tourists…”

Application example: Application example: a cross-media indexer’s decisionsa cross-media indexer’s decisions

Equivalence: “the yellow taxi-boat…”

or/and

or/and

andand

aanndd

2) Landscape – people

2) Landscape – people

2 2 ¬¬cchhooiiccee

1) Landscape–sea/coast

1) Landscape–sea/coast11¬¬aanndd

13

Cross-Media Interaction Cross-Media Interaction RelationsRelations Intelligent multimedia systems (IMMS) need

mechanisms for analysing and generating semantic links between different modalities (Andre and Rist94, Feiner and McKeown93, Green02, Gut et al. 02, Martin and Kipp04 etc.)

Focus: either image-language, or gesture-language Semiotics: seminal analysis by Barthes84 (image-text), and Kendon04 (gesture-language) Automation of relation identification restricted to equivalence/association relations (cf. e.g. Barnard et al.03) mainly between images and text. Criticism: beyond different wording, different perspectives, different (or lack of clear) criteria, all attempts to define cross-media relations incorporate a qualitative notion of “contribution” of each medium to the message, some of them employ the Rhetorical Structure Theory (Mann & Thompson87)

14

The case against RST for The case against RST for describing multimedia describing multimedia

discourse (1)discourse (1)

Inappropriate nucleus vs. satellite distinction (and the related notion of “contribution”) because:• it relies on a single, unique message reading directionality- language manifests itself linearly in time and space vs. - dynamic multimedia that are parallel in space and time (cf. AV data) vs. - static multimedia that are perceived linearly but not in a strictly pre-determined, unique order (cf. illustrated documents)• its identification relies usually on lexical cues and syntactic patternsSuch subtle cues are abundant in language to denote relations between text segments, only very few denote relation between language and other modalities• it presumes that segments are comparable in sizeInteracting modality units are not comparable (e.g. sentence – image region, word-sequence of frames etc.)

getting around the island…

S

N

Means

N

S

Purpose

RST Relation Nucleus Satellite

Purpose I drove a moppet for getting around the island

Means I got around the island by driving a moppet

Example 1:Example 1:

I got around the island by driving a moppet

Example 2:Example 2:

17

The case against RST for The case against RST for describing multimedia describing multimedia

discourse (2)discourse (2) No compliance with media characteristics image characteristics: specificity, lack of subtle focus/salience indicators and explicit abstraction mechanisms language characteristics: abstraction, meta-language functions, lack of direct access to sensorimotor entitiescf. the following RST relation:“Elaboration” = the satellite presents additional detail about the content of the nucleus (e.g. member of a set, an instance of an abstraction, an attribute of an object, something specific in a generalisation) Lack of descriptive power and computational applicability mutual exclusiveness of RST relation categories inappropriate for capturing intentionality (Moore & Pollack92) fuzzy definitions of relations make manual annotation of data for training systems to identify the relations automatically problematic (low-inter-annotator agreement – cf. Carlson et al.03)

But images always present But images always present more details…more details…

COSMOROECross-Media Interaction Relations

Equivalence Complementarity

Independence

Literal Figurative

Token-Token

Type-Token

Metonymy

Metaphor

Essential Non-Essential

Equivalence Signal

Defining Apposition

Exophora

Non-Defining Apposition

Adjunct

Contradiction Symbiosis

Meta-Information

Refining Refining the the

relation relation setset

“… helmet for safety...”

Token-token Token-token equivalenceequivalenceSemantic equivalence in which one modality refers to exactly the same entity that the other also

refers too.

“…the ever increasing population of Athens has the city bursting at the scenes and has created a vast

concrete sprawl of housing ...”

Type-token Type-token equivalenceequivalenceSemantic equivalence in which one modality refers to a class of entities and the other to one or more

representative members of the class.

“The city, of course, is

Athens , and it is here that I will begin

my exploration of modern Greece.”

MetonymyMetonymyThe two referents come from the same domain, have same array of associations, there is no transfer of qualities from one to another – the two modalities refer to different entities but the user intends the two modalities to be considered semantically equivalent

“ It’s very serene …”

a)

c)

b)

MetaphMetaphoror

The two modalities refer to different entities of different domains; the user intends the two modalities to be considered semantically equivalent – there is a transfer of qualities

a)

b)

“Do you see the black at the top of the ceiling there ?”

Equivalance SignalEquivalance SignalEquivalence signals present in discourse indicate that one modality is essentially complemented by the other

24

Defining Defining AppositionAppositionOne modality provides extra information to another, information that identifies or describes

something/someone and which –when vital for the clear comprehension of the message – is defining.

Non-Defining AppositionNon-Defining AppositionOne modality provides extra information to another, information that identifies or describes something/someone and which and which is not vital for the clear comprehension of the message.

Note: Apposition is Different Note: Apposition is Different from the Equivalence type:token from the Equivalence type:token

relation !relation !(e.g. Bush is an instance of a (e.g. Bush is an instance of a

president not generally, but in president not generally, but in certain time and space) certain time and space)

“the president…”

“the deceased handcuffed”

Apart from a type:token equivalence relation: “deceased” – image of victim (only part of the image/body shown here), one may identify an

apposition relation too: e.g. the tattoe on the hand of the man is extra, descriptive information,

complementary to the textual discourse, but not necessarily vital for comprehension by nature by nature

images will usually give such info, some images will usually give such info, some applications rely on such identification of extra applications rely on such identification of extra

info that seemed originally not important info that seemed originally not important (therefore not present in textual discourse) but (therefore not present in textual discourse) but

then considered significant e.g. crime scene then considered significant e.g. crime scene investigation applicationsinvestigation applications

“…The city is a jumble of the ancient and the modern ...”

ExophoraExophoraA pragmatic “anaphora” case

OCR: “ Acropolis ”

“…we are heading to Patmos… ”

AdjunctAdjunct

Non-essential complementarity – one modality functions as an adjunct to the other (place-position, place-direction, time, manner)

Athens has been described as the last city in the West and the first city in the East. It's a place that is rich and spectacular in its history.

Many empires have held it in their sway: Romans, Venetians, Turks and Byzantines, and the result is a cosmopolitan city of three and a half million people.

SymbiosisSymbiosisEach modality expresses different pieces of information the conjunction (in time) of which serves phatic communication (visual fillers ¬ speech fillers)

28

Meta-informationMeta-informationOne modality expresses information that comments on aspects of what the other expresses, going beyond the message communicated to creation-related comments (who created the message, when, why, how – cf. typical archival metadata)

“an aerial view of Athens….”

29

Annotating Corpora with COSMOROE (2)Annotating Corpora with COSMOROE (2)

Tool = ANVIL (Kipp00) Levels of association (local context – diff. granularity) Annotation levels - Audiovisual Topic - Transcript (manual SR, subtitles, manual-OCR) - Body movement (indication of: body-part: hands, head, legs, whole-body type: deictic, iconic, emblem, beat, metaphoric - Images Frame-Sequence: foreground, background, both Keyframe-region: bounding box, free-text label, moving vs. static object indication, corresponding FrameSequence - Relations binding AnchorText entities with movement(s), image(s) etc.

31

Annotation ObjectivesAnnotation Objectives

To test the theory for coverage and applicability To answer questions on the semantics of multimedia discourse, e.g. - In which cases “what one sees is not what one hears” in discourse? - Which concepts are usually visualised in accompanying images or expressed through gestures in discourse and which is their level of abstraction? - How is the interaction between modalities signalled? - Which concepts are usually complemented with visual or gestural adjuncts? Could it be that one may predict the selectional restrictions for the arguments of a predicate when knowing its visual/gestural compliments (and vice versa)? - How is exophora realised? Could one use anaphora resolutions mechanisms to resolve exophora? To use machine learning (ML) for automating relation identification for different applications

32

Future WorkFuture Work

First Phase: 5h Greek–5 hours English (to be reached by July07)

Investigation of phenomenon of “entailment” in multimedia discourse in the above dataset – internal collaboration with Stelios Piperidis-ILSP

Cognitive experimentation

(on coherence relations in Multimedia Discourse – notion of degree of fit between modalities as indicator of coherence – collaboration with Dublin Trinity College-Carl Vogel)

Machine Learning for auto identification of relations for indexing of Audiovisual Files in an extended dataset

1 reveal this and the cosmoroe cross-media relations framework katerina pastra language technology...

Documents

information extraction

information access

en information retrieval

multimedia indexing

image processing results

crosslanguage representations

employ contextual information

rulebased processing