1 reveal this and the cosmoroe cross-media relations framework katerina pastra language technology...
TRANSCRIPT
1
REVEAL THIS and the COSMOROE REVEAL THIS and the COSMOROE cross-media relations frameworkcross-media relations framework
Katerina Pastra
Language Technology Applications Department
Institute for Language and Speech Processing (ILSP)
“Athena” Research CentreAthens, Greece
2
• Basic and applied research in the field of Natural Language Processing focusing on the design of computational models for natural language recognition and understanding with application to three interwoven tracks:
- information processing, extraction & retrieval- multilingual information processing (multilingual applications & translation
systems)- multimedia information processing (fusion of language with other modalities)
ILSP/LTA GoalsILSP/LTA Goals
4
Research and Development Research and Development directionsdirections
• Aiming at enhancing the capacity of processing multilingual multimedia content
• Enabling fusion of unimodal (text, speech, image) processing results in order to better understand the workings of language, information access and communication phenomena
• Preparing for the important role of language technologies in the forthcoming full-fledged convergence of information and edutainment channels (tv, radio, web)
5
• Multilingual Information Processing
Use parallel corpora for automatically acquiring bilingual lexica in EN – EL
Employ contextual information for lexical transfer selection
Use the annotated parallel corpus and the automatically extracted lexica to build a statistical machine translation infrastructure
TRAID translation memory – Machine Translation Toolkit
Research and Development Research and Development projects (1)projects (1)
6
• Improving machine-assisted subtitling in a universal access framework Investigate the cognitive models underlying human subtitling and implement the appropriate computational architectures
Integrate image processing to improve video segmentation and recognise subtitle unit
Investigate the extent to which existing subtitle generation methods are portable and can be parameterised across special classes of viewers, e.g. children
Projects: MUSA/IST
Research and Development Research and Development projects (2)projects (2)
7
Research and Development Research and Development projects (3)projects (3)
•Multimedia indexing and retrievalAugment the content of multimedia documents with high-level semantic indexical information (e.g. names of entities, terms, topics, facts)Develop cross-media and cross-language representations to enable linking of topically relevant video programmes, webtexts and images. Build high-level functionalities like semantic search, retrieval, filtering, categorization, translation, summarization
Projects: CIMWOS/IST, REVEAL THIS/IST
8
Research Focus on MultimediaResearch Focus on Multimedia
Multimedia discourse relations (the COSMOROE framework) applications: cross-media indexing and retrieval, segmentation of audiovisual data, multimedia summarization
Sensorimotor & Symbolic Integration Resources Ongoing work for building an extensible computational resource which associates symbolic representations (words/concepts) with corresponding sensorimotor representations, enriched with patterns of combinations among these representations for forming conceptual structures at different levels of abstraction; focus on human action and interaction in every day life.
Going bottom-up in the resource (from sensorimotor representations to concepts) one will get a hierarchical composition of human behaviour, while going top-down (from concepts to sensorimotor
representations) one will get intentionally-laden interpretations of those structures
9
Cross-Media Decision Cross-Media Decision MechanismsMechanismsMechanisms that decide on the relation
that holds between medium specific pieces of information: across documents (Boll et al. 1999) within documents (Pastra 2006)The mechanisms decide whether medium-specific pieces of information within the same Multimedia Document are: associated (multimedia integration) complementary semantically compatible/incompatible
complementaritycomplementarityindependenceindependence
equivalenceequivalence
10
Cross-media Relation ExamplesCross-media Relation Examples
Equivalence: “the yellow taxi-boats…”
Essential complementarity: “…[pollution has taken its toll] on that..”
Non-essential complementarity: “…we are heading to Patmos…”
Independence: “…I have finally found a place that’s not overrun by tourists…”
11
Cross-media relationsCross-media relations
Equivalence: info expressed by different media refers to the same entity (object, state, event or property) Complementarity: info in one medium is an (essential or not) complement of the info expressed in another. Essential complementarity usually indicated through association signals (e.g. indexicals) Non-essentially complementarity info in one medium is a modifier/adjunct of info expressed in another
Independence: each medium carries an independent (but coherent) part of the MM messageIncoherence due to errors in Incoherence due to errors in
medium-specific processing or medium-specific processing or artistic/editorial reasonsartistic/editorial reasons
12Non-essential complementarity: “…we are heading to Patmos…”
Essential complementarity: “…[pollution has taken its toll] on this..”
Independence: “…I have finally found a place that’s not overrun by tourists…”
Application example: Application example: a cross-media indexer’s decisionsa cross-media indexer’s decisions
Equivalence: “the yellow taxi-boat…”
or/and
or/and
andand
aanndd
2) Landscape – people
2) Landscape – people
2 2 ¬¬cchhooiiccee
1) Landscape–sea/coast
1) Landscape–sea/coast11¬¬aanndd
13
Cross-Media Interaction Cross-Media Interaction RelationsRelations Intelligent multimedia systems (IMMS) need
mechanisms for analysing and generating semantic links between different modalities (Andre and Rist94, Feiner and McKeown93, Green02, Gut et al. 02, Martin and Kipp04 etc.)
Focus: either image-language, or gesture-language Semiotics: seminal analysis by Barthes84 (image-text), and Kendon04 (gesture-language) Automation of relation identification restricted to equivalence/association relations (cf. e.g. Barnard et al.03) mainly between images and text. Criticism: beyond different wording, different perspectives, different (or lack of clear) criteria, all attempts to define cross-media relations incorporate a qualitative notion of “contribution” of each medium to the message, some of them employ the Rhetorical Structure Theory (Mann & Thompson87)
14
The case against RST for The case against RST for describing multimedia describing multimedia
discourse (1)discourse (1)
Inappropriate nucleus vs. satellite distinction (and the related notion of “contribution”) because:• it relies on a single, unique message reading directionality- language manifests itself linearly in time and space vs. - dynamic multimedia that are parallel in space and time (cf. AV data) vs. - static multimedia that are perceived linearly but not in a strictly pre-determined, unique order (cf. illustrated documents)• its identification relies usually on lexical cues and syntactic patternsSuch subtle cues are abundant in language to denote relations between text segments, only very few denote relation between language and other modalities• it presumes that segments are comparable in sizeInteracting modality units are not comparable (e.g. sentence – image region, word-sequence of frames etc.)
getting around the island…
S
N
Means
N
S
Purpose
RST Relation Nucleus Satellite
Purpose I drove a moppet for getting around the island
Means I got around the island by driving a moppet
Example 1:Example 1:
I got around the island by driving a moppet
Example 2:Example 2:
17
The case against RST for The case against RST for describing multimedia describing multimedia
discourse (2)discourse (2) No compliance with media characteristics image characteristics: specificity, lack of subtle focus/salience indicators and explicit abstraction mechanisms language characteristics: abstraction, meta-language functions, lack of direct access to sensorimotor entitiescf. the following RST relation:“Elaboration” = the satellite presents additional detail about the content of the nucleus (e.g. member of a set, an instance of an abstraction, an attribute of an object, something specific in a generalisation) Lack of descriptive power and computational applicability mutual exclusiveness of RST relation categories inappropriate for capturing intentionality (Moore & Pollack92) fuzzy definitions of relations make manual annotation of data for training systems to identify the relations automatically problematic (low-inter-annotator agreement – cf. Carlson et al.03)
But images always present But images always present more details…more details…
COSMOROECross-Media Interaction Relations
Equivalence Complementarity
Independence
Literal Figurative
Token-Token
Type-Token
Metonymy
Metaphor
Essential Non-Essential
Equivalence Signal
Defining Apposition
Exophora
Non-Defining Apposition
Adjunct
Contradiction Symbiosis
Meta-Information
Refining Refining the the
relation relation setset
“… helmet for safety...”
Token-token Token-token equivalenceequivalenceSemantic equivalence in which one modality refers to exactly the same entity that the other also
refers too.
“…the ever increasing population of Athens has the city bursting at the scenes and has created a vast
concrete sprawl of housing ...”
Type-token Type-token equivalenceequivalenceSemantic equivalence in which one modality refers to a class of entities and the other to one or more
representative members of the class.
“The city, of course, is
Athens , and it is here that I will begin
my exploration of modern Greece.”
MetonymyMetonymyThe two referents come from the same domain, have same array of associations, there is no transfer of qualities from one to another – the two modalities refer to different entities but the user intends the two modalities to be considered semantically equivalent
“ It’s very serene …”
a)
c)
b)
MetaphMetaphoror
The two modalities refer to different entities of different domains; the user intends the two modalities to be considered semantically equivalent – there is a transfer of qualities
a)
b)
“Do you see the black at the top of the ceiling there ?”
Equivalance SignalEquivalance SignalEquivalence signals present in discourse indicate that one modality is essentially complemented by the other
24
Defining Defining AppositionAppositionOne modality provides extra information to another, information that identifies or describes
something/someone and which –when vital for the clear comprehension of the message – is defining.
Non-Defining AppositionNon-Defining AppositionOne modality provides extra information to another, information that identifies or describes something/someone and which and which is not vital for the clear comprehension of the message.
Note: Apposition is Different Note: Apposition is Different from the Equivalence type:token from the Equivalence type:token
relation !relation !(e.g. Bush is an instance of a (e.g. Bush is an instance of a
president not generally, but in president not generally, but in certain time and space) certain time and space)
“the president…”
“the deceased handcuffed”
Apart from a type:token equivalence relation: “deceased” – image of victim (only part of the image/body shown here), one may identify an
apposition relation too: e.g. the tattoe on the hand of the man is extra, descriptive information,
complementary to the textual discourse, but not necessarily vital for comprehension by nature by nature
images will usually give such info, some images will usually give such info, some applications rely on such identification of extra applications rely on such identification of extra
info that seemed originally not important info that seemed originally not important (therefore not present in textual discourse) but (therefore not present in textual discourse) but
then considered significant e.g. crime scene then considered significant e.g. crime scene investigation applicationsinvestigation applications
“…The city is a jumble of the ancient and the modern ...”
ExophoraExophoraA pragmatic “anaphora” case
OCR: “ Acropolis ”
“…we are heading to Patmos… ”
AdjunctAdjunct
Non-essential complementarity – one modality functions as an adjunct to the other (place-position, place-direction, time, manner)
Athens has been described as the last city in the West and the first city in the East. It's a place that is rich and spectacular in its history.
Many empires have held it in their sway: Romans, Venetians, Turks and Byzantines, and the result is a cosmopolitan city of three and a half million people.
SymbiosisSymbiosisEach modality expresses different pieces of information the conjunction (in time) of which serves phatic communication (visual fillers ¬ speech fillers)
28
Meta-informationMeta-informationOne modality expresses information that comments on aspects of what the other expresses, going beyond the message communicated to creation-related comments (who created the message, when, why, how – cf. typical archival metadata)
“an aerial view of Athens….”
29
Annotating Corpora with COSMOROE (2)Annotating Corpora with COSMOROE (2)
Tool = ANVIL (Kipp00) Levels of association (local context – diff. granularity) Annotation levels - Audiovisual Topic - Transcript (manual SR, subtitles, manual-OCR) - Body movement (indication of: body-part: hands, head, legs, whole-body type: deictic, iconic, emblem, beat, metaphoric - Images Frame-Sequence: foreground, background, both Keyframe-region: bounding box, free-text label, moving vs. static object indication, corresponding FrameSequence - Relations binding AnchorText entities with movement(s), image(s) etc.
30
31
Annotation ObjectivesAnnotation Objectives
To test the theory for coverage and applicability To answer questions on the semantics of multimedia discourse, e.g. - In which cases “what one sees is not what one hears” in discourse? - Which concepts are usually visualised in accompanying images or expressed through gestures in discourse and which is their level of abstraction? - How is the interaction between modalities signalled? - Which concepts are usually complemented with visual or gestural adjuncts? Could it be that one may predict the selectional restrictions for the arguments of a predicate when knowing its visual/gestural compliments (and vice versa)? - How is exophora realised? Could one use anaphora resolutions mechanisms to resolve exophora? To use machine learning (ML) for automating relation identification for different applications
32
Future WorkFuture Work
First Phase: 5h Greek–5 hours English (to be reached by July07)
Investigation of phenomenon of “entailment” in multimedia discourse in the above dataset – internal collaboration with Stelios Piperidis-ILSP
Cognitive experimentation
(on coherence relations in Multimedia Discourse – notion of degree of fit between modalities as indicator of coherence – collaboration with Dublin Trinity College-Carl Vogel)
Machine Learning for auto identification of relations for indexing of Audiovisual Files in an extended dataset