semex: enabling exploratory video search by semantic video analysis

Enabling Exploratory Video Search by Semantic Video

AnalysisLWA 2011

Magdeburg, 30. Sep. 2011

Dr. Harald SackHasso-Plattner-Institut for IT-Systems Engineering

University of Potsdam

Freitag, 30. September 11

Harald Sack, Hasso-Plattner-Institute for IT-Systems Engineering, LDW 2011, Magdeburg, 30. Sep. 2011

■ HPI was founded in October 1998 as a Public-Private-Partnership

■ HPI Research and Teaching is focussed onIT Systems Engineering

■ 10 Professors and 100 Scientific Coworkers■ 450 Bachelor / Master Students ■ HPI is winner of CHE-Ranking 2010

http://hpi.uni-potsdam.de/Freitag, 30. September 11

http://hpi.uni-potsdam.de

http://hpi.uni-potsdam.de


■ Research Topics□ Semantic Web Technologies□ Ontological Engineering□ Information Retrieval□ Multimedia Analysis & Retrieval□ Social Networking□ Data/Information Visualization

■ Research Projects

Semantic Technologies & Multimedia Retrieval



Overview(1) Searching Audiovisual Data(2) Semantic Multimedia Analysis(3) Explorative Semantic Search(4) SeMEX - Semantic Multimedia Explorer

SEMEX - Enabling Exploratory Video Search by Semantic Video AnalysisLDW 2011, Magdeburg, 30. Sep 2011



The Google Challenge...Freitag, 30. September 11

Harald Sack, Hasso-Plattner-Institute for IT-Systems Engineering, Workshop ,Corporate Semantic Web‘, XInnovations 2011, Berlin, 19. Sep. 2011

Google Multimedia SearchFreitag, 30. September 11


How does Google find Multimedia?


Harald Sack, Hasso-Plattner-Institute for IT-Systems Engineering, Workshop ,Corporate Semantic Web‘, XInnovations 2011, Berlin, 19. Sep. 2011Freitag, 30. September 11


...<a href="/mission_pages/shuttle/shuttlemissions/sts134/multimedia/index.html">

<IMG WIDTH="100" ALT="Close-up view of Endeavour's crew cabin prior to docking with the International Space Station" TITLE="Close-up view of Endeavour's crew cabin prior to docking with the International Space Station" SRC="/images/content/549665main_2011-05-18_1600_100-75.jpg" HEIGHT="75" ALIGN="Bottom" BORDER="0" /></a><p><a href="/mission_pages/shuttle/shuttlemissions/sts134/multimedia/index.html">&rsaquo; STS-134 Multimedia</a></p>

...

‣Google Multimedia Search relies on link context

How does Google find Multimedia?



How to Search in Multimedia Archives?



Step 1: Digitalization of analog data

Step 2: Annotation with (textbased) metadata

How to Search in Multimedia Archives?



How to Search in Multimedia Archives?• manual anotation with text-based

descriptive metadata



How to Search in Multimedia Archives?• manual anotation with text-based

descriptive metadata

...how to extract metadatain an automated way?



Automated Audiovisual Analysis




Face Detection




Face Detection

Genre Analysis

Classification:StudioIndoor

News Show




Face Detection

overlay text

Genre Analysis


News Show




Face Detection

overlay text

Genre Analysis


News Show

scenetext




Face Detection

overlay text

Logo Detection

Genre Analysis


News Show

scenetext




Face Detection

overlay text

Logo Detection

Genre Analysis


News Show

scenetext

Audio-Mining

structuralanalysis

AutomatedSpeech

Recognitionspeaker

identification



• Visual Analysis• Structural Analysis• Intelligent Character

Recognition (ICR)• Character/Logo

Detection• Character Filtering• Character Recognition

• Genre Analysis &Categorization

• Face / Body / Object •Detection•Tracking•Clustering


• Audio Analysis • Structural Analysis • Speaker Detection • Automated Speech

Recognition (ASR)



video

• Decomposition of time-based media into meaningful media fragments of coherent content that can be used as basic element for indexing and classification

Structural Analysis



video


Structural Analysis

scenes



video


Structural Analysis

scenes

shots



video


Structural Analysis

scenes

shots

subshots



video


Structural Analysis

scenes

shots

subshots

frameskey frames



•Shot Boundary Detection

• Automated Identification of• Hard Cuts• Defects, as e.g.,

• Drop Outs, White Outs, etc.• Soft Cuts, as e.g.,

• Fade-In/Out, • Dissolve, Wipe, Cross-Fade, etc.

• Automated Structural Analysis based on• Analytical Shot Boundary Detection• Machine Learning Based Shot Detection

Structural Analysis

time



•Shot Boundary Detection• Automated Identification of Hard Cuts based on• Luminance/Chrominance

Histogram Differences & Derivatives

• Edge Distribution/Density

Structural Analysis

576 577 578575574573


Hardcut: if and is true for all Subregions a


Structural Analysis

i i+1 i+2i-1i-2i-3

1 2

3 4

tha(i) = ↵ ·

2

4

0

@i+W�1X

k=i�W

Da(k, k � 1)

1

A�Da(i, i� 1)

3

5+ �

Da(i, i� 1) > th↵(i)

Da(i+ 1, i) < th↵(i)

1

Window Size=4 (W=2)

Decompose Frame into a=4 Subregions

Da(i,i-1) ... Histogram Difference (L2-norm) between Frames i and i-1 of Subregion a

tha(i) ... adaptive Threshold for Frame i of Subregion a

Adaptive Threshold

tha(i) = ↵ ·

2

4

0

@i+W�1X

k=i�W

Da(k, k � 1)

1

A�Da(i, i� 1)

3

5+ �

Da(i, i� 1) > th↵(i)

Da(i+ 1, i) < th↵(i)

1

tha(i) = ↵ ·

2

4

0

@i+W�1X

k=i�W

Da(k, k � 1)

1

A�Da(i, i� 1)

3

5+ �

Da(i, i� 1) > th↵(i)

Da(i+ 1, i) < th↵(i)

1



•Shot Boundary Detection• Automated Identification of Defects, as e.g. Drop Outs / White Outs

Structural Analysis

Drop Out

Histogram/Chrominance Difference Analysis

Flashlight / White Out

Histogram/Chrominance Difference Analysis



•Shot Boundary Detection• Automated Identification of Defects, as e.g., Drop Outs / White Outs

Structural Analysis

...i i+10i+9i+8 i+11 i+12 i+13i+1

• Luminance/ChrominanceHistogram Differences & Derivatives



•Shot Boundary Detection• Automated Identification of Soft Cuts, as e.g. Fade Out / Fade In

Structural Analysis

Fade Out

Fade In



•Shot Boundary Detection• Automated Identification of Soft Cuts, , as e.g. Fade Out / Fade In

• Features applied for machine learning:• luminance histogram (Fade In / Fade Out)

• luminance average Yµ and luminance variance Yσ2 follow distinct patterns

• image decomposition • component-based analysis to

distinguish regional and global changes in image content

• entropy• motion vectors

Structural Analysis

1 2 3

4 5 6

7 8 9




• Features deployed for machine learning:• luminance/chrominance

histogram• entropy• motion vectors

Structural Analysis





histogram• entropy•motion vectors

Structural Analysis





histogram• entropy•motion vectors• image decomposition• compute average motion

vectors for all areas• identify camera movements

(zoom, pan, etc.) andmoving objects

Structural Analysis

1 2

3 4



• Visual Analysis• Structural Analysis• Intelligent Character Recognition (ICR)

• Character/Logo Detection

• Character Filtering• Character Recognition

• Genre Analysis &Categorization

• Face / Body / Object •Detection•Tracking•Clustering


• Audio Analysis • Structural Analysis • Speaker Detection • Automated Speech

Recognition (ASR)



• Preprocessing• Character Identification• Text Preprocessing

• Text Filtering• Adaption of script geometry (Deskew)• Image quality enhancement

• Optical Character Recognition (OCR)• Standard OCR software (OCRopus)

• Postprocessing• Lexical analysis • Statistical / context based filtering

Ermittlungen nachBombenfunden

Intelligent Character Recognition



• Character Identification• Robust filter to extract text candidate frames

• 25 fps results in 90.000 frames per 60 min• too expensive for single frame preprocessing & OCR• fast and robust text identification for preprocessing


• Features used for text identification:• edge detection

• DCT / Fourier Transformation• Sobel-/Canny Edge Filter

• horizontal and vertical edge distribution• Local Binary Patterns (LBP)• Histogram of Oriented Gradients

• stroke width analysis

TTTTT T TT T T



• Stroke Width Transformation• based on edge filtering as a preprocessing step• for each edge pixel a stroke is projected along its gradient direction until

another edge pixel is hit• all pixels along the stroke will receive the same stroke width value (color)




• Stroke Width Transformation• based on edge filtering as a preprocessing step• for each edge pixel a stroke is projected along its gradient direction until

another edge pixel is hit• all pixels along the stroke will receive the same stroke width value (color)• connected component analysis groups pixels with similar stroke width value




Original Image Bounding Box


• Preprocessing• Text Preprocessing

• Text Filtering



Advanced Image Enhancement


• Preprocessing• Text Preprocessing

• Quality Enhancement



Standard OCR (OCRopus)


• Optical Character Recognition (OCR)• Standard OCR software (OCRopus)



Context-based Spell Correction

Intelligent Character Recognition• Postprocessing

• Lexical analysis • Statistical / context based filtering



• Result: Multimedia data with spatiotemporal Annotations

Metadata Extraction

Metadata (e.g. MPEG-7) ... <Video> <TemporalDecomposition> <VideoSegment> <TextAnnotation> <KeywordAnnotation> <Keyword>Astronaut</Keyword> </KeywordAnnotation> </TextAnnotation> <MediaTime> <MediaTimePoint> T00:05:05:0F25 </MediaTimePoint> <MediaDuration> PT00H00M31S0N25F </MediaDuration> </MediaTime> ... </VideoSegment> </TemporalDecomposition> </Video> ...

time




Metadata Extraction

Metadata (e.g. MPEG-7) ... <SpatialDecomposition> <TextAnnotation> <KeywordAnnotation> <Keyword>Astronaut</Keyword> </KeywordAnnotation> </TextAnnotation> <SpatialMask> <SubRegion> <Polygon> <Coords> 480 150 620 480 </Coords> </Polygon> </SubRegion> </SpatialMask> ... </SpatialDecomposition> ...

• Result: Multimedia data with spatiotemporal Annotations




But what about semantic metadata..?

... <SpatialDecomposition> <TextAnnotation> <KeywordAnnotation> <Keyword>Astronaut</Keyword> </KeywordAnnotation> </TextAnnotation> <SpatialMask> <SubRegion> <Polygon> <Coords> 480 150 620 480 </Coords> </Polygon> </SubRegion> </SpatialMask> ... </SpatialDecomposition> ...


• MPEG-7 has been re-engineered to become an OWL-DL ontology(2007: Arndt et al., COMM model)


Multimedia Ontologies

• Localize a region → Draw a bounding box

• Annotate the content → Interpret the content → Tag ,Astronaut‘



Multimedia OntologiesExample: Tagging with an MPEG-7 Ontology

Reg1

mpeg7:image

mpeg7:depicts

Man on the Moon

mpeg7:spatial_decomposition Reg1

mpeg7:StillRegion

rdf:type

mpeg7:depicts

dbpedia:Astronaut

mpeg7:SpatialMask

mpeg7:polygon

mpeg7:Coords



Named Entity Recognition

Astronaut Person

Neil Armstrong

Science Occupation

Employment

is a is a

is a

is a

Entities

Classes

Named Entity Recognition„locating and classifying atomic elements...intopredefined categories such as names, persons, organizations, locations, expressions of time,quantities, monetary values, etc.“C.J.Rijsbergen, Information Retrieval (1979)



Named Entity Recognition

Astronaut Person

Neil Armstrong

Science Occupation

Employment

is a is a

is a

is a



Video Analysis /Metadata Extraction

Semantic Multimedia Analysis

timemetadata

metadatametadata

metadatametadata





timemetadata

metadatametadata

metadatametadata

e.g., person xylocation yzevent abc

e.g., bibliographical data,geographical data,encyclopedic data, ..

Entity Recognition/ Mapping



Named Entity Recognition• Mapping keyterms (text) to semantic entities

• Context Analysis and Disambiguation






JaguarKeyterm / User Tag






JaguarKeyterm / User Tag


Jaguar (Car)

Jaguar (Cat)

Jaguar (OS)

Jaguar (Aircraft)

?

?

?

?

Semantic Entities



RDF graph to find relations between entities co-occurringin a text maintaining the hypothesis that disambiguationof co-occurring elements in a text can be obtained byfinding connected elements in an RDF graph [7]. In orderto regard the special compilation of non-textual data, staticand user-genrated metadata in audio-visual content our novelapproach combines the use of semantic technologies andLinked Data with linguistic methods.

III. METHOD

According to a study about structure and characteristicsof folksonomy tags [8] an average of 83% of user-generatedtags are single terms. Also, an average of 82% of thereviewed tags are nouns. Based on these study results, weignore tag practices, such as camel case (”barackObama”)and treat tags as subjects or categories describing a resource.As a tag could also be part of a group of nouns representingan entity or a name (”flying machine”,”albert einstein”) thetags stored as single words without any given order have tobe combined in term groups of two or more terms to findall appropriate entities. Hence, every tag or group of tagswithin a given context may represent a distinct entity. Theterm combination process and subsequent mapping of termsand term groups to entities are described in sect. III-B.

To disambiguate ambiguous terms we combine two meth-ods: a co-occurences analysis of the terms in the context inWikipedia articles and an analysis of the page link graph ofthe Wikipedia articles of entity candidates. The scores forboth analysis steps are calculated to a total score.

A. Context Definition

Metadata exists in a certain context and has to be inter-preted according to this context. For tags of audio-visualcontent we identified two dimensions:

• temporal dimension• user-centered dimensionIn the temporal dimension a context can be defined as the

entire video, a segment or a single timestamp in the video.The user-centered dimension classifies a context by howmany users created the concerning metadata - only tags by acertain user or all tags regardless of which user. Fig. 1 showsthe combinations of the two dimensions of contexts formetadata in audio-visual content the interpretation regardingthe significance of a context.

Audio-visual content also provides the opportunity tosupply spatial information. Thus, tags in the same regionof a video frame are considered as related to each other.In the current approach we did not consider this contextdimension.

To describe our approach we use a sample context of ourtest set (see sect. IV). This sample context is composed oftags by only one user at a certain timestamp in the video.The video containing this sample context is a presentation

Figure 1. Dimensions of context definition in audio-visual content

by Dr. Garik Israelian at the TED conference3 entitled ”Howspectroscopy could reveal alien life”4. Our sample contextconsists of the tags ”hubble”, ”spitzer”, ”carbon”, ”dioxide”,”methan”, ”co2”, and ”water”.

B. Preprocessing

Term Combination: Our combination algorithm takesall tags of a specified spatio-temporal context (at a certaintimestamp/in a certain segment of a video, of a singleURL/image and generates every possible combination of atmost three terms of the context in every possible order. Inthat way we make sure to rectify groups of single termsthat belong together. We chose to generate combinationsof three words to make sure to also hit named entitiesconsisting of more than two words, such as ”public keycryptography” or ”alberto santos dumont”. About 90% ofthe DBpedia [9] labels consist of at most three words, butless than 5% consist of 4 words. Due to these numbersand performance issues we decided to limit the number ofterms to be combined to three. Subsequently in this paperby terms we will refer to single terms as well as generatedterm groups. The number c of combinations is calcultaed byc =

�jk=1

n!(n�k)! .

For our sample context containing 7 tags and at most3 terms in a combination (j = 3), 259 combinations aregenerated.

Term Mapping: The terms then have to be mapped tosemantic entities. For our approach we use entities of theLinked Open Data Cloud [10], in particular of the DBpedia,version 3.5.1.

DBpedia provides labels for the identification of distinctentities in 92 languages. We use English and German aswell as Finnish labels, as we noticed that neither English northe German labels contain important acronyms as labels, butthe Finnish language version does. As tagging users prefer tokeep it simple and short[2], resources dealing with ”DomainName System” would rather be tagged with ”DNS” than”Domain Name System”.

After simple string matching of the terms of the contextto DBpedia URIs, the URIs are revised for redirects and

3http://www.ted.com4http://yovisto.com/play/14415

Context Analysis and DisambiguationWhat defines a Context in AV-Data?

• Temporal Coherence • Spatial Coherence• Provenance





III. METHOD











B. Preprocessing


�jk=1

n!(n�k)! .









Spatial Dimension




III. METHOD











B. Preprocessing


�jk=1

n!(n�k)! .









Temporal Dimension

Spatial Dimension




III. METHOD











B. Preprocessing


�jk=1

n!(n�k)! .









User-centered Dimension

Temporal Dimension

Spatial Dimension



jaguarKeyterm / User Tag

LOD Cloud

Semantic Graph Analysis

1956 Stevejaguar

McQueenrim wheel

context

Jaguar (Car)Steve McQueen

1956

Jaguar (Cat)Jaguar (OS)



Searching is not always just searching



a simple example:

I‘m looking for a book by Earnest Hemingway with the title ,For Whom the Bell Tolls‘ in the first German edition...“



Wem die Stunde schlägt. - Ernest H E M I N G W A Y. (Stockholm usw., Bermann-Fischer Verlag, 1941) 560 S. 8“

II 1, 2506, 34548

a simple example:

I‘m looking for a book by Earnest Hemingway with the title ,For Whom the Bell Tolls‘ in the first German edition...“



...but what if...

I really liked the book ,For Whom the Bell Tolls‘ but I have no idea what I should read next...


Harald Sack, Hasso-Plattner-Institute for IT-Systems Engineering, Indian Summer School on Linked Data, Leipzig, 12-18. Sep. 2011

Exploratory Search• What, if the user does not know, which query string to use?• What, if the user is looking for complex answers ?• What, if the user does not know the domain he/she is looking for?• What, if the user wants to know all(!) about a specific topic?

• ...,Browsing‘ instead of ,Searching‘• ...to find something by chance -> Serendipity• ...to get an overview• ...enable content based navigation




Exploratory Multimedia Search

timemetadata

metadatametadata

metadatametadata

e.g., person xylocation yzevent abc

e.g., bibliographical data,geographical data,encyclopedic data, ..

Entity Recognition/ Mapping


Harald Sack, Hasso-Plattner-Institute for IT-Systems Engineering, LDW 2011, Magdeburg, 30. Sep. 2011http://linkeddata.org/

Data is a precious thing and will last longer than the systems themselves. (Tim Berners-Lee)

The Web of Data - The Semantic Web



dbpedia:For_Whom_the_Bell_Tolls

What facts for dbpedia:For_Whom_the_Bell_Tollsare relevant?

http://dbpedia.org/page/For_Whom_the_Bell_Tolls

DBPedia - the Semantic Wikipedia

...use heuristicsFreitag, 30. September 11


dbpedia-owl:author

dbpedia:Ernest_Hemingwaydbpedia:For_Whom_the_Bell_Tolls




dbpedia-owl:author


dbpedia-owl:author




dbpedia-owl:author


dbpedia-owl:author

dbpedia-owl:author




dbpedia-owl:author


dbpedia-owl:author

dbpedia-owl:author

dbpedia-owl:author





dbpedia-owl:author





dbpedia-owl:author


dbpedia:Raymond_Carver

dbpedia-

owl:influenced_by




dbpedia-owl:author



dbpedia-

owl:influenced_by

dbpedia:Jack_Kerouac

dbpedia-

owl:influenced_by




dbpedia-owl:author



dbpedia-

owl:influenced_by

dbpedia:Jack_Kerouac

dbpedia-

owl:influenced_by

dbpedia-owl:influenced_by

dbpedia:Jerome_D._Salinger




dbpedia:Jack_Kerouac dbpedia:Raymond_Carverdbpedia:Jerome_D._Salinger





dbpedia-owl:notableWork





dbpedia-owl:notableWork dbpedia-owl:notableWork





dbpedia-owl:notableWork dbpedia-owl:notableWork dbpedia-owl:notableWork



Overview(1) Searching Audiovisual Data(2) Semantic Multimedia Analysis(3) Explorative Semantic Search(4) SeMEX - Semantic Multimedia

Explorer



Harald Sack, Hasso-Plattner-Institute for IT-Systems Engineering, LDW 2011, Magdeburg, 30. Sep. 2011http://bit.ly/SeMEX


http://mediaglobe.yovisto.com:8080/mggui/%23start













http://bit.ly/SeMEX

http://bit.ly/SeMEX


29

http://mediaglobe.yovisto.com:8080


http://mediaglobe.yovisto.com:8080/

http://mediaglobe.yovisto.com:8080/


Contact:Dr. Harald SackHasso-Plattner-Institut für SoftwaresystemtechnikUniversität PotsdamProf.-Dr.-Helmert-Str. 2-3D-14482 Potsdam

Homepage:http://www.hpi.uni-potsdam.de/meinel/team/sack.html http://www.yovisto.com/Blog: http://moresemantic.blogspot.com/E-Mail: [email protected] [email protected]: lysander07 / biblionomicon / yovisto

Thank you very much

for your attention!


http://www.hpi.uni-potsdam.de/meinel/team/sack.html

http://www.hpi.uni-potsdam.de/meinel/team/sack.html

http://www.yovisto.com

http://www.yovisto.com

http://moresemantic.blogspot.com

http://moresemantic.blogspot.com

mailto:[email protected]




semex: enabling exploratory video search by semantic video analysis

Technology

semantic video analysis

harald sackhassoplattner

corporate semantic web

textbased metadataharald

link contextharald sack

semantic video analysis

exploratory video search

automated way