fiat 20080921 results pisa
DESCRIPTION
In the research project PISA we have investigated how powerful search engines can be build, given a library of audiovisual material that has been analysed objectively and intelligentlyTRANSCRIPT
medialab
PISA PISA –– Proof Proof of Conceptof Concept
Production, Indexing and Search of Audiovisual MaterialProduction, Indexing and Search of Audiovisual Material
2medialab
PISA - Positioning
! VRT-Medialab (medialab.vrt.be) - technical R&D
! IBBT (www.ibbt.be) – Interdisciplinary Research Institute
! PISA – Research Project on Production and Indexing of Audiovisual Media
! 21 Man-year
! Computer Assisted Manufacturing
! Unsupervised Feature Extraction
! Search Engine Technology
3medialab
Context - Digital Media Production
Production Platform
Suprastructure – Metadata Mgnt
Production and distribution
Infrastructure - Networks and Storage
Production and distribution
Ingest
Media
Asset Mgnt
Editing
Playout
Mastering
4medialab
Digital Asset Management, Content Management…
Production Platform
Suprastructure – Metadata Mgnt
Infrastructure - Networks and Storage
Production and distribution
5medialab
User Expectations
Production Platform
Data General
Data General
Data General
Data General
Data General
Data General
MetaMeta
DataData
MetaMeta
DataData
Communication
(Information)
Suprastructure – Metadata Mgnt
Infrastructure - Networks and Storage
Production and distribution
Assumptions:
• An item is relevant or it is not
• A “scene” is the logical unit of search
The ideal search engine
• retrieves all relevant items (recall 100%)
• without false positives (precision 100%)
• enables instant access to digital media
• with respect to intellectual property.
6medialab
Archiving – Disclosure, Annotation,…
archiefnummer : ALG 20010813 1
fragmentnummer : 1
reeks : 1000 ZONNEN EN GARNALEN
bandnummer : E03024404
formaat : DBCM
fragmenttitel : 1000 ZONNEN & GARNALEN
beeld : KL/PALPLUS
fragmentduur : 18 20
tekst : 0'00" TOERISTISCH REPORTAGEMAGAZINE OVERZICHT
ONDERWERPEN GENERIEK TOERISTISCH REPORTAGEMAGAZINE,
OVERZICHT ONDERWERPEN
0'50" VANDAAG : KUNSTENAAR LUC HOFKENS ONTWIERP EEN OASE
OP ZIJN DAKTERRAS IN BORGERHOUT DIE DOET DENKEN AAN DE
GRAND CANYON INTERVIEW MET LUC EN ZIJN VROUW
MARILOU BUITENBEELD DAK MET OMGEVING BUITENKANT
ARBEIDERSWONING, PANO OVER ROTSWANDEN, KRATEN MET WATER,
BEPANTING, FOTOALBUM MET VERLOOP WERKEN
4'00" JUNIOR : KLAARTJE ALAERTS, 13 JAAR WIL ASTRONAUTEN
WORDEN ZE BEZOEKT HETEUROSPACE CENTER METRUIMTEVEREN,
RAKETTEN SIMULATIE IN RUIMTEVEER, INTERVIEW, HEEFT EEN
UFO GEZIEN MAAKT ZELF KLEIN RAKETJE, SCHIET HET AF
7'50" DE SCHEURKALENDER : ARCHIEF RECLAMEFILM IBM
INTERVIEW MAURICE DE WILDE, EERSTE PERSOONLIJKECOMPUTER
trefwoorden : BELGIE; BORGERHOUT; ARTIEST; OASE; KUNST; GRAND
CANYON (NATUURGEBIED); DAK; TERRAS; INTERVIEW; EURO
SPACE CENTER; RUIMTEVAART; PC; BOOTTOCHT; RIJKDOM;
PASSAGIER; GASTRONOMIE; RESTAURANT; PERSONEEL;
VAKANTIE; BINNENBEELD; SCHIP; BECKERS LEEN; VRT;
LOTTO; RADIOOMROEPSTER; KLANKSTUDIO; UITVINDING;
BARBECUE; BETONMOLEN; IBM; RECLAMESPOT
rechthebbende : VRT
Opzoekscherm FILM Set: 16 Aantal: 1
blz 1 van 3
trefwoorden: ibm and vrt
archiefnummer: -
uitzendjaar: maand: dag:
fragmentnummer: fragmentduur:
reeks:
formaat: bandnummer:
aflevering: afleveringsnummer:
programma: uitzenddatum:
fragmenttitel:
tekst:
kategorie:
opnamedatum: opnamenummer:
journalist: rechthebbende:
SETS
The strings required for the operation are not defined
F11 F12 F13 F14 F17 F18 F19 F20 Ent
Eindigen Sets Refset Toon Vorige Volg/Leeg Thesaurus Commando Opzoeken
7medialab
Aha - The Search Engine!
8medialab
Issues – Catch-22
-> Automated processing of information is a key
discriminator, but it requires correct and
structured metadata
-> “Annotation” of rich media requires semantic
awareness and interpretation, and thus it is at
best an approximation
-> Product Engineering is the source of structured
and meaningful information, but creative staff
are not susceptible to technology
9medialab
Objectives - Proof of Concept
• One Set of Numbers(!)
• Model Driven Development
• Computer Assisted Manufacturing
• Unsupervised Feature Extraction
• Efficient Search and Retrieval
Develop an extensible data-model and a consistent applicationDevelop an extensible data-model and a consistent application
framework, accessible via an intuitive user-interfaceframework, accessible via an intuitive user-interface
!
(! Digitizing analogue and disintegrated information flows)
10medialab
Milestone 1 – Search Engine
11medialab
Milestone 1 – Search Engine
Media Asset
Management System
(Ardome)
Search Engine
(Lucene/SOLR)
! Search federation by system integration
! Facetted search
! Integrated application of keywords
! Intuitive and structured presentation of results
! Direct access to audiovisual material
Search Client
(Custom Development)
Legacy Video Library
(Basisplus)
Actual news items
(Ardome)
Raw Material
(EBU Superpop)
<NewsML-G2>
12medialab
Shot Segmentation and Scene Recognition
13medialab
Character Recognition
14medialab
Video copy detection
! Identify dupplicates
! Generation tracking
! Grouping of search results
! Intellectual Property Protection
15medialab
Milestone 2 – Feature Extraction
Media Asset
Management
(Ardome)
! Time-coded properties and indexing allow
random access to material fragments:
! Shot segmentation and Keyframe extraction
! Subtitle processing and Speech recognition
! Taxonomy-driven topic detection
! Face recognition
! Scene recognition
! Copy detection
Shot
Segmentation
Speech
Recognition
Face
DetectionTopic
Detection
Media
Production
Media Asset
Management System
(Ardome)
Search Engine
(Lucene/SOLR)
Legacy Video Library
(Basisplus)
Actual news items
(Ardome)
Raw Material
(EBU Superpop)
<NewsML-G2>
16medialab
Work in Process (due Q4 2008)
! Multi-lingual
! Access control and Intellectual Property Protection
! Audio segmentation and classification
! Music transcription
! Fractal-based visual indexing
! …
Media
Production
17medialab
Conclusion
! Enterprise search – structured metadata, limited number of libraries, limited number
of records per library, dependencies between objects
! Intelligent search federation is aware of the media production process - scripts,
webpages, subtitles and formal annotation may represent the same editorial object
! Random access to audiovisual material requires an index is based on timecode and
not « wordposition in a document »
! Onthology-driven application logic is essential to create semantic awareness, i.e.
resolving synonyms and disambiguation of homonyms
! The perfect search engine is not for sale yet and required from the ground up design
and development.
18medialab
Future Work - From « Metadata » to CAD/CAM
?
19medialab
Future Work - From « Metadata » to CAD/CAM
?