cubrik r1 multimodal content analysis & enrich

29
R1 PIPELINES FOR MULTIMODAL CONTENT ANALYSIS & ENRICHMENT Human-enhanced time-aware multimedia search CUbRIK Project IST-287704 Deliverable D5.1 WP5 Deliverable Version 1.0 – 31 August 2012 Document. ref.: cubrik.D51.POLMI.WP5.V1.0

Upload: cubrik-project

Post on 15-Mar-2016

232 views

Category:

Documents


3 download

DESCRIPTION

The content analysis pipeline and the “human-powered logo detection pipeline”

TRANSCRIPT

Page 1: CUbRIK R1 Multimodal Content Analysis & Enrich

R1 PIPELINES FOR MULTIMODAL CONTENT ANALYSIS & ENRICHMENT Human-enhanced time-aware multimedia search

CUbRIK

Project IST-287704

Deliverable D5.1 WP5

Deliverable Version 1.0 – 31 August 2012

Document. ref.: cubrik.D51.POLMI.WP5.V1.0

Page 2: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment D5.1 Version 1.0

Programme Name: ...................... IST Project Number: ........................... 287704 Project Title:.................................. CUbRIK Partners:........................................ Coordinator: ENG (IT)

Contractors: UNITN, TUD, QMUL, LUH, POLMI, CERTH, NXT, MICT, ATN, FRH, INN, HOM, CVCE, EIPCM

Document Number: ..................... cubrik.D51.POLMI.Wp5.V1.0 Work-Package: ............................. WP5 Deliverable Type: ........................ Document Contractual Date of Delivery: ..... 31 August 2012 Actual Date of Delivery: .............. 31 August 2012 Title of Document: ....................... R1 Pipelines for Multimodal Content Analysis &

Enrichment Author(s): ..................................... Catallo, Ciceri, Tagliasacchi, Bozzon,

Martinenghi, Fraternali (POLMI) Approval of this report ............... Summary of this report: ..............

History: .......................................... Keyword List: ............................... Availability .................................... This report is: public

This work is licensed under a Creative Commons Attribution-NonCommercial-

ShareAlike 3.0 Unported License.

This work is partially funded by the EU under grant IST-FP7-287704

Page 3: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment D5.1 Version 1.0

Disclaimer

This document contains confidential information in the form of the CUbRIK project findings, work and products and its use is strictly regulated by the CUbRIK Consortium Agreement and by Contract no. FP7- ICT-287704.

Neither the CUbRIK Consortium nor any of its officers, employees or agents shall be responsible or liable in negligence or otherwise howsoever in respect of any inaccuracy or omission herein.

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7-ICT-2011-7) under grant agreement n° 287704.

The contents of this document are the sole responsibility of the CUbRIK consortium and can in no way be taken to reflect the views of the European Union.

Page 4: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment D5.1 Version 1.0

Table of Contents EXECUTIVE SUMMARY 1

1. HUMAN-ENHANCED LOGO DETECTION 2

2. ARCHITECTURE 4

2.1 DATA MODEL 5 2.2 DATA SERVICE 7 2.3 JOBS 8

2.3.1 Logo retrieval 8 2.3.2 Logo Processing 10 2.3.3 Video processing 10 2.3.4 Key-frame/logo matching 13 2.3.5 Logo validation 14 2.3.6 Match validation 15

2.4 SEARCH SERVLET 16

3. USER INTERFACE 18

3.1 RUNNING EXAMPLE 21

Page 5: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 1 D5.1 Version 1.0

Executive Summary

This deliverable contains a detailed description of the content analysis pipeline included in the CUbRIK release R1. The work is mainly the outcome of Task 5.1 (Pipelines for feature extraction and multimodal metadata fusion). Specifically, this deliverable illustrates the “human-powered logo detection pipeline”, which combines different components (cfr. D8.1) to perform the analysis of multimedia content.

Page 6: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 2 D5.1 Version 1.0

1. Human-enhanced logo detection

This section provides a high-level specification of the implemented pipeline.

Purpose Identify occurrences of logos in video clips through keyword-based queries

Description A professional user wants to retrieve all the occurrences of logos in a large collection of video clips.

Query formulation

The user query is represented by means of a keyword, which indicates the brand of interest (e.g. “Apple”, “Audi”, etc.).

Data collection

A set of video clips. In each frame there might be one or multiple occurrences of the desired logo. For a given brand, different versions of the same logo might exist. The visual appearance of logos might be affected by variable illumination, perspective, non-rigid warping, etc.

Result The output of the system is a comprehensive analytics report, which indicates, e.g.: the number of distinct logo occurrences; temporal statistics related to the timing of the occurrences, etc.

Indexing pipeline

The system might operate on video content that was not indexed.

However, different forms of indexing might be used to:

- Perform shot detection and key-frame extraction. Only keyframes might be processed at query time to reduce the computational burden

- Identify the frames containing logos and the corresponding spatial region of interest

In all cases, indexing could be either automatic or human powered (or both).

Query processing pipeline

Pipeline A – logos are represented as images

• The user enters a keyword query (e.g. “Apple”)

• A set of images related to the keyword is retrieved (e.g. using Google images)

• Retrieved images need to be - Validated: Only a subset of the retrieved images contain the

logo of the queried brand - Segmented: The region of interest around the logo needs to be

segmented and isolated from the background. - Annotated: Optionally, it might be possible to determine

additional attributes related to the identified logo, e.g. the period of time during which the logo was used.

• Each of the segmented logos represents an exemplar to be used for content-based image retrieval

• Visual matches are sought in each frame (or in keyframes only)

• For each match found, a confidence value is returned

• Successful matches might be

Page 7: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 3 D5.1 Version 1.0

- Validated, if the confidence is low - Used as further exemplars, if the confidence is high.

Extensions Pipeline B – logos are represented as typeset text

1) The user enters a keywork query 2) For each frame, a textual transcription is obtained (e.g. by means

of VideoOCR) 3) Textual matches are sought in each frame 4) For each match found, a confidence value is returned 5) Successful matches might be validated, if the confidence is low

Off-the-shelf components

- Text-to-images: Google Image Search

- CBIR system: OpenCV/SIFT + custom spatial verification

- Temporal segmentation: IDMT Temporal segmentation

Human-computing components

Humans can be involved to accomplish the following tasks:

- Validate the set of images retrieved by means of a keyword query - Identify frames containing logos - Identify the ROI in frames containing logos - Validate visual matches characterized by low confidence - Annotate logos with auxiliary information

Page 8: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 4 D5.1 Version 1.0

2. Architecture

The architecture depicted in Figure 1 includes:

• A data service component to store resources (logo images and videos);

• Seven workflows (logo retrieval, logo processing, logo index update, matching, video processing, match index update, search index);

• A servlet that manages logo search;

• A conflict resolution manager (Crows Searcher), which validates logos and matches.

Figure 1: Diagram of the software architecture implementing the “human-enhanced logo detection pipeline”

Page 9: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 5 D5.1 Version 1.0

2.1 Data Model

The data model is illustrated in Figure 2. Below, details are provided for each of the entities.

Brand metadata:

• brand ID

• brand name: the brand name stemmed and normalized for indexing purposes (e.g. coca_cola)

• brand query: the actual query sent by the user (e.g. Coca Cola)

• high confidence threshold: the value to be used after the matching phase to distinguish high confidence matches from other matches

• low confidence threshold: the value to be used after the matching phase to distinguish “junk” matches from other matches

Logo image metadata comprise:

• ID: the unique identifier obtained from the Data Service (it also corresponds to the path on the Data Service)

• URL: the URL of the image obtained by Google Images.

• confidence: calculated as googleWeight*googleConfidence+crowdWeight*crowdConfidence googleConfidence = (32 – image_position_in_top-32)/ 32 If the logo is not already evaluated by the crowd the crowdConfidence=0

• SIFT descriptors local URI: the local URI of the SIFT descriptors file

Video metadata:

• video ID

• name: the name of the video file

• path: the path of the video file on the local file system

• URI (on the Data Service)

Key-frame metadata:

• key-frame ID

• video ID

• number: the number of the key-frame in the sequence of the video frames

• time instant: the time instant in which the key-frames appears in the video

• image URI (on the Data Service)

• image local URI (on the file system)

• SIFT descriptors local URI: the path to the SIFT descriptors file on the local file system

Match metadata:

• key-frame ID

• logo ID

• bounding-box coordinates

Events

In order to display the result changes along the time dimension we need to track the events that occur during the video processing and logo processing phases, as well as the events that depend on crowd interactions. Crowd related events (e.g., validation of a logo image) might cause updates (i.e., change of the status) on the index, therefore on the final result set.

Page 10: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 6 D5.1 Version 1.0

When these updates occur, we do not want to loose the previous status of the result set, but we want to “append” the new status to the prior version of the result set.

The entities that may be affected by events are:

• logo images: one of the top-32 Google Images for a brand.

• matches: a match is the relationship between a key-frame and a logo instance when the SIFT matcher confidence is greater than 0.

The list of the system events to be tracked comprises:

• new processed video: SIFT descriptors for all video frames are computed and frames are uploaded to the Data Service

• new user query: a new brand is searched by the end user and its logo images are sent to the crowd in order to be validated

• validated logo: a logo image for a brand is validated by the crowd and its confidence is sent back to the system

• new added logo image: a new logo image for a brand is sent to the system by the crowd, in order to be processed. The initial value of the logo confidence, before the crowd validation, is proportional to its Google Images rank in the top-32 logo images.

• processed logo image: SIFT descriptors for a set of logo images are computed, logo images are uploaded on the Data Service, and logo metadata are added to the logo index

• found matches: a set of relevant matches for a new frame, or a new logo image, is found. The initial value of the match confidence is computed by the matching component.

• validated matches: a set of matches for a new frame, or a new logo image, is validated by the crowd.

When a new event occurs, the following information is stored in an internal data structure:

• event ID

• timestamp

• description

• confidence

Figure 2: Diagram of the data model adopted for the “human-enhanced logo detection pipeline”

Page 11: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 7 D5.1 Version 1.0

For more details about the CUbRIK data model see section 1 of the document “D2.1 DATA MODELS - Human-enhanced time-aware multimedia search”.

2.2 Data Service

The Data Service is a Web service that allows resources (i.e., videos, key-frames and logo images) to be stored persistently. Once a new resource is uploaded, the service returns a unique identifier, the URI. Querying the Data Service with an URI the related resource can be retrieved.

Code Example: uploading function for a video file

public String upload(String fileURL, String fileName, String fileType, String serverAddress, String user_ID, String password, String owner){

//Initialize SSL Content

initSSLContent();

URL url=new URL(serverAddress);

ClientHttpRequest ap = new ClientHttpRequest((HttpsURLConnection) url.openConnection());

// if I have got a file path (to an existing one) then I send it, else I send it as a url

File file = new File(fileURL);

if(file.exists()){

InputStream fis = new FileInputStream(fileURL);

String nameOfFile=file.getName();

ap.setParameter("source_file", nameOfFile, fis);

}else{

ap.setParameter("source_url", fileURL);

}

ap.setParameter("source_type", "video");

ap.setParameter("source_name", fileName);

// Shooting by POST method

InputStream ris = ap.post();

// Reading response

HashMap<String, String> attributes = new HashMap<String, String>();

attributes.put("url", null);

parser.setAttributes(attributes);

parser.parse(ris);

String fileURI=parser.getResponseAttributes().get(0).get("url");

return fileURI;

}

Page 12: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 8 D5.1 Version 1.0

2.3 Jobs

2.3.1 Logo retrieval

Starting from a list of brand names (off-line) or from a single brand name (at query time, a new brand is searched by the end user) the system queries Google images in order to retrieve the top-32 logo images, for each brand name.

Then, the logo images (with URLs and the Google Rank position) are uploaded to the Data Service, and a unique identifier is attached to the logo image metadata. Such logo images are forwarded to the Logo Processing job. The Logo retrieval job starts the crowd validation by creating a new task on the Conflict Resolution Manager to validate the logo images and populate the task with the logo images.

Figure 3: Sequence Diagram of the Logo Retrieval Job for the “human-enhanced logo detection pipeline”

Page 13: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 9 D5.1 Version 1.0

Code Example: workflow description for the LogoRetrieval workflow

{

"name": "LogoInstancesRetrievalWorkflow",

"modes": [

"standard"

],

"startAction": {

"worker": "bulkbuilder",

"output": {

"insertedRecords": "importBucket"

}

},

"actions": [

{

"worker": "pipeletProcessor",

"parameters": {

"pipeletName": "eu.CUbRIKprj.pipelet.polmi.RetrieveLogoInstance.RetrieveLogoInstancesFromGooglePipelet",

"googleContribution": "0.5",

"crowdContribution": "0.5"

},

"input": {

"input": "importBucket"

},

"output": {

"output": "logoURLsBucket"

}

},

{

"worker": "SubmitTaskToCrowdWorker",

"input": {

"input": "logoURLsBucket"

}

}

]

}

Page 14: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 10 D5.1 Version 1.0

2.3.2 Logo Processing

For each logo URL the logo processing job receives as input the original file and saves it in the Data Service. Then the job performs the SIFT descriptors computation and the result is stored in a logo index, in order to keep track of which brands have been already indexed.

The logo description is now updated with the local URI (URI of the image on the file system), the SIFT descriptors local URI and the updated list of events.

Logo images and their metadata are then forwarded to the matching pipeline.

Figure 4: Sequence Diagram of the Logo Processing Job for the “human-enhanced logo detection pipeline”

2.3.3 Video processing

The videos to be indexed are stored in a directory on the file system. The system retrieves them using a file system crawler. Once a new video is fed into the system, a Pipeline converts it form an .avi format into a HTML 5 compliant format (.ogv, .mp4, .webm) and then upload all the formats to the Data Service.

The system, then, identifies the key-frames within each video. For each detected key-frame, its SIFT descriptors are computed and stored locally on the file system.

Finally, each key-frame is uploaded to the Data Service and obtains a unique identifier. Such identifier can be used in order to retrieve the key-frame. Then the job performs the SIFT descriptors computation for all the key-frames and the result is stored in the Data Service. All the key-frames are then sent to the Matching job.

Page 15: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 11 D5.1 Version 1.0

Figure 5: Sequence Diagram of the Video Processing Job for the “human-enhanced logo detection pipeline”

Example of code:

File Crawling Workflow

{

"name": "fileCrawling",

"modes": [

"runOnce"

],

"startAction": {

"worker": "fileCrawler",

"input": {

"directoriesToCrawl": "dirsToCrawlBucket"

},

"output": {

"directoriesToCrawl": "dirsToCrawlBucket",

"filesToCrawl": "filesToCrawlBucket"

}

},

"actions": [

{

"worker": "deltaChecker",

"input": {

Page 16: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 12 D5.1 Version 1.0

"recordsToCheck": "filesToCrawlBucket"

},

"output": {

"updatedRecords": "filesToFetchBucket"

}

},

{

"worker": "fileFetcher",

"input": {

"filesToFetch": "filesToFetchBucket"

},

"output": {

"files": "filesToPushBucket"

}

},

{

"worker": "updatePusher",

"input": {

"recordsToPush": "filesToPushBucket"

}

}

]

}

Video Processing Workflow

{

"name": "VideoProcessingWorkflow",

"modes": [

"standard"

],

"parameters": {

"pipelineRunBulkSize": "20"

},

"startAction": {

"worker": "bulkbuilder",

"output": {

"insertedRecords": "importBucket"

}

},

"actions": [

{

"worker": "pipelineProcessor",

"parameters": {

"pipelineName": "VideoProcessingPipeline"

},

Page 17: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 13 D5.1 Version 1.0

"input": {

"input": "importBucket"

},

"output": {

"output": "imagesBucket"

}

}

]

}

2.3.4 Key-frame/logo matching

Anytime a new logo or a new key-frame is processed the matching job is activated. Each new key-frame is matched against all processed logo images. SIFT descriptors of key-frames and logos are compared in order to compute their similarity.

The list of the processed logo images as well as the list of the processed key-frames can be retrieved from the Data Service.

Similarly, when a new logo image is fed to the matching job, it is matched against all the processed key-frames.

When a match is found, it results in a numerical match confidence value, (between 0 and 1) and the coordinates of the bounding box of the logo image within the key frame.

Once the match confidence value is computed, matches are distinguished in:

• high confidence matches: the match confidence is greater than or equal to the high confidence threshold of the brand of the logo instance

• low confidence matches: the match confidence is smaller than or equal to the high confidence threshold of the brand of the logo instance, but it is greater than the low confidence threshold

• junk matches: the match confidence is smaller or equal than the low confidence threshold of the brand of the logo instance

Only low confidence matches are sent to the crowd to be validated.

Page 18: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 14 D5.1 Version 1.0

Figure 6: Sequence Diagram of the Matching Job for the “human-enhanced logo detection pipeline”

2.3.5 Logo validation

The first task performed by the crowd is the validation of logo images related to a brand name. The crowd is asked to vote the most relevant logo images for that brand name. User responses are aggregated in order to compute a normalized “logo confidence” (e.g., between 0 and 1). The measure to compute the logo confidence is proportional to the number of votes collected by a logo image.

Human computation results are sent back to the system with a REST invocation to the Logo Index Update job and corresponding logos metadata are updated.

Example of Json Input for the REST invocation:

[

{

"ID": "id1",

"URL": "http://googleimages/myfile.jpeg",

"confidence": 0.6

Page 19: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 15 D5.1 Version 1.0

},

{

"ID": "id2",

"URL": " http://googleimages/myfile2.jpeg",

"confidence": 0.8

}

]

Once all the logos have been validated the job starts a new task on the crowd to provide new logo images for a brand. The result of this task is a set of URLs sent by the Conflict Resolution Manager to the Logo Processing Job with a REST invocation. Such logo instances are given the highest confidence (e.g., 1).

Figure 7: Sequence Diagram of the Logo Validation Job for the “human-enhanced logo detection pipeline”

2.3.6 Match validation

The task performed by the crowd is to vote a set of key-frames that may contain a logo image. The user classifies the images as either Good (i.e., Relevant) or Bad (i.e., Not relevant). For each match, crowd judgments are aggregated, and a majority-voting rule is applied. Crowd judgments (Good/Bad match) are sent back to the Match Validation job in order to update matches metadata.

Page 20: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 16 D5.1 Version 1.0

2.4 Search Servlet

The architecture exposes a Java servlet to perform logo search.

Request

The user query is sent to a servlet, /SMILA/logodetectionrecordsearch, through an AJAX POST request. The request has just the parameter “brandName”, that represents the brand name that we will use to query the Solr index. Here the AJAX snippet:

var brandName = document.getElementById("brandName");

xmlhttp.open("POST","/SMILA/logodetectionrecordsearch",true);

xmlhttp.onreadystatechange = handleServerResponse;

xmlhttp.setRequestHeader('Content-Type', 'application/x-www-form-urlencoded');

xmlhttp.send("brandName=" + brandName.value); //Posting brandName to Servlet

Response

Once the request has been sent, the system returns a JSON. The outcome of the JSON is described by the parameter “outcome”.

If outcome is equal to “brand not found”, this mean that the query brand name has not been processed, thus a particular result page is shown to the user, including a button to allow the user to start the processing of the brand name.

If the outcome is equal to “no matches found” then a simple message should be displayed.

Finally, if outcome is equal to “matches found”, the JSON includes the list of all the matches for the query brand name and some parameters for the UI, as the high confidence threshold and the low confidence threshold to arrange the matches in the right group (i.e. “Good”, “Uncertain” and “Rejected”), as described in the specification of the logo detection demo. Another parameter included in the JSON response is the default value of the logos’ rank, “k”.

The result JSON includes also some other parameters, that are default response attribute of the SMILA search servlet:

• recordid: the id of the search

• query: the query send by the user;

• count: the number of matches found;

• maxcount: the maximum number of matches that can be retrieved;

• runtime: the elapsed time during the search process;

• offset: The number of matches which, starting from the top, should be skipped in the search result;

• pipeline: the name of the SMILA pipeline that produced the results;

• resultAttributes: the list of attributes to be retrieved from each match;

Result matches are listed in a sequence element called “records” (the name “record” is inherited by SMILA naming convention), each match record is a set of key-value couples.

A match record has the following metadata:

• matchID: the unique identifier of the match (i.e., up to know is simply logoURI appended to the frameURI);

• frameURI: the URI where the frame image can be retrieved;

• frameInstant: the instant in which the frame appears in the video;

• videoURI: the URI where the video, the keyframe belongs to, can be retrieved;

• videoOgv: the video in .ogv format;

• videoMp4: the video in .mp4 format;

Page 21: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 17 D5.1 Version 1.0

• videoWebm: the video in .webm format;

• videoTitle: the title of the video;

• videoPreview: the URI of the frame to be used as video preview image;

• logoURI: the URI where the logo image can be retrieved;

• logoName: the name of the logo instance;

• boundingBoxCoordinates: the four coordinates (xMin, xMax, yMin, yMax) of the bounding box to be superimposed to the frame image. It hopefully should surround the logo instance within the frame;

• events: the list of events related to the match or to its dependencies (video, frame and logo). Each event is a map itself, including the following metadata:

o ID: the unique identifier of the event; o timestamp: the millisecond in which the event occurred; o type: the type of the event. An event could be of one of this types: “New

searched brand”, “Processed video”, “Processed logos”, “Validated logo”, “Found matches”, “Validated match”;

o target: the target entity of the event (i.e, video, logo, brand or match). o description: a textual description of the event; o matchEvent: this boolean metadata, allow us to distinguish between events

that are navigable on the storyboard timeline and events that are not; o matchConfidence: the confidence of the match. Such metadata is present in

all the match events subsequent to the “Found match” event. This field is void until the match confidence is computed.

o logoConfidence: the confidence of the logo related to the match. The logo confidence is computed according to the position in the top-32 Google Images logo instances retrieved and the crowd judgments on the logo. Such metadata is present in all the match events subsequent to the “New searched brand” event. This field is void until the confidence is computed.

o isRelevant: the crowd judgment on the match (TRUE/FALSE). This metadata is set for all the match events subsequent to “Match judged by the crowd” event. This field is void until the crowd judgment on the match is received.

Page 22: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 18 D5.1 Version 1.0

3. User Interface

The starting page includes a text-based search form, to cover the case of keyword-based queries.

If the brand name searched by the user is not associated to any of the logos in the logos index, the user is asked whether he wants to start the indexing of the new brand, by clicking on a confirmation button. If the user confirms, then the system notifies the user that he has to wait a while until the process is completed, displaying a message on the page (Figure 8).

Figure 8: User Interface for missing logo

Otherwise, if the searched brand has already been indexed, the corresponding matches are shown in the result page (Figure 9).

The results page is composed of the following sections:

1. search form: the user can resubmit a new query. 2. statistics: summarizing some statistics about the search process (e.g., the number

of times a brand has been found in the video collection). 3. logos: this section contains the top-k logo instances for the query brand. The logos

are ranked according to their logo confidence. The value of k can be modified by changing the value of a numeric stepper.

4. matches: this section includes the actual results of the user query. Here videos, in which the top-k logo instances for the query brand have been detected, are listed. For each video, the set of key-frames, in which a match has been found, are shown. Key-frames are divided into three groups, high confidence, low confidence and junk ones, according to their score, the high confidence threshold, the low confidence threshold and the crowd judgment. A match is classified as “Good” if its score is above the high confidence threshold for the query brand or the crowd has judged it as “Relevant”. A match is classified as “Bad” if its score is between the low confidence threshold and the high confidence threshold for the query brand or the crowd has judged it as “Not relevant”. A match is

Page 23: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 19 D5.1 Version 1.0

classified as “Junk” if its score is below the low confidence threshold for the query brand. Junk matches are not displayed in the result page, but a link (e.g., the waste bin icon) is provided in order to show them in a different page. When the mouse pointer passes over a frame image the contained logo instance should be highlighted. Similarly, when the pointer passes over a logo instance, the set of key-frames that contains that instance should be highlighted. When the user clicks on the time instant reported under the key-frame image the video starts playing from that time instant (see cuepoint.org). When the user clicks on the key-frame image a magnified version of the key-frame is shown in a superimposed frame, showing also the bounding box shape around the logo instance.

5. storyboard: a set of related events is associated to each result match (e.g. new added video, new indexed logos, new found matches, etc). From the entire result set we can retrieve all the events that are relevant for the query. The timestamp of each event is known, therefore the events can be ordered using a timeline. For each timestamp, matches will be displayed in the status (i.e. logo confidence and match confidence) they had in that timestamp; if a match has no event for that particular timestamp, the status associated to is the one in which the match was in the maximum of the previous timestamps. The timeline is also enriched with “milestone” timestamps, which correspond to the main system events (e.g., new added video). Thus, the user can navigate through the results formation history, using the timeline. Timestamps in which there are updates in the result set are highlighted with a cue on which the user can click in order to move to that timestamps, on the contrary “milestone” timestamps cannot be navigated by the users. The set of matches shown to the user in the “matches” section is the one corresponding to the latest status, and the corresponding cue on the timeline is highlighted (for instance colored in green as in the mockup).

Page 24: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 20 D5.1 Version 1.0

Figure 9: Results User Interface

Page 25: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 21 D5.1 Version 1.0

3.1 Running example

At system startup, there are no indexed videos and no indexed logos. At a certain time instant, t0 (20:35), a new video, Shelf_1, is fed into the system and it is indexed. After a while, at instant t1 (20:39), a user submits a query, e.g., brand hefty. At instant t2 (20:40), the set of logos for the brand hefty is indexed. At instant t3 (20:41, Figure 10) the matches between Shelf_1 frames and logos for brand hefty are computed. Instant t3 is the first navigable cue in the storyboard.

Figure 10: New match found User Interface

Page 26: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 22 D5.1 Version 1.0

By changing the value of the stepper, we can change the maximum length of the logos list, filtering out or adding matches to the result.

At instant t4 (22:14, Figure 11), the system receives the crowd response on the logo validation. Logo HEFTY_LOGO reaches the top of the logos list.

Figure 11: User Interface after a logo validation

Page 27: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 23 D5.1 Version 1.0

At instant t5 (23:14), a new video, Shelf_3, is fed into the system.

At instant t6 (22:22, Figure 12), the system receive the crowd response on the match validation, thus one of the match previously classified as Bad, is now in the “Good matches” group.

Figure 12: User Interface after a match validation

Page 28: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 24 D5.1 Version 1.0

Finally, at instant t7 (23:14, Figure 13) the matches found in Shelf_3 frames are added to the index and displayed in the result page.

Notice that the addition of a new video may produce new matches between its key-frames and logo instances that were not matching with any of the key-frames of the previous videos. This means that the addition of a new video may result in changes in the top-k list of logos.

Figure 13: New video found User Interface

Page 29: CUbRIK R1 Multimodal Content Analysis & Enrich

CUbRIK R1 Pipelines for Multimodal Content Analysis & Enrichment Page 25 D5.1 Version 1.0

By clicking on a key-frame image (Figure 14), a superimposed frame is shown. The frame contains the key-frame image and a bounding box over the area around the logo.

Figure 14: Match details User Interface