the browser evaluation test a proposal pierre wellner, mike flynn idiap, september 2003

The Browser Evaluation TestA Proposal

Pierre Wellner, Mike FlynnIDIAP, September 2003

Ricoh “MuVIE”, Lee et alVideo editing, key frames, transcript search, embedded web browser, slides, whiteboard, minutes, perspective & panoramic views, speaker location, visual & audio activity

NOT TESTED ON H

UMANS

Microsoft “Distributed Meetings” Cutler et alPanoramic video, person-tracking, audio source localisation & beam-forming, speaker clustering & change, whiteboard camera, PC capture

SUBJECTIVELY TESTED

The Problem• No assessment, or...• Assessed by unique scheme• Often very subjective

[from Cutler et al, “Distributed Meetings: A Meeting Capture and Broadcasting System”, ACM Multimedia, 2002]

– “I was able to get the information I needed […]”– “I would use this system again if I had to miss a meeting.”– “I would recommend the use of this system to my peers.”

• No standard Browsing task

→ Objective comparison not possible ←

Aims of the BET

• Performance, not judgment• Independent of experimenter perception• Directly comparable numeric scores• Replicable

The Browsing Task

Find a maximum number of

observations of interest

in a minimum amount of time.

But what is an “observation of interest”?

test

sampling

BETOverview

observations answers

observers

playbacksystem

subjects

media browser

scoring

scores

meetingparticipants

corpus

recording system

People

• Participants

• Observers– Observer selection– Many diverse interests– Interesting for participants or absentees?

• Subjects– Subject selection

Data

• Corpus– Discussion, Presentation, Decision, Status…– Normal meetings, if possible– Reflect common distribution

• Observations– Pairs of statements, one true, one false

Tests & Scores

• Test: sample of observations

• Subjects must decide on truth– using the browser

• Score is correct minus incorrect answers

• Control scores established:– Educated guesses, no media– Same software as observers– Well-known basic applications

Illustration• Corpus

20 meetings @ ~40 minutes ≈ 13 hrs 20 mins of recordings• Observations

60 observers3 observers watch each meeting @ 18 observation-pairs/hour6 real-time ≈ 240 hours observation time216 observation-pairs/meeting, or 4,320 observation-pairs total

• Testing10 subjects each watch 8 meetings, in 2 hours 40 mins per subject4 subjects watch each meeting, 26 hours 40 mins total subject time1 answer per minute, 160 answers/subject ≈ 1,600 answers total

• SignificanceAssume: binomial distribution of results, 90% answered correctlyConfidence interval: 88.2% to 91.6%, with 95% confidence level

Summary

• Performance, not judgment– Subjects are measured in performance of tasks

• Independent of experimenter perception– Observers indirectly decide the tasks

• Directly comparable numeric scores– Standard methods, standard scores

• Replicable– Publicly accessible Web-site– All media available for download– Tests and scoring on-line

Questions…?

• Is this a good method?

• Do you recognise the problem?

• Would you use this method?

• Do you have a browser to test?

• Do you know of an existing MM corpus?

• …

the browser evaluation test a proposal pierre wellner, mike flynn idiap, september 2003

Documents