the browser evaluation test a proposal pierre wellner, mike flynn idiap, september 2003
TRANSCRIPT
The Browser Evaluation TestA Proposal
Pierre Wellner, Mike FlynnIDIAP, September 2003
Ricoh “MuVIE”, Lee et alVideo editing, key frames, transcript search, embedded web browser, slides, whiteboard, minutes, perspective & panoramic views, speaker location, visual & audio activity
NOT TESTED ON H
UMANS
Microsoft “Distributed Meetings” Cutler et alPanoramic video, person-tracking, audio source localisation & beam-forming, speaker clustering & change, whiteboard camera, PC capture
SUBJECTIVELY TESTED
The Problem• No assessment, or...• Assessed by unique scheme• Often very subjective
[from Cutler et al, “Distributed Meetings: A Meeting Capture and Broadcasting System”, ACM Multimedia, 2002]
– “I was able to get the information I needed […]”– “I would use this system again if I had to miss a meeting.”– “I would recommend the use of this system to my peers.”
• No standard Browsing task
→ Objective comparison not possible ←
Aims of the BET
• Performance, not judgment• Independent of experimenter perception• Directly comparable numeric scores• Replicable
The Browsing Task
Find a maximum number of
observations of interest
in a minimum amount of time.
But what is an “observation of interest”?
test
sampling
BETOverview
observations answers
observers
playbacksystem
subjects
media browser
scoring
scores
meetingparticipants
corpus
recording system
People
• Participants
• Observers– Observer selection– Many diverse interests– Interesting for participants or absentees?
• Subjects– Subject selection
Data
• Corpus– Discussion, Presentation, Decision, Status…– Normal meetings, if possible– Reflect common distribution
• Observations– Pairs of statements, one true, one false
Tests & Scores
• Test: sample of observations
• Subjects must decide on truth– using the browser
• Score is correct minus incorrect answers
• Control scores established:– Educated guesses, no media– Same software as observers– Well-known basic applications
Illustration• Corpus
20 meetings @ ~40 minutes ≈ 13 hrs 20 mins of recordings• Observations
60 observers3 observers watch each meeting @ 18 observation-pairs/hour6 real-time ≈ 240 hours observation time216 observation-pairs/meeting, or 4,320 observation-pairs total
• Testing10 subjects each watch 8 meetings, in 2 hours 40 mins per subject4 subjects watch each meeting, 26 hours 40 mins total subject time1 answer per minute, 160 answers/subject ≈ 1,600 answers total
• SignificanceAssume: binomial distribution of results, 90% answered correctlyConfidence interval: 88.2% to 91.6%, with 95% confidence level
Summary
• Performance, not judgment– Subjects are measured in performance of tasks
• Independent of experimenter perception– Observers indirectly decide the tasks
• Directly comparable numeric scores– Standard methods, standard scores
• Replicable– Publicly accessible Web-site– All media available for download– Tests and scoring on-line
Questions…?
• Is this a good method?
• Do you recognise the problem?
• Would you use this method?
• Do you have a browser to test?
• Do you know of an existing MM corpus?
• …