the browser evaluation test a proposal pierre wellner, mike flynn idiap, september 2003

12
The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003

Upload: bonnie-paul

Post on 16-Jan-2016

225 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003

The Browser Evaluation TestA Proposal

Pierre Wellner, Mike FlynnIDIAP, September 2003

Page 2: The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003

Ricoh “MuVIE”, Lee et alVideo editing, key frames, transcript search, embedded web browser, slides, whiteboard, minutes, perspective & panoramic views, speaker location, visual & audio activity

NOT TESTED ON H

UMANS

Microsoft “Distributed Meetings” Cutler et alPanoramic video, person-tracking, audio source localisation & beam-forming, speaker clustering & change, whiteboard camera, PC capture

SUBJECTIVELY TESTED

Page 3: The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003

The Problem• No assessment, or...• Assessed by unique scheme• Often very subjective

[from Cutler et al, “Distributed Meetings: A Meeting Capture and Broadcasting System”, ACM Multimedia, 2002]

– “I was able to get the information I needed […]”– “I would use this system again if I had to miss a meeting.”– “I would recommend the use of this system to my peers.”

• No standard Browsing task

→ Objective comparison not possible ←

Page 4: The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003

Aims of the BET

• Performance, not judgment• Independent of experimenter perception• Directly comparable numeric scores• Replicable

Page 5: The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003

The Browsing Task

Find a maximum number of

observations of interest

in a minimum amount of time.

But what is an “observation of interest”?

Page 6: The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003

test

sampling

BETOverview

observations answers

observers

playbacksystem

subjects

media browser

scoring

scores

meetingparticipants

corpus

recording system

Page 7: The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003

People

• Participants

• Observers– Observer selection– Many diverse interests– Interesting for participants or absentees?

• Subjects– Subject selection

Page 8: The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003

Data

• Corpus– Discussion, Presentation, Decision, Status…– Normal meetings, if possible– Reflect common distribution

• Observations– Pairs of statements, one true, one false

Page 9: The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003

Tests & Scores

• Test: sample of observations

• Subjects must decide on truth– using the browser

• Score is correct minus incorrect answers

• Control scores established:– Educated guesses, no media– Same software as observers– Well-known basic applications

Page 10: The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003

Illustration• Corpus

20 meetings @ ~40 minutes ≈ 13 hrs 20 mins of recordings• Observations

60 observers3 observers watch each meeting @ 18 observation-pairs/hour6 real-time ≈ 240 hours observation time216 observation-pairs/meeting, or 4,320 observation-pairs total

• Testing10 subjects each watch 8 meetings, in 2 hours 40 mins per subject4 subjects watch each meeting, 26 hours 40 mins total subject time1 answer per minute, 160 answers/subject ≈ 1,600 answers total

• SignificanceAssume: binomial distribution of results, 90% answered correctlyConfidence interval: 88.2% to 91.6%, with 95% confidence level

Page 11: The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003

Summary

• Performance, not judgment– Subjects are measured in performance of tasks

• Independent of experimenter perception– Observers indirectly decide the tasks

• Directly comparable numeric scores– Standard methods, standard scores

• Replicable– Publicly accessible Web-site– All media available for download– Tests and scoring on-line

Page 12: The Browser Evaluation Test A Proposal Pierre Wellner, Mike Flynn IDIAP, September 2003

Questions…?

• Is this a good method?

• Do you recognise the problem?

• Would you use this method?

• Do you have a browser to test?

• Do you know of an existing MM corpus?

• …