Download - RADAR EVALUATION Goals, Targets, Review & Discussion

RADAR EVALUATIONGoals, Targets, Review & Discussion

Jaime Carbonell

& soon Full SRI/CMU/IET RADAR Team

1-February-2005

School of Computer Science

Supported By DARPA IPTO

PAL Program: “Personalized Assistant That Learns”

Carnegie Mellon University 2

Outline: Radar Evaluation

• Brief Review of Radar Challenge Task

• Evaluation Objectives: Obligation and Desiderata

• Evaluation Components: Radar Tasks

• Radar Metrics: Tasks Meaningful Measures

• Putting it all together: Tin-man formula proposal


The resolver needs to replan:gather information, commandeer other rooms, change schedules,post to websites,inform participants.

The original plan has been disrupted. Conference wing A is no longer available.Other rooms may be affected.

Test: Radar will assist a conference planner in a crisis situation.

The test will be evaluated on quality and completeness of the new plan and on the successful completion of related tasks.

Crisis Resolver

RADAR

NLPPlanning & Scheduling

E-Mail

Handler

Learning

Knowledge Base

Conference Participants

WebsiteConference Organizers

Wing A

Wing B


Conference Re-planning Tasks

•Situation Assessment – Which resources have become unavailable– What alternative resources exist and at what price

•Tentative re-planning of conference schedule– Elicit and satisfy as many preferences as possible

•Validating conference schedule & resource allocation– Securing buy-in from key stakeholders (requires meeting)– Awaiting external confirmations (or default assumptions)– Modifying plan as/when needed

•Informing all stakeholders– Briefings to VIPs, Update website for participants

•Cope with background tasks (time permitting)


Scoring Criteria (Adapted from Garvey)

• Task Realism – Must reflect RADAR challenge performance

• Sensitive to Learning– Must allow headroom beyond Y2 (no low ceiling)

– Must include measurement of learning effects

• Auditable with Pride– Objective, Simple, Clear, Transparent,

Statistically Sound, Replicable, …

• Comprehensive & Research-Useful– All RADAR modules included, albeit differentially

– Responsive to RADAR scientific objectives


Evaluation Components • All RADAR Modules (Sched quality)

– Time-Space Planning (TSP): Schedule quality

– Meeting Scheduling (CMRadar): Meetings, bumps

– Webmaster + Briefing Assistant (VIO)

– Email + NLP: Other tasks completed: background

• Additional Learning Targets (?)– Relevant facts & preferences acquired

– Strategic knowledge (when/how to apply K)

• Combination Function (Utility-like)– Linear weighted sum with +/- terms


Example: Schedule Quality Metric

sessionsS Sfactorsfkjconf

J jk

fpswScore)(

)(1,0max)(

W = Weight = importance of the session (e.g. keynote > posters)

P = Penalty for distance from ideal (e.g. room smaller than target), linear or step fn

f = factors of sessions (e.g. room size, duration, equipment, …)

r = resource (e.g. ballroom at Flagstaff)

schedNewrschedOldr

t rScore,

cos )$(


Putting It All Together

Normalizing components:

Summing: or

MINiMAXi

MINiRADARii SS

SSS

,,

,,ˆ

%100

ˆ

ˆ

Cii

Ciii

total w

Sw

S

Ci

iitotal SwS ˆˆ


Next Steps for Evaluation Metrics• Metrics for Other components

• Metrics for Learning Boost

• Discuss/Refine/Redo Combination– True open-ended scale?

– Something other than weighted sum?

– Quality metric w/o penalties (+ ’s only)

• Test in a full walk-through scenario– Refine the details

– Don’t loose sight of objectives

Download - RADAR EVALUATION Goals, Targets, Review & Discussion

Top Related