RADAR EVALUATIONGoals, Targets, Review & Discussion
Jaime Carbonell
& soon Full SRI/CMU/IET RADAR Team
1-February-2005
School of Computer Science
Supported By DARPA IPTO
PAL Program: “Personalized Assistant That Learns”
Carnegie Mellon University 2
Outline: Radar Evaluation
• Brief Review of Radar Challenge Task
• Evaluation Objectives: Obligation and Desiderata
• Evaluation Components: Radar Tasks
• Radar Metrics: Tasks Meaningful Measures
• Putting it all together: Tin-man formula proposal
Carnegie Mellon University 3
The resolver needs to replan:gather information, commandeer other rooms, change schedules,post to websites,inform participants.
The original plan has been disrupted. Conference wing A is no longer available.Other rooms may be affected.
Test: Radar will assist a conference planner in a crisis situation.
The test will be evaluated on quality and completeness of the new plan and on the successful completion of related tasks.
Crisis Resolver
RADAR
NLPPlanning & Scheduling
Handler
Learning
Knowledge Base
Conference Participants
WebsiteConference Organizers
Wing A
Wing B
Carnegie Mellon University 4
Conference Re-planning Tasks
•Situation Assessment – Which resources have become unavailable– What alternative resources exist and at what price
•Tentative re-planning of conference schedule– Elicit and satisfy as many preferences as possible
•Validating conference schedule & resource allocation– Securing buy-in from key stakeholders (requires meeting)– Awaiting external confirmations (or default assumptions)– Modifying plan as/when needed
•Informing all stakeholders– Briefings to VIPs, Update website for participants
•Cope with background tasks (time permitting)
Carnegie Mellon University 5
Scoring Criteria (Adapted from Garvey)
• Task Realism – Must reflect RADAR challenge performance
• Sensitive to Learning– Must allow headroom beyond Y2 (no low ceiling)
– Must include measurement of learning effects
• Auditable with Pride– Objective, Simple, Clear, Transparent,
Statistically Sound, Replicable, …
• Comprehensive & Research-Useful– All RADAR modules included, albeit differentially
– Responsive to RADAR scientific objectives
Carnegie Mellon University 6
Evaluation Components • All RADAR Modules (Sched quality)
– Time-Space Planning (TSP): Schedule quality
– Meeting Scheduling (CMRadar): Meetings, bumps
– Webmaster + Briefing Assistant (VIO)
– Email + NLP: Other tasks completed: background
• Additional Learning Targets (?)– Relevant facts & preferences acquired
– Strategic knowledge (when/how to apply K)
• Combination Function (Utility-like)– Linear weighted sum with +/- terms
Carnegie Mellon University 7
Example: Schedule Quality Metric
sessionsS Sfactorsfkjconf
J jk
fpswScore)(
)(1,0max)(
W = Weight = importance of the session (e.g. keynote > posters)
P = Penalty for distance from ideal (e.g. room smaller than target), linear or step fn
f = factors of sessions (e.g. room size, duration, equipment, …)
r = resource (e.g. ballroom at Flagstaff)
schedNewrschedOldr
t rScore,
cos )$(
Carnegie Mellon University 8
Putting It All Together
Normalizing components:
Summing: or
MINiMAXi
MINiRADARii SS
SSS
,,
,,ˆ
%100
ˆ
ˆ
Cii
Ciii
total w
Sw
S
Ci
iitotal SwS ˆˆ
Carnegie Mellon University 9
Next Steps for Evaluation Metrics• Metrics for Other components
• Metrics for Learning Boost
• Discuss/Refine/Redo Combination– True open-ended scale?
– Something other than weighted sum?
– Quality metric w/o penalties (+ ’s only)
• Test in a full walk-through scenario– Refine the details
– Don’t loose sight of objectives