evaluating the performance of self-paced brain-computer...
TRANSCRIPT
Evaluating the Performance of
Self-Paced Brain-Computer
Interface Technology
Revision: 1.0 (draft) Date: May 19, 2006
Steven Mason Brain-Interface Laboratory Neil Squire Society, Vancouver, Canada, [email protected]
Julien Kronegg Computer Vision and Multimedia Laboratory University of Geneva, Geneva, Switzerland [email protected]
Jane Huggins Depts. Phys. Med. and Rehab.& Biomed. Engineering University of Michigan, Ann Arbor, U.S.A. [email protected]
Mehrdad Fatourechi Dept. of Electrical and Computer Engineering Univ. of British Columbia, Vancouver, Canada [email protected]
Alois Schlögl Laboratory of Brain-Computer Interfaces Graz University of Technology, Graz, Austria [email protected]
ii
© 2006 Steven Mason, Julien Kronegg, Jane Huggins, Mehrdad Fatourechi and Alois Schlögl.
All rights reserved. No part of this work may be reproduced or used in any form or by any means
without the prior written permission of the authors.
iii
Contents
1. Introduction ........................................................................................................................... 1
1.1. This Version ................................................................................................................ 1
1.2. Conventions ................................................................................................................. 2
2. Review of BCI Technology, Evaluation Concepts and Metrics ........................................... 3
2.1. Review of BCI Technology ......................................................................................... 3
2.1.1. BCI System Definition ...................................................................................... 3
2.1.2. BCI Transducers ................................................................................................ 4
2.1.2.1. Abstract BCI Model ........................................................................... 4
2.1.2.2. Definition of No Control (NC) support ............................................. 5
2.1.2.3. Event-Driven versus State-Driven Designs ....................................... 6
2.1.3. BCI Control Paradigms ..................................................................................... 7
2.2. Review of Performance Evaluation Concepts and Methods ....................................... 8
2.2.1. Online vs offline evaluation ............................................................................ 10
2.2.2. Continuous versus periodic analysis ............................................................... 10
2.2.3. Where to measure the performance ? .............................................................. 10
2.2.4. User Intent ....................................................................................................... 10
2.3. Paced vs. Self-Paced / Guided versus Self-Guided Testing Protocols: ..................... 11
3. Challenges with Self-Paced BCI System Evaluation .......................................................... 13
3.1. Requirements and Issues ........................................................................................... 13
3.2. Availability of Reference Data .................................................................................. 13
3.3. Issues with Existing Performance Metrics ................................................................ 14
3.3.1. Evaluation of Synchronized and Self-Paced BCIs .......................................... 14
3.3.2. The NC State ................................................................................................... 14
3.3.3. Unequal Probability of Classes ....................................................................... 14
3.3.4. Implicit Assumption of Error-Free NC ........................................................... 14
3.3.5. Decision Rate .................................................................................................. 15
3.3.6. Application versus Perspective ....................................................................... 15
3.4. Application of Current Performance Metrics to Self-paced BCI systems ................ 15
4. Determining a Quality Reference for Self-Paced Evaluation ............................................. 17
4.1. Determining a Quality Reference in the presence of Observable Phenomenon and
Real-time Monitors .................................................................................................... 17
4.2. Determining a Quality Reference in the absence of an observable Phenomenon ..... 18
4.2.1. Paced Test Environments I .............................................................................. 18
4.2.2. True Self-Paced Test Environments ................................................................ 18
4.2.3. System Paced Test Environments ................................................................... 19
4.2.4. Computationally Intensive Event Detection ................................................... 19
4.3. Examples ................................................................................................................... 19
4.3.1. Step 1: running experiment ............................................................................. 19
4.3.2. Step 2. Generating the Estimated User Intent. ................................................ 21
5. Transducer Output Performance Mark Up ......................................................................... 23
5.1. Inherent Uncertainty in the Estimated User Intent .................................................... 23
5.2. General Performance Mark Up Algorithm ................................................................ 24
5.2.1. The Internal Comparison Method (ICM) ........................................................ 26
iv
5.2.1.1. Example 1 ........................................................................................ 27
5.2.1.2. Example 2 ........................................................................................ 27
5.2.1.3. Other ICM examples ........................................................................ 28
6. Metrics for Self-Paced Evaluation ...................................................................................... 31
6.1. Definition of Terms and Symbols ............................................................................. 31
6.1.1. Observation Time ............................................................................................ 31
6.1.2. NC Time Periods ............................................................................................. 31
6.1.3. Inter-FA Time Periods .................................................................................... 31
6.1.4. Response Time ................................................................................................ 31
6.1.5. Hold Periods / Hold Time ............................................................................... 32
6.1.6. Glitch ............................................................................................................... 32
6.2. Error metrics definitions ............................................................................................ 32
6.2.1. Confusion Matrix ............................................................................................ 32
6.2.2. Other metrics ................................................................................................... 34
6.3. Temporal Characterization of Transducer Output ..................................................... 36
6.3.1. Response-Time Characterization .................................................................... 36
6.3.2. NC Period Characterization ............................................................................ 36
6.3.3. Inter-FA Period Characterization .................................................................... 36
6.3.4. Hold-Time Period Characterization ................................................................ 37
6.4. Summary .................................................................................................................... 37
6.4.1. Returning to the Research Questions .............................................................. 37
7. Reporting Practices ............................................................................................................. 39
7.1. Ideal needs for their target application ........................................................................ 39
7.1.1 Specifying theTarget Needs: ............................................................................. 39
7.1.2. Acceptable Error Rates: .................................................................................... 39
7.1.3. Acceptable Timing Characteristics: ................................................................. 39
7.2. Transducer‟s Characteristics ....................................................................................... 40
7.2.1. Transducer‟s Output Rate ................................................................................. 40
7.2.2. Temporal Characteristics: ................................................................................. 40
7.2.3. Offline vs. Online Analysis: ............................................................................. 40
7.2.4. Robustness of the Algorithm ............................................................................ 40
7.3. Transducer Output Mark Up Method ......................................................................... 40
7.4 .Basic Performance Information .................................................................................. 40
7.5. Application- Specific High Level Metrics .................................................................. 40
Appendix A Glossary.......................................................................................................... 43
Appendix B Transition-based Performance Markup Algorithm ........................................ 47
Appendix C References ...................................................................................................... 51
Appendix D Outstanding Issues ......................................................................................... 53
1. Introduction There is a growing awareness that for a Brain Computer Interface (BCI) to be most useful
for people with severe motor disabilities it must support self-paced operation. Self-paced
operation implies two things: first, when a BCI system is on, it is always available for control, and
second, the technology is able to recognize periods when no commands are generated by the user
and it does not produce false responses during those times. As such, the evaluation of self-paced
technology poses some unique challenges; the most notable is the lack of a single metric to
quantify performance.
Several groups are working in this area and dealing with these challenges (e.g., in the
laboratories of Birch, Levine, Millan, Inbar, and Pfurtscheller) although the terminology and
methods used for describing and testing these approaches are inconsistent. The term
asynchronous, in particular, has been used inconsistently as researchers have attempted to
describe self-paced BCI system designs (Mason, Bashashati et al. 2005).
This report is our effort to elucidate the concepts and methods related to the evaluation of self-
paced BCI operation. As such we define key terms and concepts, outline specific challenges and
issues in this sub-field, and summarize methods and metrics for comparing technology designs.
The overriding purpose of this report is to provide a common reference for designing and
evaluating self-paced BCI technologies. Our hope is that through efforts like these, disparate
perspectives can be aligned and the community brought together on terminology, testing strategies
and reporting related to self-paced operation.
Fundamentally, the report is a living document that will grow and change as the field matures
and our thinking is refined. In its current form, it contains the fundamental self-paced concepts and
key terms that we have agreed on and itemizes several existing terms that we would like to see
discontinued. The basic concepts and terminology are reviewed in the next chapter. As these
concepts and terms are used throughout the document, we recommend that Chapter 2 is read prior
to reading other chapters. As stated above, self-paced BCI evaluation has several unique
challenges. We have itemized these in Chapter 3. One of the most critical challenges – deriving a
suitable signal reference for comparison – is detailed further in Chapter 4. In Chapter 5 and 6 we
outline methodological approaches to BCI evaluation and describe various metrics for
summarizing experimental findings. We close the report in Chapter 7 with some guidelines for
reporting self-paced BCI system evaluation. We also have included several appendices, of which
the most noteworthy are the first and last. The first is a glossary of all key term and the last
summarizes issues which we have not resolved as of the time of publication.
1.1. This Version 1
The material presented herein is the product of a small, closed discussion group of researchers
representing laboratories that have been actively working with self-paced BCI technology. The
project was initiated at an informal gathering at the 3rd International Brain Computer Interface
Meeting, Rensselaerville, NY, USA in June 2005 where we assembled to discuss the evaluation of
self-paced BCIs. Given the diversity in perspectives and terminology that we encountered within
our small group (during and after that meeting), we chose to limit the participation in the
discussion aiming to first reach agreement on the most basic terms, concepts and metrics before
we approach the rest of the community for their input.
1 The most recent version of this report is posted on the www.BCI-info.org website under Research Info | Documents
| Articles. We recommend that you get the latest version before reading further.
2
In order to limit the scope of the discussion, we chose to focus the report primarily on BCI
transducers that have discrete output. (For reference, “BCI transducers” refer to the conceptual
component of a BCI system that translates brain activity into basic control signals) Self-paced
operation of BCI transducers with continuous output or spatial reference output will not be
discussed here. Future versions or alternate reports may deal with the evaluation of these
transducer designs.
As future versions of this document will reflect community opinion, we invite you to comment
on what we have written. If you are interested, please join the online discussion held in the
newsgroup named gmane.science.neuroengineering.bci-info.general on the news server
news.gmane.org.2 Our plan is to update this document based on comments seen in this newsgroup.
1.2. Conventions
Below are the style conventions used in this document:
<New Term> definition of key term related to self-paced BCI technology and operation
(see entry in glossary – Appendix A). Usually the first occurrence of the
term in the document.
<Key Term> subsequent use of a key term when the term is not obvious from context
<Old Term> existing terminology that we would like to see discontinued
<text> In superscript identifies unresolved issues. Refer to item n in the last
appendix entitled Outstanding Issues for a description of the issue.
Note all highlighted terms are hyperlinked to glossary entries in the first appendix.
2 Instructions for accessing the newsgroup: If you use MS Outlook, active the built-in news reader with View | Go to |
News... Then use Tools | Accounts and add news.gmane.org to the accounts list. Then right-click on
news.gmane.org in the Folders panel and select Open. Search for and select the "bci-info" newsgroup (named above)
from the listed new groups. If you use Thunderbird, you could do the following: In File | New | Account select
Newsgroup account. Enter "news.gmane.org" in the newsgroup server field. Then you can select the "bci-info"
newsgroup from the listed news groups. If you have difficulties accessing this newsgroup or have other questions,
regarding the online discussion please contact one of the authors.
3
2. Review of BCI Technology, Evaluation Concepts and Metrics
2.1. Review of BCI Technology
In this chapter, we review the concept of a Brain Computer Interface (BCI), a BCI
transducer and different types of BCI transducers, with a focus on self-paced BCI systems.
2.1.1. BCI System Definition
A BCI System is a set of sensors and signal processing components (sometimes including
displays and sensory stimulators) that translates a person‟s brain activity directly into useful
control or communication signals. From an assistive technology perspective, Figure 1 depicts a
BCI System as an assistive technology (AT) that bridges an ability gap between a person and their
environment. For example, such a system can enable a person with severe motor disabilities to
control objects in their environment such as a light switch, a TV set or an object on the computer
screen. Figure 2 depicts two common AT architectures discussed in the BCI literature 3. (See BCI
design web site (Mason, Bashashati et al. 2005) for alternative system architectures).
ability
gap
activity
person
BCI
system
perceived result
of activity
environment
Figure 1. Conceptual model of BCI technology used as an assistive technology.
(a) (b)
Figure 2. a) Functional model of 2-component BCI System; and b) functional model of 3-component BCI System.
The Assistive Device component represents the apparatus that interacts directly with object or people in the
environment. Examples would be displays, speech synthesizers, infrared remote controllers, a FES system or a
wheelchair. The Control Interface is a component that is added to a transducer that produces a relatively low
dimensional output in order to expand the control dimensionality to a level required by an Assistive Device. Most
commonly, the Control Interface is some form of electronic menu on a display.
3 Note, this model also applies to any augmentative technology that extends a person‟s abilities beyond their
inherient functional limitations.
4
2.1.2. BCI Transducers
The BCI Transducer (depicted in Figure 2) and represents a collection of sensors and signal
processing components required to translate a person‟s brain activity into usable control signals as
detailed in Figure 3. As seen in Figure 3, the brain activity is first measured by a series of sensors
and then amplified. Next, non-physiological artifacts (such as power line noise) and physiological
artifacts (such as those caused as a result of EOG activity), are removed from the brain signals (or
periods of brain signals contaminated with such artifacts are excluded from analysis). It should be
noted that some BCI systems do not have a separate “Artifact Processor” component as shown in
Figure 3. Then the Feature Extractor extracts useful features for discrimination of control signal
from background brain activity. Finally, the Feature Translator translates these features into
control signals which are sent to an Assistive Device or Control
Interface.
Figure 3. Functional model of a BCI Transducer illustrating the series of components used to translate brain activity
into control signals.
In general, transducers can be designed to produce discrete, continuous or spatial reference
outputs. For this report, only discrete transducers (non-ordered, state-based output) are
discussed.
Figure 4. Discrete state-based output versus continuous output (values from ordered set).
2.1.2.1. Abstract BCI Model
As represented in Figure 5, the BCI System can be represented abstractly as the user‟s brain
connected to a BCI Transducer followed by other signal processing components that translate the
basic transducer output into higher-dimensional (more complex) data or commands. In this model,
the user‟s intent generates the appropriate brain state to do some activity (i.e., produce some
output) in their environment. As with any other tool, the user‟s experience will dictate how much
attention must be focused on correct operation of the tool or if tool use is primarily transparent,
with the user only consciously focusing on the activity being performed. For an experienced user,
Continuous
states
Discrete
value level
Time Time
B A C C B A
5
the production of the brain state and transducer output has been automated and hence the user‟s
intent is usually related to some high-level output in this “system” (somewhere along the chain of
outputs). For inexperienced users, the production of the required brain states and transducer output
has not been automated and the user has to consciously focus on generating a particular transducer
output, so the intent is more closely tied to generating a particular transducer output.
o o o
transducer
output BCI
Transducer
user‟s
intent
user‟s brain
activity higher function
components
higher-level outputs
various forms of feedback
Figure 5. Generic signal processing model.
2.1.2.2. Definition of No Control (NC) support
When operating a BCI, users modulate their brain activity in order to generate desired
transducer outputs. For discrete BCI Transducers, the brain states related to Intentional Control
are denoted as ICA, ICB, ICC, ... with corresponding transducer outputs A, B, C,... When there is
no Intentional Control, e.g., during periods of thinking, composing, monitoring or daydreaming,
the user‟s brain state is considered to be in a No Control state, which is denoted by NC. The
appropriate transducer response to NC would be a neutral, or N, output. We refer to as this ability
as NC support. NC support is necessary for most types of machine or device interactions where
frequent actions are spaced by periods of inaction. For most physical interfaces (such as a
keyboard), NC support is not an issue as intent is related to voluntary movement. One of the
challenges of quantifying BCI performance is determining appropriate criteria to quantify
imperfect NC support. This ability to remain neutral has been compared to a car engine that idles
when no gas is applied. Thus some researchers have referred to the NC state using the term idle
and NC support as Idle Support. However, the word idle implies passivity, which is only one
possible type of NC, so these terms are not ideal and we recommend that researchers use NC and
NC support instead.
Few BCI Transducers have been specifically designed to support the NC brain states. For this
work, these transducers will be called BCI transducers with NC support. Transducers that do
not support NC brain states are referred to as BCI transducers without NC support 4.
NC support must handle the diversity of activity and thought that may make up the NC state
and must operate effectively during NC periods ranging from a few seconds used to check a
written source, to a few minutes of staring out the window, to a few hours of watching a favorite
movie. Thus it is unlikely that one can model all possible NC states related to a target application.
In general, this problem can be constrained by including some mechanism to turn the interface off
4 We have explicitly avoided using the terms asynchronous and synchronous in order to avoid misinterpretation
and confusion as these adjective are used inconsistently throughout the field. As a reference, we have (re)defined an
asynchronous BCI transducer as synonymous with a BCI transducer with NC support and a synchronous BCI
transducer as synonymous with a BCI transducer without NC support. We do not support the use of asynchronous
BCI or asynchronous BCI system as these terms are too vague and bound to cause more confusion in the
community. Instead we prefer the term self-paced BCI system.
6
when not in use and back on when desired. Applying this approach, some interface designs
propose to handle long periods of inactivity by having the BCI “put to sleep” by the user or
programmatically “go to sleep” (such as when battery powered laptops enter standby mode) and
then are turned back on (via some mechanism) when desired I1. While a sleep mechanism is a
useful mode to avoid false responses during long periods of NC, it is not practical during periods
of interaction which contain frequent short pauses for thought, composition or response
monitoring. Further, for a BCI to programmatically put itself to sleep implies that the BCI can
recognize a long period of NC and take action accordingly. Therefore, while a sleep mechanism
may form a portion of a BCI's NC support strategy, a sleep mechanism by itself is not sufficient
for NC support.
Some researchers have designed cued interfaces that in addition to producing IC related
outputs also produce an unknown state output when there is not enough confidence in the
classifier to choose one of the other IC states. This unknown state has in some works been used to
represent the neutral output, N, but there is no evidence that the NC state will fall into the
unknown state in these designs I2. Therefore, the existence of an unknown state is not necessarily
sufficient to define NC support.
Ultimately, NC support will likely be provided by a combination of the methods described
above along with strategies yet to be developed. Only practical experimentation with real BCIs
and real users will reveal the nature of effective NC support in a real-world BCI.
2.1.2.3. Event-Driven versus State-Driven Designs
Through our discussion we realized that researchers design discrete self-paced BCI transducers
in two different ways. We have named these designs state-driven (discrete) BCI transducers
and event-driven (discrete) BCI transducers.
In event-driven discrete control, the user has the ability to initiate a state change (the event),
but cannot hold this state for a given time because the underlying neurological phenomenon is
event based. Note in this case the return to the NC state is automatically done by the brain, see
Figure 6. P300 waves are typically event-driven discrete control.
In state-driven discrete control, the user has the ability to initiate a state change, hold and
release a given state, and switch between brain states, as shown in Figure 6. This is done in
transducers where the underlying neurological phenomenon is a thresholded continuous value.
Thresholded mu activity is an example of state-driven discrete control.
ICA
NC
time
A
N
decision
boundary
f(t)
ideal transducer output
initiate
A state automatic
release
duration of “on” time
depends on classifier design
feature space
ICA
NC
time
A
N
decision
boundary
feature
vector, f(t)
ideal transducer output
initiate
A state
hold
A state release
A state
feature space
Figure 6. a) Event-driven discrete control: transducer output is based on a transient brain activity (an event) such as a
Movement Related Potential (MRP). b) State-driven discrete control: the user has ability to initiate, maintain (hold),
and release an intentional control (IC) state..
The main point in differentiating these two paradigms is that for state-driven control we
should report activation, hold, release, and NC support capabilities, while for event-driven
control we only measure activation and NC support capabilities as the user has no control over
7
release (which is automatic) or holding.
Another difference between event-driven and state driven discrete control is that transition
from one IC state to another IC state can not appear in event-driven control because of the
automatic release from an IC state to NC.
2.1.3. BCI Control Paradigms
While in this paper, our focus is on self-paced operation of BCI systems, we found it necessary
to clearly delineate the self-paced control paradigm from other types of control paradigms seen in
the BCI literature. From our perspective, there are four primary control paradigms based on NC
support and system availability as depicted in Figure 7 (Mason and Birch 2005).
1) self-paced control: BCI system is continuously available to the user when it is on/awake and
it supports NC
2) system-paced I3 control: system is periodically available to the user when it is on/awake (ie.,
it requires a cuing mechanism) and it supports NC
3) synchronized I4 control: system is periodically available to the user when it is on/awake (ie.,
it requires a cuing mechanism) and it does not support NC
4) constantly-engaged control: system is continuously available to the user when it is on/awake
and it does not support NC (not a practical mode of control)
time
periodically available
continuously available
Figure 7. System availability during different temporal control paradigms. Reproduced from (Mason and Birch
2005)
As a result, one would find four different categories of BCI system design: self-paced, system-
paced, synchronized, and constantly engaged.
There have been numerous reports that equate cuing mechanisms with synchronous (or
synchronized) control. This is inaccurate and misleading. While synchronized control (and
system-paced control) require a cuing mechanism in their design, the presence of a cuing
mechanism in an experimental protocol does not imply that the system operates in a synchronized
or system-paced control paradigm. Cues are an essential part of the system design in synchronized
BCI systems and system-paced BCI systems. They let the user know when the system is about to
start interpreting their data as control. For synchronized BCI systems, the cues are generally used
to say "get ready to start controlling" (i.e., get into an IC state, or start a series of IC states (as per
Millan et al) ). For system-paced BCI systems, they indicate that a control period will be starting
soon if they want to control the system at that time (not a requirement as in synchronized systems).
Cues are also used as experimental constraints (ie, not part of the BCI system design). As
experimental constraints, cues are used to guide the user into some state, such as IC or NC. In this
way, one could set up an experimental system with a user operating a self-paced BCI system
(design) and a separate cuing mechanism to "force" the user to control the self-paced system when
8
desired by the experimenters. Such a setup would not imply a cued transducer, but instead would
indicate a tightly constrained experimental setup.
2.2. Review of Performance Evaluation Concepts and Methods
Before we proceed to discuss performance metrics and methods, we require a common
perspective and language. This chapter outlines the principle concepts related to the evaluation of
self-paced BCI technologies that we have agreed on.
Figure 8 and Figure 9 delineate the main components and data flow of data recording and
analysis for online and offline studies, respectively.
real-time
monitors
activity storage
Recorded
Data
BCI
system
environment
user
a)
Performance Evaluation
generate
Intended Output
knowledge of
experimental protocol
actual transducer output
calculate
error
statistics,
summarize
timing characteristics
Reference
information
storage
Intended
Output
transducer
output
mark-up
b)
Figure 8. Simplified on-line experimental system. a) recording of BCI system operation data while the user attempts
to perform some activity through the BCI; b) schematic of data analysis (could be done in real time or after the data
recording has been completed). Note, the collection of “soft” usability metrics, like “user satisfaction” (typically
recorded via questionnaires), is not depicted in these diagrams. I5
For an online experimental system, after a user attempts to perform some activity through the
BCI system, the data is recorded on a storage device (See Figure 8.a) . “Real-time monitors” such
as a camera ensure that user‟s actions are properly documented (Including proper execution of the
experiment and monitoring artifacts). I6
Figure 8.b shows the general schematic of data analysis. The first step is to codify the User‟s
intent in a machine readable format. This is a critical step in the design, since the exact time that
the user has intended to control the BCI is usually unknown and must be estimated from the
available reference information. We call such an estimation of the intent of the user the Estimated
User Intent and we discuss its generation in detail in Chapter 4. Once the Estimated User Intent
9
is generated, the transducer output can be marked up and the performance evaluation metrics can
be calculated.
storage
generate
Intended Output
Intended Output
brain
signals
knowledge of
experimental protocol
new transducer
output
new transducer
design
Performance Evaluation
calculate
error
statistics,
summarize
timing characteristics
transducer
output
mark-up
Reference
information
a)
activity
storage
Recorded
Data
environment
user
real-time
monitors
b)
Figure 9. Simplified off-line experimental system. a) Prerecorded brain wave data (either from a previous on-line
recording as depicted in Figure 8a or from a recording of a user performing a specific activity as depicted in b) is used
to drive a new transducer design.
Fig 9.a. shows the setup for an offline experimental system. The difference between this setup
and the on-line analysis is that here the user‟s feedback does not exist. The brain waves and other
experimental-related signals such as EMG and EOG activities from a previous BCI experiment, or
recorded from a user performing a specific task, are stored for offline analysis.
The recorded brain signals are then fed into a new transducer design (see Figure 9.b) and the
new transducer output is generated. The rest of the process of labeling and calculation of the
performance criteria are similar to the online experiment.
Depending on the transducer design and experimental set-up, the key issues in performance
evaluation are:
1. online/offline evaluation: depending on the type of analysis (online vs. offline), the
performance criteria may be different.
2. continuous versus periodic analysis
3. where should the performance be measured?
4. the guidance and pacing of experimental tasks and self reported data
5. methods to determine the a quality signal reference for comparison
6. metrics: what to use and how to calculate them
10
The first four issues are discussed in the following subsections focus. The last two will be
discussed in more detail in Chapters 4, 5 and 6.
2.2.1. Online vs offline evaluation
The performance evaluation depends on whether the experimental system is offline or online.
In an online experimental system, the performance of the system is evaluated by two sets of
metrics: transducer‟s performance metrics (as explained in Chapter 6) and system‟s usability
metrics in terms of the amount of satisfaction/dissatisfaction of subjects with the system, etc.
2.2.2. Continuous versus periodic analysis
Periodic analysis as we have defined it is when an experimenter records continuous data
(including NC and IC) , but evaluates the BCI technology only during the periods of IC (similar to
synchronized control systems) I7. Continuous analysis is when the analysis evaluates all of the
data.
2.2.3. Where to measure the performance ?
To provide context for the interplay between IC and NC states, a simple operating model shown in
Figure 10 was adopted. In this model, the user is equipped with a BCI transducer and the
transducer output is connected to an assistive device.
There are multiple places where one can test a self-paced BCI system, represented by various
test points TPα, TPβ and TPγ in Figure 10. Given the wide variety of commercially available
assistive devices, we felt that BCI users would be primarily interested in the performance of BCI
transducers as this characterization would enable them to select the best transducer to control their
existing assistive devices. To simplify our discussion (and best meet user needs), we chose to
focus our conversation on the output of the transducer, TPα, although much of what is presented
below may also apply to other test points.
Assistive
Device
TPα
TPβ
BCI
Transducer
TPγ
Figure 10. Basic operating model illustrating possible test points for evaluating function in multi component BCIs.
For BCI technology evaluation, we are treating the BCI transducer operation as a black box,
i.e. the evaluation metrics are blind to any considerations about how the transducer output is
produced. The performance metrics merely quantify the difference from the user's viewpoint
between the user's intent when controlling the system (intended output) and the actual output of
the BCI transducer.
2.2.4. User Intent
The purpose of performance metrics is to quantify desire versus ability. We refer to the
discrete representation of the user's desire as the intended transducer output or Estimated User
11
Intent. One of the major issues is how do we compute the Estimated User Intent for a self-
paced experiment in order for the analysis to be valid (refer to Chapter 4 for discussion).
Deviations of the observed output from the intended output as observed at the test point may be
caused either by limitations in the user's ability to control their brain state or by deficits in the BCI
transducer's ability to interpret their brain activity. Experiments may be configured to examine
one or the other of these sources of error or to allow both simultaneously.
2.3. Paced vs. Self-Paced / Guided versus Self-Guided Testing Protocols:
Testing protocols were broadly classified by two factors, pacing and guidance. A guided,
paced protocol can constrain the subject in such a way that the investigator can determine what the
subject intended to do (assuming subjects follow task directions) and when they attempted control
of the interface. Any protocol that utilizes self-guided or self-paced interaction will require self
report. For self-directed tasks, investigators require extra methods to label their self-directed data.
The investigator also needs to control self report error in these protocols.
Test protocol design is also independent of the BCI transducer design. What was not initially
recognized by all members of the group was that self-paced BCI Transducers that support NC can
be tested in both guided and paced protocols as well as with self-paced methods. For example,
guided and paced protocols may use cues to initially customize and train the subjects, but true
testing of self-paced BCI technology for individuals with severe motor disabilities will require
self-paced actions and timing. Ideally, a self-guided, self-paced test protocol is desired although
this type of protocol involves many issues related to how to generate the estimated intended output
(see Chapter 4). Experience with this type of testing indicated that subject training or BCI
transducer customization or calibration in controlled environments does not always map well to
self-determined environments. This makes it difficult to predict the true usability of self-paced
BCI technology for certain subject groups where self report is not possible.
13
3. Challenges with Self-Paced BCI System Evaluation
3.1. Requirements and Issues
Comparison of different BCI systems, transducers and algorithms requires shared performance
metrics. The most common performance metric is perhaps the error rate or classification accuracy
(Blankertz, Müller et al. 2004), (Blankertz, Vaughan et al. 2003), (Blankertz, Schalk et al. 2005).
However, several other metrics have also been proposed such as Cohen's Kappa coefficient (Bortz
and Lienert 1998), (Cohen 1960), (Kraemer 1982), (Schlögl, Lee et al. 2005), mutual information
and information transfer (Kronegg, Voloshynovskiy et al. 2005), (Kronegg and Pun 2005),
(Nykopp 2001), (Pierce 1980), (Schlögl, Neuper et al. 2002), (Schlögl, Keinrath et al. 2003),
(Wolpaw, Ramoser et al. 1998), receiver-operator-characteristics (ROC) and area-under-the-
(ROC) curve (AUC) (Lal, Schröder et al. 2005), (Schlögl, Anderer et al. 1999a), (Schlögl, Anderer
et al. 1999b), correlation coefficient (Gao, Black et al. 2003), (Wu, Gao et al. 2006) and mean
square error (MSE) (Gao, Black et al. 2003), (Wu, Gao et al. 2006) (for details see (Schlögl,
Huggins et al. accepted for publication)). These criteria have been applied mostly in synchronized
BCI systems, on a trial-by-trial basis. In the simplest case, a single classification result is obtained
from each trial and these single trial results are used to calculate the classification accuracy.
Sometimes, instead of the classification value, a discriminant value is used, taking into account not
only the classification but also the magnitude (or confidence level of classification). In more
advanced evaluation methods, instead of a single value per trial, the result of each time-point
within the trial is analyzed. Accordingly, the time-course of the performance metric is used,
enabling an estimate of the time delay of the data processing methods. These evaluation methods
require that the Estimated User Intent (class labels, target information, reference data) be accurate
and precise, which is a simple matter for the synchronized BCI experiments to which they have
been applied. Synchronized experiments also allow experiments to be structured so that the
assumptions necessary for the application of these performance evaluation methods are generally
met. However, for self-paced BCIs and for most real-world applications, these underlying
assumptions are not met and the Estimated User Intent contains some uncertainty.I8 Therefore,
these performance evaluation metrics cannot be readily applied to self-paced BCIs. I9
3.2. Availability of Reference Data
In order to evaluate performance, the Estimated User Intent must be known and available for
comparison with the transducer output. When evaluating self-paced BCIs, obtaining the Estimated
User Intent can be a major challenge because self-paced BCI experiments generally do not provide
rigorous reference information. Sample-by-sample class labels that are commonly employed by
existing performance metrics are especially difficult to obtain from self-paced experiments where
the point at which brain activity for a particular task begins is entirely up to the user. However, in
many cases, it is sufficient to use reasonable reference information, without requiring an 100%
accuracy. This strategy is often used in medical informatics, where experts provide scorings based
on their best knowledge. Although, different experts to not agree (they have interscorer variability)
and the same experts to not always provide the same scoring results, expert scoring is often still
useful and is used as the “gold standard” (examples are sleep stage scoring (Danker-Hopfe, Kunz
et al. 2004), or diagnosing mammograms). Obtaining Estimated User Intent information for self-
paced BCIs is discussed in detail in Chapter 4.
Since the Estimated User Intent information is imperfect, the performance metric is limited to
the inaccuracies of the reference labels. Nevertheless, the metric can be used for comparing
different data processing methods. As long as the reference information has been obtained
14
independently (a priori) from the transducer output, the obtained metric can reliably compare the
performance of the two systems on the same data. However, matters become more complex when
the desired task is to compare the performance of two systems that were tested on different data.
3.3. Issues with Existing Performance Metrics
3.3.1. Evaluation of Synchronized and Self-Paced BCIs
The evaluation of self-paced BCIs requires the comparison of the Estimated User Intent (our
signal reference) to the transducer output. Compared to synchronized BCIs where at only certain
windows of time are evaluated, a self-paced BCI output is analyzed at every output value. This
has far-reaching implications for the application of performance metrics.
3.3.2. The NC State
As the unique characteristic of self-paced BCIs, the NC state presents the greatest difference
between synchronous and self-paced BCIs and therefore an important consideration for
appropriate performance evaluation. While the NC state could be treated as simply an additional
state, this ignores the variability of the underlying brain activity, and hides the importance of error
free NC periods. Alternatively, separate statistics could be calculated for the NC state and the IC
states. This complicates the comparison of methods, but may also capture important aspects of the
performance.
3.3.3. Unequal Probability of Classes
The operation of self-paced BCIs typically results in long periods of NC interspersed with brief
instances of IC, or periods of increased IC. Regardless of the pattern of NC and IC, the NC state
often occurs with a much higher probability than the IC state(s). This unequal a priori probability
of the various classes violates the underlying assumption of equal a priori probability for a variety
of traditional performance metrics including Wolpaw‟s Mutual Information (Wolpaw, Birbaumer
et al. 2000) and the classification accuracy (ACC) or error rate (ERR). While unequal
probabilities do not violate the assumptions of other methods, they can present problems for
interpretation. For example, Receiver Operator Characteristics (ROC) and the metrics derived
from them can be used to present the performance metrics for a self-paced BCI. However, when
calculated in a traditional sample-by-sample manner, the overwhelming likelihood of the NC state
means that most BCI transducers will produce what looks like a perfect ROC curve. But even an
apparently low false positive percentage of 0.01% could indicate more than one false positive per
minute if the sample rate was 200 Hz. Thus, the area of interest on a ROC curve is so narrow that
traditional ROC analysis is impractical.
3.3.4. Implicit Assumption of Error-Free NC
One of the keys to useful performance of self-paced BCIs is the availability of long periods of
error free function. However, some performance evaluation methods may be used to compute the
performance over the IC states only. This corresponds to making the implicit assumption of error-
free periods of NC (Schlögl, Huggins et al. accepted for publication).I10 Performance metrics that
do not incorporate the existence of false positives are of limited use for the evaluation of self-
paced BCIs because they ignore such an important aspect of self-paced BCI operation. Methods
may also ignore false positives by requiring experimental setups that remove or limit the
opportunity for false positives to occur. For example, to produce a high information transfer rate,
events must occur close together. Using such short intervals increases the ITR, but also limits the
15
time between events, artificially reducing the opportunity for false positives to occur.
Consequently, while a self-paced BCI with a high ITR is desirable, such a description is
incomplete since it does not show how robust the BCI is against false positives.
3.3.5. Decision Rate
Another challenge for the use of performance metrics to compare self-paced BCIs is the large
variation in transducer output rates. Some BCIs produce decisions at a rate identical to the sample
rate (e.g. 200 Hz, (Huggins, Levine et al. 1999)) while others produce decisions at a dramatically
reduced rate (e.g. 20 Hz, (Kübler, Nijboer et al. 2005)) I11. In synchronized BCIs, this has not
been an issue, because most BCIs use a per-trial decision method with a decision rate of about 10-
15 trials/minute.
Some performance metrics, such as the kappa coefficient (Schlögl, Huggins et al. accepted for
publication), are normalized with respect to the number of samples, and would therefore not be
directly dependent on the decision rate. Some performance metrics can be normalized with respect
to the number of samples or with respect to the experiment duration. Other metrics (such as the
HF-difference), ignore time altogether, so that the metrics are only comparable when determined
from test data of the same length. However, performance metrics that are dependent on the
decision rate would be useless for comparison of BCI performance with this order of magnitude
difference in the decision rate. Evaluation of self-paced BCIs requires a metric that can be used to
compare BCIs with different decision rates.
Using a high transducer output decision rate also raises the question of the useful information
transmitted. For example, it should be asked if a 100 Hz decision rate BCI produces 10 times more
useful information than a 10 Hz decision rate BCI? Probably not, because the useful information is
primarily contained in the transitions within the Estimated User Intent which do not occur at this
frequency.
3.3.6. Application versus Perspective
The field of BCI research involves engineering and clinical considerations for BCI
performance. Too often, these diverse fields fail to communicate regarding BCI design
requirements. Researchers from a pure engineering background tend to focus on accuracy and
information transfer rates while researchers from a clinical background focus on response time and
robustness to false positives. However, both types of metrics are important for the production of a
clinically useful BCI. Ultimately, the user‟s perception of the BCI performance will control its
success as a clinical intervention, so user-centric methods should be considered in all stages of the
design. However, for some research and development tasks, more engineering-focused metrics
may provide additional insight into the choice of algorithms.
3.4. Application of Current Performance Metrics to Self-paced BCI systems
Based on the arguments in Chapter 3.3, most metrics seems to be inappropriate for use with
self-paced BCIs.I12 An initial consideration is whether the transducer output is discrete or
continuous and whether the reference information is discrete or continuous. An example of
discrete reference information would be a synchronized BCI experiment with different target cues
(Schlögl, Neuper et al. 2002), (Schlögl, Keinrath et al. 2003); an example of a continuous
reference signal would be the position of a ball on a screen (Gao, Black et al. 2003), (Wu, Black et
al. 2004), (Wu, Gao et al. 2006). The metric to apply depends on the type of transducer output, the
experimental design (which affects the probability of the classes in the data), and the desired
application for the BCI. An overview of the requirements for various metrics is provided in Table
1.
16
Table 1: Requirements for various criteria I13
Reference information Transducer output Handles Unequal Class Probabilities
Incorporates NC
Error rate, Accuracy Discrete Discrete No Cohen’s kappa coefficient Discrete Discrete Yes Wolpaw’s MI Discrete Discrete No Nykopp’s MI Discrete Discrete Yes Continuous MI Discrete Continuous AUC Discrete Continuous MSE Continuous Continuous Correlation coefficient 1-D (discrete or
continuous) Continuous
HF-difference Discrete Discrete Yes No Sensitivity, Specificity, Precision, Recall, F1, a’, d-prime
Discrete Discrete
17
4. Determining a Quality Reference for Self-Paced Evaluation
As discussed in Chapter 3, one of the most problematic issues in the evaluation of a self-paced
BCI is the ability to determine the user‟s intent and thus generate a quality reference signal to
compare with the actual transducer output. This chapter focuses on this issue, illustrating various
methods depending on the amount and quality of the experimental information.
As an overview, the ability to determine (or estimate) the subject‟s intent depends on the task
the subjects performed and the experimental equipment. For example, the subject may be
performing a task that is observable (e.g. moving their finger) or they may not, (e.g. imagining
moving their finger). If they are moving their finger, then this may be directly measured with
some form of monitor, such as a finger switch or data glove. If, however, there is no observable
phenomenon, then alternative methods are required to determine the subjects‟ intent. This is
discussed below.
Any method for generating the Estimated User Intent relies on some controlled or measured
experimental variable. These include a finger switch activation, EMG onset, an experimental cue.
In the remainder of this report we will refer to these variables as Intent Related Measures or
IRMs.
4.1. Determining a Quality Reference in the presence of Observable Phenomenon and Real-time Monitors
If the experiment has an observable correlate to the subjects‟ intent, such as an actual
movement, then a real-time monitor can be used to record this with a certain spatiotemporal
resolution (e.g. if the subjects control the BCI by moving their finger, then a data glove can be
used to record the movements). In this scenario, the movement information provides a good
approximation of the subjects‟ intent, although the reader should note that even these “concrete”
observations are still only an approximation. Because brain activity begins before movement
onset, the recorded information will be delayed compared to the true subject intent.
Pros:
- easy to implement
- observations during specific times in the experimental protocol are highly
correlated with the subjects‟ intent – analysis method is more direct
- useful as for proof of concept
Cons:
- requires observable phenomenon – this often limits these types of studies to able-
bodied individuals and rules out individuals with severe disabilities. It also rules out
those BCI technologies that use motor or other imagery as a control source. This is
a serious limitation.
Even with this approach, the performance analysis is not necessarily a simple comparison of
the monitor output and the transducer output. The delay between the onset of brain activity and
movement onset must be accommodated and additionally, the transducer may introduce
constraints on the output that prevent simple comparison. For example, a finger switch may
record a momentary press but the transducer may have a debounce mechanism that holds it active
for ¼ of a second before releasing. In this example, the switch on/off may be of varying durations,
say 1/32 to 1/8 second, thus producing a pulse of various lengths, whereas the transducer output is
always ¼ second long. As such, the comparison of this self-paced data is not straightforward and
18
may require some heuristics based on knowledge of the experimental setup and the transducer
characteristics for a meaningful analysis. Performance labels could be assigned to each sample
based on the presence or absence of an intent measure such as EMG. The brain activity in some
time window extending prior to EMG onset would also be included in the active labeled class (or
alternatively in a preparatory labeled class) in order to also label the brain activity that produces
the movement. Once these reference labels are available, there is greater opportunity to apply the
traditional evaluation criteria (as listed above). This principle can be also extended to more than
one class, by using more than one switch. Thus switches for left and right hand movement, foot
movement, tongue movement etc. could be used. The major drawback of this approach is its
reliance on actual movements when the purpose of a BCI is to provide an interface that can be
operated without these movements.
4.2. Determining a Quality Reference in the absence of an observable Phenomenon
In cases where there are no observable phenomena, other strategies are required. Four main
approaches are detailed below.
4.2.1. Paced Test Environments I14
One can approximate self-paced test environments by using a paced and guided environment
with timing cues This method can for example be used for the transducer training phase where the
pacing and guidance information is only used as reference information and not used by BCI
transducer to produce an output. The subject will attempt to activate the transducer in according to
the guidance at the timing cues. If the subject has the ability to hold and release the transducer as
well (i.e., state-driven transducers), then cues can also be used to indicate desired release times.
These types of experimental protocols provide a gross estimate of when the subjects intended to
activate and release the transducer output and for what purpose, but this approach has much more
temporal uncertainty compared with the technique describe in Section 4.1 as there is no manner in
which to observe how accurately the subject responded to the timing cues.
In order for the results to be generalized (for the study to have reasonable external validity),
care must be taken to distribute the timing cues in a configuration that resembles the timing of the
application targeted for BCI operation. An unanswered question regarding the validity of this
approach is how well brain activity related to monitoring and responding to cues relates to actual
self-paced activities since some studies have shown differences in particular types of brain activity
during self-paced and cued movements (Libet, Wright et al. 1982).
4.2.2. True Self-Paced Test Environments
In true self-paced test environments there are no timing cues and the subjects determine when
to act on their own. As such, self report is the only form of reference information available, with
the subject reporting state errors and possibly timing/response information as well, or by reporting
which mental task was/will be performed.
There are several known issues with self report, including reporting bias and temporal accuracy
of the report. For example, as brain activity has been shown to precede a subjects‟ conscious
awareness of an intention to act (Libet, Gleason et al. 1983), even the most meticulous self-report
does not necessarily provide an accurate reflection of the timing of the brain states. Although this
difficulty can be addressed to varying degrees, one of the most critical issues with self-report is
that the reporting itself is a mental activity which needs to be controlled for so that it does not
directly interfere with the experiment outcomes.
19
The closer the self-reporting activity is to error-reporting in the target application, the higher
the external validity of these studies. For example, if the subject is only reporting activation
errors in their self-report, then this parallels keyboard user‟s recognizing errors and pressing the
backspace key or delete key. The more complicated the self-report, the less likely the study will
have reasonable external validity.
4.2.3. System Paced Test Environments
As an approximation to true self-paced environments, a system-paced environment can be used to
limit times where a subject can operate the BCI. Thus with this extra information, one can reduce
the amount of self-report needed. This approach may be considered a hybrid between the
synchronized and true self-paced testing
4.2.4. Computationally Intensive Event Detection
An option for producing reference data for self-paced experiments would utilize
computationally intensive or iterative analysis methods which produce accurate results, but cannot
be performed in real-time or on single trials. While these methods would be inappropriate as the
basis of a BCI, they might be useful to produce reference data from unlabeled records of brain
activity. Most other options for producing reference data during self-paced BCI experiments
involve limiting the freedom of the user to self-direct the action. I15
4.3. Examples
To illustrate the problems in determining the Estimate User Intent, several diagrams are
presented starting from data collection.
4.3.1. Step 1: running experiment
Depending if the experiment is offline or online, data is collected differently. For online study,
the data are recorded. Recorded data include brain activity (e.g. EEG), Intent-Related Measures
(IRM, such as EMG onset, finger switch onset, timing cues or subject self-report), and output of
the transducer under test (TO), see Figure 11.
20
recorded brain activity: EEG
IRM: recorded EMG onset or finger switch onset
act A act B act A act B
(actual) Transducer Output
N
A
B B
A
B
Figure 11. Example signals from an online experiment (with event-driven transducer using testing protocol
with EMG or finger switch observable phenomena)
For offline study, data are retrieved from prerecorded data. Prerecorded data include brain
activity (e.g. EEG), Intent-Related Measures (IRM, such as EMG onset, finger switch onset,
timing cues or subject self-report). The transducer output are generated from the prerecorded brain
activity for each transducer under test (TOi). Offline study is illustrated on Figure 12.
IRM: pre-recorded EMG onset or finger switch onset
act A act B act A act B
(actual) Transducer Output – transducer A
N
A
B B
A
B
(actual) Transducer Output – transducer B
N
A
B B
A A
pre-recorded brain activity: EEG
Figure 12. Example signals from an offline experiment (with multiple event-driven transducers using testing
protocol with EMG or finger switch observable phenomena). In offline experiments, the transducer outputs
are produced after the brain activity and IRMs are recorded. This example emphathizes the differences in
transducer outputs (ie, Transducer B has a longer signal processing delay than Transducer A) to illustrate the
possible differences in transducer output timing.
21
4.3.2. Step 2. Generating the Estimated User Intent based on knowledge about the IRMs and the experimental protocol.
Depending on the type of experiment run (referring now to Chapter 4), the User‟s Intent
sequence can be estimated in different ways. For example for experiments without self-report, the
UI can be estimated directly from the Intent-Related Measure (IRM) whether that is a switch
output or a cue as shown in
Figure 13.
IRM: recorded EMG onset or finger switch onset
act A act B act A act B
A A B B
N
Estimated User‟s Intent from IRM (one-time, fixed reference)
Figure 13. Example of estimated User's Intent for non-self-reported studies. This estimation utilizes
physiological and/or experimental knowledge about how the IRM relates to User‟s Intent.
For experiments with self-reported errors, the User Intent can only be estimated relative to the
observed transducer output(s) as shown in
Figure 14.
IRM: self-reported errors (relates to negative intent)
error A error B
A
N
A
B B
actual Transducer Output
N
A
B B
A
B
error A
Estimated User‟s Intent from self reported errors and actual Trans Output (one-time,
fixed reference)
Figure 14. Example of estimated User's Intent for self-reported studies, noting that the actual transducer
output is required to estimate the reference in this case.
23
5. Transducer Output Performance Mark Up
The inherent uncertainty in the timing of the Estimated User Intent (introduced in Chapter 4)
precludes a direct comparison of the Estimated User Intent (EUI) I16 with the output(s) of the
transducers under test. As such after generating a EUI reference, we need a non-deterministic I17
method of identifying and marking correct and incorrect responses. Once we have those labels,
we can calculate summary statistics based on specific performance metrics (as discussed in
Chapter 6).
In this chapter, we delineate heuristic methods to mark up the transducer output that are based
on the EUI (our reference).
5.1. Inherent Uncertainty in the Estimated User Intent
In Chapter 4 we presented several approaches for conducting self-paced experiments (or
studies that approximate self-paced operation). The main points we wish to stress are that there
are various approaches, and each approach corresponds to a different degree of temporal
uncertainty in the estimate of the timing of user‟s intended output. Thus when comparing the
Estimated User Intent reference signal to the actual transducer output, data interpretation
heuristics are required to manage this uncertainty. These heuristics control how data in the areas
of uncertainty are interpreted.
Let‟s look at an example:
true subject intent
finger switch (monitor)
transducer output
expected response
window
time (seconds) 0 1
Figure 15. Illustration of temporal uncertainty in self-paced data evaluation of an event-driven transducer.
In Figure 15, the first line represents the timing of the subject‟s intent to activate an (event-
driven) BCI transducer. This assumes that the subject actually moved his/her finger when they
were trying to drive the BCI transducer and that this movement was recorded by a finger switch
(second line). The actual transducer output is shown in the third line and often contains signal
24
processing delays which cause the actual output to trail subject‟s intent (at least for novice user –
more experienced users may adjust to this delay). In order to accommodate for the uncertainty in
the monitoring equipment (finger switch) and the signal processing delay in the transducer,
heuristics are used to translate the finger switch output into a window of time where actual
transducer output activations will be related to the observed intent.
In the most general case, we do not necessarily have a finger switch, but rather some Intent-
Related Measure, generated from real-time monitors (such as a finger switch), experimental
constraints and/or self-report.
5.2. General Performance Mark Up Algorithm
After much discussion, we have distilled the mark up process into a general algorithm
Researchers may implement different “subroutines” in this algorithm, but what we present below
is the agreed-upon approach to performance mark up.
Step 1- Define an Expected Response Window (ERW): All performance “mark up” algorithms that we could envision are based around an Expected
Response Window, or ERW. So defining the ERW is the first step. The ERW defines the time
period around state transitions in the Estimated User Intent where the researcher expects a
transducer output response to the intended state change. For example, we propose using two
parameters ERWstart and ERWend to define the ERWs. These parameters can be positive or
negative defining times after or before each Estimated User Intent transition. Figure 16 illustrates
three different cases in the case of an event-driven paradigm. In the first case (Figure 16a), the
window is located after the transition (both parameters are positive). In the second case (Figure
16b) , the window is before the transition (both parameters are negative), and in the last case
(Figure 16c), the window includes the transition (one parameter is positive (ERWend) and the other
(ERWstart) is negative). Negative values would often be seen for example in the case of self report,
where the subject reports their intent after the transducer output has been seen and interpreted and
for the case of actual movements, where the brain activity is expected to start before the
movement.
25
Figure 16. Illustration of ERW definition based on Estimated User Intent (EUI) and ERWstart and ERWend
parameters. Please note that in case (b), the values of ERWstart and ERWend are both negative and in case (c),
only the value of ERWstart is negative.. I18
As the temporal uncertainty increases, the widths of the ERWs are bound to increase. Also
experimenters may use different ERW sizes for different task conditions or in an experiment to
quantify the timing characteristics of the transducers under test. An extension of this method
might be to use weighted ERW, but that would add another factor to the analysis which we want to
avoid for the present.
Step 2- Run the Mark-Up Algorithm: define and run a Transducer Output Mark-Up Algorithm to label the correct and incorrect
responses in the transducer output. This algorithm has the following specification
a) inputs:
a. Estimated User Intent (or the IRMs if they represent the EUI) (described in
Chapter 4)
b. ERW definition, e.g., ERWstart and ERWend parameter definitions (described above)
c. the output(s) of the transducer(s) under test (discussed in Chapter 2-4)
b) Internal Comparison Method (ICM)
a. the method to mark up samples within the ERW (detailed below with examples)
b. the method to mark up samples outside of the ERWs
c. the method to determine the response time
c) output:
a. performance mark up labels that indicate correct, incorrect or unknown responses
for each time point in the transducer output (examples below)
As such, these components (including detailed descriptions) are what we would expect a
researcher to report in a self-paced BCI evaluation. Note there are many possible algorithms for
generating the performance mark up.
ERWstart
ERWend
ERW ERW ERW
ERWend ERWstart
ERWend ERWstart
(a) (b) (c) (c)
EUI EUI EUI
26
5.2.1. The Internal Comparison Method (ICM)
Conceptually, the first step of the Internal Comparison Method (ICM) is to produce an
Intermediate Estimated User Intent. We have attempted to illustrate this in Figure 17.
IRM: recorded EMG onset or finger switch onset
act A act B act A act B
A A B B
N
ERW: expected response windows
Intermediate Estimated User Intent (IEUI)
Af Af Bf Bf N N N N N
RTend
RTstart
Estimated User Intent from IRM (one-time, fixed reference)
Figure 17. Illustration of an Intermediate Estimated User Intent sequence in relation to the Expected User
Intent and ERW definition. If there are multiple transducers under test in an offline study, different ERWs
and thus Intermediate Estimated User Intent may be required. The Intermediate Estimated User Intent (IEUI)
is a conceptual sequence that represents expected response with “fuzzy” intent labels. (Note, periods marked
Af (or Bf) in the IEUI sequence represent windows of time where at least one A (or B) response is expected in
the transducer output.)
Most of us can relate to the first three time series in Figure 17. The last time series is just a
conceptual representation of the expected output values where periods marked Af (or Bf) in the
Intermediate Estimated User Intent (EUI) sequence represent windows of time where at least
one A (or B) response is expected in the transducer output. It will be up to the fuzzy comparison
block to interpret the TO in these time periods.
Now a comparison algorithm does not necessarily generate the Intermediate Estimated User
Intent sequence. Instead it may infer it from the EUI and ERW definition during actual
computation. Regardless, the concept is useful for illustrative purposes.
The second step of the ICM is to compare the Intermediate Estimated User Intent, IEUI,
(whether physical or implied) to the actual transducer output(s) and generate performance labels
(see examples below). This is a “fuzzy” comparison in that there can be significant uncertainty in
the timing of the actual response relative to the user‟s intent. We realize that there are many
approaches to doing this fuzzy comparison so we will try to illustrate this method through
examples.
27
5.2.1.1. Example 1
As a first example, let us assume a researcher has reported the following strategy for marking
up the output of event-driven transducers:
“Within the ERWs, if there is one or more correct responses in the ERW (TOstate =
IEUI state), label the first as a correct response (“a hit”) for the desired state. Set the
response time of the hit as the time between the EUI and the first 1st correct response in the
ERW. Ignore other samples in the ERW, that is treat them as if their corresponding intent
was “unknown” and do not include them in summary statistics. Otherwise create an
incorrect response (“a miss”) label centered in the middle of the ERW. For non-fuzzy
IEUI values (outside the ERWs), do a direct sample-sample comparison and label any
incorrect responses (i.e., not N) as „spontaneous errors‟.”
This process is complete as it defines all the aspects of our ICM specification:
1) The method to mark up samples within the ERW: if there was a correct response in
the ERW, the mark up state at the time of first correct response is „correct response‟ (or
„hit‟) and all other samples are labeled „unknown‟. If there is no correct response, then
mark the sample at the center of the ERW as „missed response‟ and, as in the other case,
all other samples are labeled „unknown‟
2) The method to determine the response time: for correct responses, the time from the
EUI to the 1st correct response. For no responses, the time from the EUI to the middle of
the ERW.
3) the method to mark up samples outside of the ERWs: non-N responses are marked as
„spontaneous errors‟ on a sample by sample basis
This example also states that the output samples labeled „unknown‟ were not included in the
summary statistics. This issue will be addressed in the next chapter.
5.2.1.2. Example 2
As another example, a second researcher might use this strategy:
“For samples within the ERWs, label the first correct responses in the ERW
(TOstate(t)-IEUIstate(t)) with a label of the form TOstate(t)-IEUIstate(t), eg, “A-A” if the
desired state was „A‟. If there were no correct responses, label the first incorrect response
as TOstate(t)-IEUIstate(t). Set the label of all other points in the ERW as TOstate(t)‟-N‟,
in other words, treat all other points as if the corresponding user‟s intended output was
„N‟. For samples outside the ERWs, label all samples as TOstate(t)-IEUI state(t), which
may lead to labels such as „N-N‟ or „A-N‟.”
This is incomplete as defines all the aspects of our ICM specification, except response time:
1) The method to mark up samples within the ERW: if there was a correct response in
the ERW, mark that sample as TOstate(t)-IEUIstate(t) eg, “A-A” if the desired state was
„A‟. Otherwise, if there was an incorrect response, say „B‟, then label the first incorrect
response of the form „B-A‟. Label any other points TOstate(t)‟-N‟,e.g., „A-N‟,„B-N‟ or
„N-N‟ and all other samples are labeled „unknown‟. If there is no correct response, then
mark the sample at the center of the ERW as „missed response‟ and, as in the other case,
all other samples are labeled „unknown‟
2) The method to determine the response time: ? – not defined !
28
3) The method to mark up samples outside of the ERWs: all samples are labeled
TOstate(t)-IEUI state(t), which may lead to labels such as „N-N‟ or „A-N‟.
5.2.1.3. Other ICM examples
There are many others ways one could do this fuzzy comparison. To illustrate the point some
more, Figure 18 illustrates examples of different types of TO markup labels depending on various
markup strategies. In this figure, the first markup strategy is to “find the first correct response and
ignore the others. The second strategy is “find the first correct response and treat the others as
“N” Again output basic label pairs.” The third strategy is similar to the first except it outputs
semantic labels such as TP, FP, ....
the Transducer Output
N
A
B B
A
B
N N N N N N A N N N N N N N N N A B N N N B N N N N N N N N N N N N N N N N B N N N N N N N
N N N N N U A U N N N N N N N N U B U N N N N N N A U U N N N N N N N N N U B U N N N N N N
example TO mark up strategy A (find first, ignore neighbors – don‟t include in summary statistics).
Note errors marked in yellow and ignored values are marked in red. This example assumes that
Fuzzy Comparison block of Fig 5 does not presume to know the intent of the neighboring samples
within the ERWs. Instead it labels ignores these samples in the summary statistics. In this figure,
these samples are marked with U for “unknown intent”.
Intermediate Estimated User Intent sequence: expected response with “fuzzy” intent labels
N N N N N N A N N N N N N N N N A B N N N B N N N N N N N N N N N N N N N N B N N N N N N N
N N N N N N A N N N N N N N N N N B N N N N N N N A N N N N N N N N N N N N B N N N N N N N
example TO mark up strategy B (similar to above treat neighbours as N)
T T T T T U T U T T T T T T T T U T U T T F T T T F U U T T T T T T T T T U T U T T T T T T
N N N N N N P N N N N N N N N N N P N N N P N N N N N N N N N N N N N N N N P N N N N N N N
example TO mark up strategy C - using semantic labels TP (True Positive), TN (True Negative), FP
(False Positive), FN (False Negative) and UN (unknown). This example is similar to the first except
with semantic labels.
Af Af Bf Bf N N N N N
Figure 18. Example illustrating different Internal Comparison Methods. Three different markup strategies
are illustrated.
29
To summarize, researchers have to select and justify
1. their IRM (as described in the new Chapter 4)
2. their method to estimate the EUI (as described in the new Chapter 4)
3. their ERW definition (i.e., their ERWstart and ERWend choices relative to their IRM -
possibly different for different transducers)
4. the three components of their Internal Comparison Method
a. the method to mark up samples within the ERW
b. the method to determine the response time
c. the method to mark up samples outside of the ERWs
As study designers or manuscript reviewers, these are all we have to look at to estimate the
internal and external validity of the study. A full ICM example is provided in Appendix B.I19
Please note that the examples used herein only depicted event-driven transducers. The method
applies equally to state-driven transducers.
31
6. Metrics for Self-Paced Evaluation
In the previous two chapters we have delineated various approaches to performing analysis of
data collected from the operation of a self-paced, discrete-output BCI transducer. In this chapter
we will discuss performance metrics can be used to summarize the performance labels generated
during the self-paced analysis.
The selection of metric depends on the research question(s) being asked. For instance, the
following questions may be inquiries into how well someone can operate the transducer:
- How does it respond to transitions to an IC state from an NC state? What type and
frequency of errors would the user expect?
- What type and frequency of errors does one see when the user is in an NC state?
- How quickly does the transducer respond? Are there delays? Are these delays consistent?
- How long can a person hold it in a particular state? (for state-driven transducers)
- How well can a person release it from a held state? (for state-driven transducers)
- How well can someone switch between IC states? (for multi-IC state-driven transducers)
In these questions the reader will see our bias in this chapter is towards metrics that provide
usability information from the user‟s perspective. We have also captured in these examples the
two general categories of metrics: error metrics and timing characterization metrics. Before we
present the error metric and timing metric methods, we will define a few terms and symbols that
underlie the metrics
6.1. Definition of Terms and Symbols
6.1.1. Observation Time
The observation time is the duration of the experiment over which the transducer output
labeling is made. It is denoted as T and measured in seconds/minutes/hours.
6.1.2. NC Time Periods
The NC time periods are the periods between the IC states where the user is known
(assumed) to be in the NC state. These are denoted as TNCi.
6.1.3. Inter-FA Time Periods
The Inter-False-Activation time periods (or IFA Periods) are the periods within the NC time
periods and are denoted as TIFAi. Summarizing IFA period lengths characterizes the distribution of
false activations during NC time periods.
6.1.4. Response Time
The response time is the amount of time the between when the user initiates an activation or
release and the corresponding response in the transducer output. For event-driven transducers
these are collectively denoted as RTAi. For state-driven transducers, where there are two basic
transition types, these are collectively denoted as RTAi and RTRi for activation RT and release RT
respectively. I20
32
6.1.5. Hold Periods / Hold Time
The hold period is the period during which the transducer output enters and stays in a specific
IC state. The hold period ends when the output changes to another state (see glitch definition
below). The length of these hold periods are referred to as the hold time. These are denoted as THi
and is only exists in state-driven paradigms. Hold times are used to determine if the user is able to
perform a specific operation, such as mouse-like drag and drop operations. Examples of hold
periods are presented in Figure 19.I21
6.1.6. Glitch
We have introduced glitch to refer to a short state transition away from a held state that is
viewed by the user as a temporary deviation. From a user's perspective, the hold time is not
interrupted by a glitch, though the transducer output is spontaneously changed. It is considered to
be interrupted only if the glitch duration is longer than the "maximum glitch duration". The
maximum glitch duration I22 is defined as the maximum time that a glitch is allowed to last and
denoted Tglitch. Any change of state longer than this duration will be considered the end of the hold
time and not a glitch. It is up to the experimenter to choose this value according to their beliefs of
user perspective. A threshold of zero means that any state change will mark the end of the hold
time. The default maximum glitch duration is zero, i.e. no glitches are allowed. It can be up to one
maximum glitch duration per IC state, for example: high maximum glitch duration for mouse
pointing tasks and low maximum glitch duration for mouse clicking tasks.
Figure 19. Basic hold-time period examples ( ). One hold-time period (left), two hold-time periods
(middle), three hold-time periods (right). The middle and right examples can also be considered as only one hold-time
period if the maximum glitches are short enough.
6.2. Error metrics definitions
In general, the authors agreed that the most rudimentary measure for summarizing error was
the confusion matrix, which will be presented in the next subsection. Other error-related metrics
can be derived from that representation and will be discussed afterward.
6.2.1. Confusion Matrix
The concept of confusion matrix (CM) is used in the communication and coding theory to
characterize the communication channel. In these contexts, the confusion matrix is a table that
summarizes the states sent (desired) versus the states received. A multi-state example is drawn in
Figure 20.
intended output
transducer output
intended output
transducer output
intended output
transducer output
33
desired actual output output A B N
A OAA OAB OAN
B OBA OBA OBN
N ONA ONB ONN
Figure 20. A confusion matrix for two IC states (with corresponding outputs A and B) and an NC state (with
output N). OXY represent the number of performance labels observed when state X was desire and Y was
actually output.
For transducers that produce only two-states (an IC state and a NC state), the confusion matrix
reduces to the following diagrams. The table on the left is the general case and the one on the
right imposes 2-state, statistical labels.
desired actual output desired actual output output A N output A N
A OAA OAN A TP (hit) FN (miss)
N ONA ONN N FP TN
a) b)
Figure 21. a) A confusion matrix for two IC states (with corresponding outputs A and B) and an NC state (with
output N). Same matrix with 2-state, statistical labels, true positive (TP), false positive (FP), false negative
(FN) and true negative (TN), noting that TPs and FNs have also been referred to in some published works as
„hits‟ and „misses‟.
Some researchers have forwarded synchronized transducer designs that generate an “unknown”
output if there is not enough internal evidence to select one of the other classes. This approach can
also be applied to self-paced designs. In terms of the CM, this can be captured in the modified
multi-state confusion matrix depicted in Figure 22.
desired actual output output A B N unknown
A OAA OAB OAN OA?
B OBA OBA OBN OB?
N ONA ONB ONN ON?
Figure 22. A confusion matrix for two IC states (with corresponding outputs A and B) , an NC state (with
output N) and an unknown output state.
Researchers working with state-driven transducers may be also interested in reporting how
well their transducers respond from all possible states.I23 For this they could report multiple
confusion matrices as shown in Figure 23, where each CM characterizes the responses based on a
different previous state.. (As above, these matrices could be extended to include an “unknown”
34
state output.) During our discussions, we came up with several semantic terms to describe possible
transitions (and non-transitions) observed in the multiple CM case. For example, entries labeled
OXYY, OXXX, OXXY, OXYX, OXXY, and OXYZ were referred to as Correct Transitions (CTXY), Correct
Maintained States (CMSX), Spontaneous Transitions (STXY), Missed Transitions (MTXY) and
Incorrect Transitions (ITXYZ) respectively.
desired actual output output A B N
A OAAA OAAB OAAN
B OABA OABA OABN
N OANA OANB OANN
CM when changing state from A
desired actual output output A B N
A OBAA OBAB OBAN
B OBBA OBBA OBBN
N OBNA OBNB OBNN
CM when changing state from B
desired actual output output A B N
A ONAA ONAB ONAN
B ONBA ONBA ONBN
N ONNA ONNB ONNN
CM when changing state from N
Figure 23. Multiple Confusion Matrices for two IC states and one NC state (and no “unknown” state).
So to summarize, the confusion matrix provides a basic form to summarize performance
labels. Depending on the mark up algorithm though, it may not capture all observed data.
Specifically, there may be samples that are labelled as “ignored”, “unknown” or “do not care”
within the ERWs. These samples need to be accounted for in that analysis. I24
6.2.2. Other metrics
Although we have agreed on summarizing observations in confusion matrices, the search for a
single, meaningful performance metric has been elusive. As previously discussed in Chapter 3, a
35
few general, higher-level metrics such as the HF difference or overall classification accuracy have
been used, but these are not meaningful on there own.
The primary issue is that we have a multi-dimensional error space: errors related to the IC and
NC states. Inherently there is a error trade off when calibrating a transducer: IC errors are
decreased at the expense of the false activations in NC.
For the 2-state output case, ROC curves have been useful representations of the two-
dimensional error space in other fields. However, ROC curves can only be generated off-line, that
is they are not appropriate for real-time (on-line) evaluation, AND realistically, only a narrow
portion of the curve (the area with low NC error percentages) is meaningful. As a default,
researchers have summarized each error dimension separately. The most common is the reporting
of true positive (TP) and false positive (FP) percentages (what some have called “rates”).I25 The
percentage of TPs is measured the number successful IC-related activations relative to the number
of attempted IC states. The percentage of FPs is measured as the number of false activations
during NC relative to the total number of samples in the NC periods. One issue with this practice
is that the reported percentages are related to the transducer output rate and not normalized to time.
Thus it is difficult to interpret these results from a user-centric perspective. For example,
reporting a FP percentage of 1% for a transducer that generates an output once every 1/10 of a
second would correspond to an expected FP every 10 seconds, which is not useful for most self-
paced applications. In contrast if the transducer generated and output every second, this would
reflect a expected error every 100 seconds, which may have more application. So it would be
preferred to have these types of percentages normalized to time. Thus it would be more useful to
express the expected FPs as a temporal rate relative to Σ TNCi.
In order to overcome problem of multiple (conflicting) metrics (e.g. TP and FP), some single
number metrics have been discussed in Chapter 3 and (Schlögl, Huggins et al. accepted for
publication). Specifically, the overall accuracy, the error rate, the area-under-the-ROC-curve
(AUC), A-prime, d-prime, F1-metric (i.e. the harmonic mean of precision and recall) and the HF-
diff, as well as Cohen‟s Kappa coefficient and the mutual information of discrete output are
described. The overall accuracy (and error rate) is discouraged because it gives the states with
more samples a larger weight. The AUC, A-prime, d-prime, F1-metric and the HF-diff are defined
only for 2 states. Thus, only the mutual information and Cohen‟s Kappa coefficient can be used for
systems with more than two states. Specifically, the Kappa coefficient weights each state equally,
and measures the separability between the classes independently of its sample size. This makes the
mutual information and the Kappa coefficient possible options for summarizing the performance
in a single metric.
The requirements for a metric depend on the research question being asked which are generally
related to the usability of the BCI for a particular (target) application. For example, in an
application evaluating point and click control with a mouse, random mouse movements may be
considered less serious than random mouse clicks.
To be a considered as a metric for self-paced BCI systems, the metric must fulfil the following
requirements:
the metric must be derivable from basic performance information or other metrics;
the metric must be useful for at least one application/task;
the metric must include the NC state, because this state is specific to self-paced BCIs.
I26
36
6.3. Temporal Characterization of Transducer Output
The confusion matrices and higher-level metrics presented above summarized the overall error
percentages seen in the experimental data. In this chapter, we discuss the temporal
6.3.1. Response-Time Characterization
Response times are most generally summarized in histograms such as shown in Figure 24.
Alternatively, these curves can be statistically modelled and represented by statistics such as mean
and variance. Given our comparison methodology, response times are will always bounded by the
ERWstart and ERWend of the Expected Response Window, thus for initial RT characterization,
ERW should be a generous width.
The Response-Time metrics (histogram or statistical model parameters) are used to determine
if a BCI design is suited for a particular application or task. For example, a BCI design can
produce a response that is much too late for a particular application, see Figure 24. The Response-
Time histogram should be the most frequently used. For specific uses, the histogram can be
refined, e.g. by using only the responses to a specific state Z (CTYZZ). This can be used to
determine if one of the states is not adapted to the application.
Figure 24. Example of BCI designs Response-Time Histograms that match (A) and don't match (B) the RT
application requirements.
6.3.2. NC Period Characterization
Like RT metrics, summarizing NC period lengths in the form of a histogram or statistical
parameters is useful to characterizing the experimental conditions. From it one can determine how
frequently a subject attempted control and this can be used to determine if the results are
applicable to specific target applications. For instance if the NC period length is a mean of 5
seconds, standard deviation of 1, then the reported results are not appropriate for the control of a
wheelchair where the application‟s NC periods are generally much larger and more widely
distributed in time.
6.3.3. Inter-FA Period Characterization
Summarizing Intra-FA period lengths is useful for determining how false activations are
distributed throughout NC periods. If this metric, which can expressed as a histogram or statistical
parameters like above, is uniformly distributed, then one can assume that the false activations are
random. If the distribution is biased towards zero, then the false activations tend to appear in
patches.
RT
RT requirements for application
A RT
p(RTi)
RT requirements for application
B
p(RTi)
37
6.3.4. Hold-Time Period Characterization
Like above, the hold-time period lengths can be characterized as a histogram or statistical
parameters. Again, this metric allows one to identify rapidly if the tested BCI design is
appropriate for a specific application.
6.4. Summary
In this chapter, we defined performance measures based error and temporal response. In order
to summarize this chapter, we propose a metric dependence tree, see Figure 25. This tree allows
the researcher to determine which metrics are needed in order to compute the metric of interest.
Figure 25. Metric dependence tree.
6.4.1. Returning to the Research Questions
Selecting the metric that suits one‟s need depends on the target application and the research
questions being asked. So what follows are examples of research questions paired with possible
metrics. I27
Research Question Metric How does it respond to transitions to an IC state from an NC state? What type of errors would the user expect?
CM, plus a single or multiple error metric
What type and frequency of errors does one see when the user is in an NC state?
CM, plus single or multiple error metric histogram of error-free period length histogram of time between false activations
How quickly does the transducer respond? Are there delays? Are these delays consistent?
histogram of response time
How well can a user hold (sustain) a state (for state-driven transducer)
histogram of hold times
Performance labels
CM
% TP/TN/ FP/FN
HF difference
ROC curve
single metrics
RT
histograms / statistical models
Timing measures
multiple metrics
kappa ? % CT/IT/CMS/MT/ST
NC period
IFA period
hold time
39
7. Reporting Practices
In this chapter, some recommendations about the information that should be reported in the
self-paced BCI literature is presented. If accepted and implemented by the BCI community, these
recommendations can lead to better reproducibility of the future works. They can also provide the
possibility of comparison between different designs. We divide these guidelines into five
categories:
1. Ideal needs for their target application
2. Transducer characteristics
3. Transition labels
4. Basic performance information
5. Application-specific high level metrics I28
We now address each category in more details:
7.1. Ideal needs for their target application
It is recommended that the researchers report the ideal needs for their target application(s).
Once these ideal needs are specified, it would be possible to compare the results with the ideals
and to analyze how the goals are achieved. The ideal needs can be further divided into the
following sub-categories:
7.1.1 Specifying theTarget Needs:
Specifying acceptable error characteristics for the target application (this includes target
population, target activity and target operating environment). A specific error rate or response time
may be acceptable for a particular application but it may be completely unacceptable for another
one. It is important that the target application is expressed clearly.
7.1.2. Acceptable Error Rates:
Once the target application is specified, acceptable error rates should be specified. While
ideally zero error rates are desired, unfortunately, that is usually not the case in the practice. Thus
researchers should specifically mention what level of error is acceptable in their design. This level
may be different from one design or application to another. We do not know of any published
work that has specified this information.
7.1.3. Acceptable Timing Characteristics:
Many factors introduce some delays in the response time of a BCI transducer (such as a filter‟s
delay, post-processing, etc.). Similar to determining an acceptable error rate, it is recommended
that the authors determine the acceptable response time for their particular application.
40
7.2. Transducer’s Characteristics
Here are some recommendations when reporting the transducer‟s characteristics:
7.2.1. Transducer’s Output Rate
Specifying the transducer‟s output rate is important for determining the applicability of a
proposed design. A false positive rate of 1% for a BCI transducer with the output rate of 10
samples per second, means an average of one error every 10 seconds. For a transducer with the
output rate of 1 sample per second, this means an average of one error every 100 seconds. Clearly
there is a significant amount of difference between these two designs (assuming they have the
same hit detection rate).
Actually we would prefer to see performance results normalized to time such that the results
are not dependent on the transducers output rate..
7.2.2. Temporal Characteristics:
It is necessary to report the temporal characteristics of a switch design such as Response Time
and Refractory Period so that other researchers can test these designs accurately.
7.2.3. Offline vs. Online Analysis:
Whether or not the analysis is carried out offline or online should be clearly stated.
Particularly, it is recommended that the performance of the system during the periods of bad
data I29 (for example, when anomalies are present) is reported regardless of whether such periods
are analyzed or not.
7.2.4. Robustness of the Algorithm
It is recommended that whether or not the performance of a transducer is considered in the
presence of artifacts is also reported. Especially for online analysis, it is important to know how
robust a particular transducer is to the presence of artifact or how it can handle artifacts.
7.3. Transducer Output Mark Up Method
The procedure used for marking up the transducer‟s output should be clearly stated. This is
crucial for the reproduction of any transducer.
7.4 .Basic Performance Information
It is recommended that all the basic performance metrics (the elements in the confusion matrix
for example) are reported clearly.
7.5. Application- Specific High Level Metrics
As stated in Chapter 6, depending on the target application, the basic performance metrics may
be combined to generate high level metrics. It is recommended that all the high-level performance
metrics are reported very clearly along with the rational of using them.
43
Appendix A Glossary Term Category Definition Link Comments
Activation Activate
Transducer Output Discrete transducer output change from NC to ICi, where i = A, B, C, ... (see Discrete Transducer Output)
Used as a shortcut term.
Activation Response Time Measured Timing Characteristics
The time it takes for the transducer output to reflect the intended control of control. This can be represented as a mean time or other statistical report such as a histogram
Asynchronous BCI design classification
Generally synonymous with self-paced operation although usage is inconsistent
recommend stop using
BCI Acronym for Brain Computer Interface
BCI System BCI design classification
A system of components that translates brain activity into useful communication or control signals
text
BCI Transducer BCI design classification
The primary component of a BCI system which translates brain activity into basic control signals.
text
Brain Computer Interface A technology that translates activity measured directly from the brain into useful communication or control. Also known as Brain Interface, Direct Brain Interface and Brain Machine Interface.
Brain Activity text
Continuous Output Transducer Output Continuous (ordered) values that correspond to the user’s brain state. For example, changes in the user’s intentionally controlled (IC) brain state would map onto changes in the continuous transducer output. A NC brain state would ideally produce no change in the transducer output.
Deactivation Deactivate
Transducer Output Synonymous with Release. See Release. prefer to use Release (SM)
Deactivation Response Time
Measured Timing Characteristics
The time it takes for the transducer output to reflect the ceasing of control. Only reported for state-based discrete transducers
Discrete Output Transducer Output Discrete state (non-ordered) values that correspond to the user’s brain state. For example, brain states ICA, ICB, ICC, and NC would ideally produce transducer outputs A, B, C, and N.
Estimated User Intent An estimation of the user’s intended transducer output in terms of timing and state.
test
Event-Driven (Discrete) Transducer
Transducer Design A transducer that is driven by a transient event in the brain state, e.g., a movement related potential. In this case, the transducer output can be considered “instantaneous”, on for a brief period of time then off. There is no ability to hold and release the output. See State-Driven Discrete Transducer.
text
Expected Response Window
Analysis A time period used to compensate for the unknown timing of a intended output.
text
Glitch Transducer Output A closely spaced pairing of a spontaneous transition and a spontaneous correct where the duration between the transition and correction does not exceed a reported maximum glitch duration.. From a user's perspective, the hold time is not interrupted by a glitch, though the transducer output is spontaneously changed. It is considered to be interrupted only if the glitch duration is longer than the "maximum glitch time"
Hold Transducer Output An ability to maintain a transducer output in a particular state.
text
Hold Time Measured Timing Characteristics
The time transducer output can be held in a particular state. Only reported for state-based discrete transducers
Idle User Brain State term used to describe No Control text recommend discontinue use. Preferred term is No Control or NC.
Idle Support User Brain State term used to describe NC Support text recommend discontinue use Preferred term is NC Support.
IC User Brain State Acronym for Intentional Control
44
Intentional Control User Brain State User brain state during which the user is intentionally trying to perform some action using the BCI Transducer. Abbreviated ICi where i = A, B, C, ...
text
Jitter The undesired toggling between states that occurs during activation or deactivation when a feature vector repeatedly crosses the decision boundary(ies).
Jitter Reduction Method Transducer Design A method used to reduced Jitter. See (see Transducer Output - Jitter and Hysteresis and Debounce).
Maximum glitch duration Transducer Output The maximum time that a glitch is allowed to last. Any change of state longer than this threshold will be considered the end of the hold time and not a glitch. It is up to the experimenter to choose this value according to their beliefs of user perspective. A threshold of zero means that any state change will mark the end of the hold time.).
NC User Brain State Acronym for No Control
NC Support Transducer Output the ability of a BCI transducer to recognize a user’s NC state and generate an inactive (N) state output
text
No Control User Brain State User brain state during which the user is not trying to perform some action using the BCI Transducer. The user may be monitoring, resting, thinking but not engaged in control through the BCI transducer. Abbreviated NC.
text
Refractory Period
Measured Timing Characteristics
the minimum time after a discrete transducer has been activated before it is ready to be reactivated.
Response Time Measured Timing Characteristics
See Activation Response Time
Release Transducer Output Discrete transducer output change from NC to ICi, where i = A, B, C, ... (see Discrete Transducer Output)
text
Sleep Mechanism Transducer Design A mechanism by which a BCI system is placed into a restricted response mode to avoid false responses during long periods of No Control.
text
Spatial Reference Output Transducer Output Output that refers to a particular location on a screen or keyboard.
State-Driven (Discrete) Transducer
Transducer Design A discrete transducer that is driven by a continuously controlled brain state, e.g., alpha power. In this case, the transducer output can be turned on, held on for a period of time, then released. See Event-Driven Discrete Transducer.
text
Self-Paced Operation Operating Paradigm text
Synchronized Operation Operating Paradigm inconsistent/mixed usage – no specific definition. The use of this term is not recommended. Synchronized or system-paced are the preferred terms.
text
Synchronous Operating Paradigm Generall describes BCI systems that are operated in a periodic, system-driven manner. The use is inconsistent. The use of this term is not recommended. Synchronized or system-paced are the preferred terms.
System-Paced Operation Operating Paradigm text
Transducer BCI design classification
See BCI Transducer
Unknown State Transducer Output A special transducer output state used (in some transducer designs) to represent when there is not enough confidence in the classifier to choose one of the other IC states.
text
Jitter Reduction terms Debounce Jitter Reduction
Method A mechanism to reduce output jitter (see Transducer Output - Jitter) that locks the transducer output into a state for a fixed time period (the Debounce Time) after activation or deactivation. See Error! Reference source not found..
Debounce Time Jitter Reduction Method
the length of time the debounce mechanism is activated. May be different for activation and deactivation.
Dwell Time Jitter reduction Duration of the settling time for the detection of an
45
method activation. The use of this term is not recommended. "Activation settling time" is the recommended term.
Hysteresis Jitter Reduction Method
A mechanism to reduce output jitter (see Transducer Output - Jitter) that uses a complex decision boundary as shown in Error! Reference source not found..
Refractory Period
Jitter Reduction methods
1. the minimum time after a discrete transducer has been activated before it is ready to be reactivated. 2. Duration of the settling time for the detection of a deactivation. Not really a refractory period in the strict sense given in the first definition. The use of this definition is not recommended in this sense. It should be replaced by "deactivation settling time".
Settling Jitter Reduction Method
A mechanism to reduce output jitter where an activation or deactivation does not occur until a decision boundary has been crossed for a specific period of time (the Transducer Setup Time).
Transducer Setup Time Jitter Reduction Method
Minimal amount of time the feature vector should be “held” in a specific state in order to produce a transition on the transducer output. It could be different setup time, e.g. for NC or IC states
Settling Jitter Reduction Method
A mechanism to reduce output jitter where an activation or deactivation does not occur until a decision boundary has been crossed for a specific period of time (the Transducer Setup Time).
Correct Maintained State Transition label Transducer output state label, when the transducer
output is identical to the intended output, and when there is no transition on both outputs.
Correct Transition Transition label Transducer output transition label when the transducer output follows the user intention.
Incorrect Transition Transition label Transducer output transition to an incorrect intended output state.
Missed Transition Transition label Desired transition on the intended output which is not recognized by the transducer, which do not change the transducer output.
Spontaneous Correction Transition label Transition occurring after a missed/spontaneous/incorrect transition and corrects the error. After the correction, the transducer output state is identical to the intended output.
Spontaneous Transition Transition label Transition which occurs on the transducer output when no transition was previously desired by the user.
47
Appendix B Transition-based Performance Markup Algorithm I30 This section proposes to implement a Transducer Output Performance Markup Algorithm for a
sample-based BCI transducer, following the requirements defined in Chapter 5. This allows
computing some of the metrics from Chapter 6. This section should be considered as an example
rather than a reference implementation.
We saw in Chapter 6.2.2 that metrics should be independent to the transducer output rate.
Therefore, the current performance markup algorithm is based on the transitions between two
states of the User Intent Estimate. This algorithm assumes that the researcher already have a User
Intent Estimate signal, coded as a sequence of samples. As we did not design this performance
markup algorithm for a specific algorithm, the Expected Response Window parameters ERWstart,
ERWend are left undefined. Thus, Chapter 5.2‟s step one (defining the ERW) is not specified here.
For Chapter 5.2‟s step two (define and run the Transducer output Markup Algorithm), we
define a complete ICM in the sense of Chapter 5.2.1 because it defines the markup of samples
within the ERW, allows computing the response time and defines the markup of samples outside
the ERW. Note that the ICM described here produces transducer output performance markup
labels without using an Intermediate Estimate User Intent.
B.1 Types of transitions I31
Transitions from one EUI sample to the next one can be characterized by five terms: Correct
Transition (CT), Incorrect Transition (IT), Missed Transition (MT), Spontaneous Transition (ST),
and Spontaneous Correction (SC). Note that we do not consider here the cases where a transition
does not occur between two EUI samples. Each transition is characterized by three indices I32, e.g.
ITijk: the transducer output state before the transition, the desired EUI state after the transition, the
actual transducer output state after the transition. Depending on the transition, duplicate indices
can be removed. For example, in a Correct Transition, the transducer output state after the
transition is the same as the desired state, thus the last index (k) can be removed. The goal of the
Internal Comparison Method is to produce a list of recognize transitions.
A Correct Transition (CT) occurs when the transducer output state changes to the new EUI
state within the ERW. An Incorrect Transition (IT) occurs when the transducer output state
changes to another state than the new EUI, within the ERW. In both cases, the response time is the
duration between the EUI transition time and the transducer output transition time. A Missed
Transition (MT) occurs when the transducer output does not change during the ERW.
A Spontaneous Transition (ST) occurs when the transducer produces a transition on its output
independently of the user intention. In order to identify this kind of transitions, the concept of
Expected Cause Window is introduced, see Figure 26. This corresponds to the window during
which a transition on the Estimate User Intent should have occurred so that the transducer output
transition could have been considered as a Correct or Incorrect Transition. If no EUI transition
occurs in this window, the transducer output transition must be considered as a Spontaneous
Transition.
48
Figure 26. Expected Response Window (A) and Expected Cause Window (B). The small rounded arrows ( ) show
which transition the window refer to.
The concept of Spontaneous Correction (SC) had to be introduced to avoid counting errors
when the transducer spontaneously corrects Incorrect, Missed or Spontaneous Transition.
Otherwise, this correction would be considered as a Spontaneous Transition. Examples of the five
types of transitions are given on Figure 27.
Figure 27. Exampes of Correct transition (A), Incorrect transition (B), Missed transition (C), Spontaneous transition
(D) and Spontaneous correction (SC). The small rounded arrow ( ) shows which transition the decision window
refer to.
B.2 Internal Comparison Method
Only Correct and Incorrect Transitions are marked up inside the Expected Response Window.
The transducer response time is computed only for these two types of transition. Missed
Transitions, Spontaneous Transitions and Spontaneous Corrections are marked up only outside the
Expected Response Window. For these transitions the transducer response time cannot be
computed. The comparison method can thus be summarized by the following pseudo-code:
A ERW
-RTend
-RTstart
B ERW
RTend
RTstart
Estimate User Intent
Transducer Output
EUI
CT transducer output
ERW
ICi
ICi
transducer response time A B
EUI
IT
transducer output
ERW
ICi
ICj
transducer response time
C
EUI
MT
transducer output
ERW
D
EUI
ST
transducer output
ERW
E
EUI
ST
transducer output
ERW
SC
ICj
ICi
ICi
SC IT SC MT
ERW
49
For each transition of the Estimate User Intent occuring at time tEUI
if the transducer output contains a transition during the ERW tEUI+ERWstart..tEUI+ERWend tTO=transducer output occurrence time
if TransducerOutput(tTO+)=EstimateUserIntent(tTO+)
label a CT at the transducer output transition time tTO,
with a transducer response time = tTO-tEUI
else
label an IT at the transducer output transition time tTO,
with a transducer response time = tTO-tEUI end if
else
label a MT at the intended output transition time tEUI.
end if
end for
for each transition of the transducer output occuring at time tTO
if the intended output contains no transitions during the ERW tTO-ERWend..tTO-ERWstart
if TransducerOutput(tTO+)=EstimateUserIntent(tTO+)
label a SC at the transducer output transition time tTO
else
label an ST at the transducer output transition time tTO end if
end if
end for
This performance markup algorithm has been implemented into the BioSig open source library
(BioSig). An example of this processing is shown on Figure 28.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75
N
A
B
N
A
B
Time [decision sample]
Tra
nsducer
outp
ut
In
put
sig
nal
State-driven data labelling (RTmin
=1, RTmax
=3)
CTNAA
STAAN
SCNAA
CTANN
STNNB
SCBNN
MTNBN SC
NBBST
BBNMT
NNNCT
NAAST
AABCT
BNNST
NNASC
ANNST
NNBSC
BNNIT
NBASC
ABBCT
BAAMT
ABACT
ANN
Figure 28. Example of performance markup. Transitions are shown with a red vertical bar; spontaneous transitions
being shown with a green bar. The Expected Response Windows are shown in light gray area.
B.3 Metrics for self-paced evalutation
We present here the computation of the hold-time periods as defined in Chapter 6.1.5,
including the notion of glitch (Section 6.1.6). A hold-time period starts:
After a CT if the current/new transducer output state is a IC state and the previous one is a
NC; or
After a SC if the current/new transducer output state is a IC and the previous one is a NC; or
After a CT if the current/new transducer output state is an IC and the previous one is an IC
(in this case, the existing hold-time period is stopped and a new one is created).
The hold-time period stops:
After a ST if this ST is not followed by a SC in the next Tglitch seconds; or
50
After a CT; or
After any transition of IT, CT, SC, ST following a MT.
Note that we considered here a unique maximum glitch duration Tglitch. The above rules are
synthesized in the state machine described on Figure 29, and hold-time examples are given on
Figure 30.
Figure 29. Hold-time state machine. Simple text are the preconditions; rectangles are actions; arrows are state
machine transition; circle are states; HTP=Hold-Time Period; Tglitch =Maximum Glitch Duration, "!CT"
means "no CT transition"; "i" is the transition index, "ti" is the time of the i-th transition; multi-lines
conditions are "ORed" preconditions.
Figure 30. Case study for stopping the Hold-Time Period ( ).
MT
ST
ICi
Timing metrics
Decision
CT
IT
RT
m
ax
MT
SC
Decision
window
MT
IT RT
max
MT
CT
CTi (ICjICk, jk)
MTi (ICjICk, jk) & (ITi+1| CTi+1| SCi+1| STi+1)
CTi (NCIC)
STi & !SCi+1 (ti+1-ti< Tglitch)
CTi (ICNC) MTi (ICNC) & (ITi+1| CTi+1| SCi+1| STi+1)
STi & SCi+1 (ti+1-ti< Tglitch)
Start a new HTP
Stop the current HTP Hold-Time
Period
Stop the current HTP Start a new HTP
Init
51
Appendix C References BioSig (2003-2005). BIOSIG - an open source software library for biomedical signal processing.
Blankertz, B., K.-R. Müller, et al. (2004). "The BCI Competition 2003: Progress and Perspectives in Detection and
Discrimination of EEG Single Trials." IEEE Transactions on Biomedical Engineering 51(6): 1044-1051.
Blankertz, B., G. Schalk, et al. (2005). "BCI competition III." from
http://ida.first.fraunhofer.de/projects/bci/competition_iii/.
Blankertz, B., T. M. Vaughan, et al. (2003). "BCI competition 2003." from
http://ida.first.fraunhofer.de/projects/bci/competition/.
Bortz, H. and G. A. Lienert (1998). Kurzgefasste Statistik für die klassische Forschung. Ubereinstimmungsmasze fuer
subjektive Merkmalsurteile. Springer. Berlin Heidelberg: 265-270.
Cohen, J. (1960). "A coefficient of agreement for nominal scales." Educational and Psychological Measurement 20:
37-46.
Danker-Hopfe, H., D. Kunz, et al. (2004). "Interrater reliability between scorers from eight European sleep
laboratories in subjects with different sleep disorders." Journal of Sleep Research 13(1): 63-69.
Gao, Y., M. J. Black, et al. (2003). A quantitative comparison of linear and non-linear models of motor cortical
activity for the encoding and decoding of arm motions. International IEEE EMBS Conference on Neural
Engineering.
Huggins, J. E., S. P. Levine, et al. (1999). "Detection of Event-Related Potentials for Development of a Direct Brain
Interface." Journal of Clinical Neurophysiology 16(5): 448-455.
Kraemer, H. C. (1982). Kappa coefficient. Encyclopedia of Statistical Sciences. K. S. a. J. N. L. (Eds.). New York,
John Wiley & Sons.
Kronegg, J. and T. Pun (2005). Measuring the performance of brain-computer interfaces using the information transfer
rate. BCI 2005, Brain-Computer Interface Technology: Third International Meeting, Rensselaerville, NY,
USA.
Kronegg, J., S. Voloshynovskiy, et al. (2005). Analysis of bit-rate definitions for Brain-Computer Interfaces. Int.
Conf. on Human-computer Interaction (HCI'05), Las Vegas, Neveda, USA, CSREA Press.
Kübler, A., F. Nijboer, et al. (2005). "Patients with ALS can use sensorimotor rhythms to operate a brain-computer
interface." Neurology 64(10): 1775-1777.
Lal, T., M. Schröder, et al. (2005). A Brain Computer Interface with Online Feedback based on
Magnetoencephalography. International Conference on Machine Learning.
Libet, B., C. Gleason, et al. (1983). "Time of conscious intention to act in relation to onset of cerebral activity
(readiness-potential). The unconscious initiation of a freely voluntary act." Brain 106(3): 623-642.
Libet, B., E. J. Wright, et al. (1982). "Readiness-potentials preceding unrestricted 'spontaneous' vs. pre-planned
voluntary acts." Electroencephalography and clinical Neurophysiology 54(3): 322-335.
Mason, S. G., A. Bashashati, et al. (2005). "A Comprehensive Survey of Brain Interface Technology Designs." Annals
of Biomedical Engineering (submitted for publication).
Mason, S. G. and G. E. Birch (2005). Temporal Control Paradigms for Direct Brain Interfaces – Rethinking the
Definition of Asynchronous and Synchronous. HCI International, Las Vegas, Nevada, USA.
Nykopp, T. (2001). Statistical Modelling Issues for The Adaptive Brain Interface. Department of Electrical and
Communications Engineering. Helsinki, Helsinki University of Technology. M.Sc.
Pierce, J. R. (1980). An Introduction to Information Theory: Symbols, Signals and Noise, Dover Publications.
Schlögl, A., P. Anderer, et al. (1999a). Artifact processing of the sleep EEG in the "SIESTA"-project. EMBEC,
Vienna, Austria.
Schlögl, A., P. Anderer, et al. (1999b). Artefact detection in sleep EEG by the use of Kalman filtering. EMBEC,
Vienna, Austria.
52
Schlögl, A., J. E. Huggins, et al. (accepted for publication). Evaluation criteria in BCI research. Towards Brain-
Computer Interfacing. G. Dornhege, J. d. R. Millán, T. Hinterberger, D. J. McFarland and K.-R. Müller.
Cambridge, MA, The MIT Press.
Schlögl, A., C. Keinrath, et al. (2003). Information transfer of an EEG-based Brain-computer interface. International
IEEE EMBS Conference on Neural Engineering, Capri, Italy.
Schlögl, A., F. Y. Lee, et al. (2005). "Characterization of Four-Class Motor Imagery EEG Data for the BCI-
Competition 2005." Journal of Neural Engineering 2(4): 14-22.
Schlögl, A., C. Neuper, et al. (2002). "Estimating the Mutual Information of an EEG-based Brain-Computer
Interface." Biomedizinische Technik 47(1-2): 3-8.
Townsend, G., B. Graimann, et al. (2004). "Continuous EEG classification during motor imagery-simulation of an
asynchronous BCI." IEEE Transactions on Neural Systems and Rehabilitation Engineering 12(2): 258-265.
Wolpaw, J. R., N. Birbaumer, et al. (2000). "Brain-Computer Interface Technology: A Review of the First
International Meeting." IEEE Transactions on Rehabilitation Engineering 8(2): 164-173.
Wolpaw, J. R., H. Ramoser, et al. (1998). "EEG-Based Communication: Improved Accuracy by Response
Verification." IEEE Transactions on Rehabilitation Engineering 6(3): 326-333.
Wu, W., M. J. Black, et al. (2004). "Modeling and decoding motor cortical activity using a switching Kalman filter."
IEEE Transactions on Biomedical Engineering 51(6): 933-942.
Wu, W., Y. Gao, et al. (2006). "Bayesian population decoding of motor cortical activity using a kalman filter." Neural
Computation 18(1): 80-118.
53
Appendix D Outstanding Issues
1 Issue: existing examples? Could we add references here?
2 Issue: existing examples? Does this apply to anyone else‟s work other than Millan‟s
3 Issue: with terminology. The line between self-paced and system-paced BCI transducers can
become blurred as the rate of pacing of a system-paced BCI increases. It is possible that a
system-paced BCI (such as one based on the P300) would actually produce a higher decision
rate than a self-paced BCI utilizing something like slow cortical potentials. Therefore, the
user would feel that the P300 interface was continuously available because they were not
conscious of waiting for a period of control to occur. I think that it is worth discussing this
difference and perhaps defining a pacing threshold beyond which the user does not feel that
they have to wait for the system to become available. Or perhaps it is just an addition to the
definition of a system-paced BCI that says that it is only considered a system paced BCI if
the user is conscious of the system being unavailable? (Or is that heresy?)
4 Issue: with terminology. The term “synchronized” control is not very descriptive and
possibly confusing with synchronous and synchronization of EEG. Is there a better term?
5 Issue: use of transducer output in the generation of the signal reference. Alois: You are
using a block "transducer output labelling" which has inputs of the the transducer output
AND the "INTENDED OUTPUT" (!!!). The "intended output" is also the reference against
we compare the transducer output. This is a problem. I think we have been discussion this in
the past. One might use "the intended output" to genereate the output labeling – then no
evaluation criterion is valid anymore. Instead, the "intended output" and the "transducer
output" must go into the block "calculate error statistic". I suggest also changing the term
"calculate error statistic" to the more general term "calculate performance criteria" or
"calculate evaluation criteria", moreover do we really need an extra block for “timing
characteristics”, cannot we include the timing metrics in “performance metric” I suggest to
change the figure 8 and replace the 3 leftmost blocks into one block called “calculate
performance metric”.
Mehrdad: I suggest that we keep “transducer output labeling” and then add the “labeled
transducer output” after this block. By having some explanation, I think we can avoid
confusion for readers. I agree with the second part of Alois‟ suggestions that we should
rename the two rightmost blocks and have a single block instead.
6 Issue: clarification needed. Jane:I think this could use further clarification. I'm not sure
what it means
7 Issue: how many BCI systems are evaluated like this? Mehrdad: references?
8 Issue: with wording. Julien: Self-paced BCIs can be trained using synchronized protocols.
In this case, the EUI can be estimated the same way as in synchronized BCIs. Should we
describe that issue? Steve: Synchronized protocols do not estimate EUI. Most “average”
observations over a fixed window, so an estimate of exact timing is not required. So I don‟t
know if we can say something that‟s relevant here.
9 Issue: use of existing metrics. Alois: I do not agree that the existing performance metrics are
not useful at all. Once the reference information is available, these metrics are applicable.
Jane: I don‟t say that they are not useful at all, but I have had a terrible time trying to figure
54
out how to apply them. Most metrics seem to have hidden assumptions that make applying
them to self-paced data problematic. It is not a simple matter of getting the reference
information right and then plugging it into formulas. That doesn‟t produce useful results
because underlying assumptions are violated. Steve: I agree with Jane.
10 Issue: accuracy. Steve: I don‟t feel that this is accurate. Those who use ITR and mutual
information are not studying NC and thus are not “assuming error free NC”. They are
simply ignoring it. To say that they are assuming error free NC is to imply that the metric is
somehow incorporating data related to NC which it is not! So I have an isssue with this
whole subsection
11 Issue: clarification needed. Mehrdad: Is this referenced work self-paced? Julien: Not self-
paced, because based on trials. I would cite some papers from Mason&al, where the decision
rate is about 16 Hz. Jane: Go ahead and change the number and the reference if it is a
better illustration.
12 Issue: use of existing metrics. Alois: some can still be useful. Mehrdad, Julien, Steve,
Jane: Disagree. See Issue 9 for related comments.
13 Issue: table is incomplete.
14 Issue: with categories. Julien: are these titles/groupings the most appropriate
15 Issue: unclear on point being described. Mehrdad: can you give some examples. Steve: I
find this approach confusing. Can you give some examples Jane. Julien: Not very clear.
Do you mean using another transducer which decompose the self-paced data in “periods” of
specific state (NC or ICi)?
16 Issue: with terminology. Julien: When it comes to the abbreviation, EUI makes the
emphasis on the Estimate. Maybe User Intent Estimate (UIE) could be better because it
makes the emphasis on the User Intent.
17 Issue: with wording. Julien: It‟s not “non-deterministic methods”. From my point of view,
it‟s more deterministic methods with fuzzy matching in time.
18 Issue: only depicts event-driven transducers. Julien: EUI in Figure 16 only shows event-
based EUI. We should add an EUI in states to show the transition. Ideally, we should make 2
explanations: one for event-driven paragims and one for state-driven paradigms.
19 Issue: appendix material possibly confuses. Steve: I think the material in appendix focuses
too much on a transition-based interpretation which doesn‟t align with the rest of the
material in this section. As it stands, I think it will confuse more than elucidate the general
approach. Thus I think it still needs quite a bit of work before it is a good reference example
to accompany our specification.
20 Issue: with terminology. Julien: Referring to activation and release make a push towards
single IC state transducers. Maybe it should be changed to refer to multi-IC states
transducers (so speak about transition in general, and no more about activation and release).
21 Issue: practical implementation. What is the significance of the hold-time period when the
error rate is high?
22 Issue: with terminology. This definition is partially agreed by JH, JK, SM.
55
23 Issue: definition of Multiple Confusion Matrix. Alois: MCM not needed. Julien: MCMs
are useful and necessary for state-driven analysis. Steve: I realized upon reflection that the
MCM concept really is just a group of single CMs. They are not necessary - the number of
CMs a researcher reports depends on the what questions the reseachers want to answer.
Someone using a state-driven transducer may only report one CM if they only want to
comment on the overall ability to activate and release. So I‟ve revised the original MCM
presentation reflect this.
24 Issue: proposal for methods??
25 Issue: could use an example/reference.
26 Issue: lack of a single metric. Steve: I‟d like to have a stronger recommendation for a
method, but I‟m not convinced we have one. I like Kappa but have concerns. I started
playing with Kappa as well and realized there are problems with Kappa and researching a bit
more, realized that these problems are well known. The one issue that bothered me about
Kappa was it's sensitivity to disproportionate (skewed) class probabilities, like we often have
with self-paced evaluations. However Lantz and Nebenzahl (J Clin Epidemio vol 49(4),
attached) and Cyrt et all (J Clin Epidemiol 46(5)) have proposed formulations to report the
bias, but I don‟t see any formulation that corrects for these.
27 Issue: incomplete. Steve: This seems to be where our closed-group discussion ended last.
It is incomplete still – what are other general research questions related to self-paced
evaluation? After we publish the first draft, I‟d like to focus part of the discussion here. I
think coming up with metrics related to general research questions will be a useful
contribution.
28 Issue: with definitions: There is still some disagreements on which information belongs to
which group (e.g. histogram of temporal accuracy belongs to low-level performance
information or to high level information), but mainly due to vocabulary divergences (JK,
SM)
29 Issue: ommission: Steve: We have not talked about “bad data” (data related to protocol
anomalies) yet in this version.
30 Issue: is this material needed? Julien: Yes, in this appendix, we describe the transition
markup method, which allows to these requirements and that have some Matlab code
implemented. I think it is a good idea to provide such method. Maybe not to say “take it this
is the best one”, but to provide an detailed example on how you can design such methods.
Steve: Useful but in it‟s current form I think it is hard to follow or interpret as an example
that I can relate to. Also some of this material is out of date and does not reflect our latest
thinking.
31 Issue: transitions versus state-based perspective. Steve: We spent quite a bit of time
discussion a transition based analysis approach, but I now see this as only a special state-
based interpretation and we‟ve dropped most of the transition specific discussion from the
document. Thus I don‟t think we have provided enough background for the reader to
understand the transition-based perspective.
56
32 Issue: how-to call these three indices? Julien: asked in Mid-october summary if
current/desired/actual state is referred to the "transducer output" or to the "intended output".
It was agreed (SM,JK) on the following descriptions:
"current state": transducer output before the transition
"desired state": intended output after the transition
"actual state": transducer output after the transition
However, the original terms (August 22) were not well chosen and it was some discussion
about more appropriate terms. It was agreed that the terms must include "transducer output"
or "intended output", but the word indicating the time information ("before"/"after" in the
above descriptions) was discussed. jK proposed "current"/"next".SM proposed
"previous"/"current".JH finds "current" confusing and proposed "old"/"new" or
"before"/"after".JK argued that "current" is not well chosen because we are referring to the
transition time and "current" would indicate that a state is associated with the transition,
which is obviously not the case (the transition has a "0 width" state). He also argued that
"old"/"new" is not appropriate because older/newer refer to two comparable things (which is
not the case here as we would comparing a state and a transition).