evaluating the performance of self-paced brain-computer...

Evaluating the Performance of

Self-Paced Brain-Computer

Interface Technology

Revision: 1.0 (draft) Date: May 19, 2006

Steven Mason Brain-Interface Laboratory Neil Squire Society, Vancouver, Canada, [email protected]

Julien Kronegg Computer Vision and Multimedia Laboratory University of Geneva, Geneva, Switzerland [email protected]

Jane Huggins Depts. Phys. Med. and Rehab.& Biomed. Engineering University of Michigan, Ann Arbor, U.S.A. [email protected]

Mehrdad Fatourechi Dept. of Electrical and Computer Engineering Univ. of British Columbia, Vancouver, Canada [email protected]

Alois Schlögl Laboratory of Brain-Computer Interfaces Graz University of Technology, Graz, Austria [email protected]

ii

© 2006 Steven Mason, Julien Kronegg, Jane Huggins, Mehrdad Fatourechi and Alois Schlögl.

All rights reserved. No part of this work may be reproduced or used in any form or by any means

without the prior written permission of the authors.

iii

Contents

1. Introduction ........................................................................................................................... 1

1.1. This Version ................................................................................................................ 1

1.2. Conventions ................................................................................................................. 2

2. Review of BCI Technology, Evaluation Concepts and Metrics ........................................... 3

2.1. Review of BCI Technology ......................................................................................... 3

2.1.1. BCI System Definition ...................................................................................... 3

2.1.2. BCI Transducers ................................................................................................ 4

2.1.2.1. Abstract BCI Model ........................................................................... 4

2.1.2.2. Definition of No Control (NC) support ............................................. 5

2.1.2.3. Event-Driven versus State-Driven Designs ....................................... 6

2.1.3. BCI Control Paradigms ..................................................................................... 7

2.2. Review of Performance Evaluation Concepts and Methods ....................................... 8

2.2.1. Online vs offline evaluation ............................................................................ 10

2.2.2. Continuous versus periodic analysis ............................................................... 10

2.2.3. Where to measure the performance ? .............................................................. 10

2.2.4. User Intent ....................................................................................................... 10

2.3. Paced vs. Self-Paced / Guided versus Self-Guided Testing Protocols: ..................... 11

3. Challenges with Self-Paced BCI System Evaluation .......................................................... 13

3.1. Requirements and Issues ........................................................................................... 13

3.2. Availability of Reference Data .................................................................................. 13

3.3. Issues with Existing Performance Metrics ................................................................ 14

3.3.1. Evaluation of Synchronized and Self-Paced BCIs .......................................... 14

3.3.2. The NC State ................................................................................................... 14

3.3.3. Unequal Probability of Classes ....................................................................... 14

3.3.4. Implicit Assumption of Error-Free NC ........................................................... 14

3.3.5. Decision Rate .................................................................................................. 15

3.3.6. Application versus Perspective ....................................................................... 15

3.4. Application of Current Performance Metrics to Self-paced BCI systems ................ 15

4. Determining a Quality Reference for Self-Paced Evaluation ............................................. 17

4.1. Determining a Quality Reference in the presence of Observable Phenomenon and

Real-time Monitors .................................................................................................... 17

4.2. Determining a Quality Reference in the absence of an observable Phenomenon ..... 18

4.2.1. Paced Test Environments I .............................................................................. 18

4.2.2. True Self-Paced Test Environments ................................................................ 18

4.2.3. System Paced Test Environments ................................................................... 19

4.2.4. Computationally Intensive Event Detection ................................................... 19

4.3. Examples ................................................................................................................... 19

4.3.1. Step 1: running experiment ............................................................................. 19

4.3.2. Step 2. Generating the Estimated User Intent. ................................................ 21

5. Transducer Output Performance Mark Up ......................................................................... 23

5.1. Inherent Uncertainty in the Estimated User Intent .................................................... 23

5.2. General Performance Mark Up Algorithm ................................................................ 24

5.2.1. The Internal Comparison Method (ICM) ........................................................ 26

iv

5.2.1.1. Example 1 ........................................................................................ 27

5.2.1.2. Example 2 ........................................................................................ 27

5.2.1.3. Other ICM examples ........................................................................ 28

6. Metrics for Self-Paced Evaluation ...................................................................................... 31

6.1. Definition of Terms and Symbols ............................................................................. 31

6.1.1. Observation Time ............................................................................................ 31

6.1.2. NC Time Periods ............................................................................................. 31

6.1.3. Inter-FA Time Periods .................................................................................... 31

6.1.4. Response Time ................................................................................................ 31

6.1.5. Hold Periods / Hold Time ............................................................................... 32

6.1.6. Glitch ............................................................................................................... 32

6.2. Error metrics definitions ............................................................................................ 32

6.2.1. Confusion Matrix ............................................................................................ 32

6.2.2. Other metrics ................................................................................................... 34

6.3. Temporal Characterization of Transducer Output ..................................................... 36

6.3.1. Response-Time Characterization .................................................................... 36

6.3.2. NC Period Characterization ............................................................................ 36

6.3.3. Inter-FA Period Characterization .................................................................... 36

6.3.4. Hold-Time Period Characterization ................................................................ 37

6.4. Summary .................................................................................................................... 37

6.4.1. Returning to the Research Questions .............................................................. 37

7. Reporting Practices ............................................................................................................. 39

7.1. Ideal needs for their target application ........................................................................ 39

7.1.1 Specifying theTarget Needs: ............................................................................. 39

7.1.2. Acceptable Error Rates: .................................................................................... 39

7.1.3. Acceptable Timing Characteristics: ................................................................. 39

7.2. Transducer‟s Characteristics ....................................................................................... 40

7.2.1. Transducer‟s Output Rate ................................................................................. 40

7.2.2. Temporal Characteristics: ................................................................................. 40

7.2.3. Offline vs. Online Analysis: ............................................................................. 40

7.2.4. Robustness of the Algorithm ............................................................................ 40

7.3. Transducer Output Mark Up Method ......................................................................... 40

7.4 .Basic Performance Information .................................................................................. 40

7.5. Application- Specific High Level Metrics .................................................................. 40

Appendix A Glossary.......................................................................................................... 43

Appendix B Transition-based Performance Markup Algorithm ........................................ 47

Appendix C References ...................................................................................................... 51

Appendix D Outstanding Issues ......................................................................................... 53

1. Introduction There is a growing awareness that for a Brain Computer Interface (BCI) to be most useful

for people with severe motor disabilities it must support self-paced operation. Self-paced

operation implies two things: first, when a BCI system is on, it is always available for control, and

second, the technology is able to recognize periods when no commands are generated by the user

and it does not produce false responses during those times. As such, the evaluation of self-paced

technology poses some unique challenges; the most notable is the lack of a single metric to

quantify performance.

Several groups are working in this area and dealing with these challenges (e.g., in the

laboratories of Birch, Levine, Millan, Inbar, and Pfurtscheller) although the terminology and

methods used for describing and testing these approaches are inconsistent. The term

asynchronous, in particular, has been used inconsistently as researchers have attempted to

describe self-paced BCI system designs (Mason, Bashashati et al. 2005).

This report is our effort to elucidate the concepts and methods related to the evaluation of self-

paced BCI operation. As such we define key terms and concepts, outline specific challenges and

issues in this sub-field, and summarize methods and metrics for comparing technology designs.

The overriding purpose of this report is to provide a common reference for designing and

evaluating self-paced BCI technologies. Our hope is that through efforts like these, disparate

perspectives can be aligned and the community brought together on terminology, testing strategies

and reporting related to self-paced operation.

Fundamentally, the report is a living document that will grow and change as the field matures

and our thinking is refined. In its current form, it contains the fundamental self-paced concepts and

key terms that we have agreed on and itemizes several existing terms that we would like to see

discontinued. The basic concepts and terminology are reviewed in the next chapter. As these

concepts and terms are used throughout the document, we recommend that Chapter 2 is read prior

to reading other chapters. As stated above, self-paced BCI evaluation has several unique

challenges. We have itemized these in Chapter 3. One of the most critical challenges – deriving a

suitable signal reference for comparison – is detailed further in Chapter 4. In Chapter 5 and 6 we

outline methodological approaches to BCI evaluation and describe various metrics for

summarizing experimental findings. We close the report in Chapter 7 with some guidelines for

reporting self-paced BCI system evaluation. We also have included several appendices, of which

the most noteworthy are the first and last. The first is a glossary of all key term and the last

summarizes issues which we have not resolved as of the time of publication.

1.1. This Version 1

The material presented herein is the product of a small, closed discussion group of researchers

representing laboratories that have been actively working with self-paced BCI technology. The

project was initiated at an informal gathering at the 3rd International Brain Computer Interface

Meeting, Rensselaerville, NY, USA in June 2005 where we assembled to discuss the evaluation of

self-paced BCIs. Given the diversity in perspectives and terminology that we encountered within

our small group (during and after that meeting), we chose to limit the participation in the

discussion aiming to first reach agreement on the most basic terms, concepts and metrics before

we approach the rest of the community for their input.

1 The most recent version of this report is posted on the www.BCI-info.org website under Research Info | Documents

| Articles. We recommend that you get the latest version before reading further.

2

In order to limit the scope of the discussion, we chose to focus the report primarily on BCI

transducers that have discrete output. (For reference, “BCI transducers” refer to the conceptual

component of a BCI system that translates brain activity into basic control signals) Self-paced

operation of BCI transducers with continuous output or spatial reference output will not be

discussed here. Future versions or alternate reports may deal with the evaluation of these

transducer designs.

As future versions of this document will reflect community opinion, we invite you to comment

on what we have written. If you are interested, please join the online discussion held in the

newsgroup named gmane.science.neuroengineering.bci-info.general on the news server

news.gmane.org.2 Our plan is to update this document based on comments seen in this newsgroup.

1.2. Conventions

Below are the style conventions used in this document:

<New Term> definition of key term related to self-paced BCI technology and operation

(see entry in glossary – Appendix A). Usually the first occurrence of the

term in the document.

<Key Term> subsequent use of a key term when the term is not obvious from context

<Old Term> existing terminology that we would like to see discontinued

<text> In superscript identifies unresolved issues. Refer to item n in the last

appendix entitled Outstanding Issues for a description of the issue.

Note all highlighted terms are hyperlinked to glossary entries in the first appendix.

2 Instructions for accessing the newsgroup: If you use MS Outlook, active the built-in news reader with View | Go to |

News... Then use Tools | Accounts and add news.gmane.org to the accounts list. Then right-click on

news.gmane.org in the Folders panel and select Open. Search for and select the "bci-info" newsgroup (named above)

from the listed new groups. If you use Thunderbird, you could do the following: In File | New | Account select

Newsgroup account. Enter "news.gmane.org" in the newsgroup server field. Then you can select the "bci-info"

newsgroup from the listed news groups. If you have difficulties accessing this newsgroup or have other questions,

regarding the online discussion please contact one of the authors.

3

2. Review of BCI Technology, Evaluation Concepts and Metrics

2.1. Review of BCI Technology

In this chapter, we review the concept of a Brain Computer Interface (BCI), a BCI

transducer and different types of BCI transducers, with a focus on self-paced BCI systems.

2.1.1. BCI System Definition

A BCI System is a set of sensors and signal processing components (sometimes including

displays and sensory stimulators) that translates a person‟s brain activity directly into useful

control or communication signals. From an assistive technology perspective, Figure 1 depicts a

BCI System as an assistive technology (AT) that bridges an ability gap between a person and their

environment. For example, such a system can enable a person with severe motor disabilities to

control objects in their environment such as a light switch, a TV set or an object on the computer

screen. Figure 2 depicts two common AT architectures discussed in the BCI literature 3. (See BCI

design web site (Mason, Bashashati et al. 2005) for alternative system architectures).

ability

gap

activity

person

BCI

system

perceived result

of activity

environment

Figure 1. Conceptual model of BCI technology used as an assistive technology.

(a) (b)

Figure 2. a) Functional model of 2-component BCI System; and b) functional model of 3-component BCI System.

The Assistive Device component represents the apparatus that interacts directly with object or people in the

environment. Examples would be displays, speech synthesizers, infrared remote controllers, a FES system or a

wheelchair. The Control Interface is a component that is added to a transducer that produces a relatively low

dimensional output in order to expand the control dimensionality to a level required by an Assistive Device. Most

commonly, the Control Interface is some form of electronic menu on a display.

3 Note, this model also applies to any augmentative technology that extends a person‟s abilities beyond their

inherient functional limitations.

4

2.1.2. BCI Transducers

The BCI Transducer (depicted in Figure 2) and represents a collection of sensors and signal

processing components required to translate a person‟s brain activity into usable control signals as

detailed in Figure 3. As seen in Figure 3, the brain activity is first measured by a series of sensors

and then amplified. Next, non-physiological artifacts (such as power line noise) and physiological

artifacts (such as those caused as a result of EOG activity), are removed from the brain signals (or

periods of brain signals contaminated with such artifacts are excluded from analysis). It should be

noted that some BCI systems do not have a separate “Artifact Processor” component as shown in

Figure 3. Then the Feature Extractor extracts useful features for discrimination of control signal

from background brain activity. Finally, the Feature Translator translates these features into

control signals which are sent to an Assistive Device or Control

Interface.

Figure 3. Functional model of a BCI Transducer illustrating the series of components used to translate brain activity

into control signals.

In general, transducers can be designed to produce discrete, continuous or spatial reference

outputs. For this report, only discrete transducers (non-ordered, state-based output) are

discussed.

Figure 4. Discrete state-based output versus continuous output (values from ordered set).

2.1.2.1. Abstract BCI Model

As represented in Figure 5, the BCI System can be represented abstractly as the user‟s brain

connected to a BCI Transducer followed by other signal processing components that translate the

basic transducer output into higher-dimensional (more complex) data or commands. In this model,

the user‟s intent generates the appropriate brain state to do some activity (i.e., produce some

output) in their environment. As with any other tool, the user‟s experience will dictate how much

attention must be focused on correct operation of the tool or if tool use is primarily transparent,

with the user only consciously focusing on the activity being performed. For an experienced user,

Continuous

states

Discrete

value level

Time Time

B A C C B A

5

the production of the brain state and transducer output has been automated and hence the user‟s

intent is usually related to some high-level output in this “system” (somewhere along the chain of

outputs). For inexperienced users, the production of the required brain states and transducer output

has not been automated and the user has to consciously focus on generating a particular transducer

output, so the intent is more closely tied to generating a particular transducer output.

o o o

transducer

output BCI

Transducer

user‟s

intent

user‟s brain

activity higher function

components

higher-level outputs

various forms of feedback

Figure 5. Generic signal processing model.

2.1.2.2. Definition of No Control (NC) support

When operating a BCI, users modulate their brain activity in order to generate desired

transducer outputs. For discrete BCI Transducers, the brain states related to Intentional Control

are denoted as ICA, ICB, ICC, ... with corresponding transducer outputs A, B, C,... When there is

no Intentional Control, e.g., during periods of thinking, composing, monitoring or daydreaming,

the user‟s brain state is considered to be in a No Control state, which is denoted by NC. The

appropriate transducer response to NC would be a neutral, or N, output. We refer to as this ability

as NC support. NC support is necessary for most types of machine or device interactions where

frequent actions are spaced by periods of inaction. For most physical interfaces (such as a

keyboard), NC support is not an issue as intent is related to voluntary movement. One of the

challenges of quantifying BCI performance is determining appropriate criteria to quantify

imperfect NC support. This ability to remain neutral has been compared to a car engine that idles

when no gas is applied. Thus some researchers have referred to the NC state using the term idle

and NC support as Idle Support. However, the word idle implies passivity, which is only one

possible type of NC, so these terms are not ideal and we recommend that researchers use NC and

NC support instead.

Few BCI Transducers have been specifically designed to support the NC brain states. For this

work, these transducers will be called BCI transducers with NC support. Transducers that do

not support NC brain states are referred to as BCI transducers without NC support 4.

NC support must handle the diversity of activity and thought that may make up the NC state

and must operate effectively during NC periods ranging from a few seconds used to check a

written source, to a few minutes of staring out the window, to a few hours of watching a favorite

movie. Thus it is unlikely that one can model all possible NC states related to a target application.

In general, this problem can be constrained by including some mechanism to turn the interface off

4 We have explicitly avoided using the terms asynchronous and synchronous in order to avoid misinterpretation

and confusion as these adjective are used inconsistently throughout the field. As a reference, we have (re)defined an

asynchronous BCI transducer as synonymous with a BCI transducer with NC support and a synchronous BCI

transducer as synonymous with a BCI transducer without NC support. We do not support the use of asynchronous

BCI or asynchronous BCI system as these terms are too vague and bound to cause more confusion in the

community. Instead we prefer the term self-paced BCI system.

6

when not in use and back on when desired. Applying this approach, some interface designs

propose to handle long periods of inactivity by having the BCI “put to sleep” by the user or

programmatically “go to sleep” (such as when battery powered laptops enter standby mode) and

then are turned back on (via some mechanism) when desired I1. While a sleep mechanism is a

useful mode to avoid false responses during long periods of NC, it is not practical during periods

of interaction which contain frequent short pauses for thought, composition or response

monitoring. Further, for a BCI to programmatically put itself to sleep implies that the BCI can

recognize a long period of NC and take action accordingly. Therefore, while a sleep mechanism

may form a portion of a BCI's NC support strategy, a sleep mechanism by itself is not sufficient

for NC support.

Some researchers have designed cued interfaces that in addition to producing IC related

outputs also produce an unknown state output when there is not enough confidence in the

classifier to choose one of the other IC states. This unknown state has in some works been used to

represent the neutral output, N, but there is no evidence that the NC state will fall into the

unknown state in these designs I2. Therefore, the existence of an unknown state is not necessarily

sufficient to define NC support.

Ultimately, NC support will likely be provided by a combination of the methods described

above along with strategies yet to be developed. Only practical experimentation with real BCIs

and real users will reveal the nature of effective NC support in a real-world BCI.

2.1.2.3. Event-Driven versus State-Driven Designs

Through our discussion we realized that researchers design discrete self-paced BCI transducers

in two different ways. We have named these designs state-driven (discrete) BCI transducers

and event-driven (discrete) BCI transducers.

In event-driven discrete control, the user has the ability to initiate a state change (the event),

but cannot hold this state for a given time because the underlying neurological phenomenon is

event based. Note in this case the return to the NC state is automatically done by the brain, see

Figure 6. P300 waves are typically event-driven discrete control.

In state-driven discrete control, the user has the ability to initiate a state change, hold and

release a given state, and switch between brain states, as shown in Figure 6. This is done in

transducers where the underlying neurological phenomenon is a thresholded continuous value.

Thresholded mu activity is an example of state-driven discrete control.

ICA

NC

time

A

N

decision

boundary

f(t)

ideal transducer output

initiate

A state automatic

release

duration of “on” time

depends on classifier design

feature space

ICA

NC

time

A

N

decision

boundary

feature

vector, f(t)

ideal transducer output

initiate

A state

hold

A state release

A state

feature space

Figure 6. a) Event-driven discrete control: transducer output is based on a transient brain activity (an event) such as a

Movement Related Potential (MRP). b) State-driven discrete control: the user has ability to initiate, maintain (hold),

and release an intentional control (IC) state..

The main point in differentiating these two paradigms is that for state-driven control we

should report activation, hold, release, and NC support capabilities, while for event-driven

control we only measure activation and NC support capabilities as the user has no control over

7

release (which is automatic) or holding.

Another difference between event-driven and state driven discrete control is that transition

from one IC state to another IC state can not appear in event-driven control because of the

automatic release from an IC state to NC.

2.1.3. BCI Control Paradigms

While in this paper, our focus is on self-paced operation of BCI systems, we found it necessary

to clearly delineate the self-paced control paradigm from other types of control paradigms seen in

the BCI literature. From our perspective, there are four primary control paradigms based on NC

support and system availability as depicted in Figure 7 (Mason and Birch 2005).

1) self-paced control: BCI system is continuously available to the user when it is on/awake and

it supports NC

2) system-paced I3 control: system is periodically available to the user when it is on/awake (ie.,

it requires a cuing mechanism) and it supports NC

3) synchronized I4 control: system is periodically available to the user when it is on/awake (ie.,

it requires a cuing mechanism) and it does not support NC

4) constantly-engaged control: system is continuously available to the user when it is on/awake

and it does not support NC (not a practical mode of control)

time

periodically available

continuously available

Figure 7. System availability during different temporal control paradigms. Reproduced from (Mason and Birch

2005)

As a result, one would find four different categories of BCI system design: self-paced, system-

paced, synchronized, and constantly engaged.

There have been numerous reports that equate cuing mechanisms with synchronous (or

synchronized) control. This is inaccurate and misleading. While synchronized control (and

system-paced control) require a cuing mechanism in their design, the presence of a cuing

mechanism in an experimental protocol does not imply that the system operates in a synchronized

or system-paced control paradigm. Cues are an essential part of the system design in synchronized

BCI systems and system-paced BCI systems. They let the user know when the system is about to

start interpreting their data as control. For synchronized BCI systems, the cues are generally used

to say "get ready to start controlling" (i.e., get into an IC state, or start a series of IC states (as per

Millan et al) ). For system-paced BCI systems, they indicate that a control period will be starting

soon if they want to control the system at that time (not a requirement as in synchronized systems).

Cues are also used as experimental constraints (ie, not part of the BCI system design). As

experimental constraints, cues are used to guide the user into some state, such as IC or NC. In this

way, one could set up an experimental system with a user operating a self-paced BCI system

(design) and a separate cuing mechanism to "force" the user to control the self-paced system when

8

desired by the experimenters. Such a setup would not imply a cued transducer, but instead would

indicate a tightly constrained experimental setup.

2.2. Review of Performance Evaluation Concepts and Methods

Before we proceed to discuss performance metrics and methods, we require a common

perspective and language. This chapter outlines the principle concepts related to the evaluation of

self-paced BCI technologies that we have agreed on.

Figure 8 and Figure 9 delineate the main components and data flow of data recording and

analysis for online and offline studies, respectively.

real-time

monitors

activity storage

Recorded

Data

BCI

system

environment

user

a)

Performance Evaluation

generate

Intended Output

knowledge of

experimental protocol

actual transducer output

calculate

error

statistics,

summarize

timing characteristics

Reference

information

storage

Intended

Output

transducer

output

mark-up

b)

Figure 8. Simplified on-line experimental system. a) recording of BCI system operation data while the user attempts

to perform some activity through the BCI; b) schematic of data analysis (could be done in real time or after the data

recording has been completed). Note, the collection of “soft” usability metrics, like “user satisfaction” (typically

recorded via questionnaires), is not depicted in these diagrams. I5

For an online experimental system, after a user attempts to perform some activity through the

BCI system, the data is recorded on a storage device (See Figure 8.a) . “Real-time monitors” such

as a camera ensure that user‟s actions are properly documented (Including proper execution of the

experiment and monitoring artifacts). I6

Figure 8.b shows the general schematic of data analysis. The first step is to codify the User‟s

intent in a machine readable format. This is a critical step in the design, since the exact time that

the user has intended to control the BCI is usually unknown and must be estimated from the

available reference information. We call such an estimation of the intent of the user the Estimated

User Intent and we discuss its generation in detail in Chapter 4. Once the Estimated User Intent

9

is generated, the transducer output can be marked up and the performance evaluation metrics can

be calculated.

storage

generate

Intended Output

Intended Output

brain

signals

knowledge of

experimental protocol

new transducer

output

new transducer

design

Performance Evaluation

calculate

error

statistics,

summarize

timing characteristics

transducer

output

mark-up

Reference

information

a)

activity

storage

Recorded

Data

environment

user

real-time

monitors

b)

Figure 9. Simplified off-line experimental system. a) Prerecorded brain wave data (either from a previous on-line

recording as depicted in Figure 8a or from a recording of a user performing a specific activity as depicted in b) is used

to drive a new transducer design.

Fig 9.a. shows the setup for an offline experimental system. The difference between this setup

and the on-line analysis is that here the user‟s feedback does not exist. The brain waves and other

experimental-related signals such as EMG and EOG activities from a previous BCI experiment, or

recorded from a user performing a specific task, are stored for offline analysis.

The recorded brain signals are then fed into a new transducer design (see Figure 9.b) and the

new transducer output is generated. The rest of the process of labeling and calculation of the

performance criteria are similar to the online experiment.

Depending on the transducer design and experimental set-up, the key issues in performance

evaluation are:

1. online/offline evaluation: depending on the type of analysis (online vs. offline), the

performance criteria may be different.

2. continuous versus periodic analysis

3. where should the performance be measured?

4. the guidance and pacing of experimental tasks and self reported data

5. methods to determine the a quality signal reference for comparison

6. metrics: what to use and how to calculate them

10

The first four issues are discussed in the following subsections focus. The last two will be

discussed in more detail in Chapters 4, 5 and 6.

2.2.1. Online vs offline evaluation

The performance evaluation depends on whether the experimental system is offline or online.

In an online experimental system, the performance of the system is evaluated by two sets of

metrics: transducer‟s performance metrics (as explained in Chapter 6) and system‟s usability

metrics in terms of the amount of satisfaction/dissatisfaction of subjects with the system, etc.

2.2.2. Continuous versus periodic analysis

Periodic analysis as we have defined it is when an experimenter records continuous data

(including NC and IC) , but evaluates the BCI technology only during the periods of IC (similar to

synchronized control systems) I7. Continuous analysis is when the analysis evaluates all of the

data.

2.2.3. Where to measure the performance ?

To provide context for the interplay between IC and NC states, a simple operating model shown in

Figure 10 was adopted. In this model, the user is equipped with a BCI transducer and the

transducer output is connected to an assistive device.

There are multiple places where one can test a self-paced BCI system, represented by various

test points TPα, TPβ and TPγ in Figure 10. Given the wide variety of commercially available

assistive devices, we felt that BCI users would be primarily interested in the performance of BCI

transducers as this characterization would enable them to select the best transducer to control their

existing assistive devices. To simplify our discussion (and best meet user needs), we chose to

focus our conversation on the output of the transducer, TPα, although much of what is presented

below may also apply to other test points.

Assistive

Device

TPα

TPβ

BCI

Transducer

TPγ

Figure 10. Basic operating model illustrating possible test points for evaluating function in multi component BCIs.

For BCI technology evaluation, we are treating the BCI transducer operation as a black box,

i.e. the evaluation metrics are blind to any considerations about how the transducer output is

produced. The performance metrics merely quantify the difference from the user's viewpoint

between the user's intent when controlling the system (intended output) and the actual output of

the BCI transducer.

2.2.4. User Intent

The purpose of performance metrics is to quantify desire versus ability. We refer to the

discrete representation of the user's desire as the intended transducer output or Estimated User

11

Intent. One of the major issues is how do we compute the Estimated User Intent for a self-

paced experiment in order for the analysis to be valid (refer to Chapter 4 for discussion).

Deviations of the observed output from the intended output as observed at the test point may be

caused either by limitations in the user's ability to control their brain state or by deficits in the BCI

transducer's ability to interpret their brain activity. Experiments may be configured to examine

one or the other of these sources of error or to allow both simultaneously.

2.3. Paced vs. Self-Paced / Guided versus Self-Guided Testing Protocols:

Testing protocols were broadly classified by two factors, pacing and guidance. A guided,

paced protocol can constrain the subject in such a way that the investigator can determine what the

subject intended to do (assuming subjects follow task directions) and when they attempted control

of the interface. Any protocol that utilizes self-guided or self-paced interaction will require self

report. For self-directed tasks, investigators require extra methods to label their self-directed data.

The investigator also needs to control self report error in these protocols.

Test protocol design is also independent of the BCI transducer design. What was not initially

recognized by all members of the group was that self-paced BCI Transducers that support NC can

be tested in both guided and paced protocols as well as with self-paced methods. For example,

guided and paced protocols may use cues to initially customize and train the subjects, but true

testing of self-paced BCI technology for individuals with severe motor disabilities will require

self-paced actions and timing. Ideally, a self-guided, self-paced test protocol is desired although

this type of protocol involves many issues related to how to generate the estimated intended output

(see Chapter 4). Experience with this type of testing indicated that subject training or BCI

transducer customization or calibration in controlled environments does not always map well to

self-determined environments. This makes it difficult to predict the true usability of self-paced

BCI technology for certain subject groups where self report is not possible.

13

3. Challenges with Self-Paced BCI System Evaluation

3.1. Requirements and Issues

Comparison of different BCI systems, transducers and algorithms requires shared performance

metrics. The most common performance metric is perhaps the error rate or classification accuracy

(Blankertz, Müller et al. 2004), (Blankertz, Vaughan et al. 2003), (Blankertz, Schalk et al. 2005).

However, several other metrics have also been proposed such as Cohen's Kappa coefficient (Bortz

and Lienert 1998), (Cohen 1960), (Kraemer 1982), (Schlögl, Lee et al. 2005), mutual information

and information transfer (Kronegg, Voloshynovskiy et al. 2005), (Kronegg and Pun 2005),

(Nykopp 2001), (Pierce 1980), (Schlögl, Neuper et al. 2002), (Schlögl, Keinrath et al. 2003),

(Wolpaw, Ramoser et al. 1998), receiver-operator-characteristics (ROC) and area-under-the-

(ROC) curve (AUC) (Lal, Schröder et al. 2005), (Schlögl, Anderer et al. 1999a), (Schlögl, Anderer

et al. 1999b), correlation coefficient (Gao, Black et al. 2003), (Wu, Gao et al. 2006) and mean

square error (MSE) (Gao, Black et al. 2003), (Wu, Gao et al. 2006) (for details see (Schlögl,

Huggins et al. accepted for publication)). These criteria have been applied mostly in synchronized

BCI systems, on a trial-by-trial basis. In the simplest case, a single classification result is obtained

from each trial and these single trial results are used to calculate the classification accuracy.

Sometimes, instead of the classification value, a discriminant value is used, taking into account not

only the classification but also the magnitude (or confidence level of classification). In more

advanced evaluation methods, instead of a single value per trial, the result of each time-point

within the trial is analyzed. Accordingly, the time-course of the performance metric is used,

enabling an estimate of the time delay of the data processing methods. These evaluation methods

require that the Estimated User Intent (class labels, target information, reference data) be accurate

and precise, which is a simple matter for the synchronized BCI experiments to which they have

been applied. Synchronized experiments also allow experiments to be structured so that the

assumptions necessary for the application of these performance evaluation methods are generally

met. However, for self-paced BCIs and for most real-world applications, these underlying

assumptions are not met and the Estimated User Intent contains some uncertainty.I8 Therefore,

these performance evaluation metrics cannot be readily applied to self-paced BCIs. I9

3.2. Availability of Reference Data

In order to evaluate performance, the Estimated User Intent must be known and available for

comparison with the transducer output. When evaluating self-paced BCIs, obtaining the Estimated

User Intent can be a major challenge because self-paced BCI experiments generally do not provide

rigorous reference information. Sample-by-sample class labels that are commonly employed by

existing performance metrics are especially difficult to obtain from self-paced experiments where

the point at which brain activity for a particular task begins is entirely up to the user. However, in

many cases, it is sufficient to use reasonable reference information, without requiring an 100%

accuracy. This strategy is often used in medical informatics, where experts provide scorings based

on their best knowledge. Although, different experts to not agree (they have interscorer variability)

and the same experts to not always provide the same scoring results, expert scoring is often still

useful and is used as the “gold standard” (examples are sleep stage scoring (Danker-Hopfe, Kunz

et al. 2004), or diagnosing mammograms). Obtaining Estimated User Intent information for self-

paced BCIs is discussed in detail in Chapter 4.

Since the Estimated User Intent information is imperfect, the performance metric is limited to

the inaccuracies of the reference labels. Nevertheless, the metric can be used for comparing

different data processing methods. As long as the reference information has been obtained

14

independently (a priori) from the transducer output, the obtained metric can reliably compare the

performance of the two systems on the same data. However, matters become more complex when

the desired task is to compare the performance of two systems that were tested on different data.

3.3. Issues with Existing Performance Metrics

3.3.1. Evaluation of Synchronized and Self-Paced BCIs

The evaluation of self-paced BCIs requires the comparison of the Estimated User Intent (our

signal reference) to the transducer output. Compared to synchronized BCIs where at only certain

windows of time are evaluated, a self-paced BCI output is analyzed at every output value. This

has far-reaching implications for the application of performance metrics.

3.3.2. The NC State

As the unique characteristic of self-paced BCIs, the NC state presents the greatest difference

between synchronous and self-paced BCIs and therefore an important consideration for

appropriate performance evaluation. While the NC state could be treated as simply an additional

state, this ignores the variability of the underlying brain activity, and hides the importance of error

free NC periods. Alternatively, separate statistics could be calculated for the NC state and the IC

states. This complicates the comparison of methods, but may also capture important aspects of the

performance.

3.3.3. Unequal Probability of Classes

The operation of self-paced BCIs typically results in long periods of NC interspersed with brief

instances of IC, or periods of increased IC. Regardless of the pattern of NC and IC, the NC state

often occurs with a much higher probability than the IC state(s). This unequal a priori probability

of the various classes violates the underlying assumption of equal a priori probability for a variety

of traditional performance metrics including Wolpaw‟s Mutual Information (Wolpaw, Birbaumer

et al. 2000) and the classification accuracy (ACC) or error rate (ERR). While unequal

probabilities do not violate the assumptions of other methods, they can present problems for

interpretation. For example, Receiver Operator Characteristics (ROC) and the metrics derived

from them can be used to present the performance metrics for a self-paced BCI. However, when

calculated in a traditional sample-by-sample manner, the overwhelming likelihood of the NC state

means that most BCI transducers will produce what looks like a perfect ROC curve. But even an

apparently low false positive percentage of 0.01% could indicate more than one false positive per

minute if the sample rate was 200 Hz. Thus, the area of interest on a ROC curve is so narrow that

traditional ROC analysis is impractical.

3.3.4. Implicit Assumption of Error-Free NC

One of the keys to useful performance of self-paced BCIs is the availability of long periods of

error free function. However, some performance evaluation methods may be used to compute the

performance over the IC states only. This corresponds to making the implicit assumption of error-

free periods of NC (Schlögl, Huggins et al. accepted for publication).I10 Performance metrics that

do not incorporate the existence of false positives are of limited use for the evaluation of self-

paced BCIs because they ignore such an important aspect of self-paced BCI operation. Methods

may also ignore false positives by requiring experimental setups that remove or limit the

opportunity for false positives to occur. For example, to produce a high information transfer rate,

events must occur close together. Using such short intervals increases the ITR, but also limits the

15

time between events, artificially reducing the opportunity for false positives to occur.

Consequently, while a self-paced BCI with a high ITR is desirable, such a description is

incomplete since it does not show how robust the BCI is against false positives.

3.3.5. Decision Rate

Another challenge for the use of performance metrics to compare self-paced BCIs is the large

variation in transducer output rates. Some BCIs produce decisions at a rate identical to the sample

rate (e.g. 200 Hz, (Huggins, Levine et al. 1999)) while others produce decisions at a dramatically

reduced rate (e.g. 20 Hz, (Kübler, Nijboer et al. 2005)) I11. In synchronized BCIs, this has not

been an issue, because most BCIs use a per-trial decision method with a decision rate of about 10-

15 trials/minute.

Some performance metrics, such as the kappa coefficient (Schlögl, Huggins et al. accepted for

publication), are normalized with respect to the number of samples, and would therefore not be

directly dependent on the decision rate. Some performance metrics can be normalized with respect

to the number of samples or with respect to the experiment duration. Other metrics (such as the

HF-difference), ignore time altogether, so that the metrics are only comparable when determined

from test data of the same length. However, performance metrics that are dependent on the

decision rate would be useless for comparison of BCI performance with this order of magnitude

difference in the decision rate. Evaluation of self-paced BCIs requires a metric that can be used to

compare BCIs with different decision rates.

Using a high transducer output decision rate also raises the question of the useful information

transmitted. For example, it should be asked if a 100 Hz decision rate BCI produces 10 times more

useful information than a 10 Hz decision rate BCI? Probably not, because the useful information is

primarily contained in the transitions within the Estimated User Intent which do not occur at this

frequency.

3.3.6. Application versus Perspective

The field of BCI research involves engineering and clinical considerations for BCI

performance. Too often, these diverse fields fail to communicate regarding BCI design

requirements. Researchers from a pure engineering background tend to focus on accuracy and

information transfer rates while researchers from a clinical background focus on response time and

robustness to false positives. However, both types of metrics are important for the production of a

clinically useful BCI. Ultimately, the user‟s perception of the BCI performance will control its

success as a clinical intervention, so user-centric methods should be considered in all stages of the

design. However, for some research and development tasks, more engineering-focused metrics

may provide additional insight into the choice of algorithms.

3.4. Application of Current Performance Metrics to Self-paced BCI systems

Based on the arguments in Chapter 3.3, most metrics seems to be inappropriate for use with

self-paced BCIs.I12 An initial consideration is whether the transducer output is discrete or

continuous and whether the reference information is discrete or continuous. An example of

discrete reference information would be a synchronized BCI experiment with different target cues

(Schlögl, Neuper et al. 2002), (Schlögl, Keinrath et al. 2003); an example of a continuous

reference signal would be the position of a ball on a screen (Gao, Black et al. 2003), (Wu, Black et

al. 2004), (Wu, Gao et al. 2006). The metric to apply depends on the type of transducer output, the

experimental design (which affects the probability of the classes in the data), and the desired

application for the BCI. An overview of the requirements for various metrics is provided in Table

1.

16

Table 1: Requirements for various criteria I13

Reference information Transducer output Handles Unequal Class Probabilities

Incorporates NC

Error rate, Accuracy Discrete Discrete No Cohen’s kappa coefficient Discrete Discrete Yes Wolpaw’s MI Discrete Discrete No Nykopp’s MI Discrete Discrete Yes Continuous MI Discrete Continuous AUC Discrete Continuous MSE Continuous Continuous Correlation coefficient 1-D (discrete or

continuous) Continuous

HF-difference Discrete Discrete Yes No Sensitivity, Specificity, Precision, Recall, F1, a’, d-prime

Discrete Discrete

17

4. Determining a Quality Reference for Self-Paced Evaluation

As discussed in Chapter 3, one of the most problematic issues in the evaluation of a self-paced

BCI is the ability to determine the user‟s intent and thus generate a quality reference signal to

compare with the actual transducer output. This chapter focuses on this issue, illustrating various

methods depending on the amount and quality of the experimental information.

As an overview, the ability to determine (or estimate) the subject‟s intent depends on the task

the subjects performed and the experimental equipment. For example, the subject may be

performing a task that is observable (e.g. moving their finger) or they may not, (e.g. imagining

moving their finger). If they are moving their finger, then this may be directly measured with

some form of monitor, such as a finger switch or data glove. If, however, there is no observable

phenomenon, then alternative methods are required to determine the subjects‟ intent. This is

discussed below.

Any method for generating the Estimated User Intent relies on some controlled or measured

experimental variable. These include a finger switch activation, EMG onset, an experimental cue.

In the remainder of this report we will refer to these variables as Intent Related Measures or

IRMs.

4.1. Determining a Quality Reference in the presence of Observable Phenomenon and Real-time Monitors

If the experiment has an observable correlate to the subjects‟ intent, such as an actual

movement, then a real-time monitor can be used to record this with a certain spatiotemporal

resolution (e.g. if the subjects control the BCI by moving their finger, then a data glove can be

used to record the movements). In this scenario, the movement information provides a good

approximation of the subjects‟ intent, although the reader should note that even these “concrete”

observations are still only an approximation. Because brain activity begins before movement

onset, the recorded information will be delayed compared to the true subject intent.

Pros:

- easy to implement

- observations during specific times in the experimental protocol are highly

correlated with the subjects‟ intent – analysis method is more direct

- useful as for proof of concept

Cons:

- requires observable phenomenon – this often limits these types of studies to able-

bodied individuals and rules out individuals with severe disabilities. It also rules out

those BCI technologies that use motor or other imagery as a control source. This is

a serious limitation.

Even with this approach, the performance analysis is not necessarily a simple comparison of

the monitor output and the transducer output. The delay between the onset of brain activity and

movement onset must be accommodated and additionally, the transducer may introduce

constraints on the output that prevent simple comparison. For example, a finger switch may

record a momentary press but the transducer may have a debounce mechanism that holds it active

for ¼ of a second before releasing. In this example, the switch on/off may be of varying durations,

say 1/32 to 1/8 second, thus producing a pulse of various lengths, whereas the transducer output is

always ¼ second long. As such, the comparison of this self-paced data is not straightforward and

18

may require some heuristics based on knowledge of the experimental setup and the transducer

characteristics for a meaningful analysis. Performance labels could be assigned to each sample

based on the presence or absence of an intent measure such as EMG. The brain activity in some

time window extending prior to EMG onset would also be included in the active labeled class (or

alternatively in a preparatory labeled class) in order to also label the brain activity that produces

the movement. Once these reference labels are available, there is greater opportunity to apply the

traditional evaluation criteria (as listed above). This principle can be also extended to more than

one class, by using more than one switch. Thus switches for left and right hand movement, foot

movement, tongue movement etc. could be used. The major drawback of this approach is its

reliance on actual movements when the purpose of a BCI is to provide an interface that can be

operated without these movements.

4.2. Determining a Quality Reference in the absence of an observable Phenomenon

In cases where there are no observable phenomena, other strategies are required. Four main

approaches are detailed below.

4.2.1. Paced Test Environments I14

One can approximate self-paced test environments by using a paced and guided environment

with timing cues This method can for example be used for the transducer training phase where the

pacing and guidance information is only used as reference information and not used by BCI

transducer to produce an output. The subject will attempt to activate the transducer in according to

the guidance at the timing cues. If the subject has the ability to hold and release the transducer as

well (i.e., state-driven transducers), then cues can also be used to indicate desired release times.

These types of experimental protocols provide a gross estimate of when the subjects intended to

activate and release the transducer output and for what purpose, but this approach has much more

temporal uncertainty compared with the technique describe in Section 4.1 as there is no manner in

which to observe how accurately the subject responded to the timing cues.

In order for the results to be generalized (for the study to have reasonable external validity),

care must be taken to distribute the timing cues in a configuration that resembles the timing of the

application targeted for BCI operation. An unanswered question regarding the validity of this

approach is how well brain activity related to monitoring and responding to cues relates to actual

self-paced activities since some studies have shown differences in particular types of brain activity

during self-paced and cued movements (Libet, Wright et al. 1982).

4.2.2. True Self-Paced Test Environments

In true self-paced test environments there are no timing cues and the subjects determine when

to act on their own. As such, self report is the only form of reference information available, with

the subject reporting state errors and possibly timing/response information as well, or by reporting

which mental task was/will be performed.

There are several known issues with self report, including reporting bias and temporal accuracy

of the report. For example, as brain activity has been shown to precede a subjects‟ conscious

awareness of an intention to act (Libet, Gleason et al. 1983), even the most meticulous self-report

does not necessarily provide an accurate reflection of the timing of the brain states. Although this

difficulty can be addressed to varying degrees, one of the most critical issues with self-report is

that the reporting itself is a mental activity which needs to be controlled for so that it does not

directly interfere with the experiment outcomes.

19

The closer the self-reporting activity is to error-reporting in the target application, the higher

the external validity of these studies. For example, if the subject is only reporting activation

errors in their self-report, then this parallels keyboard user‟s recognizing errors and pressing the

backspace key or delete key. The more complicated the self-report, the less likely the study will

have reasonable external validity.

4.2.3. System Paced Test Environments

As an approximation to true self-paced environments, a system-paced environment can be used to

limit times where a subject can operate the BCI. Thus with this extra information, one can reduce

the amount of self-report needed. This approach may be considered a hybrid between the

synchronized and true self-paced testing

4.2.4. Computationally Intensive Event Detection

An option for producing reference data for self-paced experiments would utilize

computationally intensive or iterative analysis methods which produce accurate results, but cannot

be performed in real-time or on single trials. While these methods would be inappropriate as the

basis of a BCI, they might be useful to produce reference data from unlabeled records of brain

activity. Most other options for producing reference data during self-paced BCI experiments

involve limiting the freedom of the user to self-direct the action. I15

4.3. Examples

To illustrate the problems in determining the Estimate User Intent, several diagrams are

presented starting from data collection.

4.3.1. Step 1: running experiment

Depending if the experiment is offline or online, data is collected differently. For online study,

the data are recorded. Recorded data include brain activity (e.g. EEG), Intent-Related Measures

(IRM, such as EMG onset, finger switch onset, timing cues or subject self-report), and output of

the transducer under test (TO), see Figure 11.

20

recorded brain activity: EEG

IRM: recorded EMG onset or finger switch onset

act A act B act A act B

(actual) Transducer Output

N

A

B B

A

B

Figure 11. Example signals from an online experiment (with event-driven transducer using testing protocol

with EMG or finger switch observable phenomena)

For offline study, data are retrieved from prerecorded data. Prerecorded data include brain

activity (e.g. EEG), Intent-Related Measures (IRM, such as EMG onset, finger switch onset,

timing cues or subject self-report). The transducer output are generated from the prerecorded brain

activity for each transducer under test (TOi). Offline study is illustrated on Figure 12.

IRM: pre-recorded EMG onset or finger switch onset


(actual) Transducer Output – transducer A

N

A

B B

A

B

(actual) Transducer Output – transducer B

N

A

B B

A A

pre-recorded brain activity: EEG

Figure 12. Example signals from an offline experiment (with multiple event-driven transducers using testing

protocol with EMG or finger switch observable phenomena). In offline experiments, the transducer outputs

are produced after the brain activity and IRMs are recorded. This example emphathizes the differences in

transducer outputs (ie, Transducer B has a longer signal processing delay than Transducer A) to illustrate the

possible differences in transducer output timing.

21

4.3.2. Step 2. Generating the Estimated User Intent based on knowledge about the IRMs and the experimental protocol.

Depending on the type of experiment run (referring now to Chapter 4), the User‟s Intent

sequence can be estimated in different ways. For example for experiments without self-report, the

UI can be estimated directly from the Intent-Related Measure (IRM) whether that is a switch

output or a cue as shown in

Figure 13.



A A B B

N

Estimated User‟s Intent from IRM (one-time, fixed reference)

Figure 13. Example of estimated User's Intent for non-self-reported studies. This estimation utilizes

physiological and/or experimental knowledge about how the IRM relates to User‟s Intent.

For experiments with self-reported errors, the User Intent can only be estimated relative to the

observed transducer output(s) as shown in

Figure 14.

IRM: self-reported errors (relates to negative intent)

error A error B

A

N

A

B B

actual Transducer Output

N

A

B B

A

B

error A

Estimated User‟s Intent from self reported errors and actual Trans Output (one-time,

fixed reference)

Figure 14. Example of estimated User's Intent for self-reported studies, noting that the actual transducer

output is required to estimate the reference in this case.

23

5. Transducer Output Performance Mark Up

The inherent uncertainty in the timing of the Estimated User Intent (introduced in Chapter 4)

precludes a direct comparison of the Estimated User Intent (EUI) I16 with the output(s) of the

transducers under test. As such after generating a EUI reference, we need a non-deterministic I17

method of identifying and marking correct and incorrect responses. Once we have those labels,

we can calculate summary statistics based on specific performance metrics (as discussed in

Chapter 6).

In this chapter, we delineate heuristic methods to mark up the transducer output that are based

on the EUI (our reference).

5.1. Inherent Uncertainty in the Estimated User Intent

In Chapter 4 we presented several approaches for conducting self-paced experiments (or

studies that approximate self-paced operation). The main points we wish to stress are that there

are various approaches, and each approach corresponds to a different degree of temporal

uncertainty in the estimate of the timing of user‟s intended output. Thus when comparing the

Estimated User Intent reference signal to the actual transducer output, data interpretation

heuristics are required to manage this uncertainty. These heuristics control how data in the areas

of uncertainty are interpreted.

Let‟s look at an example:

true subject intent

finger switch (monitor)

transducer output

expected response

window

time (seconds) 0 1

Figure 15. Illustration of temporal uncertainty in self-paced data evaluation of an event-driven transducer.

In Figure 15, the first line represents the timing of the subject‟s intent to activate an (event-

driven) BCI transducer. This assumes that the subject actually moved his/her finger when they

were trying to drive the BCI transducer and that this movement was recorded by a finger switch

(second line). The actual transducer output is shown in the third line and often contains signal

24

processing delays which cause the actual output to trail subject‟s intent (at least for novice user –

more experienced users may adjust to this delay). In order to accommodate for the uncertainty in

the monitoring equipment (finger switch) and the signal processing delay in the transducer,

heuristics are used to translate the finger switch output into a window of time where actual

transducer output activations will be related to the observed intent.

In the most general case, we do not necessarily have a finger switch, but rather some Intent-

Related Measure, generated from real-time monitors (such as a finger switch), experimental

constraints and/or self-report.

5.2. General Performance Mark Up Algorithm

After much discussion, we have distilled the mark up process into a general algorithm

Researchers may implement different “subroutines” in this algorithm, but what we present below

is the agreed-upon approach to performance mark up.

Step 1- Define an Expected Response Window (ERW): All performance “mark up” algorithms that we could envision are based around an Expected

Response Window, or ERW. So defining the ERW is the first step. The ERW defines the time

period around state transitions in the Estimated User Intent where the researcher expects a

transducer output response to the intended state change. For example, we propose using two

parameters ERWstart and ERWend to define the ERWs. These parameters can be positive or

negative defining times after or before each Estimated User Intent transition. Figure 16 illustrates

three different cases in the case of an event-driven paradigm. In the first case (Figure 16a), the

window is located after the transition (both parameters are positive). In the second case (Figure

16b) , the window is before the transition (both parameters are negative), and in the last case

(Figure 16c), the window includes the transition (one parameter is positive (ERWend) and the other

(ERWstart) is negative). Negative values would often be seen for example in the case of self report,

where the subject reports their intent after the transducer output has been seen and interpreted and

for the case of actual movements, where the brain activity is expected to start before the

movement.

25

Figure 16. Illustration of ERW definition based on Estimated User Intent (EUI) and ERWstart and ERWend

parameters. Please note that in case (b), the values of ERWstart and ERWend are both negative and in case (c),

only the value of ERWstart is negative.. I18

As the temporal uncertainty increases, the widths of the ERWs are bound to increase. Also

experimenters may use different ERW sizes for different task conditions or in an experiment to

quantify the timing characteristics of the transducers under test. An extension of this method

might be to use weighted ERW, but that would add another factor to the analysis which we want to

avoid for the present.

Step 2- Run the Mark-Up Algorithm: define and run a Transducer Output Mark-Up Algorithm to label the correct and incorrect

responses in the transducer output. This algorithm has the following specification

a) inputs:

a. Estimated User Intent (or the IRMs if they represent the EUI) (described in

Chapter 4)

b. ERW definition, e.g., ERWstart and ERWend parameter definitions (described above)

c. the output(s) of the transducer(s) under test (discussed in Chapter 2-4)

b) Internal Comparison Method (ICM)

a. the method to mark up samples within the ERW (detailed below with examples)

b. the method to mark up samples outside of the ERWs

c. the method to determine the response time

c) output:

a. performance mark up labels that indicate correct, incorrect or unknown responses

for each time point in the transducer output (examples below)

As such, these components (including detailed descriptions) are what we would expect a

researcher to report in a self-paced BCI evaluation. Note there are many possible algorithms for

generating the performance mark up.

ERWstart

ERWend

ERW ERW ERW

ERWend ERWstart

ERWend ERWstart

(a) (b) (c) (c)

EUI EUI EUI

26

5.2.1. The Internal Comparison Method (ICM)

Conceptually, the first step of the Internal Comparison Method (ICM) is to produce an

Intermediate Estimated User Intent. We have attempted to illustrate this in Figure 17.



A A B B

N

ERW: expected response windows

Intermediate Estimated User Intent (IEUI)

Af Af Bf Bf N N N N N

RTend

RTstart

Estimated User Intent from IRM (one-time, fixed reference)

Figure 17. Illustration of an Intermediate Estimated User Intent sequence in relation to the Expected User

Intent and ERW definition. If there are multiple transducers under test in an offline study, different ERWs

and thus Intermediate Estimated User Intent may be required. The Intermediate Estimated User Intent (IEUI)

is a conceptual sequence that represents expected response with “fuzzy” intent labels. (Note, periods marked

Af (or Bf) in the IEUI sequence represent windows of time where at least one A (or B) response is expected in

the transducer output.)

Most of us can relate to the first three time series in Figure 17. The last time series is just a

conceptual representation of the expected output values where periods marked Af (or Bf) in the

Intermediate Estimated User Intent (EUI) sequence represent windows of time where at least

one A (or B) response is expected in the transducer output. It will be up to the fuzzy comparison

block to interpret the TO in these time periods.

Now a comparison algorithm does not necessarily generate the Intermediate Estimated User

Intent sequence. Instead it may infer it from the EUI and ERW definition during actual

computation. Regardless, the concept is useful for illustrative purposes.

The second step of the ICM is to compare the Intermediate Estimated User Intent, IEUI,

(whether physical or implied) to the actual transducer output(s) and generate performance labels

(see examples below). This is a “fuzzy” comparison in that there can be significant uncertainty in

the timing of the actual response relative to the user‟s intent. We realize that there are many

approaches to doing this fuzzy comparison so we will try to illustrate this method through

examples.

27

5.2.1.1. Example 1

As a first example, let us assume a researcher has reported the following strategy for marking

up the output of event-driven transducers:

“Within the ERWs, if there is one or more correct responses in the ERW (TOstate =

IEUI state), label the first as a correct response (“a hit”) for the desired state. Set the

response time of the hit as the time between the EUI and the first 1st correct response in the

ERW. Ignore other samples in the ERW, that is treat them as if their corresponding intent

was “unknown” and do not include them in summary statistics. Otherwise create an

incorrect response (“a miss”) label centered in the middle of the ERW. For non-fuzzy

IEUI values (outside the ERWs), do a direct sample-sample comparison and label any

incorrect responses (i.e., not N) as „spontaneous errors‟.”

This process is complete as it defines all the aspects of our ICM specification:

1) The method to mark up samples within the ERW: if there was a correct response in

the ERW, the mark up state at the time of first correct response is „correct response‟ (or

„hit‟) and all other samples are labeled „unknown‟. If there is no correct response, then

mark the sample at the center of the ERW as „missed response‟ and, as in the other case,

all other samples are labeled „unknown‟

2) The method to determine the response time: for correct responses, the time from the

EUI to the 1st correct response. For no responses, the time from the EUI to the middle of

the ERW.

3) the method to mark up samples outside of the ERWs: non-N responses are marked as

„spontaneous errors‟ on a sample by sample basis

This example also states that the output samples labeled „unknown‟ were not included in the

summary statistics. This issue will be addressed in the next chapter.

5.2.1.2. Example 2

As another example, a second researcher might use this strategy:

“For samples within the ERWs, label the first correct responses in the ERW

(TOstate(t)-IEUIstate(t)) with a label of the form TOstate(t)-IEUIstate(t), eg, “A-A” if the

desired state was „A‟. If there were no correct responses, label the first incorrect response

as TOstate(t)-IEUIstate(t). Set the label of all other points in the ERW as TOstate(t)‟-N‟,

in other words, treat all other points as if the corresponding user‟s intended output was

„N‟. For samples outside the ERWs, label all samples as TOstate(t)-IEUI state(t), which

may lead to labels such as „N-N‟ or „A-N‟.”

This is incomplete as defines all the aspects of our ICM specification, except response time:

1) The method to mark up samples within the ERW: if there was a correct response in

the ERW, mark that sample as TOstate(t)-IEUIstate(t) eg, “A-A” if the desired state was

„A‟. Otherwise, if there was an incorrect response, say „B‟, then label the first incorrect

response of the form „B-A‟. Label any other points TOstate(t)‟-N‟,e.g., „A-N‟,„B-N‟ or

„N-N‟ and all other samples are labeled „unknown‟. If there is no correct response, then

mark the sample at the center of the ERW as „missed response‟ and, as in the other case,

all other samples are labeled „unknown‟

2) The method to determine the response time: ? – not defined !

28

3) The method to mark up samples outside of the ERWs: all samples are labeled

TOstate(t)-IEUI state(t), which may lead to labels such as „N-N‟ or „A-N‟.

5.2.1.3. Other ICM examples

There are many others ways one could do this fuzzy comparison. To illustrate the point some

more, Figure 18 illustrates examples of different types of TO markup labels depending on various

markup strategies. In this figure, the first markup strategy is to “find the first correct response and

ignore the others. The second strategy is “find the first correct response and treat the others as

“N” Again output basic label pairs.” The third strategy is similar to the first except it outputs

semantic labels such as TP, FP, ....

the Transducer Output

N

A

B B

A

B

N N N N N N A N N N N N N N N N A B N N N B N N N N N N N N N N N N N N N N B N N N N N N N

N N N N N U A U N N N N N N N N U B U N N N N N N A U U N N N N N N N N N U B U N N N N N N

example TO mark up strategy A (find first, ignore neighbors – don‟t include in summary statistics).

Note errors marked in yellow and ignored values are marked in red. This example assumes that

Fuzzy Comparison block of Fig 5 does not presume to know the intent of the neighboring samples

within the ERWs. Instead it labels ignores these samples in the summary statistics. In this figure,

these samples are marked with U for “unknown intent”.

Intermediate Estimated User Intent sequence: expected response with “fuzzy” intent labels

N N N N N N A N N N N N N N N N A B N N N B N N N N N N N N N N N N N N N N B N N N N N N N

N N N N N N A N N N N N N N N N N B N N N N N N N A N N N N N N N N N N N N B N N N N N N N

example TO mark up strategy B (similar to above treat neighbours as N)

T T T T T U T U T T T T T T T T U T U T T F T T T F U U T T T T T T T T T U T U T T T T T T

N N N N N N P N N N N N N N N N N P N N N P N N N N N N N N N N N N N N N N P N N N N N N N

example TO mark up strategy C - using semantic labels TP (True Positive), TN (True Negative), FP

(False Positive), FN (False Negative) and UN (unknown). This example is similar to the first except

with semantic labels.

Af Af Bf Bf N N N N N

Figure 18. Example illustrating different Internal Comparison Methods. Three different markup strategies

are illustrated.

29

To summarize, researchers have to select and justify

1. their IRM (as described in the new Chapter 4)

2. their method to estimate the EUI (as described in the new Chapter 4)

3. their ERW definition (i.e., their ERWstart and ERWend choices relative to their IRM -

possibly different for different transducers)

4. the three components of their Internal Comparison Method

a. the method to mark up samples within the ERW

b. the method to determine the response time

c. the method to mark up samples outside of the ERWs

As study designers or manuscript reviewers, these are all we have to look at to estimate the

internal and external validity of the study. A full ICM example is provided in Appendix B.I19

Please note that the examples used herein only depicted event-driven transducers. The method

applies equally to state-driven transducers.

31

6. Metrics for Self-Paced Evaluation

In the previous two chapters we have delineated various approaches to performing analysis of

data collected from the operation of a self-paced, discrete-output BCI transducer. In this chapter

we will discuss performance metrics can be used to summarize the performance labels generated

during the self-paced analysis.

The selection of metric depends on the research question(s) being asked. For instance, the

following questions may be inquiries into how well someone can operate the transducer:

- How does it respond to transitions to an IC state from an NC state? What type and

frequency of errors would the user expect?

- What type and frequency of errors does one see when the user is in an NC state?

- How quickly does the transducer respond? Are there delays? Are these delays consistent?

- How long can a person hold it in a particular state? (for state-driven transducers)

- How well can a person release it from a held state? (for state-driven transducers)

- How well can someone switch between IC states? (for multi-IC state-driven transducers)

In these questions the reader will see our bias in this chapter is towards metrics that provide

usability information from the user‟s perspective. We have also captured in these examples the

two general categories of metrics: error metrics and timing characterization metrics. Before we

present the error metric and timing metric methods, we will define a few terms and symbols that

underlie the metrics

6.1. Definition of Terms and Symbols

6.1.1. Observation Time

The observation time is the duration of the experiment over which the transducer output

labeling is made. It is denoted as T and measured in seconds/minutes/hours.

6.1.2. NC Time Periods

The NC time periods are the periods between the IC states where the user is known

(assumed) to be in the NC state. These are denoted as TNCi.

6.1.3. Inter-FA Time Periods

The Inter-False-Activation time periods (or IFA Periods) are the periods within the NC time

periods and are denoted as TIFAi. Summarizing IFA period lengths characterizes the distribution of

false activations during NC time periods.

6.1.4. Response Time

The response time is the amount of time the between when the user initiates an activation or

release and the corresponding response in the transducer output. For event-driven transducers

these are collectively denoted as RTAi. For state-driven transducers, where there are two basic

transition types, these are collectively denoted as RTAi and RTRi for activation RT and release RT

respectively. I20

32

6.1.5. Hold Periods / Hold Time

The hold period is the period during which the transducer output enters and stays in a specific

IC state. The hold period ends when the output changes to another state (see glitch definition

below). The length of these hold periods are referred to as the hold time. These are denoted as THi

and is only exists in state-driven paradigms. Hold times are used to determine if the user is able to

perform a specific operation, such as mouse-like drag and drop operations. Examples of hold

periods are presented in Figure 19.I21

6.1.6. Glitch

We have introduced glitch to refer to a short state transition away from a held state that is

viewed by the user as a temporary deviation. From a user's perspective, the hold time is not

interrupted by a glitch, though the transducer output is spontaneously changed. It is considered to

be interrupted only if the glitch duration is longer than the "maximum glitch duration". The

maximum glitch duration I22 is defined as the maximum time that a glitch is allowed to last and

denoted Tglitch. Any change of state longer than this duration will be considered the end of the hold

time and not a glitch. It is up to the experimenter to choose this value according to their beliefs of

user perspective. A threshold of zero means that any state change will mark the end of the hold

time. The default maximum glitch duration is zero, i.e. no glitches are allowed. It can be up to one

maximum glitch duration per IC state, for example: high maximum glitch duration for mouse

pointing tasks and low maximum glitch duration for mouse clicking tasks.

Figure 19. Basic hold-time period examples ( ). One hold-time period (left), two hold-time periods

(middle), three hold-time periods (right). The middle and right examples can also be considered as only one hold-time

period if the maximum glitches are short enough.

6.2. Error metrics definitions

In general, the authors agreed that the most rudimentary measure for summarizing error was

the confusion matrix, which will be presented in the next subsection. Other error-related metrics

can be derived from that representation and will be discussed afterward.

6.2.1. Confusion Matrix

The concept of confusion matrix (CM) is used in the communication and coding theory to

characterize the communication channel. In these contexts, the confusion matrix is a table that

summarizes the states sent (desired) versus the states received. A multi-state example is drawn in

Figure 20.

intended output

transducer output

intended output

transducer output

intended output

transducer output

33

desired actual output output A B N

A OAA OAB OAN

B OBA OBA OBN

N ONA ONB ONN

Figure 20. A confusion matrix for two IC states (with corresponding outputs A and B) and an NC state (with

output N). OXY represent the number of performance labels observed when state X was desire and Y was

actually output.

For transducers that produce only two-states (an IC state and a NC state), the confusion matrix

reduces to the following diagrams. The table on the left is the general case and the one on the

right imposes 2-state, statistical labels.

desired actual output desired actual output output A N output A N

A OAA OAN A TP (hit) FN (miss)

N ONA ONN N FP TN

a) b)

Figure 21. a) A confusion matrix for two IC states (with corresponding outputs A and B) and an NC state (with

output N). Same matrix with 2-state, statistical labels, true positive (TP), false positive (FP), false negative

(FN) and true negative (TN), noting that TPs and FNs have also been referred to in some published works as

„hits‟ and „misses‟.

Some researchers have forwarded synchronized transducer designs that generate an “unknown”

output if there is not enough internal evidence to select one of the other classes. This approach can

also be applied to self-paced designs. In terms of the CM, this can be captured in the modified

multi-state confusion matrix depicted in Figure 22.

desired actual output output A B N unknown

A OAA OAB OAN OA?

B OBA OBA OBN OB?

N ONA ONB ONN ON?

Figure 22. A confusion matrix for two IC states (with corresponding outputs A and B) , an NC state (with

output N) and an unknown output state.

Researchers working with state-driven transducers may be also interested in reporting how

well their transducers respond from all possible states.I23 For this they could report multiple

confusion matrices as shown in Figure 23, where each CM characterizes the responses based on a

different previous state.. (As above, these matrices could be extended to include an “unknown”

34

state output.) During our discussions, we came up with several semantic terms to describe possible

transitions (and non-transitions) observed in the multiple CM case. For example, entries labeled

OXYY, OXXX, OXXY, OXYX, OXXY, and OXYZ were referred to as Correct Transitions (CTXY), Correct

Maintained States (CMSX), Spontaneous Transitions (STXY), Missed Transitions (MTXY) and

Incorrect Transitions (ITXYZ) respectively.


A OAAA OAAB OAAN

B OABA OABA OABN

N OANA OANB OANN

CM when changing state from A


A OBAA OBAB OBAN

B OBBA OBBA OBBN

N OBNA OBNB OBNN

CM when changing state from B


A ONAA ONAB ONAN

B ONBA ONBA ONBN

N ONNA ONNB ONNN

CM when changing state from N

Figure 23. Multiple Confusion Matrices for two IC states and one NC state (and no “unknown” state).

So to summarize, the confusion matrix provides a basic form to summarize performance

labels. Depending on the mark up algorithm though, it may not capture all observed data.

Specifically, there may be samples that are labelled as “ignored”, “unknown” or “do not care”

within the ERWs. These samples need to be accounted for in that analysis. I24

6.2.2. Other metrics

Although we have agreed on summarizing observations in confusion matrices, the search for a

single, meaningful performance metric has been elusive. As previously discussed in Chapter 3, a

35

few general, higher-level metrics such as the HF difference or overall classification accuracy have

been used, but these are not meaningful on there own.

The primary issue is that we have a multi-dimensional error space: errors related to the IC and

NC states. Inherently there is a error trade off when calibrating a transducer: IC errors are

decreased at the expense of the false activations in NC.

For the 2-state output case, ROC curves have been useful representations of the two-

dimensional error space in other fields. However, ROC curves can only be generated off-line, that

is they are not appropriate for real-time (on-line) evaluation, AND realistically, only a narrow

portion of the curve (the area with low NC error percentages) is meaningful. As a default,

researchers have summarized each error dimension separately. The most common is the reporting

of true positive (TP) and false positive (FP) percentages (what some have called “rates”).I25 The

percentage of TPs is measured the number successful IC-related activations relative to the number

of attempted IC states. The percentage of FPs is measured as the number of false activations

during NC relative to the total number of samples in the NC periods. One issue with this practice

is that the reported percentages are related to the transducer output rate and not normalized to time.

Thus it is difficult to interpret these results from a user-centric perspective. For example,

reporting a FP percentage of 1% for a transducer that generates an output once every 1/10 of a

second would correspond to an expected FP every 10 seconds, which is not useful for most self-

paced applications. In contrast if the transducer generated and output every second, this would

reflect a expected error every 100 seconds, which may have more application. So it would be

preferred to have these types of percentages normalized to time. Thus it would be more useful to

express the expected FPs as a temporal rate relative to Σ TNCi.

In order to overcome problem of multiple (conflicting) metrics (e.g. TP and FP), some single

number metrics have been discussed in Chapter 3 and (Schlögl, Huggins et al. accepted for

publication). Specifically, the overall accuracy, the error rate, the area-under-the-ROC-curve

(AUC), A-prime, d-prime, F1-metric (i.e. the harmonic mean of precision and recall) and the HF-

diff, as well as Cohen‟s Kappa coefficient and the mutual information of discrete output are

described. The overall accuracy (and error rate) is discouraged because it gives the states with

more samples a larger weight. The AUC, A-prime, d-prime, F1-metric and the HF-diff are defined

only for 2 states. Thus, only the mutual information and Cohen‟s Kappa coefficient can be used for

systems with more than two states. Specifically, the Kappa coefficient weights each state equally,

and measures the separability between the classes independently of its sample size. This makes the

mutual information and the Kappa coefficient possible options for summarizing the performance

in a single metric.

The requirements for a metric depend on the research question being asked which are generally

related to the usability of the BCI for a particular (target) application. For example, in an

application evaluating point and click control with a mouse, random mouse movements may be

considered less serious than random mouse clicks.

To be a considered as a metric for self-paced BCI systems, the metric must fulfil the following

requirements:

the metric must be derivable from basic performance information or other metrics;

the metric must be useful for at least one application/task;

the metric must include the NC state, because this state is specific to self-paced BCIs.

I26

36

6.3. Temporal Characterization of Transducer Output

The confusion matrices and higher-level metrics presented above summarized the overall error

percentages seen in the experimental data. In this chapter, we discuss the temporal

6.3.1. Response-Time Characterization

Response times are most generally summarized in histograms such as shown in Figure 24.

Alternatively, these curves can be statistically modelled and represented by statistics such as mean

and variance. Given our comparison methodology, response times are will always bounded by the

ERWstart and ERWend of the Expected Response Window, thus for initial RT characterization,

ERW should be a generous width.

The Response-Time metrics (histogram or statistical model parameters) are used to determine

if a BCI design is suited for a particular application or task. For example, a BCI design can

produce a response that is much too late for a particular application, see Figure 24. The Response-

Time histogram should be the most frequently used. For specific uses, the histogram can be

refined, e.g. by using only the responses to a specific state Z (CTYZZ). This can be used to

determine if one of the states is not adapted to the application.

Figure 24. Example of BCI designs Response-Time Histograms that match (A) and don't match (B) the RT

application requirements.

6.3.2. NC Period Characterization

Like RT metrics, summarizing NC period lengths in the form of a histogram or statistical

parameters is useful to characterizing the experimental conditions. From it one can determine how

frequently a subject attempted control and this can be used to determine if the results are

applicable to specific target applications. For instance if the NC period length is a mean of 5

seconds, standard deviation of 1, then the reported results are not appropriate for the control of a

wheelchair where the application‟s NC periods are generally much larger and more widely

distributed in time.

6.3.3. Inter-FA Period Characterization

Summarizing Intra-FA period lengths is useful for determining how false activations are

distributed throughout NC periods. If this metric, which can expressed as a histogram or statistical

parameters like above, is uniformly distributed, then one can assume that the false activations are

random. If the distribution is biased towards zero, then the false activations tend to appear in

patches.

RT

RT requirements for application

A RT

p(RTi)

RT requirements for application

B

p(RTi)

37

6.3.4. Hold-Time Period Characterization

Like above, the hold-time period lengths can be characterized as a histogram or statistical

parameters. Again, this metric allows one to identify rapidly if the tested BCI design is

appropriate for a specific application.

6.4. Summary

In this chapter, we defined performance measures based error and temporal response. In order

to summarize this chapter, we propose a metric dependence tree, see Figure 25. This tree allows

the researcher to determine which metrics are needed in order to compute the metric of interest.

Figure 25. Metric dependence tree.

6.4.1. Returning to the Research Questions

Selecting the metric that suits one‟s need depends on the target application and the research

questions being asked. So what follows are examples of research questions paired with possible

metrics. I27

Research Question Metric How does it respond to transitions to an IC state from an NC state? What type of errors would the user expect?

CM, plus a single or multiple error metric

What type and frequency of errors does one see when the user is in an NC state?

CM, plus single or multiple error metric histogram of error-free period length histogram of time between false activations

How quickly does the transducer respond? Are there delays? Are these delays consistent?

histogram of response time

How well can a user hold (sustain) a state (for state-driven transducer)

histogram of hold times

Performance labels

CM

% TP/TN/ FP/FN

HF difference

ROC curve

single metrics

RT

histograms / statistical models

Timing measures

multiple metrics

kappa ? % CT/IT/CMS/MT/ST

NC period

IFA period

hold time

39

7. Reporting Practices

In this chapter, some recommendations about the information that should be reported in the

self-paced BCI literature is presented. If accepted and implemented by the BCI community, these

recommendations can lead to better reproducibility of the future works. They can also provide the

possibility of comparison between different designs. We divide these guidelines into five

categories:

1. Ideal needs for their target application

2. Transducer characteristics

3. Transition labels

4. Basic performance information

5. Application-specific high level metrics I28

We now address each category in more details:

7.1. Ideal needs for their target application

It is recommended that the researchers report the ideal needs for their target application(s).

Once these ideal needs are specified, it would be possible to compare the results with the ideals

and to analyze how the goals are achieved. The ideal needs can be further divided into the

following sub-categories:

7.1.1 Specifying theTarget Needs:

Specifying acceptable error characteristics for the target application (this includes target

population, target activity and target operating environment). A specific error rate or response time

may be acceptable for a particular application but it may be completely unacceptable for another

one. It is important that the target application is expressed clearly.

7.1.2. Acceptable Error Rates:

Once the target application is specified, acceptable error rates should be specified. While

ideally zero error rates are desired, unfortunately, that is usually not the case in the practice. Thus

researchers should specifically mention what level of error is acceptable in their design. This level

may be different from one design or application to another. We do not know of any published

work that has specified this information.

7.1.3. Acceptable Timing Characteristics:

Many factors introduce some delays in the response time of a BCI transducer (such as a filter‟s

delay, post-processing, etc.). Similar to determining an acceptable error rate, it is recommended

that the authors determine the acceptable response time for their particular application.

40

7.2. Transducer’s Characteristics

Here are some recommendations when reporting the transducer‟s characteristics:

7.2.1. Transducer’s Output Rate

Specifying the transducer‟s output rate is important for determining the applicability of a

proposed design. A false positive rate of 1% for a BCI transducer with the output rate of 10

samples per second, means an average of one error every 10 seconds. For a transducer with the

output rate of 1 sample per second, this means an average of one error every 100 seconds. Clearly

there is a significant amount of difference between these two designs (assuming they have the

same hit detection rate).

Actually we would prefer to see performance results normalized to time such that the results

are not dependent on the transducers output rate..

7.2.2. Temporal Characteristics:

It is necessary to report the temporal characteristics of a switch design such as Response Time

and Refractory Period so that other researchers can test these designs accurately.

7.2.3. Offline vs. Online Analysis:

Whether or not the analysis is carried out offline or online should be clearly stated.

Particularly, it is recommended that the performance of the system during the periods of bad

data I29 (for example, when anomalies are present) is reported regardless of whether such periods

are analyzed or not.

7.2.4. Robustness of the Algorithm

It is recommended that whether or not the performance of a transducer is considered in the

presence of artifacts is also reported. Especially for online analysis, it is important to know how

robust a particular transducer is to the presence of artifact or how it can handle artifacts.

7.3. Transducer Output Mark Up Method

The procedure used for marking up the transducer‟s output should be clearly stated. This is

crucial for the reproduction of any transducer.

7.4 .Basic Performance Information

It is recommended that all the basic performance metrics (the elements in the confusion matrix

for example) are reported clearly.

7.5. Application- Specific High Level Metrics

As stated in Chapter 6, depending on the target application, the basic performance metrics may

be combined to generate high level metrics. It is recommended that all the high-level performance

metrics are reported very clearly along with the rational of using them.

41

Appendices

43

Appendix A Glossary Term Category Definition Link Comments

Activation Activate

Transducer Output Discrete transducer output change from NC to ICi, where i = A, B, C, ... (see Discrete Transducer Output)

Used as a shortcut term.

Activation Response Time Measured Timing Characteristics

The time it takes for the transducer output to reflect the intended control of control. This can be represented as a mean time or other statistical report such as a histogram

Asynchronous BCI design classification

Generally synonymous with self-paced operation although usage is inconsistent

recommend stop using

BCI Acronym for Brain Computer Interface

BCI System BCI design classification

A system of components that translates brain activity into useful communication or control signals

text

BCI Transducer BCI design classification

The primary component of a BCI system which translates brain activity into basic control signals.

text

Brain Computer Interface A technology that translates activity measured directly from the brain into useful communication or control. Also known as Brain Interface, Direct Brain Interface and Brain Machine Interface.

Brain Activity text

Continuous Output Transducer Output Continuous (ordered) values that correspond to the user’s brain state. For example, changes in the user’s intentionally controlled (IC) brain state would map onto changes in the continuous transducer output. A NC brain state would ideally produce no change in the transducer output.

Deactivation Deactivate

Transducer Output Synonymous with Release. See Release. prefer to use Release (SM)

Deactivation Response Time

Measured Timing Characteristics

The time it takes for the transducer output to reflect the ceasing of control. Only reported for state-based discrete transducers

Discrete Output Transducer Output Discrete state (non-ordered) values that correspond to the user’s brain state. For example, brain states ICA, ICB, ICC, and NC would ideally produce transducer outputs A, B, C, and N.

Estimated User Intent An estimation of the user’s intended transducer output in terms of timing and state.

test

Event-Driven (Discrete) Transducer

Transducer Design A transducer that is driven by a transient event in the brain state, e.g., a movement related potential. In this case, the transducer output can be considered “instantaneous”, on for a brief period of time then off. There is no ability to hold and release the output. See State-Driven Discrete Transducer.

text

Expected Response Window

Analysis A time period used to compensate for the unknown timing of a intended output.

text

Glitch Transducer Output A closely spaced pairing of a spontaneous transition and a spontaneous correct where the duration between the transition and correction does not exceed a reported maximum glitch duration.. From a user's perspective, the hold time is not interrupted by a glitch, though the transducer output is spontaneously changed. It is considered to be interrupted only if the glitch duration is longer than the "maximum glitch time"

Hold Transducer Output An ability to maintain a transducer output in a particular state.

text

Hold Time Measured Timing Characteristics

The time transducer output can be held in a particular state. Only reported for state-based discrete transducers

Idle User Brain State term used to describe No Control text recommend discontinue use. Preferred term is No Control or NC.

Idle Support User Brain State term used to describe NC Support text recommend discontinue use Preferred term is NC Support.

IC User Brain State Acronym for Intentional Control

44

Intentional Control User Brain State User brain state during which the user is intentionally trying to perform some action using the BCI Transducer. Abbreviated ICi where i = A, B, C, ...

text

Jitter The undesired toggling between states that occurs during activation or deactivation when a feature vector repeatedly crosses the decision boundary(ies).

Jitter Reduction Method Transducer Design A method used to reduced Jitter. See (see Transducer Output - Jitter and Hysteresis and Debounce).

Maximum glitch duration Transducer Output The maximum time that a glitch is allowed to last. Any change of state longer than this threshold will be considered the end of the hold time and not a glitch. It is up to the experimenter to choose this value according to their beliefs of user perspective. A threshold of zero means that any state change will mark the end of the hold time.).

NC User Brain State Acronym for No Control

NC Support Transducer Output the ability of a BCI transducer to recognize a user’s NC state and generate an inactive (N) state output

text

No Control User Brain State User brain state during which the user is not trying to perform some action using the BCI Transducer. The user may be monitoring, resting, thinking but not engaged in control through the BCI transducer. Abbreviated NC.

text

Refractory Period

Measured Timing Characteristics

the minimum time after a discrete transducer has been activated before it is ready to be reactivated.

Response Time Measured Timing Characteristics

See Activation Response Time

Release Transducer Output Discrete transducer output change from NC to ICi, where i = A, B, C, ... (see Discrete Transducer Output)

text

Sleep Mechanism Transducer Design A mechanism by which a BCI system is placed into a restricted response mode to avoid false responses during long periods of No Control.

text

Spatial Reference Output Transducer Output Output that refers to a particular location on a screen or keyboard.

State-Driven (Discrete) Transducer

Transducer Design A discrete transducer that is driven by a continuously controlled brain state, e.g., alpha power. In this case, the transducer output can be turned on, held on for a period of time, then released. See Event-Driven Discrete Transducer.

text

Self-Paced Operation Operating Paradigm text

Synchronized Operation Operating Paradigm inconsistent/mixed usage – no specific definition. The use of this term is not recommended. Synchronized or system-paced are the preferred terms.

text

Synchronous Operating Paradigm Generall describes BCI systems that are operated in a periodic, system-driven manner. The use is inconsistent. The use of this term is not recommended. Synchronized or system-paced are the preferred terms.

System-Paced Operation Operating Paradigm text

Transducer BCI design classification

See BCI Transducer

Unknown State Transducer Output A special transducer output state used (in some transducer designs) to represent when there is not enough confidence in the classifier to choose one of the other IC states.

text

Jitter Reduction terms Debounce Jitter Reduction

Method A mechanism to reduce output jitter (see Transducer Output - Jitter) that locks the transducer output into a state for a fixed time period (the Debounce Time) after activation or deactivation. See Error! Reference source not found..

Debounce Time Jitter Reduction Method

the length of time the debounce mechanism is activated. May be different for activation and deactivation.

Dwell Time Jitter reduction Duration of the settling time for the detection of an

45

method activation. The use of this term is not recommended. "Activation settling time" is the recommended term.

Hysteresis Jitter Reduction Method

A mechanism to reduce output jitter (see Transducer Output - Jitter) that uses a complex decision boundary as shown in Error! Reference source not found..

Refractory Period

Jitter Reduction methods

1. the minimum time after a discrete transducer has been activated before it is ready to be reactivated. 2. Duration of the settling time for the detection of a deactivation. Not really a refractory period in the strict sense given in the first definition. The use of this definition is not recommended in this sense. It should be replaced by "deactivation settling time".

Settling Jitter Reduction Method

A mechanism to reduce output jitter where an activation or deactivation does not occur until a decision boundary has been crossed for a specific period of time (the Transducer Setup Time).

Transducer Setup Time Jitter Reduction Method

Minimal amount of time the feature vector should be “held” in a specific state in order to produce a transition on the transducer output. It could be different setup time, e.g. for NC or IC states

Settling Jitter Reduction Method

A mechanism to reduce output jitter where an activation or deactivation does not occur until a decision boundary has been crossed for a specific period of time (the Transducer Setup Time).

Correct Maintained State Transition label Transducer output state label, when the transducer

output is identical to the intended output, and when there is no transition on both outputs.

Correct Transition Transition label Transducer output transition label when the transducer output follows the user intention.

Incorrect Transition Transition label Transducer output transition to an incorrect intended output state.

Missed Transition Transition label Desired transition on the intended output which is not recognized by the transducer, which do not change the transducer output.

Spontaneous Correction Transition label Transition occurring after a missed/spontaneous/incorrect transition and corrects the error. After the correction, the transducer output state is identical to the intended output.

Spontaneous Transition Transition label Transition which occurs on the transducer output when no transition was previously desired by the user.

47

Appendix B Transition-based Performance Markup Algorithm I30 This section proposes to implement a Transducer Output Performance Markup Algorithm for a

sample-based BCI transducer, following the requirements defined in Chapter 5. This allows

computing some of the metrics from Chapter 6. This section should be considered as an example

rather than a reference implementation.

We saw in Chapter 6.2.2 that metrics should be independent to the transducer output rate.

Therefore, the current performance markup algorithm is based on the transitions between two

states of the User Intent Estimate. This algorithm assumes that the researcher already have a User

Intent Estimate signal, coded as a sequence of samples. As we did not design this performance

markup algorithm for a specific algorithm, the Expected Response Window parameters ERWstart,

ERWend are left undefined. Thus, Chapter 5.2‟s step one (defining the ERW) is not specified here.

For Chapter 5.2‟s step two (define and run the Transducer output Markup Algorithm), we

define a complete ICM in the sense of Chapter 5.2.1 because it defines the markup of samples

within the ERW, allows computing the response time and defines the markup of samples outside

the ERW. Note that the ICM described here produces transducer output performance markup

labels without using an Intermediate Estimate User Intent.

B.1 Types of transitions I31

Transitions from one EUI sample to the next one can be characterized by five terms: Correct

Transition (CT), Incorrect Transition (IT), Missed Transition (MT), Spontaneous Transition (ST),

and Spontaneous Correction (SC). Note that we do not consider here the cases where a transition

does not occur between two EUI samples. Each transition is characterized by three indices I32, e.g.

ITijk: the transducer output state before the transition, the desired EUI state after the transition, the

actual transducer output state after the transition. Depending on the transition, duplicate indices

can be removed. For example, in a Correct Transition, the transducer output state after the

transition is the same as the desired state, thus the last index (k) can be removed. The goal of the

Internal Comparison Method is to produce a list of recognize transitions.

A Correct Transition (CT) occurs when the transducer output state changes to the new EUI

state within the ERW. An Incorrect Transition (IT) occurs when the transducer output state

changes to another state than the new EUI, within the ERW. In both cases, the response time is the

duration between the EUI transition time and the transducer output transition time. A Missed

Transition (MT) occurs when the transducer output does not change during the ERW.

A Spontaneous Transition (ST) occurs when the transducer produces a transition on its output

independently of the user intention. In order to identify this kind of transitions, the concept of

Expected Cause Window is introduced, see Figure 26. This corresponds to the window during

which a transition on the Estimate User Intent should have occurred so that the transducer output

transition could have been considered as a Correct or Incorrect Transition. If no EUI transition

occurs in this window, the transducer output transition must be considered as a Spontaneous

Transition.

48

Figure 26. Expected Response Window (A) and Expected Cause Window (B). The small rounded arrows ( ) show

which transition the window refer to.

The concept of Spontaneous Correction (SC) had to be introduced to avoid counting errors

when the transducer spontaneously corrects Incorrect, Missed or Spontaneous Transition.

Otherwise, this correction would be considered as a Spontaneous Transition. Examples of the five

types of transitions are given on Figure 27.

Figure 27. Exampes of Correct transition (A), Incorrect transition (B), Missed transition (C), Spontaneous transition

(D) and Spontaneous correction (SC). The small rounded arrow ( ) shows which transition the decision window

refer to.

B.2 Internal Comparison Method

Only Correct and Incorrect Transitions are marked up inside the Expected Response Window.

The transducer response time is computed only for these two types of transition. Missed

Transitions, Spontaneous Transitions and Spontaneous Corrections are marked up only outside the

Expected Response Window. For these transitions the transducer response time cannot be

computed. The comparison method can thus be summarized by the following pseudo-code:

A ERW

-RTend

-RTstart

B ERW

RTend

RTstart

Estimate User Intent

Transducer Output

EUI

CT transducer output

ERW

ICi

ICi

transducer response time A B

EUI

IT

transducer output

ERW

ICi

ICj

transducer response time

C

EUI

MT

transducer output

ERW

D

EUI

ST

transducer output

ERW

E

EUI

ST

transducer output

ERW

SC

ICj

ICi

ICi

SC IT SC MT

ERW

49

For each transition of the Estimate User Intent occuring at time tEUI

if the transducer output contains a transition during the ERW tEUI+ERWstart..tEUI+ERWend tTO=transducer output occurrence time

if TransducerOutput(tTO+)=EstimateUserIntent(tTO+)

label a CT at the transducer output transition time tTO,

with a transducer response time = tTO-tEUI

else

label an IT at the transducer output transition time tTO,

with a transducer response time = tTO-tEUI end if

else

label a MT at the intended output transition time tEUI.

end if

end for

for each transition of the transducer output occuring at time tTO

if the intended output contains no transitions during the ERW tTO-ERWend..tTO-ERWstart

if TransducerOutput(tTO+)=EstimateUserIntent(tTO+)

label a SC at the transducer output transition time tTO

else

label an ST at the transducer output transition time tTO end if

end if

end for

This performance markup algorithm has been implemented into the BioSig open source library

(BioSig). An example of this processing is shown on Figure 28.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75

N

A

B

N

A

B

Time [decision sample]

Tra

nsducer

outp

ut

In

put

sig

nal

State-driven data labelling (RTmin

=1, RTmax

=3)

CTNAA

STAAN

SCNAA

CTANN

STNNB

SCBNN

MTNBN SC

NBBST

BBNMT

NNNCT

NAAST

AABCT

BNNST

NNASC

ANNST

NNBSC

BNNIT

NBASC

ABBCT

BAAMT

ABACT

ANN

Figure 28. Example of performance markup. Transitions are shown with a red vertical bar; spontaneous transitions

being shown with a green bar. The Expected Response Windows are shown in light gray area.

B.3 Metrics for self-paced evalutation

We present here the computation of the hold-time periods as defined in Chapter 6.1.5,

including the notion of glitch (Section 6.1.6). A hold-time period starts:

After a CT if the current/new transducer output state is a IC state and the previous one is a

NC; or

After a SC if the current/new transducer output state is a IC and the previous one is a NC; or

After a CT if the current/new transducer output state is an IC and the previous one is an IC

(in this case, the existing hold-time period is stopped and a new one is created).

The hold-time period stops:

After a ST if this ST is not followed by a SC in the next Tglitch seconds; or

50

After a CT; or

After any transition of IT, CT, SC, ST following a MT.

Note that we considered here a unique maximum glitch duration Tglitch. The above rules are

synthesized in the state machine described on Figure 29, and hold-time examples are given on

Figure 30.

Figure 29. Hold-time state machine. Simple text are the preconditions; rectangles are actions; arrows are state

machine transition; circle are states; HTP=Hold-Time Period; Tglitch =Maximum Glitch Duration, "!CT"

means "no CT transition"; "i" is the transition index, "ti" is the time of the i-th transition; multi-lines

conditions are "ORed" preconditions.

Figure 30. Case study for stopping the Hold-Time Period ( ).

MT

ST

ICi

Timing metrics

Decision

CT

IT

RT

m

ax

MT

SC

Decision

window

MT

IT RT

max

MT

CT

CTi (ICjICk, jk)

MTi (ICjICk, jk) & (ITi+1| CTi+1| SCi+1| STi+1)

CTi (NCIC)

STi & !SCi+1 (ti+1-ti< Tglitch)

CTi (ICNC) MTi (ICNC) & (ITi+1| CTi+1| SCi+1| STi+1)

STi & SCi+1 (ti+1-ti< Tglitch)

Start a new HTP

Stop the current HTP Hold-Time

Period

Stop the current HTP Start a new HTP

Init

51

Appendix C References BioSig (2003-2005). BIOSIG - an open source software library for biomedical signal processing.

Blankertz, B., K.-R. Müller, et al. (2004). "The BCI Competition 2003: Progress and Perspectives in Detection and

Discrimination of EEG Single Trials." IEEE Transactions on Biomedical Engineering 51(6): 1044-1051.

Blankertz, B., G. Schalk, et al. (2005). "BCI competition III." from

http://ida.first.fraunhofer.de/projects/bci/competition_iii/.

Blankertz, B., T. M. Vaughan, et al. (2003). "BCI competition 2003." from

http://ida.first.fraunhofer.de/projects/bci/competition/.

Bortz, H. and G. A. Lienert (1998). Kurzgefasste Statistik für die klassische Forschung. Ubereinstimmungsmasze fuer

subjektive Merkmalsurteile. Springer. Berlin Heidelberg: 265-270.

Cohen, J. (1960). "A coefficient of agreement for nominal scales." Educational and Psychological Measurement 20:

37-46.

Danker-Hopfe, H., D. Kunz, et al. (2004). "Interrater reliability between scorers from eight European sleep

laboratories in subjects with different sleep disorders." Journal of Sleep Research 13(1): 63-69.

Gao, Y., M. J. Black, et al. (2003). A quantitative comparison of linear and non-linear models of motor cortical

activity for the encoding and decoding of arm motions. International IEEE EMBS Conference on Neural

Engineering.

Huggins, J. E., S. P. Levine, et al. (1999). "Detection of Event-Related Potentials for Development of a Direct Brain

Interface." Journal of Clinical Neurophysiology 16(5): 448-455.

Kraemer, H. C. (1982). Kappa coefficient. Encyclopedia of Statistical Sciences. K. S. a. J. N. L. (Eds.). New York,

John Wiley & Sons.

Kronegg, J. and T. Pun (2005). Measuring the performance of brain-computer interfaces using the information transfer

rate. BCI 2005, Brain-Computer Interface Technology: Third International Meeting, Rensselaerville, NY,

USA.

Kronegg, J., S. Voloshynovskiy, et al. (2005). Analysis of bit-rate definitions for Brain-Computer Interfaces. Int.

Conf. on Human-computer Interaction (HCI'05), Las Vegas, Neveda, USA, CSREA Press.

Kübler, A., F. Nijboer, et al. (2005). "Patients with ALS can use sensorimotor rhythms to operate a brain-computer

interface." Neurology 64(10): 1775-1777.

Lal, T., M. Schröder, et al. (2005). A Brain Computer Interface with Online Feedback based on

Magnetoencephalography. International Conference on Machine Learning.

Libet, B., C. Gleason, et al. (1983). "Time of conscious intention to act in relation to onset of cerebral activity

(readiness-potential). The unconscious initiation of a freely voluntary act." Brain 106(3): 623-642.

Libet, B., E. J. Wright, et al. (1982). "Readiness-potentials preceding unrestricted 'spontaneous' vs. pre-planned

voluntary acts." Electroencephalography and clinical Neurophysiology 54(3): 322-335.

Mason, S. G., A. Bashashati, et al. (2005). "A Comprehensive Survey of Brain Interface Technology Designs." Annals

of Biomedical Engineering (submitted for publication).

Mason, S. G. and G. E. Birch (2005). Temporal Control Paradigms for Direct Brain Interfaces – Rethinking the

Definition of Asynchronous and Synchronous. HCI International, Las Vegas, Nevada, USA.

Nykopp, T. (2001). Statistical Modelling Issues for The Adaptive Brain Interface. Department of Electrical and

Communications Engineering. Helsinki, Helsinki University of Technology. M.Sc.

Pierce, J. R. (1980). An Introduction to Information Theory: Symbols, Signals and Noise, Dover Publications.

Schlögl, A., P. Anderer, et al. (1999a). Artifact processing of the sleep EEG in the "SIESTA"-project. EMBEC,

Vienna, Austria.

Schlögl, A., P. Anderer, et al. (1999b). Artefact detection in sleep EEG by the use of Kalman filtering. EMBEC,

Vienna, Austria.

http://ida.first.fraunhofer.de/projects/bci/competition_iii/

http://ida.first.fraunhofer.de/projects/bci/competition/

52

Schlögl, A., J. E. Huggins, et al. (accepted for publication). Evaluation criteria in BCI research. Towards Brain-

Computer Interfacing. G. Dornhege, J. d. R. Millán, T. Hinterberger, D. J. McFarland and K.-R. Müller.

Cambridge, MA, The MIT Press.

Schlögl, A., C. Keinrath, et al. (2003). Information transfer of an EEG-based Brain-computer interface. International

IEEE EMBS Conference on Neural Engineering, Capri, Italy.

Schlögl, A., F. Y. Lee, et al. (2005). "Characterization of Four-Class Motor Imagery EEG Data for the BCI-

Competition 2005." Journal of Neural Engineering 2(4): 14-22.

Schlögl, A., C. Neuper, et al. (2002). "Estimating the Mutual Information of an EEG-based Brain-Computer

Interface." Biomedizinische Technik 47(1-2): 3-8.

Townsend, G., B. Graimann, et al. (2004). "Continuous EEG classification during motor imagery-simulation of an

asynchronous BCI." IEEE Transactions on Neural Systems and Rehabilitation Engineering 12(2): 258-265.

Wolpaw, J. R., N. Birbaumer, et al. (2000). "Brain-Computer Interface Technology: A Review of the First

International Meeting." IEEE Transactions on Rehabilitation Engineering 8(2): 164-173.

Wolpaw, J. R., H. Ramoser, et al. (1998). "EEG-Based Communication: Improved Accuracy by Response

Verification." IEEE Transactions on Rehabilitation Engineering 6(3): 326-333.

Wu, W., M. J. Black, et al. (2004). "Modeling and decoding motor cortical activity using a switching Kalman filter."

IEEE Transactions on Biomedical Engineering 51(6): 933-942.

Wu, W., Y. Gao, et al. (2006). "Bayesian population decoding of motor cortical activity using a kalman filter." Neural

Computation 18(1): 80-118.

53

Appendix D Outstanding Issues

1 Issue: existing examples? Could we add references here?

2 Issue: existing examples? Does this apply to anyone else‟s work other than Millan‟s

3 Issue: with terminology. The line between self-paced and system-paced BCI transducers can

become blurred as the rate of pacing of a system-paced BCI increases. It is possible that a

system-paced BCI (such as one based on the P300) would actually produce a higher decision

rate than a self-paced BCI utilizing something like slow cortical potentials. Therefore, the

user would feel that the P300 interface was continuously available because they were not

conscious of waiting for a period of control to occur. I think that it is worth discussing this

difference and perhaps defining a pacing threshold beyond which the user does not feel that

they have to wait for the system to become available. Or perhaps it is just an addition to the

definition of a system-paced BCI that says that it is only considered a system paced BCI if

the user is conscious of the system being unavailable? (Or is that heresy?)

4 Issue: with terminology. The term “synchronized” control is not very descriptive and

possibly confusing with synchronous and synchronization of EEG. Is there a better term?

5 Issue: use of transducer output in the generation of the signal reference. Alois: You are

using a block "transducer output labelling" which has inputs of the the transducer output

AND the "INTENDED OUTPUT" (!!!). The "intended output" is also the reference against

we compare the transducer output. This is a problem. I think we have been discussion this in

the past. One might use "the intended output" to genereate the output labeling – then no

evaluation criterion is valid anymore. Instead, the "intended output" and the "transducer

output" must go into the block "calculate error statistic". I suggest also changing the term

"calculate error statistic" to the more general term "calculate performance criteria" or

"calculate evaluation criteria", moreover do we really need an extra block for “timing

characteristics”, cannot we include the timing metrics in “performance metric” I suggest to

change the figure 8 and replace the 3 leftmost blocks into one block called “calculate

performance metric”.

Mehrdad: I suggest that we keep “transducer output labeling” and then add the “labeled

transducer output” after this block. By having some explanation, I think we can avoid

confusion for readers. I agree with the second part of Alois‟ suggestions that we should

rename the two rightmost blocks and have a single block instead.

6 Issue: clarification needed. Jane:I think this could use further clarification. I'm not sure

what it means

7 Issue: how many BCI systems are evaluated like this? Mehrdad: references?

8 Issue: with wording. Julien: Self-paced BCIs can be trained using synchronized protocols.

In this case, the EUI can be estimated the same way as in synchronized BCIs. Should we

describe that issue? Steve: Synchronized protocols do not estimate EUI. Most “average”

observations over a fixed window, so an estimate of exact timing is not required. So I don‟t

know if we can say something that‟s relevant here.

9 Issue: use of existing metrics. Alois: I do not agree that the existing performance metrics are

not useful at all. Once the reference information is available, these metrics are applicable.

Jane: I don‟t say that they are not useful at all, but I have had a terrible time trying to figure

54

out how to apply them. Most metrics seem to have hidden assumptions that make applying

them to self-paced data problematic. It is not a simple matter of getting the reference

information right and then plugging it into formulas. That doesn‟t produce useful results

because underlying assumptions are violated. Steve: I agree with Jane.

10 Issue: accuracy. Steve: I don‟t feel that this is accurate. Those who use ITR and mutual

information are not studying NC and thus are not “assuming error free NC”. They are

simply ignoring it. To say that they are assuming error free NC is to imply that the metric is

somehow incorporating data related to NC which it is not! So I have an isssue with this

whole subsection

11 Issue: clarification needed. Mehrdad: Is this referenced work self-paced? Julien: Not self-

paced, because based on trials. I would cite some papers from Mason&al, where the decision

rate is about 16 Hz. Jane: Go ahead and change the number and the reference if it is a

better illustration.

12 Issue: use of existing metrics. Alois: some can still be useful. Mehrdad, Julien, Steve,

Jane: Disagree. See Issue 9 for related comments.

13 Issue: table is incomplete.

14 Issue: with categories. Julien: are these titles/groupings the most appropriate

15 Issue: unclear on point being described. Mehrdad: can you give some examples. Steve: I

find this approach confusing. Can you give some examples Jane. Julien: Not very clear.

Do you mean using another transducer which decompose the self-paced data in “periods” of

specific state (NC or ICi)?

16 Issue: with terminology. Julien: When it comes to the abbreviation, EUI makes the

emphasis on the Estimate. Maybe User Intent Estimate (UIE) could be better because it

makes the emphasis on the User Intent.

17 Issue: with wording. Julien: It‟s not “non-deterministic methods”. From my point of view,

it‟s more deterministic methods with fuzzy matching in time.

18 Issue: only depicts event-driven transducers. Julien: EUI in Figure 16 only shows event-

based EUI. We should add an EUI in states to show the transition. Ideally, we should make 2

explanations: one for event-driven paragims and one for state-driven paradigms.

19 Issue: appendix material possibly confuses. Steve: I think the material in appendix focuses

too much on a transition-based interpretation which doesn‟t align with the rest of the

material in this section. As it stands, I think it will confuse more than elucidate the general

approach. Thus I think it still needs quite a bit of work before it is a good reference example

to accompany our specification.

20 Issue: with terminology. Julien: Referring to activation and release make a push towards

single IC state transducers. Maybe it should be changed to refer to multi-IC states

transducers (so speak about transition in general, and no more about activation and release).

21 Issue: practical implementation. What is the significance of the hold-time period when the

error rate is high?

22 Issue: with terminology. This definition is partially agreed by JH, JK, SM.

55

23 Issue: definition of Multiple Confusion Matrix. Alois: MCM not needed. Julien: MCMs

are useful and necessary for state-driven analysis. Steve: I realized upon reflection that the

MCM concept really is just a group of single CMs. They are not necessary - the number of

CMs a researcher reports depends on the what questions the reseachers want to answer.

Someone using a state-driven transducer may only report one CM if they only want to

comment on the overall ability to activate and release. So I‟ve revised the original MCM

presentation reflect this.

24 Issue: proposal for methods??

25 Issue: could use an example/reference.

26 Issue: lack of a single metric. Steve: I‟d like to have a stronger recommendation for a

method, but I‟m not convinced we have one. I like Kappa but have concerns. I started

playing with Kappa as well and realized there are problems with Kappa and researching a bit

more, realized that these problems are well known. The one issue that bothered me about

Kappa was it's sensitivity to disproportionate (skewed) class probabilities, like we often have

with self-paced evaluations. However Lantz and Nebenzahl (J Clin Epidemio vol 49(4),

attached) and Cyrt et all (J Clin Epidemiol 46(5)) have proposed formulations to report the

bias, but I don‟t see any formulation that corrects for these.

27 Issue: incomplete. Steve: This seems to be where our closed-group discussion ended last.

It is incomplete still – what are other general research questions related to self-paced

evaluation? After we publish the first draft, I‟d like to focus part of the discussion here. I

think coming up with metrics related to general research questions will be a useful

contribution.

28 Issue: with definitions: There is still some disagreements on which information belongs to

which group (e.g. histogram of temporal accuracy belongs to low-level performance

information or to high level information), but mainly due to vocabulary divergences (JK,

SM)

29 Issue: ommission: Steve: We have not talked about “bad data” (data related to protocol

anomalies) yet in this version.

30 Issue: is this material needed? Julien: Yes, in this appendix, we describe the transition

markup method, which allows to these requirements and that have some Matlab code

implemented. I think it is a good idea to provide such method. Maybe not to say “take it this

is the best one”, but to provide an detailed example on how you can design such methods.

Steve: Useful but in it‟s current form I think it is hard to follow or interpret as an example

that I can relate to. Also some of this material is out of date and does not reflect our latest

thinking.

31 Issue: transitions versus state-based perspective. Steve: We spent quite a bit of time

discussion a transition based analysis approach, but I now see this as only a special state-

based interpretation and we‟ve dropped most of the transition specific discussion from the

document. Thus I don‟t think we have provided enough background for the reader to

understand the transition-based perspective.

56

32 Issue: how-to call these three indices? Julien: asked in Mid-october summary if

current/desired/actual state is referred to the "transducer output" or to the "intended output".

It was agreed (SM,JK) on the following descriptions:

"current state": transducer output before the transition

"desired state": intended output after the transition

"actual state": transducer output after the transition

However, the original terms (August 22) were not well chosen and it was some discussion

about more appropriate terms. It was agreed that the terms must include "transducer output"

or "intended output", but the word indicating the time information ("before"/"after" in the

above descriptions) was discussed. jK proposed "current"/"next".SM proposed

"previous"/"current".JH finds "current" confusing and proposed "old"/"new" or

"before"/"after".JK argued that "current" is not well chosen because we are referring to the

transition time and "current" would indicate that a state is associated with the transition,

which is obviously not the case (the transition has a "0 width" state). He also argued that

"old"/"new" is not appropriate because older/newer refer to two comparable things (which is

not the case here as we would comparing a state and a transition).