speech and sound uisadv. iid: fall 2006 findings failure to respond to a request implies a lack of...

Speech and Sound UIs Adv. IID: Fall 2006

Findings Failure to respond to a request implies a lack of understanding The first part of an adjacency pair constrains the second and makes it conditionally relevant The “preferred” response to a question-answer pair is usually immediate and overlaps the question A “dis-preferred” response to a question-answer pair is usually delayed, indirect, involves the use of prefaces (well, but, uh) and includes some kind of explanation for the response Ending a conversation abruptly can pose a threat to the positive face of the other person

Structure of DialogueSource: Holtgraves, T., Language as Social Action: Social Psychology and Language Use, Lawrence Erlbaum Assciates, Mahwah, New Jersey, pp. 89 – 119- Holtgraves

Ben Koh


Structure of Dialogue- Holtgraves cont’d

Design Implications A SUI could use a lack of response as a cue to provide help SUIs can elicit targeted responses from users if the first part of an adjacency pair is sufficiently constraining Allowing utterances to occur in the middle of conversation creates a more natural experience SUIs can use cues from conversation to detect dis-preferred responses and propose other options

Ben Koh


Structure of DialogueBeveridge, M., Milward, D., Ontologies and the Structure of Dialogue, CATALOG ‘04- Beveridge

Findings During a conversation, responses sometimes include insufficient detail or too much detail Conversations are more natural when questions on related topics follow one another

Design Implications Prompt the user when more detail is required, but do not ask redundant questions when detailed information was already provided by the user A system that is flexible in the order of questioning will produce conversation that flows more naturally from topic to topic

Ben Koh


Case studies - Emacspeak

Emacspeak

Audio desktop for the visually impaired using a text-to-speech AUI Features

• Unlike screenreaders that speak the contents of a visual display, Emacspeak speaks the underlying information.

Key FindingsPros• Intelligent Audio Formatting & Audio IconsCons• Errors

Design Implications• Provides a groundwork for speech-enabling conversational

interfaces: e.g., accessing the wealth of information on the Internet via a mobile telephone or while driving.

• A more "human" computer, one the user can talk to, may make educational and entertainment applications seem more friendly and realistic.

Other Speech Navigation Tools• VoiceXML• SALT

Lily Cho


Case studies- Evaluating a Spoken Language Interface to Email

Case Study: Background

Stigma on SLI• Limited for delivering information• Require user to learn the language the system can understand• Hides available command options• Leads to unrealistic expectations to capabilities

Experiment• 2 Dialogue Strategies for a SLI to accessing email (ELVIS) by

phone: – Mixed initiative dialogue style, in which users can flexibly

control the dialogue– System initiative dialogue style, in which the system controls

the dialogue Results

• The mixed initiative system is more efficient (measured by numbers or turns, or elapsed time to complete a set of email tasks)

• Surprisingly, users preferred to use the system-initiative interface: – Easier to learn– More predictable

Conclusions• Perhaps, if the study progressed and users became experts, they

would prefer the mixed-initiative system.

Lily Cho


Common GroundingClark, H., Brennan, S. (1991) “Grounding in Communication”

Common Grounding - mutual knowledge, mutual beliefs, and common knowledge.

In asking a question, it must be established that the respondent has understood what the questioner meant.

There are two phases to establishing grounding:Presentation Phase: A presents an utterance to B. If B gives

evidence, then A can believe that B understands what he means.

Acceptance Phase: B accepts the utterance from A that she believes she understands what A meant. B will also believe that once A registers this evidence, he will also believe that B understands.

The system in a SUI should be sure that the user understands its utterances and give the user a way of getting clarification if they do not understand.

Melissa Ludowise


Common GroundingKiesler, S. (2005) “Fostering Common Ground in Human-

Robot Interaction”

People attribute knowledge to robotic systems.If a robot has many human-like qualities, people will

expect it to act as a real person. Robots may be more explicit than a human would (example of robotic security guard giving directions).

Stereotypes can also be applied to robotic systems. For example, when participants were told a robot originated from China, they believed it to have more knowledge of Chinese landmarks.

SUI systems should not be explicit unless necessary.Users will use stereotypes of characters used in SUI

systems.

Melissa Ludowise


Common GroundingPatterson, E. S., Watts-Perotti, J., Woods, D. D. "Voice loops as

coordination aids in space shuttle mission control." Computer Supported Cooperative Work: The Journal of Collaborative Computing, 8(4), 353-371 (1999).

Finding: constant monitoring of multiple auditory channels allows mission control operators to attend to many stimuli in their operational periphery. Subsequently, when tasks that require cooperation between heterogenious groups of specialists are necessary, those specialists require less time to coordinate the task. This is due to their shared grounding: they each understand the issues that other team members and groups have been struggling with.

Design Implication: When it is possible to output a constant, low-level stream of relevant background contextual information to the user, users will require less time and information to make a decision.

Jason Cornwel


Common GroundingMonk, A. “Common ground in electronically mediated communication: Clark’s

theory of language use.”, Chapter 10 in J. Carroll (ed.), HCI Models, Theories and Frameworks, Morgan Kaufmann, pp. 265-290, San Francisco, 2003.

Finding: When conversational systems ignore one or more of Clark’s constraints on grounding (included in the ntoes for this slide) they will incurr additional costs in orienting their participants. Examples from the paper included Cognoter, a shared electronic whiteboard system that allowed meeting participants to create and annotate “items” on the board either in parallel or in advance of the meeting. The system was unsucessful because people did not have enough of a shared understanding at the outset of a meeting to make independent ideation efficient. The other example cited involved a doctor, patient, and observer communicating via videoconference. When the observer knew that the others were aware of him/her (copresence), the observer was more willing to interrupt to ask questions.

Design Implication: Design for as many of Clark’s constraints as possible to increase conversational efficiency.

Jason Cornwel


SUI malfunction may cause changes to a person’s emotional state, altering both acoustic properties and choice of words.

Initial strategy: prevent an angry reaction.Contingencies should be in place to detect

and de-escalate emotional situations. Train for and adapt to emotional changes. Initiate clarification dialogues Explain or apologize for system shortcomings. Sum up current state of the system.

K. Fischer (1999)

“Repeats, reformulations, and emotional speech: Evidence for the design of human- computer speech interfaces”.

Human-Computer Interaction: Ergonomics and User Interfaces, Volume 1 of the Proceedings of the 8th International Conference on Human-Computer Interaction

Munich, Germany

Lawrence Erlbaum Ass., London

pp. 560-565

Simon King

Emotion and conversation - Emotional reactions to system interaction


Small talk is an important mechanism for managing both the channel of communication and interpersonal distance between a user and an interface agent.

It can help maintain an open channel of communication and establish a social bond.

Simple acts can keep the conversation flowing: Always respond with more information than

was explicitly asked for. Use idling behavior such as “Mhm”, or “Yea”.

Timothy Bickmore (1999)

“A Computational Model of Small Talk”

MAS 962 Discourse and Dialog for Interactive Systems

http://web.media.mit.edu/~bickmore/Mas962b/

Simon King

Emotion and conversation - Small talk and social intelligence


“Matching” emotional states between the user and the SUI has the following positive effects: Performance increase – less errors in tasks requiring high

attention• Quicker response times• Greater attention spans

Increased communication – encouraging continued communication could help task completion

The research identified “matching” as: Use energetic voice when the user is happy Use subdued voice when the user is upset

» (The implications from this paper is suited for design of systems where voice-based interface is in an multi-tasking environment where safety is important.)

Ray Su

Emotion and conversation - Matching Emotional States and Attention/Performance


Emotion and conversation - Augmenting Conversation with Dual-purpose Speech

A dual–purpose speech interaction is one where the speech serves two roles. 1. First, it is socially appropriate and meaningful in the context

of a human–to–human conversation. 2. Second, the speech provides useful input to a computer.

Much of office work involves interpersonal conversation. Dual-purpose speech allows users to interact with SUI systems without interrupting work and normal conversations with others.

Application of dual-purpose speech: Use keywords in normal human to human conversation as

commands to the speech user interface.• "Lemme check the email sent by Donna last week"• "email" and "Donna" and "last week" are action

prompts for the SUI to navigate through the email app. Current limitations in speech recognition technology requires

the user to "push to prompt" the input of keywords to the speech user interface. For future designs, we can remove this limitation to allow smoother interaction with the system

Ray Su

Speech and Sound UIs Adv. IID: Fall 2006Annie Ha

D.E Kieras, et al; An EPIC computational model of verbal working memory;University of Michigan

Experiment with words recalling • Phonologically distict words have higher

performance than similar words. • List of short words shows better recalling

performance than long words significantly.

Design: short list, short words, phonologic distinction.

Cognition and Auditory Working Memory


A. Braddeley; The episodic buffer; a new component of workign memory?(Nov, 2000); Trends in Cognitive Science; vol 4 no11

phonological similarity effectword or letters with similar pronunciation is harder to remember

word-length articulatory suppression irreverent sound -the. will decrease the

performanceusually when words are not related people starts to

make errors once the number of the words exceed 6. when the words are related a span of 16 words is possible

Phonological Loop

Annie Ha


Talkback; a conversational answering machine

Make email responses more conversational

• Messages are segmented by detected speech gaps or topic shifts System stops for a moment between

gaps to allow for response. Users stop recording or continue by

not speaking• Users can record anytime by interrupting

playback with speech

Good things• Users found it easier to provide

responses compared with regular email.• Replies are like inline email reply• Users said it would be nice to get an

overview of the message before replying (summary?)

• No touching required to begin recording

Bad things• Pauses that stopped reply recording was

annoying. Users pause while talking to think. Users suggested that pressing

something stop might be good• Users wanted to be able to jump between

segments

Jeff Wong

Source:

Vidya Lakshmipathy, Chris Schmandt, and Natalia Marmasse.TalkBack: a conversational answering machine, UIST 2003, Vancouver, BC, Canada, pages 41 - 50.


Talkback Design Implications• Inline replying reduces memory load on replying• Automatic segmentation can enable some within

message navigation• Interruption by speech is easier than interruption by

button press (compared with early prototype)

• Don’t use pauses to end a recording

Jeff Wong

Talkback; a conversational answering machine


Using spatial audio to indicate location in a document is BAD

• Uncomfortable to hear in one ear for a long time.

• You need surround sound or headphones to make this work

• Take-off and landing sounds indicate navigation skipping.

• Different gender voice and spatial location to announce position in document.

Jeff Wong

A 3D audio only interactive web browser

Source:

Goose, S. and Müller, C. 1999. A 3D audio only interactive Web browser: using spatialization to convey hypermedia document structure.

MULTIMEDIA '99. ACM Press, New York, NY, 363-371.


The FindingsThe Skip and Scan paper talked about a automated interface

that people interact with over the phone. It compared the current systems of menus and number prompts (ie. "For accounting, Press 1...") to a new way to interact with the system.

This new way of interaction had the user navigate through the prompts by using the 7 key for going backwards, the 9 key for going forward, and the 1 key to select the current prompt.Allows for larger menus.

This system allows users to browser through at their own will.

This new interaction took some time to get use to, the younger the group, the faster they got it. Once they got the hang of it, it was quicker than listening to all the prompts on a menu.

Skip and Scan

Resnick,Paul and Robert A. Virzi (1992) "Skip and Scan: Cleaning Up Telephone Interfaces." Proceedings of CHI '92. New York: ACM Press, pp.419-426.

http://portal.acm.org/citation.cfm?id=142881&coll=portal&dl=ACM&CFID=396449&CFTOKEN=63284630

James Soracco


The very old do not want to interrupt the system, so they are not much faster with this method.

The vast majority of all users from every group said that they liked this new method of interaction. Some of the reasons why they said that they like it was because it put them in control of the system instead of them waiting for the system.

Design implementationsFor most users, they prefer to have control over

navigation through their data and can navigate it faster than if it was read off to them. It is important to create a system where the user can navigate the data quickly (both forwards and backwards) and interrupt the system when they recognize that the message that is being read is not what they wanted so they can move on to the next item in the list.

James Soracco

Skip and Scan

Resnick,Paul and Robert A. Virzi (1992) "Skip and Scan: Cleaning Up Telephone Interfaces." Proceedings of CHI '92. New York: ACM Press, pp.419-426.

http://portal.acm.org/citation.cfm?id=142881&coll=portal&dl=ACM&CFID=396449&CFTOKEN=63284630


Machine with human-like qualities- Julie and other voice systems

• Julie is AMTRAK’s phone answering system (1-800-USA-RAIL)

• She has received strong positive reviews.• Surveys have found that callers give Julie a

90 percent approval rating. • She handles 25% of AMTRAK’s calls, about

five million, at a savings of $13 million. • The interface is informal, with quotes such

as: “OK Let’s get started” ; “You’ll want paper handy” ; “Got it!” ; “Sorry, I didn’t get that.” She apologizes when wrong.

Julie Stinneford, the voice of Amtrak’s Julie.

Craig Borchardt


Machine with human-like qualities - Other noted voice systems

• Tom, United Airlines• Jenni McDermott, Yahoo – Comes with photo and four-

page biography which describes that she graduated from Berkeley with an art history degree in 2001. Quotes are “Got it!” “Cool” and “Wow, you’re popular” (for callers with crowded e-mail boxes.

• Claire, Sprint – Failed system no longer used by Sprint. Online bogs still carry comments about the system. Claire could not recover when words were mispronounced and could not deal with background noises. One blog complained that she could not be interrupted; another said that she was the reason for switching to AT&T. She was attractive and friendly but navigation was a maze. She sounded happy even when customers were angry and frustrated.

• Mercedes-Benz – Had to change the on-board software in some cars because men complained that they did not want to take orders from a female voice.

• Brokerage Houses – Found people calling in responded favorably to female voices but want to deal with a man when making a trade.

Craig Borchardt


Machine with human-like qualities - Voice systems and interaction

Design Implications• A virtual operator should be able to

sense if you’re flustered by the length of pauses and number of “uhs” as well as from fluctuations in voice inflections.

• Matching voice, gender, and persona to user is important.

Source: New York Times, November 24, 2004; Chicago Tribune, March 28, 2005.

Craig Borchardt


• “I was involved on the on the dancing paper clip…I think they are right to despise him for many reasons.

• The single best reason is a fundamentally social one. When you ask people why they hate that paper clip so much, one of the first things they say is, “Well, every time I write ‘Dear Fred,’ the thing pops up and says, ‘Oh, I see you’re writing a letter.’ And they dismiss it. The first time it was okay, it was helpful.

• The second time it was at least trying. But the forty-seventh time, it was clearly being at best passive aggressive and at worst down right hostile, implying that I couldn’t make a decision about the right thing to do.” Well, we know what we do with people like that – we hate them. In fact, the first rule in the Dale Carnegie course of “How to win friends and influence people” is remember things about people.

Machine with human-like qualities - Clifford Nass on Clippy the Microsoft Paper Clip:

Craig Borchardt


Machine with human-like qualities - Nass and Microsoft Paperclip

• The paper clip doesn’t do that. Also it manifests a particular personality style that’s not very popular; it’s a rather dominant, unfriendly personality style.

• There are characters in there that lots and lots of people like – unfortunately the interface was designed so that you couldn’t discover them…research shows that almost everyone finds a character that they like.”

• Image from Joe Tullio

Craig Borchardt


Design Implications• Interface should learn about the user’s

actions.• Match the personality of the interface to the

user.• Let the user choose which interface

character to interact with.

• Source: Conversations with Clement Mok and Jacob Nielson, and with Bill Buxton and Clifford Nass. Interactions, Volume 7, issue 1. January 2000

Craig Borchardt

Machine with human-like qualities - Nass and Microsoft Paperclip


Machine with human-like qualities - Can computer personalities be human personalities?

Clifford Nass, et al. CHI, 1995. • Study examined theory that computer

personalities can be created with a small set of cues and that people respond to personalities the same way that they would to human personalities.

• In this experiment, dominant and submissive computer personalities were created and paired with people who were determined to have dominant or submissive personalities.

• The experiment concluded that when a person was paired with a computer with a similar personality, higher affiliation and competence ratings resulted.

Craig Borchardt


• Design Implication: Create multiple interface characters so that users can choose a match for themselves.

• Source: Nass, C. et al. (May 1995). Can computer personalities be human personalities? Paper presented to Chi’95 conference of the ACM/SIGCHI, Denver, CO.

Craig Borchardt

Machine with human-like qualities - Can computer personalities be human personalities?


• The study looked at how gender in computer speech affected the user’s perception of the computer.

• Found that a male voice exerted greater influence on the user’s decision than a female voice, and was seen as more socially attractive and trustworthy.

• Gendered synthesized speech triggered social identification processes with female subjects conformed more to female-voiced computers and males conformed more to male-voiced computers.

Machine with human-like qualities

- Can computer-generated speech have gender? An experimental test of gender stereotype.

Craig Borchardt


• Speech interface should consider gender of voice and consider presenting user with option to choose voice.

• Source: Lee, Eun-Ju. et al. (April 2000). Paper presented to Chi’00 conference of the ACM/SIGCHI.

Craig Borchardt

Machine with human-like qualities

- Can computer-generated speech have gender? An experimental test of gender stereotype.


Machine with human-like qualities - Kismet

SourceSociable Machines Project in MIT homepage : http://www.ai.mit.edu/projects/humanoid-robotics-group/kismet/kismet.html

OverviewAutonomous robot designed for social interactions with humans.Perceives natural social cues from visual and auditory channels, and delivers social signals to the human through gaze direction, facial expression and vocal babbles.

Jaewon Kang


Findings instead of trying to achieve realism, project team focuses on sharing emotions through communication.to recognize and affectively responds to intent such as praise, prohibition, attention, and comfort, Kismet identifies the difference in speech rate, pitch, intensity and etc.designed Kismet to look like a very young child so that people are naturally exaggerate the way they speak and which deliver a very characteristic tone of voice.


Jaewon Kang


Findings By using voice synthesizer, Kismet generates the sound

with pitch accents in response to the speaker’s communicative intent.

even though there is no grammatical structure, its manner of vocal expression gives understandable responses and contributes to Kismet’s personality.

Implications Responses/feedback should be understandable intuitively Simplifying human emotions into subsets is useful to

understand the speaker’s intent Creates the condition in which people naturally

exaggerate their voice and get more recognizable input Strong personality of system can offset somewhat

unnatural and slow responses of the machine in communication.

Related movie clip:

http://www.ai.mit.edu/projects/sociable/movies/expression-examples.mov


Jaewon Kang


Overviewa portable voice-interactive device bridges the gap between human observations and computer data bases by allowing inspectors to input findings directly into a computer system.consists of two parts; central processing unit, battery pack (3 lb.) and speech recognition headsets.works in environments of up to 100 decibels. sound variances won't affect voice recognition.can be tailored to specific jobs so in many cases, the program looks totally different than the generic format

Wearable computer

Headset

Machine with human-like qualities - VoCollect

SourceFrom article, “Machine to human: can we talk?” Ward's Auto World, May, 1992 by Stephen E. PlumbVocollect Homepage: http://www.vocollect.com/global/web.php/en/

Jaewon Kang


Findings Process

•Host coordinates operational data and send assignments •Assignments are converted into speech and completed•Vocollect voice software sends real time status on assignments •Host updates data

Benefits•boost productivity; 30% faster•improve accuracy•cut training time•lower operating costs


Jaewon Kang


Implications Give users the possibility to tailor the device to meet individual needs Voice recognition is effective for real time data update Try to reduce time for educating the user

• Vocollect keeps the structure of the voice messages consistent


Jaewon Kang

Findings Application

• Material handling and shipping and receiving verification• Order selection, replenishments, put-aways and transfers • linked with a bar-code scanner so inspectors don’t have to read 17-digit identification numbers

Speech and Sound UIs Hwi Kyoung Lee Adv. IID: Fall 2006

Non-verbal voice commandNon-verbal features in speech• Continuous voice as on/off button• Increasing pitch as accelerator• Tonguing as discrete controller• Different vowel qualities as direction

indicator

Pros: Immediate, continuous controlCons: Unnatural way of using the voice• Can complement other speech recognition

interfaces and visual interfaces• For example, to adjust system parameters

Speech and Sound UIs Rachel Glaves Adv. IID: Fall 2006

Auditory Icons

Tested 83 students for sound recognition Less than 15% correct identification of

exact sound 80% partial identification of material or

function Sounds identified as either objects or

actions• Tearing, ripping, winding vs. camera,

door, zipper

Sounds can represent objects and actions but their use should not require user to interpret them too specifically

Source:

Mynatt, Elizabeth. (1994) “Designing With Auditory Icons: How Well do we Identify Auditory Cues?” Conference on Human Factors in Computer Systems. Boston, MA: ACM Press, pp. 269-270.

Speech and Sound UIs Jared Cole Adv. IID: Fall 2006

Adaptivity in SUI:Speech User Interface

Adaptivity is typically used to make an interaction more efficient or more comfortable for users.

Adaptive systems need to be aware of User differences (characteristics/behaviour/preferences).

Adaptive systems must be aware of Context differences (and that context may change during the interaction). Context also plays key role in User behaviour.

Adaptive systems must be self-aware enough to let Users know if something has change within the system.

Primary notes of interest:• Error/contingency planning• Adaptive system (information) architecture• System awareness (User/Context/Self)

Speech and Sound UIs Ricardo Marquez Adv. IID: Fall 2006

Language and Culture (1)

Conversational Interaction that includes expanding and giving synonyms, when a second language speaker is involved in the conversation, helps the NNS's (non-native speaker) accuracy while executing the intended task (or conversation)

Recasting, which is the act of repeating a phrase correcting structural mistakes the NNS can make while formulating a statement, improves the NNS's understanding of the overall context of meaning and syntax of conversation. (Big trade off is the emotional response to hear the interlocutor correcting what you say)

It is possible to find patterns on the speech (as length of sounds and silences) between languages. For instance when a NNS does not understand a word in any given conversation can infer at least the basic notion of the answer based on emotion, tone but also structure of sound and silence.

Findings


Language and Culture (1) Design Implications

Synonyms and other equivalences to actual terms used throughout the SUI -or in the help section- can aid the non-native speaker user in completing his or her task successfully.

Keeping the commands and the answers to really short phrases is the tradition and the logical structure for SUIs, However, more feedback or little context help immensely the user if he or she does not understand the meaning of the words in the feedback

Repeating terms when necessary can help the NNS user to associate these words to their actual meaning, providing language performance tools for NNS power users

§§ Based on§ MacKey, A., Philip, J. Conversational Interaction and Second

Language Development: Recasts, Responses and Red Herrings?. The Modern Language Journal, Vol 82, No.3, Special Issue: The Role of Input and Interaction in Second Language Acquisition (Autumn, 1998), pp. 338-356.


Language and Culture (2)

It is hard for people over 12 years old to reach a native speaker's performance in the second language independently of the time they have practiced the second language

The elements that constitute the "foreign" accent can be broken into metaphors that could find equivalents of pitch, tone and speed of speech

Once the foreign language has been identified, elements such as rhythm and speed can be measured and compared to the same elements of first language in which the conversation takes place.

Findings


Language and Culture (2) Design Implications

Machines with speech recognition capabilities can learn to identify "foreign accents" and correct their interpretation of words depending on the analyzed sample of speech.

According to the elements of the speech analyzed,common patterns of speech can not only be used to structure the menus on a SUI, they can also aid to structure the feedback given to the NNS user.

§

§

§ Based on§ Van Eels, T., De Bot K. The Role of Intonation in Foreign

Accent. The Modern Language Journal, Vol. 71, No.2.(1987) pp. 147-155

Speech and Sound UIs Carl Angiolillo Adv. IID: Fall 2006

It is possible not only to place a virtual sound in a precise three dimensional space, but to modify the size and shape of the room in which the virtual sound is contained.

3D Virtual Interactive Acoustic Environments

Source:

Noisternig, M., Musil, T., Sontacchi, A., Holdrich, R. (2003) A 3D Real Time Rendering Engine For Binaural Sound Reproduction Boston, Massachusetts Proceedings of the 2003 International Conference on Auditory Display

Speech and Sound UIs Carl Angiolillo Adv. IID: Fall 2006

3D Virtual Interactive Acoustic Environments

In comparing sighted versus blind children: while blind children could successfully map out a fully three dimensional space using only sound cues, sighted children had much more trouble and their mental map included many incorrect elements, and left out many other subtle elements.

Source:

Sanchez, J., Lumbreras, M. (2000) Usability and Cognitive Impact of the Interaction with 3D Virtual Interactive Acoustic Environments by Blind Children Alghero, Italy Proc. 3rd Intl Conf. Disability, Virtual Reality & Assoc. Tech.

speech and sound uisadv. iid: fall 2006 findings failure to respond to a request implies a lack of...

Documents

dialogue strategies

sound uisadv

language use

speech aui features

options ben koh slide

holtgraves ben koh slide

topic ben koh slide

preferred responses