where's jarvis? the future of voice recognition and natural language user interfaces uxpa 2016

Download Where's Jarvis?  The Future of Voice Recognition and Natural Language User Interfaces UXPA 2016

If you can't read please download the document

Upload: crispin-reedy

Post on 16-Apr-2017

70 views

Category:

Technology


1 download

TRANSCRIPT

PowerPoint Presentation

Wheres Jarvis?The Future of Voice Recognition and Natural Language User Interfaces.Crispin Reedy, Versay Solutions@crispinTX crispinreedy.com#UXPA2016

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsVoice User Interface Designer10 years in the fieldEnglish major, former coder; got interested in UXPresident of the Association for Voice Interaction DesignConsultant for Versay Solutions2 weeks in a row for conferences

1

From the session descriptionWhat is voice recognition?What is natural language understanding?What are the common technologies in the market today? How does this fit with IoT?What are design considerations / methods to evaluate these types of interfaces?Implied: Should I speech-enable my ___?Bonus Q: Why doesnt it work the way we want it to, and when will it?

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

Should I Speech-Enable My ___?

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

Iron Man 2: Marvel Studios, Paramount Pictures

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsJarvis:Audio and gesturalPerfect recognition. No error recovery neededGreat voice qualityConnected to vast amounts of dataUnderstands all the parts of the model: Lose the landscape.Context-sensitive. Aware of the space around himSense of humor. Am I to include the Belgian Waffle stands?Takes initiative. What is it youre trying to achieve, sir?

4

Star Trek Voyager: Paramount Television

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsReplicator:Good recognitionNo error recovery neededGood voice quality understandableConnected to data perhaps too much so?Context sensitive- but was this enough?A design failure (not a tech failure)Specifically around excessive disambiguation

5

Tomato soupTomato soup. Ok, what kind?Just plainComing right up!Implicit confirmationSecond level-open ended promptingCultural context: plain = hot

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsA Better Replicator Conversation6

Terms & TechnologiesSpeech RecognitionNatural Language UnderstandingVoice Verification (Biometrics)Text to Speech

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsSpeech Recognition ASR

See the cat.

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsSpeech to Text ?Spoken Language Machine readable format8

Natural Language UnderstandingExtracting meaning from natural text

Hello, yes, Id like to pay my water bill. Can you help me with that?Intent = BillPay

Entity(Bill Type) = Water

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsNot necessarily tied to speech recognition9

Voice Verification

My voice is my password.Authenticated. Welcome, Mr. Smith.

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

Also called voiceprints, biometrics, voice authentication, etc.Not going to discuss this one in a lot of detail today but its important that you understand the difference between these technologies.Recognizes a person, not necessarily what they are saying.You can have ASR without Voice VerificationAnd vice versa10

Text To Speech

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsHuman voice talentHundreds of hours of recordingDigitizedPhonemes: Concatenated speech synthesis

11

What Is Good TTS?Phonemes change based on locationCatAlligatorElisionIm. Awaiting. You.Im awaiting you.IntonationDo you want coffee?Do you want soda, tea, or coffee?Most TTS isnt Movie Quality

IMDB

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsDynamic Speech SynthesisMany commercial products are availableAPI-basedDownloadableQuality varies If possible, record audioTTS has improved considerably, but is still noticeableHigh quality TTS may not be available in all situationsIf you have a lot of dynamic data TTS is usefulYou can mix recorded audio and TTSYou may have to use TTSVoice Agent (Alexa, Cortana, etc.)API-basedSome of them do let you mark up your TTS with SSML

More phonemes = higher quality voiceAlso means a bigger download and install (if on device)Exceptions (addresses, names) can be iffyMay require a lot of work to handle wellSt. James St. Saint James StreetPunctuationYour data needs to be clean and ready to voice backAcronyms, incomplete sentences will not sound goodIt is possible to build a custom voiceBut it takes a lot of work!12

SSML Example

SSML

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsSpeech Synthesis Markup LanguageXML based WC3 standardNot universally supported Tags which allow you produce a more natural quality output. EmphasisBreakVoiceProsodyPitch

13

Speech RecognitionHands-free command / controlDictationInput textSmall form factor device, etc. Text To SpeechOutput text dynamicallyRespond to input Useful when no display is availableNatural Language UnderstandingNecessary for all language-based inputExtract meaningParse large volumes of textVoice VerificationSecurity

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

ASR

Application

Data

Sign-InInteractionRequestActionMeaningAccess DataOutputTTSNLUVoiceprintsVerifi-cation

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

ASR

Application

Data

Sign-InInteractionRequestActionMeaningAccess DataOutputTTSNLUVoiceprintsVerifi-cation

Touch

KeyboardManage I/O ModalityDetermine Meaning in Context

VisualContext!

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsASR

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsWorld KnowledgeSemanticsSyntaxLexiconMorphologyPhoneticsAcousticsLinguistics

PhysiologyConceptsPhrasesWordsPhonemesSounds

Speaking / ListeningASRNLU

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsWorld Knowledge: Concepts of the world around us, i.e. Tables have four legs, what is left and right, what is a car, etc. This is the level before languageSemantics: The first level of language. Knowledge can be represented in structured meaningful elements. Example: semantics of a party invitationSyntax: The rules that govern putting words together to form meaningful unitsLexicon: What words meanMorphology: How words change their form to perform differently in a language i.e. horse / horsesPhonetics: Phonemes and how words are builtAcoustics: What phonemes sound like and how to create them20

Speech is ambiguous

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsSpeech is never stationaryCoarticulationNoisy environmentsAccentsDifferent speakers have voices with different acoustic qualitiesGoatsChallenges vary depending on what you are going to recognizeSpelling (short utterances) can be difficult even for humansPhonetic alphabet (Military)21

Language is ambiguous

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsHumans can deduce meaning from context and unknown words

How can I help you?Im having a problem with my account.

Id like that one. No, not the green one, the red one.

Time flies like an arrow.Fruit flies like a banana.22

Everything is ambiguous

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsAll modern speech recognition is probabilisticGUI: Button clicked? true / falseVUI: There is an 85% chance that button was clicked23

Speaker IndependenceSpeaking StyleSpeaker DependentMultiple SpeakersSpeaker IndependentIsolated WordsConnected WordsNatural Speech10 words1000 words100,000 wordsUnlimitedVocabulary Size

Humanlike

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsThree Dimensions of Speech Problems24

AUDREY: Automatic Digit RecognizerBell Labs 1952

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsAUDREY: Davis, Biddulph, and Balashek - Bell Labs 1952AnalogIsolated digit recognitionPause between digitsSpeaker-dependentSpeech recognition with vacuum tubes How very steampunk. Her name was AUDREY. Let that sink in a minute.(Automatic Digit Recognizer)25

X statesy possible observationsa state transition probabilitiesb output probabilities"HiddenMarkovModel" by Tdunningvectorization: Wikimedia

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions1980s: The Power of StatisticsThe recognition of connected speech becomes a search for the best path in a large networkProblem of finding the probabilitiesStatistical Language ModelsNot all sequences of words are equally probableRank all permissible sentences in terms of probabilityCorrect grammar is not applicableRestricted by domainHidden Markov Models (HMM)Unified probabilistic model for speech

26

TrainingSpeech Recognition EngineAcoustic ModelSLM and/orGrammarPronunciation Model

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsYoure Only As Good As What Youre Trained OnCorporaCollection of speech used to train a recognizerAcoustic and/or Pronunciation Model Associates sounds with symbols and words.Created by a general speech corpora and a phonetic and orthographic transcriptionStatistical Language Model (SLM)A probability distribution over sequences of wordsCreated by a domain-specific speech corpora and a tagged transcription to extract meaning

27

UtteranceNoise Levels?Barge-In?Feature ExtractionEndpointingSpeech Recognition EngineGrammar or SLMProbabilitiesn:best listLiteral returnTokensRecognition Event

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsEarly Commercial AdoptionsInteractive Voice ResponseThose Phone MenusServer-based ASR NuanceMicrosoftVoice-Enabled Handheld DevicesIndustrial / Productivity applicationsDevice-based ASRNetwork not neededNote: Call center is still an important customer touchpoint!

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsTodays Speech Agents vs. APIsSiri / Apple APIsCortana / Cortana APIsGoogle Now / Google Voice ActionsAmazon Echo (Alexa) / AVS APIJiboUbi / Ubi KitAssistant.ai / Api.ai

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsSpeech Agent: The Person who Distributed speech recognitionCollection and compression of speech is on the deviceThe language models are typically on the networkPhone can be speaker-dependentTrains itself on your voice and on the acoustic environments you are in most oftenMany companies are providing APIs to use their speech recognition30

Alexa Skill vs. Amazon Voice Service

Amazon.com

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsAlexa, Ask Capitol One Whats my current credit card balance?31

Alexa Skill Example

Amazon.com

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

Amazon.com

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

Capitol One.com

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsNLU

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsNatural Language UnderstandingParsing input to extract meaningCovers a large fieldCommandsAutomatic classification of emailsNewspaper articles, large chunks of textBotsConversational agentsMessaging appsPersonal assistantsInput could be via speech or via text

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsLevels of Meaning

Too Broad / AmbiguousToo MuchJust RightIm having a problem with my account.Well, I was looking at my bill, because I do that every week, and I was reviewing everything on there, and I sawIm seeing an unusual charge on my bill.How can I help you?

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsNLU Taskshttp://www.conversational-technologies.com/nldemos/nlDemos.html

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsIntents and EntitiesId like to transfer $50 from my checking account to my savings account.ACTION = Transfer (Intent)FROM_ACCOUNT = Checking (Entity)TO_ACCOUNT = Savings (Entity)AMOUNT = $50 (Entity)

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsNLU APIsAPI.aiAlexaMicrosoft LUISWit.aiGoogle Voice ActionsEtc.

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsTodays NLU APIsMicrosoft LUIS (part of Project Oxford)

Microsoft.com

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

Todays NLU APIs

API.ai|API.ai

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsThe Future Is HereDNN (Deep Neural Networks)Being applied to both ASR and NLU problemsRequires large amounts of data to train the models

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsWhats The Glue Here?

ConsistencyAcrossContexts?Omnichannel CXDataIsEverywhereState Chart XML?

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsASR vs. NLU: Wrap UpASRSpoken aloudRequires some NLU even if its hand-crafted (tagging)Useful in hands-free, eyes-free contextsNLUFocuses on meaning extractionCould be used for chat bots, etc.Machine learning to train models

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsDesign Considerations

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsDesign ConsiderationsWhat are you trying to build?Whats your platform?Existing guidelines / researchUser testing is keyEspecially if youre trying to do something complicated

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

Should I Speech-Enable My ___?

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsWhats Your ASR/NLU Platform?Write an app (skill) for an agent such as Cortana / AlexaUse cloud APIs to add ASR / NLU to your app / device / page / gadgetDownload software and use full-featured capabilities for more robust recognition on a specific deviceBuild your own

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsNetwork Availability

Simply irritating or totally unusable?

Whats on my calendar today?Sorry, I cant complete that request right now.

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsAppropriate Modality?Voice Only? Voice + Display?

Is it possible for the user to switch modalities?Or would switching potentially be dangerous?

How long is the flight from Dallas to Seattle?Ive got a few results to show you.

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsIs State Maintained?

Does your platform support a multiple-stage interaction?Does it remember what you did previously?

Who is Barack Obama?Barack Obama is the 44th president of the United States.How old is he?Im sorry, I dont understand your question.

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsWake-Up Words

How many of these Agents will we be talking to?

Jibo, take a picture.Alexa, play music.OK Google, set the temperature to 77 degrees.

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsSystem Personality

Are you writing for an Agent who has an existing style?What if your skill or app doesnt match that style?If not, should you create one?

Hi, Im Julie!

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsContext

Real-world contextDigital contextHow much does your app know about where you are and what it can do?

When I get home, remind me to take out the trash.Im sorry, your calendar doesnt support location-based reminders.

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsWhat Are You Trying To Recognize?

Long utterances work better than short onesLetter names require extra work

Start a sessionGot it

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsAnd So Much More.

What will you do when the recognizer just cant get it?

I want my. BARK BARK BARK Timmy STOP THAT NOW GET DOWN!????

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsExisting Guidelines / ResearchCaveat: Best practices evolved in one modality (e.g. voice-only) may not apply the same way in another (e.g. combined voice + touch)But they could be adaptedAssociation for Voice Interaction Design (AVIxD.org)WikiPeer-Reviewed JournalVirtual Brown BagsAcademic Sources, Books

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsAVIxD.org

CUI Working Group is actively recruiting!

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsSpecific Example: HelpVoice XML Standard(2004)Help should be a global command

AVIxD Wiki(2014)Stop using Help as a global

Agent API Doc(2015)Offer Help

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsSpecific Example: HelpDesigners who tune applications have seen that the word help is a known False AttractorOther things that you say which are short get recognized as helpPeople dont voluntarily come up with help unless they are promptedGive callers a context specific command only where help may truly be needed, and call it something besides "helpSystem: Say or enter your account number, or say, where do I find it.

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsSpecial Case: CarDistracted Driver is a hot topic!Richard Young, Wayne State UniversityPaper: Safe Interaction For DriversVisual-Manual Mode What we do todayAuditory-Vocal Mode Speech only. NO GUI.Mixed Mode Speech and GUI being used togetherFinding: If you give someone a graphic interface, theyre going to look at it And take their eyes off the road

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsDesign Documents

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsObservations to make: Represents the entirety of a VUI experiencePlacement of Spanish prompt would vary depending on type of call.Confirmation is variableConfirmation prompt is general

63

Usability Studies / ResearchSpecial ChallengesTechnical setupPhone tap / Recording both sides

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

Warner Bros.

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsEarly Stage Voice Only Prototype

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

Should I Speech-Enable My ___?

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsWhats the Use Case?Enabling applicationUser cant do it any other wayNew tasksEnhancing applicationUser can do it nowBut speech makes it betterFasterSafer

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

API-BasedDevice-BasedRoll Your Own / Open-SourceFlexibilityPowerCustomizationTime Difficulty

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsWhat do you need it for?What kind of device will you be running it on?Connectivity? Can you use cloud based ASR?How much control do you need over the application / user interface?

69

Cloud vs. Downloadable / EmbeddedEasy to get startedLightweightNot much specialized knowledge

CustomizableProbably better recognitionCan be device-specificMore featuresHigher poweredMay require specialized knowledgeSpeech scientist

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsOpen Source ASRCMU SphinxpocketsphinxKaldihttp://kaldi-asr.org/GithubNew updates include some pretty interesting stuff (DNN)Requires: Corpus Tech know-how

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

Should I Speech-Enable My ___?

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsShould I Speech-Enable My ___?Maybe

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions

Iron Man 2: Marvel Studios, Paramount PicturesWheres Jarvis?

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsJarvis:Audio and gesturalPerfect recognition. No error recovery neededGreat voice qualityConnected to vast amounts of dataUnderstands all the parts of the model: Lose the landscape.Context-sensitive. Aware of the space around himSense of humor. Am I to include the Belgian Waffle stands?Takes initiative. What is it youre trying to achieve, sir?

74

Wheres Jarvis?Gesture Based InterfaceArtificial IntelligenceVoice Based Interface

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsWheres Jarvis?ASRNLUVoice DesignContext

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay SolutionsResourcesHandout / Web page

#UXPA2016Session Survey: http://www.uxpa2016.org/sessionsurvey?sessionid=321 2016 Versay Solutions