voice recognition and natural language - dallas techfest 2016

Voice Recognition and Natural Language: Dallas Tech Fest 2016

Voice Recognition andNatural LanguageDallas TechFestJanuary 29, 2016Crispin Reedy @crispinTX #DallasTechFest16

1

2 2016 Versay Solutions LLC

Voice User Interface Designer10 years in the fieldFormer coder; got interested in UXPresident of the Association for Voice Interaction DesignConsultant for Versay [email protected]

DO NOT FORGET TO BRING THE MINI-SPEAKERS!!!2

Siri, Alexa, Cortana: Voice recognition is hot. This session will give an overview of the voice recognition and natural language ecosystem and the technologies behind the experience. For example: What is the difference between voice recognition and natural language understanding? What are some of the common technologies in the market today? What are design considerations around these types of interfaces? This is an introductory session designed for people interested the exploring the possibilities of voice and conversational user interfaces, especially when considering the Internet of Things.3

DisclaimersThis Session Is About:What is speech recognition anyway?Should I speech-enable X? How?In general, how does it work?What technologies should I consider?What skills are important?What are the design considerations?Its NOT About:Detailed codeIn depth how-tosDeep technical knowledgeAdvanced ASR

Siri, Alexa, Cortana: Voice recognition is hot. This session will give an overview of the voice recognition and natural language ecosystem and the technologies behind the experience. For example: What is the difference between voice recognition and natural language understanding? What are some of the common technologies in the market today? What are design considerations around these types of interfaces? This is an introductory session designed for people interested the exploring the possibilities of voice and conversational user interfaces, especially when considering the Internet of Things.4

Should I Speech-Enable X?

What IS X?6 2016 Versay Solutions LLC

Computers: Apps and webpages. Consoles: Gaming / ConnectivityMobile and TabletIndustrial devices especially something task drivenGadgetsCarsThe phone6

How does this new modality enable or enhance what I want to do on this platform?


Computers: Apps and webpages. Consoles: Gaming / ConnectivityMobile and TabletIndustrial devices especially something task drivenGadgetsCarsThe phoneEssentially what were coming to terms with here is a new input modality. Its one that doesnt always work very well for reasons well get into later. But, it can be a very powerful one when it does work well. Its also a lot harder to figure out how to properly combine speech with everything else that is going on in your environment.8

Terms & TechnologiesSpeech RecognitionNatural Language UnderstandingText to Speech Voice Verification (Biometrics)9 2016 Versay Solutions LLC

Speech RecognitionAlso known as ASRSpeech to Text ?


See the cat.

Spoken languageMachine-readable format

Natural Language UnderstandingExtracting meaning from natural textNot necessarily tied to speech recognition


Hello, yes, Id like to pay my water bill. Can you help me with that?Action = BillPay

BillType = Water

Text to SpeechSpeech SynthesisUsed to convert text to spoken words12 2016 Versay Solutions LLC

Voice VerificationAlso called voiceprints, biometrics, voice authentication, etc.Recognizes a person, not necessarily what they are saying.You can have ASR without Voice VerificationAnd vice versa


My voice is my password.Authenticated. Welcome, Mr. Smith.

Not going to discuss this one in a lot of detail today but its important that you understand the difference between these technologies.13

14 2016 Versay Solutions LLCSpeech RecognitionHands-free command / controlDictationInput textSmall form factor device, etc. Text To SpeechOutput text dynamicallyRespond to input Useful when no display is availableNatural Language UnderstandingNecessary at some level for all language-based inputAlso used to parse large volumes of textVoice VerificationSecurity

Uses: Separate Applications

Uses: Combined15 2016 Versay Solutions LLC

ASR

Application

Data

Sign-InInteractionRequestActionMeaningAccess DataOutputTTSNLUVoiceprintsVerifi-cation

True Multimodality16 2016 Versay Solutions LLC

ASR

Application

Data

Sign-InInteractionRequestActionMeaningAccess DataOutputTTSNLUVoiceprintsVerifi-cation

Touch

KeyboardManage I/O ModalityDetermine Meaning in Context

VisualContext!

Credit: Jon Bloom

Lets Talk Speech!

Output: Text to Speech(Somewhat) mature technology(Fairly) easy to understand and useNote: Create TTS audio is not the same as having a TTS engine19 2016 Versay Solutions LLC

How it Works20 2016 Versay Solutions LLC

Human voice talentHundreds of hours of recordingDigitizedPhonemes: Concatenated speech synthesis

20

TTS EngineText in, speech outMay do some text pre-processingSt. James St. Saint James StreetPunctuationIf it doesnt do this, youll have to yourself.Grapheme to phoneme transcriptionIdentify intonation patternsAssign the correct lexical stress to the words21 2016 Versay Solutions LLC

What Makes Good TTS?Phonemes change based on locationCatAlligatorElisionIm. Awaiting. You.Im awaiting you.IntonationDo you want coffee?Do you want soda, tea, or coffee?22 2016 Versay Solutions LLC

SSMLXML based WC3 standard for Speech Synthesis MarkupNot universally supported by vendors.Tags for marking up text to produce a more natural quality output. EmphasisBreakVoiceProsodyPitch23 2016 Versay Solutions LLC

SSML Example24 2016 Versay Solutions LLC

When To Use ItWhen high quality audio is not a considerationTTS has improved considerably, but is still noticeableWhen you have a lot of dynamic dataIf you just need to say a few things, it may be overkill25 2016 Versay Solutions LLC

Other ConsiderationsMore phonemes = higher quality voiceAlso means a bigger download and install (if on device)Exceptions (addresses, names) can be iffyMay require a lot of work to handle wellYour data needs to be clean and ready to voice backAcronyms, incomplete sentences will not sound goodSome applications may have other acoustic limitationsTelephonyIt is possible to build a custom voiceBut it takes a lot of work!26 2016 Versay Solutions LLC

Where To Find ItMany commercial products availableMost languages and dialects i.e. American English, British English, etc. Many different voicesNuance, Cepstral, InovaSome open sourceSome APIsChrome https://developer.chrome.com/apps/tts27 2016 Versay Solutions LLC

ASR and NLU

ASR and NLU: TopicsComplications of speechWhy is it so hard?How it works: overviewEarly commercial adoptionsIVRDesign considerationsSpeech todayDifferent vendorsShould I voice-enable X?


30(The Speech Chain, Bell Labs, 1963)

31The Voice in the Machine: PieracciniWorld KnowledgeSemanticsSyntaxLexiconMorphologyPhoneticsAcousticsLinguistics

PhysiologyConceptsPhrasesWordsPhonemesSounds

Speaking / ListeningASRNLU

World Knowledge: Concepts of the world around us, i.e. Tables have four legs, what is left and right, what is a car, etc. This is the level before languageSemantics: The first level of language. Knowledge can be represented in structured meaningful elements. Example: semantics of a party invitationSyntax: The rules that govern putting words together to form meaningful unitsLexicon: What words meanMorphology: How words change their form to perform differently in a language i.e. horse / horsesPhonetics: Phonemes and how words are builtAcoustics: What phonemes sound like and how to create them31

Speech Is AmbiguousSpeech is never stationaryCoarticulationNoisy environmentsAccentsDifferent speakers have voices with different acoustic qualitiesGoatsChallenges vary depending on what you are going to recognizeSpelling (short utterances) can be difficult even for humansPhonetic alphabet (Military)32 2016 Versay Solutions LLC

Language Is AmbiguousHumans can deduce meaning from context and unknown words

How can I help you?Im having a problem with my account.

Id like that one. No, not the green one, the red one.

Time flies like an arrow.Fruit flies like a banana.


Everything Is AmbiguousAll modern speech recognition is probabilisticGUI: Button clicked? true / falseVUI: There is an 85% chance that button was clicked34 2016 Versay Solutions LLC

Three Dimensions of Speech Problems35The Voice in the Machine: PieracciniSpeaker IndependenceSpeaking StyleSpeaker DependentMultiple SpeakersSpeaker IndependentIsolated WordsConnected WordsNatural Speech10 words1000 words100,000 wordsUnlimitedVocabulary Size

Humanlike

History of Speech RecognitionAUDREY: Davis, Biddulph, and Balashek - Bell Labs 195236 2016 Versay Solutions LLCAnalogIsolated digit recognitionPause between digitsSpeaker-dependent

SamplingThe start of being able to digitally manipulate audio39 2016 Versay Solutions LLC


0 dbfrequencySpectrogram vs. Waveform

Waveforms show the variation in overall intensity (decibels) over time.Spectrograms show the variation of individual frequency components40

1970s: Template MatchingTemplate matching approachBrute force modelQuantitized spectrogramsWhat about duration? Dynamic time warpingEndpoint detectionDifficult to doFeature extraction41 2016 Versay Solutions LLC

1980s: The Power of StatisticsThe recognition of connected speech becomes a search for the best path in a large networkProblem of finding the probabilitiesStatistical Language ModelsNot all sequences of words are equally probableRank all permissible sentences in terms of probabilityCorrect grammar is not applicableRestricted by domainHidden Markov Models (HMM)Unified probabilistic model for speech42 2016 Versay Solutions LLC

Hidden Markov Model Example43"HiddenMarkovModel" by Tdunningvectorization (Wikimedia)

X statesy possible observationsa state transition probabilitiesb output probabilities

Youre Only As Good As What Youre Trained OnCorporaCollection of speech used to train a recognizerAcoustic and/or Pronunciation Model Associates sounds with symbols and words.Created by a general speech corpora and a phonetic and orthographic transcriptionStatistical Language Model (SLM)A probability distribution over sequences of wordsCreated by a domain-specific speech corpora and a tagged transcription to extract meaning44 2016 Versay Solutions LLC

Training45 2016 Versay Solutions LLCSpeech Recognition EngineAcoustic ModelSLM and/orGrammarPronunciation Model

Language Model vs. GrammarSLMHas to be trained against collected utterancesLarge potential set of what the caller can sayTagged with the meanings of what they can sayGrammar (GrXML)More tightly constrained than an SLMEasier to createNot trained in the same waySystem will only recognize what is in the grammar



UtteranceNoise Levels?Barge-In?Feature ExtractionEndpointingSpeech Recognition EngineGrammar or SLMProbabilitiesn:best listLiteral returnTokensRecognition Event

Natural Language UnderstandingParsing input to extract meaningCovers a large fieldCommandsAutomatic classification of emailsNewspaper articles, large chunks of textLexiconParserGrammar rulesNew tools / APIs48 2016 Versay Solutions LLC

Levels of Meaning49 2016 Versay Solutions LLC

Too Broad / AmbiguousToo MuchJust RightIm having a problem with my account.Well, I was looking at my bill, because I do that every week, and I was reviewing everything on there, and I sawIm seeing an unusual charge on my bill.How can I help you?

Multi-Token UtterancesId like to transfer $50 from my checking account to my savings account.ACTION = TransferFROM_ACCOUNT = CheckingTO_ACCOUNT = SavingsAMOUNT = $50Unfortunately, people dont often naturally produce these kinds of utterances.50 2016 Versay Solutions LLC

Early Commercial AdoptionIVRTouchtone / DTMFFor checking, press 1. For savings, press 2.Directed Dialog (Grammar-based ASR)Which account? Just say checking, savings, or money market. Natural Language (SLM-based ASR)From which account?SpeechWorks / Nuance technologyVoice XML / GrXML



Typical IVR Architecture54 2016 Versay Solutions LLCVoice BrowserVUIVXML

PSTN / VOIP

HTTPApp Server / Data ConnectionData

SIPMRCPASR ServerTTS Server

Anatomy of an VUI + NLU projectVoice User Interface DesignHigh level designDesign style, sound and feel, IA, Detailed designPrompts (recorded)Grammars for directed dialog statesData I/O55 2016 Versay Solutions LLCSLM Creation Utterance captureTranscriptionTaggingCompiling and deployment


Observations to make: Represents the entirety of a VUI experiencePlacement of Spanish prompt would vary depending on type of call.Confirmation is variableConfirmation prompt is general

56

VUI Design Doc Detailed Example57 2016 Versay Solutions LLC

Corpora Documentation Example58 2016 Versay Solutions LLC

Design ConsiderationsTypes of Speech User InterfacesCommand and ControlDictationDialog-basedSpeech is a linear, time-based interfaceMultimodality introduces additional complications59 2016 Versay Solutions LLC

Design ConsiderationsIf the recognizer doesnt get something, you have to reprompt. Dont say sorry.

Where are you traveling today?Im going to. What city was that?


Design ConsiderationsSpeech is interruptibleMain Menu: Choose from: Beverages, Sandwiches, Sides, Salads, or Alcoholic Drinks.


Design ConsiderationsPrompts imply more than choicesWould you like chocolate or vanilla?YesBoth


Design ConsiderationsInput must be limited *after* it is providedCant check the box on the client side to only allow input of valid amountsSorry, youre only allowed to transfer up to $500.


Design ConsiderationsAvoid using the word Help as a global command.Instead, if there is a need to give additional information, supply it in the first or second reprompts.Or use specific keywordsOther than help You can also say instructions.Or, say Its something else.


User Centered Design TechniquesA set of techniques designed to keep the focus on the user during the design processMay include but are not limited to:ConversationsSpecific to VUI designRead AloudSpecific to VUI designCard SortsUsed to construct an IAPersonasUsed in all modalitiesUsability TestingUsed in all modalitiesA/B TestingUseful for applications that are already in production65 2015 Versay Solutions LLC

Usability Testing66 2016 Versay Solutions LLC

67

Should I Speech-Enable X?


Whats the Use Case For Speech?Enabling applicationUser cant do it any other wayNew tasksEnhancing applicationUser can do it nowBut speech makes it betterFasterSafer70Credit: Bruce Ballentine, EIG

How Hard Is It To Do?What do you need it for?What kind of device will you be running it on?Connectivity? Can you use cloud based ASR?Do you have to download it? If so, how much space do you have?How much control do you need over the application / user interface?71 2016 Versay Solutions LLC

Possibilities72 2016 Versay Solutions LLCWrite an app (skill) for an agent such as Cortana / AlexaUse cloud APIs to add ASR to your app / device / page / gadgetDownload an ASR and use full-featured capabilities for more robust recognitionBuild your own

Distributed: Todays Speech AgentsSiriCortanaGoogle NowAmazon Echo (Alexa)73 2016 Versay Solutions LLC

Todays Cloud-Based Speech APIsDistributed speech recognitionCollection and compression of speech is on the deviceThe language models are typically on the networkPhone can be speaker-dependentTrains itself on your voice and on the acoustic environments you are in most oftenMany companies are providing APIs to use their speech recognition


AVS vs. Amazon EchoCould use AVS with the Amazon Echo, or with your own device75 2016 Versay Solutions LLC

Speech API Example: Alexa Voice Services76 2016 Versay Solutions LLC

Alexa Skill Example 77 2016 Versay Solutions LLC


Alexa SkillsAlexa, ask Yelp to find me a restaurant.Cortana has similar integrationRegister your skill with Amazon and publish it79 2016 Versay Solutions LLC

Cloud vs. Downloadable / EmbeddedMicrosoftCortana integrationProject Oxford APIGoogle APIAmazonSeveral new recent startupsApi.ai, Capio.ai, Speechmatics, iSpeech

80 2016 Versay Solutions LLCMicrosoftWindows 10 Speech APIsMicrosoft Speech ServerNuancethe 800 pound gorilla in the roomInteractionsIBM Watson

Cloud vs. Downloadable / EmbeddedEasy to get startedLightweightNot much specialized knowledge

81 2016 Versay Solutions LLCCustomizableProbably better recognitionCan be device-specificMore featuresHigher poweredWill require specialized knowledgeSpeech scientist

Todays NLU APIsMicrosoft LUIS (part of Project Oxford)Api.ai


Open Source ASRCMU SphinxpocketsphinxKaldihttp://kaldi-asr.org/GithubNew updates include some pretty interesting stuff (DNN)Requires: Corpus Tech know-how


Who May You Need On Your TeamSpeech ScientistVUI Designer84 2016 Versay Solutions LLC

Should I Speech-Enable X?85 2016 Versay Solutions LLC

Should I Speech-Enable X?86 2016 Versay Solutions LLCDesktop App / WebsiteEasy to get started with API-based ASRBut the use case may not be as powerfulTablet / MobileStronger use caseBut will the network be available for APIs?Industrial DeviceGreat use case esp. with multimodalBut this is harder to do and probably will be customGadgetDecent use caseAPIs are tailored for thisWill they do everything you need?Will the extra modality be a plus or just a silly add-on?CarSafety considerations are high hereNeed better user interfaces & more robustIVRTouchtone can still be good for a lot of applicationsSpeech is good for complex call routing and input

ResourcesThe Voice in the Machine: Building Computers that Understand Speech Roberto PieracciniYouTube video: Open the Pod Bay Doors, SiriBest Practices in VUI Design: AVIxD Wikihttp://videsign.wikispaces.com/AVIxD: Quarterly Brown Bags87 2016 Versay Solutions LLC


Thanks!

@[email protected]

DO NOT FORGET TO BRING THE MINI-SPEAKERS!!!88

voice recognition and natural language - dallas techfest 2016

Technology