towards google-like search on spoken documents with zero resources how to get something from nothing...

77
Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that you’ve never heard of Language/CS Meeting Speech/EE Meeting 1

Upload: charla-ruby-austin

Post on 16-Dec-2015

218 views

Category:

Documents


1 download

TRANSCRIPT

  • Slide 1
  • Slide 2
  • Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that youve never heard of Language/CS MeetingSpeech/EE Meeting 1
  • Slide 3
  • 2
  • Slide 4
  • 3
  • Slide 5
  • Degree of difficulty Some queries are harder than others More bits? (Long tail: long/infrequent) Or less bits? (Big fat head: short/frequent) Query refinement/Coping Mechanisms: If query doesnt work Should you make it longer or shorter? Solitaire Multi-Player Game: Readers, Writers, Market Makers, etc. Good Guys & Bad Guys: Advertisers & Spammers 4
  • Slide 6
  • Spoken Web Search is Easy Compared to Dictation 5
  • Slide 7
  • Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff? Qiaozhu Mei , Kenneth Church University of Illinois at Urbana-Champaign Microsoft Research 6
  • Slide 8
  • 7 Big How Big is the Web? 5B? 20B? More? Less? What if a small cache of millions of pages Could capture much of the value of billions? Big Could a Big bet on a cluster in the clouds Turn into a big liability? Examples of Big Bets Computer Centers & Clusters Capital (Hardware) Expense (Power) Dev (Mapreduce, GFS, Big Table, etc.) Sales & Marketing >> Production & Distribution Goal: Maximize Sales (Not Inventory) Small
  • Slide 9
  • 8 Millions (Not Billions)
  • Slide 10
  • 9 Population Bound With all the talk about the Long Tail Youd think that the Web was astronomical Carl Sagan: Billions and Billions Lower Distribution $$ Sell Less of More But there are limits to this process NetFlix: 55k movies (not even millions) Amazon: 8M products Vanity Searches: Infinite??? Personal Home Pages
  • 19 Conclusions: Millions (not Billions) How Big is the Web? Upper bound: O(Population) Not Billions Not Infinite Shannon >> Chomsky How hard is search? Query Suggestions? Personalization? Cluster in Cloud ($$$$) Walk-in-the-Park ($) Goal: Maximize Sales (Not Inventory) Entropy is a great hammer
  • Slide 21
  • Search Companies have massive resources (logs) Unfair advantage: logs Logs are the crown jewels Search companies care so much about data collection that Toolbar: business case is all about data collection The $64B question: What (if anything) can we do without resources? 20
  • Slide 22
  • 21
  • Slide 23
  • 22
  • Slide 24
  • 23
  • Slide 25
  • Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that youve never heard of Language/CS MeetingSpeech/EE Meeting 24
  • Slide 26
  • Definitions Towards: Not there yet Zero Resources: No nothing (no knowledge of language/domain) The next crisis will be where we are least prepared No training data, no dictionaries, no models, no linguistics Motivate a New Task: Spoken Term Discovery Spoken Term Detection (Word Spotting): Standard Task Find instances of spoken phrase in spoken document Input: spoken phrase + spoken document Spoken Term Discovery: Non-Standard task Input: spoken document (without spoken phrase) Output: spoken phrases (interesting intervals in document) ? 25
  • Slide 27
  • The next crisis will be where we are least prepared ? 26
  • Slide 28
  • 27
  • Slide 29
  • RECIPE FOR REVOLUTION 2 Cups of a long-standing leader 1 Cup of an aggravated population 1 Cup of police brutality teaspoon of video Bake in YouTube for 2-3 months Flamb with Facebook and sprinkle on some Twitter 28
  • Slide 30
  • 29
  • Slide 31
  • 30
  • Slide 32
  • 31
  • Slide 33
  • Longer is Easier /iy/ encyclopedias male female 32
  • Slide 34
  • What makes an interval of speech interesting? Cues from text processing: Long (~ 1 sec such as University of Pennsylvania) Repeated Bursty (tf * IDF) tf: lots of repetitions within documents IDF: with relatively few repetitions across documents Unique to speech processing: Given-New: First mention is articulated more carefully than subsequent Dialog between two parties (A & B): multi-player game A: utters an important phrase B: what? A: repeats the important phrase 33
  • Slide 35
  • 34
  • Slide 36
  • 35
  • Slide 37
  • 36
  • Slide 38
  • 37
  • Slide 39
  • 38
  • Slide 40
  • 39
  • Slide 41
  • 40
  • Slide 42
  • 41
  • Slide 43
  • n 2 Time & Space (Update: ASRU-2011) But the constants are attractive Sparsity Redesigned algorithms to take advantage of sparsity Median Filtering Hough Transform Line Segment Search Approx Nearest Neighbors (ASRU-2012) 42
  • Slide 44
  • 43
  • Slide 45
  • 44
  • Slide 46
  • 45
  • Slide 47
  • 46
  • Slide 48
  • 47
  • Slide 49
  • 48
  • Slide 50
  • NLP on Spoken Documents Without Automatic Speech Recognition Mark Dredze, Aren Jansen, Glen Coppersmith, Ken Church EMNLP-2010 49
  • Slide 51
  • WE DONT NEED SPEECH RECOGNITION TO PROCESS SPEECH At least not for all NLP tasks 50
  • Slide 52
  • Speech Processing Chain Speech Recognition Speech Collection Text Processing Information Retrieval Corpus Organization Information Extraction Sentiment Analysis Manual Transcripts Full Transcripts This Talk Bag of Words Representation Good enough for many tasks 51
  • Slide 53
  • NLP on Speech Full speech recognition to obtain transcripts Acoustic phonetic model Sequence model and decoder Language Model NLP requirements Bag of words sufficient for many NLP tasks Document classification, clustering, etc. Do we really need the whole speech recognition pipeline? 52
  • Slide 54
  • Our Goal 0 0 1 0 0 1 1 1 Feature Extraction Find Audio Segments Assign Text to Segments Full Transcripts Speech Recognition Text Processing Find long segments (>0.5s) of repeated acoustic signal Extract features directly from repeated segments Jansen et al, Interspeech 2010 Find Segments Extract Features Find Segments Extract Features 53
  • Slide 55
  • Creating Pseudo-Terms P1P1 P1P1 P2P2 P2P2 P3P3 P3P3 54
  • Slide 56
  • Example Pseudo-Terms term_5 term_6 term_63 term_113 term_114 term_115 term_116 term_117 term_118 term_119 term_120 term_121 term_122 our_life_insurance term life_insurance how_much_we long_term budget_for our_life_insurance budget end_of_the_month stay_within_a_certain you_know have_to certain_budget 55
  • Slide 57
  • Graph Based Clustering Nodes: each matched audio segment Edges: edge between two segment if fractional overlap exceeds threshold Extract connected components of graph This work: One pseudo-term for each connected component Future work: better graph clustering algorithms newspapers newspaper a paper keep track of keep track Pseudo-term 1Pseudo-term 2 56
  • Slide 58
  • Bags of Words Bags of Pseudo-Terms Are pseudo-terms good enough? Four score and seven years is a lot of years. four score seven years... 1 2 term_12 term_5 term_12 term_5 2 1 0 0 1 0 1 57
  • Slide 59
  • Evaluation: Data Switchboard telephone speech corpus 600 conversation sides, 6 topics, 60+ hours of audio Topics: family life, news media, public education, exercise, pets, taxes Identify all pairs of matched regions Graph clustering to produce pseudo-terms O(n 2 ) on 60+ hours is a lot! Efficient algorithms & sparsity not as bad as you think 500 terapixel dotplot from 60+ hours of speech Compute time: 100 cores, 5 hours 58
  • Slide 60
  • Clustering Results: Every time I fire an ASR guy, my performance doesnt go down as much as you would have thought 59
  • Slide 61
  • What Weve Achieved (Anything you can do with a BOW, we can do with a BOP) Speech Collection Pesuedo Term Discovery Text Processing Information Retrieval Corpus Organization Information Extraction Sentiment Analysis 60
  • Slide 62
  • BACKUP 61
  • Slide 63
  • The Answer to All Questions Kenneth Church ((almost) former) president 62
  • Slide 64
  • Overview: Taxonomy of Questions Easy Questions: How do I improve my performance? Google: Q&A: The answer to all questions is a URL Siri: Will you marry me? I love you Hello Harder questions: What does a professor do? Deeper questions (Linguistics & Philosophy): What is the meaning of a lexical item? 63
  • Slide 65
  • Q&A: The Answer to All Questions What is the highest point on Earth? When was Bill Gates Born? 64
  • Slide 66
  • Siri: Real World Q&A (Not all questions can be answered with a URL) Siri Will you marry me? I love you Hello 65
  • Slide 67
  • 66
  • Slide 68
  • Spoken Web Queries You can ask any question you want As long as the answer is a URL Or a joke Or an action on the phone (open app, call mom) Easy (high freq/low entropy) Wikipedia Google Harder (OOVs) Wingardium Leviosa When Guardian love Llosa Non-speech: Water running Or (universal acceptor) 67
  • Slide 69
  • Coping Strategies When users dont get satisfaction They try various coping strategies Which are rarely effective Examples: Repeat the query Over-articulation (+ screaming & clipping) Spell mode 68
  • Slide 70
  • Pseudo-Truth (Self-training) Train on ASR Output Negative Feedback Loop (with = 1?) Obviously, pseudo-truth real truth Especially for universal acceptors If ASR output is or (universal acceptor) Pseudo-truth is wrong 69
  • Slide 71
  • Pseudo-Truth & Repetition Average WER is about 15% But average case is not typical case Usually right: Easy queries (Wikipedia) Usually wrong: Hard queries (OOVs) and Universal acceptors (or) Pseudo-truth is more credible when Lots of queries (& clicks) across users (speaker indep) Pseudo-truth is less credible when A single user repeats previous query (speaker dependent) 70
  • Slide 72
  • Zero Resource Apps Detect repetitions (with DTW) If user repeats the previous query, Pseudo-truth is probably wrong (speaker dependent) If lots of users issue the same query, Pseudo-truth is probably right (speaker independent) Need to distinguish Google from Or Both are common, But most instances of google will match one another Unlike or (universal acceptor) Probably cant use ASR output to detect repetitions Wingardium Leviosa lots of diff (wrong) ASR outputs Probably cant use ASR output to distinguish google from or DTW >> ASR output for detecting repetition & improving pseudo-truth 71
  • Slide 73
  • Spoken Web Search Dictation Spoken Web Search: Queries are short Big fat head & Long tail (OOVs) Large Cohorts: Lots of Google, or & wingardium leviosa OOVs are not one-ofs: Lots of instances of wingardium leviosa Small samples are good enough for Labeling (via Mechanical Turk?) Calibration (via Mechanical Turk?) 72
  • Slide 74
  • Spelling Correction Products: Microsoft Office Bing Microsoft Office A dictionary of general vocabulary is essential Bing (Web Search) A dictionary is essentially useless General vocabulary is more important in documents than web queries Pre-Web Survey: Kukich (1992) Cucerzan and Brill (2004) propose an iterative process that is more appropriate for web queries. 73
  • Slide 75
  • Spelling Correction: Bing Office 74
  • Slide 76
  • Context is helpful Easier to correct MWEs (names) than words Cucerzan and Brill (2004) 75
  • Slide 77
  • ASR without a Dictionary More like did-you-mean Than spelling correction for Microsoft Office Use Zero Resource Methods to cluster OOVs Label/Calibrate Small Samples with Mechanical Turk Wisdom of the crowds Crowds: smarter than all the lexicographers in the world How many words/MWEs do people know? 20k? 400k? 1M? 13M? Is there a bound or does V (vocab) increase with N (corpus)? Is V a measure of intelligence or just experience? Is Google smarter than we are? Or just more experienced? 76
  • Slide 78
  • Conclusions: Opportunities for Zero Resource Methods Bing Office Spoken Web Queries Dictation Pseudo-Truth Truth (