towards google-like search on spoken documents with zero resources how to get something from nothing...

Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that youve never heard of Language/CS MeetingSpeech/EE Meeting 1

Degree of difficulty Some queries are harder than others More bits? (Long tail: long/infrequent) Or less bits? (Big fat head: short/frequent) Query refinement/Coping Mechanisms: If query doesnt work Should you make it longer or shorter? Solitaire Multi-Player Game: Readers, Writers, Market Makers, etc. Good Guys & Bad Guys: Advertisers & Spammers 4

Spoken Web Search is Easy Compared to Dictation 5

Entropy of Search Logs - How Big is the Web? - How Hard is Search? - With Personalization? With Backoff? Qiaozhu Mei , Kenneth Church University of Illinois at Urbana-Champaign Microsoft Research 6

7 Big How Big is the Web? 5B? 20B? More? Less? What if a small cache of millions of pages Could capture much of the value of billions? Big Could a Big bet on a cluster in the clouds Turn into a big liability? Examples of Big Bets Computer Centers & Clusters Capital (Hardware) Expense (Power) Dev (Mapreduce, GFS, Big Table, etc.) Sales & Marketing >> Production & Distribution Goal: Maximize Sales (Not Inventory) Small

8 Millions (Not Billions)

9 Population Bound With all the talk about the Long Tail Youd think that the Web was astronomical Carl Sagan: Billions and Billions Lower Distribution $$ Sell Less of More But there are limits to this process NetFlix: 55k movies (not even millions) Amazon: 8M products Vanity Searches: Infinite??? Personal Home Pages

19 Conclusions: Millions (not Billions) How Big is the Web? Upper bound: O(Population) Not Billions Not Infinite Shannon >> Chomsky How hard is search? Query Suggestions? Personalization? Cluster in Cloud ($$$$) Walk-in-the-Park ($) Goal: Maximize Sales (Not Inventory) Entropy is a great hammer

Search Companies have massive resources (logs) Unfair advantage: logs Logs are the crown jewels Search companies care so much about data collection that Toolbar: business case is all about data collection The $64B question: What (if anything) can we do without resources? 20

Towards Google-like Search on Spoken Documents with Zero Resources How to get something from nothing in a language that youve never heard of Language/CS MeetingSpeech/EE Meeting 24

Definitions Towards: Not there yet Zero Resources: No nothing (no knowledge of language/domain) The next crisis will be where we are least prepared No training data, no dictionaries, no models, no linguistics Motivate a New Task: Spoken Term Discovery Spoken Term Detection (Word Spotting): Standard Task Find instances of spoken phrase in spoken document Input: spoken phrase + spoken document Spoken Term Discovery: Non-Standard task Input: spoken document (without spoken phrase) Output: spoken phrases (interesting intervals in document) ? 25

The next crisis will be where we are least prepared ? 26

RECIPE FOR REVOLUTION 2 Cups of a long-standing leader 1 Cup of an aggravated population 1 Cup of police brutality teaspoon of video Bake in YouTube for 2-3 months Flamb with Facebook and sprinkle on some Twitter 28

Longer is Easier /iy/ encyclopedias male female 32

What makes an interval of speech interesting? Cues from text processing: Long (~ 1 sec such as University of Pennsylvania) Repeated Bursty (tf * IDF) tf: lots of repetitions within documents IDF: with relatively few repetitions across documents Unique to speech processing: Given-New: First mention is articulated more carefully than subsequent Dialog between two parties (A & B): multi-player game A: utters an important phrase B: what? A: repeats the important phrase 33

n 2 Time & Space (Update: ASRU-2011) But the constants are attractive Sparsity Redesigned algorithms to take advantage of sparsity Median Filtering Hough Transform Line Segment Search Approx Nearest Neighbors (ASRU-2012) 42

NLP on Spoken Documents Without Automatic Speech Recognition Mark Dredze, Aren Jansen, Glen Coppersmith, Ken Church EMNLP-2010 49

WE DONT NEED SPEECH RECOGNITION TO PROCESS SPEECH At least not for all NLP tasks 50

Speech Processing Chain Speech Recognition Speech Collection Text Processing Information Retrieval Corpus Organization Information Extraction Sentiment Analysis Manual Transcripts Full Transcripts This Talk Bag of Words Representation Good enough for many tasks 51

NLP on Speech Full speech recognition to obtain transcripts Acoustic phonetic model Sequence model and decoder Language Model NLP requirements Bag of words sufficient for many NLP tasks Document classification, clustering, etc. Do we really need the whole speech recognition pipeline? 52

Our Goal 0 0 1 0 0 1 1 1 Feature Extraction Find Audio Segments Assign Text to Segments Full Transcripts Speech Recognition Text Processing Find long segments (>0.5s) of repeated acoustic signal Extract features directly from repeated segments Jansen et al, Interspeech 2010 Find Segments Extract Features Find Segments Extract Features 53

Creating Pseudo-Terms P1P1 P1P1 P2P2 P2P2 P3P3 P3P3 54

Example Pseudo-Terms term_5 term_6 term_63 term_113 term_114 term_115 term_116 term_117 term_118 term_119 term_120 term_121 term_122 our_life_insurance term life_insurance how_much_we long_term budget_for our_life_insurance budget end_of_the_month stay_within_a_certain you_know have_to certain_budget 55

Graph Based Clustering Nodes: each matched audio segment Edges: edge between two segment if fractional overlap exceeds threshold Extract connected components of graph This work: One pseudo-term for each connected component Future work: better graph clustering algorithms newspapers newspaper a paper keep track of keep track Pseudo-term 1Pseudo-term 2 56

Bags of Words Bags of Pseudo-Terms Are pseudo-terms good enough? Four score and seven years is a lot of years. four score seven years... 1 2 term_12 term_5 term_12 term_5 2 1 0 0 1 0 1 57

Evaluation: Data Switchboard telephone speech corpus 600 conversation sides, 6 topics, 60+ hours of audio Topics: family life, news media, public education, exercise, pets, taxes Identify all pairs of matched regions Graph clustering to produce pseudo-terms O(n 2 ) on 60+ hours is a lot! Efficient algorithms & sparsity not as bad as you think 500 terapixel dotplot from 60+ hours of speech Compute time: 100 cores, 5 hours 58

Clustering Results: Every time I fire an ASR guy, my performance doesnt go down as much as you would have thought 59

What Weve Achieved (Anything you can do with a BOW, we can do with a BOP) Speech Collection Pesuedo Term Discovery Text Processing Information Retrieval Corpus Organization Information Extraction Sentiment Analysis 60

BACKUP 61

The Answer to All Questions Kenneth Church ((almost) former) president 62

Overview: Taxonomy of Questions Easy Questions: How do I improve my performance? Google: Q&A: The answer to all questions is a URL Siri: Will you marry me? I love you Hello Harder questions: What does a professor do? Deeper questions (Linguistics & Philosophy): What is the meaning of a lexical item? 63

Q&A: The Answer to All Questions What is the highest point on Earth? When was Bill Gates Born? 64

Siri: Real World Q&A (Not all questions can be answered with a URL) Siri Will you marry me? I love you Hello 65

Spoken Web Queries You can ask any question you want As long as the answer is a URL Or a joke Or an action on the phone (open app, call mom) Easy (high freq/low entropy) Wikipedia Google Harder (OOVs) Wingardium Leviosa When Guardian love Llosa Non-speech: Water running Or (universal acceptor) 67

Coping Strategies When users dont get satisfaction They try various coping strategies Which are rarely effective Examples: Repeat the query Over-articulation (+ screaming & clipping) Spell mode 68

Pseudo-Truth (Self-training) Train on ASR Output Negative Feedback Loop (with = 1?) Obviously, pseudo-truth real truth Especially for universal acceptors If ASR output is or (universal acceptor) Pseudo-truth is wrong 69

Pseudo-Truth & Repetition Average WER is about 15% But average case is not typical case Usually right: Easy queries (Wikipedia) Usually wrong: Hard queries (OOVs) and Universal acceptors (or) Pseudo-truth is more credible when Lots of queries (& clicks) across users (speaker indep) Pseudo-truth is less credible when A single user repeats previous query (speaker dependent) 70

Zero Resource Apps Detect repetitions (with DTW) If user repeats the previous query, Pseudo-truth is probably wrong (speaker dependent) If lots of users issue the same query, Pseudo-truth is probably right (speaker independent) Need to distinguish Google from Or Both are common, But most instances of google will match one another Unlike or (universal acceptor) Probably cant use ASR output to detect repetitions Wingardium Leviosa lots of diff (wrong) ASR outputs Probably cant use ASR output to distinguish google from or DTW >> ASR output for detecting repetition & improving pseudo-truth 71

Spoken Web Search Dictation Spoken Web Search: Queries are short Big fat head & Long tail (OOVs) Large Cohorts: Lots of Google, or & wingardium leviosa OOVs are not one-ofs: Lots of instances of wingardium leviosa Small samples are good enough for Labeling (via Mechanical Turk?) Calibration (via Mechanical Turk?) 72

Spelling Correction Products: Microsoft Office Bing Microsoft Office A dictionary of general vocabulary is essential Bing (Web Search) A dictionary is essentially useless General vocabulary is more important in documents than web queries Pre-Web Survey: Kukich (1992) Cucerzan and Brill (2004) propose an iterative process that is more appropriate for web queries. 73

Spelling Correction: Bing Office 74

Context is helpful Easier to correct MWEs (names) than words Cucerzan and Brill (2004) 75

ASR without a Dictionary More like did-you-mean Than spelling correction for Microsoft Office Use Zero Resource Methods to cluster OOVs Label/Calibrate Small Samples with Mechanical Turk Wisdom of the crowds Crowds: smarter than all the lexicographers in the world How many words/MWEs do people know? 20k? 400k? 1M? 13M? Is there a bound or does V (vocab) increase with N (corpus)? Is V a measure of intelligence or just experience? Is Google smarter than we are? Or just more experienced? 76

Conclusions: Opportunities for Zero Resource Methods Bing Office Spoken Web Queries Dictation Pseudo-Truth Truth (

towards google-like search on spoken documents with zero resources how to get something from nothing...

Documents

small slide

spoken web search

spoken documents

great hammer slide

spoken document input

spoken term discovery

instances of spoken

big table