the l2f spoken web search system for mediaeval 2012

13
1 Alberto Abad and Ramón F. Astudillo L 2 F Spoken Language Systems Lab INESCID Lisboa, Portugal [email protected] Mediaeval 2012 Workshop Pisa, October 4, 2012 The L 2 F Spoken Web Search system for Mediaeval 2012

Upload: mediaeval2012

Post on 05-Dec-2014

671 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: The L2F Spoken Web Search system for Mediaeval 2012

1  The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012

Alberto  Abad  and  Ramón  F.  Astudillo  L2F  -­‐  Spoken  Language  Systems  Lab    

INESC-­‐ID  Lisboa,  Portugal  [email protected]­‐id.pt  

Mediaeval  2012  Workshop  Pisa,  October  4,  2012  

The  L2F  Spoken  Web  Search  system  for  Mediaeval  2012  

Page 2: The L2F Spoken Web Search system for Mediaeval 2012

2  The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012

IntroducJon  

  In  this  first  L2F/INESC-­‐ID  parJcipaJon  on  SWS  our  main  objecJves  were:  •  To  learn  

•  To  have  fun  •  To  build  a  reasonable  system  (in  an  unreasonable  reduced  amount  of  Jme)  

  The  submiVed  L2F  SWS  system  exploits  hybrid  ANN/HMM  connecJonist  methods  for  both  query  tokenizaJon  and  acousJc  keyword  search  •  Composed  by  the  fusion  of  4  phoneJc-­‐based  SWS  sub-­‐systems  

o  Based  on  our  in-­‐house  ASR  system  named  AUDIMUS  

•  Each  sub-­‐system  uses  different  language-­‐dependent  acousJc  models:  o  European  Portuguese  (pt),  Brazilian  Portuguese  (br),  European  Spanish  (es)  and  

American  English  (en)  

•  Different  detecJon  score  normalizaJon  and  fusion  methods  invesJgated  o  SubmiVed  system  applies  per-­‐query  score  normalizaJon  (Q-­‐norm)  and  majority  voJng  

(MV)  fusion  

Page 3: The L2F Spoken Web Search system for Mediaeval 2012

3  The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012

The  baseline  speech  recognizer  

  AUDIMUS  is  our  in-­‐house  hybrid  HMM/MLP  speech  recognizer  

•  Feature  extrac@on  MulJ-­‐stream  26  PLP,  26  logRASTA-­‐PLP,  28  MSG  and  39  ETSI  (only  8kHz)  •  MLP  Several  context  input  frames  (13-­‐15),  2  hidden-­‐layers  (500  units)  and  1  output  layer.  

•  Output  layer  size  39  for  pt,  40  for  br,  30  for  es,  41  for  en.  •  Data  pt  115  hours  (57  BN+58  telephone);    br  13  hours  of  BN  data;  es  57hours  (36  BN

+21  telephone);  en  142  hours  of  BN  data  (HUB-­‐4  96  &  97)  •  HMM  topology  Single-­‐state  phonemes  (3  frames  minimum  duraJon)  •  Decoder  Uses  Weighted  Finite-­‐State  Transducer  (WFST)  approach  

Page 4: The L2F Spoken Web Search system for Mediaeval 2012

4  The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012

Spoken  Query  TokenizaJon    

  For  each  sub-­‐system,  obtain  a  phoneJc  tokenizaJon  of  the  queries  •  Similar  to  LID  parallel  phonotacJc  

approaches  

  Use  a  phone-­‐loop  grammar  with  phoneme  minimum  duraJon  of  3  frames  to  obtain  1-­‐best  phoneme  chain  •  AlternaJve  n-­‐best  hypothesis  for  

charactering  each  query  were  explored  with  unsaJsfactory  results  o  Other  possibiliJes  not  explored  (yet):  lakce,  CN,  etc…    

•  Influence  of  the  word  inserJon  penalty  (wip)  parameter  on  the  tokenizaJon  result    

Page 5: The L2F Spoken Web Search system for Mediaeval 2012

5  The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012

Spoken  Query  Search  

  Spoken  query  search  based  on  AKWS  with  our  hybrid  speech  recognizer.      Search  window  of  5  seconds  (2.5  seconds  Jme  shio)  

•  Convert  the  problem  in  a  verificaJon  task  (originally  for  with  forced  alignment)  

•  Convenient  for  fusion  purposes    Equally-­‐likely  1-­‐gram  LM  with  target  Query  and  Background  word:  

•  Query  word    o  Described  by  the  phoneJc  units  obtained  in  the  previous  tokenizaJon  stage  

•  Background  word    o  Described  by  the  special  phoneJc  class  background/filler  o  Minimum  duraJon  set  to  250  msec.    

  DetecJon  score  of  detected  candidates  computed  as  the  average  phoneJc  log-­‐likelihood  raJos  of  the  query  term    

Page 6: The L2F Spoken Web Search system for Mediaeval 2012

6  The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012

Spoken  Query  Search  BG  modelling  with  HMM/MLP  

  Possible  approaches  •  Re-­‐train  MLP  ✖  •  Compute  posterior  probability  of  the  background  class  depending  on  the  other  

classes  ✔  o  Mean  probability  of  the  top-­‐N  most  likely  outputs  

  For  the  SWS  system,    •  We  compute  the  average  in  the  likelihood  domain  (top-­‐6)  

o  The  decoder  operates  in  the  likelihood  domain,  so  there  is  not  need  for  and  add-­‐hoc  esJmaJon  of  the  BG  class  prior  

•  We  use  a  background  scale  term  β  (exponenJal  in  the  likelihood  domain)  to  control  the  weight  of  the  BG  model  vs.  Query  

•  This  β  scale  together  with  the  wip  term  strongly  affects  searching  results  o  Adjusted  following  a  non-­‐exhausJve  greedy  search  

Page 7: The L2F Spoken Web Search system for Mediaeval 2012

7  The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012

Score  normalizaJon,  fusion  and  calibraJon  

  Score  normaliza@on  schemes  explored:  •  Q-­‐norm  Assume  that  the  scores  are  dependent  of  the  queries  and  

apply  a  by-­‐query  normalizaJon  •  F-­‐norm  Assume  that  the  scores  are  dependent  of  the  data  file  (of  

the  collecJon)  and  apply  a  by-­‐file  normalizaJon  •  CombinaJons  (QF-­‐norm,  FQ-­‐norm)  

  Fusion  schemes  explored:  •  Candidate  detecJons  from  the  4  parallel  sub-­‐systems  are  kept  (or  

rejected)  according  to  simple  combinaJon  rules:  o  AND  (all),  OR  (at  least  1  sys)  and  MV  (at  least  2  sys)  

•  Final  detecJon  score  is  the  mean  score    

  Decision  threshold  set  according  to  maxATWV  in  dev-­‐dev  

Page 8: The L2F Spoken Web Search system for Mediaeval 2012

8  The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012

Development  experiments  Sub-­‐systems  performance  

Page 9: The L2F Spoken Web Search system for Mediaeval 2012

9  The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012

Development  experiments  Score  normalizaJon  strategies  

Page 10: The L2F Spoken Web Search system for Mediaeval 2012

10  The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012

Development  experiments  Fusion  strategies  (aoer  Q-­‐norm)  

Page 11: The L2F Spoken Web Search system for Mediaeval 2012

11  The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012

Official  L2F  SWS2012  results  

5

10

20

40

60

80

90

95

98

.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40

Mis

s pr

obab

ility

(in %

)

False Alarm probability (in %)

Combined DET Plot

Random PerformanceTerm Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.633 Scr=-0.435

Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.633 Scr=-0.435

5

10

20

40

60

80

90

95

98

.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40

Mis

s pr

obab

ility

(in %

)

False Alarm probability (in %)

Combined DET Plot

Random PerformanceTerm Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.486 Scr=-0.135

Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.486 Scr=-0.135

5

10

20

40

60

80

90

95

98

.0001 .001 .004 .01.02 .05 .1 .2 .5 1 2 5 10 20 40

Mis

s pr

obab

ility

(in %

)

False Alarm probability (in %)

Combined DET Plot

Random PerformanceTerm Wtd. p-phonetic4_fusion_mv : ALL Data Max Val=0.523 Scr=-0.362

Term Wtd. p-phonetic4_fusion_mv: CTS Subset Max Val=0.523 Scr=-0.362

dev-­‐eval   eval-­‐dev   eval-­‐eval  

Page 12: The L2F Spoken Web Search system for Mediaeval 2012

12  The L2F SWS system for Mediaeval 2012 Pisa, October 4, 2012

Conclusions  

  The  L2F  Spoken  Web  Search  system  fully  exploits  hybrid  ANN/HMM  speech  recogniJon:  •  Query  tokenizaJon  based  on  1-­‐best  phoneJc  decoding  •  Query  search  based  on  AKWS  (with  no  need  for  AM  re-­‐training)  

  The  submiVed  system  is  formed  by  the  fusion  of  four  language-­‐dependent  sub-­‐systems:  •  Q-­‐norm  score  normalizaJon  is  applied  to  each  individual  sub-­‐system  •  Fusion  is  done  following  a  majority  voJng  strategy  

  The  system  achieves  an  actual  ATWV  score  of  0.5195  in  the  eval-­‐eval    •  Promising  given  the  simplicity  of  the  proposed  system  •  Robust  to  query  and  collecJon  sets  

o  Best  performance  is  achieved  in  a  mismatched  condiJon!!  •  Reasonably  well-­‐calibrated  

  Future  work  Focused  in  improved  Query  tokenizaJon,  fusion  with  other  type  of  approaches  (DTW)    

Page 13: The L2F Spoken Web Search system for Mediaeval 2012

Instituto de Engenharia de Sistemas e Computadores Investigação e Desenvolvimento em Lisboa

technology from seed

L2 F - Spoken Language Systems Laboratory 13

technology from seed

L2 F - Spoken Language Systems Laboratory