speech recognition tools - university of...

Speech Recognition Tools

Mark Hasegawa-Johnson

July 17, 2002

1 Bash, Sed, Awk

1.1 Installation

If you are on a unix system, bash, sed, gawk, and perl are probably already installed. If not, ask your systemadministrator.

If you are on Windows, download the Cygwin setup program from http://www.cygwin.com. cygwin in-stallation can be run as many times as you like; anything already installed on your PC will not be re-installed.In the screen that asks you which pieces of the OS you want to install, be sure to select (DOC)→(man),(Interpreters)→(gawk,perl), and (TEXT)→(less). I also recommend (Math)→(bc), a simple text-based cal-culator, and (Network)→(inetutils,openssh). You can also install a complete X-windows server and set ofclients from (XFree86)→(fvwm,lesstif,Xfree86-base,XFree86-startup,etc.), allowing you to (1) install X-basedprograms on your PC, and (2) run X-based programs on any unix computer on the network, with I/O comingfrom windows on your PC. Setting up X requires a little extra work; see http://xfree86.cygwin.com.

If cygwin is installed in your computer in the directory c:/cygwin, it will create a stump of a unixhierarchy starting in that directory. For example, the directory c:/cygwin/usr/local/bin is available undercygwin as /usr/local/bin. Arbitrary directories elsewhere on your c: drive are available as /cygdrive/c/. . . .

In order to use cygwin effectively, you need to set the environment variables HOME (to specify your homedirectory), DISPLAY (if you are using X-windows), and most importantly, PATH (to specify directories thatshould be searched for useful programs. This list should include at least /bin;/usr/bin;/usr/local/bin;/usr/X11R6/bin).

In order to use bash, sed, awk, and perl, you will also need a good ASCII text editor. You can downloadXemacs from http://www.xemacs.org.

1.2 Reading Assignments

Manual pages are available, among other places, at http://www.gnu.org/manual. If your computer is set upcorrectly, you can also read the bash man page by typing ’man bash’ at the cygwin/bash prompt.

• Read the bash manual page, sections: (Basic Shell Features)→(Shell Syntax, Shell Commands, ShellParameters, Shell Expansions). (Shell Builtins)→(Bourne Shell Builtins, Bash Conditional Expres-sions, Shell Arithmetic, Shell Scripts). Alternatively, you can try reading the tutorial chapter in theO’Reilly bash book.

• Read the sed manual page, or the sed tutorial chapter in the O’Reilly ’sed and awk’ book.

• You may eventually want to learn gawk, but it’s not required. The section of the gawk man page called’Getting Started with awk’ is pretty good. So are the tutorial chapters in the O’Reilly ’sed and awk’book.

1.3 A bash/sed example

Why not use C all the time? The answer is that some tasks are easier to perform with other programminglanguages:

• Manipulate file hierarchies: use bash and sed.

1

• (Simple manipulation of tabular text files: gawk)

• Manipulate text files: use perl.

• Manipulate non-text files: use C.

perl can do any of these things, but isn’t very efficient for numerical calculations. C can also do any of thethings listed, but perl has many builtin tools for string manipulation, so it’s worthwhile to learn perl. gawkis easier than perl for simple manipulation of tabular text; it’s up to you whether or not you want to trylearning it.

bash is a POSIX-compliant command interpreter, meaning that, like the DOSshell, you can type in aprogram name, and the program will run. Unlike the DOSshell, bash is also a pretty good programminglanguage (not as good as BASIC or perl, but better than DOSshell or tcsh).

For example, suppose you want to search through the entire /data/timit/train hierarchy1, apply theC program “extract” to all WAV files in order to create MFC files, and create a file with extension TRPcontaining only the third column of each PHN file, and then move all of the resulting files to a directoryhierarchy under /data/newfiles/train (but the new directory hierarchy doesn’t exist yet). You could do allthat by entering the following, either at the bash command prompt or in a shell script:

if [ ! -e ${HOME}/newfiles/train ]; thenmkdir ${HOME}/newfiles;mkdir ${HOME}/newfiles/train;

fifor dr in dr{1,2,3,4,5,6,7,8}; doif [ ! -e ${HOME}/newfiles/train/${dr} ]; thenecho mkdir ${HOME}/newfiles/train/${dr};mkdir ${HOME}/newfiles/train/${dr};

fifor spkr in ‘ls ${HOME}/timit/train/${dr}‘; doif [ ! -e ${HOME}/newfiles/train/${dr}/${spkr} ]; thenecho mkdir ${HOME}/newfiles/train/${dr}/${spkr};mkdir ${HOME}/newfiles/train/${dr}/${spkr};ficd ${HOME}/timit/train/${dr}/${spkr};for file in ‘ls‘; docase ${file} in*.wav | *.WAV )MFCfile=${HOME}/newfiles/train/${dr}/${spkr}/‘echo ${file} | sed ’s/wav/mfc/;s/WAV/MFC/’‘;echo Copying file ${PWD}/${file} into file ${MFCfile};extract ${file} ${MFCfile};;

*.phn | *.PHN )TRPfile=${HOME}/newfiles/train/${dr}/${spkr}/‘echo ${file} | sed ’s/phn/trp/;s/PHN/TRP/’‘;echo Extracting third column of file ${PWD}/${file} into file ${TRPfile};gawk ’{print $3}’ ${file} > ${TRPfile};

esacdone

donedone

Once you have created the entire new hierarchy, you can list the whole hierarchy using

ls -R ${HOME}/newfiles | less

1There are several copies of TIMIT floating around the lab. You can also buy your own copy for $100 fromhttp://www.ldc.upenn.edu, or download individual files from that web site for free.

2

You may have noticed by now that bash suffers from cryptic syntax. bash inherits syntax from ’sh’, acommand interpreter written at AT&T in the days when every ASCII character had to be chiseled on stonetablets in triplicate; thus bash uses characters economically. Three rules will help you to use bash effectively:

1. Keep in mind that ’, ‘, and ” mean very different things. { and ${ mean very different things. [ standingalone is a synonym for the command ’test’.

2. When trying to figure out how bash parses a line, you need to follow the seven steps of commandexpansion in the same order that bash follows them: brace expansion, tilde expansion, variable ex-pansion, command substitution, arithmetic expansion, word splitting, and filename expansion, in thatorder. No, really, I’m serious. Trying to read bash as pseudo-English leads only to frustration.

3. When writing your own bash scripts, trial and error is usually the fastest method. Use the ’echo’command frequently, with appropriate variable expansions at each level, so you can see what bashthinks it is doing.

About gawk: the only thing you absolutely need to know about gawk is that if you type the followingcommand into the bash prompt, the file foo.txt will contain the M’th, N’th, and P’th columns from the filebar.txt (where M,N,and P should be any single digits):

gawk ’{printf("%s\t%s\t%s\n",$M,$N,$P)}’ bar.txt > foo.txt

1.4 Homework Assignment

Create a bash shell script that does the following things. Note that if you don’t have the Switchboardtranscription files, you can download them from http://www.isip.msstate.edu/projects/switchboard/.

• Find the ‘word’ transcription file corresponding to the ‘A’ side of every conversation in Switchboard.Use gawk to copy the 2nd, 3rd, and 4th columns of that file to a new file that has the same basicfilename, but resides in the directory $HOME/switchboard/A. Do NOT create subdirectories underthis directory — your goal is to get all of the side-A word transcriptions into the same directory.

• Do the same thing to the side-B transcriptions. Put them into $HOME/switchboard/B.

3

2 Perl

2.1 Reading

• Chapter 1 of Programming Perl by Larry Wall and Randall Schwartz (this book is sometimes availableon-line at perl.com; in the perl community, it is called “The Camel Book” in order to distinguish itfrom all of the other perl books available). Larry Wall (the author of perl, as well as the author of thebook) is an occasional linguist with a good sense of humor. This chapter is possibly the best writtenintroduction to perl data structures and control flow, and contains better documentation on blocks,loops, and control flow statements (if, unless, while, until) than the man pages.

• The manual pages are available in HTML format on-line at http://www.perl.com/doc/. Downloadthe gzipped tar file, and unpack it on your own PC, so that you can keep it open while you program.You can read the manual pages using “man” under cygwin, but it is much easier to navigate thiscomplicated document set using HTML. Before programming, you should read Chapter 1 of the CamelBook, plus the perldata man page and the first halves of the perlref and perlsub manual pages.While programming, you should have an HTML browser open so that you can easily look for usefulinformation in the three manual pages just listed, and also in the perlsyn, perlop, perlfunc, andperlre pages.

• The perl motto is “There is more than one way to do it.” It is easy to write cryptic perl; it is somewhatmore difficult to write legible perl. The perlstyle file contains a few suggestions useful in the questto make your code more readable.

• Language Modeling. Chen and Goodman (Computer Speech and Language, 1998) performed ex-tensive experiments using a variety of N-gram interpolation methods, and developed an improvedmethod based on their experiments. Chen and XXX (IEEE Trans. SAP, 2000) contains a very read-able review of the Chen-Goodman interpolation method, followed by enlightening comparison to a newmaximum-entropy method.

2.2 An Example

The following file, timit durations.pl, computes the mean and variance of the durations of every phonemein the TIMIT database.

#!/usr/bin/perl## Compute mean and variance of phoneme durations# and logdurations in TIMIT.## Usage (in a bash shell):# timit_durations.pl d:/timit/TIMIT > timit_durations.txt## Creates a text table in file timit_durations.txt# showing, for each phoneme, the number of times it was seen,# the mean and standard deviation of the duration in milliseconds,# and the mean and standard deviation of the log duration (in log ms).## Status messages are printed to STDERR (usually the terminal).## Mark Hasegawa-Johnson, 6/18/2002#

################################################ Subroutine to peruse a directory tree $_[0]

4

#sub peruse_tree {# Top directory is whatever was given as $_[0]# If I can’t open it, die with an error messageopendir(TOPDIR, $_[0]) || die "Can’t opendir $_[0]: $!";print STDERR "Reading from directory ",$_[0],"\n";my(@filelist) = readdir(TOPDIR);closedir(TOPDIR);

# Read entries of TOPDIRforeach $filename (@filelist) {# If last character of the filename is ., ignore itif ( $filename =~ /\.$/ ) { }

# If it’s any other directory,call peruse_tree on itelsif ( -d "$_[0]/$filename" ) {peruse_tree("$_[0]/$filename");

}

# If the filename ends in s[ix]\d+\.phn (case-insensitive),# then call the function read_wordselsif ( $filename =~ /s[ix]\d+\.phn/i ) {read_phones("$_[0]/$filename");

}}

}

#################################################### Subroutine to convert TIMIT phone labels into Switchboard labels#sub timit2switchboard {my(@data) = @_;

# Process each record, while there are records leftfor( my($n)=0; $n <= $#data; $n++ ) {

# Merge some TIMIT labels into Switchboard superclassesif ( $data[$n][2] eq ’ax-h’ ) { $data[$n][2] = ’ax’; }if ( $data[$n][2] eq ’axr’ ) { $data[$n][2] = ’er’; }if ( $data[$n][2] eq ’ux’ ) { $data[$n][2] = ’uw’; }if ( $data[$n][2] eq ’ix’ ) { $data[$n][2] = ’ih’; }if ( $data[$n][2] eq ’dx’ ) { $data[$n][2] = ’t’; }if ( $data[$n][2] eq ’nx’ ) { $data[$n][2] = ’n’; }if ( $data[$n][2] eq ’hv’ ) { $data[$n][2] = ’hh’; }

# If this segment is a closure,# look to see if it is followed by a release.# If so, combine the two segments.if ( my($stop) = ($data[$n][2] =~ /([bdgptk])cl/) ) {

# If next segment is the right stop release,# or if next segment is jh or ch and this segment is tcl or dcl,# set label of this segment equal to next segment,# set end time of this segment equal to end of next segment,

5

# and delete the next segment.if ( ($data[$n+1][2] eq $stop) ||(($data[$n+1][2] =~ /[cj]h/) && ($stop =~ /[td]/)) ) {

$data[$n][2] = $data[$n+1][2];$data[$n][1] = $data[$n+1][1];splice(@data, $n+1, 1);}# Otherwise, this must be an unreleased stop,# so best thing to do is just fix the phoneme labelelse {

$data[$n][2] = $stop;}

}}

# Return the modified @data arrayreturn(@data);

}

#################################################### Subroutine to read phoneme data#sub read_phones {# Initialize @data to nullmy(@data) = ();

# Open the INPUTFILE or die with an error messageopen(INPUTFILE,$_[0]) || die "Unable to open input file $_[0]: $!";

# Read in all lines from the INPUTFILEforeach $_ (<INPUTFILE>) {# Read the next line, and store it in a private array; next line if failurechomp;my(@record) = split;

# Push a reference to this new record onto the @data listpush(@data, \@record );

}close(INPUTFILE);

# Convert phone labels into Switchboard labels@data = timit2switchboard(@data);

# Process each record separatelyforeach $record (@data) {my($label) = $$record[2];

# Compute duration in milliseconds, assuming 16kHz sampling ratemy($duration) = ($$record[1] - $$record[0]) / 16;my($logd) = log($duration);

# Increment the global counters $PHONES_SEEN and $ACC{$label}{’n’}$PHONES_SEEN++;$ACC{$label}{’n’}++;

6

# Add duration, square, logd, and logd^2 to accumulators# Note that these accumulators are global.# If this particular label has never before been seen,# perl automagically creates $PHN_SUM{$label}, and gives# it an initial value of zero. Very convenient.# After that, the values keep on accumulating until the# top-level script is finished.$ACC{$label}{’sum’} += $duration;$ACC{$label}{’sumsq’} += ( $duration * $duration );$ACC{$label}{’sumlog’} += $logd;$ACC{$label}{’sumsqlog’} += ( $logd * $logd );

}}

############################################ Main Program#

# Accumulate duration information# from all directories specified# on the command line#foreach $arg (@ARGV) {peruse_tree($arg);

}

# When finished, print out a table# Print the header of the phoneme tableprint "LABEL\tN\tMEAN\tSTD\tMEANLOG\tSTDLOG\n";

# Phonemes are sorted in alphabetical orderforeach $label ( sort keys(ACC)) {# Get the hash reference contained in $ACC{$label}$hr = $ACC{$label};

# $n is the number of examples of this phoneme observed# Mean is sum of durations divided by number of tokens seen# Mean log is sum of log durations divided by number of tokens seen# Std is sqrt( (sumsq of durations - mean*sum) / (n-1) )# Stdlog is same as above, but with logs$n = $$hr{’n’};$mean = $$hr{’sum’}/$n;$meanlog = $$hr{’sumlog’}/$n;$std = sqrt( ($$hr{’sumsq’} - $$hr{’sum’} * $mean) / ($n-1));$stdlog = sqrt( ($$hr{’sumsqlog’} - $$hr{’sumlog’} * $meanlog) / ($n-1) );

# Print a line of the output tableprintf "%s\t%6d\t%6.0f\t%6.0f\t%6.2f\t%6.3f\n",$label,$n,$mean,$std,$meanlog,$stdlog;

}

7

2.3 Language Modeling

A probabilistic grammar of language L may be considered useful if it satisfies one of the following twoobjectives:

1. Specifies the probability of observing any particular string of words, W = [w1, . . . , wM ] in language L.

2. Specifies the various ways in which the meanings of words [w1, . . . , wM ] may be combined in orderto compute a sentence meaning, and specifies the probability that any one of the acceptable sentencemeanings is what the talker was actually trying to say.

An N-gram grammar is a stochastic automaton designed to satisfy grammar objective number 1 in themost efficient manner possible:

p(W ) =M∏

m=1

p(wm|wm−N+1, . . . , wm−1) (1)

where the words w−N , . . . , w−1 are defined to be the special symbol “SENTENCE START.” If the length ofthe N-gram, N , is larger than the length of the sentence, M , a correct N-gram specifies the probability ofthe sentence exactly. In practice, most N-grams are either bigrams (N = 2) or trigrams (N = 3), althougha few sites have experimented with variable-length N-grams.

The maximum-likelihood estimate of the bigram probability p(wm|wm−1) given any training corpus is

pML(wm|wm−1) =C(wm−1, wm)∑wm

C(wm−1, wm)(2)

where the “count” C(wm−1, wm) is the number of times that the given word sequence was observed in thetraining corpus.

Because of the infinite productivity of human language, there are always an infinite number of perfectlyreasonable word sequences that will not be observed in any finite-sized training corpus (typical languagemodel training corpora contain 250,000,000 words). In order to allow the model to generalize to new obser-vations, higher-order N-grams may be interpolated with lower-order N-grams. There are a number of waysto do this; one method is using an arbitrary fixed reduction of the word count, as follows:

pI(wm|wm−1) =

(

C(wm−1,wm)−DC(wm−1)

)+

(DN1+(wm−1•)

C(wm−1)

)pI(wm) C(wm−1, wm) ≥ 1(

DN1+(wm−1•)C(wm−1)

)pI(wm) C(wm−1, wm) = 0

(3)

where D ≤ 1 is an adjustable parameter, and pI(wm) is any valid unigram probability estimate. Typicalinterpolation methods either use the maximum likelihood estimate pML(wm) = C(wm)/

∑w C(w), or inter-

polate between the pML(wm) and a “0-gram” distribution that assumes all words to be equally likely. Theterm N1+(wm−1•) is the number of distinct words that may follow wm−1. This term is necessary to makesure that

1 =∑wm

p(wm|wm−1) (4)

Kneser and Ney (1995) demonstrated that equation 3 gives best results if the lower-order probabilitypI(wm) is chosen so that pI(wm|wm−1) satisfies the following equation:

C(wm) =∑

wm−1

pI(wm|wm−1)C(wm−1) (5)

Equation 5 says that the higher-order interpolated N-gram pI(wm|wm−1) should be designed so that thedatabase count C(wm) is equal to its expected value given the count C(wm−1). Kneser and Ney demonstratedthat one interpolation formula that satisfies equation 5 is

pI(wm|wm−1) =

(

C(wm−1,wm)−DC(wm−1)

)+

(DN1+(wm−1•)

C(wm−1)

) (N1+(•wm)N1+(••)

)C(wm−1, wm) ≥ 1(

DN1+(wm−1•)C(wm−1)

) (N1+(•wm)N1+(••)

)C(wm−1, wm) = 0

(6)

8

where N1+(••) is the total number of lexicographically distinct bigrams observed in the training data, i.e.

N1+(••) =∑

wm−1

N1+(wm−1•) =∑wm

N1+(•wm) (7)

Chen and Goodman demonstrated two improvements to the Kneser-Ney algorithm. First, they showedthat the Kneser-Ney probabilities may be interpolated down to the 0-gram probability. Second, they showedthat the discount parameter D should depend on the database count C(wm−1wm), i.e.

D(wm−1, wm) =

0 C(wm−1, wm) = 0D1 C(wm−1, wm) = 1D2 C(wm−1, wm) = 2D3+ C(wm−1, wm) ≥ 3

(8)

Chen and Goodman suggest several empirical and theoretical methods for choosing the parameters D1, D2, D3;their figure 11 suggests that for a small corpus (1 million words), the best values are approximatelyD1 = 0.6, D2 = 1.0, D3 = 1.4.

The top-level Chen-Goodman-Kneser-Ney probability p(wm|wm−1) is calculated according to

pCGKN (wm|wm−1) =(

C(wm−1, wm)−D(wm−1, wm)C(wm−1)

)(9)

+(

D1N1(wm−1•) + D2N2(wm−1•) + D3+N3+(wm−1•)C(wm−1)

)pCGKN (wm) (10)

where N1(wm−1•) is the number of distinct words that follow wm−1 exactly once in the training data,N2(wm−1•) is the number that follow exactly twice, and N3+(wm−1•) is the number that follow three or moretimes. The lower-level probability pCGKN (wm) is based on N1+(•wm), just as in the Kneser-Ney probabilityformula, but Chen and Goodman showed that this lower-level probability may in turn be interpolated, thus

pCGKN (wm) =(

N1+(•wm)−D(wm)N1+(••)

)(11)

+(

D1N1(•) + D2N2(•) + D3+N3+(•)N1+(••)

)pCGKN (•) (12)

where N1(•) is the number of words that appear exactly once in the training data, and pCGKN (•) is the0-gram probability (all words equally likely).

Equations 10 and 12 are relatively complicated, but notice that all terms in these two equations canbe computed from the bigram counts C(wm−1wm). In order to estimate the Chen-Goodman-Kneser-Neyprobability, then, it is sufficient to find the count C(wm−1wm) of every bigram pair that appears in thetraining database, and then add up those numbers in appropriate ways.

2.4 Homework

Use perl to accumulate sufficient statistics from the Switchboard corpus for the estimation of a Chen-Goodman-smoothed bigram language model. Your code should have two distinct sections: (1) First, findthe bigram counts C(wm−1wm) for every possible word-pair in the training data, and then (2) manipulatethe bigram counts in order to calculate pCGKN (wm|wm−1). .

“Words” that begin and end with square brackets (e.g., “[laughter],” “[silence]”) should be merged intoa single category (perhaps “[silence]”). Partial-word utterances that use square brackets and possibly a dashshould be converted into the full-word code before being entered into your database count (e.g. “[com]puter”becomes “computer”, “[be]cau[se]-” becomes “because”, “-[a]bout” becomes “about”).

Notice that there are more than 30,000 distinct words in Switchboard. If you try to represent C(wm−1wm)or p(wm|wm−1) as a fully enumerated table, you will wind up with a table of size 900M. Don’t do that.Instead, C(wm−1wm) should include entries for only the bigram pairs that have a nonzero count in thedatabase (about 2M entries). The bigram probability p(wm|wm−1) of any bigram with zero count is composed

9

of two terms: pCGKN (wm) (depends only on wm), and a term that depends only on wm−1. Store these twoterms as separate output tables, with about 30K entries for each of these two tables.

If you have extra time, consider training your language model on part of the Switchboard corpus, andthen testing it on the remaining part. Check to see whether your cross-entropy measure is comparable to thecross-entropy measures that Chen and Goodman obtained on Switchboard (Figs. 3 and 4 show bigram andtrigram Jelinek-Mercer smoothing; Figs. 5 and 6 show the advantage relative to Jelinek-Mercer of a numberof different algorithms).

10

3 Training Monophone Models Using HTK

3.1 Installation

Download HTK from http://htk.eng.cam.ac.uk/. You should download the standard distribution (gzippedtar file). You may also wish to download the samples.tar.gz file, which contains a demo you can run to testyour installation. You may also wish to download the pre-compiled PDF copy of the HTKBook.

Compile HTK as specified in the README file. Under windows, you will need to use a DOS prompt tocompile, because the VCVARS32.bat file will not run under cygwin.

Add the bin.win32 directory to your path (or appropriate other bin directory, if you are on unix). Inorder to test your distribution, move to the samples/HTKDemo directory, and (assuming you are in a cygwinwindow by now) type ./runDemo.

3.2 Readings

1. Primary readings are from the HTKBook. Before you begin, read sections 3.1.5-3.4 and 6.2. Beforeyou start creating acoustic features, read sections 5.1-5.2, 5.4, 5.6, 5.8-5.11, and 5.16. Before you starttraining your HMMs, read sections 7.1-7.2 and 8.1-8.5.

2. Those who do not already know HMMs may wish to read either HTKBook chapter 1, or read Rabiner(IEEE ASSP Magazine, January 1986) and Juang et al. (IEEE Trans. Information Theory 32(2):307-309, 1986). Even those who already know HMMs may be interested in the discussion of HTK’stoken-passing algorithm in section 1.6 of the HTKBook.

3.3 Creating Label and Script Files

A script file in HTK is a list of speech or feature files to be processed. HTK’s feature conversion program,HCopy, expects an ordered list of pairs of input and output files. HTK’s training and test programs, includingHCompV, HInit, HRest, HERest, and HVIte, all expect a single-column ordered list of acoustic feature files.For example, if the file TRAIN2.scp contains

d:/timit/TIMIT/TRAIN/DR8/MTCS0/SI1972.WAV data/MTCS0SI1972.MFCd:/timit/TIMIT/TRAIN/DR8/MTCS0/SI2265.WAV data/MTCS0SI2265.MFC...

then the command line “HCopy -S TRAIN2.scp . . . ” will convert SI1972.WAV and put the result intodata/MTCS0SI1972.MFC (assuming that the “data” directory already exists). Likewise, the command“HInit -S TRAIN1.scp . . . ” works if TRAIN1.scp contains

data/MTCS0SI1972.MFCdata/MTCS0SI2265.MFC...

The long names of files in the “data” directory are necessary because TIMIT files are not fully specifiedby the sentence number. The sentence SX3.PHN, for example, was uttered by talkers FAJW0, FMBG0,FPLS0, MILB0, MEGJ0, MBSB0, and MWRP0. If you concatenate talker name and sentence number, asshown above, the resulting filename is sufficient to uniquely specify the TIMIT sentence of interest.

A master label file (MLF) in HTK contains information about the order and possibly the time alignmentof all training files or all test files. The MLF must start with the seven characters “#!MLF!#” followed by anewline. After the global header line comes the name of the first file, enclosed in double-quote characters (”);the filename should have extension .lab, and the path should be replaced by “*”. The next several lines givethe phonemes from the first file, and the first file entry ends with a period by itself on a line. For example:

#!MLF!#"*/SI1972.lab"0 1362500 sil1362500 1950000 p

11

21479375 22500000 sil."*/SI1823.lab"...

In order to use the initialization programs HInit and HRest, the start time and end time of each phonememust be specified in units of 100ns (10 million per second). In TIMIT, the start times and end times arespecified in units of samples (16,000 per second), so the TIMIT PHN files need to be converted. The timesshown above in 100ns increments, for example, correspond to the following sample times in file SI1972.PHN:

0 2180 h#2180 3120 p...

Notice that the “h#” symbol in SI1972.PHN has been changed into “sil”. TIMIT phoneme labels are toospecific; for example, it is impossible to distinguish “pau” (pause) from “h#” (sentence-initial silence) or from“tcl” (/t/ stop closure) on the basis of short-time acoustics alone. For this reason, when converting .PHNlabel files into entries in a MLF, you should also change phoneme labels as necessary in order to eliminatenon-acoustic distinctions. Some possible label substitutions are pau:sil (silence), h#:sil, tcl:sil, pcl:sil, kcl:sil,bcl:vcl (voiced closure), dcl:vcl, gcl:vcl, ax-h:axh, axr:er, ix:ih, ux:uw, nx:n, hv:hh. The segments /q/ (glottalstop) and /epi/ (epinthetic stop) can be deleted entirely.

All of the conversions described above can be done using a single perl script that searches throughthe TIMIT/TRAIN hierarchy. Every time it finds a file that matches the pattern S[IX]\d+.PHN (note: thismeans it should ignore files SA1.PHN and SA2.PHN), it should add necessary entries to the files TRAIN1.scp,TRAIN2.scp, and TRAIN.mlf, as shown above. When the program is done searching the TIMIT/TRAINhierarchy, it should search TIMIT/TEST, creating the files TEST1.scp, TEST2.scp, and TEST.mlf.

Finally, just in case you are not sure what phoneme labels you wound up with after all of that conversion,the TRAIN.mlf file can be parsed as follows to get your phoneme set:

awk ’/[\.!]/{next;}{print $3}’ TRAIN.mlf | sort | uniq > monophones

The first block of awk code skips over any line containing a period or exclamation point. The second blockof awk code looks at remaining lines, and prints out the third column of any such lines. The unix sort anduniq commands sort the resulting phoneme stream, and throw away duplicates.

3.4 Creating Acoustic Feature Files

Create a configuration file similar to the one in HTKBook page 32. Add the modifier SOURCEFORMAT=NISTin order to tell HTK that the TIMIT waveforms are in NIST format.

I also recommend a few changes to the output features, as follows. First, compute the real energy(MFCC E) instead of the cepstral pseudo-energy (MFCC 0). Second, set ENORMALISE to T (or just deletethe ENORMALISE entry).

Third, because the TIMIT sampling rate (16kHz) is higher than the sampling rate considered in Chapter3 (probably 8kHz, though it is never specified), you should use more mel-frequency channels, a longer lifter,and a longer cepstral feature vector. How many more? Well, the human auditory system distinguishes about26 critical bands below 4kHz, but only about 6 more critical bands between 4kHz and 8kHz; since MFCCwarps the frequency axis to imitate human hearing, you only need to increase NUMCHANS from 26 to about32. Increasing NUMCHANS causes an increase in the pseudo-temporal resolution of the cepstral vector, soyou should increase all of the parameters NUMCHANS, CEPLIFTER and NUMCEPS by about the samepercentage.

Use HCopy to convert TIMIT waveform files into MFCC, as specified on page 33 of the HTK book.Convert both the TRAIN and TEST corpora of TIMIT.

3.5 HMM Training

Use a text editor to create a prototype HMM with three emitting states (five states total), and with threemixtures per emitting state (see Fig. 7.3). Be sure that your mean and variance vectors contain the rightnumber of acoustic features: three times the number of cepstral coefficients, plus three energy coefficients.

12

Change your configuration file: eliminate the SOURCEFORMAT specifier, and change TARGETKIND to MFCC E D A.Use HCompV as specified on page 34 to create the files hmm0/proto and hmm0/vFloors. Next, use your

text editor to separate hmm/macros (as shown in Fig. 3.7) from the rest of the file hmm0/proto (the firstline of hmm0/proto should now read ~h "proto").

Because your .lab files specify the start and end times of each phoneme in TIMIT, you can use HInit andHRest to initialize your HMMs before running HERest. Generally, the better you initialize an HMM, thebetter it will perform, so it is often a good idea to use HInit and HRest if you have relevant labeled trainingdata. Run HInit as shown on page 120, i.e., if $phn is the name of some phoneme, type something like

mkdir hmm1;HInit -I TRAIN.mlf -S TRAIN1.scp -H hmm0/macros -C config -T 1 -M hmm1 -l $phn hmm0/proto;sed "s/proto/$phn/" hmm1/proto > hmm1/$phn;

Hint: once you have the lines above working for one phoneme label, put them inside a for loop to do theother phonemes.

Re-estimate the phonemes using HRest, as shown on page 123. Again, once you have the function workingfor one phoneme, put it inside a for loop. HRest will iterate until the log likelihood converges (use the -T 1option if you want to see a running tally of the log likelihood), or until it has attempted 20 training iterationsin a row without convergence. If you want to allow HRest to iterate more than 20 times per phoneme (andif you have enough time), specify the -i option (I used -i 100).

Once you have used HRest, you may wish to combine all of the trained phoneme files into a single mastermacro file (MMF). Assuming that all of your phoneme filenames are 1-3 characters in length, and that thenewest versions are in the directory hmm2, they can be combined by typing

cat hmm2/? hmm2/?? hmm2/??? > hmm2/hmmdefs

Now run the embedded re-estimation function HERest to update all of the phoneme files at once. HERestimproves on HRest because it allows for the possibility that transcribed phoneme boundaries may not beprecisely correct. HERest can also be used to train a recognizer even if the start and end times of individualphonemes are not known.

Unfortunately, HERest only performs one training iteration each time the program is called, so it is wiseto run HERest several times in a row. Try running it ten times in a row (moving from directory hmm2 tohmm3, then hmm3 to hmm4, and so on up to hmm12). Hint: put this inside a for loop.

3.6 Testing

In order to use HVIte and HResults to test your recognizer, you first need to create a “dictionary” and a“grammar.”

For now, the “grammar” can just specify that a sentence may contain any number of phonemes:

$phone = aa | ae | ... | zh ;( <$phone> )

Parse your grammar using HParse as specified on page 27.The “dictionary” essentially specifies that each phoneme equals itself:

aa aaae ae...

Because the dictionary is so simple, you don’t need to parse it using HDMan. You can ignore all of the textassociated with Fig. 3.3 in the book.

Run HVIte as specified in section 3.4.1 of the book; instead of “tiedlist,” you should use your own list ofphonemes (perhaps you called it “monophones”). You may have to specify -C config, so that HVite knowsto compute delta-cepstra and accelerations. The -p option specifies the bonus that HVIte gives itself eachtime it inserts a new word. Start with a value of -p 0. Use -T 1 to force HVIte to show you the wordsit is recognizing as it recognizes them. If there are too many deletions, increase -p; if there are too manyinsertions, decrease -p.

When you are done, use HResults to analyze the results:

13

HResults -I TEST.mlf monophones recout.mlf

You should get roughly 55-60% correct, and your recognition accuracy should be somewhere in the range40-60%. These terms are defined as follows:

CORRECTNESS = 100× NREF− SUBSTITUTIONS−DELETIONSNREF

ACCURACY = 100× NREF− SUBSTITUTIONS−DELETIONS− INSERTIONSNREF

Correctness is equal to the percentage of the reference labels (NREF) that were correctly recognized. Cor-rectness does not penalize for insertion errors. Accuracy is a more comprehensive measure of recognizerquality, but it has many counter-intuitive properties: for example, Accuracy is not always between 0 and 100percent. Recent papers often use the terms Precision and Recall instead, where Recall is defined to equalCorrectness, and Precision is the percentage of the recognized labels that are correct, i.e.,

PRECISION = 100× NRECOGNIZED− SUBSTITUTIONS− INSERTIONSNRECOGNIZED

NRECOGNIZED = NREF−DELETIONS + INSERTIONS

14

4 Words and Triphones

In this section, you will use the TIMIT monophone HMMs trained in the previous lecture as the startingpoint for a clustered triphone recognizer designed to transcribe the words spoken by a talker in the BU RadioNews corpus.

4.1 Readings

The HTK Book, chapters 10, 12, and sections from 14 about HBuild, HLStats, HHEd and HLEd.

4.2 Cepstral Mean Subtraction; Single-Pass Retraining

If ~xt is a log-spectral vector or a cepstral vector, the frequency response of the microphone and the roomwill influence only the average value of ~xt. It is possible to reduce the dependence of your recognizer onany particular microphone by subtracting the average value of ~xt, averaged over an entire utterance, beforetraining or testing the recognizer, i.e.,

~yt = ~xt −1T

T∑t=1

~xt (13)

Equation 13 is called cepstral mean subtraction, or CMS. In HTK, CMS is implemented automaticallyif you append “ Z” to the feature specification. For example, you can save the features as type MFCC E, thenuse a configuration file during training and testing that specifies a feature vector of type MFCC E D A Z.

HTK offers a method called “one-pass retraining” (HTKBook section 8.X) that uses models trained withone type of feature vector (for example, MFCC E D A) in order to rapidly train models with a different featurevector type (for example, MFCC E D A A). In theory, you need to have available files of both data types, butsince HTK can implement CMS on the fly when opening each feature file, there is no need to regenerate thetraining data. Just create a script file with two columns — the “old” and “new” feature files, which in thiscase are the same file:

data/SI1972.MFC data/SI1972.MFC...

Then create a configuration file with entries HPARM1 and HPARM2, as specified in section 8.X of theHTKBook, and call HERest with the -r option, exactly as specified in that section. Compare the hmmdefsfiles for the old and new file types. You should notice that the feature file type listed at the top of each filehas changed. You should also notice that the mean vectors of each Gaussian mixture have changed a lot,but the variance vectors have not changed as much.

4.3 Dictionaries

In order to recognize words using sub-word recognition models, you need a pronunciation dictionary. Pro-nunciation dictionaries for talker F1A in the Radio News corpus are provided in F1A/RADIO/F1A.PRNand F1A/LABNEWS/F1ALAB.PRN.

These dictionaries contain a number of diacritics that will be useful later, but are not useful now. Usesed, awk, or perl to get rid of the characters * and — (syllable markers), and the notation “+1” or “+2”in any transcription line. In order to reduce the number of homonyms, you may also wish to convert allcapital letters to lower-case (so that “Rob” and “rob”) are not distinct), and also eliminate apostrophes(so that “judges” and “judges’ ” are not distinct). You will also wish to make a few label substitutions inorder to map radio news phonemes into the TIMIT phonemes defined last week: axr becomes er, pau andh# become sil, and every stop consonant (b,d,g,p,t,k) gets split into two consecutive TIMIT-style phones:a closure followed by a stop release.

As an example, the radio news dictionaries might contain the following entry for “Birdbrain’s”

Birdbrain’s b axr+1 d * b r ey n z

15

Assuming that your TIMIT-based phoneme set includes er but not axr, you would wish to automaticallytranslate this entry to read

birdbrains vcl b er vcl d vcl b r ey n z

or, by adding a rule that deletes the stop release when the following segment is another consonant, you mightget

birdbrains vcl b er vcl vcl b r ey n z

Notice that there is another alternative: instead of modifying the dictionary to match your HMM definitions,you could modify your HMM definitions to match the dictionary. Specifically, er could be relabeled as axr,sil could be relabeled as h#, and you could concatenate your stop closure and stop release states in order tocreate new stop consonant models. You could even create models of ’*’ and ’—’ with no emitting states.

Once you have converted your dictionaries, you should concatenate them together, then apply the unixutilities sort and uniq to the result, e.g., convert dict.pl F1A/RADIO/F1A.PRN F1A/LABNEWS/F1ALAB.PRN| sort | uniq > F1A.dict. HTK utilities will not work unless the words in the dictionary are orthograph-ically sorted (alphabetic, all capitalized words before all diminutive words).

4.4 Transcriptions

Create master label files using almost the same perl script that you used for TIMIT, but with .WRD-fileinputs instead of .PHN-file inputs. Also, every waveform file in RADIO NEWS is uniquely named, so youdon’t need to concatenate the directory and filename. The resulting master label files should look somethinglike this, although the start times and end times are completely optional:

#!MLF!#"*/F1AS01P1.lab"0 2100000 a2100000 4700000 cape4700000 7600000 cod

In order to train the grammar, you need a word-level master label file, as shown above. In order to trainthe HMMs, though, you need a phoneme-level master label file. The phone-level MLF can be computed fromthe dictionary + word-level-MLF using the HLEd command (see section 12.8 in the HTK Book). Create afile called expand.hled that contains just one command,

EX

Then type

HLEd -d F1A.dict -l ’*’ -i phone\_level.mlf expand.hled word\_level.mlf

If HLEd fails, the most likely cause is that your master label file contains entries with times but no words.HLEd will, unfortunately, not tell you where those entries are. Try printing out all lines that have less thanthree columns using a command like

gawk ’nf<3{print}’ word\_level.mlf

Scan the output to make sure that you don’t have phantom “words” with start times and end times but noword labels.

4.5 Creation of MFCCs

Create a two-column and a one-column script file for your training data, and the same for your test data,just as you did for TIMIT. The two-column script file will look something like:

d:/radio_news/F1A/RADIO/S01/F1AS01P1.SPH data/F1AS01P1.MFCd:/radio_news/F1A/RADIO/S01/F1AS01P2.SPH data/F1AS01P2.MFC

16

You may use any subset of the data for training, and any other subset for test. I trained speaker-dependent HMMs using the F1A/RADIO directory, and tested using the F1A/LABNEWS directory. Youmay get better recognition results if your training and test set both include part RADIO data and partLABNEWS data.

Use HCopy to convert waveforms to MFCCs.

4.6 Bigram Grammar

Construct a list of your entire vocabulary, including both training and test sets, using

awk ’{print $1}’ F1A.dict | sort | uniq > wordlist

Seeding your grammar with words from the test set is cheating, but for datasets this small, it may be theonly way to avoid huge numbers of out-of-vocabulary errors.

Given a master label file for your entire training data, the command HLStats will compute a backed-offbigram language model for you, and HBuild will convert the bigram file into a format that can be usedby other HTK tools. See sections 12.4 and 12.5 in the HTKBook for examples; note that you will need tospecify both the -I and -S options to HLStats.

4.7 Monophone HMMs

If your dictionary matches the labels on your TIMIT monophone models, you should be able to use theTIMIT models now to perform recognition on the radio news corpus. Try it:

HVIte -C config_recog -H timit/macros -H timit/hmmdefs -S (one-column test script) \-l ’*’ -i recout1.mlf -t 250.0 -w (HBuild output file) \-p 5 -s 3 F1A.dict monophones

HResults -I (test MLF) monophones recout1.mlf

The -p and -s options set the word insertion penalty and the grammar weight, respectively. Theseparameters are described in section 13.3 of the HTKBook. Adjusting these parameters can cause hugechanges in your recognition performance; 5 might or might not be a good value.

In any case, your results will probably be pretty horrible. Radio news was recorded using differentmicrophones than TIMIT, by different talkers. You can account for these differences by adapting the models(using HEAdapt) or by re-estimating them (using HERest) — you probably have enough data to use re-estimation instead of adaptation.

Re-estimate your models using HERest, and then run HVIte again. Your results should improve somewhat,but may still be disappointing. How can you improve your results still further?

4.8 Word-Internal Triphones

In order to use word-internal triphones, you need to augment your transcriptions using a special word-boundary “phoneme” label. The sp (short pause) phoneme is intended to represent zero or more frames ofsilence between words.

Add sp to the end of every entry in your dictionary using awk or perl. After you have added sp to theend of every entry, add another entry of the form

silence sil

The “silence” model must not end with an sp.Now you need to augment your HMM definitions, exactly as listed in section 3.2.2 and 3.2.3 of the HTK

book. This consists of four steps. First, add sp to the end of your monophones list file. Second, edit yourhmmdefs file with a text editor, in order to create the sp model by copying the middle state of the silmodel. Third, use HHEd with the script given in 3.2.2. Finally, use HVIte in forced-alignment mode, in orderto create a new reference transcription of the training data. Be sure to use the -b silence option to addsilences to the beginning and end of each transcription; otherwise your sentence will end with an sp model,and that will cause HERest to fail.

17

Now that you have your word-boundary marker, you are ready to create word-internal triphones. UseHLEd exactly as in section 3.3.1 of the HTKBook. Because of the small size of this database, the test set maycontain triphones missing from the training data. In order to accomodate missing triphones, concatenatethe monophone and triphone files, so that any missing triphones can at least be modeled using monophones:

sort monophones triphones | uniq > allphones

Finally, use HHEd as in section 3.3.1 of the HTKBook, but use the allphones list instead of the triphoneslist to specify your set of output phones.

Re-estimate your triphone models a few times using HERest. HERest will complain that some triphonesare observed only one or two times in the training data. I guess we need a larger training database.

Test the result using HVIte. The presence of monophones in your phoneme list will confuse HVIte. Inorder to force the use of triphones whenever possible, your config file should contain the entries

FORCECXTEXP = TALLOWXWRDEXP = F

Your recognition performance with triphones should be better than it was with monophones.

4.9 Tied-State Triphones

Because of the sparsity of the training data, many triphone models are not well trained. The problem canbe alleviated somewhat by using the same parameters in multiple recognition models. This process is called“parameter tying.”

Chapter 10 of the HTKBook describes many, many different methods of parameter tying, all of whichare frequently used in practical recognition systems. I suggest using the data-driven clustering method forthe current exercise (section 10.4), although tree-based clustering (section 10.5) might work almost as well.

Run HERest with the -s option, in order to generate a file called stats file. Then create an HHEd scriptthat starts with the command RO (threshold) stats file where (threshold) specifies the minimumexpected number of times a state should be visited in order to count for parameter tying (I used 20).

Use perl, awk, or even just bash to add commands of the following form to your HHEd script:

TC 100.0 "aaS2" {(aa,*-aa,aa+*,*-aa+*).state[2]}TC 100.0 "aaS3" {(aa,*-aa,aa+*,*-aa+*).state[3]}TC 100.0 "aaS4" {(aa,*-aa,aa+*,*-aa+*).state[4]}

You can be more general, if you like. For example, the following command would allow HTK to considertying together the first state of aa with the last state of any phoneme that precedes aa:

TC 100.0 "aaS2" {(aa,*-aa,aa+*,*-aa+*).state[2],(*-*+aa,*+aa).state[4]}

Run HHEd in order to perform data-based tying (use the -T option to see what HHEd is doing). Use HERestto re-estimate the models a few times, then test using HVIte and HResults. Your performance may still notbe wonderful, but it should be better than you obtained without parameter tying.

For reference, the NIST Hub-4 and Hub-5 competitions (1997 and 1998, respectively) used large databasesof broadcast news training data similar to the Radio News corpus. Typical performance of the competitionsystems was in the range of 30-40% word error rate (word error rate = 100 - accuracy). The winning systemwas trained using HTK, plus a large amount of external code.

18

5 Prosody

Recent papers talk about three aspects of prosody that might be modeled by a speech recognition system:

• Lexical stress: Lexically unstressed vowels may be transcribed and modeled as a type of schwa (ax,ix, or axr), or as some type of full vowel. The status of /ax/ as a distinct vowel has much empiricalsupport. /ix/ and /ax/ may not be distinct in practical systems. It is possible to argue that /er/ isalways reduced, so that /er/ and /axr/ are not really distinct.

Several studies have examined the distinction between unreduced unstressed vowels and stressed vowels.Greenberg (1999) found that stressed vowels are longer and have higher energy than unstressed vowelswith the same phoneme label, but he did not control for accent placement. van Kuijk et al. (1999)compared accented stressed, unaccented stressed, unreduced unstressed, and reduced vowels, and foundno acoustic difference between the middle two categories.

Consonant reduction has apparently never been studied in speech recognition.

• Pitch accent. When a pitch accent is placed on a word, it is usually placed on or near the lexicallystressed syllable. The duration, energy, or spectral distinctiveness of the lexically stressed syllable maythen be increased. Articulatory studies (Fougeron et al.) indicate that consonants in accented syllablesare produced more distinctively than consonants in unaccented syllables.

• Phrase boundaries. The rhyme of the final syllable of a word preceding an intermediate or intonationalphrase boundary is lengthened relative to comparable phonemes in the sentence (Wightman et al.,1991). In the Switchboard database, the duration histogram of words preceding a silence or disfluencyhas a mode that is one standard deviation higher than the duration histograms of other words in thedatabase. This increased duration may be combined by a decrease in energy, and possibly by otherspectral changes.

5.1 Reading

• Wightman, Shattuck-Hufnagel, and Price, “Segmental durations in the vicinity of prosodic phraseboundaries.” J. Acoust. Soc. Am. 91(3):1707-1717, 1992.

• Fougeron and Keating, “Articulatory strengthening at edges of prosodic domains.” J. Acoust. Soc.Am 101:3728-3740, 1997.

• Steven Greenberg and Leah Hitchcock, “Stress-Accent and Vowel Quality in The Switchboard Corpus.”NIST Large Vocabulary Continuous Speech Recognition Workshop, May 2001.

5.2 Prosody-Dependent Transcriptions

Write a perl script that reads in WRD transcription files and either TON or BRK files from one talker’s datain the radio news corpus. Your script should create an HTK master label file containing either break-index-dependent or accent-dependent transcriptions. For example, a break-index-dependent transcription mightappend the number “4” after every word with a break index of at least 4, e.g.

#!MLF!#"*/F2BS01P1.lab"0 0 endsil0 1700000 a1700000 6600000 nineteen6600000 12500000 eighteen412500000 15600000 state15600000 24300000 constitutional24300000 30400000 amendment4

An accent-dependent transcription might append an exclamation point after every word containing a pitchaccent, e.g.

19

#!MLF!#"*/F2BS01P1.lab"0 0 endsil0 1700000 a1700000 6600000 nineteen!6600000 12500000 eighteen!12500000 15600000 state!15600000 24300000 constitutional24300000 30400000 amendment!

F2B has about 170 sentences transcribed with prosody, while F1A and M1B have only about 75 sentencestranscribed. Choose a talker with as many sentences as possible transcribed for prosody.

Create a prosody-independent MLF by stripping out the prosodic symbols from your prosody-dependentMLF. The prosody-independent recognizer will serve as a reference model, so that you can tell whether anyadvantage is obtained by modeling prosody.

5.3 Prosody-Dependent Dictionary

Write a perl script that reads in the dictionaries provided in the Radio News corpus, and produces anHTK-format dictionary with both prosody-dependent and prosody-independent versions of each word.

If you are studying break indices, the phonemes in the final syllable of each pre-boundary word shouldbe special pre-boundary phonemes, e.g.,

abilities ax b ih l ax t iy z spabilities4 [abilities] ax b ih l ax t4 iy4 z4 sp

If you are studying accent, the phonemes in the lexically stressed syllable of each lexically stressed wordshould be special, e.g.,

abilities ax b ih l ax t iy z spabilities! [abilities] ax b! ih! l! ax t iy z sp

Notice that in both cases, the output of HVIte should be specified to be independent of prosody, usingthe square-bracket notation.

Use HLEd, together with your dictionary, to create prosody-dependent and prosody-independent monophone-level MLF files.

5.4 Prosody-Dependent HMMs

Train a set of monophone models including the ‘sp’ model (or copy models from the last section).Create an HHEd script that duplicates your monophone models to create prosody-dependent models. If

you are studying break indices, your HHEd script might contain the line

DP "" 1 "4"

If you are studying accent, your script might contain

DP "" 1 "!"

Train both prosody-dependent and prosody-independent monophone models by running HERest 3-5times on each set of models.

Use HLEd to split the prosody-dependent and prosody-independent monophone MLF files into triphoneMLFs. Use HHEd to split the trained HMM macro files. Train both prosody-dependent and prosody-independent triphone models by running HERest 3-5 times on each model set.

Use HHEd to perform data-driven tying on the triphone models. For example, if you are studying accent,your HHEd script for tying the prosody-dependent HMMs might contain commands of the form

TC 100.0 "aaS2" {(aa,aa!,*-aa,*-aa!,aa+*,aa!+*,*-aa+*,*-aa!+*).state[2]}

Train both prosody-dependent and prosody-independent clustered triphone HMMs by running HERest 3-5times on each model set.

20

5.5 Prosody-Dependent Speech Recognition

Train a prosody-dependent backoff bigram model by running HLStats on your prosody-dependent transcrip-tion, and convert the result into a wordnet using HBuild. The result is a language model that combinesinformation about both word sequence and the sequence of stresses or phrase boundaries. For example, ifyou are studying accent, the trained bigram file might contain entries of the following form. As shown inthese examples, you may find that accented words are more likely to follow unaccented words, and vice versa.

-2.5682 a state-2.0911 a state!-1.3424 boston! city-1.8195 boston! city!

Train a prosody-independent backoff bigram model in the usual way.You should now have two language models (prosody-dependent and prosody-independent) and six sets of

HMMS (PD and PI versions of monophone, triphone, and clustered triphone models). Run HVIte six times.After running HVIte using prosody-dependent models, be sure to check the output transcription. Because

of the way the dictionary was defined, prosodic markings should not show up in the output transcription.Use HResults to compare all six recognition transcripts with the true prosody-independent transcription.Does the recognizer’s knowledge of prosody help it to achieve better word recognition accuracy?

Here is a different experiment that you can run using the same models: try to determine how well therecognition models track just the prosody of the utterance. Create another dictionary with output symbolsset to show just the prosody, and not the word content, e.g.

abilities [0] ax b ih l ax t iy z spabilities! [!] ax b! ih! l! ax t iy z sp

Process your prosody-dependent MLF in order to show the same accented/unaccented distinction, e.g.

#!MLF!#"*/F2BS01P1.lab"0 0 endsil0 1700000 01700000 6600000 !6600000 12500000 !12500000 15600000 !15600000 24300000 024300000 30400000 !

Run HVIte again using the new dictionary, then run HResults using the new transcription file. How welldoes your recognizer identify pitch accent placement? (Recognition results for break index may be betterthan recognition results for pitch accent placement, just because pitch is not part of the acoustic featurevector...)

21

speech recognition tools - university of...

Documents