k.marasek 05.07.2005 multimedia department acoustic modeling
TRANSCRIPT
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
t
Acoustic modeling
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tHidden Markov Models
Acoustic models are stochastic models used with language models and other models to make decisions based on incomplete or uncertain knowledge. Given a sequence of feature vectors X extracted from speech signal by a front end, the purpose of the AM is to compute the probability that a particular linguistic event (word, sentence, etc.) has generated the sequence
AM have to be flexible, accurate and efficient -> HMMs are! easy training First used by Markov(1913) to analyse the letter sequence in a text efficient method of training proposed by Baum et al. (1960) application to ASR - Jellinek (1975) application also in other areas: pattern recognition, linguistic analysis (stochastic
parsing)
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tHMM Theory
HMM can be defined as a pair of discrete time stochastic processes (I,X). The process I takes values from a finite set I, whose elements are called states of the model, while X takes values in a space X that can be either discrete or continuous, depending on nature of data sequences to be modeled and is called observation space
The processes satisfy following relations, where right-hand probabilities are time t independent
history before time t has no influence on the future evolution of the process if the present state is specified
neither evolution of I nor past observations influence the present observation if the last two states are specified; output probabilities at time t are conditioned by states of I at time t-1 and t, i.e. by the transition at time t
random variables of process X represent the variability of the realization of the acoustic events, while process I models various possibilities in the temporal concatenation of these events
)|Pr(),|Pr(
)|Pr()|Pr(
111
11
100
111
01
0tt
ttt
tththtt
ttttt
t
iIxXxXiIxX
iIiIiIiI
First order Markov hypothesis
Output independence hypothesis
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tHMM theory 2
Properties of HMMs:
0<=s <= t <= T, h>0 The probability of every finite sequence X1
T of observable random variables can be decomposed as:
From these follows that HMM can be defined by specifying parameter set
V=(,A,B), where =Pr(I0=I) is the initial state density
aij=Pr(It=j|It-1=I) is the transition probability matrix A
bij=Pr(Xt=x|It-1=I, It=j) is the output densities matrix B
the parameters satisfy the following relation: thus model parameters are sufficient for computing
the probability of a sequence of observations,
(but usually faster formula used)
)|Pr(),|Pr(
)|Pr(),|Pr(
)|Pr(),|Pr(
1101
11
10
11
10
tht
tttht
t
tttht
t
ts
ts
shtts
ixxix
iiIxiiI
ixxix
10
10 1
101
10011 )|Pr()Pr()|Pr()Pr()|Pr()Pr(TTTT Ii
T
tit
T
t
tit
Ii
TTTT iiiixiixx
Ijijij
Iiii
aa
pp
1,0
1,0
)()Pr(110 tiiiii
T xbapxtttt
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tHMMs as a probabilistic automata
Nodes of graph correspond to states of Markov chain, while directed arcs correspond to allowed transtions a ij
A sequence of observation is regarded as an emission of the system which at each time instant makes a transition form one to another node randomly chosen according to a node-specific probability density and generates a random vector according to arc-specific probability density. A number of states and set of arcs is usually called model topology.
In ASR it is common to have left-to-right topologies, in which aij=0 for j<I
also usually first and last states are not-emitting, i.e. source and final states are for setting initial and final probabilities, empty transitions needs slight modification of the algorithm
HMM
Typical phone model topology
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tHMM is described by:
Observation sequence O=O1...OT
States sequence Q=q1...qT
HMM states S={S1...SN}
Symbols (emissions) V={v1...vm}
Transition probabilities A
Emission probabilities B
Initial state density
Parameter set
NNN
N
itjtij
aa
aa
A
SqSqPa
.
...
.
]/[
1
111
1
MkNjkbB
SqvOPkb
j
itktj
,...,1,..,1)};({
]/[)(
Ni
SqP
i
ii
,..,1};{
][ 1
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tExample: wether
States: S1 – rain, s2- clouds, s3- sunny observed only by the temperature T (density function different for diferrent states) What is the probabilty of observed temperature sequence (20,22,18,15)? Which sequence of states (rain, clouds, sunny) is the most probable?
O={O1=20o, O2=22o, O3=18o, O4=15o} and start is sunny
Following state seqeunces are possible:
Q1={q1=S3, q2=S3,q3=S3,q4=S1},
Q2={q1=S3, q2=S3,q3=S3,q4=S3},
Q3={q1=S3, q2=S3,q3=S1,q4=S1},
etc.
s1 s2 s3s1
a31
b1(T=10)=P(T=10| q1=S1)
b2(T=10)=P(T=10| q2=S2)
b3(T=10)=P(T=10| q3=S3)
b1(T=11) b2(T=11) b3(T=11)
b1(T=40) b2(T=40) b3(T=40)
Emission probabilities B
a11 0.4 a12 0.3 0.3
0.2 0.6 0.2
0.1 0.1 0.8
Transition probabilities A
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tWeather II
For each state sequence the conditional probability can be found which depends on observation sequence:
assuming O={O1=20o, O2=22o, O3=18o, O4=15o} and start is sunny
Q1={q1=S3, q2=S3,q3=S3,q4=S1},
Q2={q1=S3, q2=S3,q3=S3,q4=S3},
Q3={q1=S3, q2=S3,q3=S1,q4=S1},
etc.
Generally, the observed temperature sequence O can be generated by many state sequences which are not observable. The probability of temperature sequence given a model is
)15()18()22()20()/,(
)15()18()22()20()/,(
)15()18()22()20()/,(
21123123313
43333323312
43133323311
11333
33333
13333
oS
oS
oS
oSS
oS
oS
oS
oSS
oS
oS
oS
oSS
ObaObaObaObQOP
ObaObaObaObQOP
ObaObaObaObQOP
ksequencesstateall
kQOPOP )/,()/(
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tTrellis for weather
s1 s2 s3s1
a31
+
b3(O1)b2(O1)b1(O1)
O =
O1 O
2 O3 O
4 O5
a12
a22a32
*b2(O2)
TtNiSqOOPi ittt ..1;,..,1];|,,...,[)( 1
1. Init:
2. Iterations:
3. Final step:
NiObi ii 1);()( 11
111
);()()( 11
1
TtNj
Obaij tj
N
iijtt
N
iT iOP
1
)()|(
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tLeft-right models: SR
For speech recognition:
left to right models Find best path in trellis Instead of summation
take max –the path will give the best sequence of states – the most probable state sequence
An additional pointers array is needed to store best pathes
Backpointers show the optimal path
1. Init:
2. Iterations:
3. Final:
0,1120)(
)()1(
,0
1
111
elseiforNifori
Ob
jiifa
i
ij
11;1
);()()( 11
1
TtNj
ObaiMAXj tjijt
N
it
N
iT iOP
1
)()|(
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tTraining- estimation of model parameters
Count the frequency of occurences to estimate bj(k)
Transition probabilities:
Assumption: we can observe states of the HMM, what not always is possible: solution: forward-backward training
dayssunnyofnumber
Canddayssunnyofnumber
dayssunnyofnumber
else
sunnyandCOif
SqCOPCb
SqvOPkb
T
t
t
sunnyttsunny
jtktj
17
0
171
]|17[)17(
]|[)(
1
istatefromstransitionofnumber
jtoifromstransitionofnumberSqSqP itjtij ]|[ 1
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tForward/Backward training
We cannot observe state sequences, but we can compute expected values depending on model parameters, iterative estimation (new params with – above)
i state_ from_ ns_ transitioof_ number_ expected_
j state_ to_i_ state_ from_ ns_ transitioof_ number_ expected_
]|[
j state_ in_ tacts_of_ number_ expected_
j state_ in_ v_symbols_ observed of_ _number_ expected)(
]|[)(
1
_k
ij
itjtij
j
jtktj
a
SqSqPa
kb
SqvOPkb
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tForward-Backward 2
s1 s2 s3s1
a31
O
O
Sj
O
OO
Si
O
+
b3(O1)b2(O1)b1(O1)
O =
O1 O
t-1 Ot O
t+1 O
t+1
a12
a21a31
*b2(O2)
a ijb j(O t+
1)
t+
1 (j)
t+1 (j)
Forward probability:
Backward probability
i.e. prob of Ot+1..OT ending in state Si
Iterative computation of backward probability:
]|,,...,,[)( 21 ittt SqOOOPi
],|,...,,[)( 21 itTttt SqOOOPi
11;1
)()()(
.2
1;1)(
:.1
111
1
tTNifor
Obajj
iteration
Nii
tioninitializa
N
jtjijtt
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tFB 3
Now compute the probability that the model in tact t is in the state s i
The formula gives the probability of being in the state i in the tact t, but we need additionally the expected value of tacts spend in the state i and expected number of transitions
For ergodic processes (doesn’t depend on time), assuming sequence X=x1,x2,..xi,..,xT with only discrete values (e.g. {a,b,c})
]|[
]|,[
)()(
],|,.....,[]|,,.....,[
)()(
)()(],|[)(
1
11
1
OP
SqOP
jj
SqOOPSqOOP
jj
iiOSqPi
it
N
j tt
itTtitt
N
j tt
ttitt
T
i k
M
kk
T
i
T
i
M
kkk
vxPvxPE
vxPvxgxgE
11
1k
k
1k
k
1
][][v xif 0
v xif 1
v xif 0
v xif 1
][)()(
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tFB 4
]|[
)()()(..],[),(
: tactsfollowing in two visitedbe willstatesknown y that twoprobabilit
i state from ns transitioofnumber expected
i statein tactsofnumber expected
1t11t
1-T
1tt
T
1tt
OP
jObaiSqSqPji tjijt
jtit
T
t t
T
t tij
itjtij
T
t t
T
t kttj
jtktj
j
jia
SqSqPa
j
vOforjkb
SqvOPkb
1
1
1
1
1
1k
)(
),(
i state from ns transitioofnumber expected
j state toi state from ns transitioofnumber expected
]|[
)(
)(
j statein tactsofnumber expected
j statein vsymbols observed ofnumber expected)(
]|[)(
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tFB 5
The estimation procedure is done iteratively, so that P(O,P(O, The previous equations assume single observations, what for multiple? Let O={O(1),O(2), .. O(M)} be the training examples {O(m)} are statistically independent, thus:
This could be solved this way, that we introduce a fictive observation in which all observations are concatenated together
Than we have
M
mmm OpPwhereP )|(max
:max ) |p(Ofor looking are we
)()()(
M
m
mT
t
m
t
mt
M
m
mT
t
m
tm
tjij
mt
ij
Mmmt
m
tm
tmm
t
ii
jObai
PPjjiPPi
1
1)(
1
)(
1
)(
1
1)(
1
)(
1)(
1
)(
)()1()()()()()1()(
)()(
)()()(
...
...)()(),(...)(
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tHMM: forward and backward coefficients
Additional probabilities have been defined to save computational load two effectiveness goals: model parameter estimation and decoding (search,
recognition) Forward probability is the Pr that X emits the partial sequence x1
t and process I in state i at time t:
can be iteratively computed by:
backward probability is the Pr that X emits the partial sequence xt+1T and
process I in state i at time t:
best-path probability is the maximum joint probability between partial sequence x1
t and state sequence ending at state I at time t
0),,Pr(
0),Pr({),(
1
01
txiI
tiIix t
t
Tt
0,),()(
0,),(
111 tjxxba
tpix
Ij
Tttjiij
iT
t
Tt
TtiIxix t
TtT
t,1
),|Pr(),( 1
1
Ttjxxba
Ttix
Ij
Tttjiij
Tt ,),()(
,1),(
1111
0),,Pr(max
0),Pr(),(
11
010
0
1 txiIii
tiIixv t
tt
t
Tt
0),()(max
0,),(
111 tjxvxba
tpixv T
ttjiij
iTt
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tTotal probability: Trellis
Total probability of an observation sequence can be computed as:
or using v which measures the probability along the path which gives the highest contribution to the summation:
these algorithms have a time complexity o(MT), where M is the number of transition with non-zero probability (depends on the number of states in system N) and T is the length o input sequence
the computation of these probabilities is performed in a data structure called trellis, which corresponds to unfolding of time axis of the graph structure
),(),(),(),()Pr( 111011 ixixixpixx Tt
T
IiT
T
Iii
T
IiT
T
),(max
)r(P̂ 11 ixvIi
x Tt
T
Trellis Dashed arrows - paths which score is added to obtain a probability, dotted for b, v corresponds to the highest scoring path among dashed ones
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tTrellis 2
Nodes in the trellis are pairs (t,i) t-time index, i - model state, arcs represent model transitions composing possible path in the model; for given observation x1
T each arc (t-1,I) -> (t,j) carries a “weight” given by aijbij(xt)
then for each path a score corresponding to the products of the weights of the arcs traversed by the path can be assigned. This score is the probability of emission of the observed sequence along the path, given v, current set of model parameters
formulas left the recurrent computation of vcorresponds to appropriate combination at each
trellis node, the scores of paths ending or starting at that node the computation proceeds in a column-wise manner, synchronously with the apperance of
observations. At every frame the scores of the nodes in a column are updated using recursion formula which unvolve the values of an adjacent column, transition probabilities of the models and the values of output denities for the current observation
for and v computation starts from left column whose values are initialized by p and ends at outermost right column where the final value is computed. For computations go in opposite direction
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tOutput probabilities
If the observation sequences are composed of symbols drawn from a finite alphabet of O symbols, then a density is a real valued vector [b(x)]x=1
O having a probability entry for possible symbol with the constraints:
observation may be also composed of couple of symbols, usually considered to be mutually statistically independent. Then the output density can be represented by the product of Q independent densities . Such a models are called discrete HMMs
discrete HMMs are simpler - only array access to find b(x), but imprecise, thus in current implementations rather continuous densities are used
to reduce memory requirements parametric representations are used most popular choice: multivariate Gaussian density:
where D is the dimension of vector space (length of a feaure vector). Parameters of Gaussian density are: mean vector (location parameter) and symmetric covariance matrix (spread of values around )
widespread in statistics, parameters easy to estimate
O
x
xbOxxb1
1)(,...1,0)(
Q
h
hh xbxb1
)()(
)()(2
1
,
1*
)det()2(
1)(
xx
DexN
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tForward algorithm
To calculate the probability (likelihood) P(X|) of the observation sequence X=(X1,X2,...XT) given the HMM the most intuitive way is to sum up the probabilities of all state sequences:
In other words, first we enumerate all possible state sequences S of length T, that generate observation X and sum all the probabilities. The probabiloity of each path S is the product of the state sequence probability and joijnt output probability along the path. Using output-independence assumption
So finally we got:
First we enumerate all possible state sequences with length T+1. For any given state sequence we go through all transitions and states in a sequence until we reach the last transition – this require O(NT) state sequences generation – exponential computational complexity
)()...()(),|(),|( 21
1 21 TTss
T
tstt XbXbXbsXPSXP
Sall
SXPSPXP ),|()|()|(
)()...()()|(11221110 21 Tssssss
Sallsss XbaXbaXbaXP
TT
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tForward algorithm II - Trellis
Based on the HMM assumption that P(st|st-1,P(Xt|st,involves only st-1, st
P(X|can be computed recursively using so called forward probabilityt(i)=P(X1
t,st=i | ) denoting partial probability that HMM is in state i having generated partial observation X1
t (i.e. X1..Xt)
This can be illustrated by trellis: arrow is the transition from state to state, number within circle denotes We startcells from t=0 with initial probabilities, other cells are computed time-synchroneous from left to right where each cell is completely computed before proceeding to time t+1, When the states in the last column have been computed, the sum of all probabilities in the final column is the probability of generating the observation sequence
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tGaussians
Disadvantage: Gaussian densities are unimodal: to overcome this Gaussian mixtures are used (weighted sums of Gaussians)
mixtures are capable to approximate other densities using appropriate number of components
D-dimensional Gaussian mixture with K components can be described using K[1+D+(D(D+1)/2)] real numbers (D=39, K=20 then 16400 real numbers)
further reduction: diagonal covariance matrix (components mutually independent). The joint density is then the product of one-dimensional Gaussian densities corresponding to the individual vector elements - 2D parameters
diagonal-covariance Gaussians are widely use in ASR to reduce number of Gaussians: distribution tying or sharing is often used:
imposing that different transitions of different models share the same outout density. The tying scheme exploits a priori knowledge, e.g. sharing densities among allophones or sound classes (will be in details described further)
attempts to use other densities known from literature: Laplacians, lambda densities (for duration), but Gaussians dominate
K
kk xNwxM
kk1
, )()(
D
h h
hhx
D
hh
D
exN 12
2||
2
1
1
,
)2(
1)(
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tHMM composition
Probabilistic decoding: ASR with stochastic models choosing in the set of the possible linguistic events the one that corresponds to the observed data with the highest probability
in ASR an observation often does not correspond to the utterance of a single words, but of a sequence of words. If the language has a limited set of sentences, then it is possible to have models for each utterance, but what to do if the number is unlimited? To much models are also not easy to handle, how to train them, impossible to recognize items not observed in training material
solution: concatenation of units from a list of manageable size, describe training
Model linking
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tDTW
Warp two speech templates x1..xN and y1..yM with minimal distortion: to find the optimal path between starting point (1,1) and end point (N,M) we need to compute the optimal accumulated distance D(N,M) based on distances d(i,,j). Since the same optimal path must be based on the previous step, the minimum distance must satisfy following equation:
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tDynamic programming algorithm
We need only consider and keep only the best move for each pair although tehre are M possible moves, DTW can be computed recursively
We can identify the optimal match yj with respect to xi and save the index in a back pointer table B(i,,j)
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tViterbi algorithm
We are looking for the state sequence S=(s1,s2..sT) that maximizes P(S,X|) – very similiar to dynamic programming (for forward probabilities). Instead of summing up probabilities from different paths coming to the same destination state, the Viterbi algorithm picks and remember the best path. Best path probability is defined as:
Vt(i) is the probability of the most likely state sequence at time t which has generated the observation X until time t and end in state i.
)|,,()( 111 isSXPiV ttt
t
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tViterbi algorithm
Algorithm for computing v probabilities- application of dynamic programming for finding a best scoring path in a directed graph with weighted arcs, I.e. in trellis
one of the most important algorithms in current computer science uses recursive formula
when the whole observation sequence x1T has been processed, the score of the
best path can be found computing: identity of states can be attained using backpointers f(x,I) - this allows to find
optimal state sequence: this constitutes a time alignment of input speech frames, allowing to locate occurrences of SR units (phonemes)
construction of recognition model: first the recognition language is represented as a network of words (‘finite-state
automata’). The connection between words are empty transitions, but could have assigned probability (LM, n-gram)
each word is replaced by a sequence (or network) of phonemes according to lexical rules.
Phonemes are replaced by instances of appropriate HMMs. Special labels are assigned to word-ending states (simplifies retrieving word sequence)
0,),(max
),,()(max
max
0,),(max
,max
),(
111
10
1
tjxvaIj
jxvxbaIj
tjxvaIj
p
ixvT
tjiT
ttjiji
Tjii
Tt
),(max)r(P̂ 11 ixvx TTIi
T
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tVitterbi movie
The Viterbi algorithm is an efficient way to find the shortest route through a type of graph called a trellis. The algorithm uses a technique called 'forward dynamic programming' which relies on the property that the cost of a path can be expressed as a sum of incremental or transition costs between nodes adjacent in time in the trellis. The demo shows the evolution of the Viterbi algorithm over 6 time instants. At each time the shortest path to each node at the next time instant is determined. Paths that do not survive to the next time instant are deleted/ By time k+2, the shortest path (track) to time k has been determined unambiguously. This is called 'merging of paths'.
stat
es
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tCompound model
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tViterbi pseudocode
note use of stack
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tModel choice in ASR
Identification of basic units is complicated due to various NL effects: reduction, other pronunciation depending on context, etc. Thus, sometime phoneme are not appropriate representation
better: use context-dependend units: allophones triphones: context made by previous and following phonemes (monophones) phoneme models can have left-to-right topology with 3 groups of states: onset,
body and coda note the huge number of possible triphones: 403=64000 models! Of course not
all occurs due to phononatctic rules, but: How to train? How to manage? Other attempts: half-syllables, diphones, microsegments, etc.but all methods of
unit selection base on a priori phonetic knowledge totally different approach: automatic unsupervised clustering of frames.
Corresponding centroids are taken as starting distributions for a set of basic simple units called fenones. Maximum likelihood decoding of utterances in terms of fenones is generated (dictionary) and fenones are then combined to built word models and the models are then trained
FenoneThe shared parameter (i.e., the output distribution) associated with a cluster of similar states is called a senone because of its state dependency. The phonetic models that share senones are shared-distribution models or SDM's
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tParameter tying
good trade-off between resolution and precision of models, imposing an equivalence relation between different components of the parameter set of the models or components of different models
definition of tying relation: involves decision about every parameter of the model set a priori knowledge-based equivalence relations :
semi-continuous HMMs: set of output density mixtures which shares the same set of basic Gaussian components (SCHMMs) - they differ only by weights
phonetically tied-mixtures: set of context-dependent HMMs in which the mixtures of all allophones of a phonem share a phoneme-dependent codebook
state-tying: clustering of states based on similarity of Gaussians (Young & Woodland, 94) and retraining
phonetic decision trees (Bahr et al, 91): binary decision tree which has a question and a set of HMM densities attached to each node; questions generally reflect phonetic context, e.g. “is the left context a plosive?”
genones: automatically determined SCHMMs :
1. Mixtures of allophones are clustered - mixtures with common components identified
2. Selecting of most likely elements of clusters: genones
3. Retraining of the system
yes no
Mostly used
K.Marasek05.07.2005
Mu
ltim
ed
ia D
ep
art
men
tImplementation issues
Overflow and underflow during computations may occur - the probabilities are very small, especially for long sentences - to overcome this log are used