lecture 1: introduction & dsp - columbia universitydpwe/e6820/lectures/l01-intro+dsp.pdf ·...
TRANSCRIPT
EE E6820: Speech & Audio Processing & Recognition
Lecture 1:Introduction & DSP
Dan Ellis <[email protected]>Mike Mandel <[email protected]>
Columbia University Dept. of Electrical Engineeringhttp://www.ee.columbia.edu/∼dpwe/e6820
January 22, 2009
1 Sound and information
2 Course Structure
3 DSP review: Timescale modification
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 1 / 33
Outline
1 Sound and information
2 Course Structure
3 DSP review: Timescale modification
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 2 / 33
Sound and information
Sound is air pressure variation
Mechanical vibration
Pressure waves in air
Motion of sensor
Time-varying voltage
+ + + +
t
v(t)
Transducers convert air pressure ↔ voltage
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 3 / 33
What use is sound?
Footsteps examples:
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-0.5
0
0.5
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5-0.5
0
0.5
time / s
Hearing confers an evolutionary advantage
useful information, complements vision
. . . at a distance, in the dark, around corners
listeners are highly adapted to ‘natural sounds’ (includingspeech)
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 4 / 33
The acoustic communication chain
message signal channel receiver decoder
!
synthesis audio processing recognition
Sound is an information bearer
Received sound reflects source(s)plus effect of environment (channel)
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 6 / 33
Levels of abstraction
Much processing concerns shifting between levels of abstraction
sound p(t)
representation (e.g. t-f energy)
‘information’
abstract
concrete
An
alys
is
Syn
thesis
Different representations serve different tasks
separating aspects, making things explicit, . . .
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 7 / 33
Outline
1 Sound and information
2 Course Structure
3 DSP review: Timescale modification
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 8 / 33
Source structure
GoalsI survey topics in sound analysis & processingI develop and intuition for sound signalsI learn some specific technologies
Course structureI weekly assignments (25%)I midterm event (25%)I final project (50%)
Text
Speech and Audio Signal ProcessingBen Gold & Nelson MorganWiley, 2000ISBN: 0-471-35154-7
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 9 / 33
Web-basedCourse website:
http://www.ee.columbia.edu/∼dpwe/e6820/for lecture notes, problem sets, examples, . . .
+ student web pages for homework, etc.
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 10 / 33
Course outline
Fundamentals
L1:DSP
L2:Acoustics
L3:Pattern
recognition
L4:Auditory
perception
Audio processing
L5:Signalmodels
L6:Music
analysis/synthesis
L7:Audio
compression
L8:Spatial sound& rendering
Applications
L9:Speech
recognition
L10:Music
retrieval
L11:Signal
separation
L12:Multimediaindexing
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 11 / 33
Weekly assignments
Research papersI journal & conference publicationsI summarize & discuss in classI written summaries on web page + Courseworks discussion
Practical experimentsI Matlab-based (+ Signal Processing Toolbox)I direct experience of sound processingI skills for project
Book sections
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 12 / 33
Final project
Most significant part of course (50%) of grade
Oral proposals mid-semester;Presentations in final class+ website
ScopeI practical (Matlab recommended)I identify a problem; try some solutionsI evaluation
TopicI few restrictions within world of audioI investigate other resourcesI develop in discussion with me
Citation & plagiarism
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 13 / 33
Examples of past projects
Automatic prosody classification Model-based note transcription
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 14 / 33
Outline
1 Sound and information
2 Course Structure
3 DSP review: Timescale modification
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 15 / 33
DSP review: digital signals
time
xd[n] = Q( xc(nT ) )
Discrete-time samplinglimits bandwidth
Discrete-levelquantization
limitsdynamic range
Tε
sampling interval T
sampling frequency ΩT = 2πT
quantizer Q(y) = ε⌊ yε
⌋Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 16 / 33
The speech signal: time domain
Speech is a sequence of different sound types
-0.2
-0.1
0
0.1
0.2
1.38 1.4 1.42.1
0
.1
1.52 1.54 1.56 1.58-0.1
0
0.1
1.86 1.88 1.921.9-0.05
0
0.05
2.42 2.44 2.46 2.4
-0.02
0
0.02
1.4 1.6 1.8 2 2.2 2.4 2.6 time/s
watch thin as a dimeahas
Vowel: periodic “has”
Fricative: aperiodic “watch”
Glide: smooth transition “watch”
Stop burst: transient “dime”
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 17 / 33
Timescale modification (TSM)Can we modify a sound to make it ‘slower’?i.e. speech pronounced more slowly
e.g. to help comprehension, analysis
or more quickly for ‘speed listening’?
Why not just slow it down?
xs(t) = xo( tr ), r = slowdown factor (> 1→ slower)
equivalent to playback at a different sampling rate
2.35 2.4 2.45 2.5 2.55 2.6-0.1
-0.05
0
0.05
0.1
2.35 2.4 2.45 2.5 2.55 2.6-0.1
-0.05
0
0.05
0.1
time/s
Original
2x slower
r = 2
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 18 / 33
Time-domain TSM
Problem: want to preserve local time structurebut alter global time structure
Repeat segmentsI but: artifacts from abrupt edges
Cross-fade & overlap
ym[mL + n] = ym−1[mL + n] + w [n] · x[⌊m
r
⌋L + n
]
2.35 2.4 2.45 2.5 2.55 2.6-0.1
0
0.1
4.7 4.75 4.8 4.85 4.9 4.95-0.1
0
0.1
1
1
1 1 2 2 3 3 4 4 5 5 6
2
2
3
3
4
4
5 6
6
5
time / s
time / s
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 19 / 33
Synchronous overlap-add (SOLA)
Idea: allow some leeway in placing window to optimize alignmentof waveforms
1
2
Km maximizes alignment of 1 and 2
Hence,
ym[mL + n] = ym−1[mL + n] + w [n] · x[⌊m
r
⌋L + n + Km
]Where Km chosen by cross-correlation:
Km = argmax0≤K≤Ku
∑Novn=0 ym−1[mL + n] · x
[⌊mr
⌋L + n + K
]√∑(ym−1[mL + n])2
∑(x[⌊
mr
⌋L + n + K
])2
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 20 / 33
The Fourier domain
Fourier Series (periodic continuous x)
Ω0 =2π
T
x(t) =∑k
cke jkΩ0t
ck =1
2πT
∫ T/2
−T/2x(t)e−jkΩ0tdt k1 2 3 5 6 74
|ck|1.0
1.5 1 0.5 0 0.5 1 1.5 1
0.5
0
0.5
t
x(t)
Fourier Transform (aperiodic continuous x)
x(t) =1
2π
∫X (jΩ)e jΩtdΩ
X (jΩ) =
∫x(t)e−jΩtdt
0 0.002 0.004 0.006 0.008time / sec
level / dB
-0.01
0
0.01
0.02
x(t)
0 2000 4000 6000 8000freq / Hz
-80
-60
-40
-20 |X(jΩ)|
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 21 / 33
Discrete-time Fourier
DT Fourier Transform (aperiodic sampled x)
x [n] =1
2π
∫ π
−πX (e jω)e jωndω
X (e jω) =∑
x [n]e−jωn
n-1 1 2 3 4 5 6 7
0 π
|X(ejω)|
ω2π 3π 4π 5π
1
2
3
x [n]
Discrete Fourier Transform (N-point x)
x [n] =∑k
X [k]e j 2πknN
X [k] =∑n
x [n]e−j 2πknN
k
|X(ejω)||X[k]|
k=1...
n1 2 3 4 5 6 7
x [n]
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 22 / 33
Sampling and aliasingDiscrete-time signals equal the continuous time signal at discretesampling instants:
xd [n] = xc(nT )
Sampling cannot represent rapid fluctuations
0 1 2 3 4 5 6 7 8 9 10 1
0.5
0
0.5
1
sin
((ΩM +
2π
T
)Tn
)= sin(ΩMTn) ∀n ∈ Z
Nyquist limit (ΩT/2) from periodic spectrum:
ΩΩM−ΩT ΩT −ΩM ΩT - ΩM−ΩT + ΩM
Gp(jΩ)Ga(jΩ) “alias” of “baseband”
signal
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 23 / 33
Speech sounds in the Fourier domain
1.52 1.54 1.56 1.58-0.1
0
0.1
2.42 2.44 2.46 2.48
-0.02
0
0.02
0 1000 2000 3000 4000-100
-80
-60
-40
0 1000 2000 3000 4000-100
-80
-60
1.37 1.38 1.39 1.4 1.41 1.42-0.1
0
0.1
0 1000 2000 3000-100
-80
-60
-40
1.86 1.87 1.88 1.89 1.9 1.91-0.05
0
0.05
0 1000 2000 3000 4000-100
-80
-60
Vowel: periodic “has”
Fricative: aperiodic “watch”
Glide: transition “watch”
Stop: transient “dime”
time domain frequency domain
time / s freq / Hz
ener
gy
/ dB
dB = 20 log10(amplitude) = 10 log10(power)Voiced spectrum has pitch + formants
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 24 / 33
Short-time Fourier TransformWant to localize energy in time and frequency
break sound into short-time pieces
calculate DFT of each one
2.35 2.4 2.45
0
4000
3000
2000
1000
2.5 2.55 2.6-0.1
0
0.1
time / s
freq
/ H
z
k →
short-timewindow
DFT
m = 0 m = 1 m = 2 m = 3
L 2L 3L
Mathematically,
X [k ,m] =N−1∑n=0
x [n] w [n −mL] exp
(−j
2πk(n −mL)
N
)Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 25 / 33
The Spectrogram
Plot STFT X [k ,m] as a gray-scale image
time / s
time / s
freq
/ H
zintensity / dB
2.35 2.4 2.45 2.5 2.55 2.60
1000
2000
3000
4000
freq
/ H
z
0
1000
2000
3000
4000
0
0.1
-50
-40
-30
-20
-10
0
10
0 0.5 1 1.5 2 2.5
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 26 / 33
Time-frequency tradeoffLonger window w [n] gains frequency resolution at cost of timeresolution
1.4 1.6 1.8 2 2.2 2.4 2.6
freq
/ H
z
time / s
level/ dB
0
1000
2000
3000
4000
freq
/ H
z
0
1000
2000
3000
4000
0
0.2W
indo
w =
256
pt
ÒNar
row
ban
dÓ
Win
dow
= 4
8 pt
ÒWid
eban
dÓ
-50
-40
-30
-20
-10
0
10
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 27 / 33
Speech sounds on the Spectrogram
Most popular speech visualization
freq
/ H
z
0
1000
2000
3000
4000
1.4 1.6 1.8 2 2.2 2.4 2.6 time/s
watch thin as a dimeahas
Vow
el:
perio
dic
“has
”
Fric
've:
ap
erio
dic
“wat
ch”
Glid
e: tr
ansi
tion
“wat
ch”
Sto
p: tr
ansi
ent
“dim
e”
Wideband (short window) better than narrowband (long window)to see formants
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 28 / 33
TSM with the Spectrogram
Just stretch out the spectrogram?
Time
Fre
quen
cy
0 0.2 0.4 0.6 0.8 1 1.2 1.40
1000
2000
3000
4000
Time
Fre
quen
cy
0 0.2 0.4 0.6 0.8 1 1.2 1.40
1000
2000
3000
4000
how to resynthesize?spectrogram is only |Y [k,m]|
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 29 / 33
The Phase VocoderTimescale modification in the STFT domain
Magnitude from ‘stretched’ spectrogram:
|Y [k,m]| =∣∣∣X [k , m
r
]∣∣∣I e.g. by linear interpolation
But preserve phase increment between slices:
θY [k,m] = θX
[k,
m
r
]I e.g. by discrete differentiator
Does right thing for single sinusoidI keeps overlapped parts of sinusoid aligned
time
θ = ∆θ∆T
.
∆θ' = θ·2∆T .
∆T
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 30 / 33
General issues in TSM
Time windowI stretching a narrowband spectrogram
Malleability of different soundsI vowels stretch well, stops lose nature
Not a well-formed problem?I want to alter time without frequency
. . . but time and frequency are not separate!I ‘satisfying’ result is a subjective judgment⇒ solution depends on auditory perception. . .
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 31 / 33
Summary
Information in soundI lots of it, multiple levels of abstraction
Course overviewI survey of audio processing topicsI practicals, readings, project
DSP reviewI digital signals, time domainI Fourier domain, STFT
Timescale modificationI properties of the speech signalI time-domainI phase vocoder
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 32 / 33
References
J. L. Flanagan and R. M. Golden. Phase vocoder. Bell System Technical Journal,pages 1493–1509, 1966.
M. Dolson. The Phase Vocoder: A Tutorial. Computer Music Journal, 10(4):14–27,1986.
M. Puckette. Phase-locked vocoder. In Proc. IEEE Workshop on Applications ofSignal Processing to Audio and Acoustics (WASPAA), pages 222–225, 1995.
A. T. Cemgil and S. J. Godsill. Probabilistic Phase Vocoder and its application toInterpolation of Missing Values in Audio Signals. In 13th European SignalProcessing Conference, Antalya, Turkey, 2005.
Dan Ellis (Ellis & Mandel) Intro & DSP January 22, 2009 33 / 33