lecture 1: introduction & dspdsp.pdfe6820 sapr - dan ellis l01 - 2002-01-28 - 1 ee e6820: speech &...

-28 - 1

EE E6820: Speech & Audio Processing & Recognition

ring

E6820 SAPR - Dan Ellis L01 - 2002-01

Lecture 1: Introduction & DSP

Sound and information

Course structure

DSP review: Timescale modification

Dan Ellis http://www.ee.columbia.edu/~dpwe/e6820/

Columbia University Dept. of Electrical EngineeSpring 2002

1

2

3

-28 - 2

Sound and information

voltage

1

n

ir

e

E6820 SAPR - Dan Ellis L01 - 2002-01

• Sound is air pressure variation

• Transducers convert air pressure ↔↔↔↔

Mechanical vibratio

Pressure waves in a

Motion of sensor

Time-varying voltag

+ + + +

t

v(t)

-28 - 3

What use is sound?

antage

ionornersal sounds’

5

5

E6820 SAPR - Dan Ellis L01 - 2002-01

• Footsteps examples:

• Hearing confers an evolutionary adv- useful information, complements vis- ...at a distance, in the dark, around c- listeners are highly adapted to ‘natur

(including speech)

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5-0.5

0

0.5

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5-0.5

0

0.5

time / s

-28 - 4

The scope of audio processing

abstract

E6820 SAPR - Dan Ellis L01 - 2002-01

AUDIO

PROCESSING

natural

man-made

simple

-28 - 5

The acoustic communication chain

lus effect

r decoder

!

ion

E6820 SAPR - Dan Ellis L01 - 2002-01

• Sound is an information bearer

• Received sound reflects source(s) pof environment (channel)

message signal channel receive

synthesis audioprocessing recognit

-28 - 6

Levels of abstraction

between

rent tasks

xplicit ...

E6820 SAPR - Dan Ellis L01 - 2002-01

• Much processing concerns shiftinglevels of abstraction

• Different representations serve diffe- separating aspects, making things e

sound p(t)

representation(e.g. t-f energy)

‘information’

abstract

concrete

An

alys

is

Syn

thesis

-28 - 7

Course structure

ocessingsp. ASR)

2

E6820 SAPR - Dan Ellis L01 - 2002-01

• Goals:- survey topics in sound analysis & pr- develop an intuition for sound signal- learn some specific technologies (es

• Course structure:- weekly assignments (25%)- midterm exam (25%)- final project (50%)

• Text:Speech and Audio Signal ProcessingBen Gold & Nelson Morgan, Wiley, 2000 ISBN: 0-471-35154-7

-28 - 8

Web-based

820/

ples, ...

etc.

E6820 SAPR - Dan Ellis L01 - 2002-01

• Course website:http://www.ee.columbia.edu/~dpwe/e6

for lecture notes, problem sets, exam

• + student web pages for homework

-28 - 9

Course outline

ryion

L10:Sequenceecognition

L12:Systems &pplications

E6820 SAPR - Dan Ellis L01 - 2002-01

Fundamentals

L1:DSP

L2:Acoustics

L3:Pattern

recognition

L4:Audito

percept

Audio processing

L5:Signalmodels

L6:Music

analysis/synthesis

L7:Audio

compression

L8:Spatial sound& rendering

Speech recognition

L9:Speechfeatures r

L11:Recognizer

training a

28 - 10

Weekly Assignments

g Toolbox)g

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Research papers- journal & conference publications- summarize & discuss in class- written summaries on web page

• Practical experiments- MATLAB-based (+ Signal Processin- direct experience of sound processin- skills for project

• Book sections+ questions from book

28 - 11

Final Project

of grade)

s

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Most significant part of course (50%

• Oral proposals mid-semester; Presentations in final class+ website

• Scope- practical (Matlab recommended)- identify a problem; try some solution- evaluation

• Topic- few restrictions within world of audio- investigate other resources- develop in discussion with me

28 - 12

Examples of past projects

ll video

undtrack

usic

s etc.

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Detecting sound events in basketba- classifying ‘cheers’ in sport video so

• The S-Matrix: A novel approach to msegment detection- finding breaks between verse, choru

dream2mSEG.wavdream2mSEG.wav

28 - 13

DSP review

3

time

g

ε

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Digital signals:

- sampling interval T,

sampling frequency

- quantizer

xd[n] = Q( xc(nT ) )

Discrete-time samplinlimits bandwidth

Discrete-levelquantization

limits dynamic range

Ωs2πT------=

Q y( ) ε y ε⁄⋅=

28 - 14

The speech signal: time domain

und types:

1.92

2.46 2.48

2.6 time/s

dime

transiente”

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Speech is a sequence of different so

-0.2

-0.1

0

0.1

0.2

1.38 1.4 1.42-0.1

0

0.1

1.52 1.54 1.56 1.58-0.1

0

0.1

1.86 1.88 1.9-0.05

0

0.05

2.42 2.44

-0.02

0

0.02

1.4 1.6 1.8 2 2.2 2.4

watch thin as aahas

Vowel: periodic“has”

Fricative: aperiodic“watch”

Glide: smooth transition“watch”

Stop burst:“dim

28 - 15

Timescale modification (TSM)

lower’?

r

pling rate

time/s

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Can we modify a sound to make it ‘si.e. speech pronounced more slowly- e.g. to help comprehension, analysis- or more quickly for ‘speed listening’?

• Why not just slow it down?

- , r = slowdown facto

- equiv. to playback at a different sam

xs t( ) xotr--

=

2.35 2.4 2.45 2.5 2.55 2.6-0.1

-0.05

0

0.05

0.1

2.35 2.4 2.45 2.5 2.55 2.6-0.1

-0.05

0

0.05

0.1

Original

2x slower

28 - 16

Time-domain TSM

e structure

r--- L n+

e / s

e / s

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Problem: want to preserve local timbut alter global time structure

• Repeat segments- but: artefacts from abrupt edges

• Cross-fade & overlap

ym

mL n+[ ] ym 1– mL n+[ ] w n[ ] x m-⋅+=

2.35 2.4 2.45 2.5 2.55 2.6-0.1

0

0.1

4.7 4.75 4.8 4.85 4.9 4.95-0.1

0

0.1

1

1

1 1 2 2 3 3 4 4 5 5 6

2

2

3

3

4

4

5 6

6

5

tim

tim

28 - 17

Synchronous Overlap-Add (SOLA)

window to

on:

L n Km+ +

r--- L n K+ +

-- L n K+ + 2

-----------------------------------------

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Idea: Allow some leeway in placing optimize alignment of waveforms

• Hence,

where Km chosen by cross-correlati

1

2

Km maximizes alignment of 1 and 2

ym

mL n+[ ] ym 1– mL n+[ ] w n[ ] x mr----⋅+=

Km

ym 1–

mL n+[ ] x m-⋅n 0=Nov∑

ym 1–

mL n+[ ]( )2

∑ x mr--

∑

------------------------------------------------------------------------------0 K KU ≤ ≤

argmax =

28 - 18

The Fourier domain

s x)

E6820 SAPR - Dan Ellis L01 - 2002-01-

Fourier Series (periodic continuous x)

Fourier Transform (aperiodic continuou

x t( ) ck ejkΩ0t⋅

k∑=

ck1

2πT---------- x t( ) e

jkΩ0– t⋅ tdT 2⁄–

T 2⁄∫=

x t( ) 12π------ X jΩ( ) e jΩt⋅ Ωd∫=

X jΩ( ) x t( ) e jΩt–⋅ td∫=

28 - 19

Discrete-time Fourier

ed x)

E6820 SAPR - Dan Ellis L01 - 2002-01-

DT Fourier Transform (aperiodic sampl

Discrete Fourier Transform (N-point x)

x n[ ] 12π------ X e

jω( ) e jωn⋅ ωdπ–

π∫=

X ejω( ) x n[ ] e jωn–⋅∑=

x n[ ] X k[ ] ej2πkN

---------⋅

k∑=

X k[ ] x n[ ] ej2πkN

---------–⋅

n∑=

28 - 20

Sampling and aliasing

tinuous tants:

nt rapid

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Discrete-time signals equal the contime signal at discrete sampling ins

• Infrequent sampling cannot represefluctuations

• Nyquist limit (Fsamp/2) emerges from periodic spectrum:

xd n[ ] xc nT( )=

28 - 21

Speech sounds in the Fourier domain

(power)

ts

000 3000 4000

000 3000 4000

0.1

000 3000

-40

000 3000 4000

time domain frequency domain

freq / Hz

B

E6820 SAPR - Dan Ellis L01 - 2002-01-

- dB = 20.log10(amplitude) = 10.log10

• Voiced spectrum has pitch + forman

1.52 1.54 1.56 1.58-0.1

0

0.1

2.42 2.44 2.46 2.48

-0.02

0

0.02

0 1000 2-100

-80

-60

-40

0 1000 2-100

-80

-60

1.37 1.38 1.39 1.4 1.41 1.42-0.1

0

0 1000 2-100

-80

-60

1.86 1.87 1.88 1.89 1.9 1.91-0.05

0

0.05

0 1000 2-100

-80

-60

Vowel: periodic“has”

Fricative: aperiodic“watch”

Glide: transition“watch”

Stop: transient“dime”

time / s

ener

gy

/ d

28 - 22

Short-time Fourier Transform

and freq

time / s

2πk n mL–( )N

--------------------------------)

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Want to localize energy in both time→break sound into short-time pieces

calculate DFT of each one

• Mathematically:

2.35 2.4 2.45

0

4000

3000

2000

1000

2.5 2.55 2.6-0.1

0

0.1

freq

/ H

z

short-timewindow

DFT

L 2L 3L

X k m,[ ] x n[ ] w n mL–[ ] j(–exp⋅ ⋅n 0=N 1–∑=

28 - 23

The Spectrogram

age:

time / s

time / s

intensity / dB

-50

-40

-30

-20

-10

0

10

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Plot STFT as a grayscale imX k m,[ ]fr

eq /

Hz

2.35 2.4 2.45 2.5 2.55 2.60

1000

2000

3000

4000

freq

/ H

z

0

1000

2000

3000

4000

0

0.1

0 0.5 1 1.5 2 2.5

28 - 24

Time-frequency tradeoff

ncy

2.6 time / s

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Longer of window w[n] gains frequeresolution at cost of time resolution

1.4 1.6 1.8 2 2.2 2.4

freq

/ H

z

0

1000

2000

3000

4000

freq

/ H

z

0

1000

2000

3000

4000

0

0.2W

ind

ow

= 2

56 p

t“N

arro

wb

and

”W

ind

ow

= 3

2 p

t“W

ideb

and

”

28 - 25

Speech sounds on the Spectrogram

n rmants

time/s

dime

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Most popular speech visualization

• Wideband (short window) better thanarrowband (long window) to see fo

freq

/ H

z

0

1000

2000

3000

4000

1.4 1.6 1.8 2 2.2 2.4 2.6

watch thin as aahas

Vo

wel

: pe

riodi

c“h

as”

Fri

c've

: ap

erio

dic

“wat

ch”

Glid

e: tr

ansi

tion

“wat

ch”

Sto

p:

tran

sien

t“ d

ime”

28 - 26

TSM with the Spectrogram

1.4

1.4

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Just stretch out the spectrogram?

- how to resynthesize?

spectrogram is only

Time

Fre

quen

cy

0 0.2 0.4 0.6 0.8 1 1.20

1000

2000

3000

4000

Time

Fre

quen

cy

0 0.2 0.4 0.6 0.8 1 1.20

1000

2000

3000

4000

Y k m,[ ]

28 - 27

The Phase Vocoder

domain

ram:

een slices:

ligned

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Timescale modification in the STFT

• Magnitude from ‘stretched’ spectrog

- e.g. by linear interpolation

• But preserve phase increment betw

- e.g. by discrete differentiator

• Does right thing for single sinusoid- keeps overlapped parts of sinusoid a

Y k m,[ ] X k mr----,=

θ̇Y k m,[ ] θ̇ X kmr----,=

28 - 28

General issues in TSM

e

parate!ementtion...

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Time window- stretching a narrowband spectrogram

• Malleability of different sounds- vowels stretch well, stops lose natur

• Not a well-formed problem?- want to alter time without frequency

... but time and frequency are not se- ‘satisfying’ result is a subjective judg→solution depends on auditory percep

28 - 29

Summary

n

E6820 SAPR - Dan Ellis L01 - 2002-01-

• Information in sound- lots of it, multiple levels of abstractio

• Course overview- survey of audio processing topics- practicals, readings, project

• DSP review- digital signals, time domain- Fourier domain, STFT

• Timescale modification- properties of the speech signal- time-domain- phase vocoder

EE E6820: Speech & Audio Processing & RecognitionLecture 1: Introduction & DSPSound and informationWhat use is sound?The scope of audio processingThe acoustic communication chainLevels of abstractionCourse structureWeb-basedCourse outlineWeekly AssignmentsFinal ProjectExamples of past projectsDSP reviewThe speech signal: time domainTimescale modification (TSM)Time-domain TSMSynchronous Overlap-Add (SOLA)The Fourier domainDiscrete-time FourierSampling and aliasingSpeech sounds in the Fourier domainShort-time Fourier TransformThe SpectrogramTime-frequency tradeoffSpeech sounds on the SpectrogramTSM with the SpectrogramThe Phase VocoderGeneral issues in TSMSummary

lecture 1: introduction & dspdsp.pdfe6820 sapr - dan ellis l01 - 2002-01-28 - 1 ee e6820: speech &...

Documents