lecture 1: introduction & dspdsp.pdfe6820 sapr - dan ellis l01 - 2002-01-28 - 1 ee e6820: speech &...

29
E6820 SAPR - Dan Ellis L01 - 2002-01-28 - 1 EE E6820: Speech & Audio Processing & Recognition Lecture 1: Introduction & DSP Sound and information Course structure DSP review: Timescale modification Dan Ellis <[email protected]> http://www.ee.columbia.edu/~dpwe/e6820/ Columbia University Dept. of Electrical Engineering Spring 2002 1 2 3

Upload: others

Post on 29-Jan-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • -28 - 1

    EE E6820: Speech & Audio Processing & Recognition

    ring

    E6820 SAPR - Dan Ellis L01 - 2002-01

    Lecture 1: Introduction & DSP

    Sound and information

    Course structure

    DSP review: Timescale modification

    Dan Ellis http://www.ee.columbia.edu/~dpwe/e6820/

    Columbia University Dept. of Electrical EngineeSpring 2002

    1

    2

    3

  • -28 - 2

    Sound and information

    voltage

    1

    n

    ir

    e

    E6820 SAPR - Dan Ellis L01 - 2002-01

    • Sound is air pressure variation

    • Transducers convert air pressure ↔↔↔↔

    Mechanical vibratio

    Pressure waves in a

    Motion of sensor

    Time-varying voltag

    + + + +

    t

    v(t)

  • -28 - 3

    What use is sound?

    antage

    ionornersal sounds’

    5

    5

    E6820 SAPR - Dan Ellis L01 - 2002-01

    • Footsteps examples:

    • Hearing confers an evolutionary adv- useful information, complements vis- ...at a distance, in the dark, around c- listeners are highly adapted to ‘natur

    (including speech)

    0 0.5 1 1.5 2 2.5 3 3.5 4 4.5-0.5

    0

    0.5

    0 0.5 1 1.5 2 2.5 3 3.5 4 4.5-0.5

    0

    0.5

    time / s

  • -28 - 4

    The scope of audio processing

    abstract

    E6820 SAPR - Dan Ellis L01 - 2002-01

    AUDIO

    PROCESSING

    natural

    man-made

    simple

  • -28 - 5

    The acoustic communication chain

    lus effect

    r decoder

    !

    ion

    E6820 SAPR - Dan Ellis L01 - 2002-01

    • Sound is an information bearer

    • Received sound reflects source(s) pof environment (channel)

    message signal channel receive

    synthesis audioprocessing recognit

  • -28 - 6

    Levels of abstraction

    between

    rent tasks

    xplicit ...

    E6820 SAPR - Dan Ellis L01 - 2002-01

    • Much processing concerns shiftinglevels of abstraction

    • Different representations serve diffe- separating aspects, making things e

    sound p(t)

    representation(e.g. t-f energy)

    ‘information’

    abstract

    concrete

    An

    alys

    is

    Syn

    thesis

  • -28 - 7

    Course structure

    ocessingsp. ASR)

    2

    E6820 SAPR - Dan Ellis L01 - 2002-01

    • Goals:- survey topics in sound analysis & pr- develop an intuition for sound signal- learn some specific technologies (es

    • Course structure:- weekly assignments (25%)- midterm exam (25%)- final project (50%)

    • Text:Speech and Audio Signal ProcessingBen Gold & Nelson Morgan, Wiley, 2000 ISBN: 0-471-35154-7

  • -28 - 8

    Web-based

    820/

    ples, ...

    etc.

    E6820 SAPR - Dan Ellis L01 - 2002-01

    • Course website:http://www.ee.columbia.edu/~dpwe/e6

    for lecture notes, problem sets, exam

    • + student web pages for homework

  • -28 - 9

    Course outline

    ryion

    L10:Sequenceecognition

    L12:Systems &pplications

    E6820 SAPR - Dan Ellis L01 - 2002-01

    Fundamentals

    L1:DSP

    L2:Acoustics

    L3:Pattern

    recognition

    L4:Audito

    percept

    Audio processing

    L5:Signalmodels

    L6:Music

    analysis/synthesis

    L7:Audio

    compression

    L8:Spatial sound& rendering

    Speech recognition

    L9:Speechfeatures r

    L11:Recognizer

    training a

  • 28 - 10

    Weekly Assignments

    g Toolbox)g

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Research papers- journal & conference publications- summarize & discuss in class- written summaries on web page

    • Practical experiments- MATLAB-based (+ Signal Processin- direct experience of sound processin- skills for project

    • Book sections+ questions from book

  • 28 - 11

    Final Project

    of grade)

    s

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Most significant part of course (50%

    • Oral proposals mid-semester; Presentations in final class+ website

    • Scope- practical (Matlab recommended)- identify a problem; try some solution- evaluation

    • Topic- few restrictions within world of audio- investigate other resources- develop in discussion with me

  • 28 - 12

    Examples of past projects

    ll video

    undtrack

    usic

    s etc.

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Detecting sound events in basketba- classifying ‘cheers’ in sport video so

    • The S-Matrix: A novel approach to msegment detection- finding breaks between verse, choru

    dream2mSEG.wavdream2mSEG.wav

  • 28 - 13

    DSP review

    3

    time

    g

    ε

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Digital signals:

    - sampling interval T,

    sampling frequency

    - quantizer

    xd[n] = Q( xc(nT ) )

    Discrete-time samplinlimits bandwidth

    Discrete-levelquantization

    limits dynamic range

    Ωs2πT------=

    Q y( ) ε y ε⁄⋅=

  • 28 - 14

    The speech signal: time domain

    und types:

    1.92

    2.46 2.48

    2.6 time/s

    dime

    transiente”

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Speech is a sequence of different so

    -0.2

    -0.1

    0

    0.1

    0.2

    1.38 1.4 1.42-0.1

    0

    0.1

    1.52 1.54 1.56 1.58-0.1

    0

    0.1

    1.86 1.88 1.9-0.05

    0

    0.05

    2.42 2.44

    -0.02

    0

    0.02

    1.4 1.6 1.8 2 2.2 2.4

    watch thin as aahas

    Vowel: periodic“has”

    Fricative: aperiodic“watch”

    Glide: smooth transition“watch”

    Stop burst:“dim

  • 28 - 15

    Timescale modification (TSM)

    lower’?

    r

    pling rate

    time/s

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Can we modify a sound to make it ‘si.e. speech pronounced more slowly- e.g. to help comprehension, analysis- or more quickly for ‘speed listening’?

    • Why not just slow it down?

    - , r = slowdown facto

    - equiv. to playback at a different sam

    xs t( ) xotr--

    =

    2.35 2.4 2.45 2.5 2.55 2.6-0.1

    -0.05

    0

    0.05

    0.1

    2.35 2.4 2.45 2.5 2.55 2.6-0.1

    -0.05

    0

    0.05

    0.1

    Original

    2x slower

  • 28 - 16

    Time-domain TSM

    e structure

    r--- L n+

    e / s

    e / s

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Problem: want to preserve local timbut alter global time structure

    • Repeat segments- but: artefacts from abrupt edges

    • Cross-fade & overlap

    ym

    mL n+[ ] ym 1– mL n+[ ] w n[ ] x m-⋅+=

    2.35 2.4 2.45 2.5 2.55 2.6-0.1

    0

    0.1

    4.7 4.75 4.8 4.85 4.9 4.95-0.1

    0

    0.1

    1

    1

    1 1 2 2 3 3 4 4 5 5 6

    2

    2

    3

    3

    4

    4

    5 6

    6

    5

    tim

    tim

  • 28 - 17

    Synchronous Overlap-Add (SOLA)

    window to

    on:

    L n Km+ +

    r--- L n K+ +

    -- L n K+ + 2

    -----------------------------------------

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Idea: Allow some leeway in placing optimize alignment of waveforms

    • Hence,

    where Km chosen by cross-correlati

    1

    2

    Km maximizes alignment of 1 and 2

    ym

    mL n+[ ] ym 1– mL n+[ ] w n[ ] x mr----⋅+=

    Km

    ym 1–

    mL n+[ ] x m-⋅n 0=Nov∑

    ym 1–

    mL n+[ ]( )2

    ∑ x mr--

    ------------------------------------------------------------------------------0 K KU ≤ ≤

    argmax =

  • 28 - 18

    The Fourier domain

    s x)

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    Fourier Series (periodic continuous x)

    Fourier Transform (aperiodic continuou

    x t( ) ck ejkΩ0t⋅

    k∑=

    ck1

    2πT---------- x t( ) e

    jkΩ0– t⋅ tdT 2⁄–

    T 2⁄∫=

    x t( ) 12π------ X jΩ( ) e jΩt⋅ Ωd∫=

    X jΩ( ) x t( ) e jΩt–⋅ td∫=

  • 28 - 19

    Discrete-time Fourier

    ed x)

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    DT Fourier Transform (aperiodic sampl

    Discrete Fourier Transform (N-point x)

    x n[ ] 12π------ X e

    jω( ) e jωn⋅ ωdπ–

    π∫=

    X ejω( ) x n[ ] e jωn–⋅∑=

    x n[ ] X k[ ] ej2πkN

    ---------⋅

    k∑=

    X k[ ] x n[ ] ej2πkN

    ---------–⋅

    n∑=

  • 28 - 20

    Sampling and aliasing

    tinuous tants:

    nt rapid

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Discrete-time signals equal the contime signal at discrete sampling ins

    • Infrequent sampling cannot represefluctuations

    • Nyquist limit (Fsamp/2) emerges from periodic spectrum:

    xd n[ ] xc nT( )=

  • 28 - 21

    Speech sounds in the Fourier domain

    (power)

    ts

    000 3000 4000

    000 3000 4000

    0.1

    000 3000

    -40

    000 3000 4000

    time domain frequency domain

    freq / Hz

    B

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    - dB = 20.log10(amplitude) = 10.log10

    • Voiced spectrum has pitch + forman

    1.52 1.54 1.56 1.58-0.1

    0

    0.1

    2.42 2.44 2.46 2.48

    -0.02

    0

    0.02

    0 1000 2-100

    -80

    -60

    -40

    0 1000 2-100

    -80

    -60

    1.37 1.38 1.39 1.4 1.41 1.42-0.1

    0

    0 1000 2-100

    -80

    -60

    1.86 1.87 1.88 1.89 1.9 1.91-0.05

    0

    0.05

    0 1000 2-100

    -80

    -60

    Vowel: periodic“has”

    Fricative: aperiodic“watch”

    Glide: transition“watch”

    Stop: transient“dime”

    time / s

    ener

    gy

    / d

  • 28 - 22

    Short-time Fourier Transform

    and freq

    time / s

    2πk n mL–( )N

    --------------------------------)

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Want to localize energy in both time→break sound into short-time pieces

    calculate DFT of each one

    • Mathematically:

    2.35 2.4 2.45

    0

    4000

    3000

    2000

    1000

    2.5 2.55 2.6-0.1

    0

    0.1

    freq

    / H

    z

    short-timewindow

    DFT

    L 2L 3L

    X k m,[ ] x n[ ] w n mL–[ ] j(–exp⋅ ⋅n 0=N 1–∑=

  • 28 - 23

    The Spectrogram

    age:

    time / s

    time / s

    intensity / dB

    -50

    -40

    -30

    -20

    -10

    0

    10

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Plot STFT as a grayscale imX k m,[ ]fr

    eq /

    Hz

    2.35 2.4 2.45 2.5 2.55 2.60

    1000

    2000

    3000

    4000

    freq

    / H

    z

    0

    1000

    2000

    3000

    4000

    0

    0.1

    0 0.5 1 1.5 2 2.5

  • 28 - 24

    Time-frequency tradeoff

    ncy

    2.6 time / s

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Longer of window w[n] gains frequeresolution at cost of time resolution

    1.4 1.6 1.8 2 2.2 2.4

    freq

    / H

    z

    0

    1000

    2000

    3000

    4000

    freq

    / H

    z

    0

    1000

    2000

    3000

    4000

    0

    0.2W

    ind

    ow

    = 2

    56 p

    t“N

    arro

    wb

    and

    ”W

    ind

    ow

    = 3

    2 p

    t“W

    ideb

    and

  • 28 - 25

    Speech sounds on the Spectrogram

    n rmants

    time/s

    dime

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Most popular speech visualization

    • Wideband (short window) better thanarrowband (long window) to see fo

    freq

    / H

    z

    0

    1000

    2000

    3000

    4000

    1.4 1.6 1.8 2 2.2 2.4 2.6

    watch thin as aahas

    Vo

    wel

    : pe

    riodi

    c“h

    as”

    Fri

    c've

    : ap

    erio

    dic

    “wat

    ch”

    Glid

    e: tr

    ansi

    tion

    “wat

    ch”

    Sto

    p:

    tran

    sien

    t“ d

    ime”

  • 28 - 26

    TSM with the Spectrogram

    1.4

    1.4

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Just stretch out the spectrogram?

    - how to resynthesize?

    spectrogram is only

    Time

    Fre

    quen

    cy

    0 0.2 0.4 0.6 0.8 1 1.20

    1000

    2000

    3000

    4000

    Time

    Fre

    quen

    cy

    0 0.2 0.4 0.6 0.8 1 1.20

    1000

    2000

    3000

    4000

    Y k m,[ ]

  • 28 - 27

    The Phase Vocoder

    domain

    ram:

    een slices:

    ligned

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Timescale modification in the STFT

    • Magnitude from ‘stretched’ spectrog

    - e.g. by linear interpolation

    • But preserve phase increment betw

    - e.g. by discrete differentiator

    • Does right thing for single sinusoid- keeps overlapped parts of sinusoid a

    Y k m,[ ] X k mr----,=

    θ̇Y k m,[ ] θ̇ X kmr----,=

  • 28 - 28

    General issues in TSM

    e

    parate!ementtion...

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Time window- stretching a narrowband spectrogram

    • Malleability of different sounds- vowels stretch well, stops lose natur

    • Not a well-formed problem?- want to alter time without frequency

    ... but time and frequency are not se- ‘satisfying’ result is a subjective judg→solution depends on auditory percep

  • 28 - 29

    Summary

    n

    E6820 SAPR - Dan Ellis L01 - 2002-01-

    • Information in sound- lots of it, multiple levels of abstractio

    • Course overview- survey of audio processing topics- practicals, readings, project

    • DSP review- digital signals, time domain- Fourier domain, STFT

    • Timescale modification- properties of the speech signal- time-domain- phase vocoder

    EE E6820: Speech & Audio Processing & RecognitionLecture 1: Introduction & DSPSound and informationWhat use is sound?The scope of audio processingThe acoustic communication chainLevels of abstractionCourse structureWeb-basedCourse outlineWeekly AssignmentsFinal ProjectExamples of past projectsDSP reviewThe speech signal: time domainTimescale modification (TSM)Time-domain TSMSynchronous Overlap-Add (SOLA)The Fourier domainDiscrete-time FourierSampling and aliasingSpeech sounds in the Fourier domainShort-time Fourier TransformThe SpectrogramTime-frequency tradeoffSpeech sounds on the SpectrogramTSM with the SpectrogramThe Phase VocoderGeneral issues in TSMSummary