rhythm analysis in music - northwestern engineeringpardo/courses/eecs352/lectures/mpm12... · music...

Post on 03-May-2018

224 Views

Category:

Documents

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Time-frequency Masking

EECS 352: Machine Perception of Music & Audio

Zafar RAFII, Spring 2012 1

Real spectrogram

time

fre

qu

en

cy

• The Short-Time Fourier Transform (STFT) is a succession of local Fourier Transforms (FT)

STFT

STFT

Zafar RAFII, Spring 2012 2

+ j* Imaginary spectrogram

time

fre

qu

en

cy

frame i frame i

Time signal

timewindow i

Imaginary spectrum

frequency

Real spectrum

frequency

• If we used a window of N samples, the FT has N values (from 0 to N-1); let’s suppose N = 8…

Time signal

timewindow i

FT

STFT

Zafar RAFII, Spring 2012 3

+ j* window i frame i frame i

-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

N frequency values N frequency values

Imaginary spectrum

frequency

Real spectrum

frequency

• Frequency index 0 is the DC component; it is always real (it is the sum of the time values)

Time signal

timewindow i

FT

STFT

Zafar RAFII, Spring 2012 4

+ j* window i frame i frame i

-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Imaginary spectrum

frequency

Real spectrum

frequency

• Frequency indices from 1 to floor(N/2) are the “unique” complex values (a + j*b)

Time signal

timewindow i

FT

STFT

Zafar RAFII, Spring 2012 5

+ j* window i frame i frame i

-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Imaginary spectrum

frequency

Real spectrum

frequency

• Frequency indices from floor(N/2) to N-1 are the “mirrored” complex conjugates (a - j*b)

Time signal

timewindow i

FT

STFT

Zafar RAFII, Spring 2012 6

+ j* window i frame i frame i

-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Imaginary spectrum

frequency

Real spectrum

frequency

• If N is even, there is also a pivot frequency at frequency index N/2; it is always real

Time signal

timewindow i

FT

STFT

Zafar RAFII, Spring 2012 7

+ j* window i frame i frame i

-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Real spectrogram

time

fre

qu

en

cy

• Summary of the frequency indices and values in the STFT (in colors!)

STFT

Zafar RAFII, Spring 2012 8

Imaginary spectrogram

time

fre

qu

en

cy

Frequency 0 = DC component (always real)

Frequency 1 to floor(N/2) = “unique” complex values

Frequency N/2 = “pivot” frequency (always real)

Frequency floor(N/2) to N-1 = “mirrored” complex conjugates

N frequency values = frequency 0 to N-1

+ j*

• The (magnitude) spectrogram is the magnitude (absolute value) of the STFT

Magnitude spectrogram

time

fre

qu

en

cy

Real spectrogram

time

fre

qu

en

cy

Spectrogram

Zafar RAFII, Spring 2012 9

Imaginary spectrogram

time

fre

qu

en

cy

abs

frame i frame i frame i

+ j*

• For a complex number 𝑎 + 𝑗 ∗ 𝑏, the absolute

value is 𝑎 + 𝑗 ∗ 𝑏 = 𝑎2 + 𝑏2

abs

Spectrogram

Zafar RAFII, Spring 2012 10

Imaginary spectrum

frequency

Real spectrum

frequency

+ j* frame i frame i

-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Magnitude spectrum

frequency

=

frame i

1.4 1.4 1.4 1.4 3 0 2 0

1.4 1.4 1.4

• All the N frequency values (frequency indices from 0 to N-1) are real and positive (abs!)

abs

Spectrogram

Zafar RAFII, Spring 2012 11

Imaginary spectrum

frequency

Real spectrum

frequency

+ j* frame i frame i

-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Magnitude spectrum

frequency

0 1 2 3 4 5 6 7

= 1.4 3 0 2 0

frame i

N frequency values

1.4 1.4 1.4

• Frequency indices from 0 to floor(N/2) are the unique frequency values (with DC and pivot)

abs

Spectrogram

Zafar RAFII, Spring 2012 12

Imaginary spectrum

frequency

Real spectrum

frequency

+ j* frame i frame i

-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Magnitude spectrum

frequency

0 1 2 3 4 5 6 7

= 1.4 3 0 2 0

frame i

1.4 1.4 1.4

• Frequency indices from floor(N/2)+1 to N-1 are the mirrored frequency values

abs

Spectrogram

Zafar RAFII, Spring 2012 13

Imaginary spectrum

frequency

Real spectrum

frequency

+ j* frame i frame i

-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Magnitude spectrum

frequency

0 1 2 3 4 5 6 7

= 1.4 3 0 2 0

frame i

1.4 1.4 1.4

• Since there are redundant, we can discard the frequency values from floor(N/2)+1 to N-1

abs

Spectrogram

Zafar RAFII, Spring 2012 14

Imaginary spectrum

frequency

Real spectrum

frequency

+ j* frame i frame i

-3 -1 0 1 2 1 0 -1 0 -1 0 1 0 -1 0 1 + j* 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

Magnitude spectrum

frequency

0 1 2 3 4 5 6 7

= 1.4 3 0 2 0

frame i

floor(N/2)+1 unique frequency values

• The spectrogram has therefore floor(N/2)+1 unique frequency values (with DC and pivot)

Magnitude spectrogram

time

fre

qu

en

cy

Real spectrogram

time

fre

qu

en

cy

Spectrogram

Zafar RAFII, Spring 2012 15

Imaginary spectrogram

time

fre

qu

en

cy

frame i frame i frame i

abs + j*

Spectrogram

• Why the magnitude spectrogram?

– Easy to visualize (compare with the STFT)

– Magnitude information more important

– Human ear less sensitive to phase

Zafar RAFII, Spring 2012 16

Time signal

time

Magnitude spectrogram

time

fre

qu

en

cy +

0

Spectrogram

• When you display a spectrogram in Matlab…

– imagesc: data is scaled to use the full colormap

– 10*log10(V): magnitude spectrogram V in dB

– set(gca,’YDir’,’normal’): y-axis from bottom to top

Zafar RAFII, Spring 2012 17

Time signal

time

Magnitude spectrogram

time

fre

qu

en

cy +

0

• The signal cannot be reconstructed from the spectrogram (phase information is missing!)

Spectrogram

Zafar RAFII, Spring 2012 18

Time signal

time

Imaginary spectrogram

time

fre

qu

en

cy

Real spectrogram

time

fre

qu

en

cy

Magnitude spectrogram

time

fre

qu

en

cy iSTFT ???

???

• Suppose we have a mixture of two sources: a music signal and a voice signal

Time-frequency Masking

Zafar RAFII, Spring 2012 19

Music signal

time

Mixture signal

time

Voice signal

time

Music spectrogram

time

fre

qu

en

cy

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

+ 0

+ 0

+ 0

• We assume that the sources are sparse = most of the time-frequency bins have null energy

Music spectrogram

time

fre

qu

en

cy

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Time-frequency Masking

Zafar RAFII, Spring 2012 20

Music signal

time

Mixture signal

time

Voice signal

time

Music spectrogram

time

fre

qu

en

cy +

0

+ 0

+ 0

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

• We assume that the sources are sparse = most of the time-frequency bins have null energy

Music spectrogram

time

fre

qu

en

cy

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Time-frequency Masking

Zafar RAFII, Spring 2012 21

Music signal

time

Mixture signal

time

Voice signal

time

Music spectrogram

time

fre

qu

en

cy +

0

+ 0

+ 0

Voice spectrogram

time

fre

qu

en

cy

Mostly low energy bins

Mostly low energy bins Mixture spectrogram

timefr

eq

ue

ncy

• We assume that the sources are disjoint = their time-frequency bins do not overlap

Music spectrogram

time

fre

qu

en

cy

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Time-frequency Masking

Zafar RAFII, Spring 2012 22

Music signal

time

Mixture signal

time

Voice signal

time

Music spectrogram

time

fre

qu

en

cy +

0

+ 0

+ 0

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

• We assume that the sources are disjoint = their time-frequency bins do not overlap

Music spectrogram

time

fre

qu

en

cy

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Time-frequency Masking

Zafar RAFII, Spring 2012 23

Music signal

time

Mixture signal

time

Voice signal

time

Music spectrogram

time

fre

qu

en

cy +

0

+ 0

+ 0

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Not a lot of overlapping

• Assuming sparseness and disjointness, we can discriminate the bins between mixed sources

Music spectrogram

time

fre

qu

en

cy

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Time-frequency Masking

Zafar RAFII, Spring 2012 24

Music spectrogram

time

fre

qu

en

cy +

0

+ 0

+ 0

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Music signal

time

Mixture signal

time

Voice signal

time

• Assuming sparseness and disjointness, we can discriminate the bins between mixed sources

Music spectrogram

time

fre

qu

en

cy

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Time-frequency Masking

Zafar RAFII, Spring 2012 25

Music spectrogram

time

fre

qu

en

cy +

0

+ 0

+ 0

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Music signal

time

Mixture signal

time

Voice signal

time

Source 1 = bright Source 2 = dark

• Bins that are likely to belong to one source are assigned to 1, the rest to 0 = binary masking!

Binary mask

time

fre

qu

en

cy

Music spectrogram

time

fre

qu

en

cy

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Time-frequency Masking

Zafar RAFII, Spring 2012 26

Music spectrogram

time

fre

qu

en

cy +

0

+ 0

+ 0

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Binary mask

time

fre

qu

en

cy 1

0

Music signal

time

Voice signal

time

Source of interest Interfering source

• By multiplying the binary mask to the mixture spectrogram, we could “preview” the estimate

Time-frequency Masking

Zafar RAFII, Spring 2012 27

Binary mask

time

fre

qu

en

cy .x

Mixture spectrogram

time

fre

qu

en

cy

Masked spectrogram

time

fre

qu

en

cy

+ 0

+ 0

1 0

• However, we cannot derive the estimate itself because we cannot invert a spectrogram!

Time-frequency Masking

Zafar RAFII, Spring 2012 28

Binary mask

time

fre

qu

en

cy .x

Mixture spectrogram

time

fre

qu

en

cy

Masked spectrogram

time

fre

qu

en

cy

+ 0

+ 0

Music estimate

time

1 0

• We mirror the redundant frequencies from the unique frequencies (without DC and pivot)

Time-frequency Masking

Zafar RAFII, Spring 2012 29

Binary mask

time

fre

qu

en

cy

Binary mask

time

fre

qu

en

cy

Binary mask

time

frequency

1 0

1 0

• And the we apply this full binary mask to the STFT using a element-wise multiplication

Time-frequency Masking

Zafar RAFII, Spring 2012 30

Binary mask

time

fre

qu

en

cy

Binary mask

time

fre

qu

en

cy

Binary mask

time

frequency

Imaginary spectrogram

time

fre

qu

en

cy

Real spectrogram

time

fre

qu

en

cy

1 0

1 0

.x

• The estimate signal can now be reconstructed via inverse STFT

Time-frequency Masking

Zafar RAFII, Spring 2012 31

Binary mask

time

fre

qu

en

cy

1 0

Masked imaginary

time

fre

qu

en

cy

Masked real

time

fre

qu

en

cy

Music estimate

time

iSTFT

• Sources are not really sparse or disjoint in time-frequency in the mixture

Music spectrogram

time

fre

qu

en

cy

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Time-frequency Masking

Zafar RAFII, Spring 2012 32

Music signal

time

Mixture signal

time

Voice signal

time

Music spectrogram

time

fre

qu

en

cy +

0

+ 0

+ 0

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

• Sources are not really sparse or disjoint in time-frequency in the mixture

Music spectrogram

time

fre

qu

en

cy

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Time-frequency Masking

Zafar RAFII, Spring 2012 33

Music signal

time

Mixture signal

time

Voice signal

time

Music spectrogram

time

fre

qu

en

cy +

0

+ 0

+ 0

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Soft mask

time

fre

qu

en

cy

• Bins that are likely to belong to one source are close to 1, the rest close to 0 = soft masking!

Music spectrogram

time

fre

qu

en

cy

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Time-frequency Masking

Zafar RAFII, Spring 2012 34

Music spectrogram

time

fre

qu

en

cy +

0

+ 0

+ 0

Voice spectrogram

time

fre

qu

en

cy

Mixture spectrogram

timefr

eq

ue

ncy

Music signal

time

Voice signal

time

Source of interest Interfering source

1 0

Soft mask

time

fre

qu

en

cy

• Let’s listen to the results!

Time-frequency Masking

Zafar RAFII, Spring 2012 35

Music estimate

time

Music signal

time

demix mix

Mixture signal

time

• How can we efficiently model a binary/soft time-frequency mask for source separation?...

• To be continued…

Question…

Zafar RAFII, Spring 2012 36

Mixture spectrogram

time

fre

qu

en

cy

+ 0

Soft mask

time

fre

qu

en

cy

1 0 ???

top related