music signal processing
TRANSCRIPT
Tutorial
Meinard Müller
Saarland University and MPI Informatik [email protected]
Music Signal Processing
Anssi Klapuri
Queen Mary University of London [email protected]
Overview
Pitch and Harmony Tempo and Beat Timbre Melody
Part I:
Part II:
Part III:
Part IV:
Coffee Break
Overview
Pitch and Harmony
Part I:
§ Motivating application: Music synchronization
§ Pitch features § Chroma features § Chroma variants § Further applications
Pitch and Frequency § Pitch is an auditory sensation in which a listener assigns
musical tones to positions on a musical scale. § Pitch is related to frequency, but are not equivalent. § Frequency is an objective concept § Pitch is subjective. § It takes the human brain to map the internal quality of
pitch. § Pitches are usually quantified as frequencies by
comparing them with pure tones.
4
Bernstein
Karajan
Scherbakov (piano)
MIDI (piano)
Music Synchronization: Audio-Audio
Beethoven’s Fifth
Various interpretations
Music Synchronization: Audio-Audio
Karajan
Scherbakov
Time (seconds)
Beethoven’s Fifth
[Müller, Springer-Monograph 2007]
Karajan
Scherbakov
Music Synchronization: Audio-Audio
Synchronization: Karajan → Scherbakov
Time (seconds)
Beethoven’s Fifth
[Müller, Springer-Monograph 2007]
Music Synchronization: Audio-Audio
Feature extraction: chroma features
Time (seconds) Time (seconds) Time (seconds) Time (seconds)
Karajan Scherbakov
Inte
nsity
(nor
mal
ized
)
[Müller, Springer-Monograph 2007]
Music Synchronization: Audio-Audio
Cost matrix K
araj
an
Scherbakov
Time (seconds)
Tim
e (s
econ
ds)
Cos
t val
ue
[Müller, Springer-Monograph 2007]
Music Synchronization: Audio-Audio
Cost-minimizing warping path
Kar
ajan
Tim
e (s
econ
ds)
Scherbakov
Time (seconds)
Cos
t val
ue
[Müller, Springer-Monograph 2007]
Short Time Fourier Transform
Time (seconds)
Frequency (Hertz)
Intensity (dB)
Example: Chirp signal
Spectrogram
(only magnitudes are shown)
Playing a single note on an instrument may result in a complex superposition of different frequencies.
Short Time Fourier Transform
§ Fundamental frequency § Harmonics (overtones) § Transients (e.g. non-harmonic attack phase) § Room acoustics (e.g. reverberation) § Vibrato (periodic variations in frequency) § Tremolo (periodic variations in amplitude) § …
Pitch and frequency are two different concepts!
Pitch Features
Model assumption: Equal-tempered scale § MIDI pitches: § Piano notes: § Concert pitch: § Center frequency: Hz
Pitch Features
Logarithmic frequency distribution Octave: doubling of frequency
A2 110 Hz
A3 220 Hz
A4 440 Hz
Pitch Features
Idea: Binning of Fourier coefficients
Divide up the frequency axis into logarithmically spaced “pitch regions” and combine spectral coefficients of each
region to form a single pitch coefficient.
Pitch Features Time-frequency representation
Windowing in the time domain
Windowing in the
frequency domain
Pitch Features Details: § Let be a spectral vector obtained from a
spectrogram w.r.t. a sampling rate and a window length N. The spectral coefficient corresponds to the frequency
§ Let
be the set of coefficients assigned to a pitch Then the pitch coefficient is defined as
Pitch Features
Example: A4, p = 69 § Center frequency: § Lower bound: § Upper bound: § STFT with ,
S(p = 69)
Pitch Features
Note MIDI pitch
Center [Hz] frequency
Left [Hz] boundary
Right [Hz] boundary
Width [Hz]
A3 57 220.0 213.7 226.4 12.7 A#3 58 233.1 226.4 239.9 13.5 B3 59 246.9 239.9 254.2 14.3 C4 60 261.6 254.2 269.3 15.1 C#4 61 277.2 269.3 285.3 16.0 D4 62 293.7 285.3 302.3 17.0 D#4 63 311.1 302.3 320.2 18.0 E4 64 329.6 320.2 339.3 19.0 F4 65 349.2 339.3 359.5 20.2 F#4 66 370.0 359.5 380.8 21.4 G4 67 392.0 380.8 403.5 22.6 G#4 68 415.3 403.5 427.5 24.0 A4 69 440.0 427.5 452.9 25.4
Pitch Features
Note: § § For some pitches, S(p) may be empty. This
particularly holds for low notes corresponding to narrow frequency bands.
→ Linear frequency sampling is problematic! Solution:
Multi-resolution spectrograms or multirate filterbanks
Pitch Features
Time (seconds)
C4: 261 Hz C5: 523 Hz
C6: 1046 Hz
C7: 2093 Hz
C8: 4186 Hz
Inte
nsity
Spectrogram
Pitch Features
MID
I pitc
h
Inte
nsity
(dB
)
Time (seconds)
C4
C5
C6
C7
C8
Log-frequency spectrogram
Pitch Features
Example: Chromatic scale
Time (seconds)
Freq
uenc
y (H
z)
Inte
nsity
(dB
)
Spectrogram
Pitch Features
Example: Chromatic scale
MID
I pitc
h
Inte
nsity
(dB
)
Log-frequency spectrogram
Time (seconds)
Chroma Features
§ Pitches are perceived as related (harmonically similar) if they differ by an octave
§ Idea: throw away information which is difficult to estimate and not so important for harmonic analysis
§ Separation of pitch into two components: tone height (octave number) and chroma
§ Chroma: 12 traditional pitch classes of the equal-tempered scale. For example: Chroma C
§ Convert pitch features into chroma features by adding up all values that belong to the same pitch class
§ Result: 12-dimensional chroma vector
Chroma Features
Shepard‘s helix of pitch perception Chromatic circle
http://en.wikipedia.org/wiki/Pitch_class_space
[Bartsch/Wakefield, IEEE-TMM 2005] [Gómez, PhD 2006]
Chroma Features
Example: Chromatic scale
MID
I pitc
h
Inte
nsity
(dB
)
Log-frequency spectrogram
Time (seconds)
Chroma Features
Example: Chromatic scale
Chr
oma
Inte
nsity
(dB
)
Chroma representation
Time (seconds)
Chroma Features
Example: Chromatic scale
Inte
nsity
(nor
mal
ized
)
Chroma representation (normalized, Euclidean)
Time (seconds)
Chr
oma
Chroma Features M
IDI p
itch
Inte
nsity
(dB
)
Time (seconds)
Log-frequency spectrogram
Example: Friedrich Burgmüller, Op. 100, No. 2
Chroma Features
Chroma representation
Inte
nsity
(dB
)
Chr
oma
Time (seconds)
Example: Friedrich Burgmüller, Op. 100, No. 2
Chroma Features
Chroma representation (normalized)
Inte
nsity
(nor
mal
ized
)
Chr
oma
Time (seconds)
Example: Friedrich Burgmüller, Op. 100, No. 2
§ Sequence of chroma vectors correlates to the harmonic progression
§ Normalization makes features invariant to
changes in dynamics
§ Further quantization and smoothing
§ Taking logarithm before adding up pitch coefficients accounts for logarithmic sensation of intensity
Chroma Features
Chroma Features
Karajan
Time (seconds) Time (seconds)
Feature sampling rate: 10 Hz
Example: Beethoven‘s Fifth Chroma representation
Scherbakov
Chroma Features
Time (seconds) Time (seconds) Time (seconds) Time (seconds)
Example: Beethoven‘s Fifth
Karajan Scherbakov
Feature sampling rate: 10 Hz Chroma representation (normalized)
Chroma Features
Time (seconds) Time (seconds) Time (seconds) Time (seconds)
Example: Beethoven‘s Fifth
Karajan Scherbakov
Feature sampling rate: 2 Hz Chroma representation (normalized)
Chroma Features
Time (seconds) Time (seconds) Time (seconds) Time (seconds)
Example: Beethoven‘s Fifth
Karajan Scherbakov
Feature sampling rate: 1 Hz Chroma representation (normalized)
Chroma Features Example: Zager & Evans “In The Year 2525”
How to deal with transpositions?
[Goto, ICASSP 2003]
Application: Audio Matching
Given: Large music database containing several – recordings of the same piece of music – interpretations by various musicians – arrangements in different instrumentations
Goal: Given a short query audio clip, identify all corresponding audio clips of similar musical content
– irrespective of the specific interpretation and instrumentation
– automatically and efficiently
Query-by-Example paradigm
[Kurth/Müller, IEEE-TASLP 2008]
Application: Audio Matching
§ Allamanche/Herre/Fröba/Cremer (AES 2001) § Cano/Battle/Kalker/Haitsma (IEEE MMSP 2002) § Kurth/Clausen/Ribbrock (AES 2002) § Wang (ISMIR 2003) § Shrestha/Kalker (ISMIR 2004)
Audio Identification/Fingerprinting
Audio/Music Matching § Pickens et al. (ISMIR 2002) § Müller/Kurth/Clausen (ISMIR 2005) § Kurth/Müller (IEEE-TASLP 2008) § Müller/Ewert (IEEE-TASLP 2010)
Cover Song Identification § Casey/Slaney (ISMIR 2006) § Ellis/Poliner (ICASSP 2007) § Serrà/Gómez/Serra (IEEE-TASLP 2008) § Serrà/Serra/Andrzehak (New J. Pysics 2009)
Application: Audio Matching
§ Allamanche/Herre/Fröba/Cremer (AES 2001) § Cano/Battle/Kalker/Haitsma (IEEE MMSP 2002) § Kurth/Clausen/Ribbrock (AES 2002) § Wang (ISMIR 2003) § Shrestha/Kalker (ISMIR 2004)
Audio Identification/Fingerprinting
Audio/Music Matching § Pickens et al. (ISMIR 2002) § Müller/Kurth/Clausen (ISMIR 2005) § Kurth/Müller (IEEE-TASLP 2008) § Müller/Ewert (IEEE-TASLP 2010)
Cover Song Identification § Casey/Slaney (ISMIR 2006) § Ellis/Poliner (ICASSP 2007) § Serrà/Gómez/Serra (IEEE-TASLP 2008) § Serrà/Serra/Andrzehak (New J. Pysics 2009)
High specificity
Low specificity
Application: Audio Matching
§ Allamanche/Herre/Fröba/Cremer (AES 2001) § Cano/Battle/Kalker/Haitsma (IEEE MMSP 2002) § Kurth/Clausen/Ribbrock (AES 2002) § Wang (ISMIR 2003) § Shrestha/Kalker (ISMIR 2004)
Audio Identification/Fingerprinting
Audio/Music Matching § Pickens et al. (ISMIR 2002) § Müller/Kurth/Clausen (ISMIR 2005) § Kurth/Müller (IEEE-TASLP 2008) § Müller/Ewert (IEEE-TASLP 2010)
Cover Song Identification § Casey/Slaney (ISMIR 2006) § Ellis/Poliner (ICASSP 2007) § Serrà/Gómez/Serra (IEEE-TASLP 2008) § Serrà/Serra/Andrzehak (New J. Pysics 2009)
High specificity
Low specificity
Fragment level
Document level
Application: Audio Matching
Four occurrences of the main theme
Third occurrence First occurrence
1 2 3 4
Time (seconds)
[Kurth/Müller, IEEE-TASLP 2008]
Chroma Features
First occurrence Third occurrence
How to make chroma features more robust to timbre changes?
Time (seconds) Time (seconds)
Chr
oma
scal
e
Chroma Features
First occurrence Third occurrence
How to make chroma features more robust to timbre changes? Idea: Discard timbre-related information
Time (seconds) Time (seconds)
Chr
oma
scal
e
[Müller/Ewert, IEEE-TASLP 2010]
MFCC Features and Timbre
Lower MFCCs Timbre
Time (seconds)
[Müller/Ewert, IEEE-TASLP 2010]
MFC
C c
oeffi
cien
t
MFCC Features and Timbre
Idea: Discard lower MFCCs to achieve timbre invariance
Lower MFCCs Timbre
Time (seconds)
[Müller/Ewert, IEEE-TASLP 2010]
MFC
C c
oeffi
cien
t
Enhancing Timbre Invariance
Short-Time Pitch Energy
Pitc
h sc
ale
Time (seconds) [Müller/Ewert, IEEE-TASLP 2010]
1. Log-frequency spectrogram Steps:
Enhancing Timbre Invariance
Log Short-Time Pitch Energy
Pitc
h sc
ale
Time (seconds) [Müller/Ewert, IEEE-TASLP 2010]
1. Log-frequency spectrogram 2. Log (amplitude)
Steps:
Enhancing Timbre Invariance
PFCC
Pitc
h sc
ale
Time (seconds) [Müller/Ewert, IEEE-TASLP 2010]
1. Log-frequency spectrogram 2. Log (amplitude) 3. DCT
Steps:
Enhancing Timbre Invariance
PFCC
Pitc
h sc
ale
Time (seconds) [Müller/Ewert, IEEE-TASLP 2010]
1. Log-frequency spectrogram 2. Log (amplitude) 3. DCT 4. Discard lower coefficients
[1:n-1]
Steps:
Enhancing Timbre Invariance
PFCC
Pitc
h sc
ale
Time (seconds) [Müller/Ewert, IEEE-TASLP 2010]
1. Log-frequency spectrogram 2. Log (amplitude) 3. DCT 4. Keep upper coefficients
[n:120]
Steps:
Enhancing Timbre Invariance
Pitc
h sc
ale
Time (seconds) [Müller/Ewert, IEEE-TASLP 2010]
1. Log-frequency spectrogram 2. Log (amplitude) 3. DCT 4. Keep upper coefficients
[n:120] 5. Inverse DCT
Steps:
Enhancing Timbre Invariance
Chr
oma
scal
e
Time (seconds)
[Müller/Ewert, IEEE-TASLP 2010]
1. Log-frequency spectrogram 2. Log (amplitude) 3. DCT 4. Keep upper coefficients
[n:120] 5. Inverse DCT 6. Chroma & Normalization
Steps:
Enhancing Timbre Invariance
1. Log-frequency spectrogram 2. Log (amplitude) 3. DCT 4. Keep upper coefficients
[n:120] 5. Inverse DCT 6. Chroma & Normalization
Steps:
Chroma DCT-Reduced Log-Pitch
CRP(n)
Time (seconds)
Chr
oma
scal
e
[Müller/Ewert, IEEE-TASLP 2010]
Chroma versus CRP Shostakovich Waltz
Third occurrence First occurrence
Chroma
Time (seconds) Time (seconds)
[Müller/Ewert, IEEE-TASLP 2010]
Chroma versus CRP Shostakovich Waltz
Third occurrence First occurrence
Chroma
CRP(55) n = 55
Time (seconds) Time (seconds) [Müller/Ewert, IEEE-TASLP 2010]
Application: Audio Matching
0 50 100 150 200 250 300 350 400 4500
0.1
0.2
0.3
0.4
0.5
0.6
Query
DB
Toccata (Bach) Waltz (Shostakovich) Yesterday (The Beatles)
Time (seconds)
[Müller/Ewert, IEEE-TASLP 2010]
Application: Audio Matching
0 50 100 150 200 250 300 350 400 4500
0.1
0.2
0.3
0.4
0.5
0.6
Query
DB
Toccata (Bach) Waltz (Shostakovich) Yesterday (The Beatles)
Time (seconds)
[Müller/Ewert, IEEE-TASLP 2010]
Application: Audio Matching
0 50 100 150 200 250 300 350 400 4500
0.1
0.2
0.3
0.4
0.5
0.6
Query
DB
Toccata (Bach) Waltz (Shostakovich) Yesterday (The Beatles)
Time (seconds)
[Müller/Ewert, IEEE-TASLP 2010]
Application: Audio Matching
0 50 100 150 200 250 300 350 400 4500
0.1
0.2
0.3
0.4
0.5
0.6
Query
DB
Toccata (Bach) Waltz (Shostakovich) Yesterday (The Beatles)
Time (seconds)
[Müller/Ewert, IEEE-TASLP 2010]
Matching Quality
Example: Shostakovich Waltz
Time (seconds)
Standard Chroma (Chroma Pitch)
CRP(55)
[Müller/Ewert, IEEE-TASLP 2010]
Matching Quality
Standard Chroma (Chroma Pitch)
CRP(55)
Example: Shostakovich Waltz
Time (seconds)
[Müller/Ewert, IEEE-TASLP 2010]
Matching Quality
Example: Free in you (Indigo Girls)
Time (seconds)
Standard Chroma (Chroma Pitch)
CRP(55)
[Müller/Ewert, IEEE-TASLP 2010]
Rank Piece Position 1 Shostakovich/Chailly 0 - 21 2 Shostakovich/Chailly 41- 60 3 Shostakovich/Chailly 180 - 198 4 Shostakovich/Yablonsky 1 - 19 5 Shostakovich/Yablonsky 36 - 52 6 Shostakovich/Yablonsky 156 - 174 7 Shostakovich/Chailly 144 - 162 8 Bach BWV 582/Chorzempa 358 - 373 9 Beethoven Op. 37,1/Toscanini 12 - 28
10 Beethoven Op. 37,1/Pollini 202 - 218
Matching Quality
Query: Shostakovich, Waltz/Chailly, first 21 seconds
Application: Audio Structure Analysis
Given: CD recording
Goal: Automatic extraction of the repetitive structure (or of the musical form) Example: Brahms Hungarian Dance No. 5 (Ormandy)
[Dannenberg/Hu, ISMIR 2002] [Paulus et al., ISMIR 2010]
Application: Chord Recognition
[Sheh/Ellis, ISMIR 2003] [Ueda et al., ICASSP 2010] [Mauch/Dixon, IEEE-TASLP 2010]
Application: Chord Recognition
[Sheh/Ellis, ISMIR 2003] [Ueda et al., ICASSP 2010] [Mauch/Dixon, IEEE-TASLP 2010]
Correct
Ground truth
False
Application: Chord Recognition
[Sheh/Ellis, ISMIR 2003] [Ueda et al., ICASSP 2010] [Mauch/Dixon, IEEE-TASLP 2010]
Correct
Ground truth
False
Application: Chord Recognition
Dependency of recognition rates on chroma type
Feature type
F-m
easu
re
Template-based method using binary templates Template-based method using average templates HMM-based method
Application: Cover Song Identification
Goal: Given a music recording of a song or piece of music, find all corresponding music recordings within a huge collection that can be regarded as a kind of version, interpretation, or cover song.
Instance of document-based retrieval!
§ Live versions § Versions adapted to particular country/region/language § Contemporary versions of an old song § Radically different interpretations of a musical piece
[Serrà et al., IEEE-TASLP 2009] [Ellis/Poliner, ICASSP 2007]
Application: Cover Song Identification Query: Bob Dylan – Knockin’ on Heaven’s Door Retrieval result:
Rank Recording Score
1. Guns and Roses: Knockin‘ On Heaven’s Door 94.2
2. Avril Lavigne: Knockin‘ On Heaven’s Door 86.6 3. Wyclef Jean: Knockin‘ On Heaven’s Door 83.8 4. Bob Dylan: Not For You 65.4 5. Guns and Roses: Patience 61.8 6. Bob Dylan: Like A Rolling Stone 57.2 7.-14. …
[Serrà et al., IEEE-TASLP 2009] [Ellis/Poliner, ICASSP 2007]
§ Chroma features capture harmonic information § High robustness to changes in timbre and
instrumentation § Many chroma variants with different properties
§ Various implementations publically available
Conclusions