gct634: musical applications of machine learning tonal...

GCT634: Musical Applications of Machine LearningTonal Analysis

Hidden Markov Model

Graduate School of Culture Technology, KAISTJuhan Nam

Outlines

• Introduction- Tonality- Perceptual Distance of Two Tones- Chords and Scales

• Tonal Analysis - Key Estimation- Chord Recognition

• Hidden Markov Model

Introduction

Bach’s Chorale Harmonization

Jazz “Real book” Pop Music

Tonality

• Tonal music has a tonal center called key- 12 keys (C, C#, D, …, B)

• Tonal music has a major or minor scale on the key and the notes have different roles

• Notes in tonal music are harmonized by chords

(C major scale)

Tonality

• A sequence of notes or chord progressions provide certain degree of stability or instability- E.g., cadence (V-I, IV-I), tension (sus2, sus4)

• How the tonality is formed? - In other words, how we perceive different degrees of stability or tension

from notes?

Tonality

• Consonance and Dissonance - If two sinusoidal tones are within 3 ST (minor 3rd) in frequency, they

become dissonant - Most dissonant when they are apart about one quarter of the critical band- Critical bands become wider below 500 Hz; two low notes can sound

dissonant (e.g. two piano notes in lower keys)

• Consonance of two harmonics tones- Determined by how much two tones have closely-located overtones within

critical bands

Consonance Rating of Intervals in Music

• Perceptual distance between two notes are different from semi-tone distance between them.

Chords

• The basic units of tonal harmony- Triads, 7th , 9th, 11th, …

• Triads are formed by choosing three notes that make the most consonant (or “most harmonized”) sounds- This ends up with stacking up major or minor 3rds- 7th, 9th are obtained by stacking up 3rds more.

• The quality of consonance becomes more sophisticated as more notes are added- Music theory is basically about how to create tension and resolve it with

different quality of consonance

Scales in Tonal Harmony

• Major Scale- Formed by spreading notes from three major chords

• Minor scale- Formed by spreading notes from three minor chords (natural minor scale)

- Harmonic or melodic minor scale can be formed by using both minor and major chords

Automatic Chord Recognition

• Identifying chord progression of tonal music

• It is a challenging task (even for human)- Chords are not explicit in music - Non-chord notes or passing notes- Key change and chromaticism: requires in-depth knowledge of music

theory- In audio, multiple musical instruments are mixed- Relevant: harmonically arranged notes- Irrelevant: percussive sounds (but can help detecting chord changes)

• What kind of audio features can be extracted to recognize chords in a robust way?

Chroma Features: FFT-based approach

• Compute spectrogram and mapping matrix- Convert frequency to music pitch scale and get the pitch class - Set one to the corresponding pitch class and, otherwise, set zero- Adjust non-zeros values such that low-frequency content have more

weights

Chroma Features: Filter-bank approach

• A filter-bank can be used to get a log-scale time-frequency representation- Center frequencies are arranged over 88

piano notes - band widths are set to have constant-Q and

robust to +/- 25 cent detune

• The outputs that belong to the same pitch class are wrapped and summed.

(Müller, 2011)

Beat-Synchronous Chroma Features

• Make chroma features homogeneous within a beat (Bartsch and Wakefield, 2001)

(From Ellis’ slides)

Key Estimation Overview

• Estimate music key from music data- One of 24 keys: 12 pitch classes (C, C#, D, .., B) + major/minor

• General Framework (Gomez, 2006)

G majorSimilarityMeasure

Chroma Features

Average

Key Template

KeyStrength

Key Template

• Probe tone profile (Krumhansl and Kessler, 1982)- Relative stability or weight of tones - Listeners rated which tones best completed the first seven notes of a major

scale - For example, in C major key, C, D, E, F, G, A, B, … what?

Probe Tone Profile - Relative Pitch Ranking

Key Estimation

• Similarity by cross-correlation between chroma features and templates

• Find the key that produces the maximum correlation

Chord Recognition

• Estimate chords from music data- Typically, one of 24 keys: 12 pitch classes + major/minor - Often, diminish chords are added (36 chords)

• General Framework

ChordsDecisionMaking

Audio/Transform

Chroma Features

Chord Template or Models

Template MatchingHMM, SVM

Template-Based Approach

• Use chord templates (Fujishima, 1999; Harte and Sandler, 2005) and find the best matches

• Chord Templates

(from Bello’s Slides)

Template-Based Approach

• Compute the cross-correlation between chroma features and chord templates and select chords that have maximum values

(from Bello’s Slides)

Review

• Template approach is too straightforward- The binary templates are hard assignments

• We can use a multi-class classifier- The output is one of the target chords- However, the local estimation tends to be temporally not smooth

• We need some algorithm that considers the temporal dependency between chords- The majority of tonal music have certain types of chord progression

Hidden Markov Model (HMM)

• A probabilistic model for time series data - Speech, gesture, DNA sequence, financial data, weather data, …

• Assumes that the time series data are generated from hidden states and the hidden states follow a Markov model

• Learning-based approach- Need training data annotated with labels - The labels usually correspond to hidden states

Markov Model

• A random variable 𝑞 has 𝑁 states (𝑆1, 𝑆2, … , 𝑆𝑁) and, at each time step, one of the states are randomly chosen: 𝑞( ∈ {𝑆1, 𝑆2, … , 𝑆𝑁}

• The probability distribution for the current state is determined by the previous state(s)- The first-order: 𝑃 𝑞( 𝑞-, 𝑞., … , 𝑞(/- = 𝑃 𝑞( 𝑞(/-- The second-order: 𝑃 𝑞( 𝑞-, 𝑞., … , 𝑞(/- = 𝑃 𝑞( 𝑞(/-, 𝑞(/.

• The first-order Markov model is widely used for simplicity

Markov Model

• Example: chord progression- 𝑞( ∈ {𝐶, 𝐹, 𝐺}- The transition probability matrix 3 by 3

FC

GSt

End

𝑃 𝑞( = 𝐶 𝑞(/- = 𝐶 = 0.7

𝑃 𝑞( = 𝐹 𝑞(/- = 𝐶 = 0.1

𝑃 𝑞( = 𝐺 𝑞(/- = 𝐶 = 0.2

𝑃 𝑞( = 𝐶 𝑞(/- = 𝐹 = 0.2

𝑃 𝑞( = 𝐹 𝑞(/- = 𝐹 = 0.6

𝑃 𝑞( = 𝐺 𝑞(/- = 𝐹 = 0.2

𝑃 𝑞( = 𝐶 𝑞(/- = 𝐺 = 0.3

𝑃 𝑞( = 𝐹 𝑞(/- = 𝐺 = 0.1

𝑃 𝑞( = 𝐺 𝑞(/- = 𝐺 = 0.6

Markov Model

• The joint probability of a sequence of states is simple with the Markov model

𝑃 𝑞-, 𝑞., … , 𝑞( = 𝑃 𝑞-, 𝑞., … , 𝑞(/- 𝑃 𝑞( 𝑞-, 𝑞., … , 𝑞(/- = 𝑃 𝑞-, 𝑞., … , 𝑞(/- 𝑃 𝑞( 𝑞(/-

= 𝑃 𝑞-, 𝑞., … , 𝑞(/. 𝑃 𝑞(/- 𝑞-, 𝑞., … , 𝑞(/. 𝑃 𝑞( 𝑞(/-

= 𝑃 𝑞-, 𝑞., … , 𝑞(/. 𝑃 𝑞(/- 𝑞(/. 𝑃 𝑞( 𝑞(/-

= 𝑃 𝑞- 𝑃 𝑞.|𝑞- …𝑃 𝑞(/- 𝑞(/. 𝑃 𝑞( 𝑞(/-

What Can We Do with the Markov Model?

• Generate a chord sequence- e.g.) C – C – C – C – F – F – C – C – G – G – C– C - … - We can also generate melody if we define the transition probability matrix

among notes

• Evaluate if a specific chord progression is more likely than others. - For example, C-G-C is more likely than C-F-C (assuming 𝑃 𝑞- = 𝐶 = 1)

𝑃 𝑞 = 𝐶, 𝐺, 𝐶 = 𝑃 𝑞- = 𝐶 𝑃 𝑞. = 𝐺|𝑞- = 𝐶 𝑃 𝑞; = 𝐶|𝑞. = 𝐺 = 0.2 ∗ 0.3 = 0.06

𝑃 𝑞 = 𝐶, 𝐹, 𝐶 = 𝑃 𝑞- = 𝐶 𝑃 𝑞. = 𝐹|𝑞- = 𝐶 𝑃 𝑞; = 𝐶|𝑞. = 𝐹 = 0.1 ∗ 0.2 = 0.02

What Can We Do with a Markov Model ?

• Compute the probability that the chord at time 𝑇 is C (or F or G) - Naïve method: count all paths that have C chord at time 𝑇: exponential!- Clever method: use a recursive induction- 𝑃 𝑞> = 𝐶 = 𝑃 𝑞> = 𝐶|𝑞>/- = 𝐶 𝑃 𝑞>/- = 𝐶

+𝑃 𝑞> = 𝐶|𝑞>/- = 𝐹 𝑃 𝑞>/- = 𝐹+𝑃 𝑞> = 𝐶|𝑞>/- = 𝐺 𝑃 𝑞>/- = 𝐺

- Repeat this for 𝑃 𝑞@ = 𝐶 , 𝑃 𝑞@ = 𝐹 , 𝑃 𝑞@ = 𝐺 for 𝑖 = 𝑇 − 1, 𝑇 − 2,… , 1

Chord Recognition from Audio

• What we observe are not chords but audio features (e.g. chroma)

• We want to infer a chord sequence from audio feature sequences

𝑞-, 𝑞., … , 𝑞(/-

𝑂-, 𝑂., … , 𝑂(/-


• The hidden states follow the Markov model• Given a state, the corresponding observation distribution is

independent of previous states or observations- Each state has emission distribution

𝑞(/- 𝑞( 𝑞(D-

𝑂(/-

. . .

𝑂( 𝑂(D-

FC

G

𝑃 𝑂 𝑞( = 𝐶 𝑃 𝑂 𝑞( = 𝐹 𝑃 𝑂 𝑞( = 𝐺


• Model parameters- Initial state probabilities: 𝑃 𝑞E → 𝜋@- Transition probability matrix: 𝑃 𝑞( 𝑞(/- → 𝑎@J- Observation distribution given a state: 𝑃 𝑂 𝑞J → 𝑏J (e.g. Gaussian)

• How can we learn the parameters from data?

Training HMM for Chord Recognition

• If chord labels are aligned with audio, estimate the parameters directly from the data- Initial state probabilities and transition probability matrix: count chord and

chord-to-chord transition- Observation distribution: fit a Gaussian model to the audio features

separately for each chord- Easy to train but very expensive to obtain the time-aligned data

• If If chord labels are not aligned with audio, we should do the maximum-likelihood estimation

Training HMM: EM algorithm

• If If chord labels are not aligned with audio, use the EM algorithm (the Baum-Welch method)

• E-Step: evaluate the probability of transitioning from state 𝑆𝑖 at time 𝑡 to state 𝑆𝑗 at time 𝑡 + 1 given observation

- Then, the probability of being in state 𝑆𝑖 at time 𝑡 can be also derived

𝛾( 𝑖 = 𝑝 𝑞( = 𝑆@ 𝑂, 𝜃 =Q𝜉( 𝑖, 𝑗S

JT-

𝜉( 𝑖, 𝑗 = 𝑝(𝑞( = 𝑆@, 𝑞(D- = 𝑆J|𝑂, 𝜆)

Training HMM: EM algorithm

• M-Step: update the parameters such that they maximize the log-likelihood given the evaluation- ∑ 𝛾( 𝑖>/-

(T- : expected number of transitions from 𝑆𝑖(or how many the state 𝑆𝑖 is visited from 1 to T-1)

- ∑ 𝜉( 𝑖, 𝑗>/-(T- : expected number of transition from 𝑆𝑖 to 𝑆𝑗

• We can use the label to constrain the model (e.g. initialization)

𝜋@ = 𝛾- 𝑖 =

𝑎@J =∑ 𝜉( 𝑖, 𝑗>/-(T-∑ 𝛾( 𝑖>/-(T-

=

𝑏J 𝑘 =∑ 𝛾( 𝑖, 𝑗 𝑠. 𝑡. 𝑂( = 𝑣[>(T-

∑ 𝛾( 𝑖>(T-

=

expected frequency in state 𝑆𝑖 at time 𝑡 = 1

expected number of transition from 𝑆𝑖 to 𝑆𝑗

expected number of transition from 𝑆𝑖

expected number of times in state 𝑆𝑗 and observing 𝑣[

expected number of times in state 𝑆𝑗

Evaluating HMM for Chord Recognition

• Find the most likely sequence of hidden states given observations and HMM model parameters

• Viterbi algorithm - Define a probability variable

- Initialization:

- Recursion:

- Termination:

(from start state)

(to end state)

𝛿( 𝑖 = max`a,`b,…,`cda

𝑃(𝑞-, 𝑞., … , 𝑞( = 𝑆@, 𝑂-, 𝑂., … , 𝑂(| 𝜆)

𝛿- 𝑖 = 𝜋@𝑏@(𝑂-) 𝜓- 𝑖 = 0

𝛿( 𝑗 = max-g@gS

𝛿(/- 𝑖 𝑎@J𝑏J(𝑂()

𝜓( 𝑗 = argmax-g@gS

𝛿(/- 𝑖 𝑎@J2 ≤ 𝑡 ≤ 𝑇, 1 ≤ 𝑗 ≤ 𝑁

1 ≤ 𝑖 ≤ 𝑁

𝑃∗ = max-g@gS

𝛿> 𝑖

𝑞>∗ = argmax-g@gS

𝛿> 𝑖

The Viterbi Trellis

• Recall the Dynamic Programming!

C

F

G

St

v2 ( j)v1( j)

. .

.

. .

.

. .

. C

F

G

Endv3( j)

t=1 t=2 t=3

C

F

G

C

F

G

C

F

G

t=T-1 t=T

vT−1( j) vT ( j)

Chord Recognition Result

• HMM provide more smoothed chord recognition output

(From Ellis’ E4896 practicals)

References

• P. R. Cook (Editor), “Music, Cognition, and Computerized Sound: An Introduction to Psychoacoustics”, book, 2001

• C. Krumhansl, “Cognitive Foundations of Musical Pitch”, 1990 • M.A. Bartsch and G. H. Wakefield,“To catch a chorus: Using chroma-based

representations for audio thumbnailing”, 2001• E. Gómez, P. Herrera, “Estimating The Tonality Of Polyphonic Audio Files: Cognitive

Versus Machine Learning Modeling Strategies”, 2004. • M. Müller and S. Ewert, “Chroma Toolbox: MATLAB Implementations for Extracting

Variants of Chroma-Based Audio Features”, 2011.• T. Fujishima, “Real-time chord recognition of musical sound: A system using common

lisp music,” 1999• A. Sheh and D. Ellis, “Chord Segmentation and Recognition using EM-Trained Hidden

Markov Models”, 2003.• L. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech

Recognition”, 1989

gct634: musical applications of machine learning tonal...

Documents