gct634: musical applications of machine learning tonal...
TRANSCRIPT
GCT634: Musical Applications of Machine LearningTonal Analysis
Hidden Markov Model
Graduate School of Culture Technology, KAISTJuhan Nam
Outlines
• Introduction- Tonality- Perceptual Distance of Two Tones- Chords and Scales
• Tonal Analysis - Key Estimation- Chord Recognition
• Hidden Markov Model
Tonality
• Tonal music has a tonal center called key- 12 keys (C, C#, D, …, B)
• Tonal music has a major or minor scale on the key and the notes have different roles
• Notes in tonal music are harmonized by chords
(C major scale)
Tonality
• A sequence of notes or chord progressions provide certain degree of stability or instability- E.g., cadence (V-I, IV-I), tension (sus2, sus4)
• How the tonality is formed? - In other words, how we perceive different degrees of stability or tension
from notes?
Tonality
• Consonance and Dissonance - If two sinusoidal tones are within 3 ST (minor 3rd) in frequency, they
become dissonant - Most dissonant when they are apart about one quarter of the critical band- Critical bands become wider below 500 Hz; two low notes can sound
dissonant (e.g. two piano notes in lower keys)
• Consonance of two harmonics tones- Determined by how much two tones have closely-located overtones within
critical bands
Consonance Rating of Intervals in Music
• Perceptual distance between two notes are different from semi-tone distance between them.
Chords
• The basic units of tonal harmony- Triads, 7th , 9th, 11th, …
• Triads are formed by choosing three notes that make the most consonant (or “most harmonized”) sounds- This ends up with stacking up major or minor 3rds- 7th, 9th are obtained by stacking up 3rds more.
• The quality of consonance becomes more sophisticated as more notes are added- Music theory is basically about how to create tension and resolve it with
different quality of consonance
Scales in Tonal Harmony
• Major Scale- Formed by spreading notes from three major chords
• Minor scale- Formed by spreading notes from three minor chords (natural minor scale)
- Harmonic or melodic minor scale can be formed by using both minor and major chords
Automatic Chord Recognition
• Identifying chord progression of tonal music
• It is a challenging task (even for human)- Chords are not explicit in music - Non-chord notes or passing notes- Key change and chromaticism: requires in-depth knowledge of music
theory- In audio, multiple musical instruments are mixed- Relevant: harmonically arranged notes- Irrelevant: percussive sounds (but can help detecting chord changes)
• What kind of audio features can be extracted to recognize chords in a robust way?
Chroma Features: FFT-based approach
• Compute spectrogram and mapping matrix- Convert frequency to music pitch scale and get the pitch class - Set one to the corresponding pitch class and, otherwise, set zero- Adjust non-zeros values such that low-frequency content have more
weights
Chroma Features: Filter-bank approach
• A filter-bank can be used to get a log-scale time-frequency representation- Center frequencies are arranged over 88
piano notes - band widths are set to have constant-Q and
robust to +/- 25 cent detune
• The outputs that belong to the same pitch class are wrapped and summed.
(Müller, 2011)
Beat-Synchronous Chroma Features
• Make chroma features homogeneous within a beat (Bartsch and Wakefield, 2001)
(From Ellis’ slides)
Key Estimation Overview
• Estimate music key from music data- One of 24 keys: 12 pitch classes (C, C#, D, .., B) + major/minor
• General Framework (Gomez, 2006)
G majorSimilarityMeasure
Chroma Features
Average
Key Template
KeyStrength
Key Template
• Probe tone profile (Krumhansl and Kessler, 1982)- Relative stability or weight of tones - Listeners rated which tones best completed the first seven notes of a major
scale - For example, in C major key, C, D, E, F, G, A, B, … what?
Probe Tone Profile - Relative Pitch Ranking
Key Estimation
• Similarity by cross-correlation between chroma features and templates
• Find the key that produces the maximum correlation
Chord Recognition
• Estimate chords from music data- Typically, one of 24 keys: 12 pitch classes + major/minor - Often, diminish chords are added (36 chords)
• General Framework
ChordsDecisionMaking
Audio/Transform
Chroma Features
Chord Template or Models
Template MatchingHMM, SVM
Template-Based Approach
• Use chord templates (Fujishima, 1999; Harte and Sandler, 2005) and find the best matches
• Chord Templates
(from Bello’s Slides)
Template-Based Approach
• Compute the cross-correlation between chroma features and chord templates and select chords that have maximum values
(from Bello’s Slides)
Review
• Template approach is too straightforward- The binary templates are hard assignments
• We can use a multi-class classifier- The output is one of the target chords- However, the local estimation tends to be temporally not smooth
• We need some algorithm that considers the temporal dependency between chords- The majority of tonal music have certain types of chord progression
Hidden Markov Model (HMM)
• A probabilistic model for time series data - Speech, gesture, DNA sequence, financial data, weather data, …
• Assumes that the time series data are generated from hidden states and the hidden states follow a Markov model
• Learning-based approach- Need training data annotated with labels - The labels usually correspond to hidden states
Markov Model
• A random variable 𝑞 has 𝑁 states (𝑆1, 𝑆2, … , 𝑆𝑁) and, at each time step, one of the states are randomly chosen: 𝑞( ∈ {𝑆1, 𝑆2, … , 𝑆𝑁}
• The probability distribution for the current state is determined by the previous state(s)- The first-order: 𝑃 𝑞( 𝑞-, 𝑞., … , 𝑞(/- = 𝑃 𝑞( 𝑞(/-- The second-order: 𝑃 𝑞( 𝑞-, 𝑞., … , 𝑞(/- = 𝑃 𝑞( 𝑞(/-, 𝑞(/.
• The first-order Markov model is widely used for simplicity
Markov Model
• Example: chord progression- 𝑞( ∈ {𝐶, 𝐹, 𝐺}- The transition probability matrix 3 by 3
FC
GSt
End
𝑃 𝑞( = 𝐶 𝑞(/- = 𝐶 = 0.7
𝑃 𝑞( = 𝐹 𝑞(/- = 𝐶 = 0.1
𝑃 𝑞( = 𝐺 𝑞(/- = 𝐶 = 0.2
𝑃 𝑞( = 𝐶 𝑞(/- = 𝐹 = 0.2
𝑃 𝑞( = 𝐹 𝑞(/- = 𝐹 = 0.6
𝑃 𝑞( = 𝐺 𝑞(/- = 𝐹 = 0.2
𝑃 𝑞( = 𝐶 𝑞(/- = 𝐺 = 0.3
𝑃 𝑞( = 𝐹 𝑞(/- = 𝐺 = 0.1
𝑃 𝑞( = 𝐺 𝑞(/- = 𝐺 = 0.6
Markov Model
• The joint probability of a sequence of states is simple with the Markov model
𝑃 𝑞-, 𝑞., … , 𝑞( = 𝑃 𝑞-, 𝑞., … , 𝑞(/- 𝑃 𝑞( 𝑞-, 𝑞., … , 𝑞(/- = 𝑃 𝑞-, 𝑞., … , 𝑞(/- 𝑃 𝑞( 𝑞(/-
= 𝑃 𝑞-, 𝑞., … , 𝑞(/. 𝑃 𝑞(/- 𝑞-, 𝑞., … , 𝑞(/. 𝑃 𝑞( 𝑞(/-
= 𝑃 𝑞-, 𝑞., … , 𝑞(/. 𝑃 𝑞(/- 𝑞(/. 𝑃 𝑞( 𝑞(/-
= 𝑃 𝑞- 𝑃 𝑞.|𝑞- …𝑃 𝑞(/- 𝑞(/. 𝑃 𝑞( 𝑞(/-
What Can We Do with the Markov Model?
• Generate a chord sequence- e.g.) C – C – C – C – F – F – C – C – G – G – C– C - … - We can also generate melody if we define the transition probability matrix
among notes
• Evaluate if a specific chord progression is more likely than others. - For example, C-G-C is more likely than C-F-C (assuming 𝑃 𝑞- = 𝐶 = 1)
𝑃 𝑞 = 𝐶, 𝐺, 𝐶 = 𝑃 𝑞- = 𝐶 𝑃 𝑞. = 𝐺|𝑞- = 𝐶 𝑃 𝑞; = 𝐶|𝑞. = 𝐺 = 0.2 ∗ 0.3 = 0.06
𝑃 𝑞 = 𝐶, 𝐹, 𝐶 = 𝑃 𝑞- = 𝐶 𝑃 𝑞. = 𝐹|𝑞- = 𝐶 𝑃 𝑞; = 𝐶|𝑞. = 𝐹 = 0.1 ∗ 0.2 = 0.02
What Can We Do with a Markov Model ?
• Compute the probability that the chord at time 𝑇 is C (or F or G) - Naïve method: count all paths that have C chord at time 𝑇: exponential!- Clever method: use a recursive induction- 𝑃 𝑞> = 𝐶 = 𝑃 𝑞> = 𝐶|𝑞>/- = 𝐶 𝑃 𝑞>/- = 𝐶
+𝑃 𝑞> = 𝐶|𝑞>/- = 𝐹 𝑃 𝑞>/- = 𝐹+𝑃 𝑞> = 𝐶|𝑞>/- = 𝐺 𝑃 𝑞>/- = 𝐺
- Repeat this for 𝑃 𝑞@ = 𝐶 , 𝑃 𝑞@ = 𝐹 , 𝑃 𝑞@ = 𝐺 for 𝑖 = 𝑇 − 1, 𝑇 − 2,… , 1
Chord Recognition from Audio
• What we observe are not chords but audio features (e.g. chroma)
• We want to infer a chord sequence from audio feature sequences
𝑞-, 𝑞., … , 𝑞(/-
𝑂-, 𝑂., … , 𝑂(/-
Hidden Markov Model (HMM)
• The hidden states follow the Markov model• Given a state, the corresponding observation distribution is
independent of previous states or observations- Each state has emission distribution
𝑞(/- 𝑞( 𝑞(D-
𝑂(/-
. . .
𝑂( 𝑂(D-
FC
G
𝑃 𝑂 𝑞( = 𝐶 𝑃 𝑂 𝑞( = 𝐹 𝑃 𝑂 𝑞( = 𝐺
Hidden Markov Model (HMM)
• Model parameters- Initial state probabilities: 𝑃 𝑞E → 𝜋@- Transition probability matrix: 𝑃 𝑞( 𝑞(/- → 𝑎@J- Observation distribution given a state: 𝑃 𝑂 𝑞J → 𝑏J (e.g. Gaussian)
• How can we learn the parameters from data?
Training HMM for Chord Recognition
• If chord labels are aligned with audio, estimate the parameters directly from the data- Initial state probabilities and transition probability matrix: count chord and
chord-to-chord transition- Observation distribution: fit a Gaussian model to the audio features
separately for each chord- Easy to train but very expensive to obtain the time-aligned data
• If If chord labels are not aligned with audio, we should do the maximum-likelihood estimation
Training HMM: EM algorithm
• If If chord labels are not aligned with audio, use the EM algorithm (the Baum-Welch method)
• E-Step: evaluate the probability of transitioning from state 𝑆𝑖 at time 𝑡 to state 𝑆𝑗 at time 𝑡 + 1 given observation
- Then, the probability of being in state 𝑆𝑖 at time 𝑡 can be also derived
𝛾( 𝑖 = 𝑝 𝑞( = 𝑆@ 𝑂, 𝜃 =Q𝜉( 𝑖, 𝑗S
JT-
𝜉( 𝑖, 𝑗 = 𝑝(𝑞( = 𝑆@, 𝑞(D- = 𝑆J|𝑂, 𝜆)
Training HMM: EM algorithm
• M-Step: update the parameters such that they maximize the log-likelihood given the evaluation- ∑ 𝛾( 𝑖>/-
(T- : expected number of transitions from 𝑆𝑖(or how many the state 𝑆𝑖 is visited from 1 to T-1)
- ∑ 𝜉( 𝑖, 𝑗>/-(T- : expected number of transition from 𝑆𝑖 to 𝑆𝑗
• We can use the label to constrain the model (e.g. initialization)
𝜋@ = 𝛾- 𝑖 =
𝑎@J =∑ 𝜉( 𝑖, 𝑗>/-(T-∑ 𝛾( 𝑖>/-(T-
=
𝑏J 𝑘 =∑ 𝛾( 𝑖, 𝑗 𝑠. 𝑡. 𝑂( = 𝑣[>(T-
∑ 𝛾( 𝑖>(T-
=
expected frequency in state 𝑆𝑖 at time 𝑡 = 1
expected number of transition from 𝑆𝑖 to 𝑆𝑗
expected number of transition from 𝑆𝑖
expected number of times in state 𝑆𝑗 and observing 𝑣[
expected number of times in state 𝑆𝑗
Evaluating HMM for Chord Recognition
• Find the most likely sequence of hidden states given observations and HMM model parameters
• Viterbi algorithm - Define a probability variable
- Initialization:
- Recursion:
- Termination:
(from start state)
(to end state)
𝛿( 𝑖 = max`a,`b,…,`cda
𝑃(𝑞-, 𝑞., … , 𝑞( = 𝑆@, 𝑂-, 𝑂., … , 𝑂(| 𝜆)
𝛿- 𝑖 = 𝜋@𝑏@(𝑂-) 𝜓- 𝑖 = 0
𝛿( 𝑗 = max-g@gS
𝛿(/- 𝑖 𝑎@J𝑏J(𝑂()
𝜓( 𝑗 = argmax-g@gS
𝛿(/- 𝑖 𝑎@J2 ≤ 𝑡 ≤ 𝑇, 1 ≤ 𝑗 ≤ 𝑁
1 ≤ 𝑖 ≤ 𝑁
𝑃∗ = max-g@gS
𝛿> 𝑖
𝑞>∗ = argmax-g@gS
𝛿> 𝑖
The Viterbi Trellis
• Recall the Dynamic Programming!
C
F
G
St
v2 ( j)v1( j)
. .
.
. .
.
. .
. C
F
G
Endv3( j)
t=1 t=2 t=3
C
F
G
C
F
G
C
F
G
t=T-1 t=T
vT−1( j) vT ( j)
Chord Recognition Result
• HMM provide more smoothed chord recognition output
(From Ellis’ E4896 practicals)
References
• P. R. Cook (Editor), “Music, Cognition, and Computerized Sound: An Introduction to Psychoacoustics”, book, 2001
• C. Krumhansl, “Cognitive Foundations of Musical Pitch”, 1990 • M.A. Bartsch and G. H. Wakefield,“To catch a chorus: Using chroma-based
representations for audio thumbnailing”, 2001• E. Gómez, P. Herrera, “Estimating The Tonality Of Polyphonic Audio Files: Cognitive
Versus Machine Learning Modeling Strategies”, 2004. • M. Müller and S. Ewert, “Chroma Toolbox: MATLAB Implementations for Extracting
Variants of Chroma-Based Audio Features”, 2011.• T. Fujishima, “Real-time chord recognition of musical sound: A system using common
lisp music,” 1999• A. Sheh and D. Ellis, “Chord Segmentation and Recognition using EM-Trained Hidden
Markov Models”, 2003.• L. Rabiner, “A Tutorial on Hidden Markov Models and Selected Applications in Speech
Recognition”, 1989