music generation: part 1 - kaistjuhan/gct634/slides/11 music...score (remi) pop music transformer:...
TRANSCRIPT
-
GCT634/AI613: Musical Applications of Machine Learning (Fall 2020)
Music Generation: Part 1
Juhan Nam
-
Introduction
● We have focused on analyzing the input audio and extracting certain information or sources○ Audio-to-label: music genre/mood classification and tagging○ Audio-to-score/MIDI: note transcription, chord recognition, beat tracking○ Audio-to-audio: source separation and audio style transfer
-
Introduction
● Now we have a very different problem: generating new musical content from scratch or a given condition ○ Label-to-score: music composition and arrangement○ Score-to-MIDI: expressive performance○ MIDI-to-audio: sound synthesis
Composer Performer Musical Instrument
-
Introduction
● But, the generative process can be directly conducted on performance MIDI or audio ○ Label-to-MIDI ○ Label-to-Audio
PerformanceRNN, Music Transformer (trained with the MAESTRO dataset)
WaveNet, WaveGAN (trained with raw audio waveforms)
-
Music Generation
● Symbolic music generation○ Generate music in the forms of music score (but mostly MIDI)○ Take 1D sequence (note events) as input○ Focus on sequential note generation based on musical language model○ Leverage advances in natural language processing: RNN, transformer
● Audio generation○ Generate waveforms or spectrogram○ Take spectrogram as 2D image or waveforms as 1D sequence○ Focus on natural sound synthesis○ Leverage high-quality image generation models such as GAN
-
Symbolic Music Generation
● Language model in natural language processing○ Predict what comes next in a sentence
● Language model in music ○ Predict what comes next in note sequence
𝑝(𝑥!|𝑥", … , 𝑥!#")
𝑥!: input representation vector
𝑥" 𝑥# 𝑥$ 𝑥%…
The sky is so𝑥" 𝑥# 𝑥$ 𝑥%…
blue
darkbeautiful Language Modeling
-
Symbolic Generation
● Once the language model is trained, the joint probability of a sequence can be calculated ○ For a sequence 𝑋 = (𝑥", 𝑥#, … , 𝑥&'", 𝑥&)
● Therefore, we can figure out which sequence is more likely than others ○ This is used for speech recognition/automatic music transcription to find
more sensible sentences/note sequence among the candidates from acoustic models
𝑃 𝑋 = 𝑃 𝑥!, 𝑥", … , 𝑥# = 𝑃 𝑥! 𝑃 𝑥" 𝑥! 𝑃 𝑥$ 𝑥", 𝑥! …𝑃 𝑥# 𝑥#%!, … , 𝑥! ='&'!
#
𝑃 𝑥& 𝑥(&
-
What’s difference in music?
● Music is polyphonic○ Melody and accompaniment ○ Music score is an 1 D sequence with the 2D nature: how we handle the
simultaneous notes in the input representation?
-
What’s difference in music?
● Music is structured in scale, rhythm and harmony○ Given a key, notes on the scale are played more likely than other notes○ Simultaneous notes are arranged in harmony with a chord ○ Successive notes are placed with a rhythm pattern
Scale
RhythmHarmony
Measure
Beat
Tick
-
What’s difference in music?
● The majority of music pieces has a form○ Repetitions and variations○ AABA or Intro-verse-chorus-outro○ 16-bar blues ○ Sonata, Rondo
● Learning the long-term structure(long-term dependency) is a challenge in music generation!○ Likewise in NLP
The Clustering of Expressive Timing Within a Phrase in Classical Piano Performances by Gaussian Mixture Models”, Li et. al, “, 2015
-
Symbolic Input Representations
● Piano roll: 2D image or 1D sequence of multi-hot vectors ○ Easy to understand: visualize music intuitively○ Easy to handle polyphony○ A note is a line of pixels: generative models handle the pixels but not the note
■ The generated output will be musically noisy ○ Too much redundancy in time
■ Time quantization (e.g., 16th note length) can reduce the redundancy (MusicVAE)■ But, the quantization is applicable to score MIDI only
Piano roll Quantization
-
Symbolic Input Representations
● MIDI event (the Magenta format)○ Event types
■ Note-on: 128 MIDI pitches■ Note-off: 128 MIDI pitches■ Set-velocity: 32 quantized velocities■ Time-shift: 100 shifts (10ms to 1 sec)
○ The time shift event compresses sustained note states into a single event■ Greatly reduce the time redundancy
○ Easy to handle polyphony○ Fit to performance MIDI but hard to incorporate score information○ All events are encoded to a 388-dimensional one hot vector
■ A typical 30-sec clip might contain about 1200 such one-hot vectors ■ No semantic meaning (as opposed to word embedding)
This Time with Feeling: Learning Expressive Musical Performance, Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, Karen Simonyan, 2018
-
Symbolic Input Representations
● Music notation parsing○ A structured text sequence
■ Bar, chord, tempo, note, and so on○ This rich information can be useful in generating more musical output.
■ But, it may need manual annotations ○ No standard method
Score (REMI)
Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions, Yu-Siang Huang, Yi-Hsuan Yang, 2020
-
Musical Language Model Using RNN
● PerformanceRNN○ Use performance MIDI files from the e-Piano competition dataset (the early
version of the MAESTRO dataset)○ Data augmentation: tempo change and key transpose○ The event-based MIDI representation (one-hot vector)
■ 𝑥)={note on, note off, set velocity, time-shift}○ Trained with three layers of LSTMs and softmax output
■ The loss function is the cross-entropy between the softmax and the one-hot vector
■ Teacher-forcing: the ground output is used for next inputinstead of the predicted output in the training phase
This Time with Feeling: Learning Expressive Musical Performance, Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, Karen Simonyan, 2018
𝑝(𝑥!|𝑥", … , 𝑥#)
𝑥" 𝑥$ 𝑥% 𝑥#
𝑥$ 𝑥% 𝑥# 𝑥! . . .
. . .
-
Music Generation Using Musical Language Model
● Generating the output from the trained MLM○ Sample from the softmax distribution ○ The sampled output is used as input at the next step
● Softmax temperature○ 𝜏 > 1 : 𝑃! becomes more uniform
■ Thus more diverse output are generated○ 𝜏 < 1 : 𝑃! becomes more spiky
■ Thus less diverse output are generated 𝑥" 2𝑥$ 2𝑥% 2𝑥#
2𝑥$ 2𝑥% 2𝑥# 2𝑥!
. . .
Sample
Softmax output
𝑃! 𝑤 =exp(𝑆"/𝜏)
∑"! exp(𝑆"!/𝜏)
𝜏 > 1𝜏 = 1
𝜏 < 1
A funny animation about auto-regressive models: https://twitter.com/i/status/1327775912352493568
https://twitter.com/i/status/1327775912352493568
-
Evaluating Musical Language Model
● Objective evaluation○ Perplexity (PPL): inverse probability of the corpus
○ Equal to the exponential of cross-entropy loss○ Lower PPL is better in NLP but in music too?
● Listening Test○ Demo: https://magenta.tensorflow.org/performance-rnn○ The result sounds natural in short terms but note patterns are not coherent
and keeps diverging: the long-term dependency issue!○ Need better models capable of learning wider music context
𝑃 𝑋 ='&'!
#
(1
𝑃34 𝑥& 𝑥(&)!/#
(“more predictable” might be “less creative”?)
https://magenta.tensorflow.org/performance-rnn
-
Generative Model
● Given a dataset of examples 𝑋 = {𝑥$} , estimate 𝑝(𝑋) and generate new samples from 𝑝(𝑋)○ Density estimation: a type of unsupervised learning○ Remember that GMM is a generative model
Training data ~ 𝑝#$!$(𝑋)
𝑝%'( 𝑥 = 3)*+
,
𝜋) 𝑁(𝑥|𝜇) , Σ))
Generated samples ~ 𝑝%'((𝑋)
-
Generative Model
● Given a dataset of examples 𝑋 = {𝑥$} , estimate 𝑝(𝑋) and generate new samples from 𝑝(𝑋)○ If the data is high-dimensional (image, audio or sequence), we will need
more representation power and so we will use deep neural network
Training data ~ 𝑝#$!$(𝑋) Generated samples ~ 𝑝%'((𝑋)
-
Auto-Encoder
● Auto-encoder is an unsupervised learning model that can learn structure within high-dimensional input○ Using encoder-decoder CNN or encoder-decoder RNN○ The latent vector can be reconstructed into the high-dimensional data
● But, can we use AE as a generative model ?○ Randomly sample a vector in the latent space and generate the data?
Encoder Decoder
Encoder Decoder
encoder-decoder CNN encoder-decoder RNN
Sample? Sample?
-
Auto-Encoder
● It can generate the input but the latent space may not be continuous○ The distribution is not dense: there are gaps between the clusters○ The generated output from the gaps (by sampling or interpolation) will be
unrealistic
Source: https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf
Encoder Decoder
https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf
-
Variational Auto Encoder (VAE)
● Model the latent space using randomly sampled latent vectors with a probabilistic model ○ Make the encoder yield two vectors for mean and standard deviation○ Randomly sample a latent vector using the mean and standard deviation○ Reconstruct the input with the random vector
Encoder
𝜇
𝜎
Mean
StandardDeviation
Generator
random sample = 𝜇 + 𝜎𝑧(𝑧~𝒩 (0, I))
-
Variational Auto Encoder (VAE)
● Optimize the network using the maximum likelihood estimation○ The estimation is intractable and so an approximated method is used:
■ Maximize the lower bound of the log likelihood ○ This ends up with minimizing two terms: the reconstruction error and KL
divergence between the Gaussian distributions
𝑙 𝑊; 𝑥 = 𝑥 − 1𝑥 # + 𝐾𝐿(𝒩(𝜇 𝑥 , 𝜎 𝑥 ) ∥ 𝒩 0, I )
Reconstruction error KL divergence: make the distribution of latent vectors have zero mean and unit variance
Auto-Encoding Variational Bayes, Diederik Kingma, Max Welling, 2014
-
Variational Auto Encoder (VAE)
● Re-parameterization ○ Enables gradient flow by detouring the sampling process
Encoder
𝜇
𝜎
Mean
StandardDeviation
Generator
random sample𝑧~𝒩 (0, I)
+
×
-
Variational Auto Encoder (VAE)
● Distribution in the latent space○ By using both KL divergence and reconstruction error, the space can be
discriminative as well as continuous
Reconstruction only KL divergence only Bothhttps://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf
https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf
-
Variational Auto Encoder (VAE)
● Generate data by taking a random vector from the unit Gaussian○ Data manifold are generated from varying 𝑧
Generator
random sample𝑧~𝒩 (0, I)
Auto-Encoding Variational Bayes, Diederik Kingma, Max Welling, 2014
𝑧1(circle shape)
𝑧2 (tilt + more )
𝑧1(smile)
𝑧2 (pose)
Generation from 2-D latent space 𝑧
-
Variational Auto Encoder (VAE)
● Recurrent VAE is also possible○ The language model in the decoder (generator) is conditioned with the latent
vector that captures dependency within the entire sentence
“I” “love” “you”
EncoderRNN
DecoderRNN
𝜇
𝜎
Mean
StandardDeviation
Generating Sentences from a Continuous Space, Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, Samy Bengio, 2016
“I” “love” “you”
“I” “Love” “you”
-
MusicVAE
● Use the encoder-decoder RNN architecture
● Encoder: bidirectional RNN○ The two hidden units at both ends
are concatenated
● Decoder: hierarchical RNN○ Conductor RNN: learns high-level
dependency in measure level○ Language model RNN: condition
from the conductor RNN is concatenated with the previous output as input at the next step
A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music, Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, Douglas Eck, 2018
-
MusicVAE
● Training dataset○ The Lakh MIDI dataset: Multi-track score MIDI○ Use piano roll but quantized notes to 16th note events○ One event is 130 dimensional vector: 128 pitches, note off, rest ○ The input length of RNN (𝑇) is 256 which corresponds to 16 measures (bars)
Quantized Score MIDI
-
MusicVAE
● Learning the latent space of long-term music sequence ○ A latent vector corresponds to a “well-structured” music segment ○ Beat-blender
■ continuous move on the latent space generates gradually changing music sequence
○ Melody mix■ Interpolate between two different melodies
● Demo ○ https://magenta.tensorflow.org/music-vae
Melody Mixer (interpolation)
Beat blender (latent space exploration)
https://magenta.tensorflow.org/music-vae
-
Issues with RNN
● Sequential computation inhibits parallelization (not like CNN)
● No explicit modeling of long and short range dependencies
● Information bottleneck in the encoder
“I” “love” “you”
“I” “love” “you”
“I” “Love” “you”
Information bottleneck
Information bottleneck
Long and short range dependencies?
-
Attention Mechanism
● Direct connections between words in the encoder and decoder○ A weighted sum of the input is concatenated to each of the output words○ The weights are computed from the one-to-one correspondence○ The alignment between words in the encoder and decoder can be obtained
for free
“난” “네가” “진짜” “I” “really” “like”
dot product
softmax
weighted sum
concatenate
“really”
“좋아” “you”
I
really
like
you
난 네가
진짜
좋아
-
Self-Attention
● Direct connections can be made between elements within a sequence○ Each input element is transformed into key, query and value via linear
transforms
key value query
x1 x2 x3 x4x
dot product
softmax
Weighted sum
“Self-attention layer”
+Attention(𝑄, 𝐾, 𝑉)= softmax𝑄𝐾&
𝑑'𝑉
𝑄𝐾 𝑉
Scaling factor
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017
-
Self-Attention
● Direct connections can be made between elements within a sequence○ Each input element is transformed into key, query and value via linear
transforms
key value query
x1 x2 x3 x4x
dot product
softmax
Weighted sum
“Self-attention layer”
+
𝑄𝐾 𝑉
Attention(𝑄, 𝐾, 𝑉)= softmax𝑄𝐾&
𝑑'𝑉
Scaling factor
-
Self-Attention
● Direct connections can be made between elements within a sequence○ Each input element is transformed into key, query and value via linear
transforms
key value query
x1 x2 x3 x4x
dot product
softmax
Weighted sum
“Self-attention layer”
+
𝑄𝐾 𝑉
Attention(𝑄, 𝐾, 𝑉)= softmax𝑄𝐾&
𝑑'𝑉
Scaling factor
-
Self-Attention
● Direct connections can be made between elements within a sequence○ Each input element is transformed into key, query and value via linear
transforms
key value query
x1 x2 x3 x4x
dot product
softmax
Weighted sum
“Self-attention layer”
+
𝑄𝐾 𝑉
Attention(𝑄, 𝐾, 𝑉)= softmax𝑄𝐾&
𝑑'𝑉
Scaling factor
-
Self-Attention
● Multi-head attention ○ Multiple independent key, query and value that capture different types of
dependency in the sequence
x1 x2 x3 x4
++
concatenated
-
Self-Attention
● “Re-representation” of input ○ Based on interactions between input elements
● Constant “path length” between any two position (unlike RNN)○ Permutation-invariant○ Need to add positional information for sequence modeling
● Trivial to parallelize○ Effective use of GPU
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017
-
Transformer
● Position encoding is added to the input○ Self-attention is permutation-invariant
● A single module is composed of○ Multi-head attention layer○ Position-wise feed forward layer ○ Add skip connections and normalization layers
■ The skip connection carries the position information
● Masking is added to the attention in the decoder○ For causal self-attention
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017
-
Transformer
● Used in the state-of-the-arts models in natural language processing○ Machine Translation○ Language modeling○ …
● Used in computed vision as well○ Image classification○ Image generation○ …
Vision transformerImage transformer
-
Music Transformer
● How about applying transformer to music generation?
Source: https://magenta.tensorflow.org/music-transformer
Primer (“Initial input”)
PerformanceRNN
Vanilla Transformer
https://magenta.tensorflow.org/music-transformer
-
Music Transformer
● What’s wrong?
Primer (“Initial input”)
PerformanceRNN
Vanilla Transformer
This model was trained on this length In this “unseen” position, the output completely goes wrong!
Source: https://magenta.tensorflow.org/music-transformer
https://magenta.tensorflow.org/music-transformer
-
Music Transformer
● Use the relative position instead of the absolute position○ Musical patterns can be translation-variance (like convolution)
● The relative position encoding○ Needs a pair-wise distance (2D compared to 1D in the absolute position). ○ A 3D tensor is necessary for positional encoding: too much memory!
Music Transformer, Cheng-Zhi Anna Huang, et al, 2019
Relative position encoding
-
Music Transformer
● Skewing to reduce relative memory
Music Transformer, Cheng-Zhi Anna Huang, et al, 2019
-
Music Transformer
● Consistent generation!
Source: https://magenta.tensorflow.org/music-transformer
Primer (“Initial input”)
Vanilla Transformer
Music Transformer
ß More music examples and a great visualization of real-time self-attention!
https://magenta.tensorflow.org/music-transformer