music generation: part 1 - kaistjuhan/gct634/slides/11 music...score (remi) pop music transformer:...

GCT634/AI613: Musical Applications of Machine Learning (Fall 2020)

Music Generation: Part 1

Juhan Nam

Introduction

● We have focused on analyzing the input audio and extracting certain information or sources○ Audio-to-label: music genre/mood classification and tagging○ Audio-to-score/MIDI: note transcription, chord recognition, beat tracking○ Audio-to-audio: source separation and audio style transfer

Introduction

● Now we have a very different problem: generating new musical content from scratch or a given condition ○ Label-to-score: music composition and arrangement○ Score-to-MIDI: expressive performance○ MIDI-to-audio: sound synthesis

Composer Performer Musical Instrument

Introduction

● But, the generative process can be directly conducted on performance MIDI or audio ○ Label-to-MIDI ○ Label-to-Audio

PerformanceRNN, Music Transformer (trained with the MAESTRO dataset)

WaveNet, WaveGAN (trained with raw audio waveforms)

Music Generation

● Symbolic music generation○ Generate music in the forms of music score (but mostly MIDI)○ Take 1D sequence (note events) as input○ Focus on sequential note generation based on musical language model○ Leverage advances in natural language processing: RNN, transformer

● Audio generation○ Generate waveforms or spectrogram○ Take spectrogram as 2D image or waveforms as 1D sequence○ Focus on natural sound synthesis○ Leverage high-quality image generation models such as GAN

Symbolic Music Generation

● Language model in natural language processing○ Predict what comes next in a sentence

● Language model in music ○ Predict what comes next in note sequence

𝑝(𝑥!|𝑥", … , 𝑥!#")

𝑥!: input representation vector

𝑥" 𝑥# 𝑥$ 𝑥%…

The sky is so𝑥" 𝑥# 𝑥$ 𝑥%…

blue

darkbeautiful Language Modeling

Symbolic Generation

● Once the language model is trained, the joint probability of a sequence can be calculated ○ For a sequence 𝑋 = (𝑥", 𝑥#, … , 𝑥&'", 𝑥&)

● Therefore, we can figure out which sequence is more likely than others ○ This is used for speech recognition/automatic music transcription to find

more sensible sentences/note sequence among the candidates from acoustic models

𝑃 𝑋 = 𝑃 𝑥!, 𝑥", … , 𝑥# = 𝑃 𝑥! 𝑃 𝑥" 𝑥! 𝑃 𝑥$ 𝑥", 𝑥! …𝑃 𝑥# 𝑥#%!, … , 𝑥! ='&'!

#

𝑃 𝑥& 𝑥(&

What’s difference in music?

● Music is polyphonic○ Melody and accompaniment ○ Music score is an 1 D sequence with the 2D nature: how we handle the

simultaneous notes in the input representation?


● Music is structured in scale, rhythm and harmony○ Given a key, notes on the scale are played more likely than other notes○ Simultaneous notes are arranged in harmony with a chord ○ Successive notes are placed with a rhythm pattern

Scale

RhythmHarmony

Measure

Beat

Tick


● The majority of music pieces has a form○ Repetitions and variations○ AABA or Intro-verse-chorus-outro○ 16-bar blues ○ Sonata, Rondo

● Learning the long-term structure(long-term dependency) is a challenge in music generation!○ Likewise in NLP

The Clustering of Expressive Timing Within a Phrase in Classical Piano Performances by Gaussian Mixture Models”, Li et. al, “, 2015

Symbolic Input Representations

● Piano roll: 2D image or 1D sequence of multi-hot vectors ○ Easy to understand: visualize music intuitively○ Easy to handle polyphony○ A note is a line of pixels: generative models handle the pixels but not the note

■ The generated output will be musically noisy ○ Too much redundancy in time

■ Time quantization (e.g., 16th note length) can reduce the redundancy (MusicVAE)■ But, the quantization is applicable to score MIDI only

Piano roll Quantization


● MIDI event (the Magenta format)○ Event types

■ Note-on: 128 MIDI pitches■ Note-off: 128 MIDI pitches■ Set-velocity: 32 quantized velocities■ Time-shift: 100 shifts (10ms to 1 sec)

○ The time shift event compresses sustained note states into a single event■ Greatly reduce the time redundancy

○ Easy to handle polyphony○ Fit to performance MIDI but hard to incorporate score information○ All events are encoded to a 388-dimensional one hot vector

■ A typical 30-sec clip might contain about 1200 such one-hot vectors ■ No semantic meaning (as opposed to word embedding)

This Time with Feeling: Learning Expressive Musical Performance, Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, Karen Simonyan, 2018


● Music notation parsing○ A structured text sequence

■ Bar, chord, tempo, note, and so on○ This rich information can be useful in generating more musical output.

■ But, it may need manual annotations ○ No standard method

Score (REMI)

Pop Music Transformer: Beat-based Modeling and Generation of Expressive Pop Piano Compositions, Yu-Siang Huang, Yi-Hsuan Yang, 2020

Musical Language Model Using RNN

● PerformanceRNN○ Use performance MIDI files from the e-Piano competition dataset (the early

version of the MAESTRO dataset)○ Data augmentation: tempo change and key transpose○ The event-based MIDI representation (one-hot vector)

■ 𝑥)={note on, note off, set velocity, time-shift}○ Trained with three layers of LSTMs and softmax output

■ The loss function is the cross-entropy between the softmax and the one-hot vector

■ Teacher-forcing: the ground output is used for next inputinstead of the predicted output in the training phase

This Time with Feeling: Learning Expressive Musical Performance, Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, Karen Simonyan, 2018

𝑝(𝑥!|𝑥", … , 𝑥#)

𝑥" 𝑥$ 𝑥% 𝑥#

𝑥$ 𝑥% 𝑥# 𝑥! . . .

. . .

Music Generation Using Musical Language Model

● Generating the output from the trained MLM○ Sample from the softmax distribution ○ The sampled output is used as input at the next step

● Softmax temperature○ 𝜏 > 1 : 𝑃! becomes more uniform

■ Thus more diverse output are generated○ 𝜏 < 1 : 𝑃! becomes more spiky

■ Thus less diverse output are generated 𝑥" 2𝑥$ 2𝑥% 2𝑥#

2𝑥$ 2𝑥% 2𝑥# 2𝑥!

. . .

Sample

Softmax output

𝑃! 𝑤 =exp(𝑆"/𝜏)

∑"! exp(𝑆"!/𝜏)

𝜏 > 1𝜏 = 1

𝜏 < 1

A funny animation about auto-regressive models: https://twitter.com/i/status/1327775912352493568

https://twitter.com/i/status/1327775912352493568

Evaluating Musical Language Model

● Objective evaluation○ Perplexity (PPL): inverse probability of the corpus

○ Equal to the exponential of cross-entropy loss○ Lower PPL is better in NLP but in music too?

● Listening Test○ Demo: https://magenta.tensorflow.org/performance-rnn○ The result sounds natural in short terms but note patterns are not coherent

and keeps diverging: the long-term dependency issue!○ Need better models capable of learning wider music context

𝑃 𝑋 ='&'!

#

(1

𝑃34 𝑥& 𝑥(&)!/#

(“more predictable” might be “less creative”?)

https://magenta.tensorflow.org/performance-rnn

Generative Model

● Given a dataset of examples 𝑋 = {𝑥$} , estimate 𝑝(𝑋) and generate new samples from 𝑝(𝑋)○ Density estimation: a type of unsupervised learning○ Remember that GMM is a generative model

Training data ~ 𝑝#$!$(𝑋)

𝑝%'( 𝑥 = 3)*+

,

𝜋) 𝑁(𝑥|𝜇) , Σ))

Generated samples ~ 𝑝%'((𝑋)

Generative Model

● Given a dataset of examples 𝑋 = {𝑥$} , estimate 𝑝(𝑋) and generate new samples from 𝑝(𝑋)○ If the data is high-dimensional (image, audio or sequence), we will need

more representation power and so we will use deep neural network

Training data ~ 𝑝#$!$(𝑋) Generated samples ~ 𝑝%'((𝑋)

Auto-Encoder

● Auto-encoder is an unsupervised learning model that can learn structure within high-dimensional input○ Using encoder-decoder CNN or encoder-decoder RNN○ The latent vector can be reconstructed into the high-dimensional data

● But, can we use AE as a generative model ?○ Randomly sample a vector in the latent space and generate the data?

Encoder Decoder

Encoder Decoder

encoder-decoder CNN encoder-decoder RNN

Sample? Sample?

Auto-Encoder

● It can generate the input but the latent space may not be continuous○ The distribution is not dense: there are gaps between the clusters○ The generated output from the gaps (by sampling or interpolation) will be

unrealistic

Source: https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

Encoder Decoder

https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

Variational Auto Encoder (VAE)

● Model the latent space using randomly sampled latent vectors with a probabilistic model ○ Make the encoder yield two vectors for mean and standard deviation○ Randomly sample a latent vector using the mean and standard deviation○ Reconstruct the input with the random vector

Encoder

𝜇

𝜎

Mean

StandardDeviation

Generator

random sample = 𝜇 + 𝜎𝑧(𝑧~𝒩 (0, I))


● Optimize the network using the maximum likelihood estimation○ The estimation is intractable and so an approximated method is used:

■ Maximize the lower bound of the log likelihood ○ This ends up with minimizing two terms: the reconstruction error and KL

divergence between the Gaussian distributions

𝑙 𝑊; 𝑥 = 𝑥 − 1𝑥 # + 𝐾𝐿(𝒩(𝜇 𝑥 , 𝜎 𝑥 ) ∥ 𝒩 0, I )

Reconstruction error KL divergence: make the distribution of latent vectors have zero mean and unit variance

Auto-Encoding Variational Bayes, Diederik Kingma, Max Welling, 2014


● Re-parameterization ○ Enables gradient flow by detouring the sampling process

Encoder

𝜇

𝜎

Mean

StandardDeviation

Generator

random sample𝑧~𝒩 (0, I)

+

×


● Distribution in the latent space○ By using both KL divergence and reconstruction error, the space can be

discriminative as well as continuous

Reconstruction only KL divergence only Bothhttps://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf

https://towardsdatascience.com/intuitively-understanding-variational-autoencoders-1bfe67eb5daf


● Generate data by taking a random vector from the unit Gaussian○ Data manifold are generated from varying 𝑧

Generator

random sample𝑧~𝒩 (0, I)

Auto-Encoding Variational Bayes, Diederik Kingma, Max Welling, 2014

𝑧1(circle shape)

𝑧2 (tilt + more )

𝑧1(smile)

𝑧2 (pose)

Generation from 2-D latent space 𝑧


● Recurrent VAE is also possible○ The language model in the decoder (generator) is conditioned with the latent

vector that captures dependency within the entire sentence

“I” “love” “you”

EncoderRNN

DecoderRNN

𝜇

𝜎

Mean

StandardDeviation

Generating Sentences from a Continuous Space, Samuel Bowman, Luke Vilnis, Oriol Vinyals, Andrew Dai, Rafal Jozefowicz, Samy Bengio, 2016


“I” “Love” “you”

MusicVAE

● Use the encoder-decoder RNN architecture

● Encoder: bidirectional RNN○ The two hidden units at both ends

are concatenated

● Decoder: hierarchical RNN○ Conductor RNN: learns high-level

dependency in measure level○ Language model RNN: condition

from the conductor RNN is concatenated with the previous output as input at the next step

A Hierarchical Latent Vector Model for Learning Long-Term Structure in Music, Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, Douglas Eck, 2018

MusicVAE

● Training dataset○ The Lakh MIDI dataset: Multi-track score MIDI○ Use piano roll but quantized notes to 16th note events○ One event is 130 dimensional vector: 128 pitches, note off, rest ○ The input length of RNN (𝑇) is 256 which corresponds to 16 measures (bars)

Quantized Score MIDI

MusicVAE

● Learning the latent space of long-term music sequence ○ A latent vector corresponds to a “well-structured” music segment ○ Beat-blender

■ continuous move on the latent space generates gradually changing music sequence

○ Melody mix■ Interpolate between two different melodies

● Demo ○ https://magenta.tensorflow.org/music-vae

Melody Mixer (interpolation)

Beat blender (latent space exploration)

https://magenta.tensorflow.org/music-vae

Issues with RNN

● Sequential computation inhibits parallelization (not like CNN)

● No explicit modeling of long and short range dependencies

● Information bottleneck in the encoder



“I” “Love” “you”

Information bottleneck

Information bottleneck

Long and short range dependencies?

Attention Mechanism

● Direct connections between words in the encoder and decoder○ A weighted sum of the input is concatenated to each of the output words○ The weights are computed from the one-to-one correspondence○ The alignment between words in the encoder and decoder can be obtained

for free

“난” “네가” “진짜” “I” “really” “like”

dot product

softmax

weighted sum

concatenate

“really”

“좋아” “you”

I

really

like

you

난 네가

진짜

좋아

Self-Attention

● Direct connections can be made between elements within a sequence○ Each input element is transformed into key, query and value via linear

transforms

key value query

x1 x2 x3 x4x

dot product

softmax

Weighted sum

“Self-attention layer”

+Attention(𝑄, 𝐾, 𝑉)= softmax𝑄𝐾&

𝑑'𝑉

𝑄𝐾 𝑉

Scaling factor

Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017

Self-Attention

● Direct connections can be made between elements within a sequence○ Each input element is transformed into key, query and value via linear

transforms

key value query

x1 x2 x3 x4x

dot product

softmax

Weighted sum

“Self-attention layer”

+

𝑄𝐾 𝑉

Attention(𝑄, 𝐾, 𝑉)= softmax𝑄𝐾&

𝑑'𝑉

Scaling factor

Self-Attention

● Multi-head attention ○ Multiple independent key, query and value that capture different types of

dependency in the sequence

x1 x2 x3 x4

++

concatenated

Self-Attention

● “Re-representation” of input ○ Based on interactions between input elements

● Constant “path length” between any two position (unlike RNN)○ Permutation-invariant○ Need to add positional information for sequence modeling

● Trivial to parallelize○ Effective use of GPU


Transformer

● Position encoding is added to the input○ Self-attention is permutation-invariant

● A single module is composed of○ Multi-head attention layer○ Position-wise feed forward layer ○ Add skip connections and normalization layers

■ The skip connection carries the position information

● Masking is added to the attention in the decoder○ For causal self-attention


Transformer

● Used in the state-of-the-arts models in natural language processing○ Machine Translation○ Language modeling○ …

● Used in computed vision as well○ Image classification○ Image generation○ …

Vision transformerImage transformer

Music Transformer

● How about applying transformer to music generation?

Source: https://magenta.tensorflow.org/music-transformer

Primer (“Initial input”)

PerformanceRNN

Vanilla Transformer

https://magenta.tensorflow.org/music-transformer

Music Transformer

● What’s wrong?


PerformanceRNN

Vanilla Transformer

This model was trained on this length In this “unseen” position, the output completely goes wrong!



Music Transformer

● Use the relative position instead of the absolute position○ Musical patterns can be translation-variance (like convolution)

● The relative position encoding○ Needs a pair-wise distance (2D compared to 1D in the absolute position). ○ A 3D tensor is necessary for positional encoding: too much memory!

Music Transformer, Cheng-Zhi Anna Huang, et al, 2019

Relative position encoding

Music Transformer

● Skewing to reduce relative memory

Music Transformer, Cheng-Zhi Anna Huang, et al, 2019

Music Transformer

● Consistent generation!



Vanilla Transformer

Music Transformer

ß More music examples and a great visualization of real-time self-attention!

music generation: part 1 - kaistjuhan/gct634/slides/11 music...score (remi) pop music transformer:...

Documents