automatic generation of lyrics in bob dylan’s style0 10 20 30 40 50 60 steps (k) 0 2 4 6 8 10 12...

1
0 10 20 30 40 50 60 steps (k) 0 2 4 6 8 10 12 perplexity hidden size experiment train: 64 valid: 64 train: 128 valid: 128 train: 256 valid: 256 0 50 100 150 200 250 300 steps (k) batch size experiment train: 10 valid: 10 train: 20 valid: 20 train: 50 valid: 50 0 10 20 30 40 50 60 steps (k) 0 2 4 6 8 10 perplexity dropout rate experiment train: 0.0 valid: 0.0 train: 0.25 valid: 0.25 train: 0.5 valid: 0.5 0 10 20 30 40 50 60 steps (k) 3 4 5 6 7 8 9 perplexity Training progress train valid Automatic Generation of Lyrics in Bob Dylan’s Style 1. Introduction Dongzhuo Li ([email protected]), Chao Liang ([email protected]), Tianze Liu ([email protected]) Bob Dylan was recently awarded the Nobel Prize “for having created new poetic expressions within the great American song tradition”. It is interesting to see if machine could learn his poetic style by looking at his lyrics. In this project, we use N-grams and Recurrent Neural Network (RNN) with Long Short Term Memory (LSTM) to model Dylan’s lyrics, and eventually use the algorithms to generate samples of lyrics in Bob Dylan’s style. 2. Data The data we use includes lyrics of 465 songs downloaded from Bob Dylan’s official website (728KB in total). In order to make it easier for machines to process the data, we preprocessed the data by low- ercasing all words and isolating punctuations from words. 3. N-grams John likes apple 3-gram Training: Find the distribution that maximizes: Prediction: Randomly sample from the distribution. # of layers: 2 Hidden layer size: 3 likes apple LSTM cell Softmax regression ... ... Training: Update the weights and biases in each cell with back-propagation Prediction: Forward propagation in the neural network. 4. RNN with LSTM 5. Word vs. Character level RNN •In practice, RNNs work at either word level or character level. •To make word level RNN more efficient, words are mapped into high dimensional vector in which semantically similar words are close to each (word embedding), before fed to RNN. • In our experiments, character level RNNs performs significantly better than word level RNNs, likely due to the small number of words our data contains. • In the following sections all results are gen- erated by character level RNNs. 9. Regularization by Drop-out N=1: You weary, shine. road. Hell's dead fate me stand That a. winds With name? N=3: One by one, until there were none. Two by two, they stepped into the night. Drinkin' white rum in a pie. Let the bird sing, let the bird fly. LSTM: Memory that allows RNN to learn long-term dependencies. Example: I am from China. I came to the US, but I still like Chinese food. Word embedding from Global Vectors for Word Representation (GloVe) “he” “she” “school”,“college” “university” 6. Metrics for RNN Performance We use perplexity as the metrics for performance of RNNs, which is defined as: is the probability of the ith word output by softmax regression following RNN. 7. Training RNN 8. Effects of Batch Size and Hidden Layer Size 10. Sample lyrics • Drop-out means randomly disable a certain fraction of neurons in each iteration, in order to avoid over-fitting. • In our experiments, a non-zero drop-out rate is very effective in preventing over-fitting. Overfitting • Without drop-out, larger hidden layer size reduces training perplexity but increases vali- dation perplexity, which indicates overfitting. Larger batch size has faster training speed and leads to earlier overfitting. Epoch 1: When, my prace, ana-peeviof Plaeved Jos a beyf pays rome! Thack Jpeps frays and her forbake roses... Epoch 5: When, you play up to the whole laughed Everything is she's crazy growid in me If you one time for anythe frien... Epoch 10: When, I say, 'That say I poor you, Well, I got anythin' for ev'ryone You think and I could fine... Epoch1 Epoch5 Epoch10 Hidden size: 256 # of layers: 2 Drop-out: 0.5 Batch size: 20 # of unrolling steps: 20 # of epochs: 50 Learning rate: 1 to 0.002 N-Grams Well, I want to hear my money from the river. Yes, and the next from the ground. With the wind blown of the day CharRNN I am going one . I’m in the moon. Something is the wind . I can’t be seen 11. Discussion N-Grams: 1. Low training perplexity (N=2: 32.3; N=3: 2), high test perplexity (~589902). This may be due to the irrelevance between lyrics of different songs and the small size of dataset. 2. The generated lyrics make more sense as N increases, but with higher probability of duplication. CharRNN: 1. Larger hidden size has risk of causing overfitting, which can be regular- ized using dropout. 2. RNN seems good at capturing the grammar (syntax) of the sentences. 3. Need a larger dataset to train a better performing character level RNN. 12. References Andrej Karpathy. http://karpathy.github.io/2015/05/21/rnn-effectiveness/, last visited, 10/20/2016. Andrej Karpathy. https://github.com/crazydonkey200/TensorFlow-char-rnn. last visited, 10/20/2016. Bob Dylan, http://bobdylan.com, last visited, 10/20/2016. Goodfellow I., Bengio Y., and Courville A. 2016, Deep Learning, Book in preparation for MIT Press, http://www.deeplearningbook.org. Michael A. Nielsen, Neural Networks and Deep Learning, Determination Press, 2015. Jurafsky D. and Martin J., Speech and Language Processing, Prentice Hall, 2008. Chen Liang. https://github.com/crazydonkey200/tensorflow-char-rnn, last visited, 12/12/2016.

Upload: buidieu

Post on 09-Feb-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

0 10 20 30 40 50 60steps (k)

0

2

4

6

8

10

12

perp

lexi

ty

hidden size experimenttrain: 64valid: 64train: 128valid: 128train: 256valid: 256

0 50 100 150 200 250 300steps (k)

batch size experimenttrain: 10valid: 10train: 20valid: 20train: 50valid: 50

0 10 20 30 40 50 60steps (k)

0

2

4

6

8

10

perp

lexi

ty

dropout rate experimenttrain: 0.0valid: 0.0train: 0.25valid: 0.25train: 0.5valid: 0.5

0 10 20 30 40 50 60steps (k)

3

4

5

6

7

8

9

perp

lexi

ty

Training progress

trainvalid

Automatic Generation of Lyrics in Bob Dylan’s Style

1. Introduction

Dongzhuo Li ([email protected]), Chao Liang ([email protected]), Tianze Liu ([email protected])

Bob Dylan was recently awarded the Nobel Prize “for having created new poetic expressions within the great American song tradition”. It is interesting to see if machine could learn his poetic style by looking at his lyrics. In this project, we use N-grams and Recurrent Neural Network (RNN) with Long Short Term Memory (LSTM) to model Dylan’s lyrics, and eventually use the algorithms to generate samples of lyrics in Bob Dylan’s style.

2. Data The data we use includes lyrics of 465 songs downloaded from Bob Dylan’s official website (728KB in total). In order to make it easier for machines to process the data, we preprocessed the data by low-ercasing all words and isolating punctuations from words.

3. N-grams

John likes apple

3-gramTraining:Find the distributionthat maximizes:

Prediction:Randomly sample fromthe distribution.

# of layers: 2Hidden layer size: 3

likes apple

LSTM cell

Softmaxregression

... ...

Training:Update the weights and biases in each cell with back-propagationPrediction:Forward propagation in the neural network.

4. RNN with LSTM

5. Word vs. Character level RNN•In practice, RNNs work at either word level

or character level.

•To make word level RNN more efficient, words are mapped into high dimensional vector in which semantically similar words are close to each (word embedding), before fed to RNN.

• In our experiments, character level RNNs performs significantly better than word level RNNs, likely due to the small number of words our data contains.

• In the following sections all results are gen-erated by character level RNNs.

9. Regularization by Drop-out

N=1: You weary, shine. road. Hell's dead fate me stand That a. winds With name? N=3: One by one, until there were none. Two by two, they stepped into the night. Drinkin' white rum in a pie. Let the bird sing, let the bird fly.

LSTM: Memory that allows RNN to learn long-term dependencies.Example: I am from China. I came to the US, but I still like Chinese food.

Word embedding from Global Vectors for Word Representation (GloVe)

“he”“she”

“school”,“college”“university”

6. Metrics for RNN PerformanceWe use perplexity as the metrics for performance of RNNs, which is defined as:

is the probability of the ith word output by softmax regression following RNN.

7. Training RNN

8. Effects of Batch Size and Hidden Layer Size

10. Sample lyrics

• Drop-out means randomly disable a certain fraction of neurons in each iteration, in order to avoid over-fitting.

• In our experiments, a non-zero drop-out rate is very effective in preventing over-fitting.

Overfitting

• Without drop-out, larger hidden layer size reduces training perplexity but increases vali-dation perplexity, which indicates overfitting.

• Larger batch size has faster training speed and leads to earlier overfitting.

Epoch 1:When, my prace, ana-peeviof Plaeved Jos a beyf pays rome! Thack Jpeps frays and her forbake roses...

Epoch 5:When, you play up to the whole laughed Everything is she's crazy growid in me If you one time for anythe frien...

Epoch 10:When, I say, 'That say I poor you, Well, I got anythin' for ev'ryone You think and I could fine...

Epoch1

Epoch5

Epoch10

Hidden size: 256# of layers: 2Drop-out: 0.5Batch size: 20# of unrolling steps: 20# of epochs: 50Learning rate: 1 to 0.002

N-Grams

Well, I want to hear my money from the river. Yes, and the next from the ground. With the wind blown of the day

CharRNN

I am going one . I’m in the moon. Something is the wind .I can’t be seen

11. DiscussionN-Grams:1. Low training perplexity (N=2: 32.3; N=3: 2), high test perplexity (~589902). This may be due to the irrelevance between lyrics of different songs and the small size of dataset.2. The generated lyrics make more sense as N increases, but with higher probability of duplication.

CharRNN:1. Larger hidden size has risk of causing overfitting, which can be regular-ized using dropout.2. RNN seems good at capturing the grammar (syntax) of the sentences.3. Need a larger dataset to train a better performing character level RNN.

12. ReferencesAndrej Karpathy. http://karpathy.github.io/2015/05/21/rnn-e�ectiveness/, last visited, 10/20/2016.Andrej Karpathy. https://github.com/crazydonkey200/TensorFlow-char-rnn. last visited, 10/20/2016.Bob Dylan, http://bobdylan.com, last visited, 10/20/2016.Goodfellow I., Bengio Y., and Courville A. 2016, Deep Learning, Book in preparation for MIT Press, http://www.deeplearningbook.org.Michael A. Nielsen, Neural Networks and Deep Learning, Determination Press, 2015.Jurafsky D. and Martin J., Speech and Language Processing, Prentice Hall, 2008.Chen Liang. https://github.com/crazydonkey200/tensor�ow-char-rnn, last visited, 12/12/2016.