automatic voice title generation by extractive...

8
Automatic Voice Title Generation by Extractive Summarization of Long Product Titles using Bi-LSTM Snehasish Mukherjee Stanford University [email protected] Phaniram Sayapaneni WalmartLabs, Sunnyvale [email protected] Abstract The proliferation of voice enabled smart assistants has created a huge market for Voice commerce but the user experience suffers due to the lack of short voice friendly titles. In this paper we investigate supervised techniques for automatic generation of short titles given, the long one. We show that even with a small paral- lel dataset of around 19,000 title compression examples, LSTM based architectures work well without over-fitting. To improve the fluency of the generated short titles we propose and implement a modified Kneser-Ney based trigram language model trained on 1 Million grocery search queries. We also finetuned SOTA architectures like BERT, GPT-1 and XLNet and applied to our problem but found that relatively simpler architectures outperformed them. Our self-attentive BiLSTM + CRF model achieved a ROUGE-1 F1 score of 0.85 which beat BERT by a huge margin. 1 Introduction Voice Commerce is already a 2 Billion dollar industry and still growing very fast. In voice commerce, the product titles need to be read out to the user at various stages of the user journey like search, browse, add to cart etc. This presents a unique problem since e-commerce product titles are long and not suitable for voice based interaction. Reading out a title like “Puffs Plus Lotion Facial Tissue, 4 Mega Cubes, 72 Facial Tissues Per Cube (288 tissues total)” takes too long and deteriorates the user experience which is a major hindrance in the adoption of this technology. The e-commerce industry gets around this problem by manually generating voice friendly titles, or voice titles in short, for each product. This is not scalable for large assortment sizes of 50 Million items or more and hence majority of these products do not have voice friendly titles. In this project we aim to solve this problem by leveraging Deep Learning techniques to automatically compress a long product title to generate the corresponding short title. Given a long product title string, we feed it as the input to a RNN based neural network to predict the output short title. For example, given the long title “Puffs Plus Lotion Facial Tissue, 4 Mega Cubes, 72 Facial Tissues Per Cube (288 tissues total)” the network should generate the short title “facial tissue”. Formally, we want to build a model M that given the long title L =(L i ) n i=1 containing n words L 1 , ...L n as input, can generate the corresponding short tile S as the output, so that S = M (L) where S =(S i ) m i=1 , has m words S 1 , ...S m , m <= n and S i := L k(i) , i.e., S is a subsequence of L. Additionally, S should be reasonably descriptive and retain the root product identity i.e., the short title for “Almond milk” should be “milk” and not “Almond”. CS230: Deep Learning, Winter 2018, Stanford University, CA. (LateX template borrowed from NIPS 2017.)

Upload: others

Post on 03-Oct-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Automatic Voice Title Generation by Extractive ...cs230.stanford.edu/projects_fall_2019/reports/26262276.pdf · work is best described as sentence compression [27, 26, 6, 16, 30,

Automatic Voice Title Generation by ExtractiveSummarization of Long Product Titles using

Bi-LSTM

Snehasish MukherjeeStanford University

[email protected]

Phaniram SayapaneniWalmartLabs, Sunnyvale

[email protected]

Abstract

The proliferation of voice enabled smart assistants has created a huge market forVoice commerce but the user experience suffers due to the lack of short voicefriendly titles. In this paper we investigate supervised techniques for automaticgeneration of short titles given, the long one. We show that even with a small paral-lel dataset of around 19,000 title compression examples, LSTM based architectureswork well without over-fitting. To improve the fluency of the generated short titleswe propose and implement a modified Kneser-Ney based trigram language modeltrained on 1 Million grocery search queries. We also finetuned SOTA architectureslike BERT, GPT-1 and XLNet and applied to our problem but found that relativelysimpler architectures outperformed them. Our self-attentive BiLSTM + CRF modelachieved a ROUGE-1 F1 score of 0.85 which beat BERT by a huge margin.

1 Introduction

Voice Commerce is already a 2 Billion dollar industry and still growing very fast. In voice commerce,the product titles need to be read out to the user at various stages of the user journey like search,browse, add to cart etc. This presents a unique problem since e-commerce product titles are long andnot suitable for voice based interaction. Reading out a title like “Puffs Plus Lotion Facial Tissue, 4Mega Cubes, 72 Facial Tissues Per Cube (288 tissues total)” takes too long and deteriorates the userexperience which is a major hindrance in the adoption of this technology. The e-commerce industrygets around this problem by manually generating voice friendly titles, or voice titles in short, foreach product. This is not scalable for large assortment sizes of 50 Million items or more and hencemajority of these products do not have voice friendly titles.

In this project we aim to solve this problem by leveraging Deep Learning techniques to automaticallycompress a long product title to generate the corresponding short title. Given a long product titlestring, we feed it as the input to a RNN based neural network to predict the output short title. Forexample, given the long title “Puffs Plus Lotion Facial Tissue, 4 Mega Cubes, 72 Facial Tissues PerCube (288 tissues total)” the network should generate the short title “facial tissue”.

Formally, we want to build a model M that given the long title L = (Li)ni=1 containing n words

L1, ...Ln as input, can generate the corresponding short tile S as the output, so that

S = M(L)

where S = (Si)mi=1, has m words S1, ...Sm, m <= n and Si := Lk(i), i.e., S is a subsequence of

L. Additionally, S should be reasonably descriptive and retain the root product identity i.e., the shorttitle for “Almond milk” should be “milk” and not “Almond”.

CS230: Deep Learning, Winter 2018, Stanford University, CA. (LateX template borrowed from NIPS 2017.)

Page 2: Automatic Voice Title Generation by Extractive ...cs230.stanford.edu/projects_fall_2019/reports/26262276.pdf · work is best described as sentence compression [27, 26, 6, 16, 30,

2 Related work

Our work can be classified under the broad category of text summarization. This is an area of activeresearch and has evolved quite a bit with the advent of RNNs. [21, 7, 2, 12] present comprehensivesurveys of neural as well as classical text summarization techniques. While document level summa-rization [5, 4, 19] deals with the problem of generating document level summaries of the content, ourwork is best described as sentence compression [27, 26, 6, 16, 30, 9, 10] which involves summarizinglong sentences to shorter ones while preserving the core intent.

There are 2 distinct flavors of summarization independent of the source granularity; Abstractive andExtractive. Abstractive summarization [6, 20, 25, 4] can produce summaries consisting of wordsnot present in the input, while Extractive summarization [10, 9, 30, 26] aims to generate summariesusing words or sentences extracted from the original input. We will focus on neural approaches to theextractive flavor of sentence compression for the remainder of this section.

A popular approach to extractive sentence compression is to model the problem as a sequence-to-sequence learning problem using an encoder-decoder based architecture [10, 9, 16]. The encoderlearns a distributed representation of the input sentence which is then fed to the decoder which istrained to produce a binary 0/1 label for each input word denoting whether to delete or keep that wordin the output summary. Beam search can be used [10, 27, 28] to sample from the decoder outputprobability distribution to generate the most likely compression.

[26] and [30] present 2 approaches different from the encoder-decoder based approaches. [30] usesReinforcement Learning with KEEP and DELETE policies. Words are kept or deleted based on thepolicy while a pre trained language model provides feedback on the generated summary. This resultsin the system learning the optimal policy over time. Like all the other deletion based compressionmodels, [26] still tags words in the input sequence with a 0/1 label, but treats it as a sequence labellingproblem, instead of a sequence generation problem. Hence it does away with the decoder and employsstacked BiLSTM layers with a final CRF layer at the output that does the 0/1 classification. Our workin this project closely follows this approach with minor modifications.

Most of the techniques discussed here, with the exception of [26], require considerable amount oftraining data. While [16, 9, 30] are trained on the Gigaword corpus, [10] is trained on 2 Millionnews headline and summary pairs. Though these 2 Million examples were synthetically generatedfollowing the method in [11], it is based on syntactic structure of the English language and its parsetrees. [9] also proposes an unsupervised approach involving generating training data by adding noise,but the results lag behind supervised approaches. Since the distribution of the words in the producttitles, and in general the vocabulary of our problem domain, is different from general English, datasparsity and unavailability of e-commerce specific embeddings pose a challenge for us. Interestingly,[26] reports state of the art performance with only 10000 pairs of original and compressed sentences,which is the reason we chose to implement a modified version of this approach to solve our problemof product title compression.

3 Dataset and Features

We use 2 different private data sets from WalmartLabs. The first dataset consists of 19,254 humangenerated short product titles and their original long counterparts. These (short query, long query)pairs correspond to the top selling grocery items and are the most heavily used in voice shopping. Wecall this the “compression dataset”. The second data set consists of around 12 Million anonymousonline grocery search queries, and their frequencies taken over a 5 month period. Each (query,frequency) pair is what was observed in a day, hence multiple rows exist for the same query, one foreach day. We call this the “search query dataset”

Normalization: We normalize the product titles and the search queries by converting to lower case,removing all non-ascii chracters, padding commas with spaces, replacing & with “and”, squeezingconsecutive spaces and removing all non-alphanumeric characters with the exception of dot(“.”),percent(“%”), single quote(“’́’) and comma(“,”). The commas were padded with spaces so that theycan be extracted as tokens. Next we filter out all those samples from the compression dataset whichare not purely extractive. After the last step we were left with 15,856 parallel title compressionexamples. From the search query dataset we remove all queries with total frequency less than 5. Thisleaves us with 1,108,043 top searched queries.

2

Page 3: Automatic Voice Title Generation by Extractive ...cs230.stanford.edu/projects_fall_2019/reports/26262276.pdf · work is best described as sentence compression [27, 26, 6, 16, 30,

(a) Distribution of query lengths (b) Distribution of short title lengths

Figure 1: Distribution of query lengths and short title lengths look similar

Training/Validation/Test: We use a 90/10 split of the compression dataset for the training and testset respectively. Additionally before every epoch another randomly chosen 10% data is set asideas the validation set. The search query dataset is not directly used for training. Table 1 lists a fewstatistics of the datasets, while figure 3 shows that short title lengths and the search query lengths aresimilarly distributed with a mode around 2. Table 2 lists a few examples of long-short title pairs fromthe compression dataset.

Table 1: Dataset Statistics

Statistic Long Titles Short Titles Search queries

No of samples 15,856 15,856 1,108,043Vocabulary size 8,032 1,833 159,994Median length 10 2 3

95 percentile length 15 5 6

Table 2: Examples from compression dataset

Long Title Short Title

great value organic extra virgin olive oil , 25.5 oz olive oilwelch’s bottle 100% juice coconut water tropical grape berry , 64 oz grape berry juice

huy fong foods chili garlic sauce , 18 oz chili garlic saucegreat value crunchy granola bars , oats and honey , 1.4 oz , 6 count granola bars

4 Methods

We model the problem of generating voice friendly titles as a supervised sequence labelling problemwhere we are given long product title and we have to label each token with a 0 or 1. The wordstagged 1 by the model will be a part on the summary. Formally, given the input training pair (X,Y )the model tries to learn the optimal parameter θ′ such that

θ′ = argmax∑

log p(Y |X; θ) (1)

where Yi ∈ {0, 1} and Xi ∈ Rd is an embedding of a word in a d-dimensional vector space.Additionally in our problem to we keep |X| = |Y | and use padding to guarantee this.

Since our inputs and outputs are sequential, we use RNNs, and since our input and out lengths are thesame we choose not to use any encoder decoder based architecture. Instead we use an architecture

3

Page 4: Automatic Voice Title Generation by Extractive ...cs230.stanford.edu/projects_fall_2019/reports/26262276.pdf · work is best described as sentence compression [27, 26, 6, 16, 30,

template where the input first passes through an Embbedding layer, followed by a stack of RNNs,which exit into a Dense layer with Sigmoid activation. This results in each time-step to output a 0 or1, which is precisely what we need. We experiment with various choices while staying within thisbroad framework.

To overcome the problem of vanishing gradients in RNN, we use Long Short Term Memory (LSTM)Networks instead. LSTMs solve this problem by using gates which can learn to remember states andpropagate them through the network. The equations for the activation, update, forget and output gatesalong with the current and previous states are shown in equation 2 where symbols have their usualmeaning.

Γ(t)f = σ

(Wf

[a〈t−1〉, x(t)

]+ bf

)Γ(t)u = σ

(Wu

[a〈t−1〉, x〈t〉

]+ bu

)c̃(t) = tanh

(WC

[a〈t−1〉, x(t)

]+ bC

)c(t) = Γ

(t)f ◦ c(t−1) + Γ

(t)u ◦ c̃(t)

Γ(t)o = σ

(Wo

[a(t−1), x(t)

]+ bo

)a(t) = Γ

(t)o ◦ tanh

(c(t))

(2)

We also experiment with Bi-Directional LSTMs (BiLSTM) which are an improvement over LSTMssince they allow the output at a time step to be generated from both the left and the right contexts.Conditional Random Fields or CRFs have been successfully used Crucial to any sequence to sequencelabelling task BiLSTMs use the same LSTM nodes but require the input to be passed both from leftto right as well from right to left. Therefore the output of each time-step can have context from boththe left and right sides of the sequence.

We use binary cross-entropy loss to train our model. Since we are using zero padding to ensure allsequences have the same length, the distribution of the labels is heavily skewed towards zero. Toovercome this class imbalance problem we use the loss function , so that the model might learn toalways predict 0 and yet get a good accuracy. .We get around this problem of class imbalance byusing the weighted binary cross-entropy to penalise errors on 1 labels more heavily scheme At a verybroad level we follow the architecture template of an embedding layer, followed by a stack of RNNlayers

Hp(q) = − 1

N

N∑i=1

yi · log (p (yi)) + (1− yi) · log (1− p (yi)) (3)

Finally, as in machine translation, we hypothesize that the fluidity of the generated compressionscan be greatly improved by using Beam Search. This can also get around the problem of the trainedmodel sometimes predicting all 0s. We

A broad level view of our model architecture is shown in TODO. Fixed length input passes through anembedding layer to enter a a stack of (Bi)LSTM layers Describe your learning algorithms, proposedalgorithm(s), or theoretical proof(s). Make sure to include relevant mathematical notation. Forexample, you can include the loss function you are using. It is okay to use formulas from the lectures(online or in-class). For each algorithm, give a short description of how it works. Again, we arelooking for your understanding of how these deep learning algorithms work. Although the teachingstaff probably know the algorithms, future readers may not (reports will be posted on the classwebsite). Additionally, if you are using a niche or cutting-edge algorithm (anything else not coveredin the class), you may want to explain your algorithm using 1/2 paragraphs. Note: Theory/algorithmsprojects may have an appendix showing extended proofs (see Appendix section below).

Finally, since we are tied down by the lack of parallel training examples from our specific domain,we try out transfer learning by finetuning 3 state of the art models, namely BERT [8], GPT-1 andXLNet [29]

5 Experiments/Results/Discussion

We use the 15,856 rows of data for training BiLSTM based neural networks and fine-tuning state-of-the-art pre-trained models. We use ROUGE-1 [14] scores for precision, recall and F1. Additionally,

4

Page 5: Automatic Voice Title Generation by Extractive ...cs230.stanford.edu/projects_fall_2019/reports/26262276.pdf · work is best described as sentence compression [27, 26, 6, 16, 30,

(a) Loss (b) Accuracy

Figure 2: Bi-LSTM learning metrics

we also look at the number of unseen test cases where the model out puts a summary of all 0s.The results for the various model architectures are presented in Table 3. We train various differentcombinations You should also give details about what (hyper)parameters you chose (e.g. why didyou use X learning rate for gradient descent, what was your mini-batch size and why) and how youchose them. What your primary metrics are: accuracy, precision, AUC, etc. Provide equations for themetrics if necessary. For results, you want to have a mixture of tables and plots. If you are solvinga classification problem, you should include a confusion matrix or AUC/AUPRC curves. Includeperformance metrics such as precision, recall, and accuracy. For regression problems, state the averageerror. You should have both quantitative and qualitative results. To reiterate, you must have bothquantitative and qualitative results! If it applies: include visualizations of results, heatmaps, examplesof where your algorithm failed and a discussion of why certain algorithms failed or succeeded. Inaddition, explain whether you think you have overfit to your training set and what, if anything, youdid to mitigate that. Make sure to discuss the figures/tables in your main text throughout this section.Your plots should include legends, axis labels, and have font sizes that are legible when printed.

Table 3: Results

Model Precision Recall F1 Num 0

3-LSTM 0.831 0.849 0.816 433-BiLSTM 0.883 0.876 0.862 25

3-BiLSTM+SA 0.878 0.882 0.862 233-BiLSTM+SA+KN - - 0.835 23

3-BiLSTM+SA+CRF 0.869 0.872 0.852 8

BERT 0.797 0.780 0.750 -Open AI GPT-1 0.730 0.656 0.656 -

XLNET 0.690 0.655 0.640 -

6 Conclusion/Future Work

We had started out with the problem of automatically generating voice friendly product titles underthe constraint of a small training data set and showed that many different variations of LSTM basednetwork architectures work well for this problem. Additionally, we also used pre-trained SOTAmodels like BERT, GPT-1 and XLNet to solve our problem. However contrary to our expectations,their results were lagging behind relatively simpler techniques. This needs further investigation. Wealso found that our solution can be sometimes prone to returning all Os. Our work did not investigatethis problem in detail and we will take this up in future. Finally, we will try to scale up this solutionto deploy this for hundreds of millions of product titles.

5

Page 6: Automatic Voice Title Generation by Extractive ...cs230.stanford.edu/projects_fall_2019/reports/26262276.pdf · work is best described as sentence compression [27, 26, 6, 16, 30,

(a) Loss (b) Accuracy

Figure 3: Transfer learning metrics

7 Contributions

Both members contributed significantly towards successful completion of the project.

Snehasish (SUNetId: mukherji) led the project. Snehasish came up with the project idea, did theliterature review, collected cleaned and pre-processed the training data, tested some warm-up modelson the Google Research news dataset, designed and and trained all the BiLSTM, Attention and CRFbased models, built the Kneser-Ney smoothing based Beam search, and deployed custom evaluationmetrics for Keras.

Phaniram trained a word2vec skip-gram based word embedding, obtained the t-SNE visualization,implemented a synthetic training data generation model, researched and tried out several state-of-the-art models for our problem including BERT, GPT-1, and XLNet.

Code The code for all the ideas implemented here, and all other associated experiments, can be foundat the github url: https://github.com/isnehasish/title_sum. Detailed instructions on howto use the notebooks to reproduce the results will be uploaded in the README.md. Please note thatthe data set used for this project os not public.

References[1] “Sentence compression dataset from google research,” https://github.com/

google-research-datasets/sentence-compression.

[2] M. Allahyari, S. A. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez, and K. Kochut,“Text summarization techniques: A brief survey,” CoRR, vol. abs/1707.02268, 2017. [Online].Available: http://arxiv.org/abs/1707.02268

[3] K. Arjun, M. Hariharan, P. Anand, V. Pradeep, R. G. Raj, and A. Mohan, “Extractive textsummarization using deep auto-encoders,” 2018.

[4] L. Bing, P. Li, Y. Liao, W. Lam, W. Guo, and R. J. Passonneau, “Abstractive multi-documentsummarization via phrase selection and merging,” in Proceedings of the 53rd Annual Meetingof the Association for Computational Linguistics and the 7th International Joint Conference onNatural Language Processing of the Asian Federation of Natural Language Processing, ACL2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, 2015, pp. 1587–1597. [Online].Available: https://www.aclweb.org/anthology/P15-1153/

[5] J. Cheng and M. Lapata, “Neural summarization by extracting sentences and words,” inProceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers, 2016. [Online]. Available:https://www.aclweb.org/anthology/P16-1046/

[6] S. Chopra, M. Auli, and A. M. Rush, “Abstractive sentence summarization with attentiverecurrent neural networks,” in NAACL HLT 2016, The 2016 Conference of the North AmericanChapter of the Association for Computational Linguistics: Human Language Technologies,

6

Page 7: Automatic Voice Title Generation by Extractive ...cs230.stanford.edu/projects_fall_2019/reports/26262276.pdf · work is best described as sentence compression [27, 26, 6, 16, 30,

San Diego California, USA, June 12-17, 2016, 2016, pp. 93–98. [Online]. Available:https://www.aclweb.org/anthology/N16-1012/

[7] V. Dalal and L. G. Malik, “A survey of extractive and abstractive text summarization techniques,”in 6th International Conference on Emerging Trends in Engineering and Technology, ICETET2013, 16-18 December, 2013, Nagpur, India, 2013, pp. 109–110. [Online]. Available:https://doi.org/10.1109/ICETET.2013.31

[8] J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deepbidirectional transformers for language understanding,” in Proceedings of the 2019Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7,2019, Volume 1 (Long and Short Papers), 2019, pp. 4171–4186. [Online]. Available:https://www.aclweb.org/anthology/N19-1423/

[9] T. Févry and J. Phang, “Unsupervised sentence compression using denoising auto-encoders,” inProceedings of the 22nd Conference on Computational Natural Language Learning, CoNLL2018, Brussels, Belgium, October 31 - November 1, 2018, 2018, pp. 413–422. [Online].Available: https://www.aclweb.org/anthology/K18-1040/

[10] K. Filippova, E. Alfonseca, C. A. Colmenares, L. Kaiser, and O. Vinyals, “Sentencecompression by deletion with lstms,” in Proceedings of the 2015 Conference on EmpiricalMethods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21,2015, 2015, pp. 360–368. [Online]. Available: https://www.aclweb.org/anthology/D15-1042/

[11] K. Filippova and Y. Altun, “Overcoming the lack of parallel data in sentence compression,” inProceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meetingof SIGDAT, a Special Interest Group of the ACL, 2013, pp. 1481–1491. [Online]. Available:https://www.aclweb.org/anthology/D13-1155/

[12] M. Gambhir and V. Gupta, “Recent automatic text summarization techniques: asurvey,” Artif. Intell. Rev., vol. 47, no. 1, pp. 1–66, 2017. [Online]. Available:https://doi.org/10.1007/s10462-016-9475-9

[13] S. S. Gross, O. Russakovsky, C. B. Do, and S. Batzoglou, “Training conditionalrandom fields for maximum labelwise accuracy,” in Advances in Neural Informa-tion Processing Systems 19, Proceedings of the Twentieth Annual Conference onNeural Information Processing Systems, Vancouver, British Columbia, Canada, De-cember 4-7, 2006, 2006, pp. 529–536. [Online]. Available: http://papers.nips.cc/paper/2992-training-conditional-random-fields-for-maximum-labelwise-accuracy

[14] C.-Y. Lin, “Rouge: A package for automatic evaluation of summaries,” in ACL 2004, 2004.

[15] Y. Liu, “Fine-tune BERT for extractive summarization,” CoRR, vol. abs/1903.10318, 2019.[Online]. Available: http://arxiv.org/abs/1903.10318

[16] Y. Miao and P. Blunsom, “Language as a latent variable: Discrete generative models forsentence compression,” in Proceedings of the 2016 Conference on Empirical Methods inNatural Language Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, 2016,pp. 319–328. [Online]. Available: https://www.aclweb.org/anthology/D16-1031/

[17] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representationsin vector space,” in 1st International Conference on Learning Representations, ICLR 2013,Scottsdale, Arizona, USA, May 2-4, 2013, Workshop Track Proceedings, 2013. [Online].Available: http://arxiv.org/abs/1301.3781

[18] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representationsof words and phrases and their compositionality,” in Advances in Neural InformationProcessing Systems 26: 27th Annual Conference on Neural Information ProcessingSystems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada,United States, 2013, pp. 3111–3119. [Online]. Available: http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality

7

Page 8: Automatic Voice Title Generation by Extractive ...cs230.stanford.edu/projects_fall_2019/reports/26262276.pdf · work is best described as sentence compression [27, 26, 6, 16, 30,

[19] R. Nallapati, F. Zhai, and B. Zhou, “Summarunner: A recurrent neural networkbased sequence model for extractive summarization of documents,” in Proceedingsof the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017,San Francisco, California, USA, 2017, pp. 3075–3081. [Online]. Available: http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14636

[20] R. Nallapati, B. Zhou, C. N. dos Santos, Ç. Gülçehre, and B. Xiang, “Abstractivetext summarization using sequence-to-sequence rnns and beyond,” in Proceedings ofthe 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL2016, Berlin, Germany, August 11-12, 2016, 2016, pp. 280–290. [Online]. Available:https://www.aclweb.org/anthology/K16-1028/

[21] A. Nenkova and K. R. McKeown, “A survey of text summarization techniques,” in Mining TextData, 2012, pp. 43–76. [Online]. Available: https://doi.org/10.1007/978-1-4614-3223-4_3

[22] K. Papineni, S. Roukos, T. Ward, and W. Zhu, “Bleu: a method for automatic evaluationof machine translation,” in Proceedings of the 40th Annual Meeting of the Association forComputational Linguistics, July 6-12, 2002, Philadelphia, PA, USA, 2002, pp. 311–318.[Online]. Available: https://www.aclweb.org/anthology/P02-1040/

[23] J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for wordrepresentation,” in Proceedings of the 2014 Conference on Empirical Methods in NaturalLanguage Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting ofSIGDAT, a Special Interest Group of the ACL, 2014, pp. 1532–1543. [Online]. Available:https://www.aclweb.org/anthology/D14-1162/

[24] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer,“Deep contextualized word representations,” in Proceedings of the 2018 Conferenceof the North American Chapter of the Association for Computational Linguistics:Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA,June 1-6, 2018, Volume 1 (Long Papers), 2018, pp. 2227–2237. [Online]. Available:https://www.aclweb.org/anthology/N18-1202/

[25] T. Shi, Y. Keneshloo, N. Ramakrishnan, and C. K. Reddy, “Neural abstractive textsummarization with sequence-to-sequence models,” CoRR, vol. abs/1812.02303, 2018. [Online].Available: http://arxiv.org/abs/1812.02303

[26] L. D. Viet, N. T. Son, and L. M. Nguyen, “Deletion-based sentence compression usingbi-enc-dec LSTM,” in Computational Linguistics - 15th International Conference of thePacific Association for Computational Linguistics, PACLING 2017, Yangon, Myanmar,August 16-18, 2017, Revised Selected Papers, 2017, pp. 249–260. [Online]. Available:https://doi.org/10.1007/978-981-10-8438-6_20

[27] L. Wang, H. Raghavan, V. Castelli, R. Florian, and C. Cardie, “A sentence compressionbased framework to query-focused multi-document summarization,” in Proceedings of the51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August2013, Sofia, Bulgaria, Volume 1: Long Papers, 2013, pp. 1384–1394. [Online]. Available:https://www.aclweb.org/anthology/P13-1136/

[28] S. Wiseman and A. M. Rush, “Sequence-to-sequence learning as beam-search optimization,” inProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing,EMNLP 2016, Austin, Texas, USA, November 1-4, 2016, 2016, pp. 1296–1306. [Online].Available: https://www.aclweb.org/anthology/D16-1137/

[29] Z. Yang, Z. Dai, Y. Yang, J. G. Carbonell, R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalizedautoregressive pretraining for language understanding,” CoRR, vol. abs/1906.08237, 2019.[Online]. Available: http://arxiv.org/abs/1906.08237

[30] Y. Zhao, Z. Luo, and A. Aizawa, “A language model based evaluator for sentence compression,”in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics,ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, 2018, pp. 170–175.[Online]. Available: https://www.aclweb.org/anthology/P18-2028/

8