proﬁling resource bottlenecks in sentiment analysis · agogical example for lstm training,...

Profiling Resource Bottlenecks in Sentiment Analysis

Jeff Burge and Matthew Trefilek

February 24, 2020

1 IntroductionThere has been much research conducted for optimizingand streamlining machine learning frameworks on dis-tributed systems. Stragglers are known to significantlycontribute to poor performance of model training on dis-tributed systems [1]. While methods exist for identifyingand handling stragglers in these systems, the underlyingcauses for why these stragglers occur is still not well un-derstood [3] [6]. A primary reason that stragglers canoccur is when workers experience performance bottle-necks when executing [5]. We identify CPU workloads,disk usage, memory limitations, and network bandwidthas four potential resource bottlenecks that can throttle theperformance of a given worker.

In this paper, we develop a system that allows forspecific machine learning problems to be more closelyinvestigated with respect to resource workloads acrossworkers in a training cluster. By collecting metrics on aspectrum of different resources during the training of ma-chine learning frameworks, we can identify if and whichresource may be acting as a performance bottleneck on agiven worker.

1.1 MotivationMachine Learning as a field has demonstrated the bene-fits to fine-tuned implementations for specific problems.One example of this is how new hardware has beendeveloped specifically to address the computational re-quirements of common machine learning systems withthe advent of TPUs [4]. In this line of thinking, it is help-ful to know what workloads are experienced for train-ing specific machine learning problems in order to morespecifically address these unique workloads. By charac-terizing resource loads in different systems for a specificproblem, straggler occurrences in specific settings can beidentified and either avoided or carefully handled.

1.2 Problem FormulationDifferent machine learning model designs allow forshared state and parallelism for model training to beleveraged differently. With these considerations in

mind, we are interested in exploring how resource usagechanges depending on the model being trained and theenvironment that the model is being trained in. Specifi-cally, we look at how CPU usage, disk reads and writes,memory usage, and network receive and sends changefor these models across environment settings. We drawgeneral conclusions about how resource usage changesover several main ’axes’:

• How does resource usage for training a modelchange as hyper-parameters shift on a fixed envi-ronment?

• How does resource usage for training a modelchange for different models on a fixed environment?

• How does resource usage change for training amodel across different environments?

With regards to resource usage in a given system, thereare several different areas that could be throttling an en-vironment’s ability to effectively train a model. Specifi-cally, we identify CPU usage, disk reading and writing,memory usage, and network bandwidth as potential bot-tlenecks for a given system. We seek to design a systemthat allows for metrics to be recorded with relation toeach of these bottleneck areas. Once the metrics are col-lected, bottlenecks can be identified in specific trainingruns and addressed appropriately.

Exploring how resource usage in this system changesas different datasets are used for model training is an in-teresting problem but greatly increases the complexity ofthe problem. Because we are already exploring how re-source usage changes over many different axes, we havedecided for the scope of this project to work on a fixeddataset. We have decided to use a dataset that is a ped-agogical example for LSTM training, performing senti-ment analysis on tweets. Using the Twitter SentimentDataset from Sentiment140 [2], we are able to formu-late a standard NLP dataset with a simple binary classi-fication problem. Additionally, the nature of the trainingproblem causes the models to be trained on extremelywide data, exploring the response of resource utilizationunder a more specialized scenario.

In general, it would be interesting to perform the anal-ysis of this paper on other standard prediction problems.

1

In this paper, we have set up a system that allows for re-source metrics from other datasets and prediction prob-lems to be easily obtained.

2 Related WorkA large amount of the inspiration for this project comesfrom Ousterhout et al. In monitoring a distributedApache Spark system the researchers set out to identifybottlenecks and determine the reasons for stragglers [5].In a similar fashion, we plan to break the monitoring ofthe overall system performance into logging CPU, disk,memory, and network usage. We will extend the exam-ple of the paper by declaring and testing the followingindividual hypotheses for each worker:

1. CPU is a bottleneck.

2. Disk is a bottleneck.

3. Memory is a bottleneck.

4. Network is a bottleneck.

Isolating and testing each of these hypotheses will giveinsight into what resource needs are not being met withinthe systems. It will also give guidance into gathering datafor the second hypothesis: resource allocation is respon-sible for stragglers within the networks.

Specifically, we will analyze the different bottlenecksand prevalence of stragglers as we perform testing on anumber of different hyper-parameters and cluster scalingparameters when training LSTM networks, SVMs, andRandom Forests. We hypothesize that the level that thesedifferent bottlenecks effect runtime should vary signifi-cantly from the testing performed in the Ousterhout pa-per. These differences are expected to be caused by thelower needs for maintaining state because of the smallerscope of the parameter-space due to the nature of recur-rent networks We will verify that the differences we seeare in line with our expectations for different ML frame-works and what architectural decisions should generallybe considered when mitigating these problems on train-ing of ML frameworks in distributed systems.

The dominant technique for straggler mitigation isspeculation, where a system launches multiple specula-tive copies for a slower task and allows the different tasksto race to completion. While this general process is goodfor avoiding slowdowns due to stragglers, it does not re-ally address the issues of why these stragglers are occur-ring in the first place. Obviously some of the stragglersare caused from problems with the hardware in the ma-chine itself, but other stragglers may be caused from sim-ply having resource capabilities that are not best-suitedfor the workload placed on the worker. Simply spin-ning up more speculative workers does not really avoid

this problem, which is why we propose that a smarterstrategy for choosing resource parameters for workers ismade when executing processes.

3 DesignIn the project, we have built a system for collecting re-source metrics and have built some tools for analyzingthe results. Given the focus on the general problem ofNLP sentiment analysis in this paper, we have executedour system on the Twitter Sentiment Dataset [2]. Byautomating the data visualization process, we allow forconclusions to be drawn for each of the main problemsdetailed in the problem statement.

3.1 Data PreparationThe data used for training all models in this project is theTwitter Sentiment Dataset [2]. This dataset consists of160,000 tweets that include exactly one emoticon. Dueto the nature of tweets, we know that there are constraintsas to the number of words that will appear in tweets.

We process the tweets by first assigning labels to eachtweet for positive or sentiment depending on what emoti-con appears in the tweet. These emoticons are then re-moved from the dataset. We finish cleaning the data byremoving bad characters (including punctuation), trans-forming all the data to lowercase, and removing a set ofstandard English stopwords1.

We then tokenize the data and left pad the tweets with’null words’ to make each item in our dataset the samelength. Afterwards, we initialize a dictionary of all thewords in our dataset to create a mapping of words tounique keys. For each index location in our tweets, weconvert these keys to a one-hot encoding in order to con-vert our categorical data into integer data to be fed intotraining our different models. Note that because we areperforming one-hot encoding on each index location inour tweets, our data is as wide as the length of our tweetstimes the size of our dictionary. While we are interestedin seeing how our models behave given wide, sparse in-put, if it is necessary to reduce the width of the data, itis possible to collapse the one-hot encoding by hashingword-keys into a smaller amount of buckets before ap-plying the one-hot encoding.

3.2 Model TrainingWe have created implementations of LSTM RecurrentNeural Networks, Support Vector Machines (SVMs),and Random Forests. The LSTM implementation uti-lizes the Keras library on top of TensorFlow and the

1Provided by nltk.

2

Figure 1: Metrics Collected During Model Training

SVM and Random Forest implementations make use ofthe Scikit-Learn library. For each model, we detail a setof hyper-parameters to be swept.

3.3 Metric CollectionOur system takes a dataset, model, and set of hyper-parameters and executes model training on a hyper-parameter sweeps for all hyper-parameters. For each run,metrics are collected for resource usage on a per-secondbasis as shown in Figure 1.

The system takes a python file and a list of hyper-parameters and proceeds to iterate across them. Tohave a basis for measurement, we elected to scan hyper-parameters one at a time, leaving the rest as a form of ex-periment control. Upon initialization, the system sendsthe models to run to each node of the cluster, begins run-ning a metrics logging tool, and then collects the results.In order to collect the results, the system kills all subpro-cesses, collates the data from the metrics and from themodel, and then uploads the files to S3. We chose S3 forthe stateless nature of the object store. As we were col-lecting data from ephemeral cloud servers, upload timeswere fast and could happen concurrently.

Over the course of experimentation, we collected over500 individual metric files recording runs from a handfulof seconds to over 2 hours.

3.4 Post-ProcessingAs a preliminary step to interpreting the metrics data col-lected during model training, we have built out function-ality to load collected data into a dictionary of pandasdataframes, allowing the data to be an accessible for-mat for post-processing. One component of this post-processing is a tool that allows for graphs for visualizingresource usage across different hyper-parameter sweeps.

3.5 Environment SetupTo ensure low latency connections between the clusteredmachines, and high transfer speeds to Amazon S3, wespun up multiple servers in an AWS virtual network. Themachines are placed locally within the available hard-ware and were installed with fresh Amazon Linux 2 op-erating systems. A configuration file was loaded to eachmachine in the cluster to ensure all dependencies re-mained the same. Most tests were done on a single ma-

chine, with the LSTM being tested on both a 2 machinecluster and a single machine. The chosen machines werethe r5.2xlarge machines provided by Amazon. They pro-vided us with 64 Gigabytes of Ram and 8 virtual CPUsclocked up to 3.1GHZ.

One consideration with the results of metrics collec-tion in the cloud is the presence of external actors. Whilewe have general control over hardware and placement,hardware is only guaranteed to an extent. So, some ma-chines may be in a rack with new speeds of RAM ornewer CPUs whose clocks may be slightly faster thanthe guaranteed minimum. While we were communicat-ing over a private subnet, inter-rack communication canstill be bottle-necked due to oversubscription.

4 EvaluationThe nature of our work has created a system where datacan easily be collected and analyzed for gaining insightsinto resource usage of workers for a given environment.General resource trends can be identified and resourcebottlenecks can be flagged. Additionally, higher levelanalysis can be applied to the data to gain deeper in-sight to the nature of resource workloads across differentmodel implementations.

As a case study, we return to the Twitter Sen-timent Analysis problem. Having collected resultsfor LSTM, SVM, and Random Forest implementationsacross hyper-parameter sweeps on different cluster de-ployments, we analyze the collected resource metrics anddo some high level analysis of implications for this prob-lem space from our data. This exploration into NLP sen-timent analysis serves as a model for how other ML prob-lems could be explored using the system created in thisproject.

A representative sampling of graphs for different col-lected metrics across different model hyper-parameterruns can be found in Figure 2. These models have beentrained on a single machine on AWS to guarantee an iso-lated environment to train on.

4.1 Intra-Model Findings

One avenue of analysis is to consider how changing dif-ferent hyper-parameters affected resource usage whilekeeping all other hyper-parameters fixed. The followingare a sampling of inferences we make from analyzinggraphs across these hyper-parameters. There is a lot ofdata collected and making inferences over all of the datarequires next steps of utilizing more sophisticated analy-sis methods, perhaps unsupervised learning techniques.

In Figure 3, we see confirmation for how the numberof jobs allowed for each worker affects both the maxi-

3

Figure 2: Representative Sample of Metrics Collected

4

mum CPU usage allowed and affects the overall run-timeof the job.

Comparing SVM CPU usage and SVM Memory us-age side-by-side yields some interesting results. We seein Figure 4 how CPU usage and SVM usage thrash be-tween full usage and very low usage. This indicates thereis some type of loading and unloading of the model dur-ing training that is slowing down the process. An explo-ration of running SVM on different environment parame-ters or even changing the algorithm used in Scikit-Learnfor SVM training could help mitigate this experiencedthrashing and improve performance.

A common way to change the nature of a neural net-work is to change the number of layers in the networkand the number of nodes that appears in each layer. In-creasing the number of layers and increasing the width oflayers both generalize to increasing the number of train-able parameters in the system. In Figure 5 we see howthe neural network training resource usage changes aswe alter the size of the embed layer of the recurrent neu-ral network. We see that as the dimension of the em-bed layer of the network increases, CPU usage slowlyincreases, the number of disk writes increases, and thememory increases (albeit by relatively little, this is anartifact of the fact that we are already capping memoryusage so RAM is not increasing as it would if there wasmore memory available). This supports our expectationthat bigger networks (more trainable parameters) use upmore resources unilaterally. The next step would be tosee what type of relationship trainable parameters haswith each resource type and to compare with the modelaccuracy for each number of trainable parameters. Oncethis relationship is established, we can turn this into anoptimization problem by balancing model accuracy withthe cost of resources necessary for training a model withn trainable parameters.

Analysis was performed across hyper-parametersweeps for Random Forests (changing number of leaves,branching, and depth among other parameters) but therewere relatively small changes across training runs ashyper-parameters were swept, as seen in Figure 6. Thissuggests that, at least for our problem of Sentiment Anal-ysis on tweets, that altering most hyper-parameters onRandom Forests does not significantly affect resource us-age during training.

4.2 Inter-Model Findings

Referring to Figure 2 can provide a general landscapefor how different machine learning model frameworksbehave in comparison to each other. Random Forest ex-ecutes by far the fastest in 6 seconds, LSTM executesin about 35 seconds, and SVM takes about a minute totrain. We see that random forest models require sub-

stantially less CPU processing than the other models,reaffirming our prior notions that the random forest isa cheap, relatively lightweight model to train. With thecurrent environment set-up and data being trained upon,the only time that the Random Forest needs to write todisk is when we increase the number of estimators in themodel to 25. This contrasts significantly to both LSTMand SVM models that require many disk writes, espe-cially as parameters change to require more stored hyper-parameters (especially in the case of LSTM). We see thatboth the Random Forest and LSTM models experienceplateauing memory usage while SVM experiences thememory thrashing previously explored.

4.3 Cluster Deployment Findings

In Figure 10 we see how resource usage in LSTM train-ing differs from training on a single machine to trainingon a cluster of two machines. We note that training on acluster greatly increases training time for our model, upto around two minutes. We suspect that because of theconstruction of our LSTM and the data being provided,that costs of communication far outweigh any potentialgains from increased CPU capacity and memory storage.We note that in the cluster, a significant amount of datais being communicated between the nodes and that eachnode is making liberal use of disk writes. We hypoth-esize that as we scale our data and size of the model,and it becomes unfeasible to load the data into memoryon a single machine, that benefits of a distributed systemwould become evident.

4.4 Scale-Up vs. Scale-Out

Scaling up vs. Scaling out is a commonly debated topicin the fields of Big Data. Scaling out requires extensivemodifications to code to allow for proper asynchronousor synchronous operations. Our approach to this prob-lem favored a scale up approach, where when memorybecame a significant bottleneck, we opted for a largermachine instead of a cluster. We found our model couldtrain in reasonable time with comparable accuracy whenscaling up.

The other reason scaling up may be preferable is thatcloud machines are cheap. We architected our test to berelatively stateless, where after each test was run, the teststarts again from scratch. No data is needed to be sharedbetween runs. This allows for us to utilize AWS Spot in-stances to save up to 90% on the hourly cost. The down-side is spot instances can be killed when resources arescarce. However, if you are training a batched modelthat can fit within the bounds of a cloud instance, youcan usually allow for it to be killed and brought back up

5

Figure 3: CPU Usage across Jobs for SVMs

Figure 4: CPU and Memory Thrashing in SVM Training

Figure 5: Resource Usage in LSTM

6

Figure 6: Resource Usage in Random Forest

Figure 7: Solitary Ma-chine Execution

Figure 8: Cluster Execu-tion: Worker 0

Figure 9: Cluster Execu-tion: Worker 1

Figure 10: LSTM Resource Usage Across Environments

7

and still finish after starting from scratch in an acceptabletimeframe.

5 Future workIn this paper, we have built a system that collects metricson resource usage for training machine learning modelsin different environments. There are many routes of fu-ture work that can be performed on top of the ground-work that we have laid out.

5.1 Investigating Other ProblemsWe have investigated the problem of sentiment analysisfor NLP problems. There are countless other commonproblems that could be processed using this system inorder to better understand common bottlenecks for dif-ferent models. Some examples of this include imageprocessing, cluster analysis, anomaly detection, and timeseries problems.

5.2 Generalizing Model SelectionIn our current implementation, we have built separatemodel files for each model that is then given to the met-rics collection system. One way that our metrics collec-tion system could be enhanced is to build out it’s func-tionality to dynamically take data and deploy training avariety of models in a more streamlined fashion. Theway we would move to integrate a wide library of mod-els would be to harness common libraries such as Kerasand Scikit-Learn.

5.3 Extending Metric AnalysisOur system allows for collected metrics to easily beloaded into dictionaries of pandas dataframes in Python.While there is already functionality for visualizing muchof the data, it would be interesting to build out somefunctionality to more deeply analyze the data itself. Onepreliminary line of analysis would be to perform somemethods of unsupervised learning on the collected met-rics by performing cluster analysis and anomaly detec-tion. This would allow for training runs to be better clas-sified (by clustering) and allow for overarching trends tobe more easily identified. Additionally, anomaly detec-tion would allow for particularly good (or bad) runs to bequickly identified and allow for more focused analysis ofthese unique training runs.

5.4 Automatic Bottleneck DetectionThis is closely related to extending metric analysis butis perhaps the most important analysis to perform on the

collected metrics. Automatically detecting resource bot-tlenecks will allow for more focused analysis of why cer-tain resources experience bottlenecks to be more closelyaddressed. This detection also sets the stage for the mostimportant direct application to resource metric collec-tion, allowing for dynamic resource spin-up.

5.5 Dynamic Resource Spin-Up

At the moment, our collection tool simply batches themetrics while running and sends them to S3 upon com-pletion. Our ideal end-state would be a streaming sys-tem that incorporates our auto setup of a new AWS in-stance when a CPU or Memory threshold is hit. If net-work is a bottleneck for clusters with large amounts ofsynchronous data, there are special ”n”-suffix instancesthat can be launched in AWS allowing for larger networkbandwidth. If we have a machine or stateless job thatcan analyze the streaming information, when a thresholdis hit, the next size instance can be spun up to replace themachine or cluster.

To achieve this goal, the test would be to run mod-els we know use specific resources that could hit bottle-necks on smaller instances and immediately bring up anew one. With spinning up new instances often takingless than a minute, we could then perform a time vs costanalysis as both machines or clusters finish.

6 Conclusion

We have designed a system for profiling resource usagemetrics during the training of machine learning models.This system is robust to deployment on different envi-ronments, whether they be single machine or distributed.Our system can run a variety of machine learning frame-works and will perform hyper-parameter sweeps as des-ignated. Collected metrics can easily be loaded into dic-tionaries of pandas dataframes to be used for higher-levelanalysis or data visualization. We have demonstrated theuse of this system by performing a case study on Senti-ment Analysis for a twitter dataset, a standard applica-tion in NLP. Future work includes continuing to gener-alize this model and to automatically spin-up additionalresources when bottlenecks for certain resources are de-tected in a training run.

Acknowledgements

Special thanks to Dr. Shivaram Venkataraman for in-structing CS744 at UW-Madison in Fall 2019.

8

CodeAll code used in this project is available athttps://gitlab.com/tref95/cs744-proj/.

References[1] J. Dean and S. Ghemawat. Mapreduce: Simpli-

fied data processing on large clusters. In OSDI’04:Sixth Symposium on Operating System Design andImplementation, pages 137–150, San Francisco, CA,2004.

[2] A. Go, R. Bhayani, and L. Huang. Sentiment140.

[3] Z. Guo and G. Fox. Improving mapreduce perfor-mance in heterogeneous network environments andresource utilization. In Proceedings of the 201212th IEEE/ACM International Symposium on Clus-ter, Cloud and Grid Computing (Ccgrid 2012), CC-GRID ’12, pages 714–716, Washington, DC, USA,2012. IEEE Computer Society.

[4] N. P. Jouppi, C. Young, N. Patil, D. Patterson,G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Bo-den, A. Borchers, R. Boyle, P. luc Cantin, C. Chao,C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean,B. Gelb, T. V. Ghaemmaghami, R. Gottipati, W. Gul-land, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu,R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. Jaworski,A. Kaplan, H. Khaitan, A. Koch, N. Kumar, S. Lacy,J. Laudon, J. Law, D. Le, C. Leary, Z. Liu, K. Lucke,A. Lundin, G. MacKean, A. Maggiore, M. Mahony,K. Miller, R. Nagarajan, R. Narayanaswami, R. Ni,K. Nix, T. Norrie, M. Omernick, N. Penukonda,A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadi-ani, C. Severn, G. Sizikov, M. Snelham, J. Souter,D. Steinberg, A. Swing, M. Tan, G. Thorson,B. Tian, H. Toma, E. Tuttle, V. Vasudevan, R. Walter,W. Wang, E. Wilcox, and D. H. Yoon. In-datacenterperformance analysis of a tensor processing unit,2017.

[5] K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker,and B.-G. Chun. Making sense of performance indata analytics frameworks. In 12th USENIX Sympo-sium on Networked Systems Design and Implementa-tion (NSDI 15), pages 293–307, Oakland, CA, May2015. USENIX Association.

[6] Y. Yu, M. Abadi, P. Barham, E. Brevdo, M. Bur-rows, A. Davis, J. Dean, S. Ghemawat, T. Harley,P. Hawkins, M. Isard, M. Kudlur, R. Monga, D. Mur-ray, and X. Zheng. Dynamic control flow in large-scale machine learning. In Proceedings of the

Thirteenth EuroSys Conference, EuroSys ’18, pages18:1–18:15, New York, NY, USA, 2018. ACM.

9

https://gitlab.com/tref95/cs744-proj/

proﬁling resource bottlenecks in sentiment analysis · agogical example for lstm training,...

Documents