discover spark in 4 hours - rescuer...

Capítulo

1

Discover Spark in 4 hours

Edward Pacheco, Rodrigo Senra, Vinícius Gottin, Bruno Costa, Wagner

Vieira, and Angelo Ciarlini

Abstract

We present an introduction to the Apache Spark platform. The subjects covered by the chapter correspond to an introductory course with an estimated duration of 4 hours. We

cover the motivation and development history of the Apache Spark project and related components, the data and execution models of Spark applications, the deployment of the Spark environment, and provide examples of Spark ML and Spark Streaming

applications. Finally, we discuss the impact of memory usage in processing time as well as opportunities for the optimization of Spark jobs.

Resumo

Apresentamos uma introdução à platafrma Apache Spark. Os assuntos cobertos neste capítulo correspondem a um curso introdutório com duração estimada de 4 horas.

Cobrimos a motivação e histórico do desenvolvimento do projeto Apache Spark e seus componentes, os modelos de dados e de execução em aplicações Spark, a estruturação de ambientes Spark, e demonstramos aplicações de Spark ML e Spark Streaming.

Finalmente, discutimos o impacto do uso de memóra no tempo de processamento e opotunidades para a otimização de jobs Spark.

2.1. Introduction

Researchers at the University of California, Berkeley’s RAD Lab [1] (later renamed as

AMPLab) created Spark with the objective of increasing the performance of Apache Hadoop [2] jobs. Their motivation was founded on two aspects: Hadoop’s inefficiency

for interactive and iterative jobs, and the decreasing price of computer memory. Spark was then designed to take advantage of in-memory processing, significantly speeding up interactive queries and iterative algorithms. Later, Spark was donated to the Apache

Software Foundation, which currently maintains it and develops it.

Apache Spark is now an open source framework for large-scale data processing, which enables and supports the scheduling and monitoring of many computational tasks in a

computer cluster, typically composed of many worker machines. Besides speed, Spark

was developed with focus on ease of use and generality [3].

In this section, we discuss terms related to the topics of Big Data Analytics and Data Science.

Hadoop was developed following the map-reduce paradigm as an engine to solve Big Data problems. Later, the open source community developed an ecosystem of related

projects. After that, Hadoop became a full software stack to solve big data problems. Apache Spark inherits that characteristic of Hadoop.

Traditional Big Data analysis tools were built to be used in host machines. The typical

workflow of such an analysis was to have specialist users, with knowledge and expertise in data analysis, to start working on a small piece of data in order to validate

their models. Only after a set of scripts implementing the validated models was available, it was necessary to run them over large data sets. This posited a problem, however, as the code needed to be adapted to execute in clusters, in order to deliver

results in reasonable time.

In recent years, however, distributed systems have enabled capabilities in terms of scalability, fault tolerance, and elasticity. These features now allow the development of

new analysis tools, reducing the effort to process both small and large data sets. Modern applications in multiple domains - such as social networks, search engines, oil and gas,

energy, and biology - are generating large quantities of data every day. For example, Facebook reportedly generates 60TB of logs daily, Google indexes more than 10 PB, and the Genome Project reaches more than 200TB [4].

Data Science is one the most relevant roles in data analysis. In simple terms, the field of data science is a combination of Hacking Skills, Math & Statistics Knowledge, and

Substantive Expertise. The figure 1.1 [5] shows the intersections of these three topic. Combining hacking skills and math & statistics knowledge, we are inside the machine learning area. For example, someone who develops an algorithm and applies it to a

large data obtaining results, also with a good understanding of these results. Combining Math & Statistics Knowledge and Substantive Expertise, we get a traditional research. As an example, we have a specialist in a domain area analyzing a large data and then get

conclusions from it. And between substantive expertise and hacking skills we found a danger zone. This intersection includes people accessing to a large dataset of a domain,

and then running algorithms developed by someone else, understanding the results partially. Finally, a data scientist is someone with substantive expertise in a specific domain, hacking skills, and math & statistics knowledge. Data scientists are able to

process large datasets and getting a deep understanding of algorithms and the results.

Figure 1.1 Data Science Venn Diagram [5].

2.2. Spark Essentials

Spark has seen wide adoption in a wide range of industries and in the open source

development community, currently standing as the top open-source Big Data project [6]. In this section we present the essentials of the framework, the components that compose the Spark ecosystem and a technical overview of programming in Spark. As

all of these topics are widely supported by existing documentation [3], the discussion is only as detailed as needed for the understanding of the subsequent sections. We start by

discussing the advantages offered by Spark over alternative frameworks: Spark is a fast, general and easy to use engine for large-scale data processing.

Spark is easy to use in two senses. The first one is that Spark applications can be written

making use of an API that offers high-level operators, in non-modified versions of well-known programming languages, such as Scala, Python, Java and R. Spark can also be used interactively from the Scala, Python and R shells. In a second sense, Spark is easy

to use as it seamlessly supports workloads that previously required separate distributed systems, like batch applications, iterative algorithms, interactive queries and streaming

[7].

Spark is also general, powering a stack of free libraries for working with structured data (Spark SQL), scalable machine learning (MLib), graph-parallel computation (Graphx)

and fault-tolerant streaming applications (Spark Streaming). We detail these libraries in the next section.

Spark is also general in the sense that it integrates with other Big Data tools and runs over a variety of cluster and resource managers like Mesos, Hadoop, or the standalone Spark cluster mode. Spark can also use both commodity servers and run on the cloud,

with similar hardware requirements as Hadoop MapReduce [8] – Spark clusters can scale up to thousands of nodes. Finally, Spark can access data in multiple formats, in

HDFS, Cassandra, HBase, Hive, Tachyon and any Hadoop data source (Hadoop

InputFormat).

As for speed, a main motivation for the development of the framework, Spark is reportedly capable of achieving a speed-up of up to 100x when compared to Hadoop

MapReduce. Apache Spark’s focus on speed refers to its usage of memory as substitute for I/O operations. In Hadoop workflows, the results of map-reduce operations are

persisted in disk, which implies many read and write operations. Besides the comparatively low performance of I/O operations, this also results in waiting times, as idle resources wait for data to be available for processing. Spark, on the other hand,

aims at performing operations in-memory.

Spark offers the concept of a fault-tolerant parallelized collection, called the Resilient

Distributed Dataset (RDD). An RDD can be created based on existing collections in the driver program, or in reference to a dataset in an external storage system like HDFS.

The RDD is the main abstraction of Spark. It can be thought of as a collection of

elements that are partitioned across the nodes of the cluster, over which the transformations and actions operate in parallel. Spark provides the functionality of persisting RDDs in memory, allowing them to be reused efficiently across multiple

operations, through explicit caching operations. RDDs are resilient in the sense that Spark provides the automatic recovery from computational failures. More details about

the RDD abstraction are given in the next section, when describing the Spark context, and Spark programming model, as well as the RDD operations (transformations and actions).

2.2.1. Spark Components

One of the features of Spark is the stack of free libraries that compose the Apache Spark

Ecosystem, built over the Spark Core API, the underlying execution engine for the platform. A representation of the Spark Ecosystem can be seen in Figure 2.1.

Figure 2.1 The Apache Spark Ecosystem [9].

2.2.1.1. MLlib

The first such component of the Spark Ecosystem is MLlib, a scalable machine learning library that “delivers both high-quality algorithms (e.g., multiple iterations to increase accuracy) and blazing speed (up to 100x faster than MapReduce) [3]”. It consists of

several machine learning algorithms, as well as utilities for featurization, managing persistence, linear algebra, and other tasks, etc. The library is usable in Java, Python and

Scala.

MLlib provides implementations for machine learning algorithms that perform classification and regression (Linear SVM, Logistic Regression, Decision Trees,

Random Forests, Naïve Bayes and Gradient-Boosted Trees), clustering (k-means), collaborative filtering (ALS), dimensionality reduction (SVD, PCA). Support for

featurization tasks includes feature extraction and transformation, dimensionality reduction and feature selection. Finally, MLlib also offers tools for constructing, evaluating, and tuning Machine learning workflows, referred to as pipelines. As

machine learning is not a one-shot activity, but rather an interactive and iterative process, pipelines are very useful to combine, e.g., separate steps such as data pre-processing, model learning, and post-processing of the results into a sequence of

algorithms. In section 6, we discuss a Spark ML, an alternative machine learning package in the Spark ecosystem through an example use case.

2.2.1.2. Spark SQL

Spark SQL is a Spark module which allows relational queries expressed in SQL, HiveQL or Scala to be executed using Spark. Up to version 1.2 [10] of the Spark SQL

module, it operated over a type of collection called SchemaRDD, derived from the RDD. Since version 1.3, however, it operates over Datasets and Dataframes, new

abstractions introduced in later versions of the Spark. More details on Datasets and Dataframes are given in section 3.

Regardless of the version, Spark SQL is used for structured data processing, with

powerful integrating with the Spark ecosystem, essential for the integration between SQL querying and machine learning tasks.

2.2.1.3. Graphx

Numerous graph-parallel systems were recently developed based on the growing relevance of problems in domains that are naturally modeled as graphs, e.g. social

network analysis and language processing, as well as in specialized domains, e.g. genome sequencing [11, 12].

GraphX is a graph processing framework built upon Apache Spark, extending the RDD

as a composable graph abstraction (Graph) that is “sufficient to express existing graph APIs, yet can be implemented using only a few basic dataflow operators (e.g., join,

map, group-by)” [13]. GraphX offers great performance gains over the base dataflow framework – an order of magnitude – through several optimizations in both the distributed computational model and in the representation of data types [14].

2.1.2. Spark Streaming

Spark Streaming is an API that that enables stream processing of live data streams in a scalable way. It is an extension of the core Spark API, reusing many of its components

but also modifying and adding components to enable scalable, high-throughput, fault-

tolerant stream processing [15].

We discuss Spark Streaming in detail in section 4 with an example use case.

2.2.2. Spark Cluster Mode

Spark can be deployed in cluster mode, as shown in Figure 2.2. The main components are driver program, cluster manager, and worker node. The driver program component

instantiates the Spark context, it defines a global namespace and coordinates the application lifecycle. The cluster manager will allocate computing resources. The worker node manages the creation of executors. The Spark context will schedule tasks

to the executors. The tasks are threads processing chunks of the data, previously loading in memory using partitions.

As mentioned before, the cluster manager is a service to retrieve computing resource. The cluster manager will acquire executor processes to get access to these computing resources. This feature enables its deployment in others cluster managers such as Mesos

or Hadoop/YARN. Section 5 describes more about this type of Spark deployment.

Figure 2.2. Spark Cluster Mode Overview.

2.2.3. Programming in Spark

To program in Spark is necessary to understand some concepts defined by Spark: Spark Context, an abstraction named as Resilient Distributed Data (RDD), Transformation,

and Actions. Before to review Spark fundamentals, we’ll describe how to access Spark.

Spark offers two access modes. The Spark shell is the interactive mode to access and write Spark code, while also facilitates testing and debugging tasks. On the other hand,

we have the “spark submit” tool to launch packaged applications. The “spark submit” tool supports the different Spark deployments. Since Spark was implemented in Scala,

the default Spark shell is also based in Scala. Spark also comes with a Python Spark shell called pyspark. To get access to Spark shell, one simply executes $SPARK_HOME/bin/spark-shell or $SPARK_HOME/bin/pyspark for access in Scala

and Python, respectively.

As introduced in the previous section the driver program instantiates the SparkContext object. This context is the central point of access for Spark applications. The Spark shell

creates a default Spark context named “sc “, we can start creating RDDs using this variable.

Resilient Distributed Datasets (RDDs) is a data structure defined as an immutable and

fault-tolerant collection of elements that can be operated on in parallel. It’s immutable because it cannot be modified once created. It’s fault tolerant because it could be rebuild

in case of any failure using lineage. RDDs are placed into logical partitions distributed across the cluster. There are two ways to create RDDs: parallelizing an existing collection or referencing a stored dataset.

Spark defines transformations as lazy operations over RDDs. This means transformations are actually processed after the first action founded creating one or

many news RDDs. Table 2.1 shows a list of transformations according to Spark documentation [3].

Table 2.1. List of Transformations.

Transformations

map(func) reduceByKey(func, [numTasks])

filter(func) aggregateByKey(zeroValue)(seqOp, combOp,

[numTasks])

flatMap(func) sortByKey([ascending], [numTasks])

mapPartitions(func) join(otherDataset, [numTasks])

mapPartitionsWithIndex(func) cogroup(otherDataset, [numTasks])

sample(withReplacement, fraction, seed) cartesian(otherDataset)

union(otherDataset) pipe(command, [envVars])

intersection(otherDataset) coalesce(numPartitions)

distinct([numTasks])) repartition(numPartitions)

groupByKey([numTasks]) repartitionAndSortWithinPartitions(partitioner)

Spark Actions are RDD operations producing non-RDD values. Spark actions materialized a sequence of transformation getting a value. Table 2.2 shows a list of

actions provided by the Spark documentation [3].

Table 2.2. List of Actions.

Actions

reduce(func) saveAsTextFile(path)

collect() saveAsSequenceFile(path)

count() (Java and Scala)

first() saveAsObjectFile(path)

take(n) (Java and Scala)

takeSample(withReplacement, num, [seed]) countByKey()

takeOrdered(n, [ordering]) foreach(func)

Table 2.3 shows a minimal Spark application to expose the use of Spark context, RDD,

transformations and actions. We use the Spark context variable to define the RDD textFile. Next, we use a filter transformation to retrieve lines containing the string Spark. Finally, we define two actions count and first. Both actions will recreate the

RDD.

Table 2.3. Spark Example.

scala> val textFile = sc.textFile("README.md")

scala> val linesWithSpark = textFile.filter(line => line.contains("Spark")) // Filter lines containing Spark

scala> linesWithSpark.count() // Number of items in this RDD

scala> textFile.first() // First item in this RDD

2.3. Dataframe and Datasets

Spark defined RDDs as the core component to store data with properties such as

immutability, fault tolerance, and distributed. RDD API provides the methods to manipulate RDDs. We show a basic example in section 2.3. In addition to RDD, two new options have been released: Dataframes and Datasets APIs. Dataframe API sets a

schema for the data organizing it as named columns. Dataframes are transferred in binary format to an off-heap space. Also, a relational optimized query plan is created for queries by an execution engine optimizer called Catalyst. Dataframe API is available in

Scala, Java, Python, and R. Table 3.1 shows an example of a basic use of Dataframes [10].

Table 3.1. Dataframe Example. val sqlContext = new org.apache.spark.sql.SQLContext(sc)

// Create the DataFrame

val df = sqlContext.read.json("examples/src/main/resources/people.json")

// Select people older than 21

df.filter(df("age") > 21).show()

Dataset API is a combination of the advantages of RDDs and the Catalyst query optimizer. A preview of this API was released in Spark 1.6 supporting scala and java languages. Dataset API follows an object-oriented programming style. Table 3.2

displays a basic example using Datasets [10].

Table 3.2. Dataset Example. // Encoders for most common types are automatically provided by importing sqlContext.implicits._

val ds = Seq(1, 2, 3).toDS()

ds.map(_ + 1).collect() // Returns: Array(2, 3, 4)

// Encoders are also created for case classes.

case class Person(name: String, age: Long)

val ds = Seq(Person("Andy", 32)).toDS()

// DataFrames can be converted to a Dataset by providing a class. Mapping will be done by name.

val path = "examples/src/main/resources/people.json"

val people = sqlContext.read.json(path).as[Person]

2.4. Spark Streaming

The goal of the Spark Streaming API is to unify stream processing and batch jobs under

a language-integrated API. It is possible to join streams against historical data, or run ad-hoc queries on stream state.

Internally, Spark Streaming splits live input data streams into batches, which are

converted to RDDs to be processed by regular Spark core functionality. There are input connectors for many data providers, as depicted in Figure 4.1 [16]. In general, data coming from TCP sockets can be streamed into the core Spark Engine to be processed

by high-level functions like map, reduce, join and window.

Figure 4.1. Stream Connectivity [16].

The basic abstraction is the Discretized Stream or DStream representing a continuous stream of data. Each DStream is represented by a group of RDDs, which are subject to the standard Spark API. Figure 4.2 [16] illustrates the data

pipeline.

Figure 4.2. Streaming Pipeline [16].

We now present an example where a data streaming application demonstrates how data

- temperature measurements in the city of San Francisco – are served over a socket connection to a Spark streaming application, which performs a simple periodical

consolidation.

The code in Table 4.1 consists of the server-side application, implemented in the Python programming language. It reads data from a comma-separated value (CSV) file ‘sf-

2008-with-head.csv’ and pushes data rows into a socket stream at regular intervals. The arguments for the execution of the Python script are the server IP address, the port and

the interval in which data are served.

Table 4.1. Python script for a server application in a Spark Streaming example. import socket

import sys

import time

import os

import pandas as pd

PATH = os.path.join('..','..','data','load', 'sf-2008-with-head.csv')

def initialize_socket(*args):

# Create a TCP/IP socket

sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)

# Connect the socket to the port where the server is listening

sock.bind(args)

sock.listen(1)

return sock

def main(server, port, interval):

df = pd.read_csv(PATH)

while True:

sock = initialize_socket(server, port)

connection, client_address = sock.accept()

try:

try:

for temperature in df.temperature.values:

connection.sendall("{0:3.2f}\n".format(temperature))

print("temperature: {0:3.2f}".format(temperature))

time.sleep(interval)

except socket.error:

print("client disconnected")

finally:

connection.close()

sock.close()

if __name__ == "__main__":

main(sys.argv[1], int(sys.argv[2]),int(sys.argv[3]))

Table 4.2 shows the code, also implemented in Python, for the receiving application.

This application defines a time window operation that operates in regular intervals defined by two parameters: the window length (the first argument to

‘stream_object.window’), and the sliding interval, the interval at which the consolidation operation is performed (second argument). In the code of Table 4.2, these intervals are constant multiples of the ‘batch_interval’ argument of the script.

Table 4.2. Python script for a client application in a Spark Streaming example. import sys

from pyspark import SparkContext

from pyspark.streaming import StreamingContext

def main():

if len(sys.argv) != 4:

print("Usage: spark-submit network_wordcount.py "

"<hostname> <port> <batch_interval>", file=sys.stderr)

exit(-1)

server = sys.argv[1]

port = int(sys.argv[2])

batch_interval = int(sys.argv[3])

print("Connecting to server {0}:{1}".format(server, port))

sc = SparkContext(appName="PythonStreamingNetworkWordCount")

ssc = StreamingContext(sc, batch_interval)

stream_object = ssc.socketTextStream(server, port)

windowed_stream = stream_object.window(batch_interval*3,

batch_interval*2)

def consolidate_cut(rdd):

sigma = rdd.map(lambda x:float(x.strip())).reduce(lambda x, y: x + y)

all_values = str(rdd.collect())

average = sigma / rdd.count()

open('output.txt', 'a').write('{0:3.2f} <-(avg) {1:s} \n'

.format(average,all_values))

windowed_stream.foreachRDD(consolidate_cut)

ssc.start()

ssc.awaitTermination()

if __name__ == "__main__":

main()

The relationship between these two arguments is exemplified by Figure 4.3. In that Figure, the operation is applied over the last 3 time units of data, and slides by 2 time units [17]. That is, every time 2 RDDs in the original DStream are received (for

example the RDDs in time 4 and time 5), the last 3 are operated upon to generate an RDD in the windowed DStream (the bottom RDD in time 5).

Figure 4.3 Windowed computation of a data stream [17].

In the Spark application of Table 4.2, the RDDs served by the server application consist of a single value each - a temperature measurement from the source CSV file. Every

‘batch_interval*2’ seconds, the application consolidates the last ‘batch_interval*3’

measurements by performing a simple average. Figure 4.4 contains a graph showing the

results of such an application.

Figure 4.4 Graph with the results of the execution of the Spark Streaming application.

The graph of Figure 4.4 represents the execution of the example applications over a period of nearly 40 seconds. Each second, an RDD containing a temperature

measurement is served by the server application. Each dot represents one such measurement (Y-axis, in ºF) over time (X-axis, in seconds). The dashed lines represent the average computed for each window of 3 samples.

2.5. Spark Deployment

Spark offers local and cluster modes for its deployment. In local mode, Spark is deployed locally on one machine, making use of the available Java environment. In cluster mode, Spark allows distributed deployment, supporting three options as

cluster managers: standalone, Mesos, and Hadoop/YARN. The standalone deployment will execute a built-in distributed version of Spark autonomously,

without the need of a cluster manager. This option is useful when the infrastructure is dedicated exclusively for Spark applications.

Spark also supports its deployment into Hadoop/YARN and Mesos environments.

Both options allow users to explore and evaluate taking advantage of the current infrastructure. In either local and cluster modes, users have access to a command shell to interact with Spark. This shell is a complete interface suited for learning

about, developing, and testing Spark transformations and actions.

In the following sections we summarize the most relevant topics related to the

configuration, building, monitoring and instrumentation of the Spark environment, as well as tuning and configuration settings that we empirically found to be the most relevant for optimization of Spark jobs.

2.5.1. Spark Configuration

Spark defines a full list of properties to specify application settings. The properties

cover configurable variables of applications, environment, user interface, memory management, networking, scheduling and security. Properties are set into a SparkConf object, as parameter of the command line tool, or specifying a Spark

default configuration file (spark-default.conf). Those property definition settings

are defined in decreasing precedence. Table 5.1 shows an example of the Spark

properties of spark-defaults.conf.

Table 5.1. Spark properties.

Property Value

spark.master spark://192.168.1.100:7077

spark.executor.memory 4g

spark.eventLog.enabled true

spark.serializer org.apache.spark.serializer.KryoSerializer

We empirically found the configuration file as the most flexible way to define Spark properties. It allows us to test a single Spark application package with different configurations and environments. Setting parameters properly, we could

setup a cluster in an efficient form, as well as identify possible optimization points in our application.

2.5.2. Building Spark

Sometimes users need to compile Spark packages to add a custom functionality, activate a non-default Spark feature, or enable support for a third-party tool. Spark

supports SBT and Maven as automated compiling tools. The main steps to build Spark are to download the source code - or clone its repository from https://github.com/apache/spark.git -, to obtain a recent version of the Java JDK,

and any extra dependencies can be solve with an internet connection automatically. Other Spark build options include the possibility to build submodules individually,

to run Spark unitary tests, and to make a full Spark package.

Spark uses Maven build profiles to define different settings - for example to enable Hive support we specify the prefix –Phive. Table 5.3 shows a complete command

line to build Spark enabling hadoop 2.7 and ganglia support. We suggest checking the main documentation (https://spark.apache.org/docs/2.0.0/building-spark.html) in order to ensure that the correct compile parameters are set.

Table 5.3. Compiling Spark with Ganglia support.

$ build/mvn -Pspark-ganglia-lgpl -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.1

-DskipTests clean package

2.5.3. Monitoring and Instrumentation

Spark features several ways of monitoring applications. The web-based tool

provides a friendly user interface, displaying information about the currently running applications, an overview of the cluster components status, and a history log of previously deployed applications.

The web tool shows the lifecycle of Spark jobs. It’s possible to see a list of running applications and to get a detailed view of them, such as their stages and tasks. Also, web tool shows details of the workload of the driver and executors. Lastly, Spark

stores information of all executed applications to later analysis.

For example, a simple Spark job containing an action such as count, will start an

execution of a Spark job visualizing the number of tasks, partitions, executors,

https://github.com/apache/spark.git

fail/complete jobs, etc. Figure 5.1, shows a view of the monitoring dashboard while

running Spark jobs [18].

Figure 5.1. Spark Monitoring dashboard.

In addition to the web interface, a REST API is available as a resource to create or extend monitoring tools. Furthermore, Spark has a metrics system which can be

configured for reporting metrics to a variety of sinks such as console, http, JMX, and CSV files. Spark metrics module collect data of master, worker, executor, driver, and application processes.

2.5.4. Tuning

Spark applications might suffer from bottlenecks caused by an intensive use of

CPU, memory, and network resources. These problems create an opportunity to identify issues and propose improvements on our Spark applications or in the Spark deployment. Spark distinguishes data serialization and memory management as

tuning opportunities. Additionally, there are parameters related to data locality that can be tweaked for optimization.

Since Spark applications involve network transfer, serialization is a way to reduce the size of transferred data between Spark compute nodes. Spark sets Java serialization as default option. However, we can use the Kryo library for faster

serialization than is provided by the Java serialization library. Even though serialization and deserialization processes consume CPU time, data serialization is typically the first step for tuning Spark applications.

Spark was developed with conscious effort in providing a flexible and robust memory management. Spark takes the spark.executor.memory (the Java heap) as

the full size to be instantiated by each executor. This size is divided into three main parts: reserved memory, user memory, and Spark memory.

First, reserved memory is hardcoded to 300MB. This space is reserved by the

system and it is not used during Spark applications. Second, user memory is set by default as 25% of the difference between the Java heap and the reserved memory.

User memory is a pool for storing customized data structures and internal metadata,

for example a rewrite aggregation function. Finally, Spark memory covers the remainder of the memory. This means that 75% of the difference between Java heap and reserved memory is Spark memory.

Spark memory is divided in two sections: storage and execution memory. Storage memory is used for serialized data, cached data, and broadcast variables. Execution

memory is used for computation of shuffle, joins, sorts, and aggregations. The property spark.memory.fraction is used to define the storage memory and is set to 50% as default, with execution memory occupying the other 50%. We can change

this value modifying the spark.memory.storageFraction property. The figure 5.2 [19] shows how the executor memory is distributed.

Figure 5.2. Memory management in Spark.

2.6. Spark ML Due to the advances in data collection and storage, huge online archives of data are

currently available in many organizations, and these archives may contain yet unknown useful information (i.e., interpretable information). A direct consequence of this is that

the field of machine learning has become very popular in such Digital Age, since its aim is to automatically discover useful information in (large) databases. As previously described in section 2.1.1, Spark also provides a machine learning component called

MLlib, implemented as the spark.mllib package, which provides various utilities and functionalities.

As of Spark 2.0, the RDD-based APIs in the spark.mllib package have entered maintenance mode. The primary Machine Learning API for Spark is now the Dataframe-based API in the spark.ml package [3]. The main reasons for switching

from RDDs to Dataframes are: (i) data frames provide a more user-friendly API than

http://spark.apache.org/docs/latest/programming-guide.html#resilient-distributed-datasets-rdds

http://spark.apache.org/docs/latest/sql-programming-guide.html

RDDs and (ii) Dataframes facilitate practical Machine learning pipelines, particularly

when considering feature engineering.

2.6.1. Example

The proposed workflow described in Figure 6.1, aims to find a model that classifies a

product based on its review text made by the customer. The model produced by the workflow has as input the review of the product and as output the binary classification 1

(good) and 0 (bad) for the product. Following we describe each activity workflow in the order they are performed and their input and output data.

Pre-processing activity: this activity takes as input the data frame df, which is a tabular

structure with the product names (name attribute) with the respective customers comments (review attribute) and product rating varying from 1 to 5 (rating attribute).

This activity performs a pre-processing on df. The pre-processing initially excludes df records having rating equal to 3, because such ratings are considered neutral in the values range from 1 to 5. Next is added to the df the label attribute that states the

product as good (value 1) or bad (value 0) according to the rating value. For rating above 3 label receives value 1 and for rating below 3 label receives value 0.

Split activity: in this activity, based on df are created two others data frames, which are

df_test and df_train containing respectively 20% and 80% of df records. The data frame df_train will be used in the following activities of the present workflow. While df_test is

reserved to another test and here is not described workflow responsible for testing classification models.

Tokenizer activity: using as input df, Tokenizer is responsible for producing a new

data frame called df_words_all. The review attribute (text type) is transformed into words_all attribute (string array type) containing the reviews words as array elements.

Removing neutral words activity: responsible for generating the data frame df_filtered from df_words_all. For this, the words_all attribute has the neutral words (eg conjunctions and pronouns) removed from their arrays of words and, therefore, creating

the new attribute called words_filtered.

Hashing activity: responsible for creating the data frame df_features, using the words_filtered attribute to create the features attribute. The features attribute is a

dictionary that provides the hash code of the words and their respective quantity of occurrences in the words_filtered array.

Training activity: in this activity is performed logistic regression technique that carries the training to generate the classification model. The training input is the data frame df_features and actually only the label and features attributes are required for model

generation.

Figure 6.1. Example of a Spark ML workflow.

2.7. References

[1] A. M. People, "AMPLab," [Online]. Available: https://amplab.cs.berkeley.edu/.

[2] Apache, "Hadoop Documentation," [Online]. Available: https://hadoop.apache.org/docs/current/.

[3] Apache, "Spark Documentation," [Online]. Available: spark.apache.org.

[4] Databricks, "Introduction to Big Data with Apache Spark," [Online].

[5] D. Conway, "The Data Science Venn Diagram - Drew Conway," [Online]. Available: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram.

[6] A. Woodie, "Apache Spark Adoption by the Numbers," [Online]. Available:

https://www.datanami.com/2016/06/08/apache-spark-adoption-numbers/.

[7] H. Karau, A. Konwinski, P. Wendell and M. Zaharia, Learning Spark: Lightning-Fast Big Data Analytics, 2015.

[8] A. Spark, "Hardware Provisioning - Spark," [Online]. Available: https://spark.apache.org/docs/0.9.1/hardware-provisioning.html.

[9] Databricks, "Spark About," [Online]. Available: https://databricks.com/spark/about.

[10] A. Spark, "Spark SQL Documentation," [Online]. Available:

http://spark.apache.org/docs/latest/sql-programming-guide.html.

[11] R. S. Xin, D. Crankshaw, A. Dave, J. E. Gonzalez, M. J. Franklin and I. Stoica, "GraphX: Unifying data-parallel and graph-parallel analytics," 2014.

[12] A. Abu-Doleh and Ü. V. Ctalyürek, "Spaler: Spark and GraphX based de novo genome assembler," 2015.

[13] J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin and I. Stoica, "GraphX: Graph Processing in a Distributed Dataflow Framework," in 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14),

2014.

[14] GraphX, "GraphX Programming Guide," [Online]. Available:

http://spark.apache.org/docs/latest/graphx-programming-guide.html.

[15] M. e. a. Zaharia, "Discretized streams: fault-tolerant streaming computation at scale," in Proceedings of the Twenty-Fourth ACM Symposium on Operating

Systems Principles, 2013.

[16] Hortonworks, "Introduction to Spark Streaming," [Online]. Available: http://hortonworks.com/hadoop-tutorial/introduction-spark-streaming/.

[17] A. Spark, "Apache Spark Streaming Programming Guide," [Online]. Available: http://spark.apache.org/docs/latest/streaming-programming-guide.html.

[18] J. Laskowski, "Spark Application’s web UI," [Online]. Available:

https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-webui.html.

[19] A. Grishchenko, "Distributed Systems Architecture," [Online]. Available:

https://0x0fff.com/spark-memory-management/.

[20] M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley and I. & Stoica,

"Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing," in Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, 2012.

[21] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker and I. & Stoica, "Spark: cluster computing with working sets," in HotCloud, 10, 2010.

[22] X. Meng, J. Bradley, B. Yuvaz, E. Sparks, S. Venkataraman and D. &. X. D. Liu, "Mllib: Machine learning in apache spark.," in JMLR, 2016.

[23] M. Armbrust, R. S. Xin, C. Lian, Y. Huai, D. Liu, J. K. Bradley and M. Zaharia,

"Spark sql: Relational data processing in spark.," in Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015.

discover spark in 4 hours - rescuer...

Documents