spark streaming versus storm: comparing systems for processing fast and large streams of data in...

Memo from Analytix.Marketing

Spark Streaming versus Storm: Comparing Systems for Processing Fast and Large Streams of Data in Real Time

One of the most popular topics at the recent Spark Summit in San Francisco was Spark Streaming, which is a system for processing fast and large streams of data in real time. This blog post highlights Spark Streamings core capabilities and architectural design point. We conclude by offering some advice for readers who need to select a streaming system by contrasting the capabilities of Spark Streaming with Storm. Core Capabilities of Spark Streaming Spark Streaming is an extension of the core Spark API that enables high-throughput, fault-tolerant processing of live data streams. It ingests data from many sources including Kafka, Flume, Twitter, ZeroMQ and plain old TCP sockets. Spark Streaming then processes that data using complex algorithms, which are expressed in high-level functions. Finally, the processed data can be stored in file systems (including HDFS), databases (including Hbase), and live dashboards. The core innovation behind Spark Streaming is to treat streaming computations as a series of deterministic batch computations on small time intervals. The input data received during each interval is stored reliably across the cluster to form an input dataset (also called "micro batch") for that interval. Once the time interval completes, this dataset is processed via deterministic parallel operations, such as map, reduce, join, window and group by, to produce new datasets representing program outputs or intermediate states. Architecture The micro batch, called D-Stream in the original paper1, provides an elegant solution to three challenges that arise in large-scale distributed computing environments: fault tolerance, consistency and a unified programming model across batch and real time. 1 http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf


Spark provides a unified programming model across all its processing engines. This yields four benefits:

o Faster learning curve: It allows users to write one analytic job, which then executes equally well on both batch and streaming data. This obviates the need to learn about the different interfaces and specific APIs of batch versus streaming systems.

o Higher developer productivity: On a related note, machine learning libraries, statistical functions, and complex algorithms such as graph processing that are available in Spark can be put to use on streaming data as well, saving developers time.

o Better decisions: Moreover, the unified programming model also makes it much easier to combine arriving real-time data with historical data in one analysis, for instance to make a decision on the basis of comparing new data with old data.

o Ease of operations: Spark provides a unified run time across different processing engines. Therefore, one physical cluster and one set of operational processes can cover the full gamut of use cases.

Consistency / statefulness. Spark nodes have an immutable state enabled by a cluster-wide in-memory cache. This guarantees exactly once semantics and supports use cases where statefulness is important.

Fault tolerance. Input batches are replicated in-memory across worker nodes. If a worker node fails, the batches on failed nodes are recomputed in parallel across several nodes to ensure a fast recovery. If used in conjunction with Zookeeper, fault tolerance also extends to the master node.

It is this innovative design point that has given rise to the large interest in Spark Streaming that we see today. Advice for Selecting a Streaming System A short blog post cannot do justice to the large variety of use cases that call for streaming capabilities. However, we can offer some guidance where Spark Streaming would be a good fit:

Ease of use is important, manifested by a quick learning curve for developers, data scientists, analysts and IT operations. Users who look at streaming from an application or business perspective find


the higher abstraction level that is available in Sparks declarative APIs particularly compelling. This allows users to work at the level of the actual business logic and data pipeline that specifies what has to happen. Spark then figures out how this has to happen, coordinating tasks such as data movement and recovery. Users are spared having to worry about the details of which nodes execute which computations as part of a specific.

Real-time decisioning for the business is important. Spark combines statefulness and persistence with high throughput. Many organizations have evolved from exploratory, discovery type use cases of big data to use cases that require reasoning on the data as it arrives in order to make a decision near real time that is pushed to the front line of the organization, for instance in a sales or service or production context. Users need certainty on questions such as the exact number of frauds, emergencies or outages occurring today and data loss is not acceptable. These business critical use cases call for the exactly once" semantics that Spark Streaming provides. Storm provides exactly once processing only in conjunction with Trident. Trident achieves this via a transaction ID, which limits the throughput that can be achieved.

Your big data vendor of choice supports Spark Streaming. Currently, Hortonworks, Cloudera, Pivotal and MapR provide commercial support for Spark, but the vendor ecosystem that supports Spark is expanding quickly.

Storm enjoys more awareness in the market as of the time of this writing, which explains why some believe that Storm is more mature. Reference implementations including the original sponsor Twitter. Storms low-level programming model might be an advantage for highly advanced users trying to implement highly specialized, unusual processing logic. Low latency is often stated as one of the biggest benefits of Storm. While it is true that Storm achieves latency as low as milliseconds to tens of milliseconds, the difference is immaterial to the vast majority of commercially relevant use cases (exceptions include algorithmic trading). In the telecommunications industry for instance, data streams from network probes arrive with an intrinsic latency of 15 minutes. For most users therefore, latency will not outweigh the other benefits of Spark Streaming. An apples-to-apples comparison with Spark Streaming would also have to consider Trident. Trident is an extension of Storm and provides higher-


level declarative / functional APIs similar to Pig or Cascading: joins, aggregations, grouping, functions, filters, etc. This allows to persist state-versioning information to an external database, which is then used to ensure exactly-once semantics. Relying on transaction IDs to update state has to be implemented by the user (not a matter of just pressing a button) and degrades performance / throughput. As technologies are evolving very quickly, readers might find our approach to comparing different streaming systems helpful: Dimension Criteria Market traction Speed of innovation

Partner ecosystem Enterprise adoption

Developer productivity Programming model & APIs Integration of batch & RT

Data integration Data ingestion Data persistence

Data processing Processing framework State management Throughput & latency

Operations Native management Choice of resource managers Fault tolerance (FT) Multi-tenancy

spark streaming versus storm: comparing systems for processing fast and large streams of data in...

Documents