big data analytic
TRANSCRIPT
Big Data in the Real World – An Opallios White Paper
The potential of big data is enormous, but
many companies find that the challenge is in
extracting real business value from that data. In
2013, an Infochimps survey of over 300 IT
employees found that fifty-five percent of big
data projects fail. Can massive amounts of data
really be tamed?
With the right expertise, the answer is a
resounding yes. Many projects fail because of a
lack of skill in handling large data files, but in
knowledgeable hands the same data can reveal
unexpected insights. Opallios has the
experience to know what big-data tools to use
and how to use them, making even the most
complex data useful. The rewards of utilizing
data properly are significant: according to a
2013 McKinsey study, harnessing the power of
big data and analytics increases productivity
and profitably by 5 to 6%.
This white paper describes a recent
Opallios client project that involved the analysis
of network traffic using Big Data technologies.
A Big Data Conundrum
The Client provides an end-to-end cyber
security solution through an appliance that
captures raw packets of network traffic and
then categorizes the traffic by application using
patented parsing technology. The Client wanted
to expand the product line to provide additional
behavioral analytics on the captured data to
customers.
However, the Client had a big-data
dilemma. The appliance couldn’t be scaled
economically by adding additional hardware to
handle the required magnitude of data. A
Big Data in the Real World: Analyzing network traffic
with big data technologies
2
COPYRIGHT 2015 OPALLIOS www.opallios.com (888) 205-4058
solution was needed that:
Could support up to 10MB/sec of input rate
per instance with a given hardware
specification,
Provided high reliability with no data loss,
Could store up to six months (around 20
TB) of data,
Would support distributed search on near
real-time data within seconds,
Would include an analytical dashboard to
show trends, patterns, and top talkers,
Supplied an alerting mechanism to detect
anomalies, and
Provided extensibility for future releases.
A Big Data Solution
Opallios evaluated several solutions before
settling on ELSA (enterprise-log-search-and-
archive). ELSA is a centralized syslog
framework built on Syslog-NG, MySQL, and
Sphinx full-text search. It’s best for the high-
volume receiving/indexing of data where a
single node can receive >30K events/sec in a
sustained mode. That’s around 5MB/sec for
events with 150 bytes. It didn’t meet the Client’s
requirement of sustained 10MB/sec, but
Opallios liked it enough to spend time on it.
Opallios proposed a solution called an
Aggregator, which was comprised of different
components each doing a specific task. Its
pluggable architecture would make it easy to
pivot and provide high extensibility.
The proposed Aggregator architecture
Opallios found Syslog-NG not quite reliable
at high volume, with the danger of data
dropping once its buffers filled up. Streaming
over TCP was causing drops of 15-20% at the
input rate of 8MB/sec.
Speed and Reliability Achieved
As reliability was a key factor, Opallios
wrote a collector that could ingest at the rate of
3
COPYRIGHT 2015 OPALLIOS www.opallios.com (888) 205-4058
10MB/sec without losing records. The custom
collector allowed the parsing of nested JSON
data and was flexible enough to support any
JSON structural changes without code
changes. The collector performed really well
and met the requirement of ingesting data at
10MB/sec with some to spare.
The Client needed ELSA to perform
indexing at double the speed its creator had
benchmarked it at. ELSA uses Sphinx as its
full-text search engine. Opallios added a data
abstraction layer in its custom middleware
solution to do a search on Sphinx data. These
custom REST APIs allowed its own UI to make
faceted searches over multiple dimensions.
Opallios also tweaked the ELSA code. The
built-in capability of distributed search in ELSA
also allowed the middleware to run search
queries on multiple instances with very low
latency.
Massive Amounts of Data in a
Compact Space
With ELSA, old data is stored in
compressed format in MySQL, which allows
one single instance of ELSA to store a large
amount of data. The compression is
approximately 10:1. Search queries are slow on
archived data, but they are meant to be run
only in batch mode as scheduled reports. The
near real-time queries are run on the Sphinx
index directly and return within a second.
Opallios built a web-based user interface
from scratch for the Client. This UI, with its
intuitive easy-to-use layout, was based on
Angular JS and Google Visualizations. It
allowed users to customize their dashboards,
save search queries, and schedule reports. The
search interface could dice-n-splice data over
multiple dimensions and search results could
be used for descriptive and diagnostic
analytics.
Readily Available Information
Another key feature required by the Client
was the availability of analytics that could
combine behavioral pattern matching with
anomaly detection to identify security threats.
Opallios chose Apache Storm as the best
solution. Storm is an open-source real-time
event-stream processing platform that provides
fixed, continuous, and low latency processing
4
COPYRIGHT 2015 OPALLIOS www.opallios.com (888) 205-4058
for very high frequency streaming data.
Apache Storm permitted the detection of
data anomalies based on pre-defined patterns
and rules. Each anomaly generated an alert
that was passed on to the middleware for
processing. Storm made it possible to process
over 100K events per second per node.
Opallios also wrote a Spring-managed
Java middleware application. This included
cluster management, a configuration manager,
a data access layer, a communication manager
using JMS and SFTP, user management,
RESTful APIs, an alert manager, and a system
health manager.
The communication manager provided data
encryption and security. The health manager
provided basic APIs for reporting system stats
and troubleshooting. The Aggregator solution is
currently in its first phase and there are plans to
add a rich set of new features, including
predictive analytics.
An Economical & Usable Package
All technology was thoroughly tested. A log
generator was written that allowed the
simulation of data generation at rates above the
production requirements. The log generator
was highly configurable, permitting control of
the data generation rates and the type of data
generated. System resource usage and query
performances were also measured.
The Aggregator ran on a Ubuntu 14.4
platform with 12 cores, 16GB of RAM and 10TB
RAID 0 15K rpm SCSI drives. This hardware
was well within the Client’s budget.
Opallios solutions are delivered properly
packaged, allowing easy installs and upgrades.
The out-of-the-box experience is intuitive
enough for users to maintain the system
themselves.
About Opallios
Opallios is an innovative technology
consulting firm with a unique combination of big
data, Salesforce, strategy, and product
development expertise.
We work with clients in many vertical
industries, including technology, healthcare,
financial services, and hospitality. We help our
clients leverage the cloud and Salesforce to
realize business value from the data revolution.