big data analytic

4
Big Data in the Real World An Opallios White Paper The potential of big data is enormous, but many companies find that the challenge is in extracting real business value from that data. In 2013, an Infochimps survey of over 300 IT employees found that fifty-five percent of big data projects fail. Can massive amounts of data really be tamed? With the right expertise, the answer is a resounding yes. Many projects fail because of a lack of skill in handling large data files, but in knowledgeable hands the same data can reveal unexpected insights. Opallios has the experience to know what big-data tools to use and how to use them, making even the most complex data useful. The rewards of utilizing data properly are significant: according to a 2013 McKinsey study, harnessing the power of big data and analytics increases productivity and profitably by 5 to 6%. This white paper describes a recent Opallios client project that involved the analysis of network traffic using Big Data technologies. A Big Data Conundrum The Client provides an end-to-end cyber security solution through an appliance that captures raw packets of network traffic and then categorizes the traffic by application using patented parsing technology. The Client wanted to expand the product line to provide additional behavioral analytics on the captured data to customers. However, the Client had a big-data dilemma. The appliance couldn’t be scaled economically by adding additional hardware to handle the required magnitude of data. A Big Data in the Real World: Analyzing network traffic with big data technologies

Upload: govind-goyal

Post on 15-Feb-2017

122 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Big Data in the Real World – An Opallios White Paper

The potential of big data is enormous, but

many companies find that the challenge is in

extracting real business value from that data. In

2013, an Infochimps survey of over 300 IT

employees found that fifty-five percent of big

data projects fail. Can massive amounts of data

really be tamed?

With the right expertise, the answer is a

resounding yes. Many projects fail because of a

lack of skill in handling large data files, but in

knowledgeable hands the same data can reveal

unexpected insights. Opallios has the

experience to know what big-data tools to use

and how to use them, making even the most

complex data useful. The rewards of utilizing

data properly are significant: according to a

2013 McKinsey study, harnessing the power of

big data and analytics increases productivity

and profitably by 5 to 6%.

This white paper describes a recent

Opallios client project that involved the analysis

of network traffic using Big Data technologies.

A Big Data Conundrum

The Client provides an end-to-end cyber

security solution through an appliance that

captures raw packets of network traffic and

then categorizes the traffic by application using

patented parsing technology. The Client wanted

to expand the product line to provide additional

behavioral analytics on the captured data to

customers.

However, the Client had a big-data

dilemma. The appliance couldn’t be scaled

economically by adding additional hardware to

handle the required magnitude of data. A

Big Data in the Real World: Analyzing network traffic

with big data technologies

2

COPYRIGHT 2015 OPALLIOS www.opallios.com (888) 205-4058

solution was needed that:

Could support up to 10MB/sec of input rate

per instance with a given hardware

specification,

Provided high reliability with no data loss,

Could store up to six months (around 20

TB) of data,

Would support distributed search on near

real-time data within seconds,

Would include an analytical dashboard to

show trends, patterns, and top talkers,

Supplied an alerting mechanism to detect

anomalies, and

Provided extensibility for future releases.

A Big Data Solution

Opallios evaluated several solutions before

settling on ELSA (enterprise-log-search-and-

archive). ELSA is a centralized syslog

framework built on Syslog-NG, MySQL, and

Sphinx full-text search. It’s best for the high-

volume receiving/indexing of data where a

single node can receive >30K events/sec in a

sustained mode. That’s around 5MB/sec for

events with 150 bytes. It didn’t meet the Client’s

requirement of sustained 10MB/sec, but

Opallios liked it enough to spend time on it.

Opallios proposed a solution called an

Aggregator, which was comprised of different

components each doing a specific task. Its

pluggable architecture would make it easy to

pivot and provide high extensibility.

The proposed Aggregator architecture

Opallios found Syslog-NG not quite reliable

at high volume, with the danger of data

dropping once its buffers filled up. Streaming

over TCP was causing drops of 15-20% at the

input rate of 8MB/sec.

Speed and Reliability Achieved

As reliability was a key factor, Opallios

wrote a collector that could ingest at the rate of

3

COPYRIGHT 2015 OPALLIOS www.opallios.com (888) 205-4058

10MB/sec without losing records. The custom

collector allowed the parsing of nested JSON

data and was flexible enough to support any

JSON structural changes without code

changes. The collector performed really well

and met the requirement of ingesting data at

10MB/sec with some to spare.

The Client needed ELSA to perform

indexing at double the speed its creator had

benchmarked it at. ELSA uses Sphinx as its

full-text search engine. Opallios added a data

abstraction layer in its custom middleware

solution to do a search on Sphinx data. These

custom REST APIs allowed its own UI to make

faceted searches over multiple dimensions.

Opallios also tweaked the ELSA code. The

built-in capability of distributed search in ELSA

also allowed the middleware to run search

queries on multiple instances with very low

latency.

Massive Amounts of Data in a

Compact Space

With ELSA, old data is stored in

compressed format in MySQL, which allows

one single instance of ELSA to store a large

amount of data. The compression is

approximately 10:1. Search queries are slow on

archived data, but they are meant to be run

only in batch mode as scheduled reports. The

near real-time queries are run on the Sphinx

index directly and return within a second.

Opallios built a web-based user interface

from scratch for the Client. This UI, with its

intuitive easy-to-use layout, was based on

Angular JS and Google Visualizations. It

allowed users to customize their dashboards,

save search queries, and schedule reports. The

search interface could dice-n-splice data over

multiple dimensions and search results could

be used for descriptive and diagnostic

analytics.

Readily Available Information

Another key feature required by the Client

was the availability of analytics that could

combine behavioral pattern matching with

anomaly detection to identify security threats.

Opallios chose Apache Storm as the best

solution. Storm is an open-source real-time

event-stream processing platform that provides

fixed, continuous, and low latency processing

4

COPYRIGHT 2015 OPALLIOS www.opallios.com (888) 205-4058

for very high frequency streaming data.

Apache Storm permitted the detection of

data anomalies based on pre-defined patterns

and rules. Each anomaly generated an alert

that was passed on to the middleware for

processing. Storm made it possible to process

over 100K events per second per node.

Opallios also wrote a Spring-managed

Java middleware application. This included

cluster management, a configuration manager,

a data access layer, a communication manager

using JMS and SFTP, user management,

RESTful APIs, an alert manager, and a system

health manager.

The communication manager provided data

encryption and security. The health manager

provided basic APIs for reporting system stats

and troubleshooting. The Aggregator solution is

currently in its first phase and there are plans to

add a rich set of new features, including

predictive analytics.

An Economical & Usable Package

All technology was thoroughly tested. A log

generator was written that allowed the

simulation of data generation at rates above the

production requirements. The log generator

was highly configurable, permitting control of

the data generation rates and the type of data

generated. System resource usage and query

performances were also measured.

The Aggregator ran on a Ubuntu 14.4

platform with 12 cores, 16GB of RAM and 10TB

RAID 0 15K rpm SCSI drives. This hardware

was well within the Client’s budget.

Opallios solutions are delivered properly

packaged, allowing easy installs and upgrades.

The out-of-the-box experience is intuitive

enough for users to maintain the system

themselves.

About Opallios

Opallios is an innovative technology

consulting firm with a unique combination of big

data, Salesforce, strategy, and product

development expertise.

We work with clients in many vertical

industries, including technology, healthcare,

financial services, and hospitality. We help our

clients leverage the cloud and Salesforce to

realize business value from the data revolution.