big data open source tools and trends: enable real-time business intelligence from machine logs

32
Big Data Open Source Tools and Trends: Enable Real- Time Business Intelligence from Machine Logs Eric Roch, Principal & Ben Hahn, Senior Technical Architect

Upload: perficient-inc

Post on 06-May-2015

1.166 views

Category:

Technology


0 download

DESCRIPTION

Most organizations still rely on batch and offline processing of data streams to gain meaningful analysis and insight into their business. However, in our instant gratification world, real-time computation and analysis of streaming data is crucial in gaining insight into patterns and threats. A trend is emerging for real-time and instant analysis from live data streams, promoting the value of logs and a move toward functional programming. This shift in technology is not about what and how to store the data, but what we can do with it to see emerging patterns and trends across multiple resources, applications, services and environments. Log data represents a wealth of information, yet is often sporadic, unstructured, scattered across the enterprise and difficult to track. These slides provide insights into some of the most helpful Big Data tools used by the largest social media and data-centric organizations for competitive trends, instant analysis and feedback from large volume data streams. We show how how using Big Data tools Storm, ElasticSearch and an elastic UI can turn application logs into real-time analytical views. You will also learn how Big Data: Contains data that is elastic, minimally structured, flexible and scalable Helps process live streams into meaningful data Promotes a move toward functional programming Effects the enterprise data architecture Works with real-time CEP tools like Storm for functional programming

TRANSCRIPT

Page 1: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Eric Roch, Principal &Ben Hahn, Senior Technical Architect

Page 2: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Perficient is a leading information technology consulting firm serving clients throughout

North America.

We help clients implement business-driven technology solutions that integrate business

processes, improve worker productivity, increase customer loyalty and create a more agile

enterprise to better respond to new business opportunities.

About Perficient

Page 3: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

• Founded in 1997

• Public, NASDAQ: PRFT

• 2013 revenue $373 million

• Major market locations throughout North America• Atlanta, Boston, Charlotte, Chicago, Cincinnati, Columbus,

Dallas, Denver, Detroit, Fairfax, Houston, Indianapolis,Los Angeles, Minneapolis, New Orleans, New York City, Northern California, Philadelphia, Southern California,St. Louis, Toronto and Washington, D.C.

• Global delivery centers in China, Europe and India

• >2,100 colleagues

• Dedicated solution practices

• ~90% repeat business rate

• Alliance partnerships with major technology vendors

• Multiple vendor/industry technology and growth awards

Perficient Profile

Page 4: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

BUSINESS SOLUTIONSBusiness IntelligenceBusiness Process ManagementCustomer Experience and CRMEnterprise Performance ManagementEnterprise Resource PlanningExperience Design (XD)Management Consulting

TECHNOLOGY SOLUTIONSBusiness Integration/SOACloud ServicesCommerceContent ManagementCustom Application DevelopmentEducationInformation ManagementMobile PlatformsPlatform IntegrationPortal & Social

Our Solutions Expertise

Page 5: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Eric RochPrincipal

Eric leads Perficient's national connected solutions practice

• Includes focus on SOA/integration, cloud, mobile and Big Data

• Author & industry speaker• 25 years+ of experience in various

aspects of information technology including:

• Executive-level management• Enterprise architecture• Application development

Speakers

Ben HahnSr. Technical Architect

Ben Hahn is a Sr. Technical Architect

• Includes focus on transactions, logging & exceptions processing

• Author & speaker• 20+ years of experience in various

aspects of information technology including:

• Software solutions• Enterprise infrastructure• Product management • Open Source software community

contributor

Page 6: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

• Often defined as data that exceeds the capacities of conventional database systems because it’s too large and moves too fast for traditional database systems to handle in an architecturally cohesive way. The three V’s of Big Data are:

• Volume • Most companies have 100 TB of data• Facebook ingests 500 TB in a single day• 40 ZettaBytes (that’s 43 trillion GB) of data by

2020 • Velocity

• NYSE captures 4-5 TB of data in a single day• A Boeing 737 generates 243 TB in a single flight• The Google self-driving car generates 750MB of

data per second!• Variety

• Twitter, Clickstreams, Audio, Video• GPS, Sensor data, Facebook content• Infrastructure and application logs

What is Big Data?

Page 7: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

POLL QUESTION:What is your current adoption level for big data?• Evaluation• Prototype• Production

Page 8: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

But Not Everyone is Google!

Where’s the Big Data coming from?

Page 9: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

POLL QUESTIONHave you used open source software for big data solutions? • Yes• No

Page 10: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Machine Data definitely has the three V’s of Big Data

Machine Data is Big Data

Page 11: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

What Can We Gain From Machine Data?

Valuable information can be mined from machine data, including:

• Transaction monitoring• Error detection• Behavior trends• Audit logging• Infrastructure states• Anomaly detection• Geospatial analysis• Network analysis

Page 12: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Log Analysis vs. Business Analytics

• Ingest - Versus ETL • Big Data - Bidirectional integration with Hadoop• Query language - MapReduce function on unstructured

data • Drill anywhere - Investigate on all the data versus a

predefined schema or cube• Information discovery - Discover relationships based on

patterns in the data • Ad-hoc versus dimensional - Log analysis is not based a

predefined structure based a point-in-time set of requirements

• Explicit logging - Versus implicit correlation

Page 13: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Polling Question:Do you mine machine data for business insights?• Yes• No

Page 14: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Innovations From Cloud and OSS

• Hadoop and MapReduce - Derived from Google's MapReduce and Google File System

• Storm – Distributed event processor open sourced by Twitter

• Presto - Facebook has released as open source a SQL query engine built to work with petabyte-sized data warehouses

• Google BigQuery - Run SQL-like queries against terabytes of data in seconds

• Amazon DynamoDB - NoSQL database service to store and retrieve any amount of data, and serve any level of request traffic

• Elasticsearch – Distributed full-text search OSS community

Page 15: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

POLLING QUESTIONDo you plan to use cloud based solutions for big data?• Yes• No

Page 16: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

• 2004 - Google published a paper called MapReduce: Simplified Data Processing on Large Clusters characterized by:

• Map and shuffle key-values data pairs and then aggregate/reduce these intermediate data pairs

• Origins in map and reduce primitives in functional languages• Massive parallelism and elasticity via commodity hardware• Fault tolerance via master-worker nodes

Big Data Processing: MapReduce

2

Page 17: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

• Based on Lambda (λ) calculus • ALL computational functions and data can be expressed as

a series of functions and predicates of functions• Declarative language rather than imperative • First-order functions – Functions can be passed just like

values as arguments and returned as arguments. This also allows currying and partial functions.

• Call by name – Function expressions are not evaluated until they are actually used.

• Recursion – Functions evaluate to itself potentially in an endless loop.

• Immutable state and values – Pure functional programming does not consider variables but rather immutable values as they appear in any moment in time. This has big effects on scalability and concurrency.

• Referential Transparency - Functions can be replaced by their values with no side effects.

• Pattern matching – Data type matching as well as data structure composition and deep object type matching

• Erlang, Haskell, Lisp, Clojure, Scala

What are functional languages?

And MapReduce is Better with Functional Languages

2

Page 18: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Imperative Model: Pascal, C. Basic, etc.

Evolution (or Devolution?) of Databases

2

Page 19: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Object Oriented Programming Model: Java, C++,C#.

Evolution (or Devolution?) of Databases

2

Page 20: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Functional Programming Model: Scala, Clojure, F#

Evolution (or Devolution?) of Databases

2

• Because commodity hardware in the cloud is infinitely elastic, resource needs to query and run transactions can be scaled in response to the data volumes at the store level.

• Data is stored using functional programming concept of immutability by only appending data as point-in-time values.

• MapReduce functions can be balanced and distributed across machines as nodes fail or new nodes are added.

• First-class functions and call by name allows function, lambda expressions to be passed into MapReduce calls as arguments allowing ad-hoc functionality to be added.

• Pattern matching allows very complex pattern matches on complex structures like XML.

• Transactions use functional expressions like compare and swap operations to ensure ACIDity.

• SQL or query expressions can be reduced to MapReduce functions or lambda expressions and/or patterns and distributed in parallel across the nodes.

• Using recursion, complex structures like XML can be mapped and reduced from a single expression.

Page 21: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

MapReduce Machine Data: What Do We Need?

• A dynamic process for parsing and mapping unstructured data to structured data in real-time

• Wide range of data formats (text, XML, JSON, CSV, EDI, etc.)

• Need intelligent pattern matching capabilities

• Ability to correlate meaningful transactional data and metrics from disparate data (reducing)

• Machine data is static and immutable. Append-only fast writes with eventual consistency is ideal

• Need fast filter, search, query capabilities to display results

Page 22: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Open Source Big Data Landscape

Source: www.bigdata‐startup.com

Page 23: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Apache Hadoop: The Elephant in the Room

• What about Apache Hadoop?

• Apache Hadoop comprises HDFS and the Hadoop MapReduce both based on Google’s GFS and MapReduce

• Batch oriented MapReduce jobs through Schedulers and JobTrackers

• Require real‐time MapReduce processes

• Need index, query, search on data in real‐time with a well‐defined interface

• We can use for secondary storage of long‐term persistent logs – Lambda Architecture (Batch vsSpeed Layer)

Page 24: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Apache Storm: Use Real-time MapReduce for Machine Data Streams

• Developed by Backtype and acquired by Twitter

• Distributed computational framework that allows real-time MapReduce functionality from any data source streams using concept of Spouts and Bolts

• Read From Any Data Stream using Spouts (Kafka, JMS, HTTP, etc.)

• Transactional and guaranteed message processing

• Parallelism and scalability

• Fault Tolerance (Master-Worker for MapReduce)

• MapReduce Topologies

• Offers Real-time MapReduce jobs (Or Bolts)

• Other tools: Apache Spark

Page 25: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Apache Storm: Use Real-time MapReduce for Machine Data Streams

MapReduce - Declarative and simplicity of functional languages within Storm

Page 26: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Elasticsearch: Distributed Document Search

• Distributed search server engine using Apache Lucene

• It’s a Schema-less document store using JSON as it’s document format. New fields can be added dynamically. All fields are indexed by default

• Uses index shards to distribute queries and searches across clusters. Queries and searches are run in parallel

• Cluster can host multiple indexes and can be queried as a group or singly. Index aliases allows indexes to be added or dropped dynamically

• Append-only model using versioning. Writes very fast depending on wait model (wait for all shards to be written or a quorom or none)

• Well-defined RESTful API interface. Very powerful query features

• Other tools: Apache Solr

Page 27: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Elasticsearch: Distributed Document Search

Elasticsearch: Distributed Query and searches using index shards and replicas

Page 28: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

A Really Cool UI to Show This Off

• Kibana – Works seamlessly with Elasticsearch, queries Elasticsearchdirectly from Javascript

• Everything is user driven, very little coding except some configuration settings in yaml

• Very dynamic screen interface• Screen layout, queries, filters, graphs, histograms are saved directly to

Elasticsearch• Great design and user interface

Page 29: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Putting It In Action: Demo

Page 30: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

As a reminder, please submit your questions in the chat box

We will get to as many as possible!

4/1/2014

Page 31: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Daily unique content about content management, user experience, portals and other enterprise information technology solutions across a variety of industries.

Perficient.com/SocialMediaFacebook.com/Perficient

Twitter.com/Perficient

Page 32: Big Data Open Source Tools and Trends: Enable Real-Time Business Intelligence from Machine Logs

Thank you for your participation today.Please fill out the survey at the close of this session.

4/1/2014