introduction to big data analytics: batch, real-time, and the best of both worlds

28
Introduction to Big Data Analytics: Batch, Real-Time, and the Best of Both Worlds Srinath Perera Director, Research, WSO2 Inc. Visiting Faculty, University of Moratuwa Member, Apache Software Foundation Research Scientist, Lanka Software Foundation

Upload: wso2

Post on 14-Jul-2015

2.954 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Introduction to Big Data Analytics: Batch,

Real-Time, and the Best of Both Worlds

Srinath Perera Director, Research, WSO2 Inc.

Visiting Faculty, University of Moratuwa Member, Apache Software Foundation

Research Scientist, Lanka Software Foundation

What can We do with Big Data? § Optimize (World is inefficient)

o  30% food wasted farm to plate o  GE 1% initiative (http://goo.gl/eYC0QE )

-  1% saving in trains can save 2B/ year -  1% in US healthcare is 20B/ year -  In contrast, Sri Lanka total exports 9B/ year.

§ Save lives o  Weather, Disease identification,

Personalized treatment

§ Technology advancement o  Most high tech research are done via

simulations

Big Data Architecture

Big data Processing Technologies

WSO2  Analy+cs  Pla/orm  

Big  Data  Analy+cs  Offering  

8

Combined Power

§  Users can send events to both BAM and CEP via the same APIs

§  CEP can combine output from batch Processing and data from various storage (e.g. databases) with real-time processing o  e.g. Implementing Lambda

Architecture

9

Highly Pluggable Architecture

WSO2  CEP  

WSO2  BAM  

●  Powered  by  Apache  Hadoop  with  management  and  queries  using  Apache  Hive  

●  Parallel,  distributed  processing  based  on  the  MapReduce  programming  model  

●  Runs  on  local  Hadoop  node  or  can  be  delegated  to  a  cluster  of  Hadoop  nodes  

●  Scalable  script-­‐based  analyAcs  wriBen  using    an  easy-­‐to-­‐learn,  SQL-­‐like  query  language.  

Analyzer Engine

Hadoop Cluster Data Store

(Cassandra/RDBMS)

12

High Level Languages § For both batch and real-time, we provide

structured , SQL-like query languages. o No Java programming is required

§ Lowers the adoption entry point § BAM o Relies on Apache Hive

§ CEP o  Implemented though our own solution, Siddhi.

13

Event  table:(Map  a  database  as  an  event  stream)  

Filter:  (Process  single  transacAon)  

Windows:(Track  a  window  of  events)  

CEP Operators with Siddhi

§  define stream RequestStream ( correlationID string, serviceID string,userID string, tear string, requestTime long, ... ) ;

§  define table BlacklistedUserTable(userID string,time long,requestCount long);

§  from RequestStream[tear==‘BRONZE’]#window.time(1 min)

§  select userID, requestTime as time, count(correlationID) as requestCount

§  group by userID§  having up requestCount > 5§  insert into BlacklistedUserTable ;

14

Smart Home §  DEBS (Distributed Event Based Systems) is a

premier academic conference, which post yearly event processing challenge (http://www.cse.iitb.ac.in/debs2014/?page_id=42)

§  Smart Home electricity data: 2000 sensors, 40 houses, 4 Billion events

§  We posted fastest single node solution measured (400K events/sec) and close to one million distributed throughput.

§  WSO2 CEP based solution is one of the four finalists (with Dresden University of Technology, Fraunhofer Institute, and Imperial College London)

§  Only generic solution to become a finalist

15

Healthcare Data Monitoring

§  Allows to search/visualize/analyze healthcare records (HL7) across 20 hospitals in Italy

§  Used in combination with WSO2 ESB and BAM §  Custom toolbox tailored to customer’s requirement

( to replace existing system)

§ 

16

Cloud IDE Analytics

§ Custom solution created in partnership with Codenvy to bring analytics to Codenvy management team and its customers

§ Developed in less than a month, with a custom plug-in to MongoDB.

§ Deployed in the codenvy.com platform.

17

Watch at: https://www.youtube.com/watch?v=nRI6buQ0NOM

Case Study: Realtime Soccer Analysis

18

Additional Customers Use Cases §  Used in Healthcare, Parking Monitoring (see Solution patterns based

approach to rapidly create IoE solutions across industries, o  http://us14.wso2con.com/videos/#Coumara-Radja

§  Used by a Large Scale IoT System Provider for use cases including Vehicle tracking, Smart City, Building Monitoring (CEP) o  See “Internet of Big Things: The Story of Pacific Controls,

http://us14.wso2con.com/videos/#Sajaad-Chaudry” §  Transaction Monitoring in a Large Bank (CEP) §  Knowledge Mining and tracking Prospective Customers through Natural

Language data sources (CEP) §  CEP Embedded in edge Devices o  See WSO2Con 2013 - Keynote:Emerging Foundations of Next-

Generation Business Systems https://www.youtube.com/watch?v=7CyG3JKUxWw

§  Throttling and Anomaly Detection by Group of Telecom Companies

19

Extensions and Toolboxes §  Fraud and Anomaly Detection Toolbox - ( Static Rules, Statistical

outliers, Markov Chains) §  Time Series Toolbox §  Natural Language Processing Plugin (Entity Extraction, POS tagging,

Sentiment analysis) §  GIS Toolbox (Geo Fencing, Tracking, Speed Alarms) §  Running machine learning models exported as PMML with CEP (e.g.

from R) §  Video Monitoring with OpenCV §  For more info,

http://wso2.com/library/articles/2014/08/wso2-cep-in-action-an-analysis-of-use-in-real-world-applications-of-different-domains/

20

Geo Fencing and Tracking Toolbox

21

SolidCon  Demo  -­‐  hBp://wso2.com/library/arAcles/2014/09/demonstraAon-­‐on-­‐architecture-­‐of-­‐internet-­‐of-­‐things-­‐an-­‐analysis/    

IoT Demos and Use Cases

§  IOT Reference Architecture, http://wso2.com/landing/internet-of-things-uk-2014/

§  Internet of Big Things: The Story of Pacific Controls, http://us14.wso2con.com/videos/#Sajaad-Chaudry

§  Federated Identity for IoT with OAuth, http://www.infoq.com/presentations/federated-identity-IoT-OAuth

22

Analyzing  senAments  for  FIFA  twiBer  hashtag  

Sentimental Analysis Demo

Work in Progress

24

Predictive Analytics

25

Leveraging Apache Storm in CEP

26

BAM Enhancements § Work underway to Switch to Apache

Spark and Shark SQL like Queries support in BAM o Faster Queries o Keeping SQL like language

§ Use “Hive on Spark” for migration purposes

§ Lower the adoption point of BAM by packaging by default an RDBMS instead of Cassandra. o Architecture already scales from small

deployments to BigData

Questions?

28

Business Model