introduction to big data analytics: batch, real-time, and the best of both worlds
TRANSCRIPT
Introduction to Big Data Analytics: Batch,
Real-Time, and the Best of Both Worlds
Srinath Perera Director, Research, WSO2 Inc.
Visiting Faculty, University of Moratuwa Member, Apache Software Foundation
Research Scientist, Lanka Software Foundation
What can We do with Big Data? § Optimize (World is inefficient)
o 30% food wasted farm to plate o GE 1% initiative (http://goo.gl/eYC0QE )
- 1% saving in trains can save 2B/ year - 1% in US healthcare is 20B/ year - In contrast, Sri Lanka total exports 9B/ year.
§ Save lives o Weather, Disease identification,
Personalized treatment
§ Technology advancement o Most high tech research are done via
simulations
8
Combined Power
§ Users can send events to both BAM and CEP via the same APIs
§ CEP can combine output from batch Processing and data from various storage (e.g. databases) with real-time processing o e.g. Implementing Lambda
Architecture
WSO2 BAM
● Powered by Apache Hadoop with management and queries using Apache Hive
● Parallel, distributed processing based on the MapReduce programming model
● Runs on local Hadoop node or can be delegated to a cluster of Hadoop nodes
● Scalable script-‐based analyAcs wriBen using an easy-‐to-‐learn, SQL-‐like query language.
Analyzer Engine
Hadoop Cluster Data Store
(Cassandra/RDBMS)
12
High Level Languages § For both batch and real-time, we provide
structured , SQL-like query languages. o No Java programming is required
§ Lowers the adoption entry point § BAM o Relies on Apache Hive
§ CEP o Implemented though our own solution, Siddhi.
13
Event table:(Map a database as an event stream)
Filter: (Process single transacAon)
Windows:(Track a window of events)
CEP Operators with Siddhi
§ define stream RequestStream ( correlationID string, serviceID string,userID string, tear string, requestTime long, ... ) ;
§ define table BlacklistedUserTable(userID string,time long,requestCount long);
§ from RequestStream[tear==‘BRONZE’]#window.time(1 min)
§ select userID, requestTime as time, count(correlationID) as requestCount
§ group by userID§ having up requestCount > 5§ insert into BlacklistedUserTable ;
14
Smart Home § DEBS (Distributed Event Based Systems) is a
premier academic conference, which post yearly event processing challenge (http://www.cse.iitb.ac.in/debs2014/?page_id=42)
§ Smart Home electricity data: 2000 sensors, 40 houses, 4 Billion events
§ We posted fastest single node solution measured (400K events/sec) and close to one million distributed throughput.
§ WSO2 CEP based solution is one of the four finalists (with Dresden University of Technology, Fraunhofer Institute, and Imperial College London)
§ Only generic solution to become a finalist
15
Healthcare Data Monitoring
§ Allows to search/visualize/analyze healthcare records (HL7) across 20 hospitals in Italy
§ Used in combination with WSO2 ESB and BAM § Custom toolbox tailored to customer’s requirement
( to replace existing system)
§
16
Cloud IDE Analytics
§ Custom solution created in partnership with Codenvy to bring analytics to Codenvy management team and its customers
§ Developed in less than a month, with a custom plug-in to MongoDB.
§ Deployed in the codenvy.com platform.
18
Additional Customers Use Cases § Used in Healthcare, Parking Monitoring (see Solution patterns based
approach to rapidly create IoE solutions across industries, o http://us14.wso2con.com/videos/#Coumara-Radja
§ Used by a Large Scale IoT System Provider for use cases including Vehicle tracking, Smart City, Building Monitoring (CEP) o See “Internet of Big Things: The Story of Pacific Controls,
http://us14.wso2con.com/videos/#Sajaad-Chaudry” § Transaction Monitoring in a Large Bank (CEP) § Knowledge Mining and tracking Prospective Customers through Natural
Language data sources (CEP) § CEP Embedded in edge Devices o See WSO2Con 2013 - Keynote:Emerging Foundations of Next-
Generation Business Systems https://www.youtube.com/watch?v=7CyG3JKUxWw
§ Throttling and Anomaly Detection by Group of Telecom Companies
19
Extensions and Toolboxes § Fraud and Anomaly Detection Toolbox - ( Static Rules, Statistical
outliers, Markov Chains) § Time Series Toolbox § Natural Language Processing Plugin (Entity Extraction, POS tagging,
Sentiment analysis) § GIS Toolbox (Geo Fencing, Tracking, Speed Alarms) § Running machine learning models exported as PMML with CEP (e.g.
from R) § Video Monitoring with OpenCV § For more info,
http://wso2.com/library/articles/2014/08/wso2-cep-in-action-an-analysis-of-use-in-real-world-applications-of-different-domains/
21
SolidCon Demo -‐ hBp://wso2.com/library/arAcles/2014/09/demonstraAon-‐on-‐architecture-‐of-‐internet-‐of-‐things-‐an-‐analysis/
IoT Demos and Use Cases
§ IOT Reference Architecture, http://wso2.com/landing/internet-of-things-uk-2014/
§ Internet of Big Things: The Story of Pacific Controls, http://us14.wso2con.com/videos/#Sajaad-Chaudry
§ Federated Identity for IoT with OAuth, http://www.infoq.com/presentations/federated-identity-IoT-OAuth
26
BAM Enhancements § Work underway to Switch to Apache
Spark and Shark SQL like Queries support in BAM o Faster Queries o Keeping SQL like language
§ Use “Hive on Spark” for migration purposes
§ Lower the adoption point of BAM by packaging by default an RDBMS instead of Cassandra. o Architecture already scales from small
deployments to BigData