![Page 1: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/1.jpg)
Introduction to Large Scale Data
Analysis and WSO2 Analytics
PlatformSrinath Perera
Director Research WSO2, Apache Member(@srinath_perera) [email protected]
At Indiana University Bloomington
![Page 2: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/2.jpg)
Who We are?We are an opensource Middleware
company - We build systems upon which others
build their systems Venture funded – Intel Capital, Cisco,
Toba Capital 400+ people & Offices at Silicon valley, Sri Lanka, London and Bloomington Customers including Banks, Aircraft Manufacturers, Governments (State and Federal), Media Companies, Telco, Retail, Healthcare ..
![Page 3: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/3.jpg)
Outline
Introduction to Big DataThe Problem we are trying to solveWSO2 Big Data PlatformNext steps
![Page 4: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/4.jpg)
A Day in Your LifeThink about a day in your life?- What is the best road to take?- Would there be any bad weather?- How to invest my money?- How is my health?
There are many decisions that you can do better if only you can access the data and process them.
http://www.flickr.com/photos/kcolwell/5512461652/ CC licence
![Page 5: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/5.jpg)
![Page 6: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/6.jpg)
Internet of ThingsCurrently th physical world and
software worlds are detached Internet of things promises to bridge
this- It is about sensors and actuators
everywhere - In your fridge, in your blanket, in your
chair, in your carpet.. Yes even in your socks
- Umbrella that light up when there is rain and medicine cups
![Page 7: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/7.jpg)
What can We do with Big Data?Optimize (World is inefficient)- 30% food wasted farm to plate
- GE Save 1% initiative (http://goo.gl/eYC0QE )- Trains => 2B/ year
- US healthcare => 20B/ year
Save lives - Weather, Disease identification, Personalized treatment
Technology advancement- Most high tech research are done via simulations
![Page 8: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/8.jpg)
Big Data Architecture
![Page 9: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/9.jpg)
Big data Processing Technologies Landscape
![Page 10: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/10.jpg)
(Batch) AnalyticsScientists are doing this for 25 year with
MPI (1991) on special Hardware- OpenMPI is being done at IU!
Took off with Google’s MapReduce paper (2004), Apache Hadoop, Hive and whole eco system created. It was successful, So we are here!!
But, processing takes time.
![Page 11: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/11.jpg)
Usecase: Targeted Advertising
Analytics Implemented with MapReduce or Queries - Min, Max, average, correlation, histograms, might join or group data in
many ways - Heatmaps, temporal trends
Key Performance indicators (KPIs)- E.g. Profit per square feet for retail
![Page 12: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/12.jpg)
Usecase: Big Data for developmentDone using CDR dataPeople density noon vs. midnight
(red => increased, blue => decreased)
Urban Planning - People distribution - Mobility - Waste Management- E.g. see http://goo.gl/jPujmM
From: http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/
![Page 13: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/13.jpg)
Value of some Insights degrade Fast!For some usecases ( e.g. stock markets, traffic, surveillance, patient
monitoring) the value of insights degrades very quickly with time. - E.g. stock markets and speed of light
We need technology that can produce outputs fast - Static Queries, but need very fast output
(Alerts, Realtime control) - Dynamic and Interactive Queries ( Data
exploration)
![Page 14: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/14.jpg)
![Page 15: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/15.jpg)
Predictive Analytics If we know how to solve a problem, that is if we know
a finite set of rules, then we can programs it. For some problems (e.g. Drive a car, character
recognition), we do not know a finite fix rule set. Instead of programming, we give lot of examples and
ask the computer to learn (often called Machine Learning)
Lot of tools - R ( Statistical language)- Sci-kit learn (Phython)- Apache Spark’s MLBase and Apache Mahout (Java)
![Page 16: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/16.jpg)
Usecase: Predictive MaintenanceIdea is to fix the problem before it
happens, avoiding expensive downtimes- Airplanes, turbines, windmills
- Construction Equipment
- Car, Golf carts
How- Build a model for normal operation
and compare deviation
- Match against known error patterns
![Page 17: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/17.jpg)
Problem we are trying to Solve!
Build a platform using which others can build their analytics systems - Collect, Analyze, Communicate - End to end, starts from humans and ends
with humans Different Audiences- Technical (Developers)- Non-technical (CXOs, sales, analysts)
There are two things you need to know about business,: make something users love and make more than you spend.
--Paul Graham
( Lisp, Y-combinator)
![Page 18: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/18.jpg)
![Page 19: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/19.jpg)
Running Example
Monitor Temperature and hot airflow across multiple buildings (e.g. central AC) - More people => hot
Analytics - Historical behavior of temperature by the hour- Alerts if temperature falls too much or too high- Modeling and predicating temperature to adjust proactively
define TemperatureStream(ts long, buildingNo long, t double);define AirflowStream(ts long, buildingNo long,
aflow double, aT);
![Page 20: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/20.jpg)
Collect DataOne Sensor API to publish events - REST, Thrift, Java, JMS, Kafka- Java clients, java script clients*
First you define streams (think it as a infinite table in SQL DB)
Then send events via API* Challenges ( performance,
guaranteed delivery, scale)
Can send to batch pipeline, Realtime pipeline or both via configuration!
![Page 21: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/21.jpg)
Collecting Data: Example
Java example: create and send events Events send asynchronously See client given in http://goo.gl/vIJzqc for more info
Agent agent = new Agent(agentConfiguration);publisher = new AsyncDataPublisher("tcp://hostname:7612", .. );
StreamDefinition definition = new StreamDefinition(STREAM_NAME,VERSION);definition.addPayloadData("sid", STRING);... publisher.addStreamDefinition(definition);... Event event = new Event();event.setPayloadData(eventData);publisher.publish(STREAM_NAME, VERSION, event); Send events
Define Stream
Initialize Stream
![Page 22: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/22.jpg)
Batch Analytics: Spark
Two frameworks: Hadoop (http://hadoop.apache.org ) and Spark (https://spark.apache.org )- Hadoop is a MapReduce implementation
Spark is faster (30X and ) and much more flexible. They set a record at Gray Sort (100TB) 3X faster with 10X less
machines, http://goo.gl/r5LGvD For Hadoop and MapReduce resources, Google it.
file = spark.textFile("hdfs://...”)file.flatMap(tsToHourFunction)
.reduceByKey(lambda a, b: a+b)
![Page 23: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/23.jpg)
SQL like Queries: HiveApache Hive provides a SQL like data
processing languageSince many understands SQL, Hive
made large scale data processing Big Data accessible to many
Expressive, short, and sweet. Define core operations that covers 90%
of problems Lets experts dig in when they like! (via
User Defined functions)
![Page 24: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/24.jpg)
Hourly Temperature Average
Hive compile the SQL like query to set of MapReduce jobs running in Hadoop or Spark (in WSO2 BAM from 15, Q2 release)
insert overwrite table TemperatureHistory select hour, average(t) as avgT, buildingId from TemperatureStream group by buildingId, getHour(ts);
![Page 25: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/25.jpg)
Complex Event Processing
![Page 26: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/26.jpg)
Operators: Filters
Assume a temperature stream Here weather:convertFtoC() is a
user defined function. They are used to extend the language.
define stream TemperatureStream(ts long, temp double);from TemperatureStream[weather:convertFtoC(temp) > 30.0)
and roomNo != 2043] select roomNo, tempinsert into HotRoomsStream ;
Usecases: - Alerts , thresholds (e.g. Alarm on
high temperature) - Preprocessing: filtering,
transformations (e.g. data cleanup)
![Page 27: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/27.jpg)
Operators: Windows and Aggregation
Support many window types- Batch Windows, Sliding windows, Custom windows
Usecases- Simple counting (e.g. failure count) - Counting with Windows ( e.g. failure count every hour)
from TemperatureStream#window.time(1 min) select roomNo, avg(temp) as avgTemp insert into HotRoomsStream ;
![Page 28: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/28.jpg)
Operators: Patterns
Models a followed by relation: e.g. event A followed by event B
Very powerful tool for tracking and detecting patterns
from every (a1 = TemperatureStream) -> a2 = TemperatureStream [temp > a1.temp + 5 ]within 1 day
select a2.ts as ts, a2.temp – a1.temp as diffinsert into HotDayAlertStream;
Usecases- Detecting Event Sequence Patterns- Tracking- Detect trends
![Page 29: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/29.jpg)
Operators: Joins
Join two data streams based on a condition and windowsUsecases- Data Correlation, Detect missing events, detecting erroneous data- Joining event streams
from TemperatureStream [temp > 30.0]#window.time(1 min) as Tjoin RegulatorStream[isOn == false]#window.length(1) as R on T.roomNo == R.roomNoselect T.roomNo, R.deviceID, ‘start’ as action insert into RegulatorActionStream
![Page 30: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/30.jpg)
Operators: Access Data from the Disk
Event tables allow users to map a database to a window and join a data stream with the window
Usecases- Merge with data in a database, collect, update data conditionally
define table HistTempTable(day long, avgT double);
from TemperatureStream#window.length(1) join OldTempTable on getDayOfYear(ts) == HistTempTable.day && ts > avgT
select ts, temp insert into PurchaseUserStream ;
![Page 31: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/31.jpg)
Realtime Analytics PatternsSimple counting (e.g. failure count) Counting with Windows ( e.g. failure count every hour) Preprocessing: filtering, transformations (e.g. data cleanup)Alerts , thresholds (e.g. Alarm on high temperature) Data Correlation, Detect missing events, detecting erroneous data
(e.g. detecting failed sensors) Joining event streams (e.g. detect a hit on soccer ball) Merge with data in a database, collect, update data conditionally
![Page 32: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/32.jpg)
Realtime Analytics Patterns (contd.)Detecting Event Sequence Patterns (e.g. small transaction followed
by large transaction) Tracking - follow some related entity’s state in space, time etc. (e.g.
location of airline baggage, vehicle, tracking wild life) Detect trends – Rise, turn, fall, Outliers, Complex trends like triple
bottom etc., (e.g. algorithmic trading, SLA, load balancing)Learning a Model (e.g. Predictive maintenance) Predicting next value and corrective actions (e.g. automated car)
![Page 33: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/33.jpg)
Predictive Analytics Build models and use them with
WSO2 CEP, BAM and ESB using upcoming WSO2 Machine Learner Product ( 2015 Q2)
Build model using R, export them as PMML, and use within WSO2 CEP
Call R Scripts from CEP queries Regression and Anomaly Detection
Operators in CEP
![Page 34: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/34.jpg)
Predictive Analytics WSO2 Machine Learner provide
an wizard to explore and build model
E.g. Build a model to predict next 15 minutes temperature - Trivial Option : (historical mean
+last 15m mean)/2- Better model via ARIMA from time
series analysis To know more, take a ML class
![Page 35: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/35.jpg)
Communicate: Dashboards
Idea is to given the “Overall idea” in a glance (e.g. car dashboard)
Support for personalization, you can build your own dashboard.
Also the entry point for Drill down How to build?- Dashboard via Google Gadget and content
via HTML5 + java scripts
- Use WSO2 User Engagement Server to build a dashboard. (or a JSP or PHP)
- Use charting libraries like Vega or D3
![Page 36: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/36.jpg)
Communicate: Dashboards
Idea is to given the “Overall idea” in a glance (e.g. car dashboard)
Support for personalization, you can build your own dashboard.
Also the entry point for Drill down How to build?- Dashboard via Google Gadget and content
via HTML5 + java scripts
- Use WSO2 User Engagement Server to build a dashboard. (or a JSP or PHP)
- Use charting libraries like Vega or D3
![Page 37: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/37.jpg)
Communicate: Alerts Detecting conditions can be done via
CEP Queries Key is the “Last Mile”- Email- SMS- Push notifications to a UI- Pager - Trigger physical Alarm
How?- Select Email sender “Output Adaptor” from CEP, or send from CEP to ESB, and ESB has lot of
connectors
![Page 38: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/38.jpg)
Communicate: APIs With mobile Apps, most data are
exposed and shared as APIs (REST/Json ) to end users.
Following are some challenges - Security and Permissions- API Discovery - Billing, throttling, quote - SLA enforcement
How?- Write data to a database from CEP event tables- Build Services via WSO2 Data Service - Expose them as APIs via API Manager
![Page 39: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/39.jpg)
Smart Home2015 yearly DEBS (Distributed Event Based Systems)
DEBS Grand Challenge (http://goo.gl/0htxlj) Smart Home electricity data: 2000 sensors, 40 houses,
4 Billion eventsWe posted (400K events/sec) and close to one million
distributed throughput with 4 nodes. WSO2 CEP based solution is one of the four finalists
(with Dresden University of Technology, Fraunhofer Institute, and Imperial College London)
Only generic solution to become a finalist
![Page 40: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/40.jpg)
Case Study: Realtime Soccer Analysis
Watch at: https://www.youtube.com/watch?v=nRI6buQ0NOM
![Page 41: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/41.jpg)
Case Study: TFL Traffic AnalysisBuilt using TFL ( Transport for London) open data feeds.
http://goo.gl/04tX6khttp://goo.gl/9xNiCm
![Page 42: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/42.jpg)
WSO2 Big Data Analytics Platform
![Page 43: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/43.jpg)
ConclusionGoal: Build a platform using
which others can build their analytics systems - End to end, starts from humans
and ends with humans Whole platform is opensource
under Apache License
What can you do with the platform?- Solve hard problems, build Great
Apps with the platform- Add and contribute extensions to
the platform (e.g. GSoc http://goo.gl/QNFP6Y )
- Fix problems ( Patches)
Find us at [email protected] list or Stackoverflow (tag wso2)
![Page 44: Introduction to Large Scale Data Analysis with WSO2 Analytics Platform](https://reader033.vdocument.in/reader033/viewer/2022050907/55a50e1e1a28abdf588b48f0/html5/thumbnails/44.jpg)
Questions?