Introduction to Large Scale Data
Analysis and WSO2 Analytics
PlatformSrinath Perera
Director Research WSO2, Apache Member(@srinath_perera) [email protected]
At Indiana University Bloomington
Who We are?We are an opensource Middleware
company - We build systems upon which others
build their systems Venture funded – Intel Capital, Cisco,
Toba Capital 400+ people & Offices at Silicon valley, Sri Lanka, London and Bloomington Customers including Banks, Aircraft Manufacturers, Governments (State and Federal), Media Companies, Telco, Retail, Healthcare ..
Outline
Introduction to Big DataThe Problem we are trying to solveWSO2 Big Data PlatformNext steps
A Day in Your LifeThink about a day in your life?- What is the best road to take?- Would there be any bad weather?- How to invest my money?- How is my health?
There are many decisions that you can do better if only you can access the data and process them.
http://www.flickr.com/photos/kcolwell/5512461652/ CC licence
Internet of ThingsCurrently th physical world and
software worlds are detached Internet of things promises to bridge
this- It is about sensors and actuators
everywhere - In your fridge, in your blanket, in your
chair, in your carpet.. Yes even in your socks
- Umbrella that light up when there is rain and medicine cups
What can We do with Big Data?Optimize (World is inefficient)- 30% food wasted farm to plate
- GE Save 1% initiative (http://goo.gl/eYC0QE )- Trains => 2B/ year
- US healthcare => 20B/ year
Save lives - Weather, Disease identification, Personalized treatment
Technology advancement- Most high tech research are done via simulations
Big Data Architecture
Big data Processing Technologies Landscape
(Batch) AnalyticsScientists are doing this for 25 year with
MPI (1991) on special Hardware- OpenMPI is being done at IU!
Took off with Google’s MapReduce paper (2004), Apache Hadoop, Hive and whole eco system created. It was successful, So we are here!!
But, processing takes time.
Usecase: Targeted Advertising
Analytics Implemented with MapReduce or Queries - Min, Max, average, correlation, histograms, might join or group data in
many ways - Heatmaps, temporal trends
Key Performance indicators (KPIs)- E.g. Profit per square feet for retail
Usecase: Big Data for developmentDone using CDR dataPeople density noon vs. midnight
(red => increased, blue => decreased)
Urban Planning - People distribution - Mobility - Waste Management- E.g. see http://goo.gl/jPujmM
From: http://lirneasia.net/2014/08/what-does-big-data-say-about-sri-lanka/
Value of some Insights degrade Fast!For some usecases ( e.g. stock markets, traffic, surveillance, patient
monitoring) the value of insights degrades very quickly with time. - E.g. stock markets and speed of light
We need technology that can produce outputs fast - Static Queries, but need very fast output
(Alerts, Realtime control) - Dynamic and Interactive Queries ( Data
exploration)
Predictive Analytics If we know how to solve a problem, that is if we know
a finite set of rules, then we can programs it. For some problems (e.g. Drive a car, character
recognition), we do not know a finite fix rule set. Instead of programming, we give lot of examples and
ask the computer to learn (often called Machine Learning)
Lot of tools - R ( Statistical language)- Sci-kit learn (Phython)- Apache Spark’s MLBase and Apache Mahout (Java)
Usecase: Predictive MaintenanceIdea is to fix the problem before it
happens, avoiding expensive downtimes- Airplanes, turbines, windmills
- Construction Equipment
- Car, Golf carts
How- Build a model for normal operation
and compare deviation
- Match against known error patterns
Problem we are trying to Solve!
Build a platform using which others can build their analytics systems - Collect, Analyze, Communicate - End to end, starts from humans and ends
with humans Different Audiences- Technical (Developers)- Non-technical (CXOs, sales, analysts)
There are two things you need to know about business,: make something users love and make more than you spend.
--Paul Graham
( Lisp, Y-combinator)
Running Example
Monitor Temperature and hot airflow across multiple buildings (e.g. central AC) - More people => hot
Analytics - Historical behavior of temperature by the hour- Alerts if temperature falls too much or too high- Modeling and predicating temperature to adjust proactively
define TemperatureStream(ts long, buildingNo long, t double);define AirflowStream(ts long, buildingNo long,
aflow double, aT);
Collect DataOne Sensor API to publish events - REST, Thrift, Java, JMS, Kafka- Java clients, java script clients*
First you define streams (think it as a infinite table in SQL DB)
Then send events via API* Challenges ( performance,
guaranteed delivery, scale)
Can send to batch pipeline, Realtime pipeline or both via configuration!
Collecting Data: Example
Java example: create and send events Events send asynchronously See client given in http://goo.gl/vIJzqc for more info
Agent agent = new Agent(agentConfiguration);publisher = new AsyncDataPublisher("tcp://hostname:7612", .. );
StreamDefinition definition = new StreamDefinition(STREAM_NAME,VERSION);definition.addPayloadData("sid", STRING);... publisher.addStreamDefinition(definition);... Event event = new Event();event.setPayloadData(eventData);publisher.publish(STREAM_NAME, VERSION, event); Send events
Define Stream
Initialize Stream
Batch Analytics: Spark
Two frameworks: Hadoop (http://hadoop.apache.org ) and Spark (https://spark.apache.org )- Hadoop is a MapReduce implementation
Spark is faster (30X and ) and much more flexible. They set a record at Gray Sort (100TB) 3X faster with 10X less
machines, http://goo.gl/r5LGvD For Hadoop and MapReduce resources, Google it.
file = spark.textFile("hdfs://...”)file.flatMap(tsToHourFunction)
.reduceByKey(lambda a, b: a+b)
SQL like Queries: HiveApache Hive provides a SQL like data
processing languageSince many understands SQL, Hive
made large scale data processing Big Data accessible to many
Expressive, short, and sweet. Define core operations that covers 90%
of problems Lets experts dig in when they like! (via
User Defined functions)
Hourly Temperature Average
Hive compile the SQL like query to set of MapReduce jobs running in Hadoop or Spark (in WSO2 BAM from 15, Q2 release)
insert overwrite table TemperatureHistory select hour, average(t) as avgT, buildingId from TemperatureStream group by buildingId, getHour(ts);
Complex Event Processing
Operators: Filters
Assume a temperature stream Here weather:convertFtoC() is a
user defined function. They are used to extend the language.
define stream TemperatureStream(ts long, temp double);from TemperatureStream[weather:convertFtoC(temp) > 30.0)
and roomNo != 2043] select roomNo, tempinsert into HotRoomsStream ;
Usecases: - Alerts , thresholds (e.g. Alarm on
high temperature) - Preprocessing: filtering,
transformations (e.g. data cleanup)
Operators: Windows and Aggregation
Support many window types- Batch Windows, Sliding windows, Custom windows
Usecases- Simple counting (e.g. failure count) - Counting with Windows ( e.g. failure count every hour)
from TemperatureStream#window.time(1 min) select roomNo, avg(temp) as avgTemp insert into HotRoomsStream ;
Operators: Patterns
Models a followed by relation: e.g. event A followed by event B
Very powerful tool for tracking and detecting patterns
from every (a1 = TemperatureStream) -> a2 = TemperatureStream [temp > a1.temp + 5 ]within 1 day
select a2.ts as ts, a2.temp – a1.temp as diffinsert into HotDayAlertStream;
Usecases- Detecting Event Sequence Patterns- Tracking- Detect trends
Operators: Joins
Join two data streams based on a condition and windowsUsecases- Data Correlation, Detect missing events, detecting erroneous data- Joining event streams
from TemperatureStream [temp > 30.0]#window.time(1 min) as Tjoin RegulatorStream[isOn == false]#window.length(1) as R on T.roomNo == R.roomNoselect T.roomNo, R.deviceID, ‘start’ as action insert into RegulatorActionStream
Operators: Access Data from the Disk
Event tables allow users to map a database to a window and join a data stream with the window
Usecases- Merge with data in a database, collect, update data conditionally
define table HistTempTable(day long, avgT double);
from TemperatureStream#window.length(1) join OldTempTable on getDayOfYear(ts) == HistTempTable.day && ts > avgT
select ts, temp insert into PurchaseUserStream ;
Realtime Analytics PatternsSimple counting (e.g. failure count) Counting with Windows ( e.g. failure count every hour) Preprocessing: filtering, transformations (e.g. data cleanup)Alerts , thresholds (e.g. Alarm on high temperature) Data Correlation, Detect missing events, detecting erroneous data
(e.g. detecting failed sensors) Joining event streams (e.g. detect a hit on soccer ball) Merge with data in a database, collect, update data conditionally
Realtime Analytics Patterns (contd.)Detecting Event Sequence Patterns (e.g. small transaction followed
by large transaction) Tracking - follow some related entity’s state in space, time etc. (e.g.
location of airline baggage, vehicle, tracking wild life) Detect trends – Rise, turn, fall, Outliers, Complex trends like triple
bottom etc., (e.g. algorithmic trading, SLA, load balancing)Learning a Model (e.g. Predictive maintenance) Predicting next value and corrective actions (e.g. automated car)
Predictive Analytics Build models and use them with
WSO2 CEP, BAM and ESB using upcoming WSO2 Machine Learner Product ( 2015 Q2)
Build model using R, export them as PMML, and use within WSO2 CEP
Call R Scripts from CEP queries Regression and Anomaly Detection
Operators in CEP
Predictive Analytics WSO2 Machine Learner provide
an wizard to explore and build model
E.g. Build a model to predict next 15 minutes temperature - Trivial Option : (historical mean
+last 15m mean)/2- Better model via ARIMA from time
series analysis To know more, take a ML class
Communicate: Dashboards
Idea is to given the “Overall idea” in a glance (e.g. car dashboard)
Support for personalization, you can build your own dashboard.
Also the entry point for Drill down How to build?- Dashboard via Google Gadget and content
via HTML5 + java scripts
- Use WSO2 User Engagement Server to build a dashboard. (or a JSP or PHP)
- Use charting libraries like Vega or D3
Communicate: Dashboards
Idea is to given the “Overall idea” in a glance (e.g. car dashboard)
Support for personalization, you can build your own dashboard.
Also the entry point for Drill down How to build?- Dashboard via Google Gadget and content
via HTML5 + java scripts
- Use WSO2 User Engagement Server to build a dashboard. (or a JSP or PHP)
- Use charting libraries like Vega or D3
Communicate: Alerts Detecting conditions can be done via
CEP Queries Key is the “Last Mile”- Email- SMS- Push notifications to a UI- Pager - Trigger physical Alarm
How?- Select Email sender “Output Adaptor” from CEP, or send from CEP to ESB, and ESB has lot of
connectors
Communicate: APIs With mobile Apps, most data are
exposed and shared as APIs (REST/Json ) to end users.
Following are some challenges - Security and Permissions- API Discovery - Billing, throttling, quote - SLA enforcement
How?- Write data to a database from CEP event tables- Build Services via WSO2 Data Service - Expose them as APIs via API Manager
Smart Home2015 yearly DEBS (Distributed Event Based Systems)
DEBS Grand Challenge (http://goo.gl/0htxlj) Smart Home electricity data: 2000 sensors, 40 houses,
4 Billion eventsWe posted (400K events/sec) and close to one million
distributed throughput with 4 nodes. WSO2 CEP based solution is one of the four finalists
(with Dresden University of Technology, Fraunhofer Institute, and Imperial College London)
Only generic solution to become a finalist
Case Study: Realtime Soccer Analysis
Watch at: https://www.youtube.com/watch?v=nRI6buQ0NOM
Case Study: TFL Traffic AnalysisBuilt using TFL ( Transport for London) open data feeds.
http://goo.gl/04tX6khttp://goo.gl/9xNiCm
WSO2 Big Data Analytics Platform
ConclusionGoal: Build a platform using
which others can build their analytics systems - End to end, starts from humans
and ends with humans Whole platform is opensource
under Apache License
What can you do with the platform?- Solve hard problems, build Great
Apps with the platform- Add and contribute extensions to
the platform (e.g. GSoc http://goo.gl/QNFP6Y )
- Fix problems ( Patches)
Find us at [email protected] list or Stackoverflow (tag wso2)
Questions?