kirk borne george mason university, fairfax, va ● @kirkdborne dynamic events in massive data...

31
Kirk Borne George Mason University, Fairfax, VA www.kirkborne.net @KirkDBorne Dynamic Events in Massive Data Streams, from Astrophysics to Marketing Automation

Upload: simon-hunter

Post on 19-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Kirk BorneGeorge Mason University, Fairfax, VA ● www.kirkborne.net @KirkDBorne

Dynamic Events in Massive Data Streams, from Astrophysics to

Marketing Automation

1) Correlation Discovery Finding patterns and dependencies, which reveal

new natural laws or new scientific principles

2) Novelty Discovery Finding new, rare, one-in-a-[million / billion / trillion]

objects and events

3) Class Discovery Finding new classes of objects and behaviors Learning the rules that constrain class boundaries

4) Association Discovery Finding unusual (improbable) co-occurring

associations

Data Science = 4 Types of Discovery(Learning from Data)

This graph says it all …3 Steps to Discovery – Learning from Data

Graphic provided by Professor S. G. Djorgovski, Caltech

• Unsupervised Learning : Cluster Analysis – partition the data items into clusters, without bias, ignoring any initially assigned categories = Class Discovery !

• Supervised Learning : Classification – for each new data item, assign it to a known class (i.e., a known category or cluster) = Predictive Power Discovery !

• Semi-supervised Learning : Outlier/Novelty Detection – identify data items that are outside the bounds of the known classes of behavior = Surprise Discovery ! 3

Goal of Data Science:Take Data to Information to Knowledge

to Insights (and Action!)

From Sensors (Measurement & Data Collection)…

… to Sentinels (Monitoring & Alerts) …

… to Sense-making (Data Science) …

… to Cents-making (ROI)

… Productizing your Big Data

4

EmpoweringData Scientists at Booz-Allen

• The Field Guide to Data Sciencehttp://www.boozallen.com/insights/2013/11/data-science-field-guide

• National Data Science Bowlhttp://www.datasciencebowl.com/

• Explore Data Sciencehttps://exploredatascience.com/

5

The BIG Big Data Challenge:Identifying, characterizing, & responding to

millions of events in real-time streaming data• Astronomy example:

Real-Time Event Mining: deciding which events (out of millions) need follow-up investigation & response (triage for maximum scientific return)

• Web Analytics example: Web Behavior Modeling and Automated System Response (from

online interactions & web browse patterns, personalization, user segmentation, 1-to-1 marketing, advanced analytics discovery,…)

• Many other examples: Health alerts (from EHRs and national health systems) Tsunami alerts (from geo sensors everywhere) Cybersecurity alerts (from network logs) Social event alerts or early warnings (from social media) Preventive Fraud alerts (from financial applications) Predictive Maintenance alerts (from machine / engine sensors) 6

The MIPS modelfor Dynamic Data-Driven Application Systems (DDDAS)• MIPS =

– Measurement – Inference – Prediction – Steering• This applies to any Network of Sensors:

– Web user interactions & actions (web analytics data), Cyber network usage logs, Social network sentiment, Machine logs (of any kind), Manufacturing sensors, Health & Epidemic monitoring systems, Financial transactions, National Security, Utilities and Energy, Remote Sensing, Tsunami warnings, Weather/Climate events, Astronomical sky events, …

• Machine Learning enables the “IP” part of MIPS:– Autonomous (or semi-autonomous) Classification– Intelligent Data Understanding– Rule-based– Model-based– Neural Networks– Markov Models– Bayes Inference Engines

Alert & Response systems:•LSST 10million events•“Mars Rover” anywhere•Automation of any data-driven operational system

7

http://dddas.org

The MIPS modelfor Dynamic Data-Driven Application Systems (DDDAS)• MIPS =

– Measurement – Inference – Prediction – Steering• This applies to any Network of Sensors:

– Web user interactions & actions (web analytics data), Cyber network usage logs, Social network sentiment, Machine logs (of any kind), Manufacturing sensors, Health & Epidemic monitoring systems, Financial transactions, National Security, Utilities and Energy, Remote Sensing, Tsunami warnings, Weather/Climate events, Astronomical sky events, …

• Machine Learning enables the “IP” part of MIPS:– Autonomous (or semi-autonomous) Classification– Intelligent Data Understanding– Rule-based– Model-based– Neural Networks– Markov Models– Bayes Inference Engines

Alert & Response systems:•LSST 10million events•“Mars Rover” anywhere•Automation of any data-driven operational system

8

http://dddas.org

From Sensors to Sentinels to Sense

Creating Value from Big Data : The 3 D2D’s

o Knowledge Discovery– Data-to-Discovery (D2D)

o Data-driven Decision Support– Data-to-Decisions (D2D)

o Big ROI (Return On Innovation) !!!– Data-to-Dollars (D2D) or

Data-to-Dividends

9

Creating Value from Big Data

Knowledge Discovery– LSST (discovery)

o Data-driven Decision Support– Mars Rovers (decisions)

o Big ROI (Return On Innovation) !!!– Innovative Applications of Big Data everywhere

10

Astronomy Big Data Example

The LSST (Large Synoptic Survey Telescope)11

LSST = Large

Synoptic Survey

Telescopehttp://www.lsst.org/

8.4-meter diameterprimary mirror =10 square degrees!

Hello !

(mirror funded by private donors)

12

LSST = Large

Synoptic Survey

Telescopehttp://www.lsst.org/

8.4-meter diameterprimary mirror =10 square degrees!

Hello !

(mirror funded by private donors)

Construction began August 2014

(funded by NSF and DOE)

13

LSST = Large

Synoptic Survey

Telescopehttp://www.lsst.org/

8.4-meter diameterprimary mirror =10 square degrees!

Hello !

(mirror funded by private donors)

–100-200 Petabyte image archive–20-40 Petabyte database catalog

14

LSST in time and space:– When? ~2022-2032– Where? Cerro Pachon, Chile

LSST Key Science Drivers: Mapping the Dynamic Universe– Complete inventory of the Solar System (Near-Earth Objects; killer asteroids???)– Nature of Dark Energy (Cosmology; Supernovae at edge of the known Universe)– Optical transients (10 million daily event notifications sent within 60 seconds)– Digital Milky Way (Dark Matter; Locations and velocities of 20 billion stars!)

Architect’s designof LSST Observatory

15

LSST Summaryhttp://www.lsst.org/

• 3-Gigapixel camera• One 6-Gigabyte image every 20 seconds• 30 Terabytes every night for 10 years• Repeat images of the entire night sky every

3 nights: Celestial Cinematography• 100-Petabyte final image data archive

anticipated – all data are public!!!• 20-Petabyte final database catalog

anticipated• Real-Time Event Mining: ~10 million events

per night, every night, for 10 years!– Follow-up observations required to classify these– Which ones should we follow up? …

… Decisions! Decisions! ( = D2D !) 16

LSST Database = massive complexity challengehttp://www.lsst.org/

• 20-Petabyte final database catalog anticipated• ~20 trillion rows (astronomical sources), 200-300 columns• How can we find descriptive and predictive models of

trillions of astrophysical sources? …• … Soft10ware.com’s fast statistical modeling technology

offers an excellent approach:– Non-parametric, multi-model generation– Rapid model evaluation & ranking

17

But the LSST is not the biggest Big Data Astronomy project

being planned …

18

SKA(starting in 2024)

19

SKA = Square Kilometer Arrayjoint project: Australia and South Africa

http://www.ska.gov.au/

20

http://www.extremetech.com/extreme/124561-ibm-to-build-exascale-supercomputer-for-the-worlds-largest-million-antennae-telescope

21

That’s a data processing job for fast streaming real-time in-memory analytics

= MapR (Apache Spark) is excellent choice

http://www.mapr.com/

Creating Value from Big Data

o Knowledge Discovery– LSST (discovery)

Data-driven Decision Support– Mars Rovers (decisions)

o Big ROI (Return On Innovation) !!!– Innovative Applications of Big Data everywhere

22

TheMars Rover

Metaphor

23

Mars Rover: intelligent data-gatherer, mobile data mining agent, and autonomous decision-support system

Rove around the surface of Mars and take samples of rocks (experimental data type: mass spectroscopy = data histogram = feature vector)

Intelligent Data Operations in Action:• Classification (assign rocks to known classes)• Supervised Learning (search for rocks with known compositions)• Unsupervised Learning (discover what types of rocks are present,

without preconceived biases)• Clustering (find the set of unique classes of rocks)• Association Mining (find unusual associations)• Deviation/Outlier Detection (one-of-kind; interesting?) • On-board Intelligent Data Understanding & Decision Support

Systems (Fuzzy Logic & Decision Trees & Cased-Based Reasoning ) = = Science Goal Monitoring :

– “stay here and do more” ; or else “follow trend to most interesting location”

– “send results to Earth immediately” ; or “send results later” 24

Mars Rover = metaphor for any Decision Science Engine(enabling data-to-decisions = D2D!)

http://legacy.samsi.info/200506/astro/presentations/tut1loredo-7.pdf 25

• New knowledge and insights are acquired by mining actionable data from all streaming inputs (sensors!)

• Decisions are based on the new knowledge mined, prior experience, and the “business” decisioning rules embedded within the pipeline (sentinels!)

• “Smart Sensors” act autonomously in real-time, without human intervention, to learn (= make sense!)

Voyager 1 becomes first human-made object to leave solar systemhttp://www.nasa.gov/mission_pages/voyager/

The Movie :http://youtu.be/L4hf8HyP0LI

Artist’s concept of the long journey into the abyss of deep space:http://bit.ly/13ecxov

The Long Tail of the Data tells the story –– entering a new region of space – – the solar particle flux essentially vanishes!http://bit.ly/QKLHM7

Speed: 61,000 km/hrDistance: 19 billion km from Earth

Creating Value from Big Data

o Knowledge Discovery– LSST (discovery)

o Data-driven Decision Support– Mars Rovers (decisions)

Big ROI (Return On Innovation) !!!– Innovative Applications of Big Data everywhere

27

Cognitive Analytics: for Insights, Innovation, & Decision Support– based on massive amounts of multi-channel information

From Devices…… … Intentions…

… Location, weather, and other geographic attributes…

… Demographics…

28© SYNTASA.com 2015

Enter... Advanced Analytics Automation!

• Learning from Data (Data Science):– Outlier / Anomaly / Novelty / Surprise detection

– Clustering (= New Class discovery, Segmentation)

– Correlation & Association discovery

– Classification, Diagnosis, Prediction

• … for D2D:– Data-to-Discoveries– Data-to-Decisions– Data-to-Dividends

(big ROI = Return on Innovation)29

AutomatingAnalytics

as-as-Service

• Based on the SYNTASA.com Marketing Analytics-as-a-ServiceTM (MAaaS)• Your own “Mars Rover in a box”

– Your business rules determine the goals, decision points, alerts, and responses.– Moving beyond historical hindsight and oversight (Descriptive & Diagnostic

Analytics) to new world of insight and foresight (Predictive & Prescriptive), eventually achieving right sight (Cognitive Analytics = the 360 view, enabling the right action, for the right customer, at the right place, at the right time).

• Mining multi-channel big data streams (across your organization)• Target Marketing, Personalization, and Segmentation (“segment of one”)• Decision Automation in a rich content (Big Data) environment

30Based on Marketing Analytics-as-a-ServiceTM (MAaaS) from http://www.syntasa.com/

DomainModeling& EventResponse

SummaryFrom Sensors to Sense(making)

• Learning from Data:• Data Science (on dynamic data from scientific instruments,

web logs, customer interactions, social media, financial systems, machine / engine logs,…)

• Text Analytics (on structured & unstructured reports)• Causal Analysis (from “all data” = the 360o view = big data)

• Achieving the 3 V’s of Productizing Big Data:• Visualization – Explorability (Telling the data’s story)

• Value – Exploitability (Using the correct data)

• Veracity – Provability (Using the data correctly)