step by step – a process for building analytical insights
DESCRIPTION
The Briefing Room with Dr. Kirk Borne and Actian Live Webcast February 18, 2014 Watch the archive: https://bloorgroup.webex.com/bloorgroup/lsr.php?RCID=5be6631268cccc1e605b12b31b58ee08 Change is everywhere in the world of analytics these days, and not just due to Big Data. The maturation of parallel processing is transforming how data can be loaded, prepared and processed, which means the window of possibility has widened dramatically in terms of what can be done. This is good news for just about everyone, especially the dedicated business analyst, who can now accomplish in hours what used to take days or weeks. Register for this episode of The Briefing Room to hear Data Science luminary Dr. Kirk Borne of George Mason University, as he describes the changing landscape of analytics. He'll be briefed by John Santaferraro of Actian, who will tout his company's analytical platform. While Santaferraro gives his talk, he'll be accompanied by a data analyst who will walk through a demonstration of the various steps for building analytical solutions. Visit InsideAnlaysis.com for more information.TRANSCRIPT
Grab some coffee and enjoy the pre-show banter before the top of the hour!
The Briefing Room
Step By Step – A Process for Building Analytical Insights
Twitter Tag: #briefr
The Briefing Room
! Reveal the essential characteristics of enterprise software, good and bad
! Provide a forum for detailed analysis of today’s innovative technologies
! Give vendors a chance to explain their product to savvy analysts
! Allow audience members to pose serious questions... and get answers!
Mission
Twitter Tag: #briefr
The Briefing Room
Topics
This Month: BIG DATA
March: CLOUD
April: BIG DATA
2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
Twitter Tag: #briefr
The Briefing Room
Big Data
“In God we trust. All others must bring data.”
~W. Edwards Deming, Statistician
Twitter Tag: #briefr
The Briefing Room
Analyst: Kirk Borne
Kirk Borne is a Transdisciplinary Data Scientist and an Astrophysicist. He is Professor of Astrophysics and Computational Science at George Mason University. He has been at Mason since 2003, where he does research, teaches, and advises students in the Data Science program. Previously, he spent nearly 20 years in positions supporting NASA projects, including an assignment as NASA's Data Archive Project Scientist for the Hubble Space Telescope, and as Project Manager in NASA's Space Science Data Operations Office. He has extensive experience in big data and data science and he is on the editorial boards of several scientific research journals and is an officer in several national and international professional societies devoted to data science, data mining, and statistics. He has published over 200 articles (research papers, conference papers, and book chapters), and given over 200 invited talks at conferences and universities worldwide.
@KirkDBorne http://kirkborne.net
Twitter Tag: #briefr
The Briefing Room
Actian
! Actian is a database and software development company
! The Actian Analytics Platform connects to data and Big Data sources to perform actionable and advanced analytics
! The platform is comprised of Actian DataFlow (formerly Pervasive DataRush), Actian Matrix (formerly ParAccel) and Actian Vector
Twitter Tag: #briefr
The Briefing Room
Guest: John Santaferraro
John Santaferraro is the Vice President of Product Marketing at Actian. Prior to joining Actian, Santaferraro was an independent industry analyst in the business intelligence and analytics market. Before that he developed and executed a vertical market strategy for Hewlett Packard's BI group, focusing on energy, communications, retail, healthcare and financial services; he was also instrumental in helping establish HP’s new BI business group with a combination of solutions, products and consulting. In 2000, John founded a marketing and sales consulting company, Ferraro Consulting, providing business acceleration strategy for technology companies.
Confiden'al © 2014 Ac'an Corpora'on 10
Suppor'ng the Data Scien'st Accelera'ng Big Data 2.0 John Santaferraro – VP of Solu'ons and Product Marke'ng
Confiden'al © 2014 Ac'an Corpora'on 11
Only the Privileged Excel in Big Data Analy'cs
Data
Value
Confiden'al © 2014 Ac'an Corpora'on 12
The “Moneyball” Effect
! Analy'cs Go Mainstream ■ Major League Baseball ■ Hire the best team
■ NSA and Big Data ■ ???????????????
■ Target and Pregnancy ■ Predic'ng pregnancies
Confiden'al © 2014 Ac'an Corpora'on 13
What is a Data Scien'st?
Confiden'al © 2014 Ac'an Corpora'on 14
A data scien'st “…incorporates varying elements and builds on
techniques and theories from many fields, including mathema'cs,
sta's'cs, data engineering, paZern recogni'on and learning, advanced compu'ng, visualiza'on, uncertainty modeling, data warehousing, and high performance compu'ng with the goal of extrac'ng meaning from data and crea'ng data products.”
What is a Data Scien'st?
Created by Calvin Andrus, depicts a mash-up of disciplines from which Data Science is derived, 13 July 2012 http://en.wikipedia.org/wiki/Data_science
Confiden'al © 2014 Ac'an Corpora'on 15
Less than
20% of data scien'sts have the
data and compute power they need to do their jobs
The average data scien'st spends
70% of their 'me finding data,
manipula'ng data, and wai'ng for queries to run
Data Science Challenges
15
More than
60% of all data scien'sts working Hadoop are s'll trying to created a business case
Confiden'al © 2014 Ac'an Corpora'on 16
“A business scien'st is an expert in the science of business, si]ng between the business analyst and the data scien'st, pulling together cross-‐
func'onal exper'se from data science, analy'cs, business applica'ons, business processes, and
business strategy.
What is a Business Scien'st?
Confiden'al © 2014 Ac'an Corpora'on 17
Business Science Skillset
Understand How Analytics Work
Understand Emerging Data Types
Understand Business Operations & Strategy
Learn Quickly
Think Outside the Box
Tell Compelling Stories
Confiden'al © 2014 Ac'an Corpora'on 18
! Libraries of Analy'c Func'ons Run at Extreme Speed ■ Transforma'onal Analy'cs
■ Sta's'cal Analy'cs ■ Machine Learning Analy'cs
■ Clustering Analy'cs ■ Discovery Analy'cs
! Visual Framework for Data Discovery, Prepara'on and Analy'cs ■ Drag and Drop Interac'on ■ Libraries of Data Prepara'on Operators
■ Libraries of Analy'c Operators
■ High-‐Performance, Parallel Processing on Hadoop (or other file systems)
The Tools of the Business Scien'st
Confiden'al © 2014 Ac'an Corpora'on 19
The Ac'an Analy'cs Pladorm: Accelera'ng Big Data 2.0TM
Extreme Agility
Extreme Scale
Extreme Performance
Actian Analytics PlatformTM
Analyze
Act
Connect
Actian Analytics Accelerators
Accelerate Hadoop
Accelerate Analytics
Accelerate Business
Intelligence
Enterprise
Applications Data Warehouse
Social
Internet of Things
SaaS
WWW Machine Data
Mobile World-Class Risk Management
Competitive Advantage
Customer Delight
Disruptive New Business Models NoSQL Traditional
Data Value
Confiden'al © 2014 Ac'an Corpora'on 20
Select From Libraries of Analytics
Exch
ange
Dat
a an
d W
orkl
oads
Hadoop Move Into a High
Performance Analytic
Engine for Low Latency
Connect to Any Data Source
Actian Analytics PlatformTM
Enterprise Data
Machine Data
Social Data
Business Processes
Users
Machines
Applications
Data Warehouse
Deliver A
nalytic Services
Manage Data Flows and Deliver Data Services
SaaS Data
Ac'an Analy'cs Pladorm: The High Performance Exoskeleton for Hadoop
Amazon Redshift
Confiden'al © 2014 Ac'an Corpora'on 21
Actian AnalyticsTM
On
Dem
and
Inte
grat
ion
Hadoop Actian MatrixTM
Actian DataConnectTM
Actian Analytics PlatformTM
Enterprise Data
Machine Data
Social Data
Business Processes
Users
Machines
Applications
Data Warehouse
On D
emand
Analytic Services Actian VectorTM
Actian DataFlowTM
SaaS Data
• Visual, drag and drop interface for all data management on Hadoop • High performance data management and analytics natively on HDFS • SQL access to Hadoop data for low latency analytics • High speed data transfer across relational and non-relational
Ac'an Analy'cs Pladorm: The High Performance Exoskeleton for Hadoop
Amazon Redshift
Confiden'al © 2014 Ac'an Corpora'on 22 Confidential © 2014 Actian Corporation 22
Ac)an Analy)cs Pla0orm
Hadoop – Leader Node
Optimized, On-HDFS Processing
Query Pipelining
CPU Pipelining
Reuse and share all components from
operators to workflows
Optimize
Choose from five sets of operators: Connections
Transformation Data Quality
Analytics Data Science
Automatically detect resources, plan
optimal utilization, and parallelize all
workloads on Hadoop
Use dual pipeline parallelism to
accelerate performance 30X
Run fully optimized processing directly on the Hadoop node via
YARN
Take processing to where the data lives, runs natively on any Hadoop distribution
Visual Framework
Manage the entire analytic process in a visual framework with no coding required.
Ac'an Analy'cs Pladorm – High Performance Data Management and Analy'cs Na'vely on HDFS
≠ ☼ ≡ ∞ ∆ ∑ √ ≈ ∑ = ? # ~ ‰
Confiden'al © 2014 Ac'an Corpora'on 23 Confidential © 2014 Actian Corporation 23
Ac'an Analy'cs Pladorm – High Performance, Low Latency Analy'cs on Hadoop Data
LEADER NODE
On-Demand Integration
Analytic Libraries
Optimizer
Orchestration
On-Demand Analytics
700+ in-database, analytic functions
Massively Parallel Columnar
Compressed Compiled
Connected
Node-to-node, bi-directional sharing of analytics & processes with Hadoop nodes
Serve up high-performance analytic
processing for any app
Connect to any data source at the point of
the query
Manage data flows across the entire analytic process
5 LEVELS OF OPTIMIZATION:
SQL Planning Execution
Communications Memory
H H H H
H H H H
H H H H
Confiden'al © 2014 Ac'an Corpora'on 24
Actian AnalyticsTM
Hadoop Actian MatrixTM
Actian DataConnectTM
Actian Analytics PlatformTM
Enterprise Data
Machine Data
Social Data
Business Processes
Users
Machines
Applications
Data Warehouse
On D
emand
Analytic Services
Actian DataFlowTM
SaaS Data
Ac'an Analy'cs Pladorm: High Speed Interac'on Between Rela'onal and Non-‐Rela'onal
Amazon Redshift
On
Dem
and
Inte
grat
ion
Nod
e-to
-Nod
e C
onne
ctio
n
Confiden'al © 2014 Ac'an Corpora'on 25
Actian AnalyticsTM
Hadoop Actian MatrixTM
Actian DataConnectTM
Actian Analytics PlatformTM
Enterprise Data
Machine Data
Social Data
Business Processes
Users
Machines
Applications
Data Warehouse
On D
emand
Analytic Services
Actian DataFlowTM
SaaS Data
Ac'an Analy'cs Pladorm: Deep Integra'on for High Performance SQL Analy'cs
Amazon Redshift
On
Dem
and
Inte
grat
ion
HC
atal
og
Hiv
e
Confiden'al © 2014 Ac'an Corpora'on 26
Actian AnalyticsTM
Hadoop Actian MatrixTM
Actian DataConnectTM
Actian Analytics PlatformTM
Enterprise Data
Machine Data
Social Data
Business Processes
Users
Machines
Applications
Data Warehouse
On D
emand
Analytic Services
Actian DataFlowTM
SaaS Data
Ac'an Analy'cs Pladorm: Deep Integra'on for High Performance SQL Analy'cs
Amazon Redshift
On
Dem
and
Inte
grat
ion
SQL, Python,
Java
Confiden'al © 2014 Ac'an Corpora'on 27
Select From Libraries of Analytics
Exch
ange
Dat
a an
d W
orkl
oads
Hadoop Move Into a High
Performance Analytic
Engine for Low Latency
Connect to Any Data Source
Actian Analytics PlatformTM
Enterprise Data
Machine Data
Social Data
Business Processes
Users
Machines
Applications
Data Warehouse
Deliver A
nalytic Services
Manage Data Flows and Deliver Data Services
SaaS Data
Ac'an Analy'cs Pladorm: The High Performance Exoskeleton for Hadoop
Amazon Redshift
Confiden'al © 2014 Ac'an Corpora'on 28
A Tradi'onal Approach to Churn Analysis
CRM
Account Info and Demographics
CONNECT ANALYZE ACT
LOGISTIC REGRESSION
LONG MODEL
TURNS
LIMITED MODEL INPUTS
MINIMUM
ACCURACY
GROUP DERIVE FIELDS CUSTOMER
CHURN PREDICTION
PRE-SET
VISUALIZATIONS
SMALL INCREASE IN
EXISTING CUST REVENUE
LIMITED HISTORICAL DATA
PULLS
Confiden'al © 2014 Ac'an Corpora'on 29
An Enriched Approach to Churn Analysis
CRM
Account Info and Demographics
JOIN
CONNECT ANALYZE ACT
AGGREGATE DECREASED PROVIDER
FEES
FILE PARSER
GEOSPATIAL NETWORK ANALYSIS
FAST NETWORK ISSUE ALERTS
CDR Logs
Customer and Network Call Quality
FILE PARSER
FILE PARSER
LOGISTIC REGRESSION
GROUP DERIVE FIELDS
CDR Logs
Geospatial Dimensions
JOIN GROUP DERIVE
FIELDS CUSTOMER
CHURN PREDICTION
WITH TARGETED CUSTOMER CONTACT
LIST
SIGNIFICANT INCREASE IN
EXISTING CUST REVENUE
Device Data
Mobile Device Mgmt
SFDC Event Filter
IMPROVED INDUSTRY CUST SATISFACTION
SCORES
Confiden'al © 2014 Ac'an Corpora'on 30
An Expanding Approach to Churn Analysis
CRM
Account Info and Demographics
JOIN
CONNECT ANALYZE ACT
AGGREGATE DECREASED PROVIDER
FEES
FILE PARSER
GEOSPATIAL NETWORK ANALYSIS
FAST NETWORK ISSUE ALERTS CDR Logs
Customer and Network Call Quality
FILE PARSER
FILE PARSER
LOGISTIC REGRESSION
GROUP DERIVE FIELDS
CDR Logs
Geospatial Dimensions
JOIN GROUP DERIVE
FIELDS CUSTOMER
CHURN PREDICTION
WITH TARGETED CUSTOMER CONTACT
LIST
SIGNIFICANT INCREASE IN
EXISTING CUST REVENUE
Device Data
Mobile Device Mgmt
SFDC Event
Filter
IMPROVED CUSTOMER
SATISFACTION SCORES
MARKET DATA
Competitive Offerings
FILE PARSER
Confiden'al © 2014 Ac'an Corpora'on 31
Actian AnalyticsTM
Hadoop Actian MatrixTM
Actian DataConnectTM
Actian Analytics PlatformTM
Enterprise Data
Machine Data
Social Data
Business Processes
Users
Machines
Applications
Data Warehouse
On D
emand
Analytic Services
Actian DataFlowTM
SaaS Data
Ques'ons
Amazon Redshift
On
Dem
and
Inte
grat
ion
SQL, Python,
Java
Confiden'al © 2014 Ac'an Corpora'on 32
www.ac'an.com facebook.com/ac'ancorp @ac'ancorp
Thank You
Confiden'al © 2014 Ac'an Corpora'on 33
This document is for informa'onal purposes only and is subject to change at any 'me without no'ce. The informa'on in this document is proprietary to Ac'an and no part of this document may be reproduced, copied, or transmiZed in any form or for any purpose without the express prior wriZen permission of Ac'an. This document is not intended to be binding upon Ac'an to any par'cular course of business, pricing, product strategy, and/or development. Ac'an assumes no responsibility for errors or omissions in this document. Ac'an shall have no liability for damages of any kind including without limita'on direct, special, indirect, or consequen'al damages that may result from the use of these materials. Ac'an does not warrant the accuracy or completeness of the informa'on, text, graphics, links, or other items contained within this material. This document is provided without a warranty of any kind, either express or implied, including but not limited to the implied warran'es of merchantability, fitness for a par'cular purpose, or non-‐infringement.
Disclaimer
Twitter Tag: #briefr
The Briefing Room
Perceptions & Questions
Analyst: Kirk Borne
Kirk Borne @KirkDBorne
School of Physics, Astronomy, & Computational Sciences College of Science, George Mason University, Fairfax, VA
Data Science for Everything
Let us start with a Big Data Quiz … Complete this sentence: Big Data is … a) the new oil. b) the new black. c) the new bacon. d) sexy. e) everything, quantified and tracked! f) All of the above
Definitions of Big Data From Wikipedia: • Big Data refers to any
collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications.
.
My suggestion: • Big Data refers to
“Everything, Quantified and Tracked!”
• Examples: – Smart Cities – Retail Analytics – Personalized Healthcare (myDNA) – Cybersecurity – National Security – Big Data Science Projects – Social Networks – IoT = Internet of Things – M2M = Machine-to-Machine – … everything!
• If we collect a thorough set of parameters (high-dimensional data) for a complete set of items within our domain of study, then we would have a “perfect” statistical model for that domain.
• In other words, Big Data becomes the model for a domain X = we call this X-informatics.
• Anything we want to know about that domain is specified and encoded within the data.
• The goal of Big Data Science is to find those encodings, patterns, and knowledge nuggets.
• See article by IBM’s James Kobielus: “Big-Data Vision? Whole-population analytics” at http://bit.ly/QB0uYi
Rationale for Big Data Science
Characterizing and Exposing the Big Data Hype: 3 V’s or ?
n If the only distinguishing characteristic was that we have lots of data, we would call it “Lots of Data” (or a Tonnabytes!)
n Big Data characteristics: the 3+n V’s = 1. Volume (lots of data = “Tonnabytes”) 2. Variety (complexity, curse of dimensionality, many formats) 3. Velocity (high rate of data and information flow, real-time, incoming!) 4. Veracity (necessary & sufficient data to test many hypotheses) 5. Value
6. Variability 7. Venue 8. Vocabulary
The Data Scientist toolkit n It is a collection of mathematical, computational, scientific,
and domain-specific methods, tools, and algorithms to be applied to Big Data for discovery, decision support, and data-to-knowledge transformation:… n Statistics n Data Mining (Machine Learning) & Analytics (KDD) n Data & Information Visualization n Semantics (Natural Language Processing, Ontologies) n Data-intensive Computing (e.g., Hadoop, Cloud, …) n Modeling & Simulation n Metadata for Indexing, Search, & Retrieval n Advanced Data Management & Data Structures n Domain-Specific Data Analysis Tools
40
1. Begin with the end in mind (= goal-based, data-driven decision making, “knowledge discovery by design”)
2. Data Science is Science (= hypothesis testing, and all that) 3. Know thy data (= data profiling, unsupervised exploration) 4. Love thy data (= including ugly data: skewed distributions,
outliers, long & fat tails) 5. Overfitting is a sin (= “models should be as simple as possible,
but no simpler” ~ A.Einstein) 6. Honor thy data’s first mile and last mile (a) The First Mile is the hardest.
(ubiquitous heterogeneous data) (b) The Last Mile is the hardest.
(actionable intelligence)
The 6 Commandments of Data Science (Based on “The 5 Fundamental Concepts of Data Science” :
http://www.statisticsviews.com/details/feature/5459931/Five-Fundamental-Concepts-of-Data-Science.html)
http://www.datagovernance.com/cartoon_17.html
Questions to Actian Corporation: 1. Most things in the world that are labeled “2.0” typically enable some sort of social
experience or social networking characteristic. How is ‘Big Data 2.0’ like that, and how is it different?
2. You talk about Unconstrained Analytics. That sounds like “Data Science Unleashed” – is that a reasonable analogy? How so?
3. How important are visual cues and visual analytics in Actian’s Big Data 2.0 design and implementation? And how have you incorporated them?
4. I/O bottlenecks (for data access and movement) are typically the most severe technological constraints in Big Data. How does Actian manage the big constraints imposed by big data inertia?
5. Data Science is truly science insofar as it involves hypothesis generation, experimental design, testing, analysis, and hypothesis refinement – what are some of the unique ways that Actian empowers and enables a data scientist to perform different steps in this process?
6. One solution to the Big Data and Data Scientist talent gap is to put powerful tools into schools and into the hands of students, and/or to provide financial incentives to students (e.g., scholarships). Is Actian planning any university programs like that?
7. Some say that Big Data 3.0 will be based on the semantics, context, and meaning of data – does Actian have goals or a vision in this direction?
8. What do you see as the next evolutionary step in Big Data Science?
Twitter Tag: #briefr
The Briefing Room
Twitter Tag: #briefr
The Briefing Room
Upcoming Topics
www.insideanalysis.com
2014 Editorial Calendar at www.insideanalysis.com/webcasts/the-briefing-room
This Month: BIG DATA
March: CLOUD
April: BIG DATA
Twitter Tag: #briefr
The Briefing Room
THANK YOU for your
ATTENTION!