achieving faster and more agile bi and analytics with virtual data...
TRANSCRIPT
Achieving Faster and More Agile
BI and Analytics with Virtual
Data Processing David Stodder
Director of Research for Business Intelligence
TDWI
August 27, 2014
3
Speakers
David Stodder Research Director,
Business Intelligence,
TDWI
Tom Traubitz Senior Director,
Product Strategy
SAP
4
Agenda
• Agility: Focus of BI and DW today
– Eliminating latency, enabling democratization of
data/analytics
• Evolution of BI and data warehouse: expansion of
technology approaches
– The issue of ETL
• Virtual data processing, virtualization, and federation
explained
• Concluding recommendations
Agility: A Major Focus for BI and DW
• Agility: Ability to sense change, adjust, and take advantage of unexpected opportunities
• Speed and flexibility: competitive advantages for most organizations
• TDWI Research: Most firms regard their BI/DW agility as “average”; 1:6 say “poor” and just 1:10 say “excellent”
• Feeling the heat: BI and data warehousing professionals are under pressure to provide data insights and build the business value of BI projects sooner
Data-Driven Goal: Eliminating Latency
• Customer service and
interaction
• Adjusting automated
response to customers’
self-directed actions
• Responding to events in
markets, supply chains,
processes
• Using information to
guide product and
service development
• Monitoring and tracking
developing patterns and
situations
• Delivering fresh data to
decision makers
6
Innovating with data to reduce inefficiencies caused to
the business by data access & query delays for:
Data for Democratizing BI and Analytics
• Data is the fuel for “competing
on analytics” at all levels
– Expectation that all decisions,
strategic, tactical, and
operational, be data-driven
• Executives, managers, and
front-line workers have distinct
needs: difficult to satisfy with
“one size fits all” BI systems
• Linking performance metrics to
analytics: users held
accountable need diverse data
access (historical, live, etc.)
BI/DW Inflexibility: The Data “Cement”
• Access denied: IT can be too
slow to meet user demands
for new data; it’s too hard to
load and access
• Out of sync: Customer-facing
and other business processes
are less efficient and effective
because the right data is not
there at the right time
• Seeking better methods:
Focus on reducing the “wait
and waste” in BI projects
– Agile method use is hot
Innovation: Responding to Agility Demands
From historical reporting to near real-time updates:
• Trickle feeds and change data capture; not just batch window
• Replication and bi-directional data synchronization
• Operational data stores; messaging (ESB); complex event
(stream) processing
Changing the database and storage orientation:
• Column-store (“columnar”) databases; Hadoop/MapReduce
• In-memory database computing
Data federation, data virtualization, virtual data processing
• Alternatives to continuous ETL development and execution
• Alternatives to loading and moving data; ELT
• Agility, choices to satisfy dynamic business analytics needs
Enter Big Data: Changing the Environment
• Raw & untransformed: Users want to explore detailed data
– Late binding based on analytics not a priori ETL and models for reporting
• Tapping the always-on data flow: online customer behavior, machine data, etc.
– Semi- and unstructured
• Living outside the DW: New technologies and methods, e.g., Hadoop, MapReduce, YARN, and more
• Fishing in the data lake
ETL: Often the Heart of the Problem
• Extract, transformation, and
loading (ETL): Mainstay
procedures for most BI/DW
• TDWI Research: How difficult to
adjust or update BI, analytics
and DW systems when
changes are made to ETL and
data integration processes?
– Very difficult: 17%
– Somewhat difficult: 47%
– Not very difficult: 29%
– Don’t know: 7%
12
ETL Challenges: Movement & Complexity
• ETL often the default
choice when there are
multiple (if not numerous)
data sources
• ETL deemed necessary
when transformation and
mapping are complex
• Long development cycles
– Hand-coding, lack of
reuse, dependence on
developers to know the
nuances of every data
source
• ETL routines usually
involve moving data; data
always in flight
– Could be 100s of ETL jobs in
production, some for reports
that are no longer needed
• Special skills needed
outside of database
development and admin
• Rising cost and complexity
of ETL raising interest in
alternatives, e.g., ELT and
no ETL in the “data lake”
13
Polling Question #1
In which area is your organization feeling the most pain when it comes to ETL?
• Development
• Performance
• Cost
• Governance
• Alignment with business requirements
• ETL is not a problem for us
Virtual Data Processing: Definitions • Virtualization: Dealing with
the already distributed
nature of data
• Dominant trend: Memory,
storage, processing,
networks – and now data
– Goal to make more efficient
use of resources
• Software solution that
enables access multiple,
distributed systems through
a single layer so that users
do not have to write queries
to each system individually
15
“Data virtualization is the
process of offering data
consumers a data access
interface that hides the technical
aspects of stored data, such as
location, storage structure, API,
access language, and storage
technology.” – R. Van der Lans
Overcoming Silos & Pursuit of Single View
• Federation or
virtualization? Terms often
used interchangeably
– Virtualization: less emphasis
on conforming sources to a
single data model
• Shared goal: moving
toward on-demand fetching
and data refreshing, no
matter where it resides
– Reducing overhead of
continuous data capture,
transformation, and movement
– Long history of distributed
queries and materialized views
• Querying as if it were one:
combining result sets from
multiple, distributed systems
into a single view, to be
delivered when requested
• TDWI Research: 19%
currently using data
federation or virtualization
– 31% plan to; 32% have no
plans to use it
– Source: “Achieving Greater
Agility with BI,” TDWI Best
Practices Report by D.
Stodder, Q1 2013
16
The “Keystone” – Business Owns the Data
BI Report
Analyst Tool
Developer Tool
SQL or
Web Service
Data Warehouse
Batch
ETL
• Role-based tools for Business
Analysts & IT Developers
• Common metadata lets Analysts &
IT collaborate in real time
• Empower business analysts to:
• Define entities & directly access &
merge data to create virtual views
• Rapidly profile data sources &
logic without more processing
• Quickly find data & rules via
business glossary
• Collaborate, test, validate &
share results
• Eliminate wait & waste delays
Common
Metadata
VIR
TU
AL
TA
BL
E
Portal
SQL or
Web Service
17
Source: “Leveraging an Agile Platform for Advanced Analytics Using Data Virtualization,” John Poonnen, pres. at TDWI Exec Summit, Boston, July 2014
Virtualization: A Fit with In-Memory
The virtual “database” could be in memory, supplied with data from the distributed
data sources; BI and analytics application users would not need to know about
the physical location; making data accessible from a “single place”
Using In-Memory to Free Up Analytics
• Reduced pre-processing:
Less need to build cubes,
aggregations, and other
designs for I/O constraints
• Random access to data:
getting beyond limits of highly
structured inquiry; raw data
• Low latency feeds: Enabling
real time & stream analysis
• Volume and scale: Exploit
potential of massively parallel
grid computing platforms
19
Virtualization and In-Memory: Part of
Emerging Logical or “Hybrid” Architecture
• Addressing the need for
combination of internal and
external data
– Not just what’s in the EDW or
just what’s in Hadoop; reaching
out to all sources
• Virtualization critical to
delivering & integrating views of
multiple kinds of data
• Cloud, SaaS, or on premise;
bringing data sets into memory
for faster access
Credit: Enterprise Management Associates
Credit: www.paddyiyer.com
Gartner Logical Data Warehouse
Virtual, Physical, and Distributed Process
21
Source: “Leveraging an Agile Platform for Advanced Analytics Using Data Virtualization,” John Poonnen, pres. at TDWI Exec Summit, Boston, July 2014
Virtualization: Benefits and Challenges
• Abstraction layer benefits:
Users see one integrated
data set regardless of APIs,
access language, location,
or storage structure
– Choices of SQL-based
software for logically unified
access, querying, analytics,
and reporting
– Federated servers, enterprise
service bus…multiple ways of
enabling access depending on
users’ needs and firm’s
technology strategy
• Performance: Will response
time be adequate? Can it be
tuned and managed?
– What about unexpected, ad
hoc queries?
– How dependent is the virtual
layer on performance and
availability of sources?
• Comprehension: Can the
user understand what they
are receiving?
– If no conforming model is
used, is the data delivered with
enough context for users?
22
Additional Benefits and Challenges
• Reducing ETL workload:
ETL not going away, but can
be employed when
absolutely needed
– Expanding the toolset and
architecture
– Less data movement can
equal reduced business
latency
• Speed of access: Some
use virtualization to achieve
real-time data access
– In-memory for “speed of
thought” analytics
• Agile development and
user “iteration” with data:
Users can access to data to
see what they need before
committing to transformation
– Fits with agile method
implementations
• Data storage: Better ability
to manage to demand and
business requirements
rather than potential peak
use at all times
• Governance challenges
23
Conclusion: Aim for User Satisfaction
• Evaluate virtualization to
give users access sooner:
at very least, they can ask
questions and see what
data looks like
– Set user expectations for data
quality and cleansing
• Govern what users see:
Virtualization can help
organizations protect
sensitive data that should
not be moved from sources
– Controls can avoid “shadow”
data store problems
• Align data access with
business objectives:
virtualization offers an
opportunity to support better
user-IT collaboration
– Toward providing data
services, not IT “order takers”
– Align real-time data access
with true business needs
• Employ virtualization to
reduce ETL burden: cost,
quality, and reuse
challenges can be eased by
employing virtualization
24
25
Thank You!
David Stodder
Director of Research for Business Intelligence
TDWI (www.tdwi.org)
(415) 859-9933
How will you turn new signals
into business value?
:-) Brand Sentiment
360O Customer View
Product
Recommendation
Propensity to Churn Real-time Demand/
Supply Forecast
Predictive Maintenance
Fraud Detection
Network Optimization
Insider Threats
Risk Mitigation,
Real-time
Asset Tracking Personalized Care Smart Vending
Smart Equipment
Smart Cities
Simplified, Accelerated, Predictive
Traditional: OLTP and OLAP
Separate
Immediate
48/hr Old
Data
EDW
Staging DB
Transactions Streams
Streams
Transactions
48 Hours
Multiple Data Sources
ETL
OLTP + OLAP in
SAP HANA
Current Data
Lots of separate ETL processes! 10:00 AM 10:00 AM
Multiple Data Sources
available with Live Access
10:00 AM 10:00 AM
Smart Data Access
SAP HANA
Reinventing the Data Warehouse
with an In-Memory Data Fabric
SDA
Petabytes of
Structured
Data
ETL & Rep for RT
sync
Load Source
Databases
Results at the
Speed of
Memory
Business
Applications SQL
or SAP River
SDA
Tightly integrated orchestration
for management, monitoring,
and control Data Fabric Layer
Real-time Events/
Machine-
generated Data
Stream
MapReduce/
Hive Column Storage
Op
RDBMS
Other Sources
In-memory
Platform
Orchestrator
World’s Largest Data Warehouse –
NEW Guinness World Record
http://www.guinnessworldrecords.com/world-records/5000/largest-data-warehouse
Audited Record: 12.1 Petabytes
Tested Configuration
Largest Data Warehouse
22x HP ProLiant DL580 G7
• 4x Intel Xeon E7-4870 @ 2.40GHz
• 1TB RAM
20x NetApp Storage Arrays E5460s
• 60/120 x 3TB 7.2Krpm HDD
• 4 x Fibre Chanel connections
SAP IQ 16 (20 nodes)
SAP HANA (5 nodes)
BMMsoft Federated EDMT 9 with UCM
Red Hat Enterprise Linux 6.4 X86-84
Real-time Applications on Business +
Context Data Across Data Domains Real-time Applications, Interactive Analysis
Str
eam
s
Columnar Data
GraphX Spark
Streamnig MLlib SHARK
Spark
Tachyon
SCM ERP CRM Text Geospatial Sensor Social
Media Logs
Data
Source
SAP HANA
Distributed File
Persistence
In-Memory
Persistence
In-Memory
Processing
Geospatial
Predictive
Planning/
Rules
Text/NLP
SAP HANA
smart data
access
HDFS / Any Hadoop
Data Access
SQL .NET Javascript MDX SQL Java Scala Python Other Other NodeJS
Real-time insights mean real-time
results
100% accuracy in early
signal detection
216x faster DNA results from
2 days to 20 minutes
$1.1M increased revenue with 1%
increased retention rate
500k Euro working capital
reduction with 1 week
3.2M reclaimed by identifying
fraudulent insurance charges
50,000 daily sports betting games
analyzed in real-time
Isn’t It Time To Reinvent Your EDW
Strategy?
SIMPLIFY PREDICT ACCELERATE
SAP In-Memory Data Fabric A Complete EDW Architecture for All Your Knowledge Workers
37
Contact Information
If you have further questions or comments: David Stodder, TDWI [email protected] Tom Traubitz, SAP [email protected]
38
TDWI World Conference
Managing Agile BI for the Enterprise
San Diego, CA | September 21-26, 2014
Early registration discount until 8/22;
Use priority code SD1
http://www.tdwi.org/SD2014
*
TDWI Executive Forum
Strategies for Master Data, Quality, and Governance
San Diego, CA | September 22-23, 2014
Early registration discount until 8/22;
Use priority code EXEC9
http://www.tdwi.org/SD2014
Learn More in San Diego!