achieving faster and more agile bi and analytics with virtual data...

38
Achieving Faster and More Agile BI and Analytics with Virtual Data Processing David Stodder Director of Research for Business Intelligence TDWI August 27, 2014

Upload: dinhhanh

Post on 10-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Achieving Faster and More Agile

BI and Analytics with Virtual

Data Processing David Stodder

Director of Research for Business Intelligence

TDWI

August 27, 2014

2

Sponsor

3

Speakers

David Stodder Research Director,

Business Intelligence,

TDWI

Tom Traubitz Senior Director,

Product Strategy

SAP

4

Agenda

• Agility: Focus of BI and DW today

– Eliminating latency, enabling democratization of

data/analytics

• Evolution of BI and data warehouse: expansion of

technology approaches

– The issue of ETL

• Virtual data processing, virtualization, and federation

explained

• Concluding recommendations

Agility: A Major Focus for BI and DW

• Agility: Ability to sense change, adjust, and take advantage of unexpected opportunities

• Speed and flexibility: competitive advantages for most organizations

• TDWI Research: Most firms regard their BI/DW agility as “average”; 1:6 say “poor” and just 1:10 say “excellent”

• Feeling the heat: BI and data warehousing professionals are under pressure to provide data insights and build the business value of BI projects sooner

Data-Driven Goal: Eliminating Latency

• Customer service and

interaction

• Adjusting automated

response to customers’

self-directed actions

• Responding to events in

markets, supply chains,

processes

• Using information to

guide product and

service development

• Monitoring and tracking

developing patterns and

situations

• Delivering fresh data to

decision makers

6

Innovating with data to reduce inefficiencies caused to

the business by data access & query delays for:

Data for Democratizing BI and Analytics

• Data is the fuel for “competing

on analytics” at all levels

– Expectation that all decisions,

strategic, tactical, and

operational, be data-driven

• Executives, managers, and

front-line workers have distinct

needs: difficult to satisfy with

“one size fits all” BI systems

• Linking performance metrics to

analytics: users held

accountable need diverse data

access (historical, live, etc.)

BI/DW Inflexibility: The Data “Cement”

• Access denied: IT can be too

slow to meet user demands

for new data; it’s too hard to

load and access

• Out of sync: Customer-facing

and other business processes

are less efficient and effective

because the right data is not

there at the right time

• Seeking better methods:

Focus on reducing the “wait

and waste” in BI projects

– Agile method use is hot

BI & Data Warehouse: Traditional View

Source: www.h2kinfosys.com

Innovation: Responding to Agility Demands

From historical reporting to near real-time updates:

• Trickle feeds and change data capture; not just batch window

• Replication and bi-directional data synchronization

• Operational data stores; messaging (ESB); complex event

(stream) processing

Changing the database and storage orientation:

• Column-store (“columnar”) databases; Hadoop/MapReduce

• In-memory database computing

Data federation, data virtualization, virtual data processing

• Alternatives to continuous ETL development and execution

• Alternatives to loading and moving data; ELT

• Agility, choices to satisfy dynamic business analytics needs

Enter Big Data: Changing the Environment

• Raw & untransformed: Users want to explore detailed data

– Late binding based on analytics not a priori ETL and models for reporting

• Tapping the always-on data flow: online customer behavior, machine data, etc.

– Semi- and unstructured

• Living outside the DW: New technologies and methods, e.g., Hadoop, MapReduce, YARN, and more

• Fishing in the data lake

ETL: Often the Heart of the Problem

• Extract, transformation, and

loading (ETL): Mainstay

procedures for most BI/DW

• TDWI Research: How difficult to

adjust or update BI, analytics

and DW systems when

changes are made to ETL and

data integration processes?

– Very difficult: 17%

– Somewhat difficult: 47%

– Not very difficult: 29%

– Don’t know: 7%

12

ETL Challenges: Movement & Complexity

• ETL often the default

choice when there are

multiple (if not numerous)

data sources

• ETL deemed necessary

when transformation and

mapping are complex

• Long development cycles

– Hand-coding, lack of

reuse, dependence on

developers to know the

nuances of every data

source

• ETL routines usually

involve moving data; data

always in flight

– Could be 100s of ETL jobs in

production, some for reports

that are no longer needed

• Special skills needed

outside of database

development and admin

• Rising cost and complexity

of ETL raising interest in

alternatives, e.g., ELT and

no ETL in the “data lake”

13

Polling Question #1

In which area is your organization feeling the most pain when it comes to ETL?

• Development

• Performance

• Cost

• Governance

• Alignment with business requirements

• ETL is not a problem for us

Virtual Data Processing: Definitions • Virtualization: Dealing with

the already distributed

nature of data

• Dominant trend: Memory,

storage, processing,

networks – and now data

– Goal to make more efficient

use of resources

• Software solution that

enables access multiple,

distributed systems through

a single layer so that users

do not have to write queries

to each system individually

15

“Data virtualization is the

process of offering data

consumers a data access

interface that hides the technical

aspects of stored data, such as

location, storage structure, API,

access language, and storage

technology.” – R. Van der Lans

Overcoming Silos & Pursuit of Single View

• Federation or

virtualization? Terms often

used interchangeably

– Virtualization: less emphasis

on conforming sources to a

single data model

• Shared goal: moving

toward on-demand fetching

and data refreshing, no

matter where it resides

– Reducing overhead of

continuous data capture,

transformation, and movement

– Long history of distributed

queries and materialized views

• Querying as if it were one:

combining result sets from

multiple, distributed systems

into a single view, to be

delivered when requested

• TDWI Research: 19%

currently using data

federation or virtualization

– 31% plan to; 32% have no

plans to use it

– Source: “Achieving Greater

Agility with BI,” TDWI Best

Practices Report by D.

Stodder, Q1 2013

16

The “Keystone” – Business Owns the Data

BI Report

Analyst Tool

Developer Tool

SQL or

Web Service

Data Warehouse

Batch

ETL

• Role-based tools for Business

Analysts & IT Developers

• Common metadata lets Analysts &

IT collaborate in real time

• Empower business analysts to:

• Define entities & directly access &

merge data to create virtual views

• Rapidly profile data sources &

logic without more processing

• Quickly find data & rules via

business glossary

• Collaborate, test, validate &

share results

• Eliminate wait & waste delays

Common

Metadata

VIR

TU

AL

TA

BL

E

Portal

SQL or

Web Service

17

Source: “Leveraging an Agile Platform for Advanced Analytics Using Data Virtualization,” John Poonnen, pres. at TDWI Exec Summit, Boston, July 2014

Virtualization: A Fit with In-Memory

The virtual “database” could be in memory, supplied with data from the distributed

data sources; BI and analytics application users would not need to know about

the physical location; making data accessible from a “single place”

Using In-Memory to Free Up Analytics

• Reduced pre-processing:

Less need to build cubes,

aggregations, and other

designs for I/O constraints

• Random access to data:

getting beyond limits of highly

structured inquiry; raw data

• Low latency feeds: Enabling

real time & stream analysis

• Volume and scale: Exploit

potential of massively parallel

grid computing platforms

19

Virtualization and In-Memory: Part of

Emerging Logical or “Hybrid” Architecture

• Addressing the need for

combination of internal and

external data

– Not just what’s in the EDW or

just what’s in Hadoop; reaching

out to all sources

• Virtualization critical to

delivering & integrating views of

multiple kinds of data

• Cloud, SaaS, or on premise;

bringing data sets into memory

for faster access

Credit: Enterprise Management Associates

Credit: www.paddyiyer.com

Gartner Logical Data Warehouse

Virtual, Physical, and Distributed Process

21

Source: “Leveraging an Agile Platform for Advanced Analytics Using Data Virtualization,” John Poonnen, pres. at TDWI Exec Summit, Boston, July 2014

Virtualization: Benefits and Challenges

• Abstraction layer benefits:

Users see one integrated

data set regardless of APIs,

access language, location,

or storage structure

– Choices of SQL-based

software for logically unified

access, querying, analytics,

and reporting

– Federated servers, enterprise

service bus…multiple ways of

enabling access depending on

users’ needs and firm’s

technology strategy

• Performance: Will response

time be adequate? Can it be

tuned and managed?

– What about unexpected, ad

hoc queries?

– How dependent is the virtual

layer on performance and

availability of sources?

• Comprehension: Can the

user understand what they

are receiving?

– If no conforming model is

used, is the data delivered with

enough context for users?

22

Additional Benefits and Challenges

• Reducing ETL workload:

ETL not going away, but can

be employed when

absolutely needed

– Expanding the toolset and

architecture

– Less data movement can

equal reduced business

latency

• Speed of access: Some

use virtualization to achieve

real-time data access

– In-memory for “speed of

thought” analytics

• Agile development and

user “iteration” with data:

Users can access to data to

see what they need before

committing to transformation

– Fits with agile method

implementations

• Data storage: Better ability

to manage to demand and

business requirements

rather than potential peak

use at all times

• Governance challenges

23

Conclusion: Aim for User Satisfaction

• Evaluate virtualization to

give users access sooner:

at very least, they can ask

questions and see what

data looks like

– Set user expectations for data

quality and cleansing

• Govern what users see:

Virtualization can help

organizations protect

sensitive data that should

not be moved from sources

– Controls can avoid “shadow”

data store problems

• Align data access with

business objectives:

virtualization offers an

opportunity to support better

user-IT collaboration

– Toward providing data

services, not IT “order takers”

– Align real-time data access

with true business needs

• Employ virtualization to

reduce ETL burden: cost,

quality, and reuse

challenges can be eased by

employing virtualization

24

25

Thank You!

David Stodder

Director of Research for Business Intelligence

TDWI (www.tdwi.org)

[email protected]

(415) 859-9933

Reinventing the

Data Warehouse

with Big Data

How will you turn new signals

into business value?

:-) Brand Sentiment

360O Customer View

Product

Recommendation

Propensity to Churn Real-time Demand/

Supply Forecast

Predictive Maintenance

Fraud Detection

Network Optimization

Insider Threats

Risk Mitigation,

Real-time

Asset Tracking Personalized Care Smart Vending

Smart Equipment

Smart Cities

CANCER

PATIENTS

RECEIVE TREATMENT

OPTIMIZED

TO THEIR DNA

SAP MAKES BIG DATA REAL

Simplified, Accelerated, Predictive

Traditional: OLTP and OLAP

Separate

Immediate

48/hr Old

Data

EDW

Staging DB

Transactions Streams

Streams

Transactions

48 Hours

Multiple Data Sources

ETL

OLTP + OLAP in

SAP HANA

Current Data

Lots of separate ETL processes! 10:00 AM 10:00 AM

Multiple Data Sources

available with Live Access

10:00 AM 10:00 AM

Smart Data Access

SAP HANA

Reinventing the Data Warehouse

with an In-Memory Data Fabric

SDA

Petabytes of

Structured

Data

ETL & Rep for RT

sync

Load Source

Databases

Results at the

Speed of

Memory

Business

Applications SQL

or SAP River

SDA

Tightly integrated orchestration

for management, monitoring,

and control Data Fabric Layer

Real-time Events/

Machine-

generated Data

Stream

MapReduce/

Hive Column Storage

Op

RDBMS

Other Sources

In-memory

Platform

Orchestrator

World’s Largest Data Warehouse –

NEW Guinness World Record

http://www.guinnessworldrecords.com/world-records/5000/largest-data-warehouse

Audited Record: 12.1 Petabytes

Tested Configuration

Largest Data Warehouse

22x HP ProLiant DL580 G7

• 4x Intel Xeon E7-4870 @ 2.40GHz

• 1TB RAM

20x NetApp Storage Arrays E5460s

• 60/120 x 3TB 7.2Krpm HDD

• 4 x Fibre Chanel connections

SAP IQ 16 (20 nodes)

SAP HANA (5 nodes)

BMMsoft Federated EDMT 9 with UCM

Red Hat Enterprise Linux 6.4 X86-84

Real-time Applications on Business +

Context Data Across Data Domains Real-time Applications, Interactive Analysis

Str

eam

s

Columnar Data

GraphX Spark

Streamnig MLlib SHARK

Spark

Tachyon

SCM ERP CRM Text Geospatial Sensor Social

Media Logs

Data

Source

SAP HANA

Distributed File

Persistence

In-Memory

Persistence

In-Memory

Processing

Geospatial

Predictive

Planning/

Rules

Text/NLP

SAP HANA

smart data

access

HDFS / Any Hadoop

Data Access

SQL .NET Javascript MDX SQL Java Scala Python Other Other NodeJS

Real-time insights mean real-time

results

100% accuracy in early

signal detection

216x faster DNA results from

2 days to 20 minutes

$1.1M increased revenue with 1%

increased retention rate

500k Euro working capital

reduction with 1 week

3.2M reclaimed by identifying

fraudulent insurance charges

50,000 daily sports betting games

analyzed in real-time

Isn’t It Time To Reinvent Your EDW

Strategy?

SIMPLIFY PREDICT ACCELERATE

SAP In-Memory Data Fabric A Complete EDW Architecture for All Your Knowledge Workers

Thank you

Contact information:

Tom Traubitz

[email protected]

36

Questions?

37

Contact Information

If you have further questions or comments: David Stodder, TDWI [email protected] Tom Traubitz, SAP [email protected]

38

TDWI World Conference

Managing Agile BI for the Enterprise

San Diego, CA | September 21-26, 2014

Early registration discount until 8/22;

Use priority code SD1

http://www.tdwi.org/SD2014

*

TDWI Executive Forum

Strategies for Master Data, Quality, and Governance

San Diego, CA | September 22-23, 2014

Early registration discount until 8/22;

Use priority code EXEC9

http://www.tdwi.org/SD2014

Learn More in San Diego!