david freriks principal solution architect · big data ecosystem –much more than just hadoop big...

38
Big Data & QlikView Democratizing Big Data Analytics David Freriks Principal Solution Architect

Upload: others

Post on 24-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

Big Data & QlikView

Democratizing Big Data Analytics

David Freriks – Principal Solution Architect

Page 2: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

TDWI – Vancouver Agenda

– What really is Big Data?

– How do we separate hype from reality?

– How does that relate to actually finding useful

business information?

– Why is Qlik unique in leading the industry in solving

Big Data solutions?

– Demo

Page 3: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

TDWI – Vancouver Agenda

– What really is Big Data?

• Most people think of Hadoop….

– How do we separate hype from reality?

– How does that relate to actually finding useful

business information?

– Why is Qlik unique in leading the industry in solving

Big Data solutions?

– Demo

Page 4: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

A Brief History of Hadoop

2005 2008 2011 2013

Cutting joins Yahoo,

estimates a billion pg

index will cost $500k

and $30k/mos to

support

A 1400n Yahoo cluster

sorts 500GB in 59s.

Cloudera launches

Google releases a

paper on GFS, based

on a distributed

search platform called

Nutch

Hadoop promoted to top

level Apache project,

predictive search index

creation time reduced from

12days to 8hrs

Yahoo spins

remaining Hadoop

folks out into

Hortonworks

Cloudera adds real-time

search, based on

Lucene, also created by

Cutting

3rd Hadoop World conf

attracts 2300 developers,

up from 275 in 2010

Page 5: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

•Hadoop Distributed File System HDFS

•Processing framework for writing scalable data applicationsMapReduce

•Procedural language that abstracts lower level MapReducePig

•Highly reliable distributed coordinationZookeeper

•System for querying data on top of HDFS (SQL-like query)Hive

•Database for random, real time read/write accessHBase

•Scalable machine learning librariesMahout

• In-memory large-scale data processing– 100x faster than HadoopSpark

•SQL engine on top of Spark Shark

•Scalable multi-master database with no single points of failureCassandra

And on, and on…

Hadoop

Example Apache Hadoop or Next-Gen Components

Page 6: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

Big Data: Expanding on 3 fronts

Real

Time

Near Real

Time

Periodic

Batch

MB

GB

TB

PB

Table

Database

Web XML

Audio Social

Video

Data Velocity Data Volume

Data Variety

Page 7: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

What is “Big Data”?

• Big Data is: Nebulous

• Big Data is: Really Big or Not

• Big Data is: Mostly Useless Noise

• Big Data is: Slow

• Big Data is: Difficult

Page 8: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

Big Data Ecosystem – Much More Than Just Hadoop

Big Insights & Streams

Big Data Appliance

HANA

Open source Distributed Processing Frameworks

Big Data Analytic Appliances

Massively Parallel Processing Platforms

Big data Integration

Packaged Mapreduce platforms

Data Visualization, Statistical & In-memory Analytics

8

Splunk >

Page 9: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

Who What Why

Telecom Usage and Location Analysis Call Detail

Records (CDRs)

Next Product to Buy (NPTB) Real-time

Bandwidth Allocation

Operational Excellence

Customer Retention

Profitability

Financial Services New Account Risk Screens

Fraud Detection

Trading Risk

Real-Time P&L

Portfolio Analysis

Improve Profit

Minimize Risk

Utilities Smart Metering Analysis Operational Excellence

Retail 360o Customer View

Brand Sentiment Analysis

Up Sell/Cross Sell

Clickstream Analysis

Increase Revenues

Customer Loyalty

Brand Awareness

Manufacturing Supply Chain & Logistics

Assembly Line QA

Proactive Maintenance

Operational Excellence

Profitability

Source: Gartner “50 Real World Examples of Big Data and Analytics”, 2013

Some uses of Big Data today

Page 10: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

TDWI – Vancouver Agenda

– What really is Big Data?

– How do we separate hype from reality?

– How does that relate to actually finding

useful business information?

– Why is Qlik unique in leading the industry in

solving Big Data solutions?

– Demo

Page 11: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

• You need to have Ga-zinga-bytes of data to deploy a Big Data solution

– Typical Cloudera Cluster is 15-20 nodes, < 10TB of data

– Hadoop storage is 3-400% cheaper than an EDW

• Hadoop is all you need

– Hadoop is an enabling technology that provides the foundation for Big Data solutions

– Focus today is on data management

• The RDBMS is dead

– RDBMS is still critical – but not for high volume, low quality analytics

• QlikView can’t handle Big Data

– Reality is a Human can’t handle Big Data

– It’s all about the use case

Popular “Big Data” Myths

Page 12: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

• Big Data is rapidly shifting from how much data you can handle to how quickly you can deliver value

– Volume of Data is just one, less and less critical factor

– Context is key and difficult to pinpoint

• Big Data:

– Hadoop is designed to support petabytes and beyond

• Fast Data:

– Teradata, SAP HANA, Netezza, Hbase, MongoDB, ParStream, etc

• Big Data is slow & cheap, Fast Data is neither

• A Big Data Solution requires components that address both

– Hadoop is the data system that combines Fast and Big platform

– QlikView is the platform that supports both scenarios simultaneously

Big Data vs. Fast Data vs. Right Data

Page 13: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

Unstructured/Semi-structured data

Data Accelerator???

Web data Docs & text

data

Audio/Video

data

Structured data

Machine data Operational systems

Where Big Data fits today: The new BI architecture

Data Warehouse???Big Data Repository

Page 14: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

many organizations lack the skills required to exploit big data

most of these skills are in short supply and rare in the market at large

data science encompasses hard skills

Big Data comes with big challenges

The Big Data bottleneck

Reports

Data Scientists

Business Users

Source: Gartner Big Data Hype Cycle Report 2013

“ ”“ ”

“ ”

Big Data

Page 15: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

Organizations have trouble finding qualified professionals to manage big

data and providing training to those already on board

Big Data comes with big challenges

Source: Ventana Research, The Challenge of Big Data Benchmark Research, November 2013

Obstacles to Big Data Analytics

Organizations are challenged in staffing and training

“”

Staffing

Training

Real-Time

License Cost

Integration

79%

77%

67%

64%

64%

Page 16: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

TDWI – Vancouver Agenda

– What really is Big Data?

– How do we separate hype from reality?

– How does that relate to actually finding

useful business information?

– Why is Qlik unique in leading the industry in

solving Big Data solutions?

– Demo

Page 17: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

Operational

systems

Machine data, web

data, cloud dataHadoop

cluster

Data

warehouse

Google

BigQuery

Insight Comes from Data, in Context

Page 18: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

Big Data Business Needs

Descriptive Analytics Predictive Analytics

DATA

Clinical,

Claims,

Monitoring,

others

How are we doing? What might happen in

the future?

Prescriptive Analytics

Best course of action

given objectives,

requirements &

constraints

How many claims did we pay

today?

Which of tomorrow’s claims

might be requesting an

Emergency Room (ER)

admission?

What would be effective

steps to reduce probability of

ER admission?

Page 19: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

TDWI – Vancouver Agenda

– What really is Big Data?

– How do we separate hype from reality?

– How does that relate to actually finding

useful business information?

– Why is Qlik unique in leading the industry in

solving Big Data solutions?

– Demo

Page 20: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

Who are we - QlikView

• What Is QlikView?

– QlikView is a Business Discovery platform – User-driven

BI supporting the creation and consumption of dynamic

apps for analyzing information

– QlikView apps allow non-technical users to explore visual

views of information and ask streams of questions,

through simple interactions such as clicks and taps

– QlikView’s patented software engine dynamically

calculates new views of information, instantly, based on

user selections

Page 21: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

QlikView - A New Kind of Software Company

• Leader in Business

Discovery – user-driven BI

• 28,000+ customers in

100 countries

• 1,500 global partners

• 1,500 employees across

28 offices in 23 countries

• No. 1 fastest-growing

enterprise technology

company (ZDNet)

• Gartner Magic Quadrant

Leader for 3 consecutive

years

Broad Base of 28,000 Customers

Page 22: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

These are Tools… And this is How BI has been done…

Page 23: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

This is a Platform

Page 24: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

An

aly

tica

l Q

uo

tie

nt

Usefulness

Managed

Reporting

Ad-Hoc

Reporting

Dashboards /

Visualization

OLAP /

Analysis

Exploration

Associative

/ Statistical

Predictive

QlikView’s

Sweet Spot

The Evolution of Business Intelligence

Page 25: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

1) Associative Query Language + Full Search*not another query tool….

2) Core Technology: True In-memory, columnar database with built in visualization,

analytics, and ELT in a single product.

3) Designed for Heterogeneous & Complex Data (*again not just another query tool)

4) Application / Mobile Design First (Mobile, Desktop, Tablet… Design once, consume anywhere)

What Makes QlikView Unique?

Page 26: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

How traditional BI and

visualization tools work

QlikView Natural Analytics™

• Limited view and access to data

• Forced down linear drill paths

• Need to involve IT to modify

• What-if and on-the-fly analysis

is limited

• Freedom to explore data from any point in

analysis in a dynamic, interactive interface

• Answer any question on the fly, real-time

• Easily see connections, and

disconnects in data

QlikView’s Natural Analytics™ makes data analysis a

natural part of every business process – for everyone

Page 27: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

The Green, The White and The Gray

Page 28: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

The Visualization Bottleneck

Response Time

Query

Size Big Data

Tableau

Spotfire

MSTR

Analytics Desktop

Datameer

Page 29: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

Connectivity to every Big Data Source

NoSQL

Databases

Real-time

Batch

Hadoop

MPP

Warehouse

SAP HANA

BigQuery

Advanced

Analytics

SAP HANA

Page 30: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

Hard Disk

Drives (HDD)

Solid State

Storage (SSD)

Random

Access

Memory (RAM)

Speed (t/TB) 3300s 1000-300s 1s

Price $/TB $ 50 $ 500 $ 4500

• Keep data in memory when the value obtained from processing it is high

• Leave data on disk when it is inactive or the value from processing it is low

Value

Size

The Big Data Value Chain

Page 31: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

Flexible Big Data deployment models

Direct Discovery

Billions of rows via Direct Discovery

100’s millions rows into Memory

Aggregates / Detail

Page 32: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

Combine Big Data and traditional data sources

Combine data sources using pure In-Memory

Aggregates / Detail

EDW Data

Data

Warehouse

Page 33: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

Today’s challenge:

What to do with Big Data? Who should do it?

IT

What to do with this?

Business

How to define requirements?

QlikView as a catalyst for implementing Big Data

Page 34: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

QlikView gives business users ability to discover with Big Data, not just

data scientists

More Access > More Questions > More Use > Higher ROI of Big Data

IT & Business

QlikView as a catalyst for implementing Big Data

Page 35: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

QlikView In-Memory approach

• Loads compressed data into memory

• Enables associative search and analysis

• Supports 100’s millions to billions of rows of data

In-Memory

Page 36: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

QlikView Direct Discovery Approach

• Combines the associative capabilities of the QlikView in-memory

dataset with a query model where:

The aggregated query result is passed back to a QlikView object

without being loaded into the QlikView data model

The result set is still part of the associative experience

Capability to Drill to Detail records

QlikView Application

QlikView In-Memory Data Model

Direct Discovery

Batch Load

Page 37: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

100% in-memory for:

• All the necessary (i.e. relevant and

contextual) data can fit in-memory

• Users require only aggregated or

summary data, i.e. hourly or daily

averages, or record-level detail

over a limited time period.

• Query performance of external

source is not satisfactory

Direct Discovery for:

• Data cannot fit in memory and

document chaining is not sufficient

• Users require access to record-

level of detail stored in a large fact

table that will not fit in memory.

• Network bandwidth limits ability to

copy data to QlikView server

The Design of Direct Discovery lets you alternate between these

approaches with absolutely no change to the application itself

A Hybrid Approach for Tackling Big Data

Page 38: David Freriks Principal Solution Architect · Big Data Ecosystem –Much More Than Just Hadoop Big Insights & Streams Big Data Appliance HANA Open source Distributed Processing Frameworks

DEMO