three signs your architecture is too small for big data. camp it december 2014

32
© 2014 Craig Jordan INTELLIGENCE ARCHITECTURE IS too small for BIG DATA Capabilities, Experiments, and Architecture Patterns CampIT, December 4, 2014 December 4, 2014 1

Upload: craig-jordan

Post on 05-Aug-2015

132 views

Category:

Technology


1 download

TRANSCRIPT

© 2014 Craig Jordan

3 SIGNS YOUR BUSINESS INTELLIGENCE ARCHITECTURE IS too small for BIG DATACapabilities, Experiments, and Architecture Patterns

CampIT, December 4, 2014December 4, 2014

1

© 2014 Craig Jordan

Agenda

• The three V-s and their impact on a classic BI architecture

• Three capabilities Big Data requires• Near real time data processing• Machine-learning • Text processing

• Extending the classic BI architecture

• Q&A

2

© 2014 Craig Jordan

Defining Big Data

Untrusted Uncleansed Master data Transactions

Speed of generation Analysis latency Decision latency Time to action

Unstructured Semi-structured Structured

Transactions & master data Click stream Sensor Log Event [Scanned] document Speech, audio Social media

Volume Variety

VeracityVelocity

3

Big Data

© 2014 Craig Jordan

Classic BI Architecture

4

Create Acquire Integrate Present Use

Op SysOp SysOp sysOp sys datadata

Third party dataThird party data

Op SysOp SysSaasSaas datadata

Extract ProcessExtract Process EDW stageEDW stage

Transform & Load

Transform & Load

EDWEDW

data martdata mart

data martdata mart

data martdata mart

data martdata mart

Metadata ManagementMetadata Management

Data Quality Management & Data GovernanceData Quality Management & Data Governance

© 2014 Craig Jordan

Classic BI Architecture Capabilities

• Standard reports and dashboards with reliable performance

• Ad hoc analysis through defined dimensions• Measure-centric analysis• Flexible representation of change over time• Manual insight

discovery• Quantifiable

quality• Analytic workload

isolated from operations

5

© 2014 Craig Jordan

Big Data 3 V-s Impact

6

• Volume • Increases the size of every “cylinder”• Increases the throughput required for every arrow & process

• Variety • detail is discarded during extract & transform

• Velocity • challenges every process to be available and responsive

• Veracity • decisions must be encoded by the processing

Op SysOp SysOp sysOp sys datadata

Third party dataThird party data

Op SysOp SysSaasSaas datadata

Extract ProcessExtract Process EDW stageEDW stage

Transform & Load

Transform & Load

EDWEDW

data martdata mart

data martdata mart

data martdata mart

data martdata mart

Metadata ManagementMetadata Management

Data Quality Management & Data GovernanceData Quality Management & Data Governance

© 2014 Craig Jordan

Warning?

7

Your data warehouse is

running out of disk space.

You don’t store tweets in your data

warehouse.

Your ETL processes still execute in

batch

© 2014 Craig Jordan

Why the 3 V-s are Insufficient

Volume• Database sizes can be increased• How large is large enough?

Variety• Which data types are required, nice to have, unnecessary?• Should an architecture provide techniques for every data type?

Velocity• ETL processes can be enhanced to execute immediately• However, doing so for all is not feasible• Which process mush be immediate?• How do you handle data that is involved both in immediate and

batch processing requirements?

Veracity• Like beauty, [at least some] truth is in the eye of the beholder.

8

© 2014 Craig Jordan

Tomorrow may be too late

9

Your business intelligence architecture is too small for big data if…

Some of your key business processes complete with undesirable outcomes

due to unnecessary data delay.

© 2014 Craig Jordan

What processes complete today?

Finding architecture requirements for increased velocity

•Not all processes have the same duration• Find the ones that can start and finish in a single day

•What information sources do they use• Find the ones that change during the process and impact its

outcome

•What consumers need the information during the process

• Find the ones that could take different action to affect the business result

10

© 2014 Craig Jordan

Business ProcessBusiness Process

Reducing the time to insight

11

Nightly ETLProcess

Nightly ETLProcess

VolatileVolatile datadata

Op SysOp SysSlowlychangingSlowlychanging datadata Data martData mart targettarget

Business ProcessBusiness Process

Investigate near-real-time data movement aligned with your source technology•XML, Avro, JSON•JMS•SOA

Minimize the change to the target by including as few as possible volatile sources

Nightly ETLProcess

Nightly ETLProcess

VolatileVolatile datadata

Op SysOp SysSlowlychangingSlowlychanging datadata

targettarget

On Demand “ETL”

Process

On Demand “ETL”

Process

Data martData mart

Data martData mart

© 2014 Craig Jordan

Ingest & respond, Accumulate & consolidate

12

Serving LayerServing Layer

IncomingData

IncomingData

Batch LayerBatch Layer

All DataAll Data

Speed LayerSpeed Layer

Process StreamProcess Stream

PrecomputedInformation

PrecomputedInformation

Incremental Information

Incremental Information Real Time

ViewReal Time

View

Real TimeView

Real TimeView

Batch Data View

Batch Data ViewBatch Data

ViewBatch Data

View

QueryQuery

• Type II Slowly Changing Dimensions

• Accumulating Snapshots• Batch machine learning

calculations

• Type II Slowly Changing Dimensions

• Accumulating Snapshots• Batch machine learning

calculations

• Micro transformations to make data accessible

• Real-time machine learning calculations

• Micro transformations to make data accessible

• Real-time machine learning calculations

The Lambda Architecture

© 2014 Craig Jordan

Action Steps: Reducing time to insight

• Find the processes that complete in a single day

• Determine the volatile sources that impact the process outcome

• Select near-real-time data acquisition techniques aligned to source

• Minimize impact on the target

• Prepare for a big data architecture by• Selecting data formats suitable for big data

(Avro & CSV rather than XML and JSON)• Understanding the concepts of the lambda architecture

• Experiment with prototype implementations

• Select the architecture that most readily reduces arbitrary delay

• Architect your analytic targets for “accumulation” and “consolidation”• Lambda architecture (speed layer, serving layer, consolidation layer)• Type-II dimensions & summaries• Accumulating snapshots

13

© 2014 Craig Jordan

Thresholds and indicators are not enough

14

Your business intelligence architecture is too small for big data if…

Finding insights from your data depends upon the day-to-day work of

your analysts.

© 2014 Craig Jordan

Predicting the expected, Finding the unexpected

• Statistical analysts use historical observations to • Predict future performance• Characterize normal capability• Identify deviations from the

norm

15

Upper Control Limit (UCL)

Lower Control Limit (LCL)

Time

From: http://www.bottomlineanalytics.com/

© 2014 Craig Jordan

Machine learning categories & their applications

16

© 2014 Craig Jordan

Experiment first

• Begin with machine learning by enabling qualified analysts through appropriate tools• R• Python• SAS/EM

17

Enterprise

Data Asset

Enterprise

Data Asset

Analytic Workbench

Analytic Workbench

MM

MM

MM

MM

python, R, SAS, …

python, R, SAS, …

Data scientist

Interactive Advanced Analytic Platform

cachecache

© 2014 Craig Jordan

Extend to the operational case

• Confirm through data analyst involvement

18

Enterprise

Data Asset

Enterprise

Data Asset

Analytic Workbench

Analytic Workbench

MM

MM

MM

MM

BI ToolBI Tool QQ

Query

Access model throughSQL extensions& R integration

python, R, SAS, …

python, R, SAS, …

Data analyst

Data scientist

Interactive Advanced Analytic Platform

cachecache

Query, Report, Etc.QQ

Analytic ModelMM

© 2014 Craig Jordan

Extend to the operational case, cont

• Fully operationalize by materializing model output for online access; or by enabling near-real-time execution

19

BI ToolBI ToolQQcachecache

Private

Data

Private

Data Data analyst

Business Leader

On demand

EnterpriseData Asset

EnterpriseData Asset

Extract Interactive query

Isolated Exploration Environment

DesktopBI Tool

DesktopBI ToolQQ

MobileBI ToolMobileBI Tool

BI ToolBI Tool QQ BI ToolQQ

Data analyst

Developer

Live Interactive Query

Analytic Workbench

Analytic Workbench

MM

MM

MM

MM

python, R, SAS, …

python, R, SAS, …

Data analyst

Data scientist

Interactive Advanced Analytic Platform

cachecache

© 2014 Craig Jordan

Action Steps: Enhancing automatic insight• Find statistical experts who already

• Classify normal and special cause variation• Cluster customers, sales, products, etc.• Forecast future performance

• Experiment with machine learning algorithms• Confirm machine learning models through extended

advanced analytic environment

• Operationalize machine learning

20

© 2014 Craig Jordan

Text doesn’t fit in your architecture

21

Your business intelligence architecture is too small for big data if…

Your analysts can’t find textual information that matters.

© 2014 Craig Jordan

Finding text to analyze

• Select text directly related to your key processes• Email• Notes• Customer surveys• Phone call transcripts• Web chat transcripts

22

© 2014 Craig Jordan

What happens to text in a classic BI architecture?

23

EDWEDW

LoadLoad

© 2014 Craig Jordan

Text processing techniques• Investigate text processing techniques and the outputs they can

create• Entity extraction (Creating relationships)• Polarity (aka Sentiment)• Search-based sets

• Keyword identification / facets• Relevancy scores

• Example Technologies• Python

• Standard Library Text Processing Services• Natural Language Toolkit (NLTK)• OpenNLP

• R• qdap (Oct 2014)• tm (June 2014)

• DB• Gptext (SOLR and MADlib)• MarkLogic (free-text search)

• Expose as UDFs through SQL

24

© 2014 Craig Jordan

Extend your architecture to include text

25

Op SysOp SysOp sysOp sys datadata

Third party dataThird party data

Op SysOp SysSaasSaas datadata

Extract/Transform/Load

Processes

Extract/Transform/Load

Processes

Metadata ManagementMetadata Management

Data Quality Management & Data GovernanceData Quality Management & Data Governance

Replication ProcessesReplication Processes

NoSQL DBNoSQL DB

HadoopCluster

HadoopCluster

EDWEDW

data martdata mart

data martdata mart

© 2014 Craig Jordan

Richly describe your core processes

26

© 2014 Craig Jordan

Action Steps: Deriving insights from text sources

• Find the text sources that relate to your core processes

• Experiment with text processing• Entity extraction• Polarity• Faceting• Relevancy

• Experiment with navigation• KPI-based

• Sentiment• Relevance

• Search-based• Facet-based

27

© 2014 Craig Jordan

Classic BI Architecture

28

Create Acquire Integrate Present Use

Op SysOp SysOp sysOp sys datadata

Third party dataThird party data

Op SysOp SysSaasSaas datadata

Extract ProcessExtract Process EDW stageEDW stage

Transform & Load

Transform & Load

EDWEDW

data martdata mart

data martdata mart

data martdata mart

data martdata mart

Metadata ManagementMetadata Management

Data Quality Management & Data GovernanceData Quality Management & Data Governance

© 2014 Craig Jordan

Experiment with speed layer techniques

Op SysOp SysOp sysOp sys datadata

Third party dataThird party data

Op SysOp SysSaasSaas datadata

Extract ProcessExtract Process EDW stageEDW stage

Transform & Load

Transform & Load

EDWEDW

data martdata mart

data martdata mart

data martdata mart

data martdata mart

Metadata ManagementMetadata Management

Data Quality Management & Data GovernanceData Quality Management & Data Governance

Extended for near real-time processing

29

Near real-time ingestion

technologies

© 2014 Craig Jordan

Op SysOp SysOp sysOp sys datadata

Third party dataThird party data

Op SysOp SysSaasSaas datadata

Extract ProcessExtract Process EDW stageEDW stage

Transform & Load

Transform & Load

EDWEDW

data martdata mart

data martdata mart

data martdata mart

data martdata mart

Metadata ManagementMetadata Management

Data Quality Management & Data GovernanceData Quality Management & Data Governance

Add ML algorithms

Extended for machine learning

30

Add ML algorithms

Add analytic workbench for

experimentation

© 2014 Craig Jordan

Op SysOp SysOp sysOp sys datadata

Third party dataThird party data

Op SysOp SysSaasSaas datadata

Extract ProcessExtract Process EDW stageEDW stage

Transform & Load

Transform & Load

EDWEDW

data martdata mart

data martdata mart

data martdata mart

data martdata mart

Metadata ManagementMetadata Management

Data Quality Management & Data GovernanceData Quality Management & Data Governance

Extended for text processing

31

Add analytic workbench for

experimentation

Add natural language processing & faceted

search

NoSQL DBNoSQL DB

HadoopCluster

HadoopCluster

Add flexible storage engines for

unstructured data

Add flexible storage engines for

unstructured data

© 2014 Craig Jordan

Summary, Q&A

32

Craig Jordanhttps://www.linkedin.com/in/crjordan

~ Thank You ~