data lake, virtual database, or data hub - how to choose?

33
© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED. Damon Feldman, Ph.D @damon.feldman http://www.marklogic.com/blog/author/dfeldman / Data Lake, Virtual Database, or Data Hub How to Choose?

Upload: dataversity

Post on 08-Jan-2017

581 views

Category:

Technology


3 download

TRANSCRIPT

© COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Damon Feldman, [email protected]://www.marklogic.com/blog/author/dfeldman/

Data Lake, Virtual Database, or Data HubHow to Choose?

SLIDE: 2 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Who am I?

• Solutions Director at MarkLogic

• About 8 years in the Big Data and Data Integration space

• Previously, in OOP, JEE worlds

• Focus on Data Hub and Customer or Person-360o systems

SLIDE: 3 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

But Why?• Data Silos

• Usually work well for a single, operational purpose

• Turn any cross-line-of-business question into a data integration effort

SLIDE: 4 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

How about EDW• For a while, Enterprise Data Warehouses were the go-to solution for silos

• One master schema to rule them

• Data Modeler’s Dream!

• Implementors Nightmare!

• BMUF

• Rigid and tightly coupled

SLIDE: 5 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Data Incompatibililties• Three forms of data incompatibilities

• Naming is the simplest

• firstName vs. GIVEN_NAME

• Structural is somewhat harder

• Semantic differences are the most challenging

• Status: {in cart, ordered, shipped, delivered}

• Status: {selected, paid, complete}

PERSON- PERS_ID- DOB- FNAME- LNAME

PERS_ADDR_REL- PERS_ID- ADDR_ID

ADDRESS- ADDR_ID- LINE1- CITY- ZIP- TYPE: {US, UK}

PERSON- PERS_ID- DOB- FNAME- LNAME- ADDR_L1- ADDR_CITY- ADDR_ZIP- ADDR_MAILING_L1- ADDR_MAILING_ZIP

SLIDE: 6 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Three New Approaches• Data Lakes

• Put it all somewhere else

• Virtual Databases (AKA Federated Databases)

• Pretend it is somewhere else

• Data Hubs

• Put it all somewhere else, Harmonize, and Index it for operational use

And a Framework to understand and choose approaches

SLIDE: 7 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

A Use CaseConsider a customer churn use case

Review high-value customers

.. Who are at-risk customers

.. Particularly if they are dropping or cancelling services

Proactively address their trouble tickets or complaints.

Customer Lifetime Value

$$$ $ $$

Customer Support

!@#&!!%! !@#

Order/Change/Drop

↑ 😠😠↓Need

more … please

upgrade…

Abysmal…dissatisfied

SLIDE: 8 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Data Lakes• Copy the data to a new infrastructure

• Typically Hadoop, but perhaps MarkLogic or other NoSQL

• Difficult with SQL because many sources Load “as-is”

• Operational Separation

Copy Process

Support

CLV

Orders

DATA LAKE

Data is Moved to one place, but still in varied structures

BI/Analytics

SLIDE: 9 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Virtual Database• Query everything in real time

• Transparent to the caller

• True real-time

• Data is not Moved or Harmonized (except in memory during processing)

Support

CLV

Orders

Data Remains in source systems

Query Transform

Query Transform

Query Transform

Retain/intervene

Churn Analysis

Reporting

Query Conversion

Data Harmonization

SLIDE: 10 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Data Hubs• Copy as with a Data Lake

• Harmonize and Index

• Regular structures for analytics, reporting, consumption

• Indexes atop the common structures

Copy

Support

CLV

Orders

DATA HUB

Data is Moved to one placeAlso Harmonized and Indexed

Harmonize BI/Analytics

ConsumerConsumerConsumers

SLIDE: 11 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Beneath and Beyond the TermsThe terms are useful, but vague, and don’t tell us what works for our next project

Consider all these approaches in terms of:

• Movement

• Harmonization

• Indexing

SLIDE: 12 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Data Movement• Data Movement is copying data to new, physical storage so it can be accessed via

new servers and processes

• Operational Separation

• Organizational Separation

Orders System

Retain / InterveneChurn Analysis

Reporting

Sales Department IT

SLIDE: 13 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Data Movement and the Three Approaches• Data Lakes are all but defined by Movement

• Operational and Organizational separation

• Virtual Databases - unique in not Moving data

• Load is pushed to the source systems

• Backup, HA/DR, Security implemented on all source systems

• Data Hubs also Move data

SLIDE: 14 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Data Harmonization• Recall: Three forms of data incompatibility

• Naming

• Structural

• Semantic

PERSON- PERS_ID- DOB- FNAME- LNAME

PERS_ADDR_REL- PERS_ID- ADDR_ID

ADDRESS- ADDR_ID- LINE1- CITY- ZIP- TYPE: {US, UK}

PERSON- PERS_ID- DOB- FNAME- LNAME- ADDR_L1- ADDR_CITY- ADDR_ZIP- ADDR_MAILING_L1- ADDR_MAILING_ZIP

SLIDE: 15 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Data Harmonization• Harmonization is mapping into a common structure for key data elements

• Eventually, data must be consumed, aggregated and analyzed in a common form

Orders System $1400 equipment order £ 270/month – 36 month contract Exchange Rate: 1.28

Maintenance/trouble tickets Network upgrade needed

Projected cost $3,000

Customer Expected Net RevenueOren Wilkins $4,280Sarah Ravnick $17,200David Perez …

SLIDE: 16 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

PersonHarmonized

NameAddressDoB

SourceEye colorHeight Credit Risk

Data Harmonization• Harmonization is the “value add” in the process

• The earlier the better for maximum use• Store it • Index it

• Yet BMUF fails often• Progressive Harmonization

PersonHarmonized

NameAddressDoB

SourceEye colorHeight Credit Risk

PersonFnameLnameBIRTHPHYSATTRPHYSATTR

PersonGiven-nameFamily-nameEye-colorDemographics

DOB

PersonHarmonized

NameAddressDoBEyeColorHeight

Source Credit Risk

Iteration 1 Iteration 2

SLIDE: 17 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Harmonization and the Approaches• Data Lakes don’t Harmonize

• Harmonization is pushed downstream, or implicit in the jobs

• Often ETL copies from format to format (particularly in Hadoop)

• Virtual Databases Harmonize in real time

• Each source query and result is harmonized in memory

• Pushes the load to the source systems

• Data Hubs Harmonize and Persist

• Explicit storage and management of Harmonized data

• Governable

Data Lake

Job 1 Job 2

Silo 1

Silo 2Query

Data Lake Data Hub

SLIDE: 19 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Indexing “Who Said Databases Weren’t a Good Idea?”

- Ken Krupa, Enterprise CTO, MarkLogic

• Indexing is a decision to make something fast

Finding, totaling, sorting, grouping, correlating, analyzing Sometimes also accessing

• Less obviously

Caching and memory use

Reference data usage

SLIDE: 20 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Indexing Benefits• Advance from Batch to Operational

• Micro-service or SOA architectures• find the latest address

• A 360o summary record of a customer

• Human Services: reviewing FSA recipients – interactive dashboard

• “Run your business”

SLIDE: 21 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Three Approaches Revisited – Virtual DatabasesIssues

• Least-common-denominator Query

• Paradox: more systems = less power

• Coupling to source systems – schema change = broken DB

• Weakest link problem - HA/DR, overload

• Complexity

• Paging, sorting, relevance, dealing with a down federate

Benefit

• Real Time is easy

• May be ok for small or initial systems

SLIDE: 22 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Three Approaches Revisited - Data LakesIssues• Still need to Harmonize the data

• Typically in every batch job, ETL (PIG/HIVE) job, query, analysis• Risk of the “Data Swamp”• Batch focus

• In-memory helps, but still batch• Frankenbeast workarounds create more silos, rather than solving the problem

Benefit• The data is moved• Storage is cheap• One team and process to add functionality

SLIDE: 23 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Three Approaches Revisited – Data HubsData Hubs - Advantages• Most powerful solution – all of: Movement, Harmonization, Indexing• “Run your business”• Indexing builds on Harmonization

• Harmonization is the value add, so index it!• Grow by regularizing, not by complicating

• More data sources to the Harmonized form• Progressive Harmonization to increase the Harmonized data elements

• HA/DR, scale, security, query power, batch efficiency, governanceTradeoffs• Dedicated hardware• Change detection or data push needed for real-time

SLIDE: 24 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Data Lake vs Data Hub” The fact is, you don't put everything into a datastore and

then go looking for something to do.”- Ted Dunning, MapR Chief Applications Architect

Data Hubs are Operational and “Purpose-driven”Use case API Progressive Harmonization Data Integration

The do not merely have Harmonized data and Indexes, they are about serving Harmonized data and indexes to drive them.

SLIDE: 25 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Value Over Time

Time, Evolution, Range of Data

ROI

Data Lake

Data Hub

Virtual Database0

SLIDE: 26 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Evaluating MarkLogic with the Three Criteria

SLIDE: 27 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

MarkLogic Operational Data Hub Pattern

Some say: “A Data Lake and EDW are better together”

Translation: ”This Data Lake is not doing a very good job, and never will”

MarkLogic brings database/data warehouse functions into the Data Lake making it “Operational” and a “Data Hub” by virtue of Harmonization and Indexing but not by trying to build a (smaller) EDW

SLIDE: 28 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

MarkLogic for Operational Data Hubs• MarkLogic supports all three paradigms

• Our product direction, consulting team, experience are focused on Data Hubs

• MarkLogic is a database

• Allowing an “Operational Data Hub”

• Run your business AND observe your business

• One place for the latest data – address, income, account status, health

• Integrated data for 360o views

SLIDE: 29 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

MarkLogic ODH Features - Movement

• Ingest data “as-is”

• Native support for JSON, XML, Binary, RDF, Text, SQL, Geo

• Data Loading tools for MPP batch ingest

• Index latent structure in each

• Commodity hardware, commodity disk

• Tiered storage for cost effective storage

SLIDE: 30 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Operational Data Hub Pattern in MarkLogic

HAR

MO

NIZ

E

ING

EST

Enveloped Documents(Entity 1)

SERV

EEnvelopedDocuments(Entity 2)

RDBMSSource 1

Documents

Message Bus

Content Feed

Data Flow

StagingRaw, As-is data

FinalHarmonized, Indexed dataSource

SystemsConsuming Applications

Source 2 Documents

Source N Documents

… …EnvelopedDocuments(Entity N)

Operational Apps

Analysis/BI

Data Feeds

Discovery, Harmonization Indexes, Query, Servies

SLIDE: 31 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

MarkLogic ODH Features - Harmonization

• Best in class data Transform capabilities

• XSLT, XQuery implemented to spec from the ground up

• JavaScript via V8 engine

• Triggers, data extraction from binaries, MPP processing

• Multi-modal processing of many data formats

• Ontology processing – RDFS, OWL

SLIDE: 32 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

MarkLogic ODH Features - Indexing

• MarkLogic is built on the “Universal Index”

• Text, document structure, fields, text and security in one index

• Columnar range indexes for analysis and SQL processing

• Triple index for RDF, SPARQL and semantic query

• Geospatial index

• Projection operations to expose one structure (e.g. JSON or XML) as SQL or RDF

• Operational vs. purely analytical. You can run your business on MarkLogic

SLIDE: 33 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

Summary• Data Lakes and Hubs are on a continuum

• Primarily distinguished by level of indexing

• Virtual databases are a very different animal – and not usually in a good way

• Within each pattern, Movement, Harmonization and Indexing are knobs to turn

• Movement – for isolation and data access

• Harmonization – for micro-services and value-add

• Indexing – for speed and operational use cases

• Consider your goals and requirements, and plan accordingly

SLIDE: 34 © COPYRIGHT 2016 MARKLOGIC CORPORATION. ALL RIGHTS RESERVED.

More InfoMarkLogic Data Hub Framework (quick start): https://marklogic.github.io/marklogic-data-hub/

MarkLogic Data Hub information: http://www.marklogic.com/solutions/operational-data-hub/

Damon’s blog on data lakes: http://www.marklogic.com/blog/data-lakes-data-hubs-federation-one-best/

Follow damon on twitter: https://twitter.com/damonfeldman