discovering big data in the fog: why catalogs matter

24

Click here to load reader

Upload: eric-kavanagh

Post on 15-Apr-2017

48 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Discovering Big Data in the Fog: Why Catalogs Matter

Enjoy the pre-show banter!

Page 2: Discovering Big Data in the Fog: Why Catalogs Matter

Connecting the Right People to the Right Data

Waterline Overview Presentation

Page 3: Discovering Big Data in the Fog: Why Catalogs Matter

If you don’t know where the data is located, you can’t affect business outcomes

2

Where do I find the data I need to complete my business analysis?

How to I organize all the data the business analysts need?

Where is sensitive data? Where did it come from? Who should have access?

Where is redundant data located and how much can I eliminate?

Risk Management & ComplianceSelf Service Data Data Rationalization

Page 4: Discovering Big Data in the Fog: Why Catalogs Matter

Didn’t “Big Data” Solve These Problems?

Data Wrangling

Data Visualization

None of these technologies tell you:

Where do I find the data?How do I organize it?

Where did it come from?What is in the data?

Who can use it?

3

Page 5: Discovering Big Data in the Fog: Why Catalogs Matter

Waterline Smart Data Catalog: Answers the Key Questions

• Automatically discovers, organizes, tags and curates data & makes it available to search and find

• Consolidates and tracks data lineage

• Profiles data, presents statistics, presents crowdsourced ratings

• Identifies sensitive data and enables tag based access control

Where do I find the data?

Where did it come from?

What is in the data?

Who can use the data?

Connect the Right People to the Right Data4

Data Professionals Business Professionals

Search | Rate | CollaborateDiscover | Organize | Curate

Page 6: Discovering Big Data in the Fog: Why Catalogs Matter

Inventory data and enable users to search, find and use data across all data sets

Demonstrate auditable data lineage. Get compliant data quickly into use. React quickly to regulatory requests.

Identify redundant databases, marts, tables & schemas to eliminate or move

What We Do

5

Data Lifecycle Management

Discover Catalog• Analyze• Secure• Rationalize

Risk Management & ComplianceSelf Service Data Data Rationalization

Page 7: Discovering Big Data in the Fog: Why Catalogs Matter

Waterline Data

• Initiative:• Optimize the quality and cost of customer credit score services

• Challenges:• Local control of data across 11 countries with little centralized

leverage or visibility • Inaccurate data costs money and jeopardizes customer loyalty• Need to keep up with real time changes. (e.g. A court case is

settled and affects a company’s rating )

• Why Waterline:• Only vendor that truly automates data tagging and lineage discovery • Works across multiple data sources for tagging and search• Very fast and scalable : Did in hours what previously took months• Easy to try (free download). Easy to use.• Integrates via APIs into existing business processes

Creditsafe is the world's most used provider of business credit reports

and maintains the largest owned database with over

240 million companies worldwide.

Page 8: Discovering Big Data in the Fog: Why Catalogs Matter

How Waterline Data Catalog Works

Page 9: Discovering Big Data in the Fog: Why Catalogs Matter

Organize Curate

• Accept or reject tags

• Search and use data through GUI and integration to 3rd

party applications

• Collaborate and share “tribal data knowledge” through crowdsource ratings

• Automates data access control via tag based security

Data Professionals

A Unique Combination of Automation and Crowdsourcing

8

• Automaticallyand incrementally “fingerprint” data at scale by analyzing actual data

• Automatically tag (match) data fingerprints to glossary terms

• Match the unmatched terms through crowdsourcing

Discover

Business Professionals

Search | Rate | Collaborate

System learns and fine tunes

matching algorithm

Page 10: Discovering Big Data in the Fog: Why Catalogs Matter

Open, Extensible Architecture

Data Sources

TeradataOracleMySql

OtherRelational

SparkHDFS/Hive

Amazon S3Microsoft Azure

Rel

atio

nal

Plug

in A

rch

JDBC

(Pro

filin

g)Lo

gs (U

sage

)Se

curit

y (P

erm

issi

ons)

Secu

rity

(Mas

king

)

Business Glossary

RES

T AP

I

Busi

ness

M

etad

ata

ETL

Line

age

Data Security

Dat

a Ta

gs

DiscoverCatalogSearch

Execution Environments

Smart Data Catalog

Analytics Environments

BI/Analytics Wrangling Other Apps

Search REST API

Page 11: Discovering Big Data in the Fog: Why Catalogs Matter

Demo

Page 12: Discovering Big Data in the Fog: Why Catalogs Matter

Waterline Data

Smart Data Catalog • Automate Discovery & Search for Analytics

• Mitigate Data Compliance Penalties

• Reduce costs due to data redundancy

11

Data Professionals Business Professionals

Search | Rate | CollaborateDiscover | Organize | Curate

Page 13: Discovering Big Data in the Fog: Why Catalogs Matter

THANK YOU

12

Page 14: Discovering Big Data in the Fog: Why Catalogs Matter

MakeBigDataWork.org

• Educational webinars on the things you need to do to “Make Big Data Work”

• Next 3 webinars are:• Data Cataloging• Data Wrangling• Big Data Intelligence

13

Page 15: Discovering Big Data in the Fog: Why Catalogs Matter

Why On Earth Would You Want a Catalog?

Robin Bloor, Ph D

Page 16: Discovering Big Data in the Fog: Why Catalogs Matter

The Catalog Idea

We find data and we process data.

Without a catalog you cannot find the data

Page 17: Discovering Big Data in the Fog: Why Catalogs Matter

Catalogs are Maps

u File Systems & Hierarchies u Database Schemasu Data Dictionariesu MDM Glossariesu The DNSu Ontologies

The difference between these catalogs has a lot to do with who the user is and how they are trying to use the data.

Page 18: Discovering Big Data in the Fog: Why Catalogs Matter

Here is a Map

Page 19: Discovering Big Data in the Fog: Why Catalogs Matter

The Data Lake Idea

u Static data and data streams

u Real-time data ingestu Data Governanceu Data Lake Mgtu Analytics & BIu Extracts

The data lake becomes the system of record

Analyticsor BI Apps

DataGovernance

Data LakeMgt

Static Data Sources Data Streams

ToDatabase Engines

Data MartsOther Apps

ETL

DataLake

Ingest

Page 20: Discovering Big Data in the Fog: Why Catalogs Matter

The Full Picture (Logical Data Lake)

DataCleansing

DataSecurity

Ingest

MetadataMgt

Real-TimeApps

Transform &Aggregate

Search &Query

BI, Visual'n& Analytics

OtherApps

Data LakeMgt

DataGovernance

DATA LAKE

Archive

Life CycleMgt Extracts

Servers, Desktops, Mobile, Network Devices, EmbeddedChips, RFID, IoT, The Cloud, Oses, VMs, Log Files, SysMgt Apps, ESBs, Web Services, SaaS, Business Apps,Office Apps, BI Apps, Workflow, Data Streams, Social...

To DatabaseEngines &Other Apps

Page 21: Discovering Big Data in the Fog: Why Catalogs Matter

Data Governance

System of record

Data provenance & lineage

Data cleansing

Data security

Data compliance

Data integrity

Data audit record

Data life-cycle mgt

Data meaning

Data Governance is a perpetual process

Page 22: Discovering Big Data in the Fog: Why Catalogs Matter

The MetaData Layer

There will always be a metadata layer of some kind. Without it there would

be no processing at all.

The question is whether it will be organized.

Page 23: Discovering Big Data in the Fog: Why Catalogs Matter

u Is “the data lake” the source of most of your business?

u How semantic is your catalog? (And how semantic could it be?)

u How does Waterline integrate with data streams? (Lambda/Kappa architectures?) – in other words, is this a streaming technology?

u What is the most ambitious project you’re involved in?

Page 24: Discovering Big Data in the Fog: Why Catalogs Matter

u How well do you play with others?

u Metadata management is (in my view) the most important aspect of data governance. What is the Waterline strategy for “global metadata management”?

u You’ve presented a broad software vision. Which type of customers are adopting this?