discovering big data in the fog: why catalogs matter
TRANSCRIPT
Enjoy the pre-show banter!
Connecting the Right People to the Right Data
Waterline Overview Presentation
If you don’t know where the data is located, you can’t affect business outcomes
2
Where do I find the data I need to complete my business analysis?
How to I organize all the data the business analysts need?
Where is sensitive data? Where did it come from? Who should have access?
Where is redundant data located and how much can I eliminate?
Risk Management & ComplianceSelf Service Data Data Rationalization
Didn’t “Big Data” Solve These Problems?
Data Wrangling
Data Visualization
None of these technologies tell you:
Where do I find the data?How do I organize it?
Where did it come from?What is in the data?
Who can use it?
3
Waterline Smart Data Catalog: Answers the Key Questions
• Automatically discovers, organizes, tags and curates data & makes it available to search and find
• Consolidates and tracks data lineage
• Profiles data, presents statistics, presents crowdsourced ratings
• Identifies sensitive data and enables tag based access control
Where do I find the data?
Where did it come from?
What is in the data?
Who can use the data?
Connect the Right People to the Right Data4
Data Professionals Business Professionals
Search | Rate | CollaborateDiscover | Organize | Curate
Inventory data and enable users to search, find and use data across all data sets
Demonstrate auditable data lineage. Get compliant data quickly into use. React quickly to regulatory requests.
Identify redundant databases, marts, tables & schemas to eliminate or move
What We Do
5
Data Lifecycle Management
Discover Catalog• Analyze• Secure• Rationalize
Risk Management & ComplianceSelf Service Data Data Rationalization
Waterline Data
• Initiative:• Optimize the quality and cost of customer credit score services
• Challenges:• Local control of data across 11 countries with little centralized
leverage or visibility • Inaccurate data costs money and jeopardizes customer loyalty• Need to keep up with real time changes. (e.g. A court case is
settled and affects a company’s rating )
• Why Waterline:• Only vendor that truly automates data tagging and lineage discovery • Works across multiple data sources for tagging and search• Very fast and scalable : Did in hours what previously took months• Easy to try (free download). Easy to use.• Integrates via APIs into existing business processes
Creditsafe is the world's most used provider of business credit reports
and maintains the largest owned database with over
240 million companies worldwide.
How Waterline Data Catalog Works
Organize Curate
• Accept or reject tags
• Search and use data through GUI and integration to 3rd
party applications
• Collaborate and share “tribal data knowledge” through crowdsource ratings
• Automates data access control via tag based security
Data Professionals
A Unique Combination of Automation and Crowdsourcing
8
• Automaticallyand incrementally “fingerprint” data at scale by analyzing actual data
• Automatically tag (match) data fingerprints to glossary terms
• Match the unmatched terms through crowdsourcing
Discover
Business Professionals
Search | Rate | Collaborate
System learns and fine tunes
matching algorithm
Open, Extensible Architecture
Data Sources
TeradataOracleMySql
OtherRelational
SparkHDFS/Hive
Amazon S3Microsoft Azure
Rel
atio
nal
Plug
in A
rch
JDBC
(Pro
filin
g)Lo
gs (U
sage
)Se
curit
y (P
erm
issi
ons)
Secu
rity
(Mas
king
)
Business Glossary
RES
T AP
I
Busi
ness
M
etad
ata
ETL
Line
age
Data Security
Dat
a Ta
gs
DiscoverCatalogSearch
Execution Environments
Smart Data Catalog
Analytics Environments
BI/Analytics Wrangling Other Apps
Search REST API
Demo
Waterline Data
Smart Data Catalog • Automate Discovery & Search for Analytics
• Mitigate Data Compliance Penalties
• Reduce costs due to data redundancy
11
Data Professionals Business Professionals
Search | Rate | CollaborateDiscover | Organize | Curate
THANK YOU
12
MakeBigDataWork.org
• Educational webinars on the things you need to do to “Make Big Data Work”
• Next 3 webinars are:• Data Cataloging• Data Wrangling• Big Data Intelligence
13
Why On Earth Would You Want a Catalog?
Robin Bloor, Ph D
The Catalog Idea
We find data and we process data.
Without a catalog you cannot find the data
Catalogs are Maps
u File Systems & Hierarchies u Database Schemasu Data Dictionariesu MDM Glossariesu The DNSu Ontologies
The difference between these catalogs has a lot to do with who the user is and how they are trying to use the data.
Here is a Map
The Data Lake Idea
u Static data and data streams
u Real-time data ingestu Data Governanceu Data Lake Mgtu Analytics & BIu Extracts
The data lake becomes the system of record
Analyticsor BI Apps
DataGovernance
Data LakeMgt
Static Data Sources Data Streams
ToDatabase Engines
Data MartsOther Apps
ETL
DataLake
Ingest
The Full Picture (Logical Data Lake)
DataCleansing
DataSecurity
Ingest
MetadataMgt
Real-TimeApps
Transform &Aggregate
Search &Query
BI, Visual'n& Analytics
OtherApps
Data LakeMgt
DataGovernance
DATA LAKE
Archive
Life CycleMgt Extracts
Servers, Desktops, Mobile, Network Devices, EmbeddedChips, RFID, IoT, The Cloud, Oses, VMs, Log Files, SysMgt Apps, ESBs, Web Services, SaaS, Business Apps,Office Apps, BI Apps, Workflow, Data Streams, Social...
To DatabaseEngines &Other Apps
Data Governance
System of record
Data provenance & lineage
Data cleansing
Data security
Data compliance
Data integrity
Data audit record
Data life-cycle mgt
Data meaning
Data Governance is a perpetual process
The MetaData Layer
There will always be a metadata layer of some kind. Without it there would
be no processing at all.
The question is whether it will be organized.
u Is “the data lake” the source of most of your business?
u How semantic is your catalog? (And how semantic could it be?)
u How does Waterline integrate with data streams? (Lambda/Kappa architectures?) – in other words, is this a streaming technology?
u What is the most ambitious project you’re involved in?
u How well do you play with others?
u Metadata management is (in my view) the most important aspect of data governance. What is the Waterline strategy for “global metadata management”?
u You’ve presented a broad software vision. Which type of customers are adopting this?