scott lee data principal emc² global professional … of concept and context; discussion of...
TRANSCRIPT
Exposition of concept and context; discussion of repercussions
SCOTT LEE Data Principal EMC² Global Professional Services +1-312-497-8853 | [email protected]
DAMA Chicago 15 april 2015
Copyright © 2015 Scott Lee. All rights reserved. 2
Introductions, assumptions Problem statement Background & definitions Contrasting Data-aaS and EDW
– Delivery models – Processes – Actors / motivations
What this means for you Q&A
Copyright © 2015 Scott Lee. All rights reserved. 4
“I want to see all of the data – not just monthly rolling aggregates.”
“I want to understand deep correlations and be able to see what might happen if I change something.”
“I want to surgically target my client actions to produce the most value possible.”
SEE MORE COMPLETELY ANALYZE MORE DEEPLY ACT MORE PRECISELY
OPEN • Connect, leverage, provision • Exposed data catalogue
SEMANTIC • Conceptual granularity • Emphasizes meaning, not model
CURIOUS • Data discovery, on-boarding • Automated stewardship tasking
MAGNETIC • Simple walk-up data provisioning • Easy ingest attracts best data
AGILE • Embracing constant change • Avoid brittle infrastructure
Copyright © 2015 Scott Lee. All rights reserved. 5
Scale
Exponential data growth cannot be matched with linear personnel growth; the gap must be addressed
Complexity
External data, metadata, unstructured data, and ever increasing use cases and contexts
Increasing Expectations
Executives and LOB information workers becoming more savvy, more demanding
BI and DW patterns are Artisanal; Big Data needs to be Industrial…
Copyright © 2015 Scott Lee. All rights reserved. 6
Analytic Modeling
Data Warehouse
Information Research & Development
Stream & Transaction Processing
accuracy / veracity 50%
(best guess) 80% 100%
(no error tolerance) 99% 90% 95%
spee
d /
laten
cy
1 day
1 hour
100 s
0.1 sec
1 ms
1 μs
10K s
100K s
1M s
10M s
100M s
1 sec
1 minute
1 week
1 month
1 year
Continuous Event Processing
Operational Reporting
Executive Dashboard
Financial Reporting Strategic Planning
Long-term Trending
Business Activity Monitoring
Copyright © 2015 Scott Lee. All rights reserved. 7
I would often rather have a directionally correct answer in five minutes instead of a
guaranteed correct answer six months.
Key Observation
Copyright © 2015 Scott Lee. All rights reserved. 8
Rigid model makes change complex Lost context and business meaning for
sophisticated analytics Integration of unstructured data Time to move large data volumes High cost of redundant infrastructure Expensive infrastructure and software
A self-contained data platform; provided on-demand; bundling data and software for access and interpretation in a single package
Ease: Simplicity of data access into a single model without requiring knowledge of underlying data objects, integration, etc.
Agility: Immediate access yield accelerated prototyping time and faster solution-time-to-market
Cost-effectiveness: Offsets cost of managing and housing complex data sets separately / redundantly
Quality: Single point of update, collaborative data management from business & IT
Other data
Data Adapter
Apps
Analytics
Models
Dat
a Co
ntai
ner
Users
Copyright © 2015 Scott Lee. All rights reserved. 9
In DaaS, the unit of service is leased access rights for a specific Data Asset
Copyright © 2015 Scott Lee. All rights reserved. 10
Traceability through data lineage across all data
processing steps and stages
Profiled and baselined data quality, by field;
vetted actively by knowledgeable librarian
Contextualized and presented with metadata
sufficient for business understanding, leverage
Searchable: indexed upon ingestion, maintained in business data catalog; as easy to find as any retail product on Amazon.com
Secured against all inappropriate and wrongful access through trust policies, masking, encryption, ACLs, and audit
Image credit: Knowledgent infographics
Copyright © 2015 Scott Lee. All rights reserved. 11
Req’s & design
Specify data required in Report
Document timing, delivery, lag requirements
Map fields to DW data model
Identify data gaps
Build solution
Back-trace gaps to data sources
Extend DW data model to house new elements
Build ETL to load sources into DW
Create report showing required data using SQL
Deliver BI
report
Research
Login to Data Catalog
Search or browse for relevant Data Assets
Initiate Lease for selected Data Assets
Provision leased data to private Analytic Sandbox
Explore data
Use any tools to access data
Navigate metadata schema details
Link data together, filter, aggregate
Develop analytical models
Find answer; iterate
Traditional BI / DW 3-6+ months
2 months … 2 minutes Data as-a-Service (DaaS)
Copyright © 2015 Scott Lee. All rights reserved. 12
Ingest Discovery
Classification
Minimum Viable Product
Curate Roles & Governance
Metadata Stewardship
Asset Enhancement
Consume Data Catalog / Info
Architecture
Provisioning & Lease
Sandbox / Tools Management
Apply standards; ensure Governance readiness; measure , assess quality
EXTE
RNAL
How data assets are found and nominated for inclusion in the Data Catalog; Consumption should drive Discovery-Ingest prioritization
Copyright © 2015 Scott Lee. All rights reserved. 13
FILE
TABLE
XML
a. Structure b. Context c. Semantics d. Contents DATA LAKE
CATALOG
METADATA
ASSET . . . METADATA
ASSET
Metadata &
DQ
NETWORK Search across network for likely data containers
Discovery Engine
LIBRARIA
N / OPERATO
R
entry entry
entry
Bringing raw data into the Catalog; refinement into Data Assets; metadata standards, data enhancement, and “just enough” data governance
Copyright © 2015 Scott Lee. All rights reserved. 14
read-copy decompose parse ingest steward
DATA LAKE
CATALOG
enrich • profile • link • quality • semantics folio
METADATA
data object
data instance (row)
data
ele
men
t (co
lum
n)
# records size (MB)
field name description data type length precision classifications …
publish
METADATA
DATA ASSET
METADATA
DATA ASSET
DATA
FILE
TABLE
XML catalog librarian
data custodian
data request asset curation
entry
User walk-up to Catalog, choose Data Asset (Fields), request Data Lease from Owner / Steward, Lease approval, Provision (copy or federated access) into generated Sandbox
Copyright © 2015 Scott Lee. All rights reserved. 15
DATA LAKE
CATALOG
. . .
SANDBOX
DATA LEASE
LIFECYCLE MANAGEMENT
• Access expires on 5/1 • Max 5 inquiries per hour • Max 3GB transfer per day • Only use for purposes of internal non-
collaborative research • Combining >2 PII fields disallowed
METADATA
DATA ASSET
METADATA
DATA ASSET
DATA
DATA
grant
publish
revoke steward
delist
asset owner
search browse
lease access
choose
data seeker
High
Future
Comparing Business Intelligence with advanced / predictive analytics (Data Science)
Predictive Analytics and Data Mining (Data Science) Typical
Techniques and Data Types
• Optimization, Predictive modeling, forecasting statistical analysis • Structured/unstructured data, any types of sources, very large data sets
Common Questions
• What if…? • What’s the optimal scenario for our business? • What will happen next? What if these trends continue? Why is this
happening?
Business Intelligence
Typical Techniques and
Data Types
• Standard and ad hoc reporting, dashboards, alerts, queries, details on demand
• Structured data, traditional sources, manageable data sets
Common Questions
• What happened last quarter? • How many did we sell? • Where is the problem? In which situations?
Business Intelligence
BUSINESS VALUE
TIME
Low
Past
Data Science
Copyright © 2015 Scott Lee. All rights reserved. 16
Robert X. Role: Sales Analyst Tools: Desktop (Primary), mobile Goals Behaviors
“My focus is getting reliable and relevant data fast .. People depend on me to deliver accurate reports showing how our planned strategies will increase revenue.”
• Well-versed in data, expert in business intelligence and analytic tools
• Knows how to search, retrieve and assemble data in many forms
• Recently hired into his role at Blue Data Technologies
• Strong knowledge of the technology industry
• Studied sales and marketing, worked as a sales associate at a software vendor prior to Blue Data
• Some familiarity with querying and languages to facilitate data discovery
• Looking to move into a more senior role of advanced modeling and predictive analytics
• Working on his first project to get vendor data for a vendor credit project
General • I am a thought leader and analyst. • I support the business planning functions
Function and Role • Spends a considerable amount of time
retrieving & assembling data from different sources
• Performs a range of data blending and preparation tasks to create dashboards and data visualizations
• Use spreadsheets & PowerPoint to provide interpretation & facilitate discussions.
Collaboration / Communication • I keep on top of things by attending
industry conferences, vendor briefings, and the internet
• I communicate with sales management helping them to understand the meaning of data patterns and predict future outcomes
• Spends majority of time with business intelligence and analytic tools
• Leans heavily on internal searches and his department and team to figure out where to find things
• Persistent in tracking down the data assets and sources needed for reporting
Background
Copyright © 2015 Scott Lee. All rights reserved. 17
Copyright © 2015 Scott Lee. All rights reserved. 18
What if… – Business stakeholders didn’t need to engage IT to update
reports every time source data fields are changed? – Data acquisition costs could be easily calculated and
shared back to LOBs most using the information? – Analysts could quickly explore and mash-up data assets,
then share their results with peers in other BUs?
from 800 to eighty THOUSAND
Copyright © 2015 Scott Lee. All rights reserved. 19
Copyright © 2015 Scott Lee. All rights reserved. 20
Staging, raw, “landing zone”, metadata
focused
Sandbox: “place to do exploratory analytics”
Analytic Sandbox
• Exploratory, ad hoc • Unpredictable loads • Experimental, iterative • Loosely governed • Bring your
own tools
ERP
CRM
EXTERNAL
ETL
• Production • Predictable load • SLA-driven • Heavily governed • Standard tools
MDM, DQ
BI: “place to do at-scale data delivery”
EDW
Copyright © 2015 Scott Lee. All rights reserved. 22
Data Lake
Data Shopping
Cart UX Prototype
C i h © 2015
2
Real Time Data Feed Schema Batch
Data Catalog View
ERP Apps Files SaaS Social Media Relational DB Legacy EDW
Data Sources
Data Management Data Virtualization
Transformation
Data Lake
Sandboxes Data Provisioning
Provisioned Schema Query Time Join Data Catalog | Metadata | DG & Stewardship HDFS
Raw Data
Public Cloud Private Cloud Analytic Sandbox Provisioning Process Flow
BI Report Analytics Approve Request
Copyright © 2015 Scott Lee. All rights reserved. 24
Rabbit MQ
A real DaaS implementation from EMC IT
SDFC Adapter
Global IDs
Attivio
Greenplum
Activity BPM
Create data access request
Approve data access
Provision data access
Firewall
Schema
Queue Data
Queue Data
Schema
External table Provisioned Sandboxes Physical
schema
Data catalog
SFDC
Index
Provisioned Schema
Requestor Approver
Data & schema flow Provisioning work flow
Requestor
Data analytics
Incremental Data Update External Indexes
Provisioning Process Workflow
Data Catalog
Copyright © 2015 Scott Lee. All rights reserved. 25
…is a 21st century alternative to the traditional Data Warehouse information delivery model.
…is the “killer app” for the Data Lake.™ …is a rich, automated, trustworthy, and just-in-time mechanism to quickly answer business questions.
…drives self-service analytics beyond data scientists to all stakeholders.
Copyright © 2015 Scott Lee. All rights reserved. 26
Rapid access to data securely enables monetization
Cost-savings over traditional data mart builds Access near real time data: detect trends Better management of data assets Shorter cycle-time to deploy enabler technologies Better utilization of compute resources Charge-back based on utilization
Copyright © 2015 Scott Lee. All rights reserved. 27
SCOTT LEE Data Principal EMC² Global Professional Services +1-312-497-8853 | [email protected]
DAMA Chicago 15 april 2015