the role of data wrangling in driving hadoop adoption

43
Grab some coffee and enjoy the preshow banter before the top of the hour!

Upload: inside-analysis

Post on 14-Apr-2017

470 views

Category:

Technology


0 download

TRANSCRIPT

Grab some coffee and enjoy the pre-­show banter

before the top of the

hour!

The Briefing Room

The Role of Data Wrangling in Driving Hadoop Adoption

Twitter Tag: #briefr The Briefing Room

Welcome

Host: Eric Kavanagh

[email protected] @eric_kavanagh

Twitter Tag: #briefr The Briefing Room

  Reveal the essential characteristics of enterprise software, good and bad

  Provide a forum for detailed analysis of today’s innovative technologies

 Give vendors a chance to explain their product to savvy analysts

  Allow audience members to pose serious questions... and get answers!

Mission

Twitter Tag: #briefr The Briefing Room

Topics

September: HADOOP 2.0

October: DATA MANAGEMENT

November: ANALYTICS

Twitter Tag: #briefr The Briefing Room

The Great Divide

Ø Close the Gap

Ø Empower Business Users

Ø Shift Focus of IT

Ø Developers are Third Leg

Twitter Tag: #briefr The Briefing Room

Analyst: Mark Madsen

Mark Madsen is president of Third Nature, a technology research and consulting firm focused on business intelligence, data integration and data management. Mark is an award-winning author, architect and CTO whose work has been featured in numerous industry publications. Over the past ten years Mark received awards for his work from the American Productivity & Quality Center, TDWI, and the Smithsonian Institute. He is an international speaker, a contributor to Forbes Online and on the O’Reilly Strata program committee. For more information or to contact Mark, follow @markmadsen on Twitter or visit http://ThirdNature.net

Twitter Tag: #briefr The Briefing Room

Trifacta

Trifacta offers a platform for data transformation and preparation

 The interface is rich in visualization and provides a productive data wrangling capability

 The platform also includes access to raw data in Hadoop, providing analysts and data scientists with secure, governed data

Twitter Tag: #briefr The Briefing Room

Guests:

Will Davis Director of Product Marketing, Trifacta

Alon Bartur Principal Product Manager, Trifacta

Trifacta: The Role of Data Wrangling In Driving Hadoop Adoption

Variety = Data is Messy

When Data is Messy… Analysis is More Complicated

Question Analysis Insight

Messy Data Requires Data Wrangling

Question Analyze Insight Discover Structure Clean Enrich Distill

Data Wrangling

The Bottleneck

DATA PRODUCT Simplicity

DATA SOURCE Complexity

The Bottleneck on Hadoop

Ingestion Storage Processing IT

ANALYSIS & CONSUMPTION

LOB

Business System Data

Machine Generated Data

Third Party Data

Java Python

R Pig

etc… How do you move from here?

To here?

80% of the work in any data project is preparing the

data for analysis

Breakdown of Communication Between IT & LOB

LOB IT

How can I access the data in Hadoop? What do you want to analyze?

I can’t tell you until I see the data – let me see the data first.

I can’t just point you to the raw data – you’ll need to tell me.

Conventional Approaches Inhibit User Empowerment

Hand-Coding Technical Workflow Mapping

Bringing Hadoop to an Analyst’s Fingertips

“ “ JOHN, DATA ANALYST

I want direct access to the raw data so I can actually see the content of different datasets to define my analytic requirements.

Wrangle Data Using This?

10

Empowering Analysts Requires a New User Experience

It’s All About The Experience

Interact Predict

Preview

12

Demo

Analyst Workflow on Hadoop

13

Register Hadoop Data Sets in Trifacta

1.

HDFS

Visualize, Interact & Define Transformation Script

2.

HDFS

Execute Script on Entirety of Data Set at Scale in Hadoop

3.

HDFS Execution in Pig or Spark

Analytic Tools Analytic Tools

Select Transformation Output Format & Location

4.

Analytic Tools Hadoop

HDFS Parquet or Avro

Table in HCatalog

Tableau R

Etc…

QUESTIONS?

SIGN UP FOR A FREE TRIAL AT TRIFACTA.COM/TRIAL

THANK YOU!

Twitter Tag: #briefr The Briefing Room

Perceptions & Questions

Analyst: Mark Madsen

© Third Nature Inc.

Analyst  comments  and  ques0ons  

Copyright  Third  Nature,  Inc.  

Ideas  about  how  we  make  data  available  are  changing  

Making  data  available  is  not  the  same  as  enabling  its  use  

Copyright  Third  Nature,  Inc.  

From  scarcity  to  abundance  

All  the  data  

Common,  typed,  tabular  data  

The  bo9leneck  is  us  

Copyright  Third  Nature,  Inc.  

The  old  problem  was  access,  the  new  problem  is  analysis  

© Third Nature Inc.

Changed  design  assump=on:  analysis  isn’t  read-­‐only  

The  results  of  analysis  can,  o=en  do,  feed  back  into  the  system  from  which  they  originate.    

Much  of  the  data  is  being  read,  wri9en  and  processed  in  real  @me.    

Our  design  point  in  IT  was  not  changing  tables  and  ephemeral  pa9erns.  

Copyright  Third  Nature,  Inc.  

Schema

In  a  repor=ng  world  data  and  processing  are  bounded  

No consideration for feedback loops and change

Processing only happens here

Carefully controlled SQL only

access

Nobody creates

new inform

ation

Sources few and well understood

Complex DI is controlled by IT

Schemas are few and designed

Tools are authorized, few in number and kind

One way flow

Copyright  Third  Nature,  Inc.  

In  an  analysis  world  flow  is  unbounded  and  con=nuous  

Feedback loops allowed

End-of-analysis dataset may be start of a BI dataset

Continuous data integration and delivery

Files are back as both input and storage

Minimal barrier of / control on collection

Areas of provisioned data

Any shape in, rectangles out

Copyright  Third  Nature,  Inc.  

The  model  and  reality  of  ETL:  one-­‐way  pipes  

DI BI

Our methods tell us that data integration and analysis are separate, and schema comes first as the point of synchronization between them.

Schema

Copyright  Third  Nature,  Inc.  

Schema

Data  isn’t  just  source  or  target,  it’s  a  con=nuum  

Unusable data that needs

engineering: ETL

Data that can be used : BI

Fuzzy areas of data that need engineering and / or composing: exploration, blending & discovery

Copyright  Third  Nature,  Inc.  

Food  supply  chain:  an  analogy  for  data  

Mul@ple  contexts  of  use,  differing  quality  levels  

Copyright  Third  Nature,  Inc.  

Tools  were  designed  with  data  model  assump=ons  S

ourc

e da

ta ,m

odel

com

plex

ity

Sim

ple

C

ompl

ex

Target data model complexity

Simple Complex

Blending

Selectively linking and changing data, producing a simpler data model as output

ETL

Multiple complex source models, large complex target model

Application integration

Basic movement of data from one place to another, minimal changes to data

Processing & Analytics

Deriving new data from a relatively simple dataset (like an event stream)

Copyright  Third  Nature,  Inc.  

Some  ques=ons  to  start  discussion  1.  Who  is  this  product  aimed  at:  end  users,  analysts    or  the  

people  who  get  and  manage  data  for  others?  2.  Can  you  get  data  from  places  other  than  Hadoop?  3.  How  do  you  deal  with  WYSIWYG  data  prepara@on  when  the  

dataset  is  very  large?  4.  How  well  does  it  handle  small  datasets?  5.  How  do  you  take  something  from  one-­‐@me-­‐process  to  a  

repeatably  executed  process  in  a  produc@on  environment?  6.  What  analysis  tool  integra@on  is  available?  7.  What    maintenance  features  are  available?  

Copyright  Third  Nature,  Inc.  

CC  Image  AIribu=ons  Thanks  to  the  people  who  supplied  the  crea@ve  commons  licensed  images  used  in  this  presenta@on:    Tokyo    forum  -­‐  h9p://flickr.com/photos/fukagawa/2004106475/  klein_bo9le_red.jpg  -­‐  h9p://flickr.com/photos/sveinhal/2081201200/  donuts_4_views.jpg  -­‐  h9p://www.flickr.com/photos/le_hibou/76718773/                                    

Copyright  Third  Nature,  Inc.  

About  the  Presenter  

Mark  Madsen  is  president  of  Third  Nature,  a  technology  research  and  consul@ng  firm  focused  on  business  intelligence,  data  integra@on  and  data  management.  Mark  is  an  award-­‐winning  author,  architect  and  CTO  whose  work  has  been  featured  in  numerous  industry  publica@ons.  Over  the  past  ten  years  Mark  received  awards  for  his  work  from  the  American  Produc@vity  &  Quality  Center,  TDWI,  and  the  Smithsonian  Ins@tute.  He  is  an  interna@onal  speaker,  a  contributor  to  Forbes  Online  and  on  the  O’Reilly  Strata  program  commi9ee.  For  more  informa@on  or  to  contact  Mark,  follow  @markmadsen  on  Twi9er  or  visit    h9p://ThirdNature.net    

Copyright  Third  Nature,  Inc.  

About  Third  Nature  

Third Nature is a research and consulting firm focused on new and emerging technology and practices in analytics, business intelligence, information strategy and data management. If your question is related to data, analytics, information strategy and technology infrastructure then you‘re at the right place.

Our goal is to help organizations solve problems using data. We offer education, consulting and research services to support business and IT organizations as well as technology vendors.

We fill the gap between what the industry analyst firms cover and what IT needs. We specialize in product and technology analysis, so we look at emerging technologies and markets, evaluating technology and hw it is applied rather than vendor market positions.

Twitter Tag: #briefr The Briefing Room

Twitter Tag: #briefr The Briefing Room

Upcoming Topics

www.insideanalysis.com

September: HADOOP 2.0

October: DATA MANAGEMENT

November: ANALYTICS

Twitter Tag: #briefr The Briefing Room

THANK YOU for your

ATTENTION!

Some images provided courtesy of Wikimedia Commons and "Grand Canyon view from Pima Point 2010" by Chensiyuan - Own work. Licensed under GFDL via Commons

- https://commons.wikimedia.org/wiki/File:Grand_Canyon_view_from_Pima_Point_2010.jpg#/media/File:Grand_Canyon_view_from_Pima_Point_2010.jpg