big data pitfalls

39
Big Data Pitfalls April 8, 2015

Upload: alex-meadows

Post on 15-Jul-2015

81 views

Category:

Technology


0 download

TRANSCRIPT

Big Data Pitfalls

April 8, 2015

2

Big Data Introduction

3

So What is it?

● Misnomer and marketing speak● “Unstructured” data

– Text heavy – Without obvious/clear structure

● Comes from many places, in many styles

4

5

Where It Comes From

6

Building Your Data Lake

7

A Common Evolution

8

A Common Evolution

9

Hadoop to the Rescue!

10

You Have a Data Lake!

11

Hadoop to the Rescue

● Cross system analytics?● Data quality confidence?● Source of truth?● Tool chain support?● Giant yellow elephants?

12

Hadoop to the Rescue

● Cross system analytics?● Data quality confidence?● Source of truth?● Tool chain support?● Giant yellow elephants?

If any are ignored...

13

You have a Data Swamp!

14

Don't worry, even the Jedi had a Data Swamp...

15

Goal is to build a Data Reservoir

16

Reservoirs...

● Contain data that is...

– Managed– Transformed– Filtered– Secured– Portable– Fit for purpose

Source: Gartner

17

Pitfalls

18

Data Warehouse Models

● Traditional models don't cover semi-structured data

● Modern models are hybrids that cross the structured semi-structured boundary

19

Data Vault

20

Data Vault

● Developed by Dan Linstedt

● Tie technical keys across structured and semi-structured data sources

● Semi-structured data can me made more structured and loaded into relational data vault

● Tools have to support crossing sources

● More details: http://www.tdan.com/view-articles/5054/

21

Anchor

22

Anchor

● Developed by Lars Rönnbäck

● 6th normal form data warehouse

● Have to transform semi-structured data to match the anchor model

● Provides flexible model that should be able to have marts built upon it

● More details: http://www.anchormodeling.com/

23

Textual Disambiguation

● Developed by Bill Inmon

● Breaking semi-structured data down by context

● Converts the data into structured format, consumable by tools

● Store data within the data warehouse – 8th/9th normal form

● White papers and more details are on Bill's website: http://www.forestrimtech.com/

24Source: http://www.slideshare.net/Roenbaeck/anchor-modeling-8140128

25

Working With “Unstructured” Data

● Most data tools require structure (Database schema, clear-cut data formatting)

● Business and technical knowledge required

– Business to provide the pattern “the grammar or syntax”

– Technical to provide the “how”

26

Working With “Unstructured” Data

“The car is hot.”

27

Identifying Context

● It's a really nice car.

● It's internal temperature requires adjustment

● It's hot to the touch

● It's on fire

28

29

How to Implement

● Map/Reduce code, Hive queries, data integration tools (Pentaho, Talend)

● Have to create the grammar/syntax rules for particular business

● MDM is _not_ the solution

● Best to have a data warehouse based on subject/relationships

– Data Vault

– Anchor

– Textual Disambiguation

30

Data Symbiosis

● Data in data lake can't stand on it's own

– Ties back to rest of the structured data

– Requires firm understanding of business rules/logic

● Provides richer data sets

● Difficult to do before data lakes, after adding a data lake the problems magnify

– But so do the rewards!

31

Data Quality

● Not just a problem for Data Warehouses!● Measuring “fit for purpose”● Same rules used for data warehouses

apply to big data

32

Principles of Data Quality

● Consistency● Correctness● Timeliness● Precision● Unambiguous● Completeness● Reliability

● Accuracy● Objectivity● Conciseness● Usefulness● Usability● Relevance● Quantity

Source: Data Quality Fundamentals, The Data Warehouse Institute

33

Why Data Quality?

● Main way to control/tame your data problems

● Most hidden costs because it's hardest to fix

● Target upstream for problem solutions

34

How to Implement

● Data integration tools ● Custom coding (Map/Reduce, etc.)● Data Profiling ● MDM (as central “dictionary”/”grammar”

handler)

35

Tooling

36

Does Your Tool Chain...

● Support Hadoop?

● Interface with non-traditional database solutions (i.e. not an RDBMS)?

● Allow for integration across disparate sources?

● Support data quality?

37

If Not...

38

Hadoop Ecosystem

● Bridges some of the gaps

– Hive – SQL to Hadoop interface (jdbc support)

● Provides even more power

https://hadoopecosystemtable.github.io/

Plus dozens of others... and growing

39

Sources

● http://en.wikipedia.org/wiki/File:Pitfall!_Coverart.png

● http://www.networkcomputing.com/big-data-defined/d/d-id/1204588

● http://www.appliedi.net/

● http://imgbuddy.com/internet-of-things-icon.asp

● http://www.smashingapps.com/, et. al.

● http://www.colleenkerriganphotographs.com/p663330184/h217016CE#h217016ce