toug big data challenge and impact

40
Ian Abramson EPAM Systems, Canada January 2014 Getting Started with Big Data and Making an Impact @TOUG Jan. 22, 2014 Confidential

Post on 19-Oct-2014

492 views

Category:

Technology


2 download

DESCRIPTION

Presented to Toronto Oracle Users Group members on Jan 22, 2014 by Ian Abramson

TRANSCRIPT

Page 1: TOUG Big Data Challenge and Impact

Ian Abramson

EPAM Systems, Canada

January 2014

Getting Started with Big Data

and Making an Impact

@TOUG Jan. 22, 2014

Confidential

Page 2: TOUG Big Data Challenge and Impact

Big Data .. The Silver Bullet?

Confidential

Page 3: TOUG Big Data Challenge and Impact

Agenda

Confidential 3

Introductions and Goals

What is Big Data

Technology Choices

Making an Impact with Data Science

Use Cases

Page 4: TOUG Big Data Challenge and Impact

About Me

4

• Degree in Applied Mathematics

• Over 20 years with Oracle software

• Over 10 years with data warehouses

• Big Data Analyst

• Author of numerous Oracle books

• Blogger: http://ians-oracle.blogspot.com/

• Oracle ACE

• IOUG Past-President

• TOUG Board Member

• Toronto based

• Twitter: @iabramson

Page 5: TOUG Big Data Challenge and Impact

WHERE IS BIG DATA?

5

Page 6: TOUG Big Data Challenge and Impact

Why Big data?

• New data sources

• Unprecedented volume

• Real World Issues

– Data Systems are reaching capacity

requiring high cost alternatives

– Archive data is too far offline

– Organizations require cost effective

options

– Retain all data for future analysis

6

Page 7: TOUG Big Data Challenge and Impact

Big Data Defined

Gart

ner:ID

C:

Ro

ger

Mag

ou

las

(O’R

eily

Re

se

arc

h)W

ikip

ed

ia:

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization.

Big Data is a term/concept, which is used as a generic name for a “generation of technologies and architectures designed to extract value economically from very large volumes of a wide variety of data by enabling high-velocity capture, discovery, and/or analysis”.

Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.

“Data becomes “Big Data”, when the size of the data becomes a part of the problem”

7

Page 8: TOUG Big Data Challenge and Impact

The Attributes of Big Data

• Classic Data Attributes:

– Volume

– Velocity

– Variety

• Big Data Technical Attributes

– massive, parallel computing environment

– infinitely scalable computing clusters, including cloud

• Three main technical requirements

– Need medium to accommodate large volumes for storage and data streaming

– Require the computing horsepower and architectural approach which allows for the processing of the data where it exists and not via extraction and processing

– Use the appropriate programming which allows for a computational paradigm, which performs computations in a highly parallel and scalable environment

8

Page 9: TOUG Big Data Challenge and Impact

Challenges for Big Data

Confidential 9

http://tdwi.org/blogs/fern-halper/2013/10/four-big-data-challenges.aspx

Page 10: TOUG Big Data Challenge and Impact

• The problem – different uses – different schemas and different partitioning. In most cases the requirements are orthogonal – impossible to provide optimal for everybody data partitioning/indexing

• The ideal goal – acquire and store “as is” – access using multiple models. Need for powerful artificial intelligence knowledge base and data access code generators.

• Will never be optimal for everybody unless huge redundancy

• Problems are less painful if most of the data are read anyway. Good for analytics, not good for OLTP

• Eventually Big data platforms will become DW platforms with well developed access interfaces

• Until then -> acquire and store and then distribute on demand to conventional DW and data marts

Big Data and Data Warehouse – war or peaceful coexistence?

10

Page 11: TOUG Big Data Challenge and Impact

Operational Systems

The New Data Architecture

11

Enterprise Data

Social & Clickstream

Sensor Generated

Public Data

Historical Data

Other New Sources

Big Data

Hadoop

HDFS Map/Reduce

Data Archive

ODS

Data Warehouse/BI/Analytics

Page 12: TOUG Big Data Challenge and Impact

TECHNOLOGY CHOICES

Confidential 12

Page 13: TOUG Big Data Challenge and Impact

The Choices for Your Data

RDBMS

- High Concurrency- TB Storage- Indexed reads

- Efficient updates- Caching

- Highly secure

Analytic Appliances

- Scalable- Medium Concurrency

- High Volume Processing (Postgres)

- No indexes- TB + Netezza (128TB/rack)Oracle (300TB/Rack)

NoSQL

- Highly Scalable- High Concurrency- Storage Options- Updates- Real-time Capable- Rudimentary indexes- TB + Capacity

Hadoop

- Highly scalable- Low concurrency- Distributed Storage- Complex Access- Security (TBD)

Page 14: TOUG Big Data Challenge and Impact

The Open Source/Big Data Landscape

14

http://www.bigdata-startups.com/open-source-tools/

Page 15: TOUG Big Data Challenge and Impact

Hadoop In Detail

Confidential 15

Reference: http://blog.blazeclan.com/252/

Page 16: TOUG Big Data Challenge and Impact

Hadoop Distributions

Confidential 16

Page 17: TOUG Big Data Challenge and Impact

For Example if you choose Cloudera…

Confidential 17

Page 18: TOUG Big Data Challenge and Impact

Comparing Hadoop Distributions

Confidential 18http://www.infoworld.com/d/business-intelligence/enterprise-hadoop-big-data-processing-made-easier-184330?page=0,5

Page 19: TOUG Big Data Challenge and Impact

• Disaster recovery

• Security

• Data consistency

• Workload management

• Reprocessing

• Troubleshooting

• Performance

Big Data’s Technical Challenges

19

Page 20: TOUG Big Data Challenge and Impact

DATA SCIENCE

Confidential 20

Page 21: TOUG Big Data Challenge and Impact

Big Data vs. BI presentation viewpoint

Confidential 21

IMPACT

Page 22: TOUG Big Data Challenge and Impact

• Sample questions for BI

– What is my sales volume by time, by region, by store, by season?

– What is average review rating by product category, by product? What is the dynamic of reviews, what are the trends?

• Sample questions for Big Data/ Data Science

– How change in review ratings impact sales?

– What is the time lag between review rating change and sales volume change?

– What products are purchased together and can I improve product recommendations?

Questions for BI and Big Data

Confidential 22

Page 23: TOUG Big Data Challenge and Impact

DATA SCIENCE

Data Science

Confidential 23

Skills Science

• State the ProblemPurpose

• Discover information about topic

Research

• Predict the OutcomeHypothesis

• Develop a process to test the hypothesis

Experiment

• Record the resultsAnalysis

• Compare hypothesis and results

Conclusion

Page 24: TOUG Big Data Challenge and Impact

Data Science Team

Each team would include:

• Data Science Analyst – excellent communication skills, science and analytical background.

• Data Science Researcher/Solution Architect – good communication,, good statistical/math, working knowledge 2 out of the following data science libraries (Mahootor any other machine learning, Rhadoop, R, SAS, SPSS) –

• Data Science Technologist – acceptable communication skills, 25% deployable to the client site (as minimum few should be deployable, others can be offshore), good developer, working knowledge of Big data and related technologies

• Data Science presentation engineer – knowledge BI and presentation tools

24

Nordstrom’s Big Data Team Mission:

“Delighting Customers through data-driven

products”

Page 25: TOUG Big Data Challenge and Impact

USING BIG DATA

Confidential 25

Page 26: TOUG Big Data Challenge and Impact

Data Science Sample use cases

Confidential 26

Page 27: TOUG Big Data Challenge and Impact

Top 10 Use Cases (2013 Computerworld)

http://www.computerworld.com.sg/resource/storage/iiis-2013-technical-workshops/?page=2

1. Modeling Risk

2. Customer Churn Analysis

3. Recommendation Engines

4. Ad Targeting

5. POS Transaction Analysis

6. Analysis of network data to predict future failures

7. Threat Analysis

8. Trade Surveillance

9. Search Quality

10.Data Sandbox

Page 28: TOUG Big Data Challenge and Impact

• From analysis of match.com dating patterns:

• 21+ Million members

• 100+ million hits per month

– January 2nd is the busiest day for people to sign up on dating sites

– Women get 60% more attention if photo is taken indoors

– Men get 19% more attention if theirs is taken outside

– Full-body photos boost both sexes success by 203%

– Posing with animals or your best friends might seem cute but it actually reduces your popularity by 53 per cent (men) and 42 per cent (women)

– Men get 8% fewer messages if they put up selfies.

– Mentions of words like divorce and separated gets men 52 per cent more messages

– Women who are more forward, using phrases like dinner, drinks or lunch in the first message get 73 per cent more replies, while men should play it cooler. Those who mention the same words in their opening message get 35 per cent fewer replies.

The Big Data of Dating

Confidential 28

Page 29: TOUG Big Data Challenge and Impact

Use Case Development

Business

Questions

Business Stakeholders

IdentifyBusiness

Value

Define Success Criteria

Develop Hypothesisand Identify

Data Sources

Iterate results and

develop data for goals

Page 30: TOUG Big Data Challenge and Impact

Use Case Checklist• Title - An active description which identifies the goals of the

primary actor

• Characteristics:

– Primary actor

– Goal in Context

– Scope

– Level

– Stakeholders and Interests

– Precondition

• Success criteria

– Precondition

– Minimal Guarantees

– Success Guarantees

– Trigger

– Main Success Scenario

– Extensions

• Technology & Data Variations List

• Related Information.Reference: Alistair Cockburn

Page 31: TOUG Big Data Challenge and Impact

Archive Use Case

1.5 Petabytes continuous ingestion data

One of the largest Hadoop clusters in the

world

80% Open Source EDW

Customer Benefits

Avoided massive cost of new DW

Infrastructure

Able to keep and analyze historical

transactions

Reduce risk of DW replacement

Able to scale on demand using low-cost

servers

Transaction Volume

> 500 GB daily increases from all sources

transaction, social, contact center

31

Call Center and

Online data

Staging and Historical

Analysis

Analytic Infrastructure

Informatica

transformation &

aggregation

EXPEDIA CASE STUDY

Page 32: TOUG Big Data Challenge and Impact

Use Case: Sales Analysis Sales per sq.ft.: Changes Over time

• Fitting the no-intercept line to the scatter of sales over sales floor brings about visual baseline Sales-per-Sq.Ft. (SpSF) for each year

Mathematically the

SpSF measure is

given by the slope

coefficient of the

trend:

392.51 [CAD/Sq.Ft.]

in 2011 vs.

373.76 [CAD/Sq.Ft.]

in 2012

SpSF417 in 2011

417 in 2012

Page 33: TOUG Big Data Challenge and Impact

Looking for Patterns Anomalies

This chart tells us most of the stores have highest sales on Saturday. But, Store X peaks on Friday and

Is also doing well on Mondays. Why?

0

1000000

2000000

3000000

4000000

5000000

6000000

7000000

8000000

9000000

10000000

THU FRI SAT SUN MON TUE WED

Page 34: TOUG Big Data Challenge and Impact

Affinity Analysis Use Case

Build model that provides the foundation for analyzing and

understanding the factors that influence year over year

change in store performance

• Affinity Analysis is an input to:• Identify products purchased in tandem

• Provide guidance an recommendations for

upsell and cross-sell

• Redesign stores, layouts and planograms

• Discount Plans and Promotions

• Identifying customer baskets in different

time and geography• Investigating patterns on fine line and

product levels

• Ranking customer baskets by Number of

times bought together Revenue

contributed

Page 35: TOUG Big Data Challenge and Impact

Clustering of Products

35

Page 36: TOUG Big Data Challenge and Impact

Snow Scrapers and Washer Fluid

36

Page 37: TOUG Big Data Challenge and Impact

Related Baskets

1. Potted annuals/plants, Cell-packs/annual plants

2. Potted annuals/plants, vegetables/plants

3. Potted annuals/plants, Outdoor soils/outdoor lawn & plant care

4. Cell-packs/annual plants, vegetables/annual plants

Size of the circle show how often

basket has been purchased

Season: 2012-05-16 - 2012-08-28

This kind of analysis can be used

for spotting driver products

Page 38: TOUG Big Data Challenge and Impact

Confidential 38

Big Data is Evolving

• The industry is evolving

• Hadoop is now 8 years old since start in 2007 at Yahoo

• CDH 5 recently released

• $2.5B in venture capital in the space

• Hadoop is now considered a standard

• Hbase is an example of a project which has not found a standard

• Many tools today? What will be in 5 years from now?

• How to avoid the big data pitfalls?

• 50% of big data projects fail

• Those who success drive it by focus

• Insight vs. Impact

• Find one problem and fix it

• Data Science

• Change how you do analysis… scientific methods

• New and exciting

• Build a hybrid team to develop Data solutions

• Team can program, knows math and statistics and communicate

Page 39: TOUG Big Data Challenge and Impact

The Big Data Adventure

Page 40: TOUG Big Data Challenge and Impact

Ian AbramsonEPAM SystemsToronto, CanadaGMT -5Mobile phone: +1 (416) 254-9286Skype: ian.abramsonE-mail: [email protected]

Thank You and Questions

Confidential 40