paradigm4 research report: leaving data on the table

16
Leaving Data on the Table Data Scientists Reveal Obstacles to Big Data Analytics

Upload: paradigm4

Post on 01-Nov-2014

146 views

Category:

Technology


0 download

DESCRIPTION

While Big Data enjoys widespread media coverage, not enough attention has been paid to what practitioners think — data scientists who manage and analyze massive volumes of data. We wanted to know, so Paradigm4 teamed up with Innovation Enterprise to ask over 100 data scientists for their help separating Big Data hype from reality. What we learned is that data scientists face multiple challenges achieving their company’s analytical aspirations. The upshot is that businesses are leaving data — and money — on the table.

TRANSCRIPT

Page 1: Paradigm4 Research Report: Leaving Data on the table

Leaving Data on the Table

Data Scientists Reveal Obstacles to Big Data Analytics

Page 2: Paradigm4 Research Report: Leaving Data on the table

Paradigm4 Data Scientist Survey 2

While Big Data enjoys widespread media coverage, not enough attention has been paid to what practitioners think — data scientists who manage and analyze massive volumes of data.

We wanted to know, so Paradigm4 teamed up with Innovation Enterprise to ask over 100 data scientists for their help separating Big Data hype from reality. What we learned is that data scientists face multiple challenges achieving their company’s analytical aspirations. The upshot is that businesses are leaving data — and money — on the table.

This survey uses the terms “complex analytics” and “basic analytics” for which respondents were given these definitions:

This distinction is important because basic analytics are “embarrassingly parallel” whereas complex analytics are not. Here’s what we mean. “Embarrassingly Parallel” (sometimes referred to as “data parallel”) refers to problems that can be separated into multiple independent sub-problems that can run in parallel and do not require access to all the data at once. This is the divide-and-conquer approach used by MapReduce/Hadoop. In contrast, “non-embarrassingly parallel” problems require using and sharing all the data at once and communicating intermediate results among processes. Matrix multiplication on matrices too large to fit on one server is an example of a non-embarrassingly parallel function.

Their experiences should help inform businesses on what to look for as they investigate options to expand their analytics infrastructure.

For insight on the issues and obstacles facing data scientists, read on.

We asked data scientists questions such as:

What obstacles prevent them from gaining insights into their data?

How many use Hadoop and which limitations have they encountered when attempting to use Hadoop for complex analytics?

What data types and sources would they like to leverage more effectively?

Whether they’ll adopt complex analytics solutions (see below) — and how quickly?

“Complex analytics” means math functions like covariance, clustering, machine learning, principal components analysis and graph operations.

“Basic analytics” means business intelligence reporting such as sums, counts and aggregates.

Page 3: Paradigm4 Research Report: Leaving Data on the table

Paradigm4 Data Scientist Survey 3

We’ve all heard how hard it is to analyze massive and rapidly growing data volumes. But data scientists say variety presents a bigger challenge. They are at times leaving data out of their analyses as they wrestle with how to integrate and analyze more types of data such as time-stamped sensor, location, image and behavioral data as well as network data.

Data scientists are turning to large-scale complex analytics both for unbiased data-driven exploration and to wrest more value from their data.

For complex analytics, data scientists are forced to move large volumes of data from existing data stores to dedicated mathematical and statistical computing software. This time-consuming and coding-intensive step adds no analytical value and impedes productivity.

While Hadoop has garnered widespread media coverage, 76 percent of data scientists have encountered serious limitations using it. Hadoop is well suited for embarrassingly-parallel problems but falls short for large-scale complex analytics.

Incorporating the diverse data types into analytical workflows is a major pain point for data scientists using traditional relational database software.

For data scientists, Big Data means Big Stress. 39 percent say it’s made their job more stressful.

123456

The Big Takeaways

Page 4: Paradigm4 Research Report: Leaving Data on the table

Paradigm4 Data Scientist Survey 4

What Is The Biggest Problem You Face In Gaining Insights From Your Big Data?

Which types of data do you anticipate using in the next year?

The overwhelming volume of corporate and organizational data continues to generate headlines but it’s the diverse types of data that pose a bigger challenge. Nearly three-quarters of data scientists — 71 percent — said Big Data had made their analytics more difficult and data variety, not just volume, was the challenge.

71%TRUE

I struggle with managing new types and sources of data

I know how to get the answer but it takes too long (my data is too big to move to a math/ analytics software package)

I don’t know what questions to ask of my data

I know what I want to ask but don’t know how to get the answers

Time-series

Business transaction

Geospatial / Location

Graph (network)

Clickstream

Health records

Sensor

Image

Genomic

I know how to get the answer but my analysis runs out of memory

29%

40%

36%

24%

18%

17%

66%

66%

55%

46%

35%

25%

17%

13%

7%

FALSE

My Analytics Are Becoming More Difficult Because of the Variety and Types of Data Sources (Not Just the Volume)

Data Variety Is Proving to Be More Important Than Volume

Page 5: Paradigm4 Research Report: Leaving Data on the table

Paradigm4 Data Scientist Survey 5

The trend toward hyper-personalization and precision targeting illustrates this well.

Recommendations, search results and ads are becoming ever more relevant and micro-targeted

as they tap more and diverse data like social networks, current location, and browsing and

purchasing history. Personalized insurance offerings are augmenting sensor data about driver

behavior to incorporate contextual data like time-of-day and road congestion. Precision medicine

providers are gaining a more refined understanding of what works for whom by integrating

molecular data with clinical, behavioral, electronic health records and environmental data. But

the ability to use diverse data types poses a serious challenge. (For more on this topic, see, “Big

Data at Work: Dispelling the Myths, Uncovering the Opportunities,” by Thomas Davenport,

Chapter 1: “Why Big Data is Important to you and your Organization.”)

What It Means:The ability to effectively use diverse data sources is proving to

be a competitive differentiator in many industries.

Page 6: Paradigm4 Research Report: Leaving Data on the table

Paradigm4 Data Scientist Survey 6

Data Scientists Are Turning to Complex Analytics to Analyze Their Big Data

When will your company begin to use complex analytics on your Big Data?

59%

1%4% 4%

16%

We use it now

In the next 3

years

More th

an 3 years down the ro

ad

No plans to use complex analytic

s

In the next 2

years

We plan to use it

in the next y

ear

15%

The point is not to be dazzled by the volume of data, but rather to analyze it — to convert it into insights, innovations, and business value.

— Thomas Davenport, “Big Data at Work: Dispelling the Myths, Uncovering the Opportunities,” page 2.

“”

Page 7: Paradigm4 Research Report: Leaving Data on the table

Paradigm4 Data Scientist Survey 7

Many new analytical uses require significantly more powerful algorithms and computational

approaches than what’s possible in Hadoop or relational databases. Data scientists increasingly

need to leverage all data sources in novel ways, using tools and analytical infrastructures suitable

for the task. As we have already seen in this survey, organizations are moving from simple SQL

aggregates and summary statistics to next-generation analytics such as machine learning,

clustering, correlation, and principal components analysis on moderately sized data sets. The

move from simple to complex analytics on Big Data presages an emerging need for analytics

that scale beyond single server memory limits and handle sparsity, missing values and mixed

sampling frequencies appropriately. These complex analytics methods can also provide data

scientists with unsupervised and assumption-free approaches, letting all the data speak for itself.

What It Means:The “low hanging fruit” of Big Data has been exploited.

Page 8: Paradigm4 Research Report: Leaving Data on the table

Paradigm4 Data Scientist Survey 8

Data scientists face another growing challenge: conventional analytic workflows require them to move data to mathematical and statistical computing software. This workflow made sense with small or sampled data but is either woefully inefficient or breaks with even moderately large data volumes.

of data scientists utilize software capable of complex analytics in addition to their data

management software

of data scientists say it takes too long to get insights from their data because it is toobig to move to their analytics software

Moving Big Data Poses Difficult Challenges to Data Scientists

78%

36%

Page 9: Paradigm4 Research Report: Leaving Data on the table

Paradigm4 Data Scientist Survey 9

This forces data scientists to make compromises, analyzing samples instead of the whole

data set, leaving data and money on the table. Data scientists risk missing rare events, weak

signals or important anomalies when restricted to working with samples or computing on

subsets independently. (For more on this topic, see “Scaling Big Data Mining Infrastructure:

The Twitter Experience,” by Twitter Engineering Manager Dmitriy Ryaboy and University of

Maryland Associate Professor Jimmy Lin). What’s needed are tools capable of conducting

complex analytics over massive data volumes efficiently — without sampling and without

moving the data.

What It Means:The size and diversity of today’s data sets pose a significant hurdle to doing more sophisticated analytics because so much time is lost

moving data from files or from a database to analysis tools.

Page 10: Paradigm4 Research Report: Leaving Data on the table

Paradigm4 Data Scientist Survey 10

While the Hadoop software platform garners significant media attention, Hadoop is not a viable solution for many use cases, especially those that require complex analytics. Fewer than half of data scientists surveyed (48 percent) have used Hadoop or SPARK — and of those, 76 percent cited significant limitations to its use.

Hadoop Only Takes You So Far

From the 76% reporting problems, what are the limitations of Hadoop / SPARK?

It takes too much effort to program

It’s too slow for interactive, ad-hoc queries

It’s too slow for real-time analytics

It’s not well-suited for my analytics (not embarrassingly parallel)

39%

37%

30%

22%

of data scientists who tried Hadoop or SPARK have stopped using it

35%

Page 11: Paradigm4 Research Report: Leaving Data on the table

Paradigm4 Data Scientist Survey 11

But even Hadoop vendors have recognized the limitations. They are adding SQL functionality to

their products to accommodate data scientists’ preference for a higher-level query language instead

of programming languages like Java and to address the limitations of MapReduce. (E.g., Cloudera

has abandoned MapReduce and is offering Impala to provide SQL on HDFS.) A growing number of

complex analytics use cases are proving to be unworkable in Hadoop. First-wave Hadoop adopters

like Google, Facebook and LinkedIn required a small army of developers to program and maintain

Hadoop. But many organizations either don’t have the required staff or face complex analytics

challenges that can’t be readily solved with Hadoop. This presents a real challenge for the Hadoop

infrastructure that has to address these shortcomings or risk being replaced.

What It Means:Hadoop was unrealistically hyped as a universal and

disruptive Big Data solution.

Page 12: Paradigm4 Research Report: Leaving Data on the table

Paradigm4 Data Scientist Survey 12

Given the growing diversification of data types and sources coupled with the limitations of existing relational databases, it’s no surprise that many data scientists are frustrated leveraging these data sources in their analytical workflows.

Existing relational database management systems are inadequate for analyzing the variety of data sources

I am finding it harder to fit my data into relational database tables

TRUE

FALSE

49%

51%

Page 13: Paradigm4 Research Report: Leaving Data on the table

Paradigm4 Data Scientist Survey 13

By comparison, temporal, spatial and network data may be quite sparse (containing

large amounts of missing values), have mixed sampling frequencies and a natural order.

Relational databases require predefined access patterns for each line of inquiry, an obvious

non-starter for data scientists doing ad hoc data exploration.

What It Means:Relational databases were built for storing and querying densely

populated transactional data such as business purchases and customer information.

Page 14: Paradigm4 Research Report: Leaving Data on the table

Paradigm4 Data Scientist Survey 14

of data scientists say the growth of Big Data has made their job more stressful in the last year

say they don’t know which questions to ask of their Big Data

There’s another side of the Big Data story: 39 percent of data scientists say their job has become more stressful with the growth of Big Data. That’s nearly four times the number who say it’s made their job less stressful.

Big Data Means Big Stress for Data Scientists

Quotes from data scientists:

24%

My biggest problem is linking various data sources.”“

The data is just too big. ”“

The biggest problem is putting multiple sources of data together. ”“

39%

Page 15: Paradigm4 Research Report: Leaving Data on the table

Paradigm4 Data Scientist Survey 15

Fulfilling those expectations falls on the data scientist. But outdated software approaches

better suited to traditional transactional data — not today’s diverse data sources and rapidly

growing volumes — often make it impossible to fulfill these expectations. It’s a recipe for

stress. Deriving business value from organizational data starts with ad hoc analysis. Tools and

workflows need to enable data scientists to conduct analysis quickly and efficiently, making

data scientists more productive and lowering stress levels as a result.

What It Means:Driven in part by media hype, organizations have developed

inflated expectations around the value they’ll get out of Big Data.

Page 16: Paradigm4 Research Report: Leaving Data on the table

Paradigm4 Data Scientist Survey 16

Data scientists play a pivotal role helping organizations unlock the potential of their Big Data. But current software tools fall short in some areas as indicated in the survey. Hype has exceeded reality and data scientists are forced to compromise, sometimes leaving data on the table. Choosing the right software solution is key but don’t expect to get there by browsing vendors’ websites. The fact that so many data scientists identified shortcomings in their infrastructure suggests that the only way to tell which solution is best suited to your organization is to do a pilot project using your data and your use cases.

So What?

The Paradigm4 Data Scientist Survey was fielded by Innovation Enterprise, an independent research firm, from March 27 to April 23, 2014. The responses were generated from a survey of 111 data scientists in the U.S.

Paradigm4 is the creator of SciDB, a computational database management system used to solve large-scale, complex analytics challenges on Big — and Diverse — Data. Led by industry visionaries and veterans Michael Stonebraker, Marilyn Matz, Paul Brown and Bryan Lewis, Paradigm4 enables data-obsessed organizations in life sciences, e-commerce, finance, and manufacturing to answer harder questions faster.

For more information, visit www.paradigm4.com

About the Survey

About Paradigm4