rediscoveringbi - cdm media[p16]bringing bi and big data together three things that make a “big”...

12
rediscovering BI AFTER THE BIG DATA PARTY 04 APRIL 2013 ISSUE 7 Radiant Advisors Publication THE BIG DATA HONEYMOON OVER ALREADY? BI AND BIG DATA BRINGING THEM TOGETHER BIG DATA VS. DATA MANAGEMENT BI'S BIG QUESTION A ZERO-SUM SCENARIO HAS THE BUBBLE BURST?

Upload: others

Post on 31-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: rediscoveringBI - CDM Media[P16]Bringing BI and Big Data Together three things that make a “big” difference when implementing big data. [By John o’Brien] [P10] Twilight of the

rediscoveringBI

After the big dAtA pArty04April 2013

issue 7

radiant Advisors publication

The Big DaTahoneyMoon

Over AlreAdy?

Bi anD Big DaTa

bringing them tOgether

Big DaTa vs.DaTa ManageMenT

Bi's BigQUesTion

A zerO-sum scenAriO

hAs the bubble burst?

Page 2: rediscoveringBI - CDM Media[P16]Bringing BI and Big Data Together three things that make a “big” difference when implementing big data. [By John o’Brien] [P10] Twilight of the

WW

2 • rediscoveringBI Magazine • #rediscoveringBI rediscoveringBI Magazine • #rediscoveringBI • 3

editor in Chief Lindy [email protected]

ContributorDr. Barry [email protected]

ContributorKrish [email protected]

ContributorJohn O’[email protected]

Distinguished WriterStephen [email protected]

art DirectorBrendan [email protected]

For More information:[email protected]

frOm the editOr“Big data” is being routinely paired with proportionally “big” descrip-

tors: innovative, revolutionary, and even (in this editor’s humble opinion,

the mother-of-all-big-descriptors) transformative. With the inexorable

momentum of cyber-journalism keeping it affixed atop industry head-

lines, big data has indeed earned itself quite the reputation, complete

with a stalwart following of industry pundits, papers, and conferences.

In fact, the whole big-data-thing has mutated into a sort-of larger-

than-life caricature of promise and possibility – and, of course, power.

Let’s face it: big data is the Incredible Hulk of BI -- gargantuan, brilliant,

and, yes, sometimes even a bit hyper-aggressive. For many the mild-

mannered, Bruce Banner-esque data analyst out there, big data is, for

better or worse, the remarkably-regenerative, impulsive alter ego of the

industry, eager to show with brute force just how much we can really do

with all our data – what the tangible extent of all that big data power

is, so to speak.

Yet, as with the inaugural debut of Stan Lee’s destructive antihero’s in

The Incredible Hulk #1, in early 2013 we still haven’t begun to really see

what big data can do yet. Not even close.

In this month’s edition of RediscoveringBI, authors Dr. Barry Devlin, Krish

Krishnan, Stephen Swoyer, and John O’Brien each explore different fac-

ets of that very construct, asking has the big data bubble actually burst,

or is its honeymoon phase just over? How much is just hype? And, how

much is only a precursor to what we’ll continue to see buzzing around

and reinventing the industry?

What’s next for BI’s Incredible Hulk antihero, Big Data?

Lindy R

rediscoveringBI April 2013, issue 7

SPOTLIGHT

[P4]The Honeymoon is Over for Big Data big data, it turns out, means precisely nothing and imprecisely anything you

want it to mean.

[By Dr. Barry Devlin]

[P16]Bringing BI and Big Data Together three

things that make a “big” difference when

implementing big data.

[By John o’Brien]

[P10] Twilight of the (DM) Idols big data is already a

disruptive force: at once democratizing,

reconfiguring, and destructive.

[By stephen swoyer]

[P8] Has the Big Data Bubble Burst? the bi industry

is abuzz with one new question: is big

data done?

[By Krish Krishnan]

EDITOR’S PICK

[P7] Big Data Revolution Are we becom-

ing no more than sentient founts

of data? mayer-schönberger and

cukier put the pulse back in the

big data conversation.

[By Lindy Ryan]

[P13] Big Impact: Big Data two use cases for big

data are having a big impact, at least

from a data management perspective.

[P15] A Kludge Too Far? [By stephen swoyer]

rediscoveringBI

After the big dAtA pArty04April 2013

issue 7

radiant Advisors publication

The Big DaTahoneyMoon

Over AlreAdy?

Bi anD Big DaTa

bringing them tOgether

Big DaTa vs.DaTa ManageMenT

Bi's BigQUesTion

A zerO-sum scenAriO

hAs the bubble burst?

Lindy Ryan

Editor in Chief

Radiant Advisors

feAtures

sidebAr

yan

Page 3: rediscoveringBI - CDM Media[P16]Bringing BI and Big Data Together three things that make a “big” difference when implementing big data. [By John o’Brien] [P10] Twilight of the

4 • rediscoveringBI Magazine • #rediscoveringBI

Evolution vs. RevolutionThis is an excellent, thought-provoking article. I believe

that you are correct in the assertion that an Architectural

Reckoning is underway. In fact, I believe it has been underway

for at least 10 years.

To focus on technology in general and Hadoop in particular is,

however, to miss the point. The Reckoning is being driven by

the intersection of business needs and technology advances.

Both sides can be summed up as “faster and smarter” – and

they are mutually reinforcing. I call this the “biz-tech ecosys-

tem”. On the technology side, Hadoop is and will be part of

it. So will a range of other data management technologies,

including relational databases, for sure. And I believe that

the various approaches – column, in-memory and more – will

be combined into a hybrid approach more powerful than

any RDBMS we have today. And we will need that, because

I am certain that the Data Warehouse – in a new, more cir-

cumscribed role, but central to consistency and reliability

for information that must be of high quality – will continue

to thrive. (And many thanks for the historical positioning of

Paul’s and my paper from 1988!) I call this new role “Core

Business Information.”

As you also pointed out, it’s not just about data management.

What is happening is also changing application development,

as well as process and business modeling and implementa-

tion. Collaborative and social computing are also vital com-

ponents of the mix. So, yes, an inter-disciplinary approach will

be needed – not just within IT but across the business – IT

divide.

We are also in somewhat of a positive feedback loop – and as

anyone who has ever put a microphone in front of speakers

knows, the result rapidly becomes very unpleasant. So, we do

need to step back from the hype of big data and recognize the

dangers as well as the opportunities.

My bottom line: yes, we are in a time of Architectural

Reckoning (is this the same as a Paradigm Shift?) but continu-

ity of thinking and a mindset of evolution rather than revolu-

tion are vital. I’m trying to capture this in my long-awaited (by

me, anyway) second book.

- Dr. Barry Devlin (Editor's Note: See The Honeymoon is Over for

Big Data by Dr. Barry Devlin in this month's issue)

Augmentation of Traditional DWsI totally agree with Claudia that “not all analytics now belong

inside the BI architecture” and that we are in a “very disrup-

tive period of a lot of new technologies flooding in to busi-

ness intelligence.” I also am not actually that far away from

the position Scott Davis takes: I agree that “Hadoop is a huge-

ly transformational technology.” I just think for the short- to

medium-term Hadoop et al are going to augment, rather than

replace, traditional data warehouses. Will Hadoop replace a

traditional data warehouse database in the long term? Only if

it adds a lot of database like features, and then the argument

becomes a lot less interesting – something akin to the “Will

Ingres/Informix/Sybase replace Oracle?” debate of yesteryear.

My main concern is how customers are going to embrace this

new data landscape, rather than if they are going to. How are

organizations going to build a data landscape that includes

Teradata, Aster, and Hadoop? How are they going to manage

Analysis Services cubes and a smattering of legacy Oracle

data warehouses?

Data warehouses currently take too long to build and are too

hard to change. The new architectural changes are going to

make things worse not better.

Yes, WhereScape does have a stake in the game –although

not in the status quo. Regardless of the platform, design and

technology the need to deliver quickly without compromise

remains the same. Who wants to manually build out a mul-

tiple platform data warehouse? A data warehouse automation

environment (such as WhereScape RED) helps simplify the

approach, and I believe is a key piece of the new architecture.

- Michael Whitehead (Editor's Note: Michael Whitehead is the

CEO and Founder of WhereScape)

Agile and flexible -- those might well be the mantras of Modern Data Platforms. As

organizations look to harness the latest advances in analytics and integration technolo-

gies, the focus turns quite sharply to architecture: the right data platform can empower

companies to harness everything from Big Data to real-time, all without sacrificing data

quality and governance.

Register for this free Webcast to catch a preview of SPARK!: Modern Data Platforms, a

three-day seminar series to be held in Austin, TX, from April 29 - May 1. The seminar

will feature a tag-team of experts from Radiant Advisors and The Bloor Group, who will

provide detailed instruction on the range of activities associated with modernizing and

evolving robust data platforms. John O'Brien of Radiant will focus on Rediscovering BI,

while Dr. Robin Bloor of The Bloor Group will discuss the Event-Driven Architecture.

Attendees of the Webcast will receive a discount code for $150 off the in-person seminar.

rediscoveringBI

shifting gEARs with ModERn bi ARchitEctuREs03MARch 2013

issuE 6

Radiant Advisors Publication

EvEnt-drivEn architEcturEs

thE shifting lAndscAPE

timE ofrEckoning

arE data modEls dEad?

An ARchitEctuRAl

thE REAl dEbAtE

collision couRsEsElEctingthE rightbi solution

tying goAls to REquiREMEnts

http://www.bigdatabootcamp.net

Don't miss Radiant Advisors' John O'Brien as he keynotes the upcom-

ing Big Data Boot Camp.

May 21-22New York

HiltonJohn will offer perspective into the

dynamics and current issues being

encountered in today's Big Data

analytic implementations as well

as the most important and strate-

gic technologies currently emerg-

ing to meet the needs of the "Big

Data Paradigm."

Join John and other Big Data

experts as they converge upon New

York and be sure to save an extra

$100 off the early bird rate by

using this link. Early bird registra-tion ends April 19.

Have something to say? Send your letters to the editor at [email protected]

opInIon

letters tO the editOr

upcoming Webinarinside AnalysisMoDeRn DaTa PLaTFoRMsinside Analysis with dr. robin bloor and John O'brienhosted by eric Kavanagh

April17

3:00pM Cst

RegisteR Now

On: Time for An Architectural Reckoning

yes, we are in a time of Architectural reckoning but continuity of think-ing and a mindset of evolution rather than revolution are vital."

- dr. barry devlin

f o l l o w t h e

c o n v e r s a t i o n # s p a r k e v e n t

Page 4: rediscoveringBI - CDM Media[P16]Bringing BI and Big Data Together three things that make a “big” difference when implementing big data. [By John o’Brien] [P10] Twilight of the

6 • rediscoveringBI Magazine • #rediscoveringBI rediscoveringBI Magazine • #rediscoveringBI • 7

IG DATA IS tumbling into the

“Trough of Disillusionment,”

according to Gartner’s Svetlana

Sicular. If you fear that this

means the end of the road for big data,

see Mark Beyer’s (co-lead for Gartner

big data research) remedial education

on the meaning of Gartner’s Hype Cycle

curve, although they might have chosen

a less alarmist phrase!

Let me put it another way: the big data

honeymoon is over. Let’s quickly review

the history of the romance before look-

ing to the future of the relationship.

For commercial computing, big data

“dating” really began in the mid-2000s,

when technical people in the bur-

geoning web business began to con-

sider new ways to handle the exploding

amounts and types of data generated

by web usage. Before then, big data

had been the dream -- or nightmare,

actually -- of the scientific community

where, from genetics to astrophysics,

instrumentation was spewing data. In

early 2008, the commercial romance

of big data really began to get serious

when Hadoop, the yellow poster ele-

phant child of big data, was accepted as

a top-level Apache project. The market-

ing paparazzi began stalking the couple

soon after and, true to paparazzi nature,

have been publishing a stream of outra-

geous claims and pictures ever since. By

2012, a shotgun wedding with business

the is Over fOr big dAtA

spotlIght

dr. bArry devlin

was hastily arranged. By then the gloss had begun to wear off

and the honeymoon was washed out in a brief trip to Atlantic

City at the height of a super storm.

Enough Of The Past: Let’s Look Forward!Big data does offer real and realizable business benefits,

but there is one major issue: what actually is big data? The

“volume, variety, and velocity” nomenclature, claimed by Doug

Laney from a 2001 Meta Group research note, is useful short-

hand at best. In reality, each attribute opens up a question of

how far on any scale must data be in order to be called big

-- how vast, how disparate, how fast? Furthermore, what combi-

nation of these three factors should be used in making a call?

Big data, it turns out, means precisely nothing and imprecisely

anything the Mad Men want it to mean. And, with the various

additional “v-words” vaunted by vendors, the value vanishes.

(Oops, I veered into the v-v-verge there!)

The extent of this terminology problem was made clear in

a big data survey conducted last fall by EMA and myself.

Participants were those who declared they were investigat-

ing or implementing big data projects yet almost a third of

respondents classed the data source for their projects as

“process-mediated data” -- data originating from traditional

operating systems. My conclusion: the term big data has

passed its use-by date.

Big data and “small data” are conceptually one and the same:

just data, all data. Or, to be more semantically correct, all

information, as I’ll explain in a new book later this year. (Editor's

Note: Business Unintelligence: Via Analytics, Big Data and

Collaboration to Innovative Business Insight will be published in

Q3 2013 by Technics Publications).

To be clear, I don’t consider that big data has taken us into

a dead end. Rather, it has usefully exposed the fact that our

traditional business intelligence (BI) view of the information

available to and used by business is woefully inadequate. It

has caused me to revisit many underlying assumptions about

information and I now see that there exist three domains of

information that future business intelligence/analytics must

handle, as shown in in the accompanying figure: human-

sourced information, process-mediated data, and machine-

generated data. These domains are fundamentally different in

their usage characteristics and in the demands they place on

technology. The terms are largely self-explanatory, but more

information can be found in my white paper. (See Barry Devlin’s

The Big Data Zoo - Taming the Beasts: The need for an integrated

platform for enterprise information).

The bottom line is that we need a new architecture for infor-

mation -- all of it and its entire life cycle in business.

The Biz-Tech EcosystemBoth challenges and opportunities emerge as we shift the

view from IT to business.

The biggest challenge in the big data/analytics scene is the

alleged dearth of so-called “data scientists.” How different

are data scientists from the power users we’ve known in BI

for decades? Arguably, the only substantive difference is deep

statistical skill. The other characteristics mentioned -- data

munging, business acumen, and storytelling -- are all com-

mon to power users. Statistics, however, is a very specialized

skill that should, in principle, be tightly supervised to ensure

valid and proper application. The phrase “lies, damn lies, and

statistics” indicates the problem: statistics are far too easy

to misuse -- deliberately or otherwise. Moreover, we seem

to have blindly accepted an assertion that the exponential

growth in data volumes implies a similar growth in hidden

nuggets of useful business knowledge. This is unlikely to be

true. Most of the good examples of business value coming

from big data illustrate this. Real value emerges from a new

type or new combination of data; growth in volumes leads to

incremental increases in value, at best.

These challenges aside, a focus on novel (big) data use does

big data, it turns out, means precisely nothing and imprecisely anything”“[The term big data has passed its use-by date]

Page 5: rediscoveringBI - CDM Media[P16]Bringing BI and Big Data Together three things that make a “big” difference when implementing big data. [By John o’Brien] [P10] Twilight of the

8 • rediscoveringBI Magazine • #rediscoveringBI rediscoveringBI Magazine • #rediscoveringBI • 9

drive opportunities for new businesses, business models, or,

simply, ways to compete. A useful, cross-industry categoriza-

tion (courtesy of IBM) of these opportunities is:

• Big Data Exploration: analyze “big data” to identify new

business opportunities

• Enhanced 360° View of the Customer: incorporate human-

sourced information sources, such as call center logs and

social media, into traditional CRM approaches

• Security and Intelligence Extension: lower risk, detect fraud,

and monitor cyber security in real-time, machine-generated

data

• Operations Analysis: analyze and use machine-generated

data to drive immediate business results

• Data Warehouse Augmentation: increase operational effi-

ciency by integrating big data with BI

This focus on (big) data is but the latest stage in the evolu-

tion of what I call the biz-tech ecosystem -- the symbiotic

relationship between business and IT that drives all suc-

cessful, modern businesses. Every business advance worth

mentioning in the past twenty years has had technology, and

almost always information technology, at its core. On the other

hand, much of the advances in IT have been driven by busi-

ness demands. The relative roles of business and IT people

may change as the process evolves, but that process is set to

continue. And, at its heart are the collection, creation, and use

of information, as opposed to data -- big or small -- as manda-

tory, core competencies of modern business. Share your comments >

...at its heart are the collection, creation, and

use of information, as opposed to data -- big

or small -- as mandatory, core competencies of

modern business.”

Dr. Barry Devlin is Founder and Principal

of 9sight Consulting, and is among the

foremost authorities on business insight

and big data. He is a widely respected

analyst, consultant, lecturer, and author.

HILE IT’S INARGUABLE that the phenomenon

known as “big data” is rapidly reknitting the very

fabric of our lives, what we are just now begin-

ning to see and to understand – to appreciate

– is how.

Yet, so often our conversations about big data focus on these

“how’s” in the abstract – on its benefits, potentials, and oppor-

tunities, and likewise, its risks, challenges, and implications –

that we overlook the simpler, more primordial question: what’s

not changing?

It’s a simple question that requires a simple answer. Us. Sure,

we can assert that we’re becoming more data-dependent. We

generate more data: last month, social media giant Twitter

blogged1 that its over 200-million active users generate

over 400-million tweets per day. We consume more data: a

now-outdated University of California report2 calculated that

American households collectively consumed 3.6 zettabytes of

information in 2008. Are we – the data-generating organisms

that we are – becoming no more than sentient founts of data?

In Big Data: A Revolution That Will Transform How We Live,

Work, and Think, authors Viktor Mayer-Schönberger and

Kenneth Cukier effectively put the pulse back in the Big Data

Conversation: “big data is not an ice-cold world of algorithms

and automatons. . .we [must] carve out a place for the human:

to reserve space for intuition, common sense, and serendipity

to ensure that they are not crowded out by data and machine-

made answers.”

In our brief email exchange, Mayer- Schönberger elaborated

a bit more on this idea. “[We] try to understand the (human)

dimension between input and output,” he noted. “Not through

the jargon-laden sociology of big data, but through what we

believe is the flesh and blood of big data as it is done right

now.”

With the elegance of an Oxford University professor and

The Economist’s data editor – Mayer-Schönberger and Cukier,

respectively – Big Data’s authors remind us that it is our

human traits of “creativity, intuition, and intellectual ambi-

tion” that should be fostered in this brave new world of

big data. That the inevitable “messiness” of big data can

be directly correlated to the inherent “messiness” of being

human. And, most important, that the evolution of big data

as a resource and tool derives from (is a function of) the dis-

tinctly human capacities of instinct, accident, and error, which

manifest, even if unpredictably, in greatness. In that greatness

is progress.

That – progress – is the intrinsic value of big data. It’s what’s

so compelling about Big Data (both the book and the thing

itself): it’s not always about the inputs or outputs, but the

space – or, what Mayer-Schönberger calls the “black box,” of

in-between.  

1 http://blog.twitter.com/2013/03/celebrating-twitter7.html2 How Much Information? http://hmi.ucsd.edu/howmuchinfo.php

Lindy Ryan is Editor in Chief

of Radiant Advisors.

lindy ryAn

Big Data is available on Amazon and

the Radiant Advisors eBookshelf

www.radiantadvisors.com/ebookshelf

editOr’s picK

big dAtA

Share your comments >

Page 6: rediscoveringBI - CDM Media[P16]Bringing BI and Big Data Together three things that make a “big” difference when implementing big data. [By John o’Brien] [P10] Twilight of the

10 • rediscoveringBI Magazine • #rediscoveringBI rediscoveringBI Magazine • #rediscoveringBI • 11

1. Build the business case and keep it simple

2. Create a data discovery environment that can be used

by line of business experts

3. Identify the data and patterns that are needed to

create a robust foundation for analytics

4. Create the initial analytics based on the data discovery

5. Visualize the data in a mash-up platform using

semantic data integration techniques

6. Get the business users to use the outcomes

7. Gain adoption of the users

8. Create a roadmap for the larger program

ECENT ARTICLES IN leading business publications, a

hype-cycle presentation by Gartner, and a number of

blogs have all startled the world of big data by asking

one “big” question: are we done? Did the big data

bubble burst even quicker than the “dot com” bubble?

Has the big data bubble burst?

The answer is: not really. If anything, the market for infra-

structure is booming with more vendors distributing com-

mercial versions of open source software (like Hadoop and

NoSQL). We are seeing the evolution of new consulting

practices focused on analytics and – perhaps most impor-

tant – traditional database vendors have all either embraced

or announced support for big data platforms. So, what is the

basis of this notion of failure or disappointment around the

big data space?

The Promised LandIn 2004, Google’s announcement of the general availability of

MapReduce and Google File System started a flurry of activ-

ity building platforms aimed at solving scalability problems.

One of these projects was “Nutch,” a parallel search engine

on the open source platform. The team at Nutch succeeded

in building the infrastructure that attracted Yahoo to sponsor

and incubate the project under its commercial name: Hadoop.

Submitted to open source in 2009, Hadoop quickly gained

notoriety as the panacea for all data scalability problems.

Since then it has become a viable platform for large-scale

computing needs and has been adopted as a data storage

and processing platform at many companies across the world.

Subsequently, the last four years have also seen the evolution

of NoSQL databases and multiple other additional technolo-

gies on the Hadoop framework.

The RealityHadoop’s early adopters did not fully understand the com-

plexities of the platform until they began implementing the

technology, and this lack of understanding inevitably has

spurred a sense of failure (or disappointment).

Among the potential gaps not understood clearly by adopters:

One size does not fit all: Big data technologies were devel-

oped to solve the problems of extreme scalability and sus-

tained performance. While these technologies have certainly

overcome the traditional limitations of database-oriented

data processing, the same techniques cannot be directly

extended to solve problems in the same realm.

MapReduce skill availability: To effectively use most of the

big data platforms one has to be able to write some amount

of Map Reduce code; however, this is an area where skills are

evolving and (still) scarce.

Programming dependence: Many corporations are unable to

adjust to the idea of having teams design and develop code

(or data processing) – much like application software devel-

opment. Standardization of programming techniques for big

data are still maturing.

Business case: Most early adopters did not have a robust

business case, or, in many cases, the right business case to

implement on these platforms. The lack of an end-state solu-

tion -- or usage and ROI expectations -- has led to longer

development and implementation cycles.

Hype: Continued hype about the technology has caused

unrest amongst executives, line of business owners, IT, and

business users, leading to often misunderstood capabilities

of the platform as well as incorrect ROI or TCO expectations.

But wait: it is not “all over” when we talk about big data,

rather we have come to the point in time where the reality of

the platform – and how to drive its adoption within corpora-

tions – has started settling down. The big data bubble is well

and alive; in fact, it’s even progressing in the right direction.

How to Integrate Big Data As corporations begin to see beyond the hype of big data,

everyone from the executive sponsor to the implementa-

tion team is beginning to recognize the need to dig a better

foundation for integrating big data. There are a few subtle yet

invaluable pointers in this process:

Features

hAs the Big DaTa BUBBLe burst?[The BI industry is abuzz with one new question: is big data done?]

Krish KrishnAn

While the overall process of big data integration seems

closely aligned to the integration of any other project, there

are key differences that can define the success of the big data

bubble in your corporation: data discovery, data analysis, and

data visualization. These three integral pillars will clearly

identify the basis of how to implement big data and monetize

such an exercise.

The FutureSeveral technology providers have announced their support

of big data platforms, including Datastax (Cassandra), Intel,

Microsoft, EMC and HP (Hadoop), 10Gen (Mongo DB), and

Cray (YARC Graph Analytics DB). These vendors -- along with

existing vendors -- will undoubtedly continue to provide more

options and solution platforms for deploying and integrating

big data technologies within the enterprise platform.

The big data bubble has not busted; it is still only begin-

ning and will be reaching various levels of maturity over the

following years. There are many layers of complexities and

intricacies that need to be defined and formalized, but this is

where the evolution and opportunities exist.

Share your comments >

Krish Krishnan is a globally recognized

expert in the strategy, architecture, and

implementation of big data. His new

book Data Warehousing in the Age of Big

Data will be released in August 2013.

the big data bubble is well and alive; in fact, it’s even progressing in the right direction."

Page 7: rediscoveringBI - CDM Media[P16]Bringing BI and Big Data Together three things that make a “big” difference when implementing big data. [By John o’Brien] [P10] Twilight of the

12 • rediscoveringBI Magazine • #rediscoveringBI rediscoveringBI Magazine • #rediscoveringBI • 13

OME IN THE INDUSTRY ARE already writing epitaphs

for big data. Others – a prominent market watcher

comes to mind – argue that big data, like so many

technologies or trends before it, is simply conforming

to well-established patterns: following a period of hype, it’s

undergoing a correction. It’s regressing toward a mean.

That was fast.

This doesn’t concern us. Big data is an epistemic shift. It’s

going to transform how we know and understand — how we

perceive — the world. What’s meant by the term “big data” is

a force for destabilizing and reordering existing configura-

tions – much as the Bubonic Plague, or Black Death, was for

the Europe of the late-medieval period. It’s an unsettling anal-

ogy, but it underscores an important point: the phenomenon

of big data, like that of the Black Death, is indifferent to the

hopes, prayers, expectations, or innumerate prognostications of

human actors. It’s inevitable. It’s going to happen. It’s going to

change everything.

Even as the epitaphs are flying, the magic quadrants being

plotted, and the opinions mongering, big data is changing

(chiefly by challenging) the status quo. This is particularly the

case with respect to the domain of data management (DM) and

its status quo. Here, big data is already a disruptive force: at

once democratizing, reconfiguring, and destructive. We’ll con-

sider its reordering effect through the prism of Hadoop, which,

in the software development and data management worlds,

has to a real degree become synonymous with what’s meant

by “big data.”

[Big Data Vs. Data Management]

TwilighT of The (DM) idolsstephen sWOyer

The Citadel of Data ManagementBig data has been described as a wake-up call for data man-

agement (DM) practitioners.

If we’re grasping for analogies, the big data phenomenon

seems less like a wake-up call than.. .a grim tableau straight

out of 14th France.

This was the time of the Black Death, which was to function as

an enormous force for social destabilization and reordering. It

was also the time of the Hundred Years War, which was fought

between England and France on French soil. The manpower

shortage of the invading English was exacerbated by the viru-

lence of the Plague, which historians estimate killed between

one- to two-thirds of the European population. Outmanned

– and outwoman-ed, for that matter, once Joan D’Arc abrupted

onto the scene – the English resorted to a time-tested tactic:

the chevauchée. The logic of the chevauchée is fiendishly

simple: Edward III’s English forces were resource-constrained;

they enjoyed neither the manpower nor the defensive advan-

tages – e.g. , castles, towers, or city walls – that accrued (by

default) to the French. The English achieved their best out-

comes in pitched battle; the French, on the other hand, were

understandably reluctant to relinquish their fortifications,

fixed or otherwise.

The challenge for the English was to draw them out to fight.

Enter the chevauchée. It describes the “tactic” of rampag-

ing and pillaging – among other, far more horrific practices

– in the comparatively defenseless French countryside. Left

unchecked, the depredations of the chevauchée could ulti-

mately comprise a threat to a ruler’s hegemony: fealty counts

for little if it doesn’t at least afford one protection from other

would-be conquerors.

As a tactical tool, the chevauchée succeeded by challenging

the legitimacy of a ruling power.

Hadoop has had a similar effect. For the last two decades,

the data management (DM) or data warehousing (DW) Powers

That Be have been holed up in their fortified castles, dictating

terms of access – dictating terms of ingest; dictating time-

tables and schedules, almost always to the frustration of the

line of business, to say nothing of other IT stakeholders.

Though Hadoop wasn’t conceived tactically, its adoption and

growth have had a tactical aspect.

By running amok in the countryside, pillaging, burning, and

destroying stuff – or, by offering an alternative to the data

warehouse-driven BI model – the Hottentots of Hadoop have

managed to drag the Lords of DM into open battle.

At last year’s Strata + Hadoop World confab in New York, NY,

a representative with a prominent data integration (DI) ven-

dor shared the story of a frustrated customer that it says had

developed – perforce – an especially ambitious project focus-

ing on Hadoop.

The salient point, this vendor representative indicated, was

that the business and IT stakeholders behind the project saw

in Hadoop an opportunity to upend the power and authority of

the rival DM team. “It’s almost like a coup d’etat for them,” he

said, explaining that both business stakeholders and software

developers were exasperated by the glacial pace of the DM

team’s responsiveness. “[T]hey asked how long it would take to

get source connectivity [for a proposed application and] they

were told nine months. Now they just want to go around them

[i.e. , the data management group],” this representative said.

“[T]hey basically want Hadoop to be their new massive data

warehouse.”

The Zero-Sum ScenarioThis zero-sum scenario sets up a struggle for information

management supremacy. It proposes to isolate DM altogether;

eventually it would starve the DM group out of existence. It

views DM not as a potential partner for compromise, but as a

zero-sum adversary.

It’s an extremist position, to be sure; it nevertheless brings into

focus the primary antagonism that exists between software-

development and data-management stakeholders. This antag-

onism must be seen as a factor in the promotion of Hadoop as

a general-purpose platform for enterprise data management.

Hadoop was created to address the unprecedented challenges

associated with developing and managing data-intensive

distributed applications. The impetus and momentum behind

Hadoop originated with Web or distributed application devel-

opers. To some extent, Hadoop and other big data technology

projects are still largely programmer-driven efforts. This has

implications for their use on an enterprise-wide scale, because

software developers and data management practitioners have

very different worldviews. Both groups are accustomed to talk-

ing past one another. Each suspects the other of giving short

shrift to its concerns or requirements.

big data is an epistemic shift. it’s going to transform how we know and understand — how we perceive — the world.”“

Features

Page 8: rediscoveringBI - CDM Media[P16]Bringing BI and Big Data Together three things that make a “big” difference when implementing big data. [By John o’Brien] [P10] Twilight of the

John O’Brien

Founder and CEO

Radiant Advisors

Dr. Robin Bloor Co-Founder and Principal Analyst

The Bloor Group

14 • rediscoveringBI Magazine • #rediscoveringBI

Get directions

In short, both groups resent one another. This resentment

isn’t symmetrical, however; there’s a power imbalance. For a

quarter century now, the DM group hasn’t just managed data

-- it’s been able to dictate the terms and conditions of access

to the data that it manages. In this capacity, it’s been able to

impose its will on multiple internal constituencies: not only

on software developers, but on line-of-business stakehold-

ers, too. The irony is that the per-

ceived inflexibility and unrespon-

siveness – the seeming indifference

– of DM stakeholders has helped to

bring together two other nominally

antagonistic camps; in their resent-

ment of DM, software developers

and the line of business have been

able to find common cause.

Few would deny that stakeholders

jealously guard their fiefdoms. This

is as true of software developers

and the line of business as it is of

their counterparts in the DM world.

Part of the problem is that DM

is viewed as an unreasonable or

uncompromising stakeholder: e.g. ,

DM practitioners have been unable

to meaningfully communicate the

logic of their policies; they’ve like-

wise been reluctant – or in some cases, unwilling – to revise

these policies to address changing business requirements. In

addition, they’ve been slow to adopt technologies or meth-

ods that promise to reduce latencies or which propose to

empower line-of-business users. Finally, DM practitioners are

fundamentally uncomfortable with practices – such as ana-

lytic discovery, with its preference for less-than-consistent

data – which don’t comport with data management best

practices.

Hadoop and Big Data in ContextThat’s where the zero-sum animus comes from. It explains

why some in business and IT

champion Hadoop as a technology to replace – or at the very

least, to displace – the DM status quo. There’s a much more

pragmatic way of looking at what’s going on, however.

This is to see Hadoop in context – i.e. , at the nexus of two

related trends: viz. , a decade-plus, bottom-up insurgency,

and a sweeping (if still coalescing) big data epistemic shift.

The two are related. Think back to the Bubonic Plague, which

had a destabilizing effect on the late-Medieval social order.

The depredations of the Plague effectively wiped out many

of the practices, customs, and (not to put too fine a point on

it) human stakeholders that might otherwise have contested

destabilization.

The Plague, then, cleared away the ante-status quo, creating

the conditions for change and transformation. Big data has

had a similar effect in data management – chiefly by raising

questions about the warehouse’s ability to accommodate

disruptions (e.g. , new kinds of data and new analytic use

cases) for which it wasn’t designed. Simply by claiming to

be Something New, big data raised questions about the DM

status quo.

This challenge was exploited by

well-established insurgent cur-

rents inside both the line of busi-

ness and IT. The former has been

fighting an insurgency against IT

for decades; however, in an age

of pervasive mobility, BYOD, social

collaboration, and (specific to the

DM space) analytic discovery, this

insurgency has taken on new force

and urgency.

IT, for its part, has grappled with

insurgency in its own ranks: the

agile movement, which most in

DM associate with project manage-

ment, began as a software develop-

ment initiative; it explicitly bor-

rowed from the language of politi-

cal revolution – the seminal agile

document is Kent Beck’s “Manifesto

for Agile Software Development,” published in 2001 – in

championing an alternative to software development’s top-

down, deterministic status quo.

Agility and insurgency have been slower to catch on in DM.

Nevertheless, insurgent pressure from both the line of busi-

ness and IT is forcing DM stakeholders (and the vendors who

nominally service them) to reassess both their strategies and

their positions.

However far-fetched, the possibility of a Hadoop-led chevau-

chée in the very heart of its enterprise fiefdom – with aid

and comfort from a line-of-business class that DM has too

often treated more as peasants than as enfranchised citizens

– snagged the attention of data management practitioners.

Big time.

ReinventionThe Hadoop chevauchée got the attention of DM practitio-

ners for another reason.

In its current state, Hadoop is no more suited for use as a

general-purpose, all-in-one platform for reporting, discovery,

and analysis than is the data warehouse. (See Sidebar: A Kludge Too Far?)Given the maturity of the DW, Hadoop is arguably much less

suited for this role. For all of its shortcomings, the data ware-

house is an inescapably pragmatic solution; (Contiued p21)DM practitioners learned what works chiefly by figuring out

Day One | Designing Modern Data PlatformsThese sessions provide an approach to confidently assess and make architecture changes, beginning with an understanding

of how data warehouse architectures evolve and mature over time, balancing technical and strategic value delivery. We break

down best practices into principles for creating new data platforms.

Day Two | Modern Data IntegrationThese sessions provide the knowledge needed for understanding and modeling data integration frameworks to make confident

decisions to approach, design, and manage evolving data integration blueprints that leverage agile techniques. We recognize

data integration patterns for refactoring into optimized engines.

Day Three | Databases for AnalyticsThese sessions review several of the most significant trends in analytic databases challenging BI architects today. Cutting through

the definitions and hype of big data in the market, NoSQL databases offer a solution for a variety of data warehouse requirements.

Register now at: http://radiantadiantadvisors.com

CAN'T MAKE IT? Catch us in San Francisco from May 28-30. Registration opens April 22nd. Use the priority code ReBI to save $150

At the Omni Downtown in Austin

AUsTiN, TXApril 29 - MAy 1

#sparkevent

Sponsored by:

Featured Keynotes By:

Page 9: rediscoveringBI - CDM Media[P16]Bringing BI and Big Data Together three things that make a “big” difference when implementing big data. [By John o’Brien] [P10] Twilight of the

16 • rediscoveringBI Magazine • #rediscoveringBI rediscoveringBI Magazine • #rediscoveringBI • 17

The most common big data use cases tend to be less sexy

than mundane.

In fact, two use cases for which big data is today having a

Big Impact have decidedly sexy implications, at least from

a data management (DM) perspective.

Both use cases address long-standing DM problems;

both likewise anticipate issues specific to the age of big

data. The first involves using big data technologies to

super charge ETL; the second, as a landing zone – i.e. , a

general-purpose virtual storage locker – for all kinds of

information.

Of the two, the first is the more mature: IT technologists

have been talking up the potential of super-charged ETL

almost from the beginning.

Back then, this was framed largely in terms of MapReduce,

the mega-scale parallel processing algorithm popular-

ized by Google. Five years on, the emphasis has shifted

to Hadoop itself as a platform for massively parallel ETL

processing.

The rub is that performing stuff other than map and

reduce operations across a Hadoop cluster is kind of a

kludge. (See sidebar: A KLUDGE TOO FAR?.)

However, because ETL processing can be broken down

into sequential map and reduce operations, data integra-

tion (DI) vendors have managed to make it work. Some DI

players – e.g. , Informatica, Pervasive Software, SyncSort,

and Talend, among others – market ETL products for

Hadoop. Both Informatica and Talend – along with ana-

lytic specialist Pentaho Inc. – use Hadoop MapReduce to

perform ETL operations. Pervasive and SyncSort, on the

other hand, tout libraries that they say can be used as

MapReduce replacements. The result, both vendors claim,

is ETL processing that’s (a) faster than vanilla Hadoop

MapReduce and (b) orders of magnitude faster than tradi-

tional enterprise ETL.

This stuff is available now. In the last 12 calendar months,

both Informatica and Talend announced “big data” ver-

sions of their ETL technologies for Hadoop MapReduce;

Pervasive and SyncSort have marketed Hadoop-able ver-

sions of their own ETL tools (DataRush and DMExpress,

respectively) for slightly longer. In every case, big data

ETL tools abstract the complexity of Hadoop: ETL work-

flows are designed in a GUI design studio; the tools them-

selves generate jobs in the form of Java code, which can

be fed into Hadoop.

Just because the technology’s available doesn’t mean

there’s demand for it.

Parallel processing ETL technologies have been available

for decades; not everybody needs or can afford them,

however. David Inbar, senior director of big data products

with Pervasive, concedes that demand for mega-scale ETL

processing used to be specialized.

At the same time, he says, usage patterns are changing;

analytic practices and methods are changing. So, too, is

the concept of analytic scale: scaling from gigabyte-sized

data sets to dozens or hundreds of terabytes – to say

nothing of petabytes – is an increase of several orders

of magnitude. In the emerging model, rapid iteration is

the thing; this means being able to rapidly prepare and

crunch data sets for analysis.

sidebAr

big dAtA: big impAct

[STEPHEN SWOYER]

[STEPHEN SWOYER]

SIDEBAR: A Kludge Too FAr?

Nor is analysis a one-and-done affair, says Inbar: it’s itera-

tive.

“What really matters is not so much if it uses MapReduce

code or if it uses some other code; what really matters is

does it perform and does it save you operational money –

and can you actually iterate and discover patterns in the

first place faster than you would be able to otherwise?” he

asks. “It’s always possible to write custom code to get stuff

done. Ultimately it’s a relatively straightforward [proposi-

tion]: [manually] stringing together SQL code [for tradi-

tional ETL] or Java code [for Hadoop] can work, but it’s not

going to carry you forward.”

However, one of the data warehouse’s (DW) biggest selling

points is also its biggest limiting factor.

The DW is a schema-mandatory platform. It’s most comfort-

able speaking SQL. It uses a kludge – i.e. , the binary large

object (BLOB) – to accommodate unstructured, semi-struc-

tured, or non-traditional data-types. Hadoop, by contrast, is

a schema-optional platform.

For this reason, many in DM conceive of Hadoop as a virtual

storage locker for big data.

“You can drop any old piece of data on it without having to

do any of the upfront work of modeling the data and trans-

forming it [to conform to] your data model,” explains Rick

Glick, vice president of technology and architecture with

analytic discovery specialist ParAccel. “You can do that [i.e. ,

transform and conform] as you move the data over.”

At a recent industry event, several vendors – viz. ,

Hortonworks, ParAccel, and Teradata, – touted Hadoop as

a point of ingest for all kinds of information. This “landing

zone” scenario is something that customers are adopting

right now, says Pervasive’s Inbar; it has the potential to be

the most common use case for Hadoop in the enterprise.

“Before you can do all of the amazing/glamorous/ground-

breaking analytical work … and innovation, you do actually

have to land and ingest and provision the data,” he argues.

“Hadoop and HDFS are wonderful in that they let you [store

data] without having predefined what it is you think you’re

going to get out of it. Traditionally, the data warehouse

requires you to predefine what you think you’re going to

get out of it in the first place.”

The problem with MapReduce – to invoke a shopworn

cliché – is that it’s a hammer.

From its perspective, any and every distributed processing

task wants and needs to be nailed. If Hadoop is to be a

useful platform for general-purpose parallel processing,

it must be able to perform operations other than synchro-

nous map and reduce jobs.

The problem is that MapReduce and Hadoop are tightly

coupled: the former has historically functioned as paral-

lel processing yin to the Hadoop Distributed File System’s

storage yang.

Enter the still-incubating Apache YARN project (YARN is

a bacronym for “Yet Another Resource Negotiator”), which

aims to decouple Hadoop from MapReduce.

Right now, Hadoop’s Job Tracker facility performs two

functions: resource management and job scheduling;

YARN breaks Job Tracker into two discrete daemons.

From a DM perspective, this will make it possible to

perform asynchronous operations in Hadoop; it will also

enable pipelining, which – to the extent it’s possible in

Hadoop today – is typically supported by vendor-specific

libraries.

YARN’s been a long time coming, however: it’s part of

the Hadoop 2.0 framework, which is still in development.

Given what’s involved, some in DM say YARN’s going to

need seasoning before it can be used to manage mission-

critical, production workloads.

That said, YARN is hugely important to Hadoop. It has

support from all of the Hadoop Heavies: Cloudera, EMC,

Hortonworks, Intel, MapR, and others.

“It feels like it’s been coming for quite a while,” concedes

David Inbar, senior director of big data products with data

integration specialist Pervasive Software. “All of the play-

ers … are in favor of it. Customers are going to need it. If

as a sysadmin you don’t have a unified view of everything

that’s running and consum[ing] resources in your environ-

ment, that’s going to be suboptimal,” Inbar continues. “So

YARN is a mechanism that’s going to make it easier to

manage [Hadoop clusters]. It’s also going to open up the

Hadoop distributed data and processing framework to a

wider range of compute engines and paradigms.”

Just because the technology's available doesn't mean there's demand for it."

Page 10: rediscoveringBI - CDM Media[P16]Bringing BI and Big Data Together three things that make a “big” difference when implementing big data. [By John o’Brien] [P10] Twilight of the

18 • rediscoveringBI Magazine • #rediscoveringBI rediscoveringBI Magazine • #rediscoveringBI • 19

ERHAPS YOUR ORGANIZATION IS hearing the buzz

about big data and business analytics creating value,

transforming businesses, and gaining new insights. Or,

perhaps you’ve spent some time and resources during

the past year reading publications or attending industry

events, or even launched a small scale “big data pilot” exper-

iment. In any case, if you’re at the early stages of your com-

pany’s journey into big data, there are some important con-

versations to keep in mind as you continue your path to

bringing business intelligence (BI) and your company’s big

data together.

Big Data and the Business Intelligence Program

For the most part, big data environments are those that adopt

Apache’s Hadoop or one of its variants (like Cloudera, MapR,

or HortonWorks) or the NoSQL databases (like MongoDB,

Cassandra, or HBase with Hadoop). These data stores have

massive scalability and unstructured data flexibility at the

best price. No longer

reserved for the biggest IT shops, the democratization of big

data comes from Hadoop’s ability to enable any company to

affordably and easily exploit big data sets, and sometimes go

even further with Cloud implementations. Gleaning insights

from these vast data sets requires a completely different type

of data platform and programming framework for creating

insightful analytic routines.

Analytics is not new to BI: the ability to execute statistical

models and identify hidden patterns and clusters of data

has long allowed for better business decision-making and

predictions. What these new BI analytic capabilities have

in common is that they work beyond the capabilities of SQL

statements that govern relational database management

systems to execute embedded algorithms. No longer are we

constrained to sample data sets; advanced analytic tools can

now execute their algorithms in parallel at the data layer. For

many years, data has been extracted from data warehouses

into flat files to be executed outside the RDBMS by data min-

ing software packages (like SPSS, SAS, and Statistica). Both

traditional capabilities -- reporting and dimensional analysis

– have always been needed, along with what is now being

called “Analytics” in today’s BI programs.

Big data analytics are another one of the several BI capabili-

ties required by the business. And, even when big data is not

3WAyS to BRIng BI And BIg dATA togEthER

JOhn O’brien

[Three things that make a “big” difference when implementing big data.]

Features

1.

so “big” there are other reasons why Hadoop and NoSQL are

better solutions than RDBMSs, or cubes. Most common is when

working with the data is beyond the capabilities of SQL and

tends to be more programmatic. The second most common

is when the data be captured is constantly changing or is an

unknown structure, such that a database schema is difficult to

maintain. In this scenario, schema-less Hadoop and key value

data stores are a clear solution. Another is when the data

needs to be stored in various data types, such as documents,

images, videos, sounds, or other non-record like data (think,

for example, about the metadata to be extracted from a photo

image, like date, time, geo-coding, technical photography data,

meta-tags, and perhaps even names of people from facial rec-

ognition). Most company big data environments today are less

than ten terabytes and fewer than eight nodes in the Hadoop

cluster because of the other “non-bigness” requirements.

Data Platform = Big Data + Data Warehouse

You might have already discussed what to do now that you

have both a Hadoop and data warehouse system. Should the

data warehouse be moved into Hadoop, or should you link

them? Do you provide a semantic layer over both of them for

users or between the data stores?

Most companies are moving forward recognizing that both

environments serve different purposes, but are part of a com-

plete BI data platform. The traditional hub and spoke archi-

tecture of data warehouses and data marts is evolving into a

modern data platform of three tiers: big data Hadoop, analytic

databases, and the traditional RDBMS. Industry analysts are

contemplating whether this is a two-tier or three-tier data

platform, especially given the expected maturing of Hadoop

in the coming years; however, it is safe to say that analytic

databases will be the cornerstone of modern BI data platforms

for years to come.

The analytic database tier is really for highly-optimized or

highly-specialized workloads -- such as columnar, MPP, and in-

memory (or vector based) -- for analytic performance, or text

analytics and graph databases for highly-specialized analytic

capabilities. Big data governance and analytic lifecycles would

encompass semantic and analytic discoveries made in Hadoop,

combined with traditional reference data, and then be migrat-

ed and productionized in a more controlled, monitored-- and

accessible -- analytics tier.

2.

...the democratization of big data comes from hadoop’s

ability to enable any company to affordably and easily exploit

big data sets”

Page 11: rediscoveringBI - CDM Media[P16]Bringing BI and Big Data Together three things that make a “big” difference when implementing big data. [By John o’Brien] [P10] Twilight of the

Stephen Swoyer is a technology

writer with more than 15 years of

experience. His writing has focused

on business intelligence and data

warehousing for almost a decade.

rediscoveringBI Magazine • #rediscoveringBI • 21

what doesn’t work. The genealogy of the data warehouse is

encoded in a double-helix of intertwined lineages: the first is

a lineage of failure; the second, a lineage of success born of

this failure. The latter has been won – at considerable cost –

at the expense of the former. A common DM-centric critique

of Hadoop (and of big data in general) is that some of its sup-

porters want to throw out the old order and start from scratch.

As with the chevauchée – which entailed the destruction of

infrastructure, agricultural sustenance, and formative social

institutions – many in DM (rightly) see in this a challenge to

an entrenched order or configuration.

They likewise see the inevitability of avoidable mistakes –

particularly to the extent that Hadoop developers are con-

temptuous of or indifferent to the finely-honed techniques,

methods, and best practices of data management.

“Reinvention is exactly it, … [but] they aren’t inventing data

management technology. They don’t understand data manage-

ment at all,” argues industry veteran Mark Madsen, a principal

with information management consultancy Third Nature Inc.

Madsen is by no means a Hadoop hater; he notes that, as a

schema-optional platform, Hadoop seems tailor-made for the

age of big data: it can function as a virtual warehouse – i.e. ,

as a general-purpose storage area – for information of any

and every kind.

The DW is schema-mandatory; its design is predicated on

a pair of best-of-all-possible-worlds assumptions: firstly,

that data and requirements can be known and modeled in

advance; secondly, that requirements won’t significantly

change. For this very reason, the data warehouse will never be

a good general-purpose storage area. Madsen takes issue with

Hadoop’s promotion as an information management platform-

of-all-trades.

Proponents who tout such a vision “understand data process-

ing. They get code, not data,” he argues. “They write code and

focus on that, despite the data being important. Their ethos

is around data as the expendable item. They think [that] code

[is greater than or more important than] data, or maybe [they]

believe that [even though they say] the opposite. So they do

not understand managing data, data quality, why some data is

more important than other data at all times, while other data

is variable and/or contextual. They build systems that pre-

sume data, simply source and store it, then whack away at it.”

The New PragmatismInitially, interest in Hadoop took the form of dismissive

assessments.

A later move was to co-opt some of the key technologies

associated with Hadoop and big data: almost five years ago,

for example, Aster Data Systems Inc. and Greenplum Software

(both companies have since been acquired by Teradata

and EMC, respectively) introduced in-database support for

MapReduce, the parallel processing algorithm that search

giant Google had first helped to popularize, and which Yahoo

helped to democratize – in the guise of Hadoop. Aster and

Greenplum effectively excised MapReduce from Hadoop and

implemented it (as one algorithm among others) inside their

massively parallel processing (MPP) database engines; this

gave them the ability to perform mapping/reducing opera-

tions across their MPP clusters, on top of their own file sys-

tems. Hadoop and its Hadoop Distributed File System (HDFS)

were nowhere in the mix.

It was, however, a big part of the backstory. Let’s turn the clock

back just a bit more, to early-2008, when Greenplum made a

move which hinted at what was to come – announcing API-

level support for Hadoop and HDFS. In this way, Greenplum

positioned its MPP appliance as a kind of choreographer for

external MapReduce jobs: by writing to its Hadoop API, devel-

opers could schedule MapReduce jobs to run on Hadoop and

HDFS. The resulting data, data sets, or analysis could then be

recirculated back to the Greenplum RDBMS.

Today, this is one of the schemes by which many in DM

would like to accommodate Hadoop and big data. The differ-

ence, at least relative to half a decade ago, is a kind of frank

acceptance of the inevitability – and, to some extent, of the

desirability – of platform heterogeneity. Part of this has to do

with the “big” in big data: as volumes scale into the double-

or triple-digit terabyte -- or even into the petabyte – range,

technologists in every IT domain must reassess what they’re

doing and where they’re doing it, along with just how they

expect to do it in a timely and cost-effective manner. Bound

up with this is acceptance of the fact that DM can no longer

simply dictate terms: that it must become more responsive to

the concerns and requirements of line-of-business stakehold-

ers, as well as to those of its IT peers; that it must open itself

up to new types of data, new kinds of analytics, new ways of

doing things.

“The overall strategy is one of cooperative computing,”

explains Rick Glick, vice president of technology and archi-

tecture with analytic discovery specialist ParAccel Inc. “When

you’re dealing with terabytes or petabytes [of data], the chal-

lenge is that you want to move as little of it as possible. If

you’ve got these other [data processing] platforms, you inevi-

tably say, ‘Where is the cheapest place to do it?’” This means

proactively adopting technologies or methods that help to

promote agility, reduce latency, and empower line-of-business

users. This means running the “right” workloads in the “right”

place, with “right” being understood as a function of both

timeliness and cost-effectiveness. Share your comments >

(Continued from p12)

3. Determining Access

Apache “Hive” is sometimes called the “data warehouse appli-

cation on top of Hadoop” as it enables a more generalized

access capability for everyday users with its familiar Hive-QL

format that SQL-familiar users can understand. Hive provides a

semantic layer that allows for the definition of familiar tables

and columns mapped to key-value pairs found in Hadoop. With

virtual tables and columns in places, Hive users can write HQL

to access data within the Hadoop environment.

More recently, has been the release of “HCatalog,” which is mak-

ing its way into the Apache Hadoop project. HCatalog is the

semantic layer component similar to Hive, and allows for the

definition of virtual tables and columns for communication with

any application, not just HiveQL. Last summer, data visualization

tool Tableau allowed users to work with and visualize Hadoop

data for the first time via HCatalog. Today, many analytic data-

bases are allowing users to work with tables that are views to

HCatalog and Hadoop data. Some vendors also choose to lever-

age Hive as access to Hadoop data by leveraging its semantic

layer and converting user SQL statements into HQL statements.

Expect more BI vendors to follow suit and enable their own con-

nectivity to Hadoop.

There are emerging new agile analytic development methodolo-

gies and processes that enable the iterative and agile nature of

analytics in big data environments for discovery, then couple

that with data governance procedures to properly move the

analytic models to a faster analytic database with operational

controls and access. In this model, companies can store big data

cheaply until its value can be determined, and then move it to

the appropriate production and valued data platform tier. This

could be a map-reduce extract to a relational database data

mart (or cube), or this could be executing the analytic program

in an MPP, columnar, or in-memory high-performance database.

More to ComeWhile big data has come a long way in just a short amount of

time, it still has a long road ahead as an industry, as a maturing

technology, and as best practices are realized and shared. Don’t

compare your company with mega e-commerce companies (like

Yahoo, Facebook, Google, or LinkedIn) who live and breathe big

data as a part of their mission critical core business functions

for many years already. Rather, think of your company as the

other 99% of companies -- small and large -- found in every

industry exploring opportunities to unlock the hidden value in

big data on their own. These companies typically already have a

BI program underway, but now must grapple with the challenge

of maintaining BI delivery from structured operational data

combined with the new integration of big data platforms for

business analysts, customers, and internal consumers.

Share your comments >

John O’Brien is the Principal and CEO

of Radiant Advisors, a strategic advisory

and research firm that delivers innova-

tive thought-leadership, publications,

and industry news.

While big data has come a long way in just a short amount of time, it still has a long road ahead as an industry, as a maturing technology, and as best practices are realized and shared."

Page 12: rediscoveringBI - CDM Media[P16]Bringing BI and Big Data Together three things that make a “big” difference when implementing big data. [By John o’Brien] [P10] Twilight of the

CHADVISEDOPRESEARREARCHADCHADVISEDDVISEDEVEDEVELOPR

Radiant Advisors is a strategic advisory and research firm that networks with industry experts to deliver innovative thought-leadership, cutting-edge publications, and in-depth industry research.

v i s i t w w w . r a d i a n t a d v i s o r s . c o m

F o l l o w u s o n Tw i t t e r ! @ r a d i a n t a d v i s o r s

AbOut rAdiAnt AdvisOrsr e s e A r c h . . . A d v i s e . . . d e v e l O p . . .