computer weekly buyer’s guide a cw buyer’s guide to big ...docs.media.bitpipe.com › io_10x ›...

10
computerweekly.com buyer’s guide 1 HOME CHOOSING A PLATFORM TO MANAGE THE BIG DATA MIX BIG DATA REQUIRES A CHANGE IN MINDSET AND TECHNOLOGY AS BUSINESSES SEEK TO ANALYSE VARIOUS FORMS OF DATA FROM MYRIAD SOURCES PROCESS BIG DATA AT SPEED LOW-COST SOLID STATE MEMORY IS POWERING HIGH- SPEED ANALYTICS OF BIG DATA STREAMING FROM SOCIAL NETWORK FEEDS AND THE INDUSTRIAL INTERNET STORAGE STRUGGLES TO KEEP UP WITH DATA GROWTH EXPLOSION BIG DATA ANALYSIS NEEDS A SPLIT FROM THE TRADITIONAL APPROACH OF MATCHING BACK-END INFRASTRUCTURE TO APPLICATION REQUIREMENTS COMPUTER WEEKLY BUYER’S GUIDE NADLA/ISTOCKPHOTO A CW buyer’s guide to big data infrastructure Organisations are beginning to realise the potential of real-time big data analysis but there is also a growing realisation that the full benefit will remain unexploited without the infrastructure to expedite it. In this 10-page buyer’s guide, Computer Weekly looks at the mindset and technology businesses need to analyse various forms of data, the low-cost solid state memory powering datastreams from social network feeds and the industrial internet and a revision of the traditional approach of matching back-end infrastructure to application requirements These articles were originally published in the Computer Weekly ezine

Upload: others

Post on 30-Jun-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Computer Weekly buyer’s Guide A CW buyer’s guide to big ...docs.media.bitpipe.com › io_10x › io_102267 › item_632592 › CWE_B… · Choosing a platform to manage the big

computerweekly.com buyer’s guide 1

Home

CHoosing a platform to

manage tHe big data mixbig data

requires a CHange in

mindset and teCHnology

as businesses seek to analyse

various forms of data from

myriad sourCes

proCess big data at speed

low-Cost solid state memory is powering HigH-speed analytiCs

of big data streaming from soCial network

feeds and tHe industrial

internet

storage struggles to keep up witH

data growtH explosion

big data analysis needs

a split from tHe traditional approaCH of

matCHing baCk-end

infrastruCture to appliCation requirements

Computer Weekly buyer’s Guide

na

dla

/ist

oc

kph

oto

A CW buyer’s guide to big data infrastructureOrganisations are beginning to realise the potential of real-time big data analysis but there is also a growing realisation that the full benefit will remain unexploited without the infrastructure to expedite it. In this 10-page buyer’s guide, Computer Weekly looks at the mindset and technology businesses need to analyse various forms of data, the low-cost solid state memory powering datastreams from social network feeds and the industrial internet and a revision of the traditional approach of matching back-end infrastructure to application requirements

these articles were originally published

in the computer Weekly ezine

Page 2: Computer Weekly buyer’s Guide A CW buyer’s guide to big ...docs.media.bitpipe.com › io_10x › io_102267 › item_632592 › CWE_B… · Choosing a platform to manage the big

computerweekly.com buyer’s guide 2

Home

CHoosing a platform to

manage tHe big data mixbig data

requires a CHange in

mindset and teCHnology

as businesses seek to analyse

various forms of data from

myriad sourCes

proCess big data at speed

low-Cost solid state memory is powering HigH-speed analytiCs

of big data streaming from soCial network

feeds and tHe industrial

internet

storage struggles to keep up witH

data growtH explosion

big data analysis needs

a split from tHe traditional approaCH of

matCHing baCk-end

infrastruCture to appliCation requirements

Computer Weekly buyer’s Guide

It seems that 2012 was the year of big data – at least as far as the main software and hardware suppliers were concerned. nearly every supplier in the market had something that could be marketed as part of a solution for big data. this raft of misplaced technolo-gies has led to misconceptions about what is really needed for organisations to draw

value from escalating volumes of data.But 2013 could be the year of big data rationality. systems being offered from many sup-

pliers now appear to have been built to deal with the many facets of the big data problem, rather than a mish-mash collection of existing offerings quickly bundled together to hit the market at the same time as everyone else.

Challenges of big dataso, what constitutes a big data issue? What springs to mind for many is “volume”. surely, if i have lots of data, then this a big data issue? Well, it could be – or it may not be. it may

eno

t po

losk

un

/ist

oc

kph

oto

Big data challenges and

pitfalls

How big is big data?

Choosing a platform to manage the big data mixBig data requires a change in both mindset and technology as businesses seek to analyse various forms of data from myriad sources. Clive Longbottom reports

Buyer’s guideinfrastructure for big data

Webcast: Big data,

bright future

Page 3: Computer Weekly buyer’s Guide A CW buyer’s guide to big ...docs.media.bitpipe.com › io_10x › io_102267 › item_632592 › CWE_B… · Choosing a platform to manage the big

computerweekly.com buyer’s guide 3

Home

CHoosing a platform to

manage tHe big data mixbig data

requires a CHange in

mindset and teCHnology

as businesses seek to analyse

various forms of data from

myriad sourCes

proCess big data at speed

low-Cost solid state memory is powering HigH-speed analytiCs

of big data streaming from soCial network

feeds and tHe industrial

internet

storage struggles to keep up witH

data growtH explosion

big data analysis needs

a split from tHe traditional approaCH of

matCHing baCk-end

infrastruCture to appliCation requirements

Computer Weekly buyer’s Guide

just be a “lot of data” issue – which only needs a scalable database from the likes of oracle, iBM or Microsoft with a good business analytics solution from these or independent sup-pliers, such as the sas institute, layered on top of it.

no, size is not everything. Big data brings with it a lot more variables, in the guise of a series of “Vs”.

First, the variety of the information sources has to be taken into account. no longer can a stream of information just be regarded as a series of ones and zeros with a belief that all data is equal. not everything will be held within a formal database.

Files held in office automation formats, information scraped from search engines crawl-ing the web, and other data sources all need to be included in what is used as the basic raw materials. iBM, oracle, teradata and eMc are all now building systems that incorporate a mix of technologies into single boxes that can deal with mixed data – or can offer matched systems that ensure that each type of data is managed and analysed in an optimised and coherent manner.

the rise of the use of images, voice, video and non-standard data sources, such as real-time pro-duction line data, smart-building and environmental systems, means data has to be dealt with in context. this requires the use of networking equipment that can deal with prioritised data streams and is capa-ble of analysing such streams either at line speed or by copying enough of the stream and dealing with it as required outside of the live stream. the majority of networking equipment from the likes of cisco, Juniper, dell and hp will be able to deal with 802.1p/q priority and quality of service settings, and externally with packets tagged with the multi-protocol labelling service (Mpls).

next is velocity. a retail operation needing to ana-lyse transactions on a daily basis to organise its sup-ply operations for the following day does not need real-time data analysis. however, an investment bank making investment decisions against incoming live data on rapidly changing information such as commodities pricing will need analysis to be carried out as near to real time as possi-ble. similarly, government security systems, online fraud detection and anti-malware systems need large amounts of data to be analysed in near real time to be effective. this is where the latest systems from the likes of iBM with puredata, oracle with exadata, and teradata are providing solutions designed to deal with masses of data in real time.

Data accuracythe veracity of the data also has to be considered. this has two components. one is the quality of the data under the control of the organisation. information that is specific about a person – names, addresses, telephone numbers – can be dealt with through data cleans-

ing from companies such as ukchanges, pca or equifax; other information, such as mapping data, can be dealt with through cloud services such as Google or Bing Maps – thus outsourcing the problems of ensuring that data is accurate and up to date to organisations that can afford to be experts in the field.

the other aspect of veracity is around data that is not under the direct control of the organi-sation. information drawn from external sources needs to be evaluated to determine the level of trust that can be put in the source. For named sources, such as those mentioned above, trust can be explicit. For other sources, cross-referencing may be required to see how many

Size iS not everything. Big data BringS with it a lot more variaBleS, in the guiSe of a SerieS of vS – variety, velocity, veracity, value

› Developing high-quality data models› How clean is your data?

› Analytics: The real-world use of big data

Page 4: Computer Weekly buyer’s Guide A CW buyer’s guide to big ...docs.media.bitpipe.com › io_10x › io_102267 › item_632592 › CWE_B… · Choosing a platform to manage the big

computerweekly.com buyer’s guide 4

Home

CHoosing a platform to

manage tHe big data mixbig data

requires a CHange in

mindset and teCHnology

as businesses seek to analyse

various forms of data from

myriad sourCes

proCess big data at speed

low-Cost solid state memory is powering HigH-speed analytiCs

of big data streaming from soCial network

feeds and tHe industrial

internet

storage struggles to keep up witH

data growtH explosion

big data analysis needs

a split from tHe traditional approaCH of

matCHing baCk-end

infrastruCture to appliCation requirements

Computer Weekly buyer’s Guide

other people have quoted the source, whether the individual or organisation associated with the information is known and trusted by others, and so on. here, the likes of Wolfram alpha, lexus nexus, Reuters and others can provide corroborated information that can be regarded as more trustworthy than just direct internet traffic.

next is value. two things need to be considered here – the upstream value and the down-stream value.

Whereas most organisations have focused on internal data, there is now a need to reach outside the organisation (upstream) to other data sources. For example, a pharmaceutical company researching a new chemical or molecular entity needs to keep an eye on what its competitors are up to by monitoring the web, but it must also filter out all the crank stuff that has little bearing on what it is doing.

the downstream side of the value is explicit – there is little point in providing analysis of data to a person if it is of little use to them.

Merging relational and NoSQLoverall, big data is not just a change in mindset. it also requires a change in the technologies used to deal with the different types of data in play and the way they are analysed and reported.

the archetypal databases from oracle, iBM, Microsoft and others are currently useful for deal-ing with structured data held in rows and columns, but struggle with less structured data that they have to hold as binary large objects – or blobs.

the rise of the nosQl, schema-less database market, exemplified through the likes of 10Gen with MongodB, couchbase, cassandra and oth-ers, is showing how less structured data can be held in a manner that makes it easier for the data to be analysed and reported on. it is in pulling such technologies together where iBM, teradata, eMc and others are beginning to create systems that are true big data engines.

however, there is still the need to combine the structured database and the less structured systems together and make sure that all the various data ends up in the right place. this is where something like hadoop tends to fit in – using MapReduce, it can act as a filter against incoming data streams and the outputs can then be placed in either a structured or unstructured data store. For those who are heavily involved with sap, hana can be used in much the same manner.

over the top of the data infrastructure has to be the analysis and reporting capability. although those selling the hardware and databases tend to have their own systems – for example, oracle has hyperion, iBM has cognos and spss, and sap has Business objects – there are plenty of choices outside of these suppliers. sas institute remains the independ-ent 800-pound gorilla, but newer incomers such as Qliktech, Birst, panopticon, pentaho and splunk are showing great promise in being able to provide deep insights across mixed data sources.

although 2012 was filled with big data hype, it does not mean that big data is something that is not important. the capability to effectively analyse a broader mix of data in a man-ner that enables true knowledge to be extracted will be a powerful driver for future success in businesses. it is better to start planning for an effective platform for this now, rather than waiting and watching your competition beat you to it. n

Big data iS not juSt a change in mindSet. it alSo requireS a change in the technologieS uSed to deal with the different typeS of data in play and the way they are analySed and reported

Clive Longbottom is founder of Quocirca

Solve ‘big data’ problems with a MySQL

cluster, NoSQL

Case study: King.com

gaming site unlocks big

data with Hadoop

Page 5: Computer Weekly buyer’s Guide A CW buyer’s guide to big ...docs.media.bitpipe.com › io_10x › io_102267 › item_632592 › CWE_B… · Choosing a platform to manage the big

computerweekly.com buyer’s guide 5

Home

CHoosing a platform to

manage tHe big data mixbig data

requires a CHange in

mindset and teCHnology

as businesses seek to analyse

various forms of data from

myriad sourCes

proCess big data at speed

low-Cost solid state memory is powering HigH-speed analytiCs

of big data streaming from soCial network

feeds and tHe industrial

internet

storage struggles to keep up witH

data growtH explosion

big data analysis needs

a split from tHe traditional approaCH of

matCHing baCk-end

infrastruCture to appliCation requirements

Computer Weekly buyer’s Guide

There is little question that big data has broken through, not only to the enterprise it agenda, but also to the public imagination. Barely a day passes without mention of new capabilities for mining the internet to yield new insights on customers, opti-mise search, or personalise web experiences for communities comprising tens or

hundreds of millions of members.ovum considers big data to be a problem that requires powerful alternatives beyond tradi-

tional sQl database technology. the attributes of big data include “the three Vs” – volume, variety (structured, along with variably structured data) and velocity. ovum believes that a fourth V – value – must also be part of the equation.

a variety of platforms have emerged to process big data, including advanced sQl (some-times called newsQl) databases that adapt sQl to handle larger volumes of structured data with greater speed, and nosQl platforms that may range from file systems to document or columnar data stores that typically dispense with the need for modelling data. Most of the early implementations of big data, especially with nosQl platforms such as hadoop, have focused more on volume and variety, with results delivered through batch processing.

Behind the scenes, there is a growing range of use cases that also emphasise speed. some of them consist of new applications that take advantage not only of powerful back-end data platforms, but also the growth in bandwidth and mobility. examples include mobile applica-tions such as Waze that harness sensory data from smartphones and Gps devices to provide real-time pictures of traffic conditions.

on the horizon there are opportunities for mobile carriers to track caller behaviour in real time to target ads, location-based services, or otherwise engage their customers, as well as opportunities for online retailers to exploit next-best offers to maximise customer revenue.

na

dla

/ist

oc

kph

oto

A ‘big data’ veteran talks fundamentals

of big data infrastructure

Using Hadoop in big data systems

can pay off fast

Process big data at speedLow-cost solid state memory is powering high-speed analytics

of big data streaming from social network feeds and the industrial internet. Tony Baer reports

Buyer’s guideinfrastructure for big data

Page 6: Computer Weekly buyer’s Guide A CW buyer’s guide to big ...docs.media.bitpipe.com › io_10x › io_102267 › item_632592 › CWE_B… · Choosing a platform to manage the big

computerweekly.com buyer’s guide 6

Home

CHoosing a platform to

manage tHe big data mixbig data

requires a CHange in

mindset and teCHnology

as businesses seek to analyse

various forms of data from

myriad sourCes

proCess big data at speed

low-Cost solid state memory is powering HigH-speed analytiCs

of big data streaming from soCial network

feeds and tHe industrial

internet

storage struggles to keep up witH

data growtH explosion

big data analysis needs

a split from tHe traditional approaCH of

matCHing baCk-end

infrastruCture to appliCation requirements

Computer Weekly buyer’s Guide

conversely, existing applications are being made more accurate, responsive and effec-tive as smart sensors add more data points, intelligence and adaptive control. these are as diverse as optimising supply chain inventories, regulating public utility and infrastructure networks, or providing real-time alerts for homeland security. the list of potential opportuni-ties for fast processing of big data is limited only by the imagination.

Silicon-based storageFast data is the subset of big data implementations that require velocity. it enables instant analytics or closed-loop operational support with data that is either not persisted, or is per-sisted in a manner optimised for instant, ad hoc access. Fast data applications are typically driven by rules or complex logic or algorithms.

Regarding persistence, the data is either processed immediately and is not persisted, such as through extreme low-latency event processing, or it is persisted in an optimised man-ner. this is typically accomplished with silicon-based flash or memory storage, and is either lightly indexed (to reduce scanning overhead) or not indexed at all. the rationale is that the speed of silicon either eliminates the need for sophisticated indexing, or allows customised data views to be generated dynamically.

connectivity is also critical. While ultra low-latency messaging links are not mandatory (they are typically only utilised by securities trading firms), optimising connectivity through high-speed internal buses, such as infiniband, is essential for computation of large blocks of data. direct links to high-speed wide-area network (Wan) backbones are key for fast data applications digesting data from external sources.

What fast data is notFast transaction systems that can be updated interactively but do not automatically close the loop on how an organisation responds to events (for example, the systems are read manually or generate reports) are not considered applications of fast data.

conventional online transaction processing (oltp) databases are typically designed with some nominal degree of optimisation, such as locating hot (frequently or recently used) data on the most accessible parts of disk (or sharded across multiple disks in a storage array), more elaborate indexes, and/or table designs to reduce the need for joins, and so on. in these cases, the goal is optimising the interactive response for frequent, routine queries or updates. these systems are not, however, designed for processing inordinately large volumes or varie-ties of data in real time.

Fast Data is not newReal-time databases are not new. capital markets firms have long relied on databases capable of ingesting and analysing “tick” data – a task that requires not only speed but the

ExamplEs of fast Data applications

n Sensory applications that provide snapshots of phenomena or events from a variety of data points that are aggregated and processed in real time or near real time.

n Stream-processing applications that process high-speed data feeds with embedded, rules-driven logic to either alert people to make decisions, or to trigger automated closed-loop operational responses.

n High-speed, low-latency series of events that, as with stream processing, generate alerts or closed-loop automated response based on embedded rules or business logic, such as high-frequency trading (HFT).

n Real-time or near real-time analytics. Real-time transactional or interactive processing appli-cations involving large, multi-terabyte, internet-scale data sets.

Page 7: Computer Weekly buyer’s Guide A CW buyer’s guide to big ...docs.media.bitpipe.com › io_10x › io_102267 › item_632592 › CWE_B… · Choosing a platform to manage the big

computerweekly.com buyer’s guide 7

Home

CHoosing a platform to

manage tHe big data mixbig data

requires a CHange in

mindset and teCHnology

as businesses seek to analyse

various forms of data from

myriad sourCes

proCess big data at speed

low-Cost solid state memory is powering HigH-speed analytiCs

of big data streaming from soCial network

feeds and tHe industrial

internet

storage struggles to keep up witH

data growtH explosion

big data analysis needs

a split from tHe traditional approaCH of

matCHing baCk-end

infrastruCture to appliCation requirements

Computer Weekly buyer’s Guide

ability to process fairly complex analytics algorithms. traditionally, these firms have looked to niche suppliers with highly specialised engines that could keep pace and in some cases these platforms have been backed by kove’s in-memory storage appliance.

in-memory data stores have been around for roughly 20 years, but their cost has typically restricted them to highly specialised applications that include: extreme high-value niches such as event streaming systems developed by investment banks for conducting high-speed trading triggered by patterns in real-time capital markets feeds; small footprint subsystems, such as directories of router addresses for embedded networking systems; and determin-istic, also known as “hard”, real-time systems based on the need to ensure that the system responds within a specific interval.

these systems often have extremely small footprints, sized in kilobytes or megabytes, and are deployed as embedded controllers, typically as firmware burned into application-specific integrated circuit (asic) chips, for applications such as avionics navigation or industrial machine control.

the common threads are that real-time systems for messaging, closed-loop analytics and transaction pro-cessing have been restricted to specific niches because of their cost and the fact that the amount of memory necessary for many of these applications is fairly modest.

What has changed?it should sound familiar. the continuing technol-ogy price/performance curve, fed by trends such as Moore’s law for processors, and corollaries for other parts of it infrastructure, especially in storage, is mak-ing fast data technologies and solutions economical for a wider cross-section of use cases.

the declining cost of storage has become the most important inflection point. silicon-based storage, either flash or memory, has become cheap enough to be deployed with sufficient scale to not only cache input/output (i/o), but also store significant portions of an entire database.

enabling technology has inflated user expectations accordingly. a recent Jaspersoft big data survey of its customer base provided a snapshot of demand. Business intelligence customers, who have grown increasingly accustomed to near real-time or interactive ad hoc querying and reporting, are carrying their same expectations over to big data. Jaspersoft’s survey revealed that over 60% of respondents have deployed or are planning to deploy big data analytics within 18 months, with nearly 50% expecting results in real time or near real time.

speed is being embraced by mainstream enterprise software players and start-ups alike. oracle and sap are commercialising a former niche market of in-memory databases,

and tibco is promoting the ability to deliver a two-second advantage that delivers just enough information in context to make snap decisions based on a combination of messaging, in-memory data grid, rules and event-processing systems.

core sQl platforms are being reinvented to raise limits on speed and scale. For instance, the latest models of oracle exadata engineered appliances pack up to 75% more memory, and up to four times more flash memory, at similar price points, compared with previous models. n

core Sql platformS are Being reinvented to raiSe limitS on Speed and Scale

Tony Baer is a research director at Ovum. This is an extract of Ovum’s report What is fast data? Download the full report here.

› Big data impact on storage infrastructure› The human face of big data: Data driven

› Choosing a platform to manage the big data mix

Page 8: Computer Weekly buyer’s Guide A CW buyer’s guide to big ...docs.media.bitpipe.com › io_10x › io_102267 › item_632592 › CWE_B… · Choosing a platform to manage the big

computerweekly.com buyer’s guide 8

Home

CHoosing a platform to

manage tHe big data mixbig data

requires a CHange in

mindset and teCHnology

as businesses seek to analyse

various forms of data from

myriad sourCes

proCess big data at speed

low-Cost solid state memory is powering HigH-speed analytiCs

of big data streaming from soCial network

feeds and tHe industrial

internet

storage struggles to keep up witH

data growtH explosion

big data analysis needs

a split from tHe traditional approaCH of

matCHing baCk-end

infrastruCture to appliCation requirements

Computer Weekly buyer’s Guide

Traditional approaches to building storage infrastructure may be wholly unsuitable to the analysis of large, real-time datasets. enterprise storage can be very application-focused. it deploys storage area network (san) storage for transactional systems or network-attached storage (nas) for file storage. Businesses usually think about

their applications first and the back-end storage comes afterwards.Big data needs a different approach, due to the large volumes of data involved. ovum senior

analyst tim stammers warns: “there is no clear consensus in the industry in what to sell customers.” some suppliers are offering object storage, clustered, scalable nas or block-level sans. “all have their own advantages but it all depends on your environment,” he adds.

suppliers sell big data appliances with integrated storage, which improves performance, but it may also cause businesses issues when the data needs to be shared.

thin

ksto

ck

The datacentre

guide to big data

Big data impact on

storage infrastructure

Storage struggles to keep up with data growth explosion

Big data analysis needs a split from the traditional approach of matching back-end infrastructure to application requirements. Cliff Saran reports

Buyer’s guideInfrastructure for big data

Page 9: Computer Weekly buyer’s Guide A CW buyer’s guide to big ...docs.media.bitpipe.com › io_10x › io_102267 › item_632592 › CWE_B… · Choosing a platform to manage the big

computerweekly.com buyer’s guide 9

Home

CHoosing a platform to

manage tHe big data mixbig data

requires a CHange in

mindset and teCHnology

as businesses seek to analyse

various forms of data from

myriad sourCes

proCess big data at speed

low-Cost solid state memory is powering HigH-speed analytiCs

of big data streaming from soCial network

feeds and tHe industrial

internet

storage struggles to keep up witH

data growtH explosion

big data analysis needs

a split from tHe traditional approaCH of

matCHing baCk-end

infrastruCture to appliCation requirements

Computer Weekly buyer’s Guide

Storage for Hadoophadoop, the apache open-source implementation of Google’s MapReduce algorithm, takes a different approach to processing data over the relational databases used to power transactional systems.

hadoop processes data by running parallel processing . data is effectively split across multiple nodes in a large computer cluster, allowing big data to be analysed across a large number of low-cost computing nodes. the cluster can be on-premise or hosted somewhere such as the amazon cloud.

“it maps data to store on computer nodes in a cluster and reduces the amount of data transferred to the cluster,” says Gartner research director Jie Zhang. “traditionally it infra-structure is siloed and is very vertical, big data uses a scale-out architecture.”

Server farms for big data hadoop effectively splits the datasets into smaller pieces known as blocks through its filing system, known as the hadoop Filing system (hdFs).

such a cluster puts a heavy load on the network. according to iBM, for hadoop deployments using a san or nas, the extra network communication over-head can cause performance bottlenecks, especially for larger clusters. so nas and san-based storage is out of the question.

ovum principal analyst tony Baer has been looking at how to extend the performance and enterprise-readi-ness of hadoop.

Given that it relies on large numbers of low-cost disks, rather than enterprise-grade disk drives, factors such as the mean time between failures quoted by disk manu-facturers become significant.

in 2010 Facebook was the largest deployment of hadoop with a 30pB database. now consider using 30,000 1tB drives for storage.

For simplicity, assume the installation was built all in one go. if a typical drive has a mean time between failure (MBFt) of 300,000 hours, in a year each will run 8,766 hours. the total number of hours the 30pB storage system will run in a year is 263 million (8766 x 30,000). that means 877 drives will fail in a year, or 2.4 disk drive failures a day.

luckily, hdFs has built-in redundancy so disk failure does not result in data loss. But one must feel a little sorry for the technician whose job it is to locate and replace the failed drive, even if it becomes part of a regular maintenance routine.

however, in spite of its redundancy capabilities, Baer notes in his latest research paper Big Data Storage in Hadoop, that hdFs lacks many data protection, security, access and perfor-mance optimisation features associated with mature commercial file and data storage subsystems. Many of these features are not essential for existing hadoop analytic processing patterns. he says: “Because big data analytics is a moving target, the hadoop platform may have to accommodate features currently offered by more mature file and storage systems, such as support of snapshots, if it is to become accepted as an enterprise analytics platform.”

Fast DataBig data in the real world cannot be processed by hadoop, as it is batch-based. Bill Ruh, vice-president for the software centre at Ge, explains the problem: “the amount of data generated by sensor networks on heavy equipment is astounding. a day’s worth of

“the amount of data generated By heavy equipment iS aStounding. one SenSor on a turBine Blade generateS 520gB per day and you have 20 of them”Bill ruh, ge

Page 10: Computer Weekly buyer’s Guide A CW buyer’s guide to big ...docs.media.bitpipe.com › io_10x › io_102267 › item_632592 › CWE_B… · Choosing a platform to manage the big

computerweekly.com buyer’s guide 10

Home

CHoosing a platform to

manage tHe big data mixbig data

requires a CHange in

mindset and teCHnology

as businesses seek to analyse

various forms of data from

myriad sourCes

proCess big data at speed

low-Cost solid state memory is powering HigH-speed analytiCs

of big data streaming from soCial network

feeds and tHe industrial

internet

storage struggles to keep up witH

data growtH explosion

big data analysis needs

a split from tHe traditional approaCH of

matCHing baCk-end

infrastruCture to appliCation requirements

Computer Weekly buyer’s Guide

real-time feeds on twitter amounts to 80GB. one sensor on a blade of a turbine generates 520GB per day and you have 20 of them.”

these sensors produce real-time big data that enables Ge to manage the efficiency of each blade of every turbine in a wind farm. the performance of the blades are influenced not only by the weather, but also by the turbulence caused by the turbines in front of it. Ge has developed its own software to take on some of the big data processing, but hadoop is also used.

Gartner’s Zhang says disk drives will become performance bottlenecks. “the hottest trend is ssd [solid-state disk] drives, to eliminate mechanical disk drives,” she explains.

Zhang says hadoop can be configured to mix and match ssd with hard disk drives: “in your disk array, not all the data is accessed all the time. the really important data at any given moment is not a large dataset. this hot data needs to be quickly accessible, so can be migrated to ssd.” Just like traditional disk tiering, more historical data will be pushed down to cheaper, mechanical disk drives.

the major suppliers are also addressing the real-time aspects of big data analysis, through vertically integrated appliances and in-memory databases like hana from sap. clive longbottom, founder of analyst Quocirca says: “new systems from the likes of iBM with puredata, oracle with exadata, and teradata are providing architected solutions designed to deal with masses of data in real time.”

Cloud-based big dataMandhir Gidda is uk technical director at Razorfish, a global interactive digital agency. When the company was set up, it built a complex datacentre infrastructure to run a service called atlas. this uses a single cookie on a browser and Javascript to allow agencies see what sites a user is visiting, the products they put in their baskets and what they ultimately buy online.

he says it was a challenge for the company that, as consumers moved to multi-channel ecommerce – with the attendant growth in social data – its own infrastruc-ture could no longer cope.

“We were one of the biggest datacentres with atlas, but we struggled to meet client deadlines,” he says.

Investment catches up with growththe company needed to invest £500,000 in upgrades. Gidda says the company prob-ably needed to invest a further £500,000 three months later just to keep up with the data growth: “We needed an it organisation the size of the whole business.”

in 2009 Razorfish decided to move to amazon. ”We use elastic MapReduce on amazon and cascading, a tool to upload 500GB per day.” this represents a trillion impres-sions, clicks and actions per day. processing this amount of data on its old infrastructure used to take three days. “it now takes four hours,” he adds.

however, cloud-based processing of big data is not for everyone. Ge’s Ruh explains: “cloud computing in its present form is not wholly suitable for machine-to-machine (M2M) interac-tions at Ge. We are seeing more processing running on machines, due to latency.”

the technology is based around in-memory database systems. since the datasets are extremely large, Ruh says Ge uses nosQl and hadoop.

“We have also developed our own database for time series analysis,” he says. But Ge is also working with Microsoft azure and amazon Web services to investigate how to offload processing to the cloud. n

“cloud computing in itS preSent form iS not SuitaBle for machine-to-machine interactionS”Bill ruh, ge

› Choosing a platform to manage big data› How to choose big data infrastructure

› The fundamentals of big data infrastructure