sector roadmap: hadoop/data warehouse interoperability...solution can facilitate hadoop-data...

Sector Roadmap: Hadoop/DataWarehouse Interoperability

George Gilbert

Sector Roadmap: Hadoop/Data Warehouse

Interoperability

01/29/2015

Table of Contents

1. Executive Summary

2. Introduction and Methodology

3. Usage Scenarios

4. Disruption Vectors

5. Company Analysis

6. Key Takeaways

7. About George Gilbert

8. About Gigaom Research

9. Copyright

Sector Roadmap: Hadoop/Data Warehouse Interoperability 2

1 Executive Summary

SQL-on-Hadoop capabilities played a key role in the big data market in 2013. In 2014, theirimportance only grew, as did their ubiquitousness, making possible new use cases for big data.Now, with virtually every Hadoop distribution vendor and incumbent database vendor offeringSQL-on-Hadoop solutions, the key factor in the market is no longer mere SQL query capability,it’s the quality and economics of the resulting integration between Hadoop and data warehousetechnology.

This Sector RoadmapTM examines that integration, reviewing SQL-on-Hadoop solutions on offerfrom the three major Hadoop vendors: Cloudera, Hortonworks, and MapR; incumbent datawarehouse vendor Teradata; relational-database juggernaut Oracle; and Hadoop/datawarehouse hybrid vendor Pivotal. With this analysis, key usage scenarios made possible bythese solutions are identified, as are the architectural distinctions between them.

Vendor solutions are evaluated over six Disruption Vectors: schema flexibility, data engineinteroperability, pricing model, enterprise manageability, workload role optimization, and queryengine maturity. These vectors collectively measure not just how well a SQL-on-Hadoopsolution can facilitate Hadoop-data warehouse integration, but how successfully it does so withrespect to the emerging usage patterns discussed in this report.

Key findings in our analysis include:

• In addition to the widely discussed data lake, the adjunct data warehouse is a key concept,which has a greater near-term relevance to pragmatist customers.

• The adjunct data warehouse provides for production ETL, reporting, and BI on the datasources first explored in the data lake. It also offloads production ETL from the core datawarehouse in order to avoid costly capacity additions on proprietary platforms at a 10- to30-times cost premium.

• MapR fared best in our comparison due to the integration powers of Apache Drill’stechnology. It would have fared better still were Drill not in such a relatively early phase ofdevelopment.

• Hortonworks, given its enhancements to Apache Hive, and Cloudera, with its dominantImpala SQL-on-Hadoop engine, follow closely behind MapR.


• Despite their conventional data warehouse pedigrees, Teradata, Pivotal, and Oracle are verymuch in the game as they make their comprehensive SQL languages available as a queryinterface over data in Hadoop.

Key:

• Number indicates companies relative strength across all vectors

• Size of ball indicates company’s relative strength along individual vector

Source: Gigaom Research


http://research.gigaom.com/wp-content/uploads/sites/4/2015/01/SQLHadoopUber2-e1422547088381.png


http://www.istockphoto.com/photo/server-room-24929752?st=1063abd

2 Introduction and Methodology

Traditionally monolithic in features and architecture, with the correspondingly high costs ofacquisition and operation, database and data warehouse systems are becoming disaggregated.Thanks to big data, and Hadoop in particular, the data manipulation engine, data catalog, andstorage engine are more componentized and less tightly bound in today’s analytic datamanagement world.

This looser federation of data processing functionality does and will continue to make thebuilding of analytics pipelines easier. While customers have been limited to what was built intothe core DBMS, this pipeline approach lets them mix and match more specialized and, ideally,more capable engines. This phenomenon is changing the database industry, forcing apreviously staid ecosystem to undergo rapid, fundamental transition, innovation, and marketshare disruption.

Traditional DBMS Analytic Data Pipeline vs. Hadoop SQL Equivalent


http://research.gigaom.com/wp-content/uploads/sites/4/2015/01/Screen-Shot-2015-01-28-at-9.25.48-AM-e1422455177173.png



Just as critical as this architectural and price disruption is the need to make business users moreself-sufficient in their consumption of data. With traditional data warehouses, the responsibilityof IT and its database designers was extended to answering substantially all the questions thebusiness had. Asking new questions later in the process meant changing the data pipeline andthe design of the analytic database—a critical bottleneck in the process.

It is now the era of the data lake, a data management practice that utilizes Hadoop as acollecting point for large volumes of relatively raw data. Here, IT is responsible only forprovisioning and cataloging the data placed there. New Hadoop-based technologies allow thebusiness to ask and answer those new questions in an iterative fashion. Those same users alsohave to perform analytic tasks in order to inform many more types of decision-making in a rangeof timeframes.

Methodology

For our analysis, we have identified and assessed the relative importance of six DisruptionVectors. These are the key technologies in which players will strive to gain advantage in thesector. Tech buyers can also use the Disruption Vector analysis to aid them in picking productsthat best suit their own situation.

The “Disruption Vectors” section of this report features a visualization of the relative importanceof each of the key Disruption Vectors that Gigaom Research has identified for the Hadoop/datawarehouse interoperability marketplace. We have weighted the Disruption Vectors in terms oftheir relative importance to one another.

Gigaom Research’s analysis process also assigns a 1 to 5 score to each company for eachvector. The combination of those scores and the relative weighting and importance of thevectors drives the company index across all vectors. That produces the Sector Roadmap chartin the company analysis section.


3 Usage Scenarios

For this report, Gigaom Research distills the definition of three key analytics usage scenarios.Two are well-established; the third is a novel one stemming from our own observations (andnomenclature).

Core Data Warehouse. With highly curated data tuned for performance, the role of the datawarehouse, which provides production reporting on business operations, remains largelyunchanged. The highly curated nature of data in the warehouse facilitates easy navigation andoptimized query response times, but it comes at a cost. Asking questions that weren’tanticipated requires going back to IT in order to modify or redesign the ETL pipeline. As a result,“time-to-insight” can be weeks to months.



Data Lake. Facilitated by Hadoop’s open infrastructure and the engine agnosticism of its YARNresource manager, the data lake is a low-cost collecting point for large volumes of data frommany sources, providing a sandbox for exploratory business intelligence (BI) and ETLprocessing. The data lake’s flexibility enables more agile, iterative exploration of new questionsand answers because users can start with raw data and progressively add the structure toanalyze it on their own. IT often drives this usage scenario for two reasons: It provides a low-riskproof-of-concept and it offloads exploratory analysis from core data warehouses tuned forproduction workloads.

Adjunct Data Warehouse. This architecture provides for production ETL, reporting, and BI onthe large volumes and variety of data sources first explored in the data lake, promoting those adhoc explorations to production processes. It also offloads production ETL from the core data




warehouse in order to avoid costly capacity additions on proprietary platforms at a 10- to30-times cost premium.

Adjunct Data Warehouse Pipeline


The enterprise customers of the core data warehouse are feeling acute pain from the cost ofadding capacity to handle the volumes of data they encounter from a great variety of sources.The CapEx burden of expanding the core data warehouse relative to minimal budget growthmeans that a cheaper storage platform must be utilized. As such, we see the adjunct datawarehouse crossing the adoption threshold and going mainstream with greater speed in 2015.In fact, leading-edge customers are already adopting such a model.

The data lake is the ideal platform to land, store, and explore 10 to 30 times the volume of dataas compared to the core data warehouse. Today, the structure and organization of muchincoming data are largely unknown in advance. There is minimal need for service level




agreements (SLAs) around this data, as its exploratory value means it’s used for non-productionwork. On the other hand, manipulation of this data has to be open to any processing engine inorder to support a wide range of exploratory analysis.

The adjunct data warehouse features the same data sources as the data lake, but productionETL refines it so that the data’s structure and organization are known in advance. ProductionETL, reporting, and BI jobs make requirements for SLA and concurrency management stronger.There is the same need to be open to additional data processing engines as there is with a datalake, because the adjunct data warehouse represents the production deployment platform forwhat was discovered in the data lake. However, the manageability required for productionmeans total cost of ownership becomes more important than with a data lake, though lesscritical than with a core data warehouse.

Comparison of Hadoop and Core Data Warehouse Usage Scenarios

Data Lake Adjunct Data Warehouse Core Data Warehouse

WorkloadsLand large volume of data from many sourcesat low cost; sandbox for exploratory BI/ETL

Production ETL, reporting, and BI on sources firstexplored in Data Lake; also adds context to datain Core DW.

Production reporting on business operations; ad-hoc BIanalysis.

Data volume 10-100X core DW10-100X core DW. Level of detail/history andsources that can’t be cost-effectively captured incore DW.

Data resolution at aggregation level summary oftransaction data. Limited detail because of cost.

Data typesMostly time-series events: clickstreams, Weblogs, sensors, social media, NoSQL Web/mobile-facing operational data

Same as Data LakeBusiness transactions that record legal interactions withcustomers, suppliers, employees, products, accounts

Data modelIrregular event data in JSON, delimited textfiles, log files, time series, unstructured

Consistent structure but not star schema Highly refined star schema

Production SLA support Limited Strong Very Strong

Fault tolerance Limited Required Critical

Data structureSchema-on-read for well-knownsources.Dynamic schema for exploratorysources.

Embedded schema and/or schema-on-write. Datais well-known when it arrives.

Schema-on-write: designed before data capture.Designed with questions known in advance.

Data quality Raw, partially refined through discoveryFormal ETL and reporting provide strong audittrails

Highest quality between production ETL, provenance andgovernance, and DBA curation

Data manipulationworkloads

Open: SQL analytics, scripted transformation,statistical analysis, machine learning

Same as Data Lake Limited to functions built into SQL DBMS.

Performance optimization(schema, statistics)

LimitedCapable of high performance optimization viaschema and detailed stats

Highly tuned through detailed stats on data, indexes, andOLAP cubes

Cost (capex) $1.5K-$3.5K/TB HW+OSS $1.5K-5K/TB HW + MPP DBMS $35K/TB with appliance



4 Disruption Vectors

The use of Hadoop as a companion to the enterprise data warehouse has been growing overthe last two years, but new factors are accelerating adoption of this usage scenario and trulydisrupting the data analytics space. Below, we discuss what we feel are the six most importantvectors of this disruption which, broadly speaking, enable the processing of more kinds of data,in a more efficient manner than was previously possible, and in a fashion that is making Hadoopmuch more enterprise-ready than it had been before.

The six vectors we have identified are:

• Schema flexibility

• Data engine interoperability

• Pricing model

• Enterprise manageability

• Workload role optimization

• Query engine maturity

Key: vector weighting sums to 100%



http://research.gigaom.com/wp-content/uploads/sites/4/2015/01/SQLonHadoopVectors-e1422457227591.png

http://research.gigaom.com/wp-content/uploads/sites/4/2015/01/SQLonHadoopVectors-e1422457227591.png

Schema Flexibility

The analytic data pipeline is undergoing much change. In the unbundled database layersdescribed in this report, the catalog must be able to keep track of a data schema that is lessstructured because it’s working with that data earlier in the pipeline. But the data access engineitself also has to be more flexible in order to get at this data through the catalog. Because of allthis, we need more flexible approaches to accommodating the data.

The data lake changes how the data pipeline works and requires more flexibility in thedatabase. In this usage scenario, IT is only responsible for collecting as much data as possible,and provisioning and cataloging it in the data lake.

Business analysts and data scientists take it from there, wading through this un-curatedrepository in order to answer their own questions by iteratively adding structure to the data.New analytic database engines make this possible. They must be able to make sense of thedata through catalogs that support more flexible schemas.

A summary of the different levels of schema formality includes:

• Schema on write. As described in the section on usage scenarios, with traditional datawarehouses the questions were known in advance. In technical terms, IT had to design thedatabase schema before writing any data.

• Embedded schema. With this approach, data has a formal structure that is described in a“mini catalog” embedded in the data file. Unlike schema-on-write approaches, the targetdatabase doesn’t have to be designed up front, before loading the data. However, the dataitself must be organized and loaded into this file structure. Examples include the Parquet andORC file formats, which are columnar formats optimized for analytic performance, and Avro,which is better for nested data.

• Schema-on-read. With this approach, semi-structured data is written to a database on an as-is basis—there is no description of its structure. Each item can have more or fewer columnsthan other items. It is not until “query time,” when that data is actually read, that an ad hoc,context-dependent schema is decided upon. This approach enables analysts to explore rawdata.

Data Engine Interoperability

Data engine interoperability is a critical Disruption Vector because the new analytic datapipeline is neither as simple nor as uniform as traditional ones. The new pipeline needs to callon a wide range of modular services that customers can mix and match. This can only be


achieved through strong interoperability between a Hadoop integration platform and thesevarious engines.

At the most basic level, a database that offers a SQL API to HDFS would qualify asinteroperable. But satisfying this need for greater customization of the pipeline meansdeconstructing traditionally monolithic database engines into three layers of services:

• The storage engine, which includes HDFS, and the file formats for exchanging data

• The catalog or data dictionary that keeps track of how the data is organized and integrated,such as HCatalog

• The data manipulation engine, of which SQL is just the most likely starting point forenterprise customers.

Underneath everything is YARN, which for now enables multiple workloads to share access toHadoop’s cluster hardware resources.

Pricing Model

The growth of the Hadoop ecosystem in the enterprise is partly a classic low-end disruption oftraditional SQL databases. A Hadoop cluster with hardware and software can cost $1,500/terabyte of data, while a traditional data warehouse running on a hardware appliance can costas much as $35K/TB. As a result, Hadoop is opening up markets that traditional datawarehouses had trouble serving in the past. These lower absolute price levels represent greatervalue in our ranking for this report.

The challenge for incumbent vendors such as Oracle is that they can’t pursue the new usecases involving data volumes in the petabyte range with the same pricing levels and modelsthey use to serve conventional use cases, which tend to top out in the tens of TBs.

Because of Hadoop’s clustered architecture and its typical commodity hardware deploymentscenario, Hadoop distribution vendors tend to price their enterprise licenses and support basedon the number of nodes in the cluster. Data warehouse vendors, meanwhile, usually sell theirproducts (including their in-built Hadoop distributions) on appliance hardware, and tend to pricetheir licenses based on storage capacity. This can make it difficult to compare the relativepricing value of Hadoop vendors versus their DW competitors.

A question arises, then: How much storage capacity does the average commodity Hadoop nodeprovide? Furthermore, since appliance-based products include the hardware by definition, wemust also know how much the hardware for an average Hadoop node costs. Coming up with


these numbers is a somewhat subjective exercise, but we have done so based on a set ofconservative assumptions.

Enterprise Manageability

The Hadoop world has thus far favored innovation in the form of raw features over the fit andfinish necessary to make Hadoop manageability in league with that of enterprise databasemanagement systems. So, as is also the case with the Query Engine Maturity Disruption Vectorbelow, a product that excels in the relatively mundane facet of manageability is in fact disruptiverelative to the standards of the Hadoop world.

Where this argument becomes concrete is in the realm of total cost of ownership. Hadoop hasgarnered a reputation of being inexpensive due to its open-source availability and itscompatibility with commodity servers and storage. However, the relatively austere facilities forautomation of Hadoop provisioning, cluster management, and monitoring have madeoperational costs relatively high. This, in turn, has limited Hadoop’s broad deployment andadoption.

Advances in this sphere, whether in terms of tools, APIs or both, will catalyze mainstreamHadoop deployment by reducing operational costs. That much is already happening, so thisvector’s disruptive powers should not be underestimated.

Workload Role Optimization

The three most important workload roles considered here are BI or interactive queries,production reporting, and ETL. Versatility across these workloads was desirable in conventionaldatabase systems. By design, native Hadoop products aren’t trying to have workload coveragethat is as broad as the traditional SQL vendors’ products. That’s the point of deconstructing aDBMS into three layers. It’s easier to bring additional engines to bear on the data—ideally withthe data in place so there is no latency required to move it—than to have a single monolithicengine that tries to do it all.

Too much generality can be detrimental to performance and usability. When taken to extremes,it really means more overhead. For example, the optimizations and technology required tosupport online transaction processing (OLTP) is huge and orthogonal to BI, productionreporting, and ETL.

Query Engine Maturity

The norm in the Hadoop world is for vendors’ SQL engines to be new, especially whencompared to incumbent SQL relational database products. So a vendor finding a way to


integrate a mature SQL engine with Hadoop constitutes a disruption, even if that might becounterintuitive. This fact puts the incumbent vendors in this Sector Roadmap at an advantagefor this Disruption Vector but also favors companies that have found a way to hybridize aseasoned query engine with Hadoop’s file system.

Product maturity is very important to customers putting any enterprise technology intoproduction to support a strategic application. It’s also subjective. However, there are someuseful proxies that indicate how much hardening a product has accumulated. Any vendorworking with customers in production over time will have the best source of feedback in termsof finding issues that affect stability and performance.

But in recent years, the open-source community has grown into an alternate source of maturity.The process of crowdsourcing code reviews and bug fixes has accelerated the traditionalproduct testing cycle. And the great proliferation of open-source projects has made itimmensely easier to build in layers on the work of others.


5 Company Analysis

The companies examined here most closely provide a representative cross-section of the SQL-on-Hadoop market. Hortonworks is a Hadoop distribution vendor with a 100-percent open-source business model. Cloudera and MapR employ “open-core” models towards the Hadoopdistribution business. Oracle is a relational database management system (RDBMS) stalwart.Meanwhile, Teradata is a data warehouse pioneer and Pivotal is hybrid player that combineselements of the DW and Hadoop distribution business models. Detailed descriptions of our sixscored vendors follow with discussion of how they each fare across the key Disruption Vectors.

Key:

• Number indicates companies relative strength across all vectors

• Size of ball indicates company’s relative strength along individual vector


MapR, Drill

MapR uses an annual subscription per node with a multi-year commitment that includes support.The Enterprise Edition comes with the company’s own, binary-compatible implementations ofHDFS and HBase Apache Drill, and the complete Apache Hadoop stack, including Spark.




Pricing is approximately $7K/node. An annual subscription is required in order to obtain this per-node pricing.

MapR’s approach to Hadoop is intriguing because the company chose to create new, moreproduction-ready (in its view) implementations of HDFS and HBase, and will in the future providea JavaScript Object Notation (JSON) NoSQL document database modeled on MongoDB. Thekeystone that makes everything fit together is a SQL data access engine in Apache Drill that isso flexible it can unify access to all these databases as well as just about any other under asingle API.

MapR is implicitly saying that it needs to control the layered implementation of these services.By controlling them, it should be able to offer a more hardened, production level of stability thatwould be far more difficult to achieve by mixing and matching the equivalent services frommultiple vendors.

The Apache Drill project, of which MapR is the chief backer, is due to reach version 1.0 status inQ1 of 2015, so it is certainly the least mature of the SQL-on-Hadoop integration platformsreviewed here. Of note, however, is that MapR’s Hadoop distribution includes Hive and Impala(both discussed later in this report), allowing customers to use those engines in addition to, or inplace of, Apache Drill. This allows it to provide availability of an alternative toolset while Drillmatures.

Apache Drill looks like it will be a major advance in the state of the art for schema flexibility.There are a few key open issues, however:

• Other vendors use embedded schema formats because their structures, which can includeindexes, greatly accelerate performance. We don’t yet know how Drill will solve thisproblem.

• ODBC drivers can’t yet enumerate the variable structure of downstream data sources, whichmeans when Drill encounters items with a new structure, its ODBC driver must do theequivalent of inserting a page break and sending the items as a new collection. Applicationssuch as Tableau that connect to Drill via an ODBC driver will have to be modified toaccommodate this behavior.

MapR has always delivered unique value by virtue of its proprietary implementations of HDFS,HBase, and other services, which deliver resilience and production-readiness that we view asbeing ahead of the company’s competition. The underlying foundation offers high availability,data protection, recoverability, and advanced monitoring. MapR even provided a form of YARN-like service preemption before YARN itself was introduced.


MapR’s challenge in the future will be to ensure that its unique branch of Hadoop foundationtechnology continues to remain ahead of competitive functionality coming from the mainstreamopen-source community.

In terms of workload role optimization, Drill is the wildcard. Like other vendors’ offerings, Drillhandles BI and interactive queries with great aplomb, but it is designed to serve theseworkloads with data complexity that goes well beyond the flat structured data that other SQL-on-Hadoop systems deal with. Drill can handle complex, nested JSON documents that vary fromone record to the next as easily as other vendors’ SQL-on-Hadoop solutions handle Parquetfiles, where the schema is embedded in the file along with the data. It can also query a range ofdata sources such as MongoDB, HBase, flat files stored in HDFS, and any database with anODBC driver, and can join related data across these databases.

But Drill’s performance will need to extend well beyond simple BI queries that involve relativelysmall result sets. MapR claims to have architected the product with the same three performancecharacteristics common to columnar databases. In order to know Drill’s true performance profile,however, it will need to work under heavy, real world-scale demand.

Cloudera, Impala

Cloudera pricing at the same level of specificity as MapR isn’t available, but Gigaom Research’sfinding is that it is comparable in total cost.

Like Cloudera’s CDH Hadoop distribution itself, Cloudera’s Impala SQL-on-Hadoop layerdominates the market. Impala implements its own massively parallel processing (MPP) engineover HDFS, and leverages the widely supported Parquet file format along with the HCatalogdata dictionary that was at the heart of the original Hive project. In contrast with MapR’sproprietary technologies, Cloudera’s use of standards has clear value, even if judged in termscompetitive offerings. For example, Oracle uses HCatalog to query Hadoop-based data.

Most MPP data warehouse products are based on the open-source PostgreSQL databaseengine at the node level. That Cloudera decided to build its own distributed SQL query enginewas an audacious decision, if not a risky one. A little over two years on the market now, Impalais still a relatively young database engine. Whether online service providers who areaccustomed to working with and contributing to open-source projects or traditionally leading-edge enterprises such as telcos, customers who have put Impala into production tend to betechnically very sophisticated.

All customers, especially less technically sophisticated ones, take some risk in adopting theproduct. The rather broad support of Impala from BI and even other Hadoop vendors, however,does mitigate this risk.


Impala gravitates towards an embedded schema architecture. Nonetheless, Clouderadocuments Impala as schema-on-read capable when reading files in comma- or tab-separatedvalue (CSV or TSV) formats. But Impala’s benchmark results are premised on the use of Parquetfiles, which feature embedded schema. This leaves the impression that embedded schemascenarios are the engine’s strong suit and primary use case.

Cloudera Manager is a proprietary system management tool providing a console user interfaceand application programming interface. Cloudera Manager delivers a consistent way to wrapservices in order to manage their lifecycle, collecting metrics and monitoring each service in away that’s relevant to each administrative persona.

Impala was designed as an open-source MPP analytic SQL DBMS, making interactive BI itsnatural workload. Early versions of Impala focused more on very low latency performance overrelatively small queries, at the expense of acceptable performance when querying larger datavolumes. A lot of work has gone into later versions to change that. Cloudera is also putting a lotof effort into simplifying the mechanical aspects of ETL, including the movement of databetween tables and even between databases.

Even more important is work underway to allow Impala to handle complex nested data andmoderately semi-structured data, and to query that data without flattening it. This work will makeImpala more competitive with Drill. Another boost in that regard would be the ability to managelong-running jobs with fault tolerance and recovery, enabling much more robust production ETL.

Hortonworks, Stinger/Hive

Unlike all other vendors in this Sector Roadmap, the lack of any proprietary components inHortonworks’ Hadoop distribution means its pricing model dispenses with any licensingcomponent. Hortonworks derives all software-related revenue through annual supportsubscriptions whose prices are in line with those of Cloudera and MapR.

Some vendors have criticized Hortonworks’ model as unsustainable. The reality is thattraditional enterprise software business models that feature perpetual, up-front licenses are notradically more profitable when they reach maturity. Traditional licenses only cover the very highcost of making the initial sale. At maturity, most of the revenue and all of the profit comes frommaintenance and support for existing customers. Although customers use Hortonworks’technology primarily on-premises, its business model is akin to that of a SaaS vendor, where up-front revenue is much lower but long-term revenue and profitability are much closer totraditional, licensed software models.

Because Hortonworks’ business model is premised on using components that are open-sourceand widely adopted, it logically saw fit to add interactive query capabilities to Apache Hiverather than build its own distinct SQL-on-Hadoop platform. The ramifications that has on


Hortonworks’ SQL-on-Hadoop showings for schema flexibility, query engine maturity, andworkload role optimization are significant.

Under its Stinger initiative, whose main thrust was to enable a mode for Hive to operateindependently of MapReduce, Hortonworks effectively migrated Hive from being an exclusivelybatch-oriented engine to one that can undertake interactive query workloads. Stinger alsoadded vector (i.e. highly-parallelized) query processing and corresponding support for columnarstorage through a new persistence file format. That format, called ORC, supports embeddedschema similarly to the way Impala’s Parquet format does. ACID transactions for updates todimension and fact data and a cost-based optimizer have been added to Hive under theauspices of Stinger as well.

Hive 0.13 and 0.14, which are, effectively, the Stinger releases, make their best showings withlarger queries than with the smaller, low-latency queries that are Impala’s forte. This makes Hiveperhaps more versatile in terms of workloads yet less-optimized for the short, iterative queriesthat are the bread and butter of self-service data discovery. Stinger is an ongoing initiative andits current, second, phase is focused on closing that gap by adding a buffer cache to Hive.

Hive’s Stinger features are early in their lifecycle, creating liabilities on the maturity front.However, Hive’s universal presence in all Hadoop distributions assures broad adoption, bugreporting, and resolution. Additionally, features like vector processing were developed inconjunction with Microsoft’s SQL Server team, which previously implemented them in their owncommercial, enterprise database product.

Hive’s use of the ORC file format gives it an unmistakable embedded schema orientation.However, HiveQL’s CREATE EXTERNAL TABLE command allows schema to be defined on thefly, in the context of ad hoc analyses, while the data’s native structure, whatever it might be, isleft intact.

Hortonworks supports Apache Ambari, an open-source framework that delivers a consistentway to wrap services in order to manage their lifecycle. Ambari provides REST-based APIs forHadoop cluster provisioning, management, monitoring, and integration with enterprise datacenter management tools like Microsoft System Center. Pivotal, whose offering we cover below,uses Ambari as well.

Teradata, QueryGrid

Teradata, a decades-old data warehouse appliance vendor, offers a technology calledQueryGrid that can federate data in Hadoop with data in its own core data warehouse store. Italso offers the option to include major Hadoop distributions (including those from Hortonworksand Cloudera) on its appliance products. As described earlier, its ballpark pricing for core datawarehouse capacity is on the order of $35,000/TB.


Teradata’s Hadoop appliance may in fact be the ideal solution for Teradata customers whoneed to expand capacity without breaking the bank. With a tightly integrated Hadoop appliance,customers can choose to tier their data and store less frequently accessed data on Hadoopwithout the complex setup of piecing together a multi-vendor solution.

QueryGrid captures the data pipeline ethos of interoperating with multiple data access enginesand accessing different technology layers. QueryGrid orchestrates the federation of queriesacross other databases, including Hadoop and Oracle, with plans to add Aster (a database thatis now owned by Teradata), MongoDB, and others.

This approach is appealing in that it can leverage the unique SQL access, data dictionaries, andstorage engines of other databases. The challenge is that the Teradata SQL access layer can’tmap 1:1 with the access layers of the other databases. But managing that potential mismatch isinherent in any federated database approach.

The combination of Teradata’s battle-tested query engine with Hadoop’s compelling economicsmakes for a very attractive solution for existing Teradata customers. These buyers can maximizereturn on their significant investment in the Teradata platform without having to investsignificantly more capital in order to cope with today’s larger data volumes.

Teradata, as a traditional data warehouse vendor, depends on schema-on-write for performanceand accessibility by analysts. While this may prove a disadvantage for Teradata in comparison toits competitors, the company can compensate in other realms.

For example, Teradata’s integrated hardware and software stack eliminates the constant needfor customers to troubleshoot problems and then find, apply, and test patches, and upgrades. Itmay well be the most effective way to reduce TCO-related OpEx but at the expense of higherCapEx for the proprietary hardware.

And with Teradata measuring its industry tenure in decades, it can claim support for the threekey workloads (BI, production reporting and ETL) without breaking a sweat.

Pivotal, HAWQ

Pivotal Software was formed in 2013 as a spinoff from storage vendor EMC and anchored by theproducts and personnel derived from EMC’s acquisition of MPP data warehouse vendorGreenplum and professional services boutique Pivotal Labs. Various other properties from bothEMC and affiliated company VMWare were transferred to Pivotal as well, resulting in a full stackof database, cloud, and developer products and platforms.


Most germane to this Sector Roadmap is the release (roughly simultaneously with thecompany’s launch) of Pivotal’s own Hadoop distribution (Pivotal HD) and its accompanying SQL-on-Hadoop component, HAWQ. The latter is effectively an implementation of the GreenplumMPP product using HDFS in Pivotal HD as its file system.

Thus HAWQ, as an MPP engine over HDFS, is architecturally comparable to Cloudera’s Impala.But quite conversely to Impala’s green field development, HAWQ is based on a 10-year old MPPproduct which, at the node level, is in turn based on the well-established PostgreSQL databaseengine. This provides HAWQ with a modern architecture coupled with a core engine that isseasoned and tested.

Pivotal’s pricing model is also a hybrid of old and new. Pivotal HD is available for a nominalcharge on an unlimited number of nodes. However, HAWQ, as well as Pivotal’s other datamanipulation engines such as its GemFire XD in-memory component, is licensed on a per-nodebasis at rates comparable to those of the other Hadoop distribution vendors covered so far.Annual support is offered on similarly comparable terms.

Ultimately, Pivotal’s license fees are correlated with processing power, not storage capacity.Annual support, however, is premised on node usage, making Pivotal’s pricing model a hybridof Hortonworks’ support-oriented treatment and the Cloudera/MapR license-per-nodeapproach. This pricing model is ideal for the new pipeline which separates storage fromprocessing.

Pivotal’s strategy toward interoperability keeps pace with its competitors’ but its roadmap is wellahead of its current products. When it comes to data manipulation engines, Pivotal offers avariety of products starting with HAWQ, and extending to GemFire XD, a transactional, SQL-based, in-memory data grid. Pivotal has other engines as well, which customers can mix andmatch around Hadoop and HAWQ.

But Pivotal needs to make progress in moving its interoperability implementation more towardHadoop standards. For example, while HAWQ can read and write widely adopted formats suchas Parquet, it apparently performs several times faster in benchmarks when using its ownGreenplum-derived file format. We expect HAWQ and GemFire XD will eventually align moreclosely with existing and emerging Hadoop community standards, for disk as well as memory-based storage and processing.

HAWQ is essentially a schema-on-write-based engine. The company has schema-on-read on itsrelease calendar for the first quarter of 2015.

Pivotal Command Center provides a management dashboard analogous to ClouderaManager’s. Pivotal’s manageability layer builds on the same Apache Ambari technology favoredand backed by Hortonworks.


Pivotal’s HAWQ can claim support for the three key workloads without a problem. However, itstill has to master some of the tricks that, for example, Oracle finds easy, including highlyvariable user processing loads.

Oracle, Big Data SQL

Oracle’s Big Data SQL offering is available when purchasing both the company’s Exadataproduct and its Big Data SQL appliance. The latter runs the Cloudera distribution of Hadoop.While the upfront CapEx may be high (with pricing comparable to Teradata’s), Oracle isoptimizing for low TCO. The TCO advantage is that patching and upgrades on appliances with afull stack of software are as painless as possible, offering cloud-like convenience with on-premises operation.

As a traditional DBMS vendor, Oracle is somewhat challenged in the interoperability category.Its approach is to present the familiar Oracle SQL API to developers and databaseadministrators and lets them interact with Hadoop-based data through that uniform interface.This simplifies life for those trained in Oracle technologies but it will be harder to keep up withinnovations happening at the different layers of Hadoop technologies.

Underneath the covers, Oracle uses HCatalog to understand Hadoop data’s structure, relyingon a schema-on-write approach and using its own query engine to plan the work for its ownside and the Hadoop side. It also fetches the Hadoop-based data using its own code running oneach node in the Hadoop cluster.

As far as other data-manipulation technologies such as statistics packages, graph databases,and machine learning, Oracle prefers to build them into its core DBMS, making it rather difficultto call out to other, third-party engines. The organizing assumption appears to be that the workthat needs to be farmed out is that of reading tables defined in HCatalog. All of this makes for alow score along the data engine interoperability Disruption Vector.

On the data engine maturity front, Oracle is quick to point out that its query optimizer hasbenefited from 35-plus years of development. It’s not just the gold standard for speed, stability,and scalability—at a high cost—but it can handle issues that trip up other vendors.

Oracle products are exemplars for enterprise manageability in terms of tools offered by Oracleitself, as well as a broad third-party ecosystem.

Oracle stands in a class of its own in terms of workload generality. Its recent addition of in-memory columnar support is so strong that it can scan billions of items per second perprocessor core—a performance level that would have been unthinkable for a traditional DBMSseveral years ago. But in the context of workload role optimization, its OLTP support, graph


processing, text mining, and other capabilities all count as overhead. External workload-optimized engines cannot work as cooperatively with Oracle as with other solutions evaluatedin this Sector Roadmap.


6 Key Takeaways

MapR comes out on top by a nose. Cloudera and Hortonworks may be top-of-mind for somewhere Hadoop is concerned, but MapR’s layered data access interoperability matches its nativeHadoop peers. And even though it earns the lowest rating for maturity since Drill hasn’t quitereached 1.0 status, its demonstrated ability to deliver true schema-on-read access merits the toprank in schema flexibility.

Cloudera and Hortonworks are formidable competitors however. Impala and the newestversions of Hive are very capable, and both companies conduct ongoing enhancements andimprovements to their SQL-on-Hadoop platforms.

Hadoop Pivotal may shine even brighter here for customers open to adding a new vendor. Forexisting Teradata and Oracle customers, the QueryGrid and Big Data SQL products offer accessto Hadoop’s economic advantages with relatively low OpEx and high leverage of existingskillsets.

Other takeaways

• For those looking beyond provisioning a sandbox for exploratory BI and ETL, the adjunctdata warehouse may be the ideal usage scenario to address the high cost of core datawarehouse capacity expansion. The opportunity to offload ETL workloads from a core datawarehouse at $35K/TB to a Hadoop-based solution at $1,750/TB should be attractive.

• Depending on needs, some customers may choose to weight the Disruption Vectorsdifferently than we have here, emphasizing manageability and TCO, workload roleoptimization, and maturity.

• In the future, YARN will grow into a workload manager that mediates access to layeredsoftware services. That will likely take years—implementing it involves non-trivial technology.

• Apache Spark, like Drill, may reshape assumptions about how to build a state-of-the-art dataanalytics infrastructure. Spark works with data stored across nodes, passing data back andforth between engines. This contrasts significantly with the sequential pipeline paradigmexplored in this roadmap. If the prevailing paradigm shifts, so too may leadership in theHadoop integration space.


7 About George Gilbert

George Gilbert is an Analyst for Gigaom Research and the co-founder and Partner ofTechAlpha, a management consulting and research firm that advises clients in the technology,media and telecommunications industries.

Gilbert is recognized as a thought leader on the future of cloud computing, data centerautomation and SaaS economics, including contributions to The Economist as well as on hisblog.

Previously, George was the lead enterprise software analyst for Credit Suisse First Boston, oneof the leading investment banks to the technology sector. Prior to being an analyst, Georgeworked at Microsoft as a product manager on Windows Server and spent four years in productmanagement and marketing at Lotus Development.


8 About Gigaom Research

Gigaom Research gives you insider access to expert industry insights on emerging markets.Focused on delivering highly relevant and timely research to the people who need it most, ouranalysis, reports, and original research come from the most respected voices in the industry.Whether you’re beginning to learn about a new market or are an industry insider, GigaomResearch addresses the need for relevant, illuminating insights into the industry’s most dynamicmarkets.

Visit us at: research.gigaom.com.

© Giga Omni Media 2015. "Sector Roadmap: Hadoop/Data Warehouse Interoperability" is atrademark of Giga Omni Media. For permission to reproduce this report, please [email protected].


http://research.gigaom.com

mailto:[email protected]

sector roadmap: hadoop/data warehouse interoperability...solution can facilitate hadoop-data...

Documents