1031 data warehousing with hp

Data warehousing with HP

A White Paper by Bloor ResearchAuthor : Philip HowardPublish date : April 2009

Whi

te P

aper

…the HP Oracle products and HP Neoview have fundamentally different architectures and, as a result, are aimed at very different segments of the overall data warehousing spacePhilip Howard

© 2009 Bloor Research A Bloor White Paper


Executive summary

HP has recently collaborated with Oracle in order to introduce the HP Oracle Database Machine and the HP Oracle Exadata Storage Server as solutions for the data warehousing market. On the other hand, HP also offers its own Neoview product that addresses that market.

From one perspective this collaboration can be seen as a good illustration of HP’s partnership model, and its role in the development and provision of the Exadata Storage Server and the Database Machine is based on its leading position as a provider of the platform technology involved.

However, this may also appear confusing. It might suggest that HP was directly competing with itself in its partnership with Oracle. We do not believe this to be the case: the HP Oracle products and HP Neoview have fundamentally different architectures and, as a result, are aimed at very different segments of the overall data warehousing space. Neoview, in particular, is aimed at the enterprise data warehousing space where there is a focus placed on mixed workload environments that involve a combination of simple, complex and real-time query processing within a single environment. Oracle, because of its architecture, does not serve this mixed workload environment well. In this paper we will explore the differences between the various options now available from HP and explain how the company’s partnership with Oracle strengthens its position within the data warehousing market.


HP and Oracle have recently announced the HP Oracle Database Machine. Although not explicitly described as an appliance this has been positioned by Oracle as a competitor to the various vendors of data warehouse appliances. The HP Oracle Database Machine is enabled through a second product, the HP Oracle Exadata Storage Server, which can be added to existing instances of Oracle 11g to improve performance in data warehousing environments.

The HP Oracle Exadata Storage Server

The HP Oracle Database Machine is an appliance-like data warehousing package based on the Oracle database. Specifically, it is a pre-configured, but not pre-implemented, package of software, servers and storage. From Oracle’s standpoint, the database machine itself is based on a new approach to storage, instantiated in the HP Oracle Exadata Storage Server, which runs on HP Proliant servers. This architecture is illustrated in Figure 1 and it is this that represents most of the “secret sauce” within the database machine. Exadata servers will be available separately, thereby allowing existing Oracle data warehouse users (initially restricted to Oracle 11g RAC on Oracle Enterprise Linux) to upgrade to the new environment.

To understand how the Exadata server improves the performance of queries running on an Oracle database we need to compare what happens when using Exadata as opposed to a traditional Oracle environment. In the latter case you read the data from disk then process it in the database. However, the Exadata server introduces an additional step that filters, but doesn’t join or aggregate, the data after reading it from disk, so that only the information that the database is interested in is passed to the database for processing. Benefits accrue because the filtering is done

close to the disk, making it much faster for the table scanning portion of query processing than doing it in the database; the database has less work to do in answering queries, because it doesn’t have to do the filtering for itself; and it is therefore more efficient. Oracle officially claims, though it has not published its figures, that in benchmarks the use of Exadata servers means that performance is improved by up to ten times for appropriate queries (most likely those involving large scale table scans). It is noteworthy that many of the queries that are typically run in data marts require just such table scanning.

The HP Oracle Database Machine

The HP Oracle Database Machine (HODM) is being touted as a ‘complete package of software, servers, and storage’. By that it means that the operating system (today only Oracle Enterprise Linux) and database software is pre-installed prior to delivery to the customer, that the hardware has been pre-configured and so on. It is appliance-like because the software is not pre-installed at the HP factory but by Oracle. This is by design, at the request of Oracle, as it wanted flexibility for site licensing.

This is certainly convenient and will reduce implementation times. However, this is not an appliance in the same sense that a domestic refrigerator is an appliance as there will still be the administrative overhead that one would normally associate with an Oracle database, particularly with respect to the administrative and tuning efforts required to maintain an Oracle database, not to mention the complexity involved with Real Application Clusters (RAC). This also, of course, applies when upgrading an existing implementation by means of Exadata (and, again. there are restrictions on what front-end RAC configurations can be linked to Exadata).


The HP Oracle Database Machine and Exadata essentials

Figure 1: HP Oracle Database Machine architecture


Figure 2 shows the key differences between a RAC implementation and one based on the HP Oracle Database Machine.

Another consequence of the fact that this is still an Oracle database is that you will require significantly greater disk capacity than the raw data itself. A typical Oracle data warehouse has to be much larger than the raw data it contains because of the number of indexes, materialised views, temporary scratch areas for sorting and query temp tables, and so forth, that you need to define in order to maintain performance. Certainly, the overall size of the data warehouse will be reduced through the use of compression but as this is a common feature across data warehouse products it is still likely that the Oracle database will be significantly larger than when using other products, including HP Neoview.

Finally, perhaps 20% or 25% (an educated guess) of queries involve whole table scans. Many others, of course, consist only of simple enquiries and reports. However, there will still be some queries involving multi-way table joins where performance is an issue because of the complexity involved rather than because table scans are needed, in which case HP Neoview is likely to provide better performance than the HP Oracle Database Machine. In such circumstances faster scans are of no major benefit and, even where they are a part of the requirement, for more complex queries on large amounts of data, the performance benefits of parallelism and partitioning of an MPP platform will outweigh the advantages of the filtering scheme.


The HP Oracle Database Machine and Exadata essentials

Figure 2: Differences between traditional RAC and Exadata architectures


If we consider large table scans then the way that Neoview and, indeed, any product with a massively parallel shared-nothing architecture works is essentially the same. That is, it processes the data immediately as it comes off disk, using disks local to the processing nodes. However, whereas with the Database Machine the filtered data is passed to the database, when using Neoview the data continues to be processed close to the disk with, in this example, the relevant data being extracted and sorted at the node level with only the results being passed to the database to combine the results from the various processing nodes. As Oracle has stated, the reason it has introduced the Exadata server and Database Machine is because it is faster to process the data as close to the disk as possible. However, in the HP Oracle solution it only does this to filter out unwanted data. HP Neoview, on the other hand, was

designed from the outset to put as much processing as possible close to disk, therefore one would expect better performance from Neoview even for whole table scans, at least when any significant further processing (such as sorting or joining with another table) is required.

Further, because Neoview is massively parallel (see Figure 3) it has intrinsic performance characteristics that mean that it does not require all of the various constructs (indexes, materialised views and so on) that Oracle does and thus is more efficient (and less expensive) in terms of storage requirement; and, of course, this will have a beneficial knock-on effect when it comes to data centre costs. Note that you can use indexes and materialised views in Neoview when it is appropriate but it will not be necessary all the time as would typically be the case with Oracle.


HP Neoview essentials

Figure 3: Neoview architecture

Neoview Shared Nothing Architecture


There are, essentially, seven key dimensions by which one can look at the data warehousing market, namely:

1. the types of queries you want to run,

2. storage scalability: the amount of data you have to manage,

3. the number of users and query volumes you need to support,

4. memory and processor scalability,

5. mixed workload support,

6. availability,

7. deployment options.

We will consider each of these in turn. In addition, it is important to understand the different architectural options that are available, as this will also influence deployment choices, so we will also discuss this aspect of the data warehousing market.

Types of query

There are a number of different types of query, analytics and other capabilities that you may want to include within a data warehousing environment. These include:

•Standard reports and queries: typical examples would be daily, weekly, monthly or quarterly reports.

•Ad hoc reports and queries: these can vary widely between simple, parameterised versions of standard reports to the completely unexpected. Most systems will cater to the former but may struggle with the latter because the database has not been optimised for them. For example, there may be no relevant indexes or materialised views.

•Query drilldown: also known as slice and dice this provides the ability to explore data with characteristics such as how many, what colour and where. Typically, this functionality is supported by some form of on-line analytic processing (OLAP). Note that OLAP supports ad hoc queries only in so far as the relevant dimensions and hierarchies have been defined. Further, OLAP works against aggregated data rather than at the level of individual transactions, though it may be possible to drill down to that level.

•Embedded queries (in business applications or processes): the ability to embed and support query capabilities that are built into business processes whereby the business process automatically queries the data warehouse as and when required. Call centre operations that need to reference the warehouse are a good example.

•Real-time monitoring: the ability to monitor events and transactions in (near) real-time through the use of dashboards or via event-based capabilities and business rules. This is often combined with alerts that

send appropriate messages via email or instant messaging to relevant individuals. In some environments this may be extended to the automated instigation of relevant processes.

•Analytics: this comprises a range of capabilities including statistical analysis, data and text mining, web analytics, forecasting, predictive modelling and business optimisation. While some analytic queries are relatively simple, many are complex, involving multi-way joins and aggregations.

There are several points to note here. The first is that when it comes to merchant databases (that is, those products that are general-purpose and which are used for transaction processing as well as data warehousing) as opposed to pure play data warehouses, you need to define indexes, and possibly other constructs, against all the data you are supporting within your query environment, otherwise performance will suffer. As a result, the size of the database will grow significantly. In particular, you can’t index everything both for this reason and because performance ultimately suffers because of the growth in size of the warehouse. Conversely, some specialised data warehousing products eschew indexes altogether, some use them only occasionally and others offer them as options. They can afford to do this because their technical architecture has been specifically designed for massively parallel query processing rather than for transaction processing. The widespread adoption of MPP architectures to overcome the weakness of merchant databases in handling complex analytics, regardless of how many indexes and materialised views you build, is a case in point.

A second point is the need to aggregate data in order to support query drilldown. This is the second biggest administrative headache after index maintenance and tuning (which is a major issue). Most specialised data warehousing products do not require pre-aggregation of the data (because the system is fast enough to obviate the necessity).

Thirdly, consider the impact of implementing process-centric BI and, for that matter, operational BI. The use of data warehousing to support these functions can hugely increase the number of users that are addressing the warehouse, thereby putting much more strain on the system. In addition, these sorts of queries often tend to be simple, short queries whereas the more traditional reporting and analytics are more likely to be long and complex. There is therefore a requirement to provide mixed workload management capabilities that are designed to ensure that all these varied query types are responded to within a reasonable timeframe, which are quite often governed by strict SLAs.

Fourthly, analytics, especially complex analytics, is not handled well by merchant databases. This is precisely because these are addressing data that has not been indexed and/or because it is combining data in complicated ways across many tables that simply cannot be readily pre-determined. Certainly you can build materialised views that will speed up the processing of analytics that are repeated on a regular basis but it is in the nature of analytics that this is frequently not the case.


Key dimensions to consider


Essentially, the choice between Neoview and HODM/Exadata depends on the expected mix of query complexity that one needs to support. In a nutshell, complex queries where the performance boost given by filtering is likely to be overshadowed by the performance of a massively parallel platform are what Neoview is suited for. In addition, Neoview has mixed query workload capabilities (see below) that can provide substantial additional benefits.

Storage scalability

Data warehouses vary in size from hundreds of gigabytes to hundreds of terabytes and everything in between. It is therefore necessary to appreciate where the different vendors fit into this landscape. Before we examine that, it is important to understand what we mean by storage scalability. You will hear vendor x claim to have a “120 terabyte” implementation, for example. What this means is that it has a customer with 120 terabytes of disk capacity. Part of this will be unused to allow for future growth. Other parts will be devoted to indexes, materialised views and other constructs that are needed to help the database perform adequately. It is not untypical that these constructs will take up two, three or even four times as much space as the data itself. The only real measure of a data warehouse’s size is its raw data capacity.

In addition, most database products now include support for compression, which further affects the relationship between the size of raw data and data warehouse size.

It is also worth bearing in mind that we have heard vendors claim to support particular database sizes without mentioning that this is actually distributed across multiple warehouses or marts. The issue for scalability is raw disk on a single database instance.

User and query scalability

This is a measure of how many users can be supported. However, it is not as simple as counting numbers but also of considering how many queries each user runs, how often and of what complexity. While it is easy to count users, queries and frequency the problem here is with measuring complexity. Supporting 5,000 users running simple look-up and similar queries is completely different from 5,000 users doing data mining, but unfortunately we do not know of any way of measuring the difference accurately, so vendors quoting number of users supported or numbers of queries run is not very useful. A better gauge would be the number of queries run per day within each of the query categories identified previously.

Understanding competing claims for user support is further complicated by the fact that you need mixed workload management capability in the database in order to ensure that you meet relevant service level agreements for, in particular, operational and process-centric BI. We say “in particular” with reference to these types of queries because these tend to be the sorts of queries that support call centre and similar operations that require the most stringent levels of service. On the other hand, many of these queries are simply doing something like looking up customer reference data (lifetime profitability, say) that do not require the full power of the parallelism of a solution such as HP Neoview,

so the workload management software is not just about ensuring performance but it is also about not assigning unnecessary resources to queries that do not need them.

The bottom line then is that user scalability is not a very useful metric unless combined with query complexity information. You could then categorise vendors by whether they only support a simple, median or rich mix of users and query types. HP Neoview is clearly targeted at supporting a rich mix whereas Oracle environments would tend to be at the median level at best.

Memory and processor scalability

More data tends to mean larger queries and more users will mean more queries. In order to process more and larger queries your solution will also need to be able to scale in terms of memory capacity and processor power. In the Neoview architecture, the processing power of the system should scale linearly with data size. In general terms this should not need further discussion though it is worth commenting that pipeline parallelism (which HP Neoview has) will be useful here, because it enables more efficient use of memory. The ability to utilise pipeline parallelism with large memory and eschew physical I/O enables Neoview to execute large, complex queries significantly faster than other data warehouse platforms that rely on intermediate answer sets on physical storage.

Mixed workload support

There is a growing requirement to support embedded and real-time BI capabilities from the data warehouse, in addition to its more familiar functions. However, as we have noted, these tend to incur very large numbers of short queries that need to be supported in a timely fashion alongside more complex queries and analytics. In order to be able to manage this gamut of requirements necessitates what is known as mixed workload support in order to ensure that SLAs are met on the one hand and that resources are maximised on the other.

In addition, this issue is exacerbated by the fact that these operational requirements often require the ability to load data in real-time. So there are two issues: loading the data and managing the queries.

•The ability to load data in a timely manner is obviously fundamental and, as the size of data warehouses and data marts increases, this becomes more challenging. In order to support large volumes you really need to be able to support full parallel loading functionality. Fortunately, most vendors, including HP, have recognised this fact and have implemented relevant facilities. However, there are two additional factors that are relevant to real-time query and monitoring, which is that you need to be able to ingest data in real-time, and that you have to be able to load and query the same data: so rapid, real-time data ingestion should be available on the same tables that are being used to support real-time queries, which is what Neoview does.

•Neoview has been designed from the ground up to handle a dynamic mix of embedded, real-time and analytical workloads. Conversely, the HP Oracle Database Machine on Exadata is targeted at existing Oracle 11g customers




who want to extend the capabilities of their Oracle implementation but which will not provide additional mixed workload support.

Availability

Historically, data warehousing was not regarded as mission critical and, as a result, traditional warehousing did not place as much emphasis on the availability of the platform. However, with the increase in user base, and the operational workloads going against the warehouse, warehouses are increasingly critical to the business with the result that availability is becoming more important for many customers. Neoview has a fault-tolerant hardware architecture as well as patented process-pair technology for software fault tolerance that relies on a takeover strategy for ensuring continuous availability. Oracle’s availability is based on a more traditional failover approach. When a single failure occurs, the takeover strategy seamlessly continues, whereas the impact of the more traditional approach will be seen by clients while the failover is taking place.

Deployment options

The traditional approach to data warehousing has historically been that the data warehouse per se was used as the system of record and it supported all the query types detailed above with the exception of process-centric BI and operational BI. For this reason, there were relatively few users of the data warehouse, mostly business analysts, statisticians and data miners. Nevertheless, in some instances this central data warehouse could not provide adequate performance for complex analytics and unexpected ad hoc queries so departments concerned with getting this information in a more timely fashion resorted to the use of specialised subject-specific data marts to speed up this processing. Historically, most (though not all) data marts have been implemented using the same database type as the main data warehouse, even where that was from a merchant database vendor whose product was not well suited to this role.

More recently, appliance vendors have taken an increasing share of the data mart segment of the market and we should also say that we have seen a few companies experimenting with the use of federated data marts with no centralised data warehouse. However, there is a big difference between the controlled support for a few specialised data marts and a proliferation of data marts that can easily get out of IT’s control.

However, the role of the data warehouse is expanding. While some companies still prefer a traditional approach there are an increasing number of organisations that want to include support for process-centric and operational BI within their data warehouses. This requires significantly greater capability because of the number of users, diversity of query types and requirement for mixed workload management, as already discussed. While in principle this sort of environment may be augmented with data marts, that is usually not the case or only to meet specific requirements.

Lastly, we must bear in mind that deployment choices are not simply a function of what you want to do but also of what you already have in place. If you have no data warehouse, or what you have is hopelessly inadequate in terms of performance and/or scalability, then choices between modern or traditional approaches to data warehousing and whether or not to deploy data marts may make sense. However, where you simply have an existing system that is creaking at the seams then upgrading (in the case of Oracle and Exadata) rather than replacement may be the preferred option,

To contrast the HP Oracle Database Machine with HP Neoview: the former addresses the traditional data warehouse and the data mart markets and, while Neoview is capable of addressing these requirements, it has extensive mixed workload management capabilities and is primarily targeted at the more modern data warehouse environments outlined, where there is a need, typically, to support all the types of queries identified previously, all running simultaneously.




To summarise the various products:

•The HP Oracle Exadata Server (when it is available) is targeted at existing users of Oracle 11g who, for performance reasons, might otherwise consider a solution provided by one of the appliance or specialist data warehouse vendors.

•The HP Oracle Database Machine is aimed at new and replacement data warehouses and data marts for conventional data warehouse style processing, where it makes sense to leverage the ‘filtering’ nature of the Exadata Storage subsystem.

•As with the Exadata Server this is probably limited to around 30 terabytes of raw user data per database instance, though this will depend on workload.

•HP Neoview is targeted at data warehouse environments where there are potentially thousands or tens of thousands of users and processes targeting the data warehouse with a variety of query types that range from the very short and simple and to the very long and complex. It overlaps with the HP Oracle offering in the low tens of terabytes and relatively simple characteristics (primarily analytics) but otherwise should scale much further than the Oracle-based solution.


Summary comparison and positioning

•More generally, HP Neoview has been designed to specifically target environments where there are:

» Multiple, inter-dependent dimensions of scalability such as number of users, database size and query workload.

» High complexity requirements whether in the queries themselves and/or in the database schema.

» Mission critical availability requirements with respect to the platform, database, loading times, and database maintenance.

» Requirements to support high concurrency.

» Mixed workload environments where there are large numbers of short operational and process-centric queries combined with analytics, conventional and ad hoc reporting/queries. The product has a number of features designed specifically to assist in this mixed workload environment, notably Adaptive Segmentation, which parallelises queries while assigning the optimal number of nodes to each query depending on their complexity and size and available resources; and Skew Buster, which recognises the existence of any skewed data and deals with it differently and in an optimal manner so that joins and aggregations are skew-insensitive when it comes to performance.


In general terms, the introduction of the HP Oracle Database Machine and the HP Oracle Exadata Server are classic cases of HP’s approach to working with partners, which is by no means exclusive to Oracle. The company has a long-standing relationship with Microsoft, for example, with Microsoft recently announcing reference architectures for implementing SQL Server-based data warehouses on HP hardware. In this sense the company’s focus is on providing world-class technology to support specific solutions in conjunction with its partners, of which this particular Oracle partnership is just another instance.

To re-iterate:

•The HP Oracle Database Machine is a conventional merchant database with a data pre-filter that improves performance for certain types of queries;

•The HP Oracle Exadata Server is an upgrade for (some) existing Oracle database users to reach the same point;

•HP Neoview is a massively parallel database system designed to scale to hundreds of terabytes while providing optimal performance for all types of queries, regardless of their simplicity or complexity and whether they need real-time responses or not; all at the same time.


Conclusion

Put succinctly, HP Neoview scales further than Oracle and has the mixed workload capabilities that Oracle lacks. As a result the target markets for the HP Oracle offerings and HP Neoview are significantly different. While there may be some overlap in particular cases we can generalise by saying that Oracle will be most suitable for users deploying conventional data warehousing and data mart solutions whereas HP Neoview is targeted at the new breed of data warehousing environments where extreme levels of both storage and user scalability are required alongside a flexible, mixed workload environment. What this means in practice is that HP Oracle’s competitors are Microsoft and the appliance vendors, while HP’s principal rivals are IBM and Teradata. In other words, these are completely different constituencies. There will, of course, be occasions when this not the case but we think these will be rare: we see the introduction of the HP Oracle offerings as decidedly complementary to the positioning of HP Neoview.

Further Information

Further information about this subject is available from http://www.BloorResearch.com/update/1021

http://www.BloorResearch.com/update/1021

Bloor Research overview

Bloor Research has spent the last decade developing what is recognised as Europe’s leading independent IT research organisation. With its core research activities underpinning a range of services, from research and consulting to events and publishing, Bloor Research is committed to turning knowledge into client value across all of its products and engagements. Our objectives are:

• Save clients’ time by providing comparison and analysis that is clear and succinct.

• Update clients’ expertise, enabling them to have a clear understanding of IT issues and facts and validate existing technology strategies.

• Bring an independent perspective, minimising the inherent risks of product selection and decision-making.

• Communicate our visionary perspective of the future of IT.

Founded in 1989, Bloor Research is one of the world’s leading IT research, analysis and consultancy organisations—distributing research and analysis to IT user and vendor organisations throughout the world via online subscriptions, tailored research services and consultancy projects.

About the author

Philip HowardResearch Director - Data Management

Philip started in the computer industry way back in 1973 and has variously worked as a systems analyst, programmer and salesperson, as well as in marketing and product management, for a variety of companies including GEC Marconi, GPT, Philips Data Systems, Raytheon and NCR.

After a quarter of a century of not being his own boss Philip set up his own company in 1992 and his first client was Bloor Research (then ButlerBloor), with Philip working for the company as an associate analyst. His relationship with Bloor Research has continued since that time and he is now Research Director focussed on Data Management. Data Management refers to the management, movement, governance and storage of data and involves diverse technologies that include (but are not limited to) databases and data warehousing, data integration (including ETL, data migration and data federation), data quality, master data management, metadata management, and log and event management. Philip also tracks spreadsheet management and complex event processing.

In addition to the numerous reports Philip has written on behalf of Bloor Research, Philip also contributes regularly to www.IT-Director.com and www.IT-Analysis.com and was previously the editor of both “Application Development News” and “Operating System News” on behalf of Cambridge Market Intelligence (CMI). He has also contributed to various magazines and published a number of reports published by companies such as CMI and The Financial Times. Philip speaks regularly at conferences and other events throughout Europe and North America.

Away from work, Philip’s primary leisure activities are canal boats, skiing, playing Bridge (at which he is a Life Master), dining out and walking Benji the dog.

Copyright & disclaimer

This document is copyright © 2009 Bloor Research. No part of this publication may be reproduced by any method whatsoever without the prior consent of Bloor Research.

Due to the nature of this material, numerous hardware and software products have been mentioned by name. In the majority, if not all, of the cases, these product names are claimed as trademarks by the companies that manufacture the products. It is not Bloor Research’s intent to claim these names or trademarks as our own. Likewise, company logos, graphics or screen shots have been reproduced with the consent of the owner and are subject to that owner’s copyright.

Whilst every care has been taken in the preparation of this document to ensure that the information is correct, the publishers cannot accept responsibility for any errors or omissions.

2nd Floor, 145–157 St John Street

LONDON, EC1V 4PY, United Kingdom

Tel: +44 (0)207 043 9750 Fax: +44 (0)207 043 9748

Web: www.BloorResearch.com email: [email protected]

www.bloor-research.com

malto:[email protected]

1031 data warehousing with hp

Documents