an elastic platform for business analytics · adopted in sap sybase iq. in developing this report,...

W I N T E R C O R P O R A T I O N

WH

ITE

PA

PE

R

S P O N S O R E D R E S E A R C H P R O G R A M

Th

e L

a rg e

Sc a l e

Da t a M

a n a g e m e n t E x p e r t s

SAP SybASe IQ 15.4

An Elastic Platform for Business Analytics

©2012 Winter Corporation, Cambridge, MA. All rights reserved.

245 FIRsT sTREET, suITE 1800CAmbRIdgE mA 02145

617-695-1800

v i s i t us a t www.wintercorp .com

W I N T E R C O R P O R A T I O N

SAP SybASe IQ 15.4

An Elastic Platform for Business Analytics

APRIL 2012

SAP SybASe IQ 15.4: An Elastic Platform for Business Analytics

A W I N T E R C O R P O R A T I O N W H I T E P A P E R

Copyright © 2012, WINTER CORPORATION, Cambridge, MA. All rights reserved.

3

Executive Summary xECutivES AROuND tHE wORlD are intensely focusing on business analytics. They see an analytical approach to business decisions—an approach based on more abundant data and mathematical analysis of that data—as the cornerstone of new strategies for profitable operation, profitable growth, new product development and customer engagement. The opportunity to benefit from business analytics is especially large right now in part because businesses have access to “big data”—enormous, previously unavailable volumes of data on the actions, interests and sentiment of customers; on the movement of products, components and raw materials through the supply chain and the distribution chain; and, on many other aspects of the operation of businesses and their market environment. Perhaps surprisingly, the challenge of “big data” is not only the data volume. It is also that much of the new data is less structured and less regular than the tabular corporate data that has been the focus of data warehousing in the past. The new big data comes from new or greatly expanded sources: social media, rapidly proliferating smart mobile devices, from vehicles and a dizzying array of new sensors and intelligent products.Even beyond the challenges of big data, there are other obstacles to success with business analytics: data analysis can be a cumbersome, slow, frustrating and expensive process. First you have to find the data you want. Then you have to get it loaded into a repository where it is accessible. Then you need to cleanse it, organize it and integrate it with other data of interest. Then you have to conduct the analysis…every step bedeviled by many practical difficulties, not least of which is often the difficulty of getting help from people with the right skills.New open source technology has emerged and is being deployed for “big data”; new vocabulary includes terms such as “Hadoop clusters” and “mapReduce.” This technology brings new benefits for certain types of information and analysis. However, it also creates one more data silo in a world in which there are already too many silos. The complete analytical process thus gets enhanced in some areas but also becomes more fragmented: to get analytic results and business solutions, stakeholders must contend with a yet more complex environment with net new skill requirements.The new, highly analytical business strategies place a particular emphasis on prediction. Knowing what happened yesterday isn’t enough—you need to predict which of the business actions in front of you is likely to produce the best result. And, as well as judgment, you need facts, data and analysis to back that decision. And, you must take into account the new data sources—the customer sentiment expressed on social media; the customer behavior evident from new data sources and devices; the subtle patterns that can be seen in purchase behavior, web browsing and many other sources; and, the supply chain realities now visible as parts, components, goods and materials move around the world and are affected by weather, catastrophes and human events.Often, to the decision maker, the unfortunate reality is that predictions of which profitable customers are at risk may indeed be extremely valuable, but getting such predictions before it’s too late is easier said than done.

For many enterprises, then, the key to the analytic opportunity is finding a way to make the entire analytic process work smoothly, conveniently, responsively and cost effectively—whether the analysis focuses on the tabular data most frequently used for the past 25 years; on newer data sources, such as sentiment expressed in social media; or, both.

In response to this challenge, sAP has introduced a new version of its flagship analytic dbms product—sAP sybase IQ 15.4—as a platform and an integrated environment to support and facilitate the customer’s entire analytic process.

E




In addition to a greatly enhanced dbms engine for data warehousing, sybase IQ 15.4 features significant new capabilities for business analytics and big data. Highlights are: • A new analytic services layer that supports the use of mapReduce and many other analytic

functions on data within sybase IQ itself;• Parallel interaction between sybase IQ and Hadoop;• support of R, the open source language for statistical analysis;• support of new third party sQL-callable functions for data mining and predictive analytics;• An expanded eco-system for the support of third-party applications for information

lifecycle management, business intelligence and data integration, predictive analytics and system/data administration.

At the core of sybase IQ 15.4 is the most mature column store dbms for data warehousing on the market, with sophisticated capabilities for data compression, query processing and query optimization—an engine with a long record of exceptional query performance and efficiency. While column storage and column-oriented data compression have been “hot trends” for the last few years, sybase IQ was built from day one with these capabilities: its users have been benefitting from them for more than a decade. And, they contribute significantly to the efficiency of sybase IQ for analytics.In addition to the remarkably efficient storage and query processing technology at its core, sybase IQ 15.4 features PlexQ™ technology, a distinctive, elastic design that supports highly parallel query processing and data loading along with independent scaling for data growth and workload growth. WinterCorp, an independent expert in analytic data management and big data, has been invited by sybase Inc, an sAP company, to review its new product, sAP sybase IQ 15.4. To conduct its review, WinterCorp, reviewed product designs and documentation; and, engaged in technical discussions of the product architecture with key employees at sAP/sybase and with independent parties. This White Paper, sponsored by sybase Inc, an sAP company, presents WinterCorp’s views and findings from that review.

4




Table of ContentsExecutive summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Architecture of sAP sybase IQ 15.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1 A Platform For business Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 The sAP sybase IQ15.4 Core data management Infrastructure . . . . . . . . . . . . . . . . . . . . . . . . . .11

3.1 data Load Performance and scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

3.2 Column-store storage Efficiency, Indexing, and Compression . . . . . . . . . . . . . . . . . . . . . . . 13

3.3 Query Processing Performance and scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

3.4 Very Large database (VLdb) management and backup . . . . . . . . . . . . . . . . . . . . . . . . . . . . .16

3.5 In-database Analytics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.6 Text search and Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .18

4 The Application services Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19

4.1 “mPP Enabled” user defined Functions (udF) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .19

4.2 Protected JAVA udF’s . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.3 In-database mapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.4 simulation for In-database development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.5 Hadoop Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4.6 geospatial/geometric data & Query support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.7 Free Express Edition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

5 The Ecosystem Layer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1 sAP businessObjects Portfolio support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.2 “R” Language support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.3 mapReduce-Enabled data mining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.4 social Network Analysis modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.5 sybase Powerdesigner 16 Architecture Recommender. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.6 In-database PmmL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6 Conclusions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

5




IntroductionThis paper examines the architecture and capabilities of sAP sybase IQ 15.4 with a particular focus on demanding new requirements for business analytics and big data.

Business Analytics. People who have been involved with data warehousing for the last decade or more—especially those with a technical background in the field—are often puzzled by the new wave of executive interest in “business analytics.” A common question is, “Aren’t we doing that already?” surely, the reason all that data has been modeled, cleansed, integrated and stored in data warehouses for the last ten or twenty years is so that it can be analyzed!

Certainly there has been analysis going with data warehouse data. but, from the perspective of the business manager or business end user, data warehousing and business intelligence in practice has too often meant little more than ‘routine-ized’ reporting; extraction to other applications and systems; and, the occasional ad hoc query. sure, business intelligence tools have steadily improved; data may be delivered on nicer looking, more functional electronic reports and dashboards; data access may be more interactive; and, data may even be available on mobile devices. All of these advances add some value.

but most end users will still tell you the same thing: most of what they have been doing with the data warehouse has been “looking in the rear view mirror.” Often, business users learn what has happened from the data warehouse. They learn which products have been selling; which customers have been buying; which suppliers have consistently delivered on time… these insights are treasured when good information was not previously available as a basis for decision making.

The problem is that the practice of business management has moved on from that point. Looking in the rear view mirror is no longer enough.

Increasingly, operating and strategic decisions must be based on forward looking analysis with a mathematically sound foundation. The analytical approaches to business exemplified in Competing on Analytics1 and a series of subsequent books—and in the best selling popular book and recent hit movie, Moneyball 2, have influenced business culture. These accounts and many others have shown how business performance can undergo radical improvement when the decision making process looks forward with analytics. At the heart of this revolutionary analysis is better prediction: whether of the performance of a baseball player, of a product, of a service—or the behavior of a customer.

This WinterCorp Executive Report describes two trends: business analytics and “big data”

—and the approach to them adopted in SAP Sybase IQ.

In developing this report, WinterCorp drew on its own independent research and experience; interviewed SAP Sybase IQ employees; and, reviewed SAP Sybase IQ product materials.

In its capacity as the sponsor of this report, Sybase Inc, an SAP company, was provided an opportunity to comment on the paper with respect to facts.

WinterCorp has final editorial control over the content of this publication and is solely responsible for any opinions expressed.

Methodology & Sponsorhip

1

6

1Competing on Analytics, The New Science of Winning, Thomas davenport and Jeanne Harris, Harvard business school Press, 2007 (www.tomdavenportbooks.com) 2Moneyball, The Art of Winning an Unfair Game, michael Lewis, W.W. Norton & Company, 2003




7

And, while you may feel that your data warehouse already has the capabilities to support these analytics, there is more to the story.

Big Data. As predictive analytics have been gaining ever more significance in business circles, another trend—big data—has made a profound impact on business and data strategies.

“big data” is a broad phenomenon encompassing the rise of social media; the seemingly sudden proliferation of machine generated data; the worldwide spread of mobile intelligent devices, including smart phones and tablets; the widespread use of gPs data, which attaches a location to many events in daily life; and, rapid decreases in cost associated with capturing, delivering and storing a wide range of previously costly varieties of data, including voice, image, video, etc.

Taking all of these phenomena together, we are witnessing an enormous explosion of data which is many times larger and faster growing than what we have seen in data warehouses over the last decade. While the transactional information about customers, products, stores and the like is still uniquely valuable—and plays a central role in understanding any business—there is now new and unprecedented information available that can provide business, engineering, scientific and medical insights never before available.

To provide one example, a useful technique in customer retention is to observe when a profitable customer’s activity with a credit card begins to decline and then react quickly to retain the customer before the account is cancelled. When this technique works it is much more efficient than acquiring a new customer that is equally or more profitable.

but what if you could know earlier—before the usage declined—that the customer was at risk? Perhaps the retention rate would become yet higher and the retention cost lower, particularly if you could discover the reason that the customer relationship was threatened. If you knew the reason, then your actions to deal with it could be yet more efficiently directed at the root cause.

but how could you know earlier? One possibility is social media. If you are engaged with your customers on social media, they may tell you what they are thinking: that they like the service or the incentives or the prices offered by a competitor; that they don’t like your call center or your fees. Or, if they have opted into your social media program, they may let you see what they are saying to others about your product or service.

The enormous flood of data pouring out of social media is one of many examples of big data. data is also pouring out of a growing tide of products that we use every day, and to the extent that we opt in, manufacturers can gain precious knowledge about how, when, and where we use products—and what problems we have with them. This is clearly the case today with smartphones and tablets. Vehicles are becoming more intelligent and more connected and will increasingly provide similar capabilities (more expensive commercial vehicles, such as helicopters, already provide telemetry data that is used to optimize safety and maintenance). The trend will spread to many other products that we use every day, in every case generating yet more “big data” for analysis.

New tools and technologies. The concurrent rise of predictive analytics and big data has generated interest in new tools and technologies for several reasons.

First, much of the big data does not fit closely with the relational database model. much of the significance of the data is not revealed by fitting it into a tabular structure. social media data has textual, image, audio, video and other components that must be analyzed primarily by specialized or procedural functions—sQL solves a relatively small part of the problem here. Embedded in the data is a social graph which is most readily analyzed outside of sQL.




In general, a significant element of the new, more predictive analysis—especially of the newly varied and highly voluminous “big data”—is best attacked with tools other than sQL. In connection with this, interest has grown in mapReduce, a parallel data analysis framework, and Hadoop, an open source engine for running mapReduce jobs.

some data analysis jobs can be readily performed in a Hadoop cluster. Others may require the services of a data warehouse, such as sAP sybase IQ. Yet others may best be handled with a combination of the two.

Regardless of where the data is stored, interest has also grown rapidly in other analysis tools, such as the open source statistical analysis language, R. In general, the new business analytics will use sQL and the data warehouse, but will also create a strong demand for other tools.

Data Strategies. As enterprises grapple with this rapidly changing world of big data, they need a data infrastructure that will enable them to implement analytic business strategies. Especially with regulatory and governance requirements enforcing longer periods of data retention, enterprises need a convenient, flexible, cost effective process for solving analytic data problems from beginning to end.

sybase seeks to address that customer need—for a comprehensive approach to business analytics—through its new capabilities in sAP sybase IQ 15.4.

8




Architecture of SAP Sybase IQ 15.4software relational database engines have been commercially available since the 1970s. To this day, most of these products were originally conceived as row storage engines for online transaction processing. A notable exception is sAP sybase IQ. Conceived from its earliest days as a column-storage, analytical dbms, sybase IQ was in many ways ahead of its time. It was the first commercial column storage engine; the first to put a major emphasis on data compression; and, one of the earliest to place a strong emphasis on complex queries and analytics, rather than on online transaction processing.

sybase IQ has come into substantially widespread use, with thousands of customer installations, and thus has developed into a reliable, highly usable, comprehensive product for data warehousing and business intelligence.

but, with sybase IQ 15.4, that distinctive engine architecture has been expanded into something more: a platform for large scale business analytics. This section will discuss the new capabilities of sybase 15.4 and describe how they support and enable analytics for the data warehouse and for the newer phenomenon of big data.

2.1 A PLATFORM FOR BUSINESS ANALYTICS

With the introduction of sybase IQ 15.4, sAP has expanded its IQ product line from data warehouse engine to business analytics platform, as shown in Figure 1.

The core data management infrastructure, represented by the innermost layer in Figure 1, is a high performance column storage analytic database engine. In recent releases, the core data management infrastructure has been enhanced with sAP’s patented PlexQ™ technology, which sAP characterizes as massively parallel shared everything architecture. The combination of the relatively new PlexQ™ technology and sybase IQ’s previously developed grid structure results in an elastic architecture—on which capacity is readily added or removed. The underlying database engine, a distinctive design with sophisticated column storage, compression and indexing techniques, has long established advantages in query performance. In sybase IQ 15.4, the core data management infrastructure is further enhanced with new capabilities for large object compression and high performance bulk inserts via the industry standard OdbC and JdbC interfaces. The core infrastructure has several other noteworthy features, highlights of which are discussed in section 3.

Figure 1: Sybase IQ 15.4 as a Platform for Business Analytics

9

2




The Application Services Layer, shown in Figure 1, is a greatly expanded set of services designed specifically to for the development and support of analytic applications. It also provides facilities for users and partners to develop and use their own analytic functions that sybase IQ will run in parallel against the database. This layer provides major new services, including an implementation of native mapReduce that runs in parallel against the database and also provides connectivity with Hadoop. The Application services Layer is described further in section 4.The Ecosystem Layer, represented by the outermost layer in Figure 1, is an environment in which sAP and its partners can provide and support analytic applications and tools, as well as the business intelligence tools that have long been available with sybase IQ. some key elements of this layer that are new in sybase IQ 15.4 include support for:• Expansion to support all major business Intelligence and data Integration tools including

optimizations for sAP businessObjects products;• the R language, an open source language for statistical analysis;• a library of mapReduce enabled data mining functions that will run in parallel against data in

sybase IQ; • a set of social network analysis modules; and,• packaged applications for analytics and data lifecycle management.The Ecosystem layer is yet another significant enhancement to the analytic capabilities of sybase IQ and a third major element of sAP’s initiative to make sybase IQ a major platform for business analytics. The Ecosystem Layer is described in section 5.While sybase IQ has long enjoyed a respected presence in data warehousing, increasing its customer base over the last few years from about 2,000 to over 4,500 installations, sybase IQ 15.4 is clearly something new and different from what sybase has offered before. As well as significant continuing enhancement to its core dbms engine for data warehousing, sAP is now offering an array of capabilities for business analytics with sybase IQ.

10




The SAP Sybase IQ 15.4 Core Data Management Infrastructure

the core infrastructure of SAP Sybase iQ has been enhanced significantly over the last three releases with the implementation of elastic PlexQ™ grids for highly parallel processing.

The elastic PlexQ™ grid preserves the advantages of the earlier sybase IQ architecture—a sophisticated form of shared data clustering—while adding scale out processing for queries, loads and other large data warehouse operations. In prior releases, sybase IQ could run queries and loads in parallel across a single node. In sybase IQ 15.4, with PlexQ™, the system can run an individual query or load in parallel across multiple nodes. This ability to scale out for individual queries and loads enables sybase IQ 15.4 to handle significantly larger scale data warehouses and analyses. In addition, as in prior releases, sybase IQ 15.4 can spread the work of multiple users across the nodes of the grid. Also, grid nodes can be grouped and assigned to specific workloads or user populations, making it possible to dedicate a chosen set of nodes to a particular purpose. New nodes can be added to the cluster as the workload grows, providing an elastic character to the system. Figure 2 below provides an overview of the core data management infrastructure:

Figure 2: SAP Sybase IQ Core Infrastructure with PlexQ™ Technology

11

3

source: Adapted from a diagram by sAP Inc.

sybase IQ runs on Red Hat and susE Linux 64/32 bit systems, Windows 64/32 bit, AIX 64 bit, sun solaris 64 bit, and HP-uX 64 bit systems, providing for customers to independently optimize storage, caching, processors, memory, threading, and load distribution.

3.1 DATA LOAD PERFORMANCE AND SCALABILITYsybase IQ data load performance and scalability depend primarily on seven factors:

1. PlexQ™ technology, making it possible to spread the work of a load job across multiple nodes of an elastic PlexQ™ grid.




2. in a new feature for Sybase iQ 15.4, highly efficient bulk inserts via ODBC and JDBC are supported. this means that many third party tools and applications that load via industry standard interfaces will load large data volumes much more rapidly. in some practical examples, for example when third party Etl tools are used, speeds up of 100 times have been measured.3

3. Fast, flexible load processing built into the engine at the most fundamental level.4. Versioning to minimize contention between data-load and query processing.5. Automated, flexible remote loads.6. “Near-real-time” “Trickle-feed” loads.7. sybase’s ETL (extract, transform, load) utility.

Fast load Processing. sybase IQ provides specific features for speeding column-store data loading. In the batch case, a column-store approach allows loads to be in “flat schema” (or “semi-normalized”) format—that is, users can avoid the added space and complexity of storing the data as multiple tables. sybase IQ’s architecture allows parallelism in loading, including parallel feeds from distributed clients (the “grid”) to multiple servers and parallelism by using multiple processors for parallel storage of individual tables and columns in the target data-warehouse database. sybase IQ loads only those columns that have changed in a given row (or, of course, in the entire data store)—this typically allows sybase IQ to create loads a fraction of the size of the comparable row-store relational approach.versioning. As the changed columns are loaded, they do not replace old columns. Rather, new versions are created and old ones maintained while needed by ongoing queries. Within a new column version, only changed pages create new storage. Thus, sybase IQ querying is not interrupted during data loading, data loading is not blocked by ongoing querying, and additional storage for versioning is minimized. Automated, Flexible Remote loads. sybase IQ allows scale-out loading across its grid architecture. data can be pulled from the clients, or “pushed” by the client to the server via OdbC. The utility also enables data loading from sAP sybase AsE, microsoft sQL server and Oracle data stores.Near-Real-time loads. sybase IQ supports “micro-bursts” of “microbatched” incremental data loads (i.e., not the constant stream of updates of an OLTP database, but column changes accumulated over a minute or two, loaded at once). For example, Replication server—Real Time Loading Edition 15.5 allows delivery of changed data to the sybase IQ data store within minutes of a data change elsewhere. This ensures “near-real-time” up-to-dateness of data. Combined with versioning, it allows up-to-dateness without interruption of ongoing queries.Sybase infoPrimer Etl tool. This coordinates data loading, including data cleansing as necessary. It takes advantage of the features described in 1-4, and operates multi-threaded, for a high degree of concurrency and/or parallelism. InfoPrimer ETL combines loading and indexing—a chunk of data and its indexes are treated as a single object item—for additional ETL speedup. A sAP utility automates data loading from sAP sybase AsE, microsoft sQL server and Oracle data stores. sybase IQ also supports sAP businessObjects data services ETL tool and other third-party ETL tools such as those of Informatica, syncsort, and data stage. Note also that sybase IQ supports “Extract Load Transform” schemes, in which database functionality or stored procedures are used to speed some forms of data transformation, as well as “change data capture” via Replication server.

12

3Note that bulk inserts were efficiently implemented in prior releases for the native application interface.




3.2 COLUMN-STORE STORAGE EFFICIENCY, INDEXING, AND COMPRESSION

A key differentiator for sybase IQ is its ability to store data in the minimum amount of space on disk or in main memory, which has a dramatic positive effect on performance and scalability. Relational data stores in row format, by and large, already minimize duplication of records (rows). However, relational row stores duplicate columns within a row even when there is no data in the column, and store the same value in a column multiple times. sybase IQ’s columnar-data-store approach does not store non-existent column data, and stores each distinct value only once (Figure 4). For example, where a relational row store may store the “married” value (or any other value) in the customer-marital-status field in every row, the columnar approach stores a pointer to one central instance of each value in the field.

Figure 3: SAP Sybase IQ’s Columnar Data Storage

13

source: sAP Inc.

many queries in bI, complex or otherwise, analyze data using only a few fields in a record, or only a few columns in a row. For queries involving analysis of many rows, this means that a row-based query engine will retrieve much more data than necessary, slowing performance, while a column-based query engine like sybase IQ will retrieve only those efficiently-stored columns applicable to the analysis. Add sybase IQ’s ability to partition data according to columns and thus avoid some indexing performance overhead (discussed in VLdb management, below), and the more that sybase IQ scales, the greater the frequency and size of its performance advantage.

Note that other queries may favor a row-based approach—for example, those that access a small number of rows and a large number of columns. The design philosophy of sybase IQ argues that such queries typically comprise a modest fraction of the workload in an analytic database. Therefore the gains from a column-store approach will dominate the performance tradeoffs. While sybase IQ was alone in advancing this argument ten and more years ago, many several products have since incorporated some column storage features or capabilities in response. However, few products have been designed with a column storage approach from the ground up—and sybase IQ remains the most mature of these.

To improve storage efficiency, sybase IQ’s column-store architecture adds data compression, leveraging its storage of a single data type per column per data page. Aside from the standard methods of compressing individual word strings, sybase IQ offers bit-mapped indexing (in which low-cardinality column data values are represented as bit strings, and query operations can be




carried out as bit operations, for two-orders-of-magnitude performance speedup) where appropriate. In fact, sybase IQ provides compression not only of the data, but also of its indexes (Figure 4).

In sybase IQ 15.4, data compression is enhanced further for large data objects, providing a critical new capability for unstructured data. The enhanced data compression applies to variable length and fixed length character and binary large objects (VARCHAR, VARbINARY, CHAR and bINARY). In early use of these features, data has compressed from 3 times to 16 times more than with prior releases of sybase IQ. This enhanced compression means fewer disk I/O operations to read and write the same data, thus enhancing performance. Large objects are especially prevalent in the new “big data” arena, where unstructured and semi-structured data accounts for most of the increased volume.

Figure 4: SAP Sybase IQ Data Compression

14

source: sAP Inc.

many relational databases “retrofit” compression into their database engine by decompressing the data before processing it. sybase IQ designed in query processing without decompression, so that all operations use the compressed data, and the only time data is decompressed is when processing is finished and the data is being sent to an end user to read in a report. Also, sybase IQ performs “perfect prefetch of pages,” because it knows from its bitmaps exactly which pages have to be fetched in sequence. The result is an increase in the amount of data that can be stored in main memory, allowing in-memory-database-like performance plus scalability beyond an in-memory database.

sybase IQ’s indexing schemes complement its columnar storage and compression approaches. In particular, sybase IQ offers a wide range of indexing schemes that allow columns with different characteristics to be stored in less space (Figure 5).




Figure 5: Forms of Indexing Supported by SAP Sybase IQ

15

Index Name Type of Data Useful For Type of Query Operation Useful For Data Type

Fast Project (Default) all columns with < 16m unique values

Projections with scalar aggregates All

High Group high cardinality columns Large joins, gROuP bYs All except, bITs/CHARs > 255

High Non Group high cardinality columns Range searches mainly for integers and CHARs < 255

low Fast columns with < 1000 unique values requiring fast lookup

Projections, joins, scalar aggregates

All except, bITs/CHARs > 255

Date, time, Datetime Columns with dATE, TImE, dATETImE data types

Queries with dates, times, timestamps ranges/compares

dATE, dATETImE, TImE only

Comparetwo columns with

identical data types (for comparison operations)

<, >, = compares mainly for integers and CHARs

word data types involving strings and words dictionary Lookup CHARs, VARCHARs only

text data types involving strings and words

Complex text terms/phrase searches including boolean,

proximity, and scoringCHARs, VARCHARs only

This broad range of indexing techniques is partly baked in (that is, data loading will automatically index data in a compressed form for storage efficiency), but also allows the customer further flexibility to create additional indexes to deliver performance for the customer’s unique querying patterns. because indexes are highly compressed, users can create a multitude of them up front in anticipation of future ad hoc queries. An “index advisor” built into the query optimizer assists the user by suggesting indexes that will improve query performance. sybase IQ’s column store architecture aggressively encourages usage of indexes—in many cases multiple indexes per column—on which predicates are applied to obtain speed up. Figure 6 shows how sybase IQ’s data-storage approach can minimize I/Os.Note also that sAP sybase IQ can fetch data in large page sizes (typically 64K), which can reduce disk I/O significantly.

Figure 6: SAP Sybase IQ Query I/O ReductionEXAMPLE:

select sum(sales) from customerswhere state = ’NY’ and class = ‘A’

Sybase IQ will use the LF indexes to filter rows and then apply to HNG to compute the sum.Minimal amount of data is read to resolve the query.

Note also that Sybase IQ can fetch data in large page sizes (typically 64K), which can reduce disk I/O significantly.

source: sAP Inc.




16

3.3 QUERY PROCESSING PERFORMANCE AND SCALABILITY

The sybase IQ query-processing engine is built to take advantage of all of sybase IQ’s storage, shared Everything PlexQ™ architecture, and versioning capabilities. The cost-based optimizer can load-balance a query across processors and systems, while constantly updating its “sense” of the relative load on each processor/system. The optimizer also factors in the size of the compressed/indexed data and its presence or absence in main memory, ensuring quicker data access and processing. sybase IQ can dynamically adjust its query execution plan based on concurrent workload, after having started the execution of the query. sybase IQ 15.4 rebalances query resources—threads, processors, and cache—every several seconds, to maintain query performance for both long-running/larger and short-time-period/smaller queries. Note that the intelligence of the cost-based optimizer allows users to flexibly deploy heterogeneous small-scale servers if needed, each with its own sLA (service level agreement).

Once the query is optimized, the engine carries out pipelining of operations within queries as well as parallelism within and across queries. That is, a query that may involve an initial load and sort followed by a join might begin the join operation for one column value immediately, without waiting for all data to have been sorted. When one processor is finished sorting a column value, it might move to sort the next, passing the value to the “join” processor. multiple pipelines may operate in parallel for different sets of data within a query. In the case of joins, in particular, sybase IQ provides two levels of parallelism, in which parts of data to be joined may be “grouped” initially for separate, parallel processing, and then the groups may be joined together in a second step.

In the case of column data that uses bit-mapped storage and indexing, the engine takes an additional step. It combines (performs bit operations) early, in order to reduce the number of times that the engine needs to actually “touch” a data item. In this case, sybase IQ never needs to do a table scan.

3.4 VERY LARGE DATABASE (VLDB) MANAGEMENT AND BACKUP

The larger sybase IQ implementations typically manage hundreds of terabytes of data; a few sybase IQ systems manage petabytes of data, according to sAP.

moreover, sybase IQ allows administrators to bind tables, indexes, and columns to particular storage structures—thus placing less fresh groups of data on more price-performant storage (offline disk or nearline tape) without significant diminution of performance. Logical “groups” can be moved (e.g., from disk to tape) with simple commands, as when “aging” data becomes ready for archiving. sybase Powerdesigner (also part of the sybase Workspace IdE) enables creation of programs that generate reports based on the data’s logical “age.” To complement logical “data age” partitioning, sybase IQ supports physical range partitioning of columns/tables based on the values in a “date created” or “date last modified” field. Older data can be marked “read-only,” avoiding the need for further backup (see Figure 7).

Adding and removing data can have significantly less impact on performance (and hence the need for retuning) than in row-based systems. specifically, if a field needs to be added to or removed from the data, it does not require reallocation of each row in storage or immediate redefinition of all affected row indexes, and does not “lock up” rows during the addition/removal process. moreover, efficient data and column representation means quicker field addition or removal.




Figure 7: Data Partitioning Allows the Placement of Older Data on Lower Cost Storage

source: sAP Inc.

In general, sybase IQ emphasizes ease of tool use by administrators. They can perform most needed operations via the sybase Central guI (graphical user interface), and sAP anticipates releasing a Web version of administrative tools with the same functionality within the next 12 months. Parameters that administrators may tweak include modeling the data, and ETL. At the same time, sybase IQ automates job load balancing within a node, as well as ETL-based data-load balancing. sybase IQ supports active-passive disaster recovery, with manual failover of a single failed node. sybase IQ’s Virtual backup integrates with the storage subsystem to create and periodically resynchronize shadow data-device copies online, with delayed logged writing of updates to the shadow. Effectively, this means that during normal processing, backup overhead is minimal, and “virtual restore” involves only roll-forward of changes not yet applied to the shadow—often a matter of seconds.Note that sybase IQ reduces the amount of pre-aggregation/materialization and index creation work required of the typical data-warehouse administrator. sybase IQ’s columnar approach effectively aggregates data according to columns and values in a column; index compression is carried out during data loading; indexes can be created automatically “on the fly” by the query engine, and can be based on usage patterns rather than pre-defined by the administrator.security schemes involve both data communications (e.g., RsA, FIPs 140-2, Kerberos) and data storage. data storage encryption is applied to the entire database and to particular columns (using sybase IQ’s AEs 128-bit encryption or an optional FIPs 140-2 certified version of the encryption.

3.5 IN-DATABASE ANALYTICSusing stored procedures or user-defined functions compiled and optimized within the database engine’s process is a time-honored way to improve performance of key query types. sAP extends the notion to encompass not only built-in math functions and sQL OLAP operators but also sAs/sPss-type complex operations such as clustering, simulations, and classifications. And sybase IQ specifically opens this capability (e.g., via C++ plug-ins) to third parties such as Fuzzy Logix and Visual Numerics.sybase IQ 15.4 introduced a major expansion of the user defined Function (udF) and other analytic capabilities. This is described in section 4 on Analytic services.

17




3.6 TEXT SEARCH AND ANALYSIS

sybase IQ allows full (semi-structured) text data search in combination with traditional relational (structured) data analysis. For example, users can find all instances of a word or phrase in a set of text fields stored in sybase IQ’s data store, without having to scan table rows or having to know which column the word or phrase is stored in. specialized text indexes that store positional information for terms in the indexed column(s) speed up complex text search and analysis. moreover, sybase IQ’s -in-database capabilities (outlined earlier) include plug-ins for third-party C++ Text Analytics/mining libraries.

18




The Application Services LayerThe Application Services Layer, represented by the middle layer in the sybase IQ 15.4 analytic platform architecture, is a greatly expanded set of services designed specifically for the development and operation of analytic applications. This layer provides several new services, including an implementation of mapReduce that runs in parallel against the database. The Application services Layer also provides facilities for users and partners to develop and use their own analytic functions that sybase IQ 15.4 will perform in parallel against the database.

Figure 8: The SAP Sybase IQ 15.4 Application Services Layer

4

source: sAP Inc.

Additional key elements of the Application services Layer include protected “out of process’ Java udFs, spatial/geometric data and query support and simulation for in-database application development and testing.

4.1 “MPP ENABLED” USER DEFINED FUNCTIONS (UDF) several of the advanced capabilities of Application services Layer are possible because of the new forms of user defined functions (udF) supported in sybase IQ 15.4. sAP characterizes these new udFs as “mPP Enabled,” meaning that sybase IQ will run them in highly parallel fashion, including spreading the work of a single function call across multiple nodes of the PlexQ™ grid.These are functions written in C or C++ (and for some types, may be written in JAVA); and, are callable from sQL. because such functions are enabled for execution in parallel across multiple nodes, they are key enablers for business analytics and big data.udFs are a convenient mechanism for the advanced users or database professionals in an enterprise to codify certain calculations or analytical techniques specific to a business—and then make them available for use throughout the enterprise.Though the industry term for this capability is “user defined function”— and while sybase IQ customers will certainly write them—a substantial library of such functions is provided by sAP and its partners. udFs also provide a mechanism whereby a software vendor or data service provider can develop proprietary techniques; and, make them available for use by customers; but, without necessarily disclosing the algorithm or its implementation.

19




Four classes of udFs are supported:• scalar functions operate on individual data items, returning a single value;• Aggregate functions operate on sets of values, returning a single value; several aggregation

operations are built into the sQL language—for example sum, COuNT and AVERAgE—but aggregate udFs provide an opportunity for users to create their own aggregation functions, which may incorporate techniques specific to an industry, company or analytical discipline;

• Table functions produce bulk data (that is, a table) as output and may be written in C/C++ and/or JAVA;

• Table parameterized functions both accept bulk data as input and return bulk data as output and may be written in C/C++.

Taken together, considered in light of their enablement for highly parallel execution, these udFs provide a potent new analytical capability for sAP customers and partners.

4.2 PROTECTED JAVA UDF’S

Prior to sybase IQ 15.4, customers have been able to write udFs in C and C++. such functions had to be tested and certified before they could be run as part of a production system. They ran in the sybase IQ kernel.

udFs can now also be written in JAVA. In addition, they are run in a “protected” mode. This means that they are executed in a separate process that runs on a database server (that is, it runs on a node of the PlexQ™ grid). This prevents an error in the udF from interfering in the operation of either the core infrastructure of sybase IQ 15.4 or in the operation of any other udF or user process. The result is therefore more reliable and consistently available data and analytical services.

4.3 IN-DATABASE MAPREDUCE

“mapReduce is a software framework for distributed processing of large data sets on compute clusters,” as described on the website of the Apache Foundation (http://hadoop.apache.org/mapreduce/). In the mapReduce framework, data analysis tasks can be broken into functional pieces—called mappers and reducers—each of which performs a portion of the analysis, reading an incoming set of (key, value) pairs and writing an outgoing set of (key, value) pairs. When mappers and reducers are run in the correct sequence, the complete analysis task is accomplished.

The mapReduce framework is especially interesting when a large volume of data is to be processed because it is designed—and mapReduce functions are written—so that many copies of each mapper and each reducer can be run at the same time in a parallel architecture. Thus, if there is a terabyte of data to be analyzed and one runs 100 copies of a mapper, then each mapper needs to analyze only 10 gb of data (assuming that there is a readily available way to partition the data into 100 roughly equal parts). This concept of scalable, highly parallel analysis is similar to the concept of parallel query processing used in a data warehouse, though there are important differences between the two.

Prior to the development of mapReduce, procedural programs to analyze data—written outside of the context of a parallel database system—had to deal with all the complexity of parallel programming. so, data could be analyzed serially—a very slow process with large data volumes—or the programmer could get involved in the very complex and error prone process of specifying manually how the data was to be:• partitioned; • fed to many separate copies of the analysis process; and,

20




• analyzed;

and, then, how the many separate results were to be• recombined; and, • delivered.

In a complex analysis there are many successive stages of parallel analysis, with the data passing between them in complex patterns, and the difficulty of the programming task escalates rapidly.

The mapReduce framework relieves the programmer of explicitly dealing with the parallel aspect of the analysis, freeing him or her to concentrate on the data and the analytical logic.

The mapReduce framework has been popularized in connection with Hadoop, an open software system that implements mapReduce on compute clusters (typically, clusters of low cost servers and low cost storage). Hundreds, if not thousands, of companies are now using or experimenting with Hadoop clusters in part so that they can have an environment for storing large amounts of data—the so-called “big data”—and analyzing it with mapReduce and other tools.

While a Hadoop cluster provides a repository for the storage and analysis of big data, it has different advantages and limitations than a data warehouse. WinterCorp believes that most enterprises will have an analytical environment in which at least one data warehouse and at least one Hadoop cluster will be present. section 5.x provides more information about Hadoop clusters and describes the facilities in sybase IQ 15.4 for interfacing to them and interacting with data stored in them.

meanwhile mapReduce as a programming framework has come to be widely viewed as a standard method for interfacing the procedural program—written in Java, Python or some other popular language for data analysis—to a large volume of data in storage. This is because programs and functions written using mapReduce can be executed in highly parallel architectures that speed up the large scale analysis.

in Sybase iQ 15.4, a facility is provided for running C++ applications that use the MapReduce framework and run within the Sybase iQ PlexQ™ elastic grid. They can run against data stored in the Sybase IQ database or against externally stored data. the data can be structured or unstructured, as Sybase iQ 15.4 is capable of storing either. And the mappers and reducers are stackable.

Note carefully that, in the sybase IQ 15.4 context, such programs need not have anything to do with Hadoop. The data that they are analyzing can be data previously stored in sybase IQ and that could be analyzed with sQL queries or any other tool that works with sybase IQ. but, because of the popularity of mapReduce this facility is likely to be valuable, because:•many libraries of analytical functions will be implemented for other environments using

mapReduce; such libraries can then be ported to sybase IQ 15.4;•many programmers, data scientists and other data specialists will gain familiarity with

mapReduce and may prefer to program using it; and,• sybase IQ customers may want to build their own libraries of functions that can be used both

on data in sybase IQ and on data in other environments such as Hadoop; these customers will therefore be able to use mapReduce for this purpose.

As described in section 5.2, at least one sybase IQ partner has already leveraged this facility to provide data mining functions to sybase IQ customers using mapReduce.

21




in addition, using MapReduce on Sybase iQ data will typically be simpler than accomplishing the same task on data in a Hadoop cluster. this is because the data to be analyzed can be selected and partitioned using SQl; the results returned by the analysis can be stored back in Sybase iQ using SQl; and, the definition of the data to be analyzed can be maintained in SQl. Each of these simplifies some aspect of the analytical process. Also, data that is stored in Sybase iQ is managed by Sybase iQ. it benefits from all of the services provided in the Sybase iQ environment for other data. For example, it can be incorporated in a routine backup schedule; it can be made recoverable; it can be secured via access controls; and, so on.

4.4 SIMULATION FOR IN-DATABASE DEVELOPMENT

Analyzing data within a udF—rather than transferring the data to an external system for analysis—has important advantages for a user of sybase IQ 15.4. First, it takes time and system resources to transfer the data elsewhere. second, the moment the transfer begins, the data starts growing stale. If the analysis is delayed for some reason it becomes even more stale. And, the larger the volume of data to be analyzed, the higher is the overhead of first moving it elsewhere. second, sybase IQ 15.4 is capable of running udFs in parallel across multiple nodes. If the data is transferred to another system for analysis—and if that system is not able to analyze data with an equal or greater degree of parallelism—then there will be yet more delay. Third, if the results of the analysis are substantial and are to be retained for later use, it will be efficient to write them in parallel back within sybase IQ, rather than having to transfer them from another system.

These reasons—and others—provide incentives to analyze data in place within udFs in sybase IQ. but, there are some issues to address. As udFs are being developed, they may contain errors. A udF under test could have unintended—and undesirable—effects on the production environment if run there.

In some environments, the production data is extremely sensitive and it is not practical to have a copy of it in a separate test environment.

To address such issues, sybase IQ 15.4 provides facilities for testing of udFs and applications that are intended to perform in-database analysis. These facilitate the process of creating realistic simulated data in a large scale test database. As a result, the development of in-database analytics is far more streamlined and udFs can be more completely tested before they are used with production data.

4.5 HADOOP INTERFACE

many enterprises will create—or have already created—an analytic environment in which there are multiple data repositories, some on data warehouse platforms and others in Hadoop clusters. In this situation, which WinterCorp believes will become nearly universal within the next several years, it will be common to have analytic processes which leverage data from multiple sources.

In response to this emerging requirement, sybase IQ 15.4 has four mechanisms in its Application services framework for connection between sybase IQ and Hadoop. These are:

a. Client Side Federation. The Quest TOAd data query tool (certified with sybase IQ and Hadoop)can retrieve data from each source and bring it together at the client; this can be a good solution when the volumes of data returned are not very large;

b. Analysis in the Sybase iQ Environment that includes Data Extracted from Hadoop. ETL Hadoop data into sybase IQ via Apache sQOOP, an open source tool for bulk data transfers between Hadoop and relational databases; sQOOP stands for “sQL-to-Hadoop”; this is a

22




particularly attractive solution when the data extracted from Hadoop is to be joined or aggregated with data that resides in sybase IQ; performing the work in the sybase IQ environment brings all the benefits of a mature relational database environment, including security, compliance, backup/recovery, query optimization, and, defined and controlled data semantics; this is a bulk data transfer which can be highly parallel on both the sybase IQ side and the Hadoop side.

c. incorporation of Hadoop Data into Sybase iQ Queries. When the data access is to be more frequent, and when the data volumes to be transferred are not very large; data can be retrieved from Hadoop using sybase IQ table functions; these retrievals, while not instant because Hadoop is fundamentally a batch environment, can nonetheless be incorporated into sQL queries as they are executing;

d. Coordinate Hadoop Job(s) with Sybase iQ Query. In this case, a Hadoop mapReduce job runs separately from a query but is designed to feed data to it; the query and the Hadoop job interface by means of parameterized table functions in the sybase IQ query; though similar to case (c) above, the emphasis here on coordinating and integrating analysis that is occurring in two jobs in two separate environments.

With these four capabilities, sybase IQ customers can deal effectively with a range of situations in which a Hadoop repository is to be used in conjunction with an sybase IQ data warehouse to meet an analytic requirement.

4.6 GEOSPATIAL/GEOMETRIC DATA & QUERY SUPPORT

Trends have increased the prevalence and the significance of location data and geometric data in the analytical environment.

First, gPs enabled devices such as smartphones and tablets are in widespread use and proliferating rapidly. There are hundreds of millions of such devices in use today and projections are that there will soon be billions. such devices frequently communicate their location via the internet and such location data ends up in many commercially valuable databases.

second, many other types of gPs-enabled electronic devices are being created and they also communicate their location with increasing frequency. Examples include vehicles, medical devices, surveillance devices used for traffic analysis, weather devices and others too numerous to mention here.

data on the location of devices, once too expensive or impractical to obtain, now shows up in more databases every day. Analysis of the location aspects of this data is central to the timely understanding and management of public safety, public health, the commercial supply chain, customer purchase patterns and many commercial resources.

A similar trend exists with respect to geometric data significant to the design and manufacture of products; the maintenance of buildings, highways and bridges; the management of energy use; and, so on.

In both cases, it is important for the data to be defined, captured and stored in as standardized, easily specified and easily used a fashion as possible. It is also essential for the database system to facilitate the specification of queries that exploit geographic and geometric data. And, it is essential for the database system to perform such queries efficiently, especially when the data volumes involved are large.

sybase presently addresses these requirements in the embedded row store dbms inside sybase IQ - sQL Anywhere, a very efficient small footprint dbms that serves as a catalog store for sybase

23




IQ’s engine. However, users can also spawn a separate instance of sQL Anywhere from sybase IQ to store geo spatial data. sybase IQ then provides facilities for federated query of the geospatial/geometric data stored in sQL Anywhere and the main analytical column data store in sybase IQ 15.4 to enable high performance geospatial analysis

4.7 FREE EXPRESS EDITION

As with any software platform, it is important for developers to be encouraged to develop tools and applications for sybase IQ. The robust and rapidly growing community of millions of sybase IQ end users— using the product at over 4,500 installations worldwide—is certainly an incentive to developers.

but it is still important to remove obstacles from the path of any developer interested in providing new capabilities to those users.

To this end, sAP is providing a free Express Edition of sybase IQ 15.4. Anyone developing for sybase IQ (utilizing the rich Application services described in this section) using sybase IQ or thinking about such an activity can download the product from http://www.sybase.com/iqexpressedition.

24




The Ecosystem LayerThe Ecosystem Layer, represented by the outermost layer in the sybase IQ 15.4 analytic platform architecture, is an environment in which sAP and its partners can provide and support analytic applications and tools, as well as the business intelligence tools that have long been available with sybase IQ.

Figure 9: The SAP Sybase IQ 15.4 Application Enablement Layer

5

source: sAP Inc.

Key elements of the Ecosystem Layer include support for sAP businessObjects and the “R” statistical language; more efficient and scalable data mining functions written to exploit the new in-database mapReduce; social network analysis modules; support for PmmL, for mathematical modeling; new capabilities for Powerdesigner and system administration and monitoring; applications for big data information lifecycle management;. Highlights are described in the following sections

5.1 SAP BUSINESSOBJECTS PORTFOLIO SUPPORTAs part of sAP, sybase IQ is now well integrated with sAP’s market leading tools for business Intelligence and data Integration from its businessObjects portfolio. The sAP businessObjects bI Platform is not only certified with every new version of sybase IQ, including sybase IQ 15.4, but it is also being optimized to support sybase IQ focused optimized sQL generation. similarly, sAP businessObjects data services is being certified and optimized to load and transform data into sybase IQ in a very efficient manner.

5.2 “R” LANGUAGE SUPPORT As described at www.r-project.org,

R is a language and environment for statistical computing and graphics. …R provides a wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering, etc) and graphical techniques, and is highly extensible.

R is available as Free software under the terms of the Free software Foundation’s gNu general Public License in source code form. It compiles and runs on a wide variety of uNIX platforms and similar systems (including Freebsd and Linux), Windows and macOs.

25




sybase IQ 15.4 provides support for the R language in two ways. First, R applications can fetch data sets stored in IQ for analysis in the R environment through RJdbC. second, calls to models written in R can be embedded in a sybase IQ table udF written in C++. Then sQL queries submitted to sybase IQ can call the udF, thereby causing the model to be invoked in an R server process.

5.3 MAPREDUCE-ENABLED DATA MININGsince Version 15.1, the Fuzzy Logix library of data mining and analytic functions has been available with sybase IQ. With sybase IQ 15.4, this library of over 250 functions has been:• Re-implemented using sybase IQ’s in-database mapReduce API; and, • Extended with additional new functions.most significantly, by using sybase IQ’s in-database mapReduce API, the new implementation leverages the sybase IQ table and table parameterized functions (thus using bulk data input and bulk data output to gain efficiency) and exploits the elastic PlexQ™ grid to execute the functions with much higher parallelism.

5.4 SOCIAL NETWORK ANALYSIS MODULESKXEN’s InfiniteInsight social network analysis and predictive analytic toolset has been certified with sybase IQ 15.4 to run on data stored in the database. With sybase IQ 15.4, KXEN does its scoring directly in the database, and reports that it realizes large performance benefits both from the column storage model and the in-database analytic support.

5.5 SYBASE POWERDESIGNER 16 ARCHITECTURE RECOMMENDERsybase Powerdesigner is a widely used application and database design product that has long been available and integrated with sybase IQ.Powerdesigner 16 and sybase IQ 15.4 are now jointly enhanced and integrated to provide a new capability of recommending the architecture for a sybase IQ solution. The user provides Powerdesigner with:• database design• Expected data volumes & growth• Expected workload• Performance requirements• Hardware preferences (e.g., Intel or Power)Powerdesigner will then generate an estimate of the configuration required and a bill of materials based on sybase IQ reference architectures developed in cooperation with system partners. Where pre-built appliance-like configurations are available, these can be generated.The user can then vary input assumptions and examine the sensitivity of the configuration to variations.In WinterCorp’s opinion, such estimated configurations would be used only as a starting point in certain capacity planning situations. Particularly in larger and more complex deployments, users would be well advised to seek independent confirmation and measurement. However, a fast path to an initial estimate is often extremely useful in capacity planning and this tool can provide that, along with an indication of sensitivity to various planning assumptions.

26




An area where particular caution is advised is in regard to large databases with complex query requirements. Where query complexity is high and data volumes are large, modest changes to the query workload can produce surprisingly large variations in capacity requirements. In these cases, a certain amount of realistic testing—along with larger allowances for unexpected capacity demands—are in order.However, with this tool, a history of configuration changes can be initiated, estimated, tracked, and maintained that can make sizing and deployments much more “factory like.”

5.6 IN-DATABASE PMML

From http://www.dmg.org/pmml-v3-0.html The Predictive model markup Language (PmmL) is an XmL-based language which provides a way for applications to define statistical and data mining models and to share models between PmmL compliant applications.

PmmL provides applications a vendor-independent method of defining models so that proprietary issues and incompatibilities are no longer a barrier to the exchange of models between applications.

PmmL models can be developed in a variety of data mining and statistical workbench environments available from other parties. However, when PmmL models are actually used in production to score large volumes of data, they must run in a highly parallel environment.In sybase IQ 15.4, users can run PmmL models with a plug-in, developed by Zementis (http://www.zementis.com/in-DB-plugin.htm). With the plug-in, the PmmL model can be run directly against data in sybase IQ. The Zementis plug-in is a sybase IQ udF, leveraging the new JAVA API available in sybase IQ 15.4

besides the various eco-system modules outlined above, sybase IQ supports a substantial variety of packaged analytical applications through its OEm partnerships covering various functional areas. A few examples include Ericsson’s Oss product ENIQ, bmmsoft EdmT, and solix Edms.

27




ConclusionsOver the course of its last five rapid releases in 3 years—from 15.0 through the present 15.4—sAP sybase IQ has been transformed to a platform for large scale data analytics and big data. It has significantly advanced in:• Scalability, with the development of its elastic PlexQ™ grid that adds highly parallel

execution of large queries and loads; previously, such operations could run in parallel over a single node of the grid; now they can run in parallel over multiple nodes; this is a major architectural advance, highly significant for larger data and workload requirements;

• in-database analytics, with a major generalization and extension of the user defined function (udF) facility in sybase IQ; with these new capabilities, udFs can be written in Java as well as C++; they can read and write bulk data in the form of tables and files; they can be run in a protected mode, increasing system reliability and data availability; and, they can be executed in parallel over multiple nodes of the grid;

• in-database MapReduce, enabling end users and partners to run mapReduce routines and libraries against data in place and in a highly parallel fashion in sybase IQ, and opening sybase IQ up to a large range analytic tools and applications from many vendors and sources;

• interface to Hadoop, enabling the many customers who are investing—or will invest—in an open source data repository in a Hadoop cluster—to leverage that investment in combination with data and analysis in sybase IQ;

• Other analytic application services leveraging in-database MapReduce and new, more powerful uDFs; these include an expanded, more efficient and more highly parallel version of the Fuzzy Logix data mining and analytics library; a simulator for testing analytic applications; and, other features.

• Partner Ecosystem - Other analytical, management and business intelligence tools and functions available from partners, certified by sybase IQ and providing analytical solutions and capabilities to customers; these include support for the sAP businessObjects tool set, the R statistical language; a PmmL plug-in for data mining from Zementis; social network analysis from KXEN; query and administration tools from Quest TOAd; and, of other capabilities.

These advances are evidence of a significant reorientation of the product direction and a significant enhancement of the product line to focus on the major drivers of change in business today. Organizations everywhere are grappling with the implications of a much larger volume and variety of data and a much increased focus on business strategies driven by fuller analysis of that data. mobility (tablets, smartphones, other devices), social media and machine generated data are all changing our data environments. sybase IQ now claims more than 4,500 installations of sybase IQ across the globe, following a rapid growth in revenue and a large expansion of the development organization. In addition to the recent advances in releases 15.0 through 15.4 described here, sybase IQ retains its established advantages in column storage, indexing and compression. These features—present since the earliest versions of sybase IQ—work in combination to confer benefits that are unique to sybase IQ. While other products offer column storage and compression, no other product has the sophistication of sybase IQ in integrating these features with advanced indexing and query optimization. The result is that sybase Q is particularly efficient in reducing the amount of data that must be read to satisfy queries. These fundamental strengths are now combined with increased parallelism and other features to deliver product benefits in a wider range of applications, now including those that use advanced analytic methods, including mapReduce and that involve interaction with big data in Hadoop clusters.

6

29

WinterCorp is an independent consulting firm expert in the architecture and scalability of big data and analytic database solutions.

Since our founding in 1992, we have architected solutions to some of the largest scale and most demanding big data and data warehouse requirements, worldwide.

We help technology users define their requirements; architect their solutions; select their platforms; and, engineer their implementations to optimize business value.

We create and conduct benchmarks, proofs-of-concept, pilot programs and system engineering studies that help our clients manage technical risk,

control cost and reach business goals.

Our seminars and structured workshops help client teams establish a shared foundation of knowledge and move forward to meet their challenges in big data and analytic database scalability, performance and availability.

We’re expert with SQL, MapReduce and Hadoop—with structured data, unstructured data, and semi-structured data—with the products, tools

and technologies of data analytics in all its major forms.

With our in-depth knowledge and experience, we deliver unmatched insight into the issues that impede scalability and into the technologies

and practice that enable business success.

©2012 Winter Corporation, Cambridge, MA. All rights reserved.

245 FIRsT sTREET, suITE 1800CAmbRIdgE mA 02145

617-695-1800

v i s i t us a t www.wintercorp .com

an elastic platform for business analytics · adopted in sap sybase iq. in developing this report,...

Documents