indetail - crayin the opinion of bloor research, the following represent the key facts of which...

12
uRiKA An InDetail Paper by Bloor Research Author : Philip Howard Publish date : July 2012 InDetail

Upload: others

Post on 17-Jun-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: InDetail - CrayIn the opinion of Bloor Research, the following represent the key facts of which prospective users should be aware: • Graph-related analytic queries represent a class

uRiKA

An InDetail Paper by Bloor ResearchAuthor : Philip HowardPublish date : July 2012

InDetail

Page 2: InDetail - CrayIn the opinion of Bloor Research, the following represent the key facts of which prospective users should be aware: • Graph-related analytic queries represent a class

In our view, graph databases offer a better solution for relevant applications than other database technologies and we expect them to make a major impact on the market. Philip Howard

Page 3: InDetail - CrayIn the opinion of Bloor Research, the following represent the key facts of which prospective users should be aware: • Graph-related analytic queries represent a class

1 © 2012 Bloor ResearchA Bloor InDetail Paper

uRiKA

Executive summary

uRiKA is an appliance for graph analytics which contains a graph data-base, but what exactly is a graph database or, for that matter, a graph, and why should you care?

A graph, sometimes also known as a network, consists of nodes and edges where a node represents an entity (a person or thing) and an edge a relationship between entities. Figure 1 is an example of such a graph depicting the 9/11 terrorists and how they were related (courtesy of V. E. Krebs “Uncloaking Terrorist Networks”). Note that relationships (and therefore edges) may be one way or two-way. For example, if you follow me on Twitter but I don’t follow you then I can influence you but you may have no way to influence me.

Figure 1: A graph showing 9/11 terrorists and their relationships

Different types of relationships can be captured in the same graph. Here are a few examples: the relationships within a family; owner-ship of property, or influencers. Relationships can be one-to-one (for example, a spousal relationship), or one-to-many (joint ownership of a house, where each spouse has an ownership relationship with the house). Other types of relationships that can be captured are temporal (for example, a chain of events, where one event caused the next), or spatial (for example, relationships between the climate of two different places). Basically, graphs offer a holistic view of all the relationships that an entity participates in.

Page 4: InDetail - CrayIn the opinion of Bloor Research, the following represent the key facts of which prospective users should be aware: • Graph-related analytic queries represent a class

2© 2012 Bloor Research A Bloor InDetail Paper

uRiKA

Executive summary

uRiKA can be viewed as a new class of a product: a ‘graph warehouse’, purpose built for relationship analytics, focused on the dis-covery of relationships (and patterns within those relationships) that were not previously known. Such use cases are common: for ex-ample, security services want to discover and understand the relationships that exist be-tween criminals and/or terrorists; similarly, fraudsters (telecoms fraud, benefit fraud, insurance fraud and so on) are frequently associates of one another, as are individuals involved in money laundering; in medicine and life sciences you want to be able to discover associations between treatments and patients, between drugs and allergic reactions, and be-tween different pieces of research to discover new possible treatments. Other possible uses would be in SIEM (security information and event management) to use graphs to identify particular patterns of cyber-attack, in capital markets to discover trading patterns and, of course, for social media analytics.

There remains the question of why a graph database, and uRiKA in particular, should be superior to a relational or other database-based approach. We will discuss this in more detail in due course but suffice it to say for the moment that there are good reasons to sup-pose that uRiKA will provide very significant performance benefits when compared to more traditional processing methods.

Fast facts

uRiKA is a graph database, announced in March 2012, that has been specifically de-signed to support real-time analytics against relationship-based data. The product has been developed by YarcData, an operating subsidiary of Cray Inc., and it is shipped as an appliance. That is, it is a combined hardware and software offering with the database pre-installed prior to delivery. Note that the hardware has been specifically designed to support and optimise the uRiKA database.

Key findings

In the opinion of Bloor Research, the following represent the key facts of which prospective users should be aware:

• Graph-related analytic queries represent a class of problem that are not easily or well

handled by either traditional data ware-houses (row-based or columnar) or by using NoSQL approaches such as Hadoop.

• uRiKA is a purpose-built appliance designed specifically to provide graph-related analyt-ics within a big data context.

• uRiKA does not need to stand alone: it can work collaboratively within a broader data warehousing environment, importing graph datasets and exporting results for further analysis.

• uRiKA runs entirely in memory—from 512GB to 512TB. As a result performance should not be an issue.

• uRiKA not only scales in terms of memory but also in processing capacity. An entry level system has 16 processors and scales up to 8,192 processors, each with 128-way multi-threading.

• Load capacity is also highly scalable: the company claims up to 350TB per hour.

• YarcData’s software stack has been designed for compatibility with the commonly used Jena open source framework, facilitating the migration of existing graph applications onto uRiKA as the dataset size grows, and sup-porting open standards.

• While the product has only recently been released the company already has some prestigious customers using uRiKA.

The bottom line

uRiKA represents the convergence of super-computing with big data. It should perform beyond the wildest dreams of those used to conventional data warehousing environments: it has more in-memory capacity, more comput-ing power, faster load rates, more parallelisa-tion. What YarcData has recognised is that if you are serious—which means detailed analyt-ics in real-time—about analysing and discover-ing relationships amongst very large datasets then neither relational technology nor offerings such as Hadoop are sufficient. As such uRiKA is unique: there are no comparable products within the commercial market. Certainly there are other graph databases but they cannot of-fer the scale or performance of uRiKA.

Page 5: InDetail - CrayIn the opinion of Bloor Research, the following represent the key facts of which prospective users should be aware: • Graph-related analytic queries represent a class

3 © 2012 Bloor ResearchA Bloor InDetail Paper

uRiKA

Why use a graph database?

Let us consider a use case. Suppose you want to identify three or more people who are con-nected in some way (directly or indirectly), at least one of whom has rented or bought a truck, one of whom has bought fertiliser even though he doesn’t own or work on a farm, one of whom has visited a website dealing with bomb making, and one of whom has been seen visiting national monuments. Graph analytics allow you to search for this pattern in the sea of what otherwise might appear to be innocu-ous relationships that, when identified, form a plot. Then, once you have detected these per-sons of interest, you can graphically visualise the relationships between these people and things and search for more evidence of this possible plot. Moreover, in the case of uRiKA you can do this in real-time, compared to the days or weeks that might be required if using conventional methods.

In addition, it is important to appreciate that when it comes to analytics, the more nodes you have in your graph then the richer the environment becomes and the more informa-tion you can get out of it (see box). So, at least for analytics, graph-based data is a big data problem. Further, it is worth considering why we do not believe that other types of database are well-suited to these sorts of queries. Since the arguments differ somewhat between re-lational databases and NoSQL approaches we will discuss each of these separately.

Relational databases and data warehouses

Relational databases and data warehouses serve different functions. Relational databas-es are optimised for transaction processing, while warehouses are optimised to support analytic applications and business intelli-gence. Neither is well suited to graph analyt-ics, however.

Both relational databases and warehouses are based on tables. Each row in a table is called a “tuple”, also referred to as a “relation” (hence, relational database). However, relational da-tabases are not about holding or modelling relationships in the graph analytic sense.

Data in relational databases is split across multiple tables in a process known as “nor-malisation”. Normalisation enables very high transaction processing rates, taking advan-tage of the fact that most transactions only update or retrieve a small amount of data in each record. However, there is a cost: queries that require data to be amalgamated from multiple tables require ‘joins’, which are very expensive in performance terms. Even worse are ‘nested joins’, where one or more joins are required for each record returned in response to a query. Herein lies the problem: graph queries typically require multiple joins and often multiple nested joins, making relational databases poorly suited to graph analytics.

In addition, conventional data warehouses perform best when the details of what queries are to be run are known in advance. While re-cent developments in data warehousing have improved ad hoc query performance, analyt-ics performed with graph databases usually involve ad-hoc and frequently changing que-ries, as well as the search for relationships between individual entities, none of which are the strong suits of a data warehouse.

NoSQL databases

Technically, any database that doesn’t employ SQL as an access mechanism is a NoSQL data-base. However, when most people talk about NoSQL databases, they are referring to clus-tered solutions based on MapReduce, such as Hadoop or Cassandra.

The relationship between information richness and the number of nodes in a graph is a matter for debate: Metcalfe’s Law (which is actually no more than a hypothesis) suggests that growth in value of a network is approxi-mately proportional to the square of the number of nodes (actually n x (n-1)). However, this has been disputed, not least because some connections (re-lationships) between nodes are more valuable than others. Other research-ers have suggested that n(logn) would be a more appropriate figure. The answer is probably somewhere in between but there seems no doubt that the more information you can collect then the more value you can extract.

Page 6: InDetail - CrayIn the opinion of Bloor Research, the following represent the key facts of which prospective users should be aware: • Graph-related analytic queries represent a class

4© 2012 Bloor Research A Bloor InDetail Paper

uRiKA

Why use a graph database?

However, the big problem is, and this also applies to relational (row or column-based) data warehouses that offer a clustered solution, that you don’t know how to partition the data in a graph. If you don’t know the relationships that exist between different entities, or the queries that will be run, how do you know which partition (on which node) to put the data in? Whatever you guess will be wrong. As a result, queries will have to access nodes across the cluster on a regular basis and that will significantly slow down performance. As a result graph databases such as uRiKA scale up rather than scale out.

In the case of Hadoop, apart from other shortcomings (such as single points of failure), a further problem is that this is essentially a batch environment that does not lend itself well to ad hoc enquiries whereas uRiKA is aimed at real-time environments, where the whole raison d’être is to support ad hoc queries.

Page 7: InDetail - CrayIn the opinion of Bloor Research, the following represent the key facts of which prospective users should be aware: • Graph-related analytic queries represent a class

5 © 2012 Bloor ResearchA Bloor InDetail Paper

The product

uRiKA

uRiKA, which is provided on a subscription basis, is based on the Cray XMT-2. This was in-troduced in 2011 as the world’s largest shared memory machine. This is important because uRiKA holds the entire graph in memory. Note that this is not comparable to other databases that offer in-memory capability. With these products what you tend to be offered is a lim-ited amount of memory (typically a few tera-bytes) and either your application fits within that memory or you effectively use memory as a cache. With uRiKA this is not the case and the entire database is held in memory with a ca-pacity of anything between 512Gb and 512TB.

As stated, uRiKA is focused on real-time ana-lytics. In order to support this YarcData has put a major focus on tolerating memory latency. Memory latency is the Achilles’ heel of many graph processing solutions. Memory speed has been increasing much more slowly than processor speed over the past three decades, with the result that achieving reasonable sys-tem performance requires the effective use of caching. Graph problems are all about follow-ing the edges between nodes—and since the edges can lead anywhere in memory, caches are not effective, and the processor slows to the speed of memory. Cray addresses the issue of

memory latency using multi-threading. uRiKA supports up to 128 hardware threads per processor, the basic principle being that there will always be some threads able to continue processing even if others are waiting for data or synchronisation.

More generally, an entry level system com-prises 16 processors, each with 128 threads, and 512GB of memory, and scales up to 8,192 processors and 512TB of memory. YarcData maintains a system balance between I/O and processing power, enabling a load rate that the company claims can scale up to 350TB per hour. This is, frankly, an astonishing figure: data warehousing vendors typically talk in terms of single figures of terabytes per hour. On the other hand it is a plausible figure given the scalability of this architecture. uRiKA uses a Cray-developed interconnect—a 3D Torus network—which the company claims (and the company has been a specialist in this area for some years) is an order of magnitude faster than the fastest commercial interconnects (InfiniBand) used by the majority of data ware-housing vendors.

The uRiKA software stack is designed to inte-grate easily into an existing environment, and is easy to extend. The most notable point is that the environment is fully compatible with the open-source Jena platform (an Apache project), making it easy to migrate existing applications requiring greater performance. Adoption is also facilitated by support for industry standards, such as RDF (Resource Description Framework, a W3C standard), and the SPARQL (a recursive acronym for “SPARQL Protocol And Query Language) query language.

This adherence to open standards means that you can use the visualisation tool of your choice and there are a number of such tech-nologies available, though none are perfect for all applications. As an example of the sort of facilities that uRiKA provides, Figure 2 shows a screenshot of one particular visualisation, in this case looking at protein pathways that connect gene pairs, using technology from key-lines.com.Figure 2: Screenshot showing protein pathways

Page 8: InDetail - CrayIn the opinion of Bloor Research, the following represent the key facts of which prospective users should be aware: • Graph-related analytic queries represent a class

6© 2012 Bloor Research A Bloor InDetail Paper

uRiKA

The vendor

YarcData is a subsidiary of Cray Inc. (Yarc is Cray spelled backwards), a public company listed on NASDAQ.

Cray Research, Inc. was founded in 1972 by computer designer Seymour Cray. In 1996, the company was bought by Silicon Graphics (SGI) and Cray Inc. was formed in 2000 when Tera Computer Company purchased the Cray busi-ness from SGI and adopted the name of its acqui-sition. Meanwhile, the Tera Computer Company was a manufacturer of high-performance com-puting software and hardware, founded in 1987. It was a pioneer in massive multi-threading to deal with memory latency and the company was the first to offer a commercial deployment of massive multi-threading.

Cray’s headquarters are in Seattle where YarcData is also based. Outside North America the company has offices in the UK, Germany, Switzerland, Spain, France, Italy, Japan, Australia, India, Taiwan, South Korea and Hong Kong.

Cray has a long history of working in conjunc-tion with major organisations and government departments, especially the US government, and it was sponsorship of this sort that, in part, led to the development of uRiKA. De-spite the fact that uRiKA was only launched in March 2012 the product is already in use at The Mayo Clinic, The Swiss National Super-computing Centre, the Institute for Systems Biology, and Sandia National Laboratories, amongst others.

Cray web address: www.cray.com

YarcData web address: www.yarcdata.com

Page 9: InDetail - CrayIn the opinion of Bloor Research, the following represent the key facts of which prospective users should be aware: • Graph-related analytic queries represent a class

7 © 2012 Bloor ResearchA Bloor InDetail Paper

Summary

uRiKA

In our view, graph databases offer a better solution for relevant applica-tions than other database technologies and we expect them to make a major impact on the market. This is especially true when it comes to large, complex analytic problems where rapid (real-time) visualisation of results is required. It is this area that is uRiKA’s forte and we have not seen any product to compare with it in this space. Conversely, the difficulty that YarcData faces is that while there are plenty of potential deployments for its technology the whole concept of graph databases, let alone uRiKA, is little known or understood. There is therefore an educational process to go through. This tends to delay decision mak-ing and we therefore expect it to be some time before this approach becomes mainstream. Nevertheless we are confident that uRiKA will make a significant impact on the market.

Further Information

Further information about this subject is available from http://www.BloorResearch.com/update/2139

Page 10: InDetail - CrayIn the opinion of Bloor Research, the following represent the key facts of which prospective users should be aware: • Graph-related analytic queries represent a class

Bloor Research overview

Bloor Research is one of Europe’s leading IT research, analysis and consultancy organisa-tions. We explain how to bring greater Agil-ity to corporate IT systems through the effec-tive governance, management and leverage of Information. We have built a reputation for ‘telling the right story’ with independent, in-telligent, well-articulated communications content and publications on all aspects of the ICT industry. We believe the objective of telling the right story is to:

• Describe the technology in context to its business value and the other systems and processes it interacts with.

• Understand how new and innovative tech-nologies fit in with existing ICT invest-ments.

• Look at the whole market and explain all the solutions available and how they can be more effectively evaluated.

• Filter “noise” and make it easier to find the additional information or news that sup-ports both investment and implementation.

• Ensure all our content is available through the most appropriate channel.

Founded in 1989, we have spent over two dec-ades distributing research and analysis to IT user and vendor organisations throughout the world via online subscriptions, tailored research services, events and consultancy projects. We are committed to turning our knowledge into business value for you.

About the authorPhilip Howard Research Director - Data Management

Philip started in the computer industry way back in 1973 and has variously worked as a systems analyst, programmer and salesperson, as well as in marketing and product management, for a variety of companies including GEC Marconi, GPT, Philips Data Systems, Raytheon and NCR.

After a quarter of a century of not being his own boss Philip set up his own company in 1992 and his first client was Bloor Research (then ButlerBloor), with Philip working for the company as an associate ana-lyst. His relationship with Bloor Research has continued since that time and he is now Research Director focused on Data Management.

Data management refers to the management, movement, governance and storage of data and involves diverse technologies that include (but are not limited to) databases and data warehousing, data integration (including ETL, data migration and data federation), data quality, master data management, metadata management and log and event manage-ment. Philip also tracks spreadsheet management and complex event processing.

In addition to the numerous reports Philip has written on behalf of Bloor Re-search, Philip also contributes regularly to IT-Director.com and IT-Analysis.com and was previously editor of both “Application Development News” and “Operating System News” on behalf of Cambridge Market Intelligence (CMI). He has also contributed to various magazines and written a number of reports published by companies such as CMI and The Financial Times. Philip speaks regularly at conferences and other events throughout Europe and North America.

Away from work, Philip’s primary leisure activities are canal boats, ski-ing, playing Bridge (at which he is a Life Master), dining out and walking Benji the dog.

Page 11: InDetail - CrayIn the opinion of Bloor Research, the following represent the key facts of which prospective users should be aware: • Graph-related analytic queries represent a class

Copyright & disclaimer

This document is copyright © 2012 Bloor Research. No part of this pub-lication may be reproduced by any method whatsoever without the prior consent of Bloor Research.

Due to the nature of this material, numerous hardware and software products have been mentioned by name. In the majority, if not all, of the cases, these product names are claimed as trademarks by the compa-nies that manufacture the products. It is not Bloor Research’s intent to claim these names or trademarks as our own. Likewise, company logos, graphics or screen shots have been reproduced with the consent of the owner and are subject to that owner’s copyright.

Whilst every care has been taken in the preparation of this document to ensure that the information is correct, the publishers cannot accept responsibility for any errors or omissions.

Page 12: InDetail - CrayIn the opinion of Bloor Research, the following represent the key facts of which prospective users should be aware: • Graph-related analytic queries represent a class

2nd Floor, 145–157 St John Street

LONDON, EC1V 4PY, United Kingdom

Tel: +44 (0)207 043 9750 Fax: +44 (0)207 043 9748

Web: www.BloorResearch.com email: [email protected]