whitepaper anzo smart data lake -...

WHITEPAPER

Anzo Smart Data Lake™ Enterprise Graph-Based Data Discovery, Analytics and Governance

2 Anzo Smart Data Lake © Copyright 2015, Cambridge Semantics

Introduction Cambridge Semantics, The Smart Data Company™, is an industry

leader in semantic standards and graph-based technology solutions.

We have combined scalable graph-based database technology with

our proven Anzo Smart Data Platform™.

The result – the Anzo Smart Data Lake - going beyond the rigid

relational data warehouse and the unwieldy Hadoop only Data

Lake; disrupting the way IT and business alike manage and analyze

data at enterprise scale with unprecedented flexibility, insight and

speed.

Read this whitepaper to learn how Cambridge Semantics has

changed the game of data discovery, analytics and governance

for the enterprise, to provide:

An unlimited enterprise graph so all enterprise users can “surf”

and query all their data intuitively and without specialized data

analytics knowledge

A semantic data model that easily captures and delivers the

“meaning” of data with all the inherent, relationships and

attributes

Ad hoc data discovery and analytic tools so business users in

any department can get answers to questions, as well as,

generate questions they didn’t think to ask before

Democratized Big Data, so essentially everyone can now

discover and analyze all the data - applying governance,

security and flexible policies

A rapidly deployable platform to integrate with existing Hadoop

or other Data Lake environments or start from the ground up

Linked and contextualized data, so users are now able to self-

help and combine data as needed to support business functions

Anzo Smart Data Lake 3 © Copyright 2015, Cambridge Semantics

Current Data Warehouse and Data Lake Approaches to Analytics Relational data warehouses continue to be the predominant

approach to organizing information for analytics and decision

support. Based on proven technology and governance methodology,

they offer IT a way to deliver solutions with predictable resourcing,

time and cost. Despite their ubiquity and effectiveness, however, as

data volumes and diversity grow, the time and cost of the warehouse

is becoming increasingly infeasible for the majority of analysis and

decision support requests brought to IT from the business – a great

deal of valuable insight is left on the table, and time to market

opportunities suffer.

The ever-increasing variety of alternate approaches is evidence of the

urgency of the industry to arrive at a better solution – as well as the

stark reality that we aren’t even close. Alternatives like Hadoop, No-

SQL and other Big Data approaches are, however, beginning to

converge under the banner of the “Data Lake” - generally defined as

a limitless repository for all data resources, with little up-front

preparation and effort by simply storing the data in its original

format.

To understand their limits, it’s worth pointing out that Hadoop, with

origins in Internet search, was designed to solve narrow, pre-defined

problems on homogeneous data at Web-scale. (Take for example, the

conceptually simple, yet computationally complex problem of

ranking and indexing the world’s Web pages.) It should therefore

not come as a surprise that the technology is not immediately

suitable for the conceptually difficult and broad analytic challenges

enterprises face with their heterogeneous data.

And yet, because of its low cost to scale, Hadoop continues to be the

platform of choice for building a Data Lake - usually taking one of

two forms:

Bespoke search and analytic applications built on Hadoop

Raw data extracts co-located in a Hadoop cluster

Despite their ubiquity and effectiveness, as data volumes and diversity grow, the time and cost of the warehouse is becoming increasingly infeasible for the majority of analysis and decision support requests brought to IT from the business.

“


The first approach has the potential to provide quality, scalable

solutions to solve a specific problem with custom modeling, ETL and

development efforts as its relational brethren……aspirational. The

second is the more common, and risky. Even in its aspirational state,

this Data Lake has lured CIOs and CDOs into the misunderstanding

that having data in one place is a facilitator to broader and more useful

analytics leading to better decision-making and better outcomes. While

the data may be in one place physically (a Hadoop cluster), in essence

all that is created is a collection of data siloes, unlinked and not useful

in a broader context, reducing the Data Lake to nothing more than a

collection of disparate data sources. (i.e., a “Data Swamp”)

While IT departments no longer have to spend time developing

models and programming ETL with this Data Lake approach,

the burden of organizing and merging data has been shifted

onto the shoulders of those least equipped to deal with the

problem: data scientists, analysts and subject matter experts.

Even with current approaches, these valuable resources are

spending an estimated 50% to 80% of their time preparing and

organizing their data and only 20% of their time analyzing it –

any solution that increases this burden is not viable.

Where does this leave us? In the “Old Country”, is the Data

Warehouse – tried and true yet with a cost and inflexibility putting the

approach out of reach of many business problems, particularly with

all the new data sources available In the “Wild West”, is the Data

Lake, a disorderly collection of out-of-context data sources loaded into

a mismatched technology – bereft of governance or reusability.

The lack of required data preparation (storing data in its original

format) is the cornerstone of industry analysts’ definition and

benefit of the Data Lake. But this very characteristic, the lack of

preparation, is what makes the Data Lakes difficult to use for

deriving insights and ultimately, value .

Could there be a way to the make the Data Lake smarter;

utilizing its cost benefits but eliminating its shortcoming?

Cambridge Semantics has discovered that by taking a less literal

interpretation of “original format”, the Data Lake can indeed be

made smart enough to deliver exceptional value, on demand.

While the data may be in one place physically (a Hadoop cluster), in essence all that is created is a collection of data siloes, unlinked and not useful in a broader context, reducing the Data Lake to nothing more than a collection of disparate data sources.

“

In the Wild West, we have the “Data Lake”, a disorderly collection of out-of-context data sources loaded into a mismatched technology – bereft of governance or reusability.

“


The Semantic Graph Model Approach to Data Discovery, Analytics and Governance A semantic model, more formally an OWL ontology, is a

conceptual description of data in a RDF graph offering users

at all levels a road-map to navigate the data, pose questions

and execute analytics. Semantic models are flexible and are

designed to be conceived and maintained at all organizational

levels:

Industry standards groups

Corporate governance

Departmental best practices

Individual models

Organizations can start small with their semantic models and evolve

them as business needs change or new data sources are required.

RDF, a graph representation of data comprising a network of nodes,

attributes and relationships, is inherently flexible, allowing new data

sources to be integrated without having to redesign the

representation.

The RDF data standard was designed to capture all relationships

and attributes of diverse data sources, to faithfully and completely

represent data.

Together, the RDF graph and the OWL model offer a natural way to

link information from disparate sources without having to know

what types of questions will be asked.

While it may be helpful to use the analogy, RDF is to Relational

records, as OWL is to Relational Schema, the semantic approach

offers several key advantages over the relational approach:

Flexibility to evolve the model to accommodate changes or new

sources

A conceptual representation for easy consumption by business

users

A unified model spanning all layers of the analytics stack

A framework for sharing standards across organizations


While the Big Data evolution unfolded over the last decade or so,

engineers at Cambridge Semantics have honed the art and science of

applying semantic graph models to real business analytics and data

discovery problems.

Semantic graph models target key challenges of data integration and

analytics:

Flexibly adapting to new data sources and queries requires a

data representation and model that can gracefully evolve to

accommodate new data, as well as link data from disparate

sources

Accurately capturing the full meaning of data with a format that

does not lose any of the inherent, relationships or attributes of

the data

Quickly asking new questions and performing ad hoc analytics—

without having to engage IT each step of the way

Consider the challenge financial institutions face in tracking-down

and investigating potential insider trading activity within the firm.

Looking at the list of an employee’s trades, for example, does not

paint a broad enough picture. Additional data sources such as watch

lists of companies, employee email and IM, research reports, news,

and even location must be combined, analyzed and explored. By

establishing a unified semantic model across these sources, and

bringing the data together in a graph, we can follow the

relationships in any direction without knowing up-front the types of

questions required. We can answers questions such as:


Which employees have made trades in the same location after exchanging

emails?

Which securities have been traded within 2 dates of a related research

report?

Are any deal team members trading off their own watch list?

Now suppose we want to ask the question

Have any traders engaged with industry experts?

We simply extend the semantic model and load relevant data into

our graph.


On the surface these models bear strong resemblance to entity

relationship diagrams or other modeling techniques used in

conjunction with relational data warehouses. So what’s different?

In the relational world, such a model would require translation to a

relational logical model – schema and tables carefully constructed by

database experts with indexes to optimize sets of known or

anticipated questions. Posing such questions requires translation

into SQL queries with joins and optimization – out of the reach of

business users and even most data scientists.

A semantic graph model, on the other hand, requires no such

translation. The data is stored exactly in the way it is modeled – the

way business users think - allowing questions to be asked and new

hypotheses explored on the fly.

Building the Graph How does this all really work in practice? The semantic graph model

is, after all, only a data representation. Building and maintaining

graphs can be challenging, particularly when the data sources are

multi-structured and diverse. But making this all actionable requires

a sophisticated semantics-based platform. Consider the following

example from Pharma R&D Intelligence.

A semantic graph model, on the other hand, requires no such translation. The data is stored exactly in the way it is modeled – the way business users think - allowing questions to be asked and new hypotheses explored on the fly.

“


A decision maker is trying to track activities of small biotech

companies in his area of expertise. The data sources are a relational

database, a news feed and a CRM system. Creating such graphs

from the source data requires sophisticated technology to build the

model, map to multiple sources and ingest the data.

The Anzo Smart Data Platform, an end-to-end suite for linking and

contextualizing multi-structured data into semantic graphs, is built

following a service-oriented architecture (SOA).

The platform includes tools for:

Modeling and Governance

Managing and versioning models, ontologies

Access control and security

Ingestion

Loading data from disparate sources (ETL)

Linking and transforming content across sources (ELT)

Text analytics

APIs and connection points to integrate with external tools and

systems

Graph-aware Analytics – a new paradigm in data discovery The incredible benefit of the graph model continues past data

integration and right into the data discovery and the analytics front-

end of the stack – yielding perhaps the greatest differentiator of the

approach.

The BI and analytics landscape is replete with tools – each with a

different set of capabilities for user empowerment and offer slick

visualizations. However all of these tools require significant data


preparation and data movement to work with existing Data Lake

approaches.

Rectangular subsets or data frames must be defined and extracted

before these BI tools can be effective. Building these extracts from a

swamp of disconnected raw data is technical, time-consuming, and

error-prone. If the requirements change or new questions are asked,

further extracts must be prepared involving additional work for the

data scientist and IT, often to the point of impracticality.

To understand how the semantic graph model shatters this glass ceiling,

let’s take a look at an actual model used in R&D Intelligence. A clinical

trial has related concepts including disease, phase of development,

organization and country. With this model, decision makers can ask

questions and create visualizations around the current clinical trial

landscape such as:

What Phase II clinical trials are being run in Japan for Ovarian Cancer?

However, what if we want to explore further and discover related

information.

Who are the key investigators in a particular region of Japan?

What trials are focusing on injections vs. oral medication?

To answer these questions, additional information is required. With

traditional BI tools, work must be done to discover, join and extract the

appropriate data set. With the graph model, all related data is


immediately available for data discovery and analytics. The data

scientist can explore the entire model to include any connected

information in the analysis.

To deliver this extraordinary potential to end users, Cambridge

Semantics built graph-awareness into Anzo on the Web – the data

discovery and analytics front-end of the Anzo Smart Data Platform.

Instead of relying on rectangular extracts of data - analysts can create

tables, filters, charts and visualizations by intuitively exploring paths

through the full model, applying filters to refine what specific data is

relevant This approach combines data discovery and analytics with

speed and agility – arriving at answers to new and ad hoc questions

quickly and without requesting support from IT.


Anzo Smart Data Lake Technical Overview Driven by the success of the Anzo Smart Data Platform, Cambridge

Semantics’ customers are increasing the size and scope of their

sources. For example, bringing together much larger data sets than

can be handled by single-server architecture. Rising to the challenge,

Cambridge Semantics has married Big Data scale with flexible graph

-based middleware. The result is the Anzo Smart Data Lake (Anzo

SDL) - a flexible and scalable knowledgebase for data discovery,

analytics and governance.

Born from the market’s growing thirst to deploy our proven graph-

based approach at enterprise Data Lake scale - Anzo SDL brings an

authenticity and fresh approach to the theater of Big Data

representation. Anzo SDL stores data with its full original meaning

and context, although requiring a bit more preparation on ingest, but

orders of magnitude less effort to derive downstream value.

Anzo SDL introduces three elements of scale to the Anzo Smart Data

Platform (SDP):

Unbounded storage and cataloging of RDF graphs

Parallelizable and rapid ingestion and linking of data sources

An interactive Graph Query Engine

Further since Anzo SDP is built on a services-oriented architecture,

Anzo SDL enables the unbundling of components for distributed

deployment.

Cambridge Semantics’ customers are deploying Anzo Smart Data

Lake to work with and leverage existing Hadoop Data Lake

environments as well build new Data Lakes from scratch.

Graph Storage and Cataloging

Anzo SDL uses highly scalable and available file systems such as

HDFS for storing the graph data at rest. Anzo Smart Data Lake

Anzo Smart Data Lake (Anzo SDL) - a flexible and scalable knowledgebase for data discovery, analytics and governance.

“


Server has local transactional graph storage containing the catalog of

models, data sets, mappings, analytics and other configuration used

throughout Anzo SDL.

The server provides:

A power-user workbench for configuring models, ingestion and

linking

Cataloging and metadata management of Anzo SDL graph data

as well as data sources outside Anzo SDL - including Hadoop

data sources

A data scientist/analyst entry-point for data discovery and

analytics

Provisioning and configuration of all other Anzo servers in Anzo

SDL for elastic cloud deployment

Security and access control

High availability and failover


Integration – Ingestion and Linking

Anzo Smart Data Integration is the toolset within the Anzo Smart

Data Platform for mapping and transforming data from all sources

into RDF graphs. Driven by the semantic model, these scalable

servers convert data from all formats, structured and unstructured

into the RDF graph format. An appropriate number of servers may

be deployed to accommodate the number of sources and total

volume of incoming data, including automatic incremental updates.

Depending on the nature of each of the data sources, one or more of

the techniques will applied:

Mapping and transformation of structured or tabular data

Text analytics, converting unstructured data to structured graphs

Custom plugins for data sources with APIs or proprietary

formats

High performance mapping and transforming using Apache

Spark to bring your existing Hadoop data into Anzo SDL

Maintaining the enterprise semantic graph at scale also presents a

modeling and governance challenge. Anzo SDL must accommodate

all sources, retaining models that are both true to the data, as well as

linked and contextualized to support query and analytics across

sources. Cambridge Semantics has developed methodologies and

tooling for organizing the enterprise graph. One such methodology

is the canonical linking model – graph models that link across

sources and take on configurable characteristics of the sources.

Canonical models also maintain provenance of each source’s

contribution to the canonical representation.

The methodology allows:

A scalable Data Lake with thousands of interconnected data sets.

Multiple canonical models (“versions of the truth”) for different

business applications – democratizing the modeling


Well-described, widely reusable data sets

High performance linking and transformation at scale based on

Apache Spark technology

With these approaches, new data can be quickly loaded into the Data

Lake, and links can be created across sources. While IT governance is

a key element of maintaining the enterprise graph in Anzo SDL, the

model-driven tooling enables new classes of users including

business analysts and data scientists to become data stewards –

participating in the process of filling the Data Lake.

Data Discovery and Analytics

The Anzo Discovery and Analytics Servers allow users to perform

data discovery and analytics across the large enterprise graph within

Anzo Smart Data Lake. The earlier mentioned Smart Data Lake

Server allows analysts to discover data sets across the enterprise

graph and combine them for interactive analytics in the Anzo

Discovery and Analytics servers. Analytics servers and cluster nodes

may be spun-up and down based on user demand.

Cambridge Semantics has developed methodologies and tooling for organizing the enterprise graph.

“


A key module of the Anzo Discovery and Analytics servers is Anzo

on the Web, users can configure search and visualization dashboards

with valuable views, analytics and insights. These configurations are

maintained in the analytics servers while active, but stored centrally

in the catalog for sharing and collaboration.

The Anzo Graph Query Engine is the key element of scale in the

Anzo Smart Data Lake. Based on elastic clustered, in-memory

computing, this component offers interactive ad hoc query and

analytics on datasets with billions of triples. With this powerful layer

over the RDF storage, end users can effect powerful analytic

workflows in a self-service manner.

On a browser like web interface the Smart Data Lake catalogue can

show not only the typical ways different data sets can be linked and

joined or are conceptually connected, it can even recommend other

datasets or even dashboards that you haven’t considered.

When data or a dashboard is selected, the in-memory graph

processing engine is loaded, reading in currently up to six million

“triples” or facts per second from the Anzo Smart Data Lake into

vast in-memory graphs that contain billions of facts available to be

simultaneously queried.

Once loaded, the data in the in-memory graph engine can be

interactively analyzed and traversed in any direction because of the

support for blazingly fast pipelines including numerous joins. That

would be near impossible in a relational database without a great

deal of prior schema structuring and query preparation.

This clean process of discovering and combining data analytics is

near instantaneous when compared with other Data Lake

approaches that require tedious mixing and matching of unprepared

and unlinked data sets for use in BI tools.


Anzo Smart Data Lake provides a unique, graph-aware data

discovery and analytics experience, enabling users to quickly drill-

down and analyze large, combined data sets. Results of this analysis

can be visualized and displayed within Anzo on the Web or

exported on-the-fly into external BI and reporting tools using open

protocols including OData and SPARQL.

Anzo Smart Data Lake - Time to Value The driving force behind enterprise data analytics is the desire to

obtain valuable insights more quickly from large, diverse data sets.

IT groups are now facing a trade-off. The data warehouse has a

lengthy initial implementation, and its lack of flexibility means new

questions cannot be quickly asked nor new insights quickly

discovered. The conventional Data Lake can be deployed quickly,

but the savings in data preparation and modeling is dearly paid for

later when analysts and data scientists approach the system to ask

questions and analyze data - finding they have significant work to

do. Well-conceived and constructed Hadoop-based point solutions

offer a middle ground, but on the same value curve.

The Anzo Smart Data Lake, by introducing a simple, graph-based

data representation, transcends this trade-off curve. Because RDF is a

“lossless” data representation, full data sets need only be loaded

once, regardless of anticipated (or unanticipated) use. For this one-

time cost to load data into the RDF graph representation, data

scientists enjoy self-service, on-demand, immediate reuse and

combination of data for any set of questions or analysis.

The big question is then, how high is the one-time cost of data

modeling and data ingestion? RDF itself is simple – building and

maintaining an enterprise-scale RDF graph does take effort.

Fortunately, Cambridge Semantics has 100’s of man-years of

research, engineering and field experience creating and linking RDF

from diverse sources. Not only Cambridge Semantics’ teams, but

also our customers and partners are able to use our tools and

This clean process of

discovering and combining

data analytics is near

instantaneous when

compared with other Data

Lake approaches that

require tedious mixing

and matching of

unprepared and unlinked

data sets for use in BI

tools.

“


methodologies to quickly load data into Anzo Smart Data Lakes and

reap near immediate value.

Governance

The Anzo Smart Data Lake, a disruptive capability that allows

groups to combine and query data from across the enterprise using

ad hoc models, requires organizations to reconsider governance

from a new perspective. A careful program of flexibility and reuse

balanced with methodology and controls will ensure that access

control, security, full data lineage or provenance and data context

are all preserved.

The tooling and methodologies within the Anzo Smart Data

Integration toolkit were designed for this type of governance -

insuring that proper modeling and linking practices are preserved

without limiting the expressivity of the models. Mappings to source

systems and linkages between data sets are created with provenance

for trust and traceability.

Anzo Smart Data Lake offers a platform on which organization

specific policies can be layered with appropriate roles for

stewardship and review. Analysts and data scientists rapidly

uncover insights and decision makers have confidence in those

insights – the crucial last step in the realization of value.


Industry Perspective

Companies in Pharmaceutical, Life Sciences, Financial Services,

Retail industries, and Government Agencies, are seeking ways to

make the full extent of their data more insightful, valuable and

actionable.

The following Pharma and Financial Services examples are related to

two different markets noted for complex data requirements.

Pharma

The data problems in Pharma range from the traditional - sales

forecasting and supply chain management, to the deeply scientific –

genome sequencing and assay result analysis. While practitioners on

the ends of this spectrum will find value with the graph-based

approach of Anzo Smart Data Lake as the technology proliferates

within their respective organizations, they are not the early adopters.

As the focus of big pharma has shifted from laboratory and basic

research to strategic partnerships, clinical development and medical

relationships, it is the knowledge management groups who mix

science with business within Pharma R&D who have been first to

adopt the approach. Tasked with combining and analyzing scientific

-rich data and presenting the results for making critical business

decisions, these non-technical, bench-turned-data scientists have

been winning with Anzo Smart Data Platform from its earliest days.

As the size of complexity of data sources have grown, these

customers represent the key drivers behind the scale of Anzo Smart

Data Lake.

Pharma R&D Intelligence

Competitive intelligence professionals combine internal and external

data of all formats to support strategic decisions around licensing IP,

partnering and running clinical trials. The data sources are large and

diverse and rely on accurate linkages using deep taxonomies.

Analysts and data

scientists rapidly uncover

insights and decision

makers have confidence in

those insights – the crucial

last step in the realization

of value.

“


Canonical linking data sets provide the backbone for scaling the

complex models inherent in these solutions. Approaches that do not

support text analytics to combine structured and unstructured data

are ineffective in this space.

Clinical Data Integration

Clinicians and the data scientists who support them require flexible

access to data sets across clinical trials. These users have found

significant value in graph aware analytics, allowing them to navigate

the wide and complex clinical data models to answer ad hoc

questions without manual data preparation or IT intervention.

Groups are further integrating real-world patient data to assess the

value and success of clinical trials.


Financial Services

Compliance

In the financial services sector, multiple billions of dollars are at

stake for those firms who are unable to effectively manage risk and

compliance to catch wrong doing early.

Specifically, identifying the potential for misuse of material non-

public information can be extremely difficult. Emails, messages,

trades and the people making them need to be looked at in a holistic

manner. Links and relationships need to be examined in detail, no

matter what the source is. For compliance officers and analysts,

identifying and exploring these relationships are a crucial

component of understanding what, how, why and when information

is shared and whether it is compliant or not.

To magnify the problem, the regulations for compliance are a

moving target making flexibility and ad hoc analytics an essential

feature of any solution.

Using an Anzo Smart Data Lake, Cambridge Semantics and its

partners have developed an investigative approach based on

combining disparate data sources in an interactive model that allows

compliance offers to investigate for compliance violations. Account

activity, web logs, email, phone archives, IM communications and

other sources can be linked to uncover potential violations of

regulatory requirements as well as internal policies and procedures

violations. Should regulations change, compliance workers can

quickly change the point of attack within the data – without

Rebuilding.

Visit our website to download the IDC buyer case study

“PricewaterhouseCoopers Helps Clients Manage Financial Risk and

Compliance with Cambridge Semantics’ Anzo Smart Data Platform”


Conclusion With Anzo Smart Data Lake, the game has changed. IT groups no

longer have to compromise between a data warehouse and data

swamp and the business is able to arrive at insights faster than

anyone believed possible. High performance graph query

technology has unlocked the Anzo Smart Data Platform’s innate

ability deliver on this promise.

Using the graph-aware tools in Anzo SDP for analytics ETL, ELT,

and modeling, graph, our customers work quicker, cheaper, and

faster, with more flexibility and greater accuracy. The Anzo Smart

Data Lake delivers unprecedented data value, turning data assets

into extreme insight and competitive advantage.


To Learn More

Contact Cambridge Semantics:

[email protected]

http://www.cambridgesemantics.com/

About Cambridge Semantics

Cambridge Semantics Inc., The Smart Data Company™, is an enterprise analytics and data management software company. Our software, the Anzo Smart Data Platform™, allows IT departments and their business users to semantically link, analyze and manage diverse data whether internal or external, structured or unstructured, with speed, at big data scale and at the fraction of the implementation costs of using traditional approaches.

The company is based in Boston, Massachusetts.

For more information visit www.cambridgesemantics.com or follow us on Facebook, LinkedIn and Twitter: @CamSemantics

© Copyright 2015, Cambridge Semantics. All rights reserved.

Anzo Smart Data Lake Enterprise Graph-Based Discovery, Analytics and Governance

http://www.cambridgesemantics.com/

whitepaper anzo smart data lake -...

Documents