purpose-built data architecture for the industrial internet core white paper_p4.pdfpurpose-built...

Overview and Architecture Recommendation

Purpose-Built Data Architecture for the Industrial Internet

2Purpose-Built Data Architecture for the Industrial Internet

ContentsExecutive Summary ................................................................................................................... 3

Experience Matters in Industrial “Big Data” .......................................................................... 5

Data Users ................................................................................................................................... 9

Recommended Architecture.................................................................................................... 9

Understanding Hadoop and its Limitations......................................................................... 14

Data Ingestion with MIx Core ................................................................................................ 14

Semantic Data Modeling, Data Mapping and Data Federation ....................................... 15

Shared Model ........................................................................................................................... 16

Collaborative Design and Model Extensions ....................................................................... 16

Management and Use of the Semantic Data Model .......................................................... 16

Purpose-built for the Industrial Internet ............................................................................. 17

Collaborative Processing ........................................................................................................ 18

Integration with Hadoop Infrastructure .............................................................................. 18

Connectors ................................................................................................................................ 19

Integration with HBase ........................................................................................................... 19

Industrial Data Modeling and Access Methods .................................................................. 19

Data Mapping ........................................................................................................................... 20

Searching and Fishing ............................................................................................................. 20

Third Party Application Access .............................................................................................. 20

Accessing Data ......................................................................................................................... 21

Conclusion ................................................................................................................................. 21


Executive Summary

Industrial environments such as those found within large electric and gas utilities, where Bit Stew has hardened its purpose-built technology for industrial environments, are producing massive volumes of data in real-time that is overwhelming the infrastructure and traditional ICT architectures. Furthermore, the variety of operational information that is received is orders of magnitude more complex than what is typically found in ICT environments. With an ever-increasing number of connected devices, sensors and network systems, companies operating in industrial environments often have millions of unstructured data streams from disparate sources, trapped in siloed legacy systems, all in different formats. The data can be a valuable resource for driving operational efficiency and coping with transforma-tional changes, but only if it can be transformed into actionable insights and meaningful intelligence. For example, utilities are deploying increasingly more cost-effective sensor technology on the distribution grid to capture data on voltage, power quality, outages, non-technical losses and other key aspects. However, without the ability to integrate the raw data into a common model and gain a holistic view of the big picture and potential risks in real-time, utilities will continue to struggle to achieve operational and business benefits from the sensor technology and data integration and management architectures they implement. Traditional data warehousing models, open-source alternatives

like Apache Hadoop ® and Storm ®, and Enterprise Service Bus architectures fail to provide industrial businesses with the real-time access and awareness they need to improve decision-making, identify operational efficiencies or proactively address risks.

The ability to derive actionable insight from large volumes of unstructured, disparate data improves asset performance, lowers operational costs and allows industrial companies to improve productivity and process reliability. As an example, managing low voltage networks for electric utilities is becoming increasingly more important as energy efficiency takes center stage and distributed energy resources come online. This is a new operating paradigm for utilities and visionaries are looking for operational intelligence that can support new business models, decrease operating costs and in some cases tap into additional revenue streams. In this scenario, the impact of the low voltage network has an upstream impact on medium voltage and high voltage transmission. In a modern world, the electrical grid is no longer isolated or top-down—it now is an interdependent mesh where utilities and regulators must rely on sensors and controls to create an energy balance. This persistent need for contextual visibility across complex industrial operations and interconnected assets is also transforming other industries such as oil and gas, manufacturing, power generation, and many others.

Managing extremely large data volumes and data velocities within the industrial environment is an arduous and complicated challenge.


Operational intelligence requires data which comes in the form of sensor attached assets. For utilities, a variety of sensors is required to track power quality, voltage, energy delivered, energy received, reactive power, temperature, acoustics, video, radar and many other types of measurements that can provide the necessary insight. The challenge faced by utilities and other industries is how to leverage all the data, place it in context with other data, and then create intelligent information for operations. It’s also important to note that the information demands will change between normal operating states and abnormal operating states. In the case of an abnormal operating state, this requires more “situational intelligence” that allows the operators to respond appropriately and effectively.

Solving this data challenge requires a new thought process and approach within the organizations including a new purpose-built architecture for industrial infrastructure, communications and storage. Although the industry has coined the term “Big Data” to refer to the broad category of massive data volumes and the technologies in the market designed to solve the problem; the issues facing industrial customers are far more intricate.

“Solving this data challenge requires a new thought process and approach within the organizations including a new purpose-built architecture for industrial infrastructure, communications and storage.”

First, industrial organizations deal with much greater volumes of data than non-industrial sectors; they often have numerous, siloed legacy systems producing data in different formats and protocols which do not communicate with each other. Additionally, due to the critical nature of industrial processes, operators need real-time visibility and situational awareness of their entire operations in order to identify and proactively prevent risks, manage event alarms, and ensure uptime. Traditional data warehousing models or relying solely on technologies built on open-source alternatives are not able to support the unique needs of the industrial sector.

This document provides a recommended architecture for dealing with industrial data at volume, at speed and at variety. It is important to note that the variety of data used in the industrial sectors is far greater than non-industrial sectors, especially with the advent of so many sensors attached to physical assets and returning data such as time series, events, channels, registers, wave forms and others, in combination with additional data sources such as transactions, environmental risks and others.


3. Data appliances for high performance and high volume data storage and data analysis

4. Data warehouses for sole source access to data and analytic models

Experience Matters in Industrial “Big Data”

Case 1: Large Canadian Utility and Data Appliance

Utilities have been involved in the big data space for many years and have attempted to solve the challenge with a variety of approaches that include:

1. RDBMS tactical databases for storage and analysis

2. NoSQL data stores for scalability, unstructured data support and data analysis

A large Canadian utility focused their big data and analytics effort on a third party commercial data appliance technology. The solution also involved several technologies that tied into the data appliance for statistical analysis and modeling. The primary purpose for the data appliance approach was to create a central solution for capturing essential operational data and supporting advanced analytics for determining non-technical line loss, performing phase determination, improving the resolution of the connectivity model, optimizing field crews during storms and outage situations and improving load forecasting metrics.

Endeavored to use an appliance solution that was based on RDBMS technology adapted to run on a HDFS infrastructure for scalability

Performance issues were noted in the overall solution including application-level use of the underlying data store. These issues were associated to the applications, modeling, data structures and translation from RDBMS to the underlying HDFS (map-reduce)

Due to operational data from multiple sources in different formats, analytics did not perform and were severely limited with multi-user access on the system

Data modeling was constrained to RDBMS structures in support of the database technology and forced the utility to consider different storage types such as HStore.

Due to the technology’s limitations – its inability to effectively integrate different data types from numerous sources for real-time decision making – this utility is now considering a move to a Data Lake architecture with a Bit Stew recommended approach to modeling and management of data. The utility’s goals, such as being able to optimize field crews during storms and outages, require real-time access and intelligence that simply cannot be provided by the data appliance approach.

Recent technologies involving data appliances, NoSQL and data lakes have had some limited success in typical IT applications. However, on the industrial side, success has rarely been achieved due to the velocity, volume and variety of challenges associated with operational technology (OT) data. The “consumerization” of industrial technology requires new architectures and a new approach to data stores and data analytics that any of the aforementioned approaches fail to address individually. The following sub-sections describe real-world use cases where utilities have taken certain approaches to big data and analytics. In all of these cases there have been significant challenges and lessons learned.

5. Data lakes for scalable, structured, unstructured, raw and processed data storage and data analysis


Case 2: Large US West Coast Utility and Data Lake

This large utility on the West Coast of the United States took a leading-edge approach to big data and analytics by implementing a Data Lake on the Hadoop platform. This approach was unproven in the utility space but based on Hadoop technology that was widely tested in IT environments in non-industrial sectors such as financial services, advertising and e-commerce. The primary purpose of the Data Lake architecture was to centralize the operational data store and drive more insight through analytics—similar to the previous customer use case. This utility had an immediate focus on eliminating the data noise and improving internal processes for exception management. Furthermore, the utility wanted to create a data solution that allowed operators and analysts to browse through massive data sets and “go fishing” for business insights.

Sought a scalable solution for native storage of a wide variety of operational data including time series information that was stored in PI

Had struggled with scalability and performance using other RDBMS technologies

The Hadoop Data Lake solution performed well for raw storage of data using HDFS, however that was only basic raw storage and did not include any data processing or data structuring. Storing raw data in Data Lakes or data warehouses does not enable operators and analysts to gain real-time business insights, which is what this utility desired

Performance issues were noted when data structuring and processing logic was added to the data sources both in batch and real-time

Several management and process issues were noted when using schema-less storage and handling of the structured/unstructured data

Data integrity and completeness issues were also noted with SQL-like query translation from source into target within HDFS (i.e. the core data handling layers)

Storage of PI time series information proved problematic depending on data structuring technology used and the decision was made to keep that data in PI

Data ingestion process and quality remained a consistent issue due to the immense volume of data generated in the utilities’ operations. Though the Hadoop Data Lake architecture had been tested widely in other industries, most other business sectors simply do not deal with the volume and scale of data that industrial organizations generate.

Due to the numerous issues noted with the Hadoop infrastructure, the customer team decided to change to a Bit Stew recommended approach for integration. This approach moved the utility from relying on the native Hadoop ingestion methods, to having the Bit Stew MIx Core™ platform perform data ingestion.

This was essentially a move from HIVE to MIx Core for data ingestion and intelligent population of the underlying data store. Because MIx Core can be integrated with the Hadoop architecture, the HIVE/HBASE can be intelligently populated by the MIx Core so the data is not stored in raw, unstructured formats but rather, has already had correlation and complex processing logic applied to it. The utility can then use HIVE/HBASE tools to access the intelligently populated data for queries, enabling them to identify more valuable business insights.

7

Case 3: Large International Utility

This large Australasia utility relied heavily on an Oracle data warehouse as well as several Oracle RDBMS tactical databases for storage of grid and smart grid data such as channel and register reading information from smart meters. Additionally, critical vendor solutions relied on Oracle RDBMS databases for storage of their source information. This utility had created a data warehouse and tactical database environment for handling the massive amount of information received from their advanced metering infrastructure and was primarily concerned with energy use analytics and supporting a retail market. The data processing functions proved critical to the utility and the retail market—especially around data availability, timely data and quality of data. Handling data exceptions was a key focus for the management team.

Significant performance and scalability issues were noted across both the data warehouse and the tactical databases

Vendor RDBMS solutions also noted significant performance issues due to the massive load of data and access was restricted to minimize disruption to the core business

Data synchronization between source systems, the data warehouse and the tactical databases remained a problem due to the disk and performance demands of the ETL processes

The tactical databases were introduced due to limitations in the data warehouse schema for performing analytics required by the business

The data solution was noted to have an inability to effectively scale and deal with time series information, channel (i.e. interval) data and register data

Data scans for analytics proved expensive and extremely slow

Multi-user access to data was severely restricted due to performance issues.

This utility is now planning a move to a Data Lake architecture with MIx Core as a primary platform for analytics and data indexing because it will enable them to perform the real-time energy use analytics and complex event processing they need to support the retail market. The Bit Stew solution also eliminates the challenges the utility faces around data synchronization between source systems due to its ability to integrate data from disparate sources into a common data model.



Case 4: Large US East Coast Utility and Operational Data Store

Performance and scalability are separate architecture and technical aspects. That is, performance does not guarantee scalability and scalability does not guarantee performance.

This large East Coast utility in the United States was focused on an operational data store technology that involved OSIsoft PI. The goal was to provide a common and highly scalable solution for storing operational data associated with their smart grid/smart metering project, such as real-time asset performance and power quality events. As with other utilities, there were many business requirements for centralizing information around an operational data store. This utility’s initial focus was on non-technical line loss.

The large US East Coast utility is currently utilizing both an Oracle and PostgreSQL RDBMS solution for storage of OT data including channel and register data from smart meters

OSIsoft PI is used for time series information received from SCADA systems

This utility is considering an architecture based on the concept of an Operational Data Store for data types such as time series, channels, registers, events, assets, usage points and other types of OT data

Decision to move into production has been repeatedly moved out several years, due to concerns about the expense associated with the ODS approach as well as performance

Scalability issues have not been addressed and is unproven

This utility is currently looking forward to exercising the full capabilities of the MIx Core NoSQL solution.

In all the use cases described above, the Hadoop Data Lake or other data integration and warehousing solutions used at these utilities were not able to effectively model such a wide variety and massive scale of operational data elements generated by the businesses. In each of these use cases, the utilities have decided to deploy an architecture recommended by Bit Stew, with the use of the MIx Core platform and MIx Director application for modeling and real-time analytics capabilities.

8


Data Users

Knowing who will be using the data and which systems will access the data is critical to the overall design of the architecture and will impact scalability and performance.

Similarly, the objective is not to design a solution that is based purely on current [perceived] requirements but must be able to evolve, in order to allow for new use cases and future requirements. The best practice is to design an architecture that considers future-proofing for functional scalability as well as system performance and system scalability. Functional scalability is a key concept with data platform designs.

The recommendation for industrial environments is to design and implement a comprehensive purpose-built architecture for industrial environments that incorporates data stores, data indexing and data analytics. Considerations that went into this recommendation include:

Primary purpose is storage of data for indexing and analysis, with ability to provide real-time access and a comprehensive view of critical information so operators can derive actionable intelligence from the data

Data to be stored will be enormous and contain a high variety of data types, data structures, data relationships

• Data types are significantly different between IT and OT systems and the architecture must account for time series, wave forms, images, audio, binary scans, binary tables, telemetry, transactions, static content and other types when dealing with both IT and OT

Up-front analysis and modeling will be challenging and in most cases incorrect; therefore the data architecture is designed to be adaptive and allow for new approaches

Data will be provided by source systems and the source systems are the systems of record for the data they generate, and will have their own control over their stored data.

• Therefore data stored in a data lake (or other central system) is primarily for the purposes of broad-based analysis

• MIx Core is also a system of record and will be generating meta data, information based on analysis, processed information, patterns and other information where MIx Core will remain the system of record for this type of information

Designing an approach to Big Data must account for the primary users, which includes not only data scientists but operators and applications as well. Not all interfaces to data will be the same and the solution must support programmatic interfaces (e.g. web services) as well as human-to-machine (HMI) interfaces. When designed properly, data integration and management architectures can serve as a bridge between the information technology (IT) and operational technology (OT) functions by providing accurate, reliable data across an enterprise so operators and engineers can have the information and situational awareness they need to do their jobs effectively and make operational improvements.

On the HMI side, operators will require different types of access than would data scientists. In this case, the operators are performing forensic analysis of information based on issues they

Recommended Architecture

From an architecture perspective, the objective is not to paint the business into a corner with preconceived notions of how the solution will be used.

face on a day-to-day basis. This is in contrast to data scientists that are a seeking patterns in data and formulating models that can be used for simulations, predictions and other complex analysis formulas. The methods for forensics and science are different and typically require different tools for looking at the data. Furthermore, applications that rely on the data for operational analysis or cause-and-effect analysis are using different methods. For this reason, it is important to design a Big Data architecture with flexibility, providing different users access to the data they need in order to derive the intelligence and insights needed for their distinct use cases, whether as IT functions or OT functions.

Queries, searches, analysis and examination of the data will vary depending on the user, use case, business process or system.


Data storage/processing decisions will need to account for flexibility and allow for reconfiguration of data and indexes to meet on-going business requirements—all with massive volume and high velocity

Traditional database designs that force a particular approach, model, or data types are well suited to application requirements but are not well suited to analytics

• Traditional database designs and data lake architectures that rely on open source tools like Hadoop are especially ill-suited for analytics at the immense scale required in industrial operations, which generate much greater volumes and types of operational data than other business sectors

Enterprise data warehouse solutions were created to support many of the same objectives including scalable analysis of data but have traditionally focused on data received from standard ICT applications such as transactional systems or historical data repositories. Enterprise data warehouse appliances and other big-iron solutions have created scalable and high performance solutions based on a traditional ICT view of data. However, experience with these solutions has demonstrated limited support for dynamic data models, dynamic analysis and on-demand data analysis that is expected of modern designs based on “elastic” scalability. Furthermore, traditional enterprise data warehouse approaches have not accounted for the complexity, velocity and variety of data received from operational technology such as industrial sensors—as well as the demands of analytics that involve complex OT/IT correlations, cascading impacts and other advanced heuristics.

10


Reads & Events

Customers, Work Orders & Locations

Grid / Enterprise Data & Models

Market Census Third Party Adapters, Data

Mapping, CanonicalPreprocessor

Schemas, rules, dictionary, registry,

specifications,meta data,

templates, patterns

VisualizationFramework

KnowledgeFramework

IntegrationFramework

Analytics FrameworkReal-Time Analytics, and Complex Event Processor

Interactive Analysis, Algorithms & Methods

Federated & Standardized Information Indexing & Correlation Engine

MIx Coretm and Data Lake Solution

The following diagram illustrates the recommended architecture for MIx Core layered on top of a Hadoop-based data lake.

The Hadoop Infrastructure is a base solution for the Data Lake • Provides proven scalability with raw storage of data and

flexible layers of structure and data typing on top of the raw storage

• HBASE is recommended for its column wide and massive data storage and data analysis capabilities but care is recommended for the HBASE data structure and approach to data retrieval. In this case, it is best to use the MIx Core data ingestion method outlined below along with the power indexing methods provided by the MIx Core platform for fast data retrieval

• HIVE can be layered in separately and as needed for more common SQL-like access to data for relational purposes

The Federated and Standardized Information Indexing layer is provided by the MIx Core and utilizes a highly distributed and highly scalable indexing approach based on Elastic Search and Lucene• The indexing allows for rapid access to data within the data

lake and is mandatory for fast access during analytics and operations. Without an indexing method, the data lake would rely on limited primary keys or full data scans for data retrieval

• Beyond the benefits of information indexing, this layer provides an extremely powerful Semantic Data Model that is based on industry standards and normalizes the information for analysis and retrieval (see the Semantic Data Model section for more information)

Key Architecture Points

Within the architecture, Bit Stew System’s MIx Core, together with MIx Director, provides key framework elements for Integration, Analytics, Knowledge and Visualization as well as a critically important information indexing layer that uses a power Semantic Model. The base of the architecture includes a Hadoop infrastructure that provides flexibility and scalability for raw and processed data storage. This layer includes a HDFS platform for scaling the data storage aspects as well as data layers for structuring and organizing information such as HBASE and HIVE.


The MIx Core platform leverages the Semantic Data Model to efficiently structure the indexes and create an dynamic/adaptive approach to information indexing• Indices will vary and change as necessary and is easily

accommodated as an independent layer from the data storage. In this case, re-indexing is a process that is easily accommodated based on business requirements and layered in effectively rather than brute-forced after the fact

All of the data mapping, modeling and ingestion methods are based on common semantic models that are managed by the MIx Core• Changes and additions to the model are easily managed

Common data services are also provided by the Federated and Standardized Information & Correlation Engine and are important for consistent data access• Data Services for access to data using correlation and

aggregation as well as allowing for unstructured “fishing” within the data lake

• Ad hoc “fishing” activities on the data have to be more intuitive for operators but will force a learning curve on complex analytics

• Data Services are an essential part of the overall architecture and design as they dramatically impact performance and scalability. Data services provide common access to data but also continuous access to common data by leveraging “warm-up” routines and caching

• Data Services are responsible for putting processed data back into the lake. In this case, they can process or pre-process data into correlations, aggregations and calculations that are stored and retrieved as necessary. This is a design goal of CODECs within the MIx Core technology

• Note that data services can provide access to ‘raw’ data and allow for ‘raw’ queries

Query interfaces will be different and must be different to allow for different data types, different types of correlation and analysis, different types of models.• Query interfaces must also be different to leverage the

scalability and performance characteristics of the data infrastructure

• SQL was not designed for NoSQL and translating from one to another is not obvious. It is necessary to learn new techniques for NoSQL analytics

• SQL concepts such as normalization, joins and aggregations will differ significantly in NoSQL. In some cases it might be more beneficial to de-normalize rather than normalize some sets of data. Management of de-normalized data therefore becomes critical

• In some cases, models will need to be re-written• SQL to NoSQL translators such as HIVE will have issues but

may serve well in a limited capacity.


This is a fundamental layer for handling large scale data requests and is based on a map-reduce concept where data is parsed out as small “chunks” that are then load distributed across multiple systems and disks. This approach to breaking the data down into small easily managed chunks (i.e. jobs) allows for effective distribution of the data load. The Hadoop platform can then easily farm-out the chunks as map-reduce jobs and efficiently scale by improving the I/O capacity of supporting systems and disks.

However, the complexity with Hadoop comes into understanding how to create map-reduce jobs that not only optimizes the I/O capacity of the systems but also achieves the scalability and performance requirements for runtime queries. Getting data back to the operator efficiently is necessary, but is not a trivial task – typically left up to the upper data layers to translate queries into map-reduce jobs for storing data and then querying data. This step requires advance forethought and design for true big data analytics that leads to real-time or near real-time performance; and an absolute necessity for moving to toward the advantages of future technologies and realizing software defined operations.

The upper data layers of the Hadoop platform include technologies such as HBASE and HIVE. HBASE is a technology designed for large data sets and is based on a column-wide structure for handling billions of “rows” and “millions” of columns. Unlike typical RDBMS systems such as Oracle, HBASE is a NoSQL solution and requires a learning curve for setting up and developing queries—but it does provide for analytics at scale. On the other hand, HIVE is a SQL like technology that translates between traditional SQL queries and the underlying map-reduce jobs—thereby allowing access via a familiar language and RDBMS type storage/queries.

Layering data technologies such as HBASE and HIVE on top of Hadoop is important for structuring, querying and simplifying the complexity involved with map-reduce. However, it is important to understand the underlying architecture of Hadoop and map-reduce involved in chunking data and retrieving it for analytics. Hadoop is inherently a batch-oriented system and operates in the background to initiate and manage map-reduce jobs. It is important to distinguish the limitations of a batch-oriented system, particularly when the needs are increasingly mandating real-time operations. Batch jobs are used to scan data that matches the query, return distributed results and allow a process to feed the results back to the calling system such as HBASE. The underlying batch process

Understanding Hadoop and its Limitations can mean scalability (in incremental steps, each with its own process) but does not necessarily offer the real time results and performance required for guaranteed results.

For real-time data processing and analytics, Hadoop offers technologies such as Apache Storm that provide streaming capability for ingesting data into the Hadoop system. However, Storm is a framework and exposes hooks in the data stream for developers to access and process data in near real-time—it therefore requires development for data handling/processing; development that requires time and resources to develop, test and deploy each and every process. This additional provisioning becomes particularly burdensome in attempts to scale and automate future processes. It is also important to note that real-time processing in the big-data sense is not necessarily processing data in accordance with operational requirements within the industrial environment. For example, real-time processing of events and steering controls within a utility’s distribution or transmission environment must be based on sub-second analysis of the events, data and conditions. This level of real-time or near real-time performance eliminates the possibility of consuming large volumes of data into a data lake and then performing scan analysis of the data. Eliminating the consumption and query model more effectively address the real world challenges and limitations from skills and personnel shortages and already overtaxed experts. Scenarios such as this require an approach similar to complex event processing and streaming analysis—areas where MIx Core was designed to leverage high performance adaptive stream computing and machine learning.

In basic terms, Hadoop is a good data infrastructure component for handling the I/O requirements for massive volumes of data and for the demands of large data scans, but falls short compared to a data architecture that can make complex decisions without reliance on a centralized data lake and can make real time decisions from processes running on the very edge of the network. This reduces the reliance of Operational Technology upon Information Technology and provides intelligent, actionable decisions directly into the hands of the operator.

As outlined in the previous section, Hadoop is one of the recommended layers for handling massive volumes of data as it is based on proven scalability within certain industries and use cases, although, to date, industrial use cases have not yet been thoroughly investigated.

Hadoop is a good data infrastructure component for handling the I/O requirements for massive volumes of data and for the demands of large data scans, but falls short compared to a data architecture that can make complex decisions without reliance on a centralized data lake and can make real time decisions from processes running on the very edge of the network.


This not only ensures that the data lake is populated intelligently but allows for changes and adaptations a later time due to the data management capabilities of the MIx Core. Many aspects of this approach can be benefit industrial utilities:

Embedded directly in Hadoop for high performance

Ability to process business rules at-speed for data quality checks, data transformation and other data processing requirements

Allow for semantic modeling and mapping of information from source to target and to handle normalization and de-normalization activities automatically

Data Ingestion with MIx Core

The recommended process for ingesting data into a Hadoop infrastructure is to leverage the MIx Core capabilities for semantic modeling and fast data integration

The diagram below illustrates the process flow for ingesting data into the data lake using the MIx Core.

In this case, MIx Core is providing dynamic configurations to an adapter that is running directly on the Hadoop infrastructure and the adapter is simultaneously populating the data lake and the index.

Sour

ce S

yste

mH

DFS

MIx

Cor

e

Data Extract, Load and Transform Process

Adapter

Status

Landing Zone

Mapping and Rule Definitions Transform

Status and Errors

Load

CSV

Pull

Pull

HDFS HBASE

IndexConfigurator

Extract Data

Note that this diagram is an example of batch processing of information data loads (using CSV files as an example) and real-time processing would follow a similar path but avoid the file and landing zone steps.


Without a Semantic Data Model there is little with which a machine can use to baseline data that is received and thus becomes reliant on human interpretation. The human element in this circumstance cannot only be inconsistent, but also a real time-consuming hindrance when the ultimate goal for these processes should be increased automation resulting in more efficient operations. The Semantic Data Model thus creates consistency in the interpretation of data and thereby drives consistency throughout all processes and functions that depend on the data. The MIx Core has a rich Semantic Data Model designed around industrial use cases, while providing the basis for highly customizable use cases for unique deployments within a specific enterprise. The structure and management of the Semantic Data Model is critical within the MIx Core, even to the extent of allowing dynamic and on-demand extensions to the model based on machine learning or even supervised learning by an operator.

The MIx Core allows the Semantic Data Model to be federated across instances on an as-needed and as-authorized basis; as well as shared, communicated, collaborated and distributed across use cases. The Semantic Model Management functions are embedded directly within the MIx Core and form a powerful method for

machine intelligence. On top of the Semantic Data Model is a Data Mapping component that, as part of the machines knowledge repository creates an intelligent mapping between source systems and the Semantic Data Model, enabling current and future models.

The mapping is also dynamic and not only covers the system connectivity aspects of integration but also the data and, more importantly, the informational aspects of integration. In this case, the MIx Core is handling the information integration to ensure common understanding in a federated environment rather than just technical and system-level connectivity.

Semantic Data Modeling, Data Mapping and Data Federation

One of the primary elements in the overall design of the MIx Core technology is a Semantic Data Model; a fundamental component in the machine’s understanding of data.

The structure and management of the Semantic Data Model is critical within the MIx Core, even to the extent of allowing dynamic and on-demand extensions to the model based on machine learning or even supervised learning by an operator.


Shared Model

The MIx Core Semantic Data Model is shared with users, applications and external systems easily through the web-based UI as well as through programmatic interfaces. This not only allows customers to interact with the Semantic Data Model but also allows it to be consumed by other enterprise applications and leveraged for enterprise-wide designs. As an example, one of the challenging aspects of Enterprise Service Bus (ESB) imple-mentations is the “heavy lift” involved in creating and supporting canonical documents for normalized business services. With the Semantic Data Model, ESB implementations can leverage the model for automated management of canonical document structures and related schemas.

Likewise it is possible for database systems to leverage the Semantic Data Model for creating common representations of information within target systems—and support the auto generation of schemas as well as the auto ingestion capabilities of the MIx Core. This is the same approach that Bit Stew leverages with the Hadoop Data Lake infrastructure.

Collaborative Design and Model Extensions

The Bit Stew Semantic Data Model is derived from the IEC CIM. This model has been adapted and extended over years of experience working with utility customers and industry experts to create a very rich and powerful model for use in production environments. The collaboration aspects are an important part of model design and are incorporated into the process and architecture followed by Bit Stew. Additionally, the MIx Core allows for customer-based extensions to the data model that allows for customization, management and ownership by customers. More importantly, the design allows for extensions

that can be developed collaboratively between customers and industry experts; and allow the extensions to be iterated, validated and then shared within the community.

The conventions developed within the Semantic Data Model permit extensions within existing entities (e.g. elements and attributes) as well as the creation of new entities and these are easily managed through naming conventions. Versioning and version control are supported concepts to allow for more sophisticated collaboration and sharing.

Management and Use of the Semantic Data Model

Customers cannot only interact with the Semantic Data Model, they can also easily manage and extend the model through the Web Based UI—or even through the APIs and Templates provided for Developers and Power Users. The Semantic Data Model is based on a set of easily understood descriptors that can be output directly into a format familiar with customers such as XML and JSON—allowing for customers to adapt the data model for their environment.

The MIx Core allows the Semantic Data Model to be federated across instances on an as-needed and as-authorized basis; as well as shared, communicated, collaborated and distributed across use cases.


Purpose-built for the Industrial Internet

The following reference architecture illustrates how the end-to-end solution works within the industrial environment.

Other key aspects of this ecosystem drive home architectural concepts that are critical in operational environments:

Real-time industrial data manage requires sophisticated processes, management and data handling that would be typical in a pure IT environment

Technologies such as complex event processing, machine learning and dynamic data correlations are required as part of the architecture and solution

Data services are also needed to support operations and the data workbenches that will be used by operators, data scientists, MIx Apps and third party applications including external analytic packages and enterprise systems (see the Third Party Application section for more information)

Analytics must be used in the context of “applied” whereby they are purpose fit for requirements such as operations, forecasting, forensics and other applications

• There are many different types of analytics and analytical models present within industrial environments and the design allows for new analytics to be developed within MIx Apps or to leverage third party analytical engines and models via two-way communication

Data is primarily derived by sensors and other equipment at the edge of the network within substations or even in the field. It is critically important to design for processing intelligence right at the edge of the network to deliver on low-latency requirements as well as reduce noise level of the traffic that is transmitted

The processing of data does not always happen at the data center or within a cloud• The architecture and implementation must allow for data

processing and data handling where needed by the business and this can be in the Cloud and/or in a Private Cloud and/or in a Secure Data Center (e.g. On Premise) and/or at the Edge

• An architecture based on the above can accommodate the flexibility needed in data processing and the location of data processing—which is fundamental to federated designs


Collaborative Processing

As indicated above, the data processing design is based on federated concepts helping determine where, when and how to process data. In most cases, an industrial company may select one or more of the options and it is equally important to understand that data processing is collaborative and shared. In the case where an implementation involves Data Center, Cloud and Edge systems; these systems can collaborate on processing such that the Edge is pre-processing and filtering to identify and generate high-value information that is used by a Secure Data Center (i.e. a critical infrastructure domain), and the Secure Data Center uses information shared by Cloud data services such as models and longer term patterns.

Moreover the users of the data will vary depending on requirements and access is available via programmatic interfaces to third party applications and other types of systems/users.

The collaborative data interfaces are designed for two-way communication with data exposed and data received. The MIx Core handles all the data exchange aspects through a Governance Framework that includes authorization, accountability, access, management and registry services.

Integration with Hadoop Infrastructure

The Apache Hadoop infrastructure offers utilities a method for distributing the data processing load across several systems (i.e. clusters) and is becoming a popular method for massive scalability. Hadoop offers innovative techniques for load distribution, improved I/O capacity, fault management, redundancy and raw storage processing capabilities that can meet the scaling demands of large industrial companies. Combined with the many projects that build on top of the Hadoop infrastructure, utilities can create repositories and operational data stores that may scale with the massive influx of data received from smart grid

networks, but there are definitive challenges with scaling to an optimized level and in real time with Hadoop alone.

The MIx Core platform has been designed utilizing similar scaling techniques applied within Hadoop seamlessly integrating with the Hadoop infrastructure to create massive improvements on data ingestion, data classification, analysis, dynamic event management, and derived intelligence.

The platform is built to concurrently manage a network of connected devices and has proven scalability to over one billion end devices – making it an ideal platform for managing the data challenges of the Industrial Internet. MIx Core utilizes the Hadoop infrastructure as a native operational data store where information can be retrieved as well as fed back into the repository for use by other applications. MIx Core has native recognition for many of the interfaces, components and methods applied by Hadoop including HDFS, Map Reduce, HBase, Hive, Pig and Cascading. Additionally, MIx Core can offer bi-directional interfaces for indexing information directly contained within Hadoop and the many supported projects as well as “pushing” information to the Hadoop infrastructure. Mix Director, built on the MIx Core platform, then allows operators to detect anomalies through analytics. As businesses mature in their usage of the platform, it becomes more intelligent over time by storing what it learns and evoking this knowledge through sophisticated business rules.

As businesses mature in their usage of MIx Core, it becomes more intelligent over time by storing what it learns and evoking this knowledge through sophisticated business rules.


The powerful indexing capabilities of MIx Core offer new levels of scalability and performance across massive data sets such as:

High performance searching and ‘fishing’ capabilities Complex aggregations and filtering Logical correlations across structured and unstructured data sets Canonical data modeling for portability and standardization Forensic and statistical analysis of canonical and raw data Powerful business rules processing

These capabilities greatly extend what is possible with raw and processed storage within the native Hadoop infrastructure. With machine-learning intelligence, industrial businesses leverage predictive analytics to automate operation and begin taking advantage of Software Defined Operations to realize value and maximize profit from their big data architecture.

Connectors

MIx Core integrates with the Hadoop infrastructure (e.g. Map Reduce, HBase, Hive, Pig, Cascading) using three primary methods:

Streaming Connector with Canonical Mapping and Indexing: Direct integration through the streaming infrastructure for canonical mapping of the Hadoop structures and indexing within MIx Core

Streaming Connector with Automatic Indexing: Similar to the first method but the indexing method is automatic based on native data models and data types. This supports broad ‘fishing’ across the Hadoop data repository.

Direct Connect: This method utilizes direct Java API and/or RESTful interfaces for adhoc, on-demand operations with the Hadoop data repository.

All methods provide two-way interfaces for pushing, pulling, and receiving information to and from the Hadoop repository. One of the more common methods for integration is the Streaming Connector with Canonical Mapping and Indexing as this provides for standardized data models across all utilities and is based on IEC CIM. This method will, using Integration Adapters, map the native models represented within Hadoop to the canonical model within MIx Core and fully index the information. Once indexed, this information can be rapidly searched, aggregated and correlated across all other data that is indexed or even referenced within the Hadoop repository.

Integration with HBase

Direct integration with HBase, including HBase in the Hadoop infrastructure, is available using the same three methods mentioned previously. MIx Core provides an HBase Streaming Connector and performs automatic or canonical model indexing.

The MIx Core indexing on top of an HBase implementation offers extremely fast and powerful searching, aggregation,

filtering and correlation across the data regardless of the underlying HBase schema. Mapping from the HBase native data structures to the MIx Core canonical model is straightforward and accomplished directly in the Mix templates. This allows for rapid and dynamic changes as well as an iterative approach to integration.

Industrial Data Modeling and Access Methods

Industrial data modeling requirements are significantly more complex than those found within the consumer market, the primary settings for which Hadoop has previously been adopted. This can make the job of design and scaling a Hadoop implementation more challenging. Even with Hadoop, in industrial environments, where IT and OT data are coalescing, the exponential increase in data flows, alarms and alerts can overwhelm workers and hinder their ability to visualize asset and network health or prioritize operational performance problems. It is important to consider the modeling differences and limitations, as well as access methods required for different types of data within the Hadoop implementation, because Hadoop by itself remains unproven in the Industrial Internet:

Sensor data such as asset readings Network statistical information captured from the head end

and network management systems Time based data such as exceptions and events Logging information received from systems such as a SIEM and

head end components such as a Meter Communications Host Location and asset management data received from an

enterprise system Spatial data, spatial analysis and location-based data received

from a GIS and remote systems Time series information received from head ends, field systems

and SCADA

In the above cases, it is important to consider the type of Hadoop project to apply for data structures as well as the schema; and to consider how the information will be used and accessed by applications. For example, the HBase table partitions, rowkey designs and column families are important design elements for each of the data types mentioned above. Additionally, the HBase data loading, acquisition, streaming configurations are important as data is moved from source systems into the Hadoop repository.

Bit Stew’s MIx product stack, purpose-built for the huge volumes of data in the Industrial Internet, eases the requirements for strict and/or elaborate designs by automatically discovering information within Hadoop, indexing the information and representing it in canonical form. Additionally, MIx Core’s adaptive integration algorithms provide a high degree of flexibility to the underlying HBase model changes allowing for iterative designs, prototyping and as-needed performance changes within the Hadoop platform.

The out-of-the-box design elements within the MIx Core index are a rapid enabler for industrial companies.


Data Mapping

Mapping of the data from HBase to the canonical data model is an exercise covering three main areas:

Structural mapping from rowkeys, column families, columns and versions to the IEC CIM model.

Data code mapping from internal codes and values to a common term reference. This has already been done by Bit Stew, but will be required for definitions such as job codes, work types, status codes, internal asset type references and other “data meanings” to be captured within the knowledge repository

Business rule mapping for logical processing of information according to business requirements and alignment with common industry understandings.

This type of configuration work is all accomplished by the Integration team and completed using the web-based UI for knowledge elements as well as transformation logic contained in rules and data mapping templates.

Searching and Fishing

The integration of MIx Core, Hadoop and HBase provide unprecedented capabilities for searching across all information elements, performing fine-grained filtering of the information, aggregating and correlating diverse sets of information and forensically analyzing all the data received from the smart grid network. Search is enabled across all the canonical represen-tations as well as the native data structures contained within Hadoop/HBase. MIx Director allows customers to visualize, analyze, and optimize their data to provide a single unified, real-time view of their operations.

Third Party Application Access

The MIx Core and MIx Director technology was designed from the ground-up to be a dynamic and flexible service oriented architecture. Key aspects of the design are to build around the concept of small functional components that can be easily accessed through a number of methods:

Programmatic interfaces are available to 3rd party applications via web services including REST, JSON and SOAP

Programmatic interfaces are also available to 3rd party applications via a JAVA or XML SDK

Operator and Power User interfaces are available for configuration, tailoring, rules processing and macro functions via the web-based UI

Developer interfaces are available for development via a Cloud-based UI as well as through a standalone IDE such as Eclipse and IntelliJ.

This allows direct access to the application, on an authorized basis, by developers and third-party applications. Bit Stew leverages these interfaces for two-way integration with applications such as Tableau, Spotfire, SAS, Hadoop, Trove and many other tools or applications. There are many easy methods to expose data accessible through the MIx Core and MIx Director as well as add and update data within the MIx Core and MIx Director.

Accessing Data

Data can easily be accessed using the web services approach described above as well as using low-level data services exposed for batch loading, batch indexing, batch retrieval, streaming input, streaming output, event based triggers, exports and imports. The design principle is that the customer owns the data and therefore ubiquitous access to the data is required within the enterprise. By aggregating data from across the architecture, operators can query and then deploy personal, timely and relevant insights to professionals at their desks, in the control room or working remotely on their rugged handhelds and tablets.


By leveraging the Bit Stew MIx Core platform and MIx Director applications with their Hadoop architecture, industrial organizations can more quickly integrate extremely large volumes of operational data from a variety of disparate sources, correlate that data into a common data model, and apply predictive analytics and machine learning at the edge to derive actionable intelligence in real-time. Gaining actionable insights from large volumes of unstructured data enables utilities to improve power distribution, lower operational costs, proactively identify and address risks, and accommodate new distributed energy resources.

Conclusion

The Hadoop Data Lake architecture and other data warehousing models have been touted as a solution to Big Data challenges. However, companies in industrial sectors have found that these approaches simply cannot handle the scale and complexity of industrial data. Additionally, they do not provide the real-time analysis and situational awareness that operators and engineers need in order to make critical operational decisions in the moment.

To learn more about Bit Stew please visit www.bitstew.com or contact us at [email protected].

CANADA - International Headquarters Suite 205 - 7436 Fraser Park Drive,Burnaby, BC, V5J 5B9(604) 568-5999 [email protected]

USA 800 West El Camino Real, Suite 180 Mountain View, CA 94040 (650) [email protected]

bitstew.comLearn more at

AUSTRALIA Rialto South Tower, Level 27, 525 Collins Street Melbourne, Australia 3000 [email protected]

EUROPE Paseo de la Castellana 141 – 8º 28046 Madrid , Spain [email protected]

Follow us on LinkedIn & Twitter: @BitStew

ABOUT THE AUTHOR Alex Clark, Chief Software Architect Alex’s 15-year career has made him a seasoned data architect and leading expert in web service technolo-gies, global class computing, and building high-performance, secure, scalable and distributed architec-tures. He is responsible for developing the initial software that has evolved into Mix Director and is an expert in real-time systems and data integration.

In his previous role at BC Hydro, Alex was responsible for data architecture relating to the rollout of the utility’s two-million-meter smart grid project. Alex was also Chief Technology Officer with Navio Systems, Inc. and led the technical vision for the company’s award-winning and patented rights-based commerce technology. Prior to Navio, Alex was Chief Software Architect for B3 Security Corp., where he conceptual-ized and built the company’s secure, distributed and real-time transaction processing system.

Alex studied aerospace engineering at San Jose State University and computer science at West Valley College. He is part of the ZigBee Alliance, has authored numerous publications and patents, and has served as guest speaker on various topics at corporations and universities around the world.

In his current role as Chief Software Architect at Bit Stew, Alex leads the R&D function and collaborates on the evolution of the product roadmap. In this position, he combines his extensive experience in the utility industry with his deep knowledge of software design.

purpose-built data architecture for the industrial internet core white paper_p4.pdfpurpose-built...

Documents