implementation of the apache kafka with the apache avro ... · the test connectivity _ button. ......

Implementation of the Apache Kafka® with

the Apache AVRO serialization in Pega

Created by: Pawel Nowak On November 1, 2019

2 | P a g e

2020

Pegasystems Inc., Cambridge, MA

All rights reserved.

Trademarks

For Pegasystems Inc. trademarks and registered trademarks, all rights reserved. All other trademarks or service marks are property of their

respective holders.

For information about the third-party software that is delivered with the product, refer to the third-party license file on your installation media that

is specific to your release.

Notices

This publication describes and/or represents products and services of Pegasystems Inc. It may contain trade secrets and proprietary

information that are protected by various federal, state, and international laws, and distributed under licenses restricting their use, copying,

modification, distribution, or transmittal in any form without prior written authorization of Pegasystems Inc.

This publication is current as of the date of publication only. Changes to the publication may be made from time to time at the discretion of

Pegasystems Inc. This publication remains the property of Pegasystems Inc. and must be returned to it upon request. This publication does not

imply any commitment to offer or deliver the products or services described herein.

This publication may include references to Pegasystems Inc. product features that have not been licensed by you or your company. If you have

questions about whether a particular capability is included in your installation, please consult your Pegasystems Inc. services consultant.

Although Pegasystems Inc. strives for accuracy in its publications, any publication may contain inaccuracies or typographical errors, as well as

technical inaccuracies. Pegasystems Inc. shall not be liable for technical or editorial errors or omissions contained herein. Pegasystems Inc.

may make improvements and/or changes to the publication at any time without notice.

Any references in this publication to non-Pegasystems websites are provided for convenience only and do not serve as an endorsement of

these websites. The materials at these websites are not part of the material for Pegasystems products and use of those websites is at your own

risk.

Information concerning non-Pegasystems products was obtained from the suppliers of those products, their publications, or other publicly

available sources. Address questions about non-Pegasystems products to the suppliers of those products.

This publication may contain examples used in daily business operations that include the names of people, companies, products, and other

third-party publications. Such examples are fictitious and any similarity to the names or other data used by an actual business enterprise or

individual is coincidental.

This document is the property of:

Pegasystems Inc. One Rogers Street Cambridge, MA 02142-1209 USA Phone: (617) 374-9600 Fax: (617) 374-9620 www.pega.com Document Name: Implementation of the Apache Kafka with the Apache AVRO serialization in Pega Updated: November 1st, 2019

http://www.pega.com/

3 | P a g e

Use Case Demonstrate the implementation of the Apache Kafka event streaming platform with the Apache AVRO

serialization to consume the Apache AVRO messages using the Pega Real-Time Data Flow run.

Apache Kafka®

The Apache Kafka® is a distributed streaming platform, which has three key capabilities:

• publish and subscribe to streams of records, like a message queue or enterprise messaging system

• store streams of records in a fault-tolerant durable way

• process streams of records, as they occur

Kafka is generally used for two broad classes of applications:

• building the real-time streaming data pipelines that reliably get data between systems or applications

• building the real-time streaming applications that transform or react to the streams of data

More about Apache Kafka: https://kafka.apache.org/

Apache AVRO

“The Apache Avro is a row-oriented remote procedure call and data serialization framework developed

within Apache's Hadoop project. It uses JSON for defining data types and protocols, and serializes data in

a compact binary format.”

More about Apache AVRO: https://avro.apache.org/

Pega Implementation

The below implementation is created using the Pega Platform in the version 8.2.3 and the Apache AVRO

Schema Registry component, installed thru the Pega Exchange, which is required to consume the AVRO

messages from the Apache Kafka topic. The Schema Registry component is delivered by the Pega Product

Team and it will be also available at the Pega Exchange website soon.

The below screenshots show the Pega application definition view with the installed Apache AVRO Schema

Registry component.

https://kafka.apache.org/

https://en.wikipedia.org/wiki/Column-oriented_DBMS#Row-oriented_systems

https://en.wikipedia.org/wiki/Remote_procedure_call

https://en.wikipedia.org/wiki/Serialization

https://en.wikipedia.org/wiki/JSON

https://en.wikipedia.org/wiki/Communications_protocol

https://avro.apache.org/

4 | P a g e

5 | P a g e

The installation of the Schema Registry component requires a reboot of the Pega environment, as the

component contains changes to the java classes, which must be applied.

After a successful recycle of the Pega environment the Kafka Data Instance must be created in Pega, to

establish a connection in between Pega and the Kafka client. The Kafka Data Instance is available under

the Records explorer in the SysAdmin category.

The configuration of the Kafka Data Instance requires to provide the Kafka client details (hosts, ports), and

optional security/authentication details (keystore, truststore), as in below example:

6 | P a g e

When a connection in between Pega and Kafka client is successfully established, then after the “Test

connectivity” button click, there should be a green “Connection established” message displayed above

the “Test connectivity” button.

The next configuration step after a successful Kafka Data Instance configuration is creation of the Kafka

Schema Registry Data Instance in Pega. The Schema Registry provides a serving layer for the metadata. It

also provides a RESTful interface for storing and retrieving Apache AVRO schemas. It stores a versioned

history of all schemas, provides multiple compatibility settings and allows evolution of schemas according

to the configured compatibility setting. It provides serializers that plug into Kafka clients that handle

schema storage and retrieval for Kafka messages that are sent in the AVRO format.

The Kafka Schema Registry Data Instance is available under the Records explorer in the SysAdmin

category.

7 | P a g e

The configuration requires the URL to the Kafka Schema Registry and it also allows to configure an optional

authentication too.

8 | P a g e

Depends on the application requirements, there is a good practice to create the AVRO representation, as

the abstract class in Pega, either in the implementation layer, or in the framework, or in the organization

layer for reuse of the Kafka components. In below example the “…-AVRO” abstract class was created, as

the reusable integration class for the Kafka implementation. The “…-AVRO-Claim” class maps the AVRO

attributes into the Pega Clipboard Page.

After the Kafka Data Instance and the Kafka Schema Registry Data Instance creation, the Kafka Data Set

can be configured to access the Kafka topic with the Apache AVRO messages. There is a one to one relation

in between Pega Data Set and the Kafka topic, so the multiple Kafka topics, can be configured in Pega by

creation of the multiple Data Sets. For example, the Kafka Data Set, which consumes the claim messages

in AVRO format, together with the Pega Data Flow, which contains the process and triggers the

consumption of the AVRO messages from the Kafka topic, should reside under the same new AVRO class.

The Kafka Data Set configuration allows to select previously created Kafka Data Instance configuration

and it lists all the available Kafka topics for selection, what is not changeable after the first save operation.

The Partition Key determines how the data from the Kafka topic will be spread it in between Pega

partitions in the Pega Data Flow run to improve the processing time. The “Record format” section is really

important and it allows to change the default JSON message format to be a custom one, like the Apache

AVRO, by providing the serialization implementation, as in the example

“com.pega.integration.kafka.AvroSchemaRegistrySerde”. In the future, there is a plan to introduce

a separate radio button option for the Apache AVRO message format. The custom record format also

allows for input parameters, like Schema Registry name, supported by the parameter

“schema.registry.config”, where the Key field is the parameter name and the Value field is the Schema

Registry Data Instance name, which was configured before.

9 | P a g e

The Kafka Data Set can be executed and tested directly from the Data Set rule, by Run option, as in the

below example, which presents the AVRO messages read from the Kafka topic into the Clipboard Page

OperationResult, which has the structure of the new AVRO class “…-AVRO-Claim”.

10 | P a g e

After a successful configuration of the Kafka Data Set, the Pega Data Flow can be created with the Kafka

Data Set, as the input to the flow. When the Kafka Data Set is the first shape in the Pega Data Flow, then

the flow is treated, as the real-time flow during the Pega Data Flow run execution. In comparison to the

Kafka Data Set, when the Table (Non-Stream, like SQL or Non-SQL) Data Set is used, then the flow is

treated, as the batch flow during the Pega Data Flow run execution.

There is an important read option available in the Data Set shape of the Kafka Type, where:

- “Read existing and new records” – reads all the messages from the Kafka topic after stop or fail of the Real-Time Data Flow run. This does not apply for pausing of the Real-Time Data Flow run.

- “Only read new records” – reads only new incoming records from the moment after the Real-Time Data Flow run was started, not from the beginning.

11 | P a g e

When the Data Flow with the Kafka Data Set is successfully configured, then the last step in the process is

creation of the Real-Time Data Flow run, which triggers the execution of the Data Flow definition in real

time. The Real-Time Data Flow run can be created from the below landing page:

The Real-Time Data Flow run configuration requires the AVRO class and the Data Flow definition to be

triggered, created under that class. After a successful save of the Real-Time Data Flow run configuration,

the process designed in the Data Flow automatically launches and consumes the Apache AVRO messages

from the Kafka topic.

12 | P a g e

When the Real-Time Data Flow run has “In progress” status, then it means that the Kafka Data Set

configuration is correct and the Kafka Data Instance configuration with the Kafka clients is reachable.

When the “# Records processed” in the Real-Time Data Flow run is greater than 0 and the “# Failures” is

equal 0, then it means that Pega Real-Time Data Flow run successfully executes the Pega Data Flow with

the Kafka Data Set, which successfully reads the Apache AVRO messages from the Kafka topic and maps

those messages to the Pega Class Structure.

Summary

The AVRO out of the box support is challenging, as every single customer uses it in a different way. This

component has been tested on 8.2.x and sanity checked on 8.3 and it is not supported by Pega in

production by GCS now. Component will not be released via Marketplace primary due to a very long time

it takes to publish and keep the component up to date. All the new versions and fixes will be delivered via

GitHub. Tentatively, there is a plan to add native platform Avro support in 8.6.

This component is published to the Pega GitHub repository:

https://github.com/pegasystems/dataset-integrations/tree/master/kafka-schema-registry

You can find the released version and supporting documentation here:

https://github.com/pegasystems/dataset-integrations/releases



https://github.com/pegasystems/dataset-integrations/releases

implementation of the apache kafka with the apache avro ... · the test connectivity _ button. ......

Documents