an evaluation on using coarse- grained events in an event sourcing...

77
IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS , STOCKHOLM SWEDEN 2017 An Evaluation on Using Coarse- grained Events in an Event Sourcing Context and its Effects Compared to Fine-grained Events BRIAN YE KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

Upload: others

Post on 11-Sep-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

  • IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

    , STOCKHOLM SWEDEN 2017

    An Evaluation on Using Coarse-grained Events in an Event Sourcing Context and its Effects Compared to Fine-grained Events

    BRIAN YE

    KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF COMPUTER SCIENCE AND COMMUNICATION

  • An Evaluation on Using Coarse-grained Events inan Event Sourcing Context and its Effects

    Compared to Fine-grained Events

    2017-06-01

    BRIAN [email protected]

    Title in Swedish: En utvärdering på användningen av grova händelser i ett eventsourcing-sammanhang och dess konsekvenser jämfört med fina händelser

    Master’s Thesis in Computer ScienceSchool of Computer Science and Communication (CSC)

    Royal Institute of Technology, StockholmExaminer: Olle Bälter

    Supervisor: Erik IsakssonProject Provider: Fredrik Westermark at TriOptima AB

  • iii

    Abstract

    Introducing event sourcing to a system that is based on a model followingCreate, Read, Update and Delete (CRUD) operations can be a challengingtask and requires an extensive rework of the current system. By introducingcoarse-grained events it is possible to persist the structure of the data in aCRUD model and still gain the benefits of event sourcing, avoiding an extensiverework of the system. This thesis investigates how large amounts of data can behandled with coarse-grained events and still gain the benefits of event sourcing,by comparing with the conventional way of using fine-grained events. The datato be examined is trade data fed into a data warehouse.

    Based on research, an event sourcing application is implemented for coarse-grained as well as fine-grained events, to measure the difference between thetwo event types. The difference is limited to the metrics, latency and size ofstorage. The application is verified with an error handler, using example dataand a profiler to make sure that it does not have any unnecessary bottlenecks.

    The resulting performance of the two cases show that fine-grained eventshave excessively larger latency than coarse-grained events in most cases whereasthe size of storage is strictly smaller for fine-grained events.

  • iv

    Sammanfattning

    Att introducera event sourcing i ett system baserat på en model som använderCreate-, Read-, Update- och Delete-operationer (CRUD) kan vara en utma-nande uppgift och kräver en omfattande omstrukturering av det nuvarandesystemet. Genom att introducera grova händelser är det möjligt att bevara he-la strukturen på datan i en CRUD-modell och ändå få fördelarna med eventsourcing, för att därigenom undvika en omfattande omarbetning av systemet.Detta arbete undersöker hur stora datamängder kan hanteras genom grovahändelser och ändå ge fördelarna med event sourcing, genom att jämföra meddet konventionella sättet att använda fina händelser. Datan som undersöks ärtransaktionsdata på finansiella derivat som matas in i ett datalager.

    Baserat på forskning implementeras en event sourcing-applikation för bådegrova och fina händelser, för att mäta skillnaden mellan dessa två händelse-typer. Skillnaden är avgränsad till latens och lagringsutrymme. Applikationenverifieras genom felhantering, exempeldata och profilering för att säkerställaatt den inte har några onödiga flaskhalsar.

    Den resulterande prestandan visar att fina händelser har betydligt störrelatens än grova händelser i de flesta fallen, medan lagringsutrymmet är striktmindre för fina händelser.

  • v

    AcknowledgementsI would like to express my gratitude to my supervisor Fredrik Westermark at Tri-Optima for the useful comments, remarks and engagement through the learningprocess of this thesis. Furthermore I would like to thank my supervisor at KTH,Erik Isaksson, for providing invaluable feedback on scientific approaches and aidingin outlining the thesis as well as my examiner Olle Bälter who made the projectpossible. Lastly I would like to thank my friends and family who have continuouslybeen a supporting edge for me throughout the entire process.

    Yours Sincerely,

    Brian YeStockholm, 2017-06-01

  • Contents

    1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    1.1.1 OTC Derivatives . . . . . . . . . . . . . . . . . . . . . . . . . 21.1.2 TriOptima and triResolve . . . . . . . . . . . . . . . . . . . . 21.1.3 Granularity and Event Examples . . . . . . . . . . . . . . . . 3

    1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41.3 Purpose and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . 41.4 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.5 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51.6 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61.7 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

    2 Related Work 72.1 Event Sourcing in Different Contexts . . . . . . . . . . . . . . . . . . 72.2 Large Object Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3 Summary of Related Work . . . . . . . . . . . . . . . . . . . . . . . . 10

    3 Theory 113.1 Domain-Driven Design . . . . . . . . . . . . . . . . . . . . . . . . . . 11

    3.1.1 Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.1.2 Building Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 12

    3.2 Event Sourcing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133.2.1 Event Store . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153.2.2 Event Storming . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2.3 Granularity . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    3.3 Command Query Responsibility Segregation (CQRS) . . . . . . . . . 183.3.1 Command Side . . . . . . . . . . . . . . . . . . . . . . . . . . 193.3.2 Query Side . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

    3.4 Database Concepts and Storage Methods . . . . . . . . . . . . . . . 203.4.1 CAP Theorem and Eventual Consistency . . . . . . . . . . . 203.4.2 Handling Large Object Storage . . . . . . . . . . . . . . . . . 21

    4 Procedure 23

    vi

  • CONTENTS vii

    4.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.2 Choice of Events . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244.3 Choice of Database Technology . . . . . . . . . . . . . . . . . . . . . 264.4 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264.5 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

    5 Implementation Details 295.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295.2 Simulation Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 315.3 Chunking of Large Object . . . . . . . . . . . . . . . . . . . . . . . . 325.4 Event Store Schema in Cassandra . . . . . . . . . . . . . . . . . . . . 325.5 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335.6 Verification of Implementation . . . . . . . . . . . . . . . . . . . . . 35

    6 Evaluation 376.1 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376.2 Measuring Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 386.3 Test Case Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396.4 Benchmark Environment . . . . . . . . . . . . . . . . . . . . . . . . . 40

    7 Results 417.1 Test Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

    7.1.1 Fine-grained . . . . . . . . . . . . . . . . . . . . . . . . . . . 417.1.2 Coarse-grained . . . . . . . . . . . . . . . . . . . . . . . . . . 427.1.3 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

    7.2 Comments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    8 Discussion 458.1 Explaining the Results . . . . . . . . . . . . . . . . . . . . . . . . . . 458.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 468.3 Criticism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478.4 Initial Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . 488.5 Ethical Concerns and Sustainability . . . . . . . . . . . . . . . . . . 49

    9 Conclusions 509.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

    Bibliography 52

    A Data Examination Results 56A.1 Match View Byte Size over Number of Trades . . . . . . . . . . . . . 57A.2 Distribution of Number of Trades for All Match Views . . . . . . . . 57

    B Source Code 58

  • viii CONTENTS

    B.1 Help Functions Creating Simulated Event Payloads . . . . . . . . . . 58B.2 Example Usage of cProfile Module . . . . . . . . . . . . . . . . . . 58B.3 Test Case Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

    C Program Output 61C.1 Example Data Output of Coarse-grained Events . . . . . . . . . . . 61C.2 Example Data Output of Fine-grained Events . . . . . . . . . . . . . 62C.3 Python Profiler Output . . . . . . . . . . . . . . . . . . . . . . . . . 64

  • List of Figures

    1.1 Overview of the underlying flow of trade data storage in triResolve . . . 3

    3.1 Composition of all building blocks in a DDD context . . . . . . . . . . . 123.2 Figure of a stream of events in an append-only manner. Executing these

    events will yield the current state of the application . . . . . . . . . . . 143.3 Abstracted visualization of how an event is triggered by a command . . 143.4 CQRS representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193.5 Detailed view on Command Side of the CQRS pattern . . . . . . . . . . 193.6 Detailed view on Query Side of the CQRS pattern . . . . . . . . . . . . 203.7 Flow chart illustration of how large objects are handled with compression

    and chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    5.1 System overview of the event sourcing implementation . . . . . . . . . . 30

    7.1 Compares the resulting latency for every test case between coarse-grained(blue) and fine-grained (orange) events. Each test case is denoted withan error bar on the linear scale to show the negative as well as positivedeviation of the data point used. . . . . . . . . . . . . . . . . . . . . . . 43

    7.2 Compares the resulting size of storage for every test case between coarse-grained (blue) and fine-grained (orange) events . . . . . . . . . . . . . . 43

    7.3 Plotting size of storage against latency between coarse-grained (blue) andfine-grained (orange) events. Each data point represents the resultingdata for each test case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

    A.1 Match view length over number of trades per match view . . . . . . . . 57A.2 Distribution of number of trades for all match views observed . . . . . . 57

    ix

  • List of Tables

    3.1 Table showing example of Aggregate/stream group schema . . . . . . . 163.2 Table showing example of event stream schema for "Match view" . . . . 17

    4.1 General example to illustrate the structure of a match view. Match viewshave in reality 155 columns . . . . . . . . . . . . . . . . . . . . . . . . . 24

    4.2 The events and aggregate used for the event sourcing solutions . . . . . 254.3 Summary of data for the match_view_created event . . . . . . . . . . . 274.4 Summary of data for the trade_valued event . . . . . . . . . . . . . . . 27

    6.1 The number of trades and the size of every match view defined for eachtest case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

    7.1 Latency and size of storage results for fine-grained events . . . . . . . . 427.2 Latency and size of storage results for coarse-grained events . . . . . . . 42

    x

  • Listings

    3.1 Simple code example of event sourcing in Python . . . . . . . . . . . 145.1 Commands that trigger their corresponding events in event sourcing 305.2 Events that are triggered by specific commands . . . . . . . . . . . . 315.3 Setup of simulated data for event payloads . . . . . . . . . . . . . . . 315.4 Presents the chunking function as well as the payload of an event

    chunk of the match_view_published event . . . . . . . . . . . . . . 325.5 Schema of the event store including a table of all event streams and

    a table of the events (the event stream) . . . . . . . . . . . . . . . . 335.6 Store events query in CQL for a certain aggregate . . . . . . . . . . 345.7 Update stream version query in CQL for a certain aggregate . . . . . 345.8 Load events query in CQL for a certain aggregate . . . . . . . . . . . 355.9 Load event stream query in CQL . . . . . . . . . . . . . . . . . . . . 35B.1 Help function that sets up JSON columns with dummy string values 58B.2 Example usage of cProfile module for some function call which

    writes the output to a file . . . . . . . . . . . . . . . . . . . . . . . . 58B.3 Setup of data to be tested in performance evaluation . . . . . . . . . 59B.4 Running fine-grained events and measuring time . . . . . . . . . . . 59B.5 Running coarse-grained events and measuring time . . . . . . . . . . 60C.1 Event store output of event streams table for chunked events of a

    coarse-grained event . . . . . . . . . . . . . . . . . . . . . . . . . . . 61C.2 Event store output of events table for chunked events of a coarse-

    grained event . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61C.3 Event store output of event streams table for fine-grained events . . 63C.4 Event store output for fine-grained events of events table . . . . . . . 63C.5 Profiler output using example data for the coarse-grained case. Shows

    that the total time spent is in execute_and_persist() and that thebottlenecks are only load_stream() and persist_event_as_chunks()(marked in green). Supporting functions do not significantly affectthe performance either. . . . . . . . . . . . . . . . . . . . . . . . . . 64

    C.6 Profiler output using example data for the fine-grained case. Showsthat the total time spent is in execute_and_persist() and that thebottlenecks are only load_stream() and persist_event() (markedin green). Supporting functions do not significantly affect the perfor-mance either. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    xi

  • Glossary

    aggregate A collection of objects in a domain that are bound together with acertain root object and treated as a single unit. 12

    Cassandra An open-source NoSQL database system providing high availabilityand no single point of failure, placing high value in performance. 10

    chunking The process of dividing an object into several pieces for which can beput together to rebuild the original object that was divided. 10

    CQRS Common Query Responsibility Segregation. 1

    CRUD Create, Read, Update and Delete. 1

    DDD Domain-Driven Design. 1

    event An immutable object expressed in past tense capturing a change in thesystem and its business intent. 13

    event sourcing A way to ensure that all changes in an application state are storedas a sequence of events in chronological order. 1

    event store A data store that stores events generated from event sourcing. 7

    granularity Degree of detail shown in an object. An object is fine if degree ofgranularity is high and is coarse if it is low. 3

    match view A table containing all matched trades between two financial entities.2

    xii

  • Chapter 1

    Introduction

    This chapter introduces the topic of event sourcing, the problem statement of thethesis as well as a description of the context it is executed in.

    Within the domain of software development, many applications follow a CRUD(Create, Read, Update, Delete) model, which in general can be described as anapplication that is modified using regular CRUD operations. Due to its commonoccurrence, developers tend to take for granted that this is the best practice to storeand persist data for an application [1]. One essential weakness that recently hasbeen addressed is its ability in only persisting the current state of an application.In a dynamic and ever-changing environment, applications are in continuous changeand whenever an error or bug occurs in the application it requires a lot of manualwork to pinpoint the source of the error. Event sourcing is a fairly new concept thataddresses this challenge, by providing a record of historical events that can lead upto the current state of an application, similar to an audit trail in accounting. Thisintroduces a new way of thinking regarding the business domain of a software projectand makes a strong correlation to an architectural pattern called CQRS. The ideain itself originated from communities within Domain-Driven Design (DDD), at thesame time Martin Fowler introduced the concept of event sourcing in 2005. Eversince then it has gained more popularity where some libraries and scientific papershave been implemented and written within this domain. Despite gained popularity,it has yet not become an industrial standard in software development. The reasonwhy it has not gained much popularity yet is due to that some practitioners claimthat event sourcing is in ineffective way of structuring systems and adds an unnec-essary overhead to the development cycle [2]. Furthermore, it is also a challengeto introduce event sourcing to a system that has never come across to it before.This thesis aims to investigate how the event sourcing pattern can be applied for asystem that has never come across the pattern before and persists very large datavolumes. What this specifically entails will be presented later in this chapter.

    1

  • 2 CHAPTER 1. INTRODUCTION

    1.1 Background

    Before diving into the problem statement, the context as well as some key conceptsare introduced.

    1.1.1 OTC Derivatives

    OTC itself stands for Over-the-Counter, which is an expression used to describe off-exchange trading where no regulating exchange is involved in financial transactionsbetween different parties [3]. In such a sense trading can be tailored to each party’sneeds. A party in this context is an entity involved in financial transactions.

    OTC derivatives are the securities traded through an OTC network. A derivativecan be best described as a type of security where the price of the security dependson the price of the underlying asset. There are also regular derivatives that aretraded on a stock exchange but should not be confused with OTC derivatives.

    1.1.2 TriOptima and triResolve

    TriOptima is a Stockholm-based company currently active in the OTC derivativesmarket, providing critical post trade infrastructure and risk management servicesfor trading of OTC derivatives.

    triResolve is a service where trade data of OTC derivatives between differentparties are compared in a central system [4]. The reason this system exists is becausetrading of OTC derivatives is done outside the stock market, meaning that there isno regulating actor correcting and adjusting the data being traded. Wrongful datacan therefore exist in the trade data between parties, e.g. a trade between Party Aand Party B can be valued to 1,111,000 USD but should instead be 1,000,000 USD.The trade data that exists between two parties is described as a match view in thecontext of triResolve’s service, which is a view of all matching trades between twoparties in a financial transaction.

    The trade data fed into triResolve is stored in a large data warehouse, for whichin their case is a Vertica database [5]. The flow of the system is described infigure 1.1. A trade file is sent to the system for which is processed and then storedin a relational database, in this case a MySQL database. When the matchingprocess is done the match view ID for the corresponding match is sent to anotherprocess that initiates the storing procedure to Vertica. The data is queried from theMySQL database for the corresponding match view, and is in the end dumped intoa temporary file with a certain format and stored in Vertica. Trade data is storedin two ways when fed into Vertica:

    1. When a new match view is loaded: Every trade in the match view isvalued, meaning that every trade gets a new trade value in the match view.

  • 1.1. BACKGROUND 3

    Figure 1.1: Overview of the underlying flow of trade data storage in triResolve

    2. When an existing match view is reloaded: A subset of trades are reval-ued in the match view, meaning that the values of the trades that have changedgets a new trade value.

    1.1.3 Granularity and Event Examples

    Granularity refers to the degree of detail shown in an object. If an object is fine-grained the degree of granularity is high, and if it is coarse-grained it is low [6]. Anevent can be expressed as a coarse-grained event or as fine-grained events. A coarse-grained event is a cluster of fine-grained events [7]. To illustrate this, an example ofa coarse-grained event and its corresponding fine-grained events is given. A user ina form can be edited through a single field, e.g. User edited: "details = Person1, Lindstedtsvägen 24, 11428 Stockholm, Sweden".

    The corresponding fine-grained events in this case would be the sub-fields:

    • Name changed: "name: Person 1"

    • Address changed: "address: Lindstedtsvägen 24, 11428"

  • 4 CHAPTER 1. INTRODUCTION

    • City changed: "City: Stockholm"

    • Country changed: "Country: Sweden"

    1.2 Problem StatementIn the triResolve product at TriOptima, the trade data fed into the warehouse iscritical for their application and is, among others, used internally for analytics. Likemost systems, the warehouse and the source for the warehouse only describe thecurrent state of the system. For TriOptima’s case, the dump trade data process(figure 1.1) represents the current state data and is structured in a way fittingtheir context. If someone suspects that old data in the warehouse is corrupted,it requires excessive amounts of manual work to analyze the data from databasesand logs and brings an overhead to the overall software development cycle. Eventsourcing is a way of handling this issue and can reduce the amount of this overhead.However, people from the communities in Domain-Driven Design argue that fine-grained events are to be preferred [8], which is problematic for TriOptima’s casesince an extensive rework of their system would be required. The reason why fine-grained events are to be preferred is their ability in capturing the business intentin the domain as well as knowing exactly what has been changed in the system[8]. This is a problem commonly faced by many organizations when desiring tomove towards event sourcing from traditional CRUD operations [1], since the datafed in a system is structured in a way not optimal for event sourcing. The maininterest lies therefore in understanding what effects event sourcing has, if it wereto be introduced, on TriOptima’s current system and how it can be used withoutperforming major changes to the system. The problem statement gives rise to thefollowing research question:

    • How can large amounts of data be handled with coarse-grained events and stillgain the benefits of event sourcing?

    The question entails whether or not structuring large amounts of data withcoarse-grained events is a beneficial approach compared to the conventional way ofevent sourcing with fine-grained events. Using coarse-grained events would allowreusing this structure, while using fine-grained events would break down that datastructure and initiate an extensive rework of their system. However, a fine-grainedevent stores only the necessary data change to the system, while a coarse-grainedcan contain data that has not necessarily been completely changed.

    1.3 Purpose and ObjectivesThe overall purpose of this thesis is to investigate how event sourcing can be used,whether coarse-grained or fine-grained events are more applicable in terms of per-

  • 1.4. DELIMITATIONS 5

    formance, to handle TriOptima’s currently very large dataset. The purpose hasbeen fulfilled when the following three objectives are met:

    • An event sourcing application and its operations are implemented correctlyfollowing a set number of requirements (which are described in the Implemen-tation chapter).

    • Differences/similarities in terms of latency and storage size can be shownadequately between coarse-grained and fine-grained events.

    • It can be identified when coarse-grained events are beneficial compared tofine-grained events in terms of latency and storage size, and vice verse.

    1.4 Delimitations

    The focus on this thesis is to evaluate the performance differences of two types ofevents, coarse-grained and fine-grained events. Hence, it does not include investi-gating the best large object storage strategy. Furthermore the choice of data storewill be based on literature studies, meaning that there will not be a quantitativeevaluation on the choice of storage mechanism.

    1.5 Method

    The thesis starts with a literature review on event sourcing and its applicability,as well as on common ways of storing large objects. Examples of coarse-grainedand fine-grained events are also stated. This is required to understand how theevent sourcing implementation should be conducted. Once understanding the keyconcepts, a fine-grained event and a coarse-grained event will be decided based onthe domain. The event sourcing pattern is implemented for both cases and is thenverified using example data returning an expected output to make sure that theimplementation works as expected for both cases. Error handling is implementedas well making sure that the implementation behaves as it should without doingunnecessary operations. Finally, the implementation is profiled with example datato make sure that it does not have any unnecessary bottlenecks in the system.

    The use case data is examined and is in turn simulated with simulated data usinglinear regression. This data will be fed to the event sourcing implementation, to setup the necessary data conditions to simulate as it is in production environment.

    The choice of the two events, the coarse-grained and the fine-grained event, andthe database technology is justified as well. A set of operations is then defined inorder to fulfil the objective of the thesis for which are used to conduct adequateperformance tests between the two cases.

  • 6 CHAPTER 1. INTRODUCTION

    1.6 ContributionEvent sourcing is a fairly new concept and only a few scientific papers have beenwritten in this domain. This thesis will hence be a source of inspiration for thosewilling to do further research in this topic. Since the mainstream way of process-ing data is through a CRUD model [9], many organizations face similar challengesas TriOptima does, finding trade-offs and deciding whether or not a rework of thecurrent system is worth the cost. Using coarse-grained events could be a viableway of reusing the current system of an application and still gain the benefits ofevent sourcing. There is to our knowledge little to no work done related to structur-ing coarse-grained events, therefore making this thesis exploratory in this aspect.Furthermore, this adds to the understanding of why/why not events should be fine-grained in an event sourcing context.

    1.7 OutlineThe thesis is outlined and organized with the following parts:

    • Chapter 2 introduces and examines prior work related to event sourcing andlarge object storage, and chapter 3 presents the theory necessary to graspthe area of study the thesis concerns

    • Chapter 4 presents the procedure used to answer the research question,including examining the data being used, the choice of events, the choice ofdatabase technology and the implementation process

    • Chapter 5 presents the implementation details of the application

    • Chapter 6 presents the evaluation setup and chapter 7 presents the resultsof it

    • Chapter 8 presents the discussion of the thesis and chapter 9 presents theconclusions of key findings and future work

  • Chapter 2

    Related Work

    This chapter will present prior work within the area of study this thesis focuses on.The purpose is to analyze conclusions made by others, in order to choose a methodapplicable for this investigation. Since event sourcing is more common with storingfine-grained events, the chapter will be divided with a section about prior work inevent sourcing followed by a section about prior work in handling large objects.

    2.1 Event Sourcing in Different Contexts

    When initiating the exploration of prior related work within this domain, it wasapparent that there are only a few papers and case studies made about event sourc-ing. The reason is because it is a new concept of designing the architecture of asoftware system and has yet not received much scientific recognition [10]. A com-mon theme among all work related to event sourcing is that it is closely related todomain driven design. Some papers were however found related to this thesis.

    A paper by Rothsberg [11], different types of databases are evaluated againsteach other in an event sourcing context. Rothsberg starts with evaluating whatevent sourcing is, identifies necessary querying functionalities in event sourcing andthen makes an evaluation of different database technologies for storage in eventsourcing. More specifically, a comparison was made of an existing event store im-plementation for a relational database with an own implementation of an eventstore with a NoSQL database technology. The reason behind evaluating NoSQLdatabases was because of the technology’s emerging popularity and the benefitsbrought complementing the flaws in regular relational databases. Since there ex-ist many different categories within NoSQL databases (key-value stored, column-oriented, document-based and graph-based), Rothsberg evaluated these categoriesagainst a set of evaluation criteria defined for the paper. An event store schemawas provided for each NoSQL database category, based on the research made onhow event sourcing works and how event stores are designed in its simplest form.The keypoint here was that an event store should be an append-only database, onlyallowing inserts in a certain order to guarantee the traceability property brought

    7

  • 8 CHAPTER 2. RELATED WORK

    from event sourcing. The graph database Neo4j was finally chosen to be evaluatedwith a relational database as an event store. The conclusions made for this paperwas that using Neo4j as an event store could easily be used to model event databut had issues with query performance. The final result indicated on a relationaldatabase having better performance than the implemented graph database. Thiswork strengthens the fact that it is possible to use many different types of storagemechanisms as an event store, where the read and write performance is what willbe the bottleneck given an evolvable event store.

    Previous papers directly related to event sourcing is as said very limited, butprior work exists where event sourcing and a larger system pattern called CommandQuery Responsibility Segregation (CQRS) are combined together. In short, CQRSis an approach to separate the write and read side of a system. Korkmaz et al. makesa justification on using event sourcing in combination with CQRS [12]. The scope ishowever delimited to only CQRS, where event sourcing is only shortly mentioned. Itis noted that event sourcing is a commonly known, but not a required, sub-pattern inCQRS where it is well complemented together with the architecture. Furthermore,the paper indicates on the benefits which event sourcing can give and establishes acommon consensus about its usability among developers.

    A case study on variability gained from CQRS and its sub-patterns was madeby Kabbedijk et al. [13], where the authors indicated on event sourcing being oneof the sub-patterns that could increase a system’s variability. Variability in thiscontext was defined as the ability to modify or change a software product. Thecase study was conducted on a large software product vendor that designs softwareproducts based on CQRS.

    Erb et al. also explores the usability of event sourcing, however in the contextof event simulation and modeling [14]. The authors suggests a way of combiningCQRS and event sourcing, similar to what Korkmaz et al. investigated. The keyconclusions made from this paper are that event sourcing provides simplification ofdebugging, improvements in analytics as well as possibilities for distributed simu-lation architectures. The main challenge identified is the overhead trade-off madebetween replaying the event log from scratch whenever events are changed. Erb etal. work is somewhat related to the thesis in this context, where the authors canjustify on the applicability of event sourcing in other contexts. It is neverthelessalso noted that event sourcing is not a requirement in CQRS, and can be usedindependently.

    Furthermore, a master’s thesis in the field of introducing retroactive computingin event sourcing has also been done [15]. Similarly to the afore-mentioned papers,the author highlights the importance of event sourcing and shows how it can becombined with retroactively modifying historical data. The keypoint from thiswork is that it is mentioned that the granularity of events depends on context andis also something, to our knowledge, not discussed in other work related to eventsourcing.

    Microsoft made a large commitment in exploring the possibilities and limitationswith CQRS and event sourcing [1]. It is explored how a system with high availability,

  • 2.2. LARGE OBJECT STORAGE 9

    scalability and maintainability can be designed. The book is mainly a collection ofdifferent journeys applying CQRS and event sourcing in different contexts to providea common consensus for this domain of knowledge. Furthermore, they describe thedistinction between a transaction log and event sourcing since both of these patternsprovide a historical trail of an application. The difference seen in this aspect is thatevent sourcing does not only capture the fact that happened but also intent, whichis the ability to capture the user’s intention that triggered the event. Furthermore,it is highlighted that event sourcing can be used as a standalone application, whereonly providing a data storage for application change is in itself valuable for anyorganization.

    The closest prior implementation projects that have been done is John Bywa-ter’s event sourcing library written in Python [16]. Bywater implements a completeinfrastructure and required objects for a model including events, commands, eventstore supporting different types of databases, and various services. The imple-mentation complies with all prior work in event sourcing reviewed in this thesis,whereas event replaying, event store snapshotting and fast-forwarding, are somedesired features within this context. The databases, used as event stores, that theimplementation has support for are the NoSQL database Apache Cassandra [17],and relational databases. This shows that it is possible to create an event sourcingimplementation in Python. How this implementation differs from this thesis is thatit does not have support for handling large events and provides an infrastructurethat stretches beyond the scope of this thesis.

    2.2 Large Object Storage

    This section will provide some prior work regarding large object storage. Sincethe pilot study has indicated on no conventional way of structuring event sourcingusing coarse-grained events, this will be a topic to be researched as well. Databasetechnologies have an upper threshold size limit of storage [18], [19], and in manycases they have similar ways of handling objects that exceeds these limits.

    Stonebraker et al. present four implementations for support of large objectstorage in POSTGRES [20], for which one of them is relatable to this thesis. Theauthors call the method f-chunks which splits a large object into several chunks of afixed size, whereas the chunks are stored as records in the database. The intendedobject stored by the user is broken into sub-objects of 8,000 bytes and stored inthe database in a safe way [20]. The storage mechanism supports compressionas well where each sub-object being stored is compressed and then decompressedwhenever the data is loaded. In the end, the authors benchmarks the differentimplementations with each other, indicating on how they perform over elapsed time,degree of compression, and data size. The importance gained from this paper is thepossibility and ability to split a large object into several chunks for storage andthe importance of compression. Similarly in an official guide for PostgreSQL, it isstated that the way large objects are handled in this technological context is by

  • 10 CHAPTER 2. RELATED WORK

    splitting up the objects in chunks for which are stored in rows in the database [21].Each chunk has an associating unique key for correct indexing as well.

    In Netflix’s open source platform for code repositories an implementation ofa Java client for Apache Cassandra was made. Cassandra in itself is a highlyavailable column-oriented database [17] and the name of the Java client is calledAstyanax and was in use by Netflix [22] until February 16th 2017. The most relevantpart of their Java client is the part where support for chunking object storage isimplemented. Implementation has been made for writing, loading and deletingobjects using a Cassandra database. A so called ObjectWriter, representing thewriting functionality, is implemented which makes sure that a file is broken downinto smaller pieces and then pushed to the Cassandra database. This approach issimilar to the one introduced by Stonebraker et al.

    Another paper written by Pataki [23], four different chunking methods are eval-uated for the context of document chunking. The evaluated methods are the closestrelated work to this thesis when it comes to handling large objects. The four meth-ods were evaluated on terms of number of chunks generated, size of database, tuningability and finding overlapping words. The methods evaluated were (1) sentencechunking, (2) hashed breakpoint chunking, (3) overlapping word chunking and (4)overlapping hashed breakpoint chunking. Method (2) showed the most promisingresults and is similar to how large objects are handled by Stonebraker et al. [20],and Netflix [22].

    2.3 Summary of Related WorkRelevant conclusions and topics introduced in prior related work can by summarizedaccording to the bullet points below:

    • Storage mechanism for event sourcing is commonly done with a NoSQL database.

    • Events being stored have usually a high degree of granularity, but no furtherwork to our knowledge has been accomplished for comparing coarse-grainedand fine-grained events.

    • Performance tests in terms of latency is usually done, in the context of eventsourcing, to measure the total elapsed time to store events.

    • Event sourcing is commonly combined with CQRS, but can be implementedas a standalone application.

    • Despite not being a requirement in event sourcing, DDD is closely related toit.

    • Large objects are usually processed through chunking and compression.

    The literature study will be based on the related work, in order to understand thearea of knowledge needed to be explored to adequately answer the stated researchquestion in section 1.2.

  • Chapter 3

    Theory

    This chapter will present all theory necessary to understand the area of knowledgethe thesis handles. The importance of this part is to provide an understanding ofwhat event sourcing is, in order to gain knowledge of how such a system functions.Concepts of DDD and CQRS will be presented as well, since it is often exercisedin combination with event sourcing. Furthermore, system limitations of databasestorage size will be presented as well, followed by ways of handling large objects.

    3.1 Domain-Driven Design

    As mentioned in the introduction, event sourcing is closely related to DDD. In orderto fully grasp the applicability of event sourcing, some key concepts and vocabularyis introduced in this section. Note that the thesis does not focus on providingan implementation following DDD but rather is used to understand how eventsourcing is used and implemented. DDD is an approach in software developmentwhere development is connected to an evolving model of the core business of anorganization [24]. The requirements for this approach are that the primary focusis placed on the domain or domain logic, complex designs are based on a modelof the domain, and a creative collaboration is initiated between domain expertsand developers to iteratively refine a conceptual domain model that addresses theproblems in certain domains [24]. A domain expert is a person who is in authorityof a particular topic [25].

    3.1.1 Concepts

    To be able to exercise the three stated requirements for DDD above, some conceptsneed to be introduced. Evans define four high-level concepts [26], the context whichis the setting where the meaning of a certain expression is determined, the domainwhich is the subject area which the user applies a program, the model which is asystem of abstractions that describes parts of a domain, and finally the ubiquitouslanguage which is a common vocabulary used to structure all activities for all team

    11

  • 12 CHAPTER 3. THEORY

    members for a certain software project. These four concepts together compose thedomain model, which is a model designed for and by everyone to understand ina software project [26]. The domain model is expressed in programming code fordevelopers, and captures the relevant domain knowledge for domain experts as wellas enabling the whole team to determine the scope and verify the consistency of thegained knowledge.

    3.1.2 Building BlocksA domain model is composed of a set of building blocks, and understanding eachof them is essential in understanding what a domain model is. The building blocksrelevant for CQRS and event sourcing are entity, value object, service, aggre-gate, and domain event [1], [26]. An entity is an object defined by its identityand is kept continuously through time in a domain model, e.g. a trade file in atrade repository can be an entity. The attributes of a trade file can change overtime (such as its trade value, name, status, etc) but within the system each tradefile will have a unique identity. A value object is however the opposite, being anobject with attributes but no identity. They are immutable objects and only thecontent is valuable. E.g. when exchanging business cards there is no distinctionmade between different cards but rather the information printed on the cards iswhat is valuable. In this context, a business card is a value object. Wheneveran operation that does not belong to any object in the domain model, it can beimplemented as a service and thus removing any unnecessary dependencies to thedomain model. A domain event is an object within a domain describing some-thing that has happened for which the domain expert cares about [24]. A domainevent is often used for communication between different aggregates.

    Figure 3.1: Composition of all building blocks in a DDD context

    Entities, value objects and services describe the fundamental building blocksfor a domain model. The term aggregate however describes the life-cycle of theseobjects [26]. An aggregate can be defined as a collection of objects that are boundtogether with a certain root object. This root object is called an aggregate rootbut can also be called a root entity. For example, an aggregate could be a carthat bounds together several other objects (wheels, steering, engine, etc). The carwould serve as an aggregate root to the other entities and value objects defined inthis context [26]. Aggregates are commonly distinguished defining a certain type

  • 3.2. EVENT SOURCING 13

    and a unique identity [11], to be able to load the most relevant aggregates andtheir associated objects for specific cases. In order to access any object bound inan aggregate, external objects must access the aggregate root meaning that theyare only allowed to make a reference to the root. This guarantees consistency ofchanges being made to an aggregate [26]. Figure 3.1 is an illustration on how thesebuilding blocks work together for one aggregate type.

    3.2 Event SourcingFrom communities of DDD the concepts of event sourcing emerged. Event sourcingis a way to not only know about what current state the application is at, but alsoknow how the application reached the current state. In its core event sourcingensures that all changes in an application state are stored as a chronologicallyordered sequence of events [2]. Every change of an application state is captured asan event object and are persisted with the same lifetime as the application itself [2].In order to fully grasp event sourcing, it is first important to understand the ideaof an event [1]:

    • An event is something that happened in the past, and is written in past tense,e.g. “Trade data was added”.

    • An event is immutable.

    • Events are one-way messages. Only one source publishes the events but severalsources can listen to events.

    • An event contains information about the event and describes the businessintent, e.g. “Trade with ID 1234 was valued for Party A” is described morein line with the organization’s business while “Insert new Trade row with keyvalue 1234 and Party A” would not capture the business intent of the event.

    Events in event sourcing is closely related to aggregates and can be mapped tohow DDD describes a domain event [27]. The two features of an aggregate that givevalue to an event is first of all that aggregates define consistency boundaries forgroups of related entities, where an event can be raised for an aggregate when thereis a change to notify parties interested in any entity related to the aggregate [1].Secondly, every aggregate is associated with a unique ID and the ID can be used torecord which aggregate in the system raised the specific event. Event sourcing is notrestricted to DDD, whereas the aggregate can be known as e.g. an event stream aswell [28]. Each event is thus appended to its belonging event stream/aggregate in theorder for which they were triggered. The events are triggered through a command,a request made to change the state of the specified aggregate where the change issaved to a persistent event log. Every command has its own validation model makingsure that the command executed is legal. Figure 3.2 is a visual representation over

  • 14 CHAPTER 3. THEORY

    Figure 3.2: Figure of a stream of events in an append-only manner. Executing theseevents will yield the current state of the application

    Figure 3.3: Abstracted visualization of how an event is triggered by a command

    how the event stream would look like and figure 3.3 is a visualization of how anevent is generated.

    A simple code example is presented in figure 3.1 of how event sourcing couldbe implemented. An aggregate Trade, represented as a class, is referred as anAggregateRoot containing the functions for applying events and committing changes.The Trade aggregate has the command change_trade_value, and the correspond-ing event TradeValueChanged. A command for Trade is executed, and the event islogged according to what seems relevant in the context. The event is saved locallyand then appended to the event store (line 35-38). Line 40 shows how the eventsfor an aggregate are replayed. Note that commands should be named in imperativetense while events in past tense.

    1 class AggregateRoot (): # abstract class2 def __init__(self):3 self._changes = []45 def apply_change(self , event , is_new=True):6 # dynamically call function on child class to update

    state based on event7 getattr(self , event.function_name)(event)8 if is_new:9 self._changes.push(event)1011 def loads_from_history(self , events):12 for e in events:13 self.apply_change(e, is_new=False)1415 def commit_changes ():16 # save to database/message queue/etc17 self._changes = []1819 class Trade(AggregateRoot):

  • 3.2. EVENT SOURCING 15

    20 def __init__(self):21 self.trade_value = None2223 def change_trade_value(self , trade_value):24 if trade_value: # some sort of validation25 self.apply_change(TradeValueChanged(trade_value))2627 def _trade_value_changed(event):28 self.trade_value = event.trade_value2930 class TradeValueChanged(dict):31 function_name = ’_trade_value_changed ’32 def __init__(self , trade_value):33 self.trade_value = trade_value3435 trade = Trade ()36 trade.change_trade_value(value =26)37 events = trade._changes ## store events in memory38 trade.commit_changes () # store events in db3940 trade2 = Trade()41 trade2.loads_from_history(events)42 assert trade2.trade_value == 26

    Listing 3.1: Simple code example of event sourcing in Python

    As long as the event stream can guarantee the order any application can be re-played. It can be argued that event sourcing is an inefficient architecture since everyapplication state can only be accessed through the event log, which would requirequerying the entire history of it every time it is desired to query any applicationstate [14]. Hence, event sourcing is commonly augmented by using a secondarydatabase that maintains an application state by instantly reading the event fromthe event log. This will be more thoroughly described in section 3.3 to understandhow event sourcing is used in a CQRS context. Another way of making queryingon the event log more efficient is to use snapshotting. A snapshot of the event logis made whereas all following events will be replayed from this snapshot [29]. Thistechnique is not a requirement for event sourcing but is useful when the event loghas a long and increasing lifetime and needs performance optimizations.

    3.2.1 Event Store

    The event log is called an event store, and is a main feature in event sourcing whereall events are persisted. An event store is special in the sense that it is an append-only data store, that is inserted data is never deleted [10]. Each stored object isin other words immutable. In the context of NoSQL and relational databases, theevents are most commonly saved as either plain texts or as JSON structures [10].An event store has a structure of three levels [28]:

  • 16 CHAPTER 3. THEORY

    • Event store: A collection of event streams where every stream has a typeand an ID.

    • Event streams: Also known as an aggregate and is a collection of eventsthat originated from a single source. They are chronologically ordered.

    • Events: A state change in the system. Contains at least a type and apayload.

    The schema for an event store varies depending on the system in a domain, aslong as the schema makes sure that every event stream has a unique version counterfor every event that is applied. The schema should be done to make querying assimple as possible [1]. The most common feature among previous work is to usetwo tables, one for aggregates and one for events [11], [30]. The following list isa suggestion of the attributes to such a schema. The attributes representing theaggregates are:

    Aggregate ID Aggregate Type Version123 "Match view" 3999 "Trade" 8

    Table 3.1: Table showing example of Aggregate/stream group schema

    • Aggregate ID: The unique identifier to the aggregate.

    • Aggregate Type: The type associated to the aggregate. Used together withthe ID as an identifier. Can also be called a stream group.

    • Version: The incrementing version number, showing which event was mostrecently applied to this aggregate.

    The attributes representing the events are:

    • Aggregate ID: The unique identifier pointing to the aggregate

    • Aggregate Type: Used together with the ID as an identifier

    • Version: The latest applied event. Incremented for every event.

    • Payload: The actual event. Can be structured as a JSON or a simple text

    Table 3.1 and 3.2 is an example of a set of aggregates and one of their corre-sponding event streams. The event stream table can also have more attributes, suchas an event ID for every event or any additional metadata specific for the contextevent sourcing is used in. The schema should be designed to make querying assimple as possible for the intended user [2].

  • 3.2. EVENT SOURCING 17

    Aggregate ID Type Version Payload123 "Match view" 0 "trade_created"123 "Match view" 1 "value_changed"; trade_value="26"123 "Match view" 2 "name_changed"; name="Party A"123 "Match view" 3 "value_changed"; trade_value="50"

    Table 3.2: Table showing example of event stream schema for "Match view"

    An event store in its simplest form supports two operations, loading an entireevent stream for an aggregate and storing new events in the store [2], [10], [11]. Theimportant requirements when performing these operations are that, all events areloaded in the order they were stored and all events being stored have an associatedversion number and is incremented for every newly appended event.

    3.2.2 Event StormingThe key to a functioning event sourced system is to have a plan of how events andaggregates are set up in a domain. A high-level method, called event storming, iscommonly applied to identify the domain model of a software project [31]. Eventstorming is executed through a brainstorming session in a workshop format togetherwith a group of domain experts to quickly gain understanding of the domain, ex-pected aggregates and other relevant objects of the application. This session alsoresults in identifying what a coarse-grained event is and a fine-grained event is in aspecific domain since this can vary depending on context [8]. Subsection 3.2.3 de-scribes the implications of granularity in more detail. The event storming procedureis as followed:

    1. Explore the domain with events as a starting-point. Let the domainexperts give an example of the business process. The goal here is to identifykey words and how the business process should be described in the context ofsoftware development.

    2. Explore the origin of events. Events can be a consequence of a user action,hence triggered by a command. This will find the commands related to thedomain.

    3. Identify aggregates. They are found based on the events and commandsemerged in the previous steps.

    Event storming can be used to scope certain parts of a system, meaning it doesnot necessarily entail finding a domain model of the complete system.

    3.2.3 GranularityThe granularity is something made apparent when deciding what events to use fora domain model in an event sourcing context. Since an event captures user intent

  • 18 CHAPTER 3. THEORY

    and is based on the system’s business domain, coarse-grained as well as fine-grainedevents can exist within one domain describing the same data being changed butoriginating from different intents. The choice of using fine-grained or coarse-grainedevents lies in the responsibility of the domain experts, to fit the events in the domain[32]. In section 1.1.3 an example of a coarse-grained event and its correspondingfine-grained events was given. In event sourcing the events can be represented withkeys in a JSON structure for the event payload, thus expressing the same exampleas followed:

    • Coarse-grained: payload = {’event’ = ’User edited’, ’details’ = ’Person1, Lindstedtsvägen 24, 11428 Stockholm, Sweden’}

    • Fine-grained: payload = {’event’ = ’Name changed’, ’name’ = ’Person1’}

    The granularity of events depends on their applicability in a domain, whichcan be determined with event storming [31]. The granularity depends therefore onthe context [15], [32]. If e.g. it is more applicable for the domain to have a sin-gle event changing the geographic location, the event looks like following payload ={’event’ = ’Region edited’, ’city’ = ’Stockholm’, ’country’ = ’Sweden’},where the payload can have several keys as well.

    3.3 Command Query Responsibility Segregation (CQRS)Despite being able to use event sourcing as a standalone application, it can becombined with other patterns - most commonly with CQRS [1]. This section de-scribes the importance of event sourcing in a larger context, where key concepts ofCQRS are introduced. Briefly, CQRS is the idea of having two different models forwriting and reading information [9]. In contrast to the traditional CRUD (Create,Read, Update, Delete) model, systems start to require more than only updatingand retrieving information from a model. CQRS is a way of tackling this, wheretrade-offs of added complexity to the system [9] are made. CQRS is based on theCommand-Query Separation (CQS) principle, which was introduced by Meyer in1988 [33]. Meyer states that a question being asked should not change the answer,meaning that every method should either be a command that performs an actionor a query that returns data.

    CQS was applied by Greg Young and Udi Dahan, combining principles of CQSand DDD, and created CQRS [34], [35]. The pattern can be described as two sub-systems, the command side and the query side, where the first is responsiblefor changing the state while the latter is for viewing the state. Each side has theirown data store, where the data from the write side is projected to the read side.The subsystems communicate through, usually, an asynchronous message bus forwhich events are sent from the command to the query side. The data store on thequery side can then be updated accordingly. The pattern is described in figure 3.4.

  • 3.3. COMMAND QUERY RESPONSIBILITY SEGREGATION (CQRS) 19

    Figure 3.4: CQRS representation

    One of the main benefits with this pattern is that system upgrades can be doneindividually for the query and command side, without affecting the other.

    When CQRS and event sourcing are combined, the data store on the commandside is the event store, as described in subsection 3.2.1. The event sourced informa-tion is in such a sense used as the message to notify the query side about a changethat has been made to the application. The main reasons to apply this combinationof patterns are when the system becomes growingly complex, when scalability is amain concern in the system, and when it is desired to reveal business intent in thesystem [9].

    3.3.1 Command Side

    The command side, as mentioned, is responsible for validating and processingchanges to the application state. As described in section 3.2, a command fromthe application is applied to its designated aggregate and the change is stored in apersistent event store. This handling mechanism is usually called the commandhandler and makes sure that the command is correctly sent to the correct aggre-gate [10], [34], [35]. Every command has its own command handler. Figure 3.5shows an in-depth view of the command side part of the CQRS pattern.

    Figure 3.5: Detailed view on Command Side of the CQRS pattern

  • 20 CHAPTER 3. THEORY

    3.3.2 Query Side

    The query side, responsible for reading the data being persisted in an application,is the other part of the CQRS pattern shown in figure 3.6. This part consistsof projectors, a query store and a query handler [10]. Projections persist thecurrent state of the application, by projecting denormalized views of the applicationrelevant for the intended user. When an event is created and stored in the eventstore on the command side, it is sent to the query side via the message bus to updatethe designated projections. This provides performance optimization, allowing thesystem to avoid querying the event store whenever data is requested to be read. Thequery handler handles incoming requests to the query store, where the projectionsare stored.

    Figure 3.6: Detailed view on Query Side of the CQRS pattern

    3.4 Database Concepts and Storage MethodsTheoretical concepts of database storage are presented, indicating on the challengesfaced when persisting data with event sourcing. How large objects are handled aswell as the limitations are presented.

    3.4.1 CAP Theorem and Eventual Consistency

    In traditional learning of databases, many has come across ACID (Atomicity, Con-sistency, Isolation, Durability) [1]. The CAP Theorem is another set of propertiesapplicable for distributed databases, also known as the Brewer’s Theorem [36]. Itwas first introduced and proven by Eric Brewer, stating that a distributed computersystem can only simultaneously provide at most two out of the three following guar-antees [36]:

    • Consistency (C): Every read query gets the most recent write query or anerror is returned

    • Availability (A): Every sent request receives a request for a non-failing node

    • Partition tolerance (P): The system still operates when a network partitionis occurring (when a network is divided into two smaller networks for whichare unable to communicate with each other)

  • 3.4. DATABASE CONCEPTS AND STORAGE METHODS 21

    A distributed database is known to operate over a distributed system, whichNoSQL databases in many cases are. Since it is more common to desire to have asystem guaranteeing high availability and partition tolerance, Brewer further pro-poses a way to have consistency through eventual consistency, a system withhigh availability and partition tolerance where all read queries will eventually re-ceive the most recent writes [37]. Instead of seeing the consistency property asa binary condition, it has a more continuous viewpoint and thus allows eventualconsistency.

    Event sourcing’s largest challenge is to keep consistency of data. By combiningit with CQRS it is possible to create an eventually consistent system and thushandling this challenge adequately [1]. If the message bus between the commandand query side is asynchronous, it is possible to create projections without waitingfor the latest events to update the views, but the system will know for sure thatthe views will at some point receive the latest updates.

    3.4.2 Handling Large Object Storage

    This section presents system limitations of commonly used databases when it comesto storing large objects, as well as a method commonly used to handle this.

    Storing objects in databases can be limiting when it comes to storing it directlyin a database field. The NoSQL database Cassandra has a theoretical column-valuefield limit of 2 GB, but should be limited to a single-digit MB size when used inpractice [18]. Similarly, the relational database MySQL shows that each row hasa size limit of approximately 65,000 bytes [19]. In general, this indicates on thatwhen storing files of larger sizes needs a certain way of being handled. In theMySQL world, two data types have been defined called TEXT and BLOB (BinaryLarge Objects) which only stores the key of some data in the database, where thecontent is stored at another location. The difference between these data types isthat BLOBs are treated as binary strings while TEXTs are treated as character strings[38]. In contrast to MySQL, Cassandra does not support random access of BLOBvalues and hence needs to be handled by the user [18]. If large objects are nothandled, hot spots tend to emerge and can result in exhausting the memory andcrashing the database server [39].

    Figure 3.7: Flow chart illustration of how large objects are handled with compressionand chunking

    A way of handling large objects, which is commonly the underlying function-

  • 22 CHAPTER 3. THEORY

    ality of BLOB storage in MySQL and PostgreSQL, is file chunking [21], [23], [38].It has been shown that it comes with many different names but the underlyingfunctionality is still the same. The input to this method is a file of arbitrary sizefor which are split into evenly distributed chunks. The output is a list containingthe chunked data. By optimizing large object handling, compression is commonlycombined with chunking to decrease the number of chunks needed to be stored [40],by first compressing and then chunking the file and still maintaining the order of thechunks. One of the challenges with chunking is to maintain order which is neededto restore the correct data sequence. As similar to event sourcing the order can bemaintained with an incrementing sequence number associating each chunk. Figure3.7 illustrates how the chunking mechanism works.

  • Chapter 4

    Procedure

    This chapter presents the procedure of this thesis to answer the research questionadequately and fulfil the defined objectives. An evaluation can thus be done on howcoarse-grained events can be used and still gain the benefits of event sourcing.

    The thesis takes an empirical approach for which experiments are conducted for thetwo cases, coarse-grained events and fine-grained events. To create an appropriateexperimental model and in the end answer the research question, the following stepsmust be completed:

    1. The data to be used is examined to understand how large the dataset is.

    2. Appropriate design choices on a fine-grained event and a coarse-grained eventare made.

    3. An appropriate database technology is chosen as an event store.

    4. Simulated data is generated and prepared based on the examination results ofthe original data. The reason why data is simulated is due to protection of sen-sitive data, as well as restrictions in the production systems thus disallowingevaluations in production environment.

    5. An event sourcing application is implemented for coarse-grained and fine-grained events, including a simulated data generator feeding data to the eventsin an appropiate way.

    6. Performance evaluations are done with a set number of iterations and tendifferent test cases, each test case being a certain number of trades, for agiven benchmark environment.

    The data examined is specific for this case but can be applied for other cases aswell, as long as the data can be represented as coarse-grained as well as fine-grainedevents.

    23

  • 24 CHAPTER 4. PROCEDURE

    4.1 DatasetAs stated in the introduction, the thesis is scoped on match view data that isfed into Vertica. It is therefore important to first understand the structure of thedata and then be able to make appropriate design choices for the event sourcingimplementation.

    Each match view is structured as a table with 155 different headers. The tableconsists of rows of trades, where each row represents one trade in a match view.Each header value represents a certain attribute related to the trade. A headervalue could be a monetary value of the trade, a date, the currency exchange, or anyother metadata. Match views can vary between containing one trade to over 420,000trades, as shown in figure A.1 in the appendix. Figure A.2 Shows the distributionof number of trades for all the observed match views. Table 4.1 shows how a matchview looks like.

    ID Trade value Party 1 Party 2 ... Currency col X col Y1 123 Party A Party B ... USD [value] [value]2 431 Party A Party B ... EU [value] [value]3 21321 Party A Party B ... SEK [value] [value]

    Table 4.1: General example to illustrate the structure of a match view. Match viewshave in reality 155 columns

    Note that the match view in reality is more complex than the provided example(table 4.1) which is only used for illustration. The match view data is continuouslybeing changed when a trade day is active, e.g. the currency could be modified forsome trades or some trades could be revalued. A change in a match view correspondstherefore to a set of trades in the match view being revalued. All data in matchviews is expressed as text values. The values in 10 out of the 155 headers arechanged when a trade is valued.

    4.2 Choice of EventsRelevant events for the business domain are identified. This is done through eventstorming, since according to previous work and best practices it is a way to obtaindomain events [31]. For this reason, event storming is conducted with key stake-holders within the company, limiting the procedure to what events should be usedto generate match views. This choice is not only based on event storming’s commonpractice, but also used because it can also be used to decide what a coarse-grainedevent and a fine-grained event is (as described in the Introduction chapter) in aspecific domain. Three domain experts attended the session. The outcome for theevent storming session is:

    • to know what aggregate(s) and events to use,

  • 4.2. CHOICE OF EVENTS 25

    • to know what a coarse-grained event is, and

    • to know what a fine-grained event is in this context.

    Table 4.2 summarizes the outcome of the event storming session, where the to-beused aggregate as well as events are presented.

    Type Name DescriptionAggregate Match view one event stream

    of match viewsCoarse-grained Event match_view_published Contains com-

    plete match viewdata

    Fine-grained Event (1/2) trade_valued Contains thechanged trade ina match view

    Fine-grained Event (2/2) match_view_created Initial event thatsets up a matchview format

    Table 4.2: The events and aggregate used for the event sourcing solutions

    For the case of the coarse-grained event, the data of a complete match viewis loaded in the event payload, hence persisting the format of the dump trade dataprocedure as it is. This applies to both when reloading a match view and loadinga new match view. This means that the match_view_published event is triggeredwhenever there is any trade valuation called by the system whereas the two casesof reload and a new load are treated the same way. This payload will be as large asseveral megabytes, which will cause the system to not be able to store everythingin a single value field. How this is solved is described in chapter 5.

    For the case of the fine-grained event, it was noted during the event storm-ing session that two types of events have to be created. An event initializing theformat of a match view is made, called match_view_created, followed by an eventrecording the change of a single trade, called trade_valued. The events have thefollowing intents:

    • trade_valued: Corresponds to the change of a single trade row in a matchview. The event payload consists of the ten out of the 155 headers that arechanged when a trade is valued in a match view

    • match_view_created: Initializes the format of a match view, creating the155 headers that exists in a match view. This is required for the case whena new match view is loaded, where all headers in a match view is initializedfollowed by events valuing every trade in the match view. When an existingmatch view is reloaded, only a subset of trades of the match view is valued

  • 26 CHAPTER 4. PROCEDURE

    The number of trades that exist in one coarse-grained event corresponds there-fore to the number of fine-grained events to the corresponding match view, thusrepresenting the same data but with different degrees of granularity. The motiva-tion behind the granularity of the coarse-grained event was to create an event thatpersists the format of the data, avoiding additional functionality in trying to restorethe format. For the fine-grained event, it is desired to create events that togethercluster the data in the coarse-grained event in a sense that every fine-grained eventfits the domain they are applied in. The event storming procedure makes sure thisis done correctly for the specified domain.

    4.3 Choice of Database TechnologyThe database chosen for the event storage is the Cassandra NoSQL database. Thechoice is based on following reasons:

    • Database should be able to handle high volumes of writes

    • Database should be easy to query

    • Database should be scalable

    • Database is commonly used as an event store

    Unlike relational databases, NoSQL databases does not have a strict schema tofollow which makes it easier to scale and handle large amounts of data. For thedata of this case and given that the database is going to be an event store, thedatabase is supposed to persist in theory all data in the entire organization. Also,from literature studies it is noted that NoSQL databases are often used as eventstores in an event sourcing context. The reason Cassandra was chosen is due toits ability for high-speed operations where it beats many other NoSQL technologiesin performance benchmarking tests [41], which is something desired for this case.Furthermore Cassandra provides a query language CQL [17], similar to the querylanguage for relational databases SQL, which allows easy querying of data. It is astate-of-the-art technology commonly used in both research and the industry, whichwill give an adequate indication on the performance of event sourcing. The systemlimitation to be aware of is, which was also stated in the literature study, that eachcolumn-value in a Cassandra database is recommended to be at most a single-digitmegabyte in size.

    4.4 SimulationTo evaluate the event sourcing application, simulated data is first generated byobserving the data provided by TriOptima. The data is then fed to every event inan adequate manner to accomplish a close to realistic and comparable event sourcing

  • 4.4. SIMULATION 27

    evaluation. Using simulated data is a known way to test systems and applicationsin research without exposing sensitive information and still generate the behaviorof the original data [42]. For the general case, simulated data is generated byfitting original data in a statistical model and provides a function that can outputadditional simulated data. The simulated data for this thesis is done through linearregression, by analyzing the trades of a single trade day. The resulting metrics arepresented in tables 4.3 and 4.4. See appendix A for the complete analysis.

    Coarse-grained Event

    The event payload to the coarse-grained event, match_view_published, consists ofthe entire match view as a text value. This is simulated by generating a text valuebased on the size of the match view. The size is found through equation A.1, withthe the number of trades as input x.

    Fine-grained Events

    Description ValueAverage size of a single trade 1256 bytesNumber of headers 155Average size per header 8 bytes

    Table 4.3: Summary of data for the match_view_created event

    For the fine-grained events, the event payload needs to be analyzed further, sincethe payload is based on the number of headers and the size of each header value inthe match view. Table 4.3 shows how the event payload for the match_view_createdevent should be formatted. The purpose of this event is to initialize all 155 headersof a single trade. The average size of a single trade is therefore calculated by tak-ing the average of dividing the total size of every match view with the number ofcorresponding trades in that match view. The size of a single header is then foundby dividing the total number of headers, in this case being 155 headers.

    Description ValueAverage size of changing headers 61 bytesNumber of changing headers 10Average size per value 7 bytes

    Table 4.4: Summary of data for the trade_valued event

    Table 4.4 on the other hand shows the setup for the trade_valued event. Thepurpose of this event is to store the change when a single trade is valued in a matchview. The number of changing headers is, as mentioned earlier, ten. Similarly as thematch_view_created event, the average size of the values of these ten headers are

  • 28 CHAPTER 4. PROCEDURE

    calculated and the size of the values of each changing header can thus be calculated.Chapter 5 presents how these are applied in the context of an implementation.

    4.5 ImplementationWhen the simulated data is finished being set up, the implementation of the eventsourcing application for coarse-grained and fine-grained events can be done. Thispart involves setting up the requirements of an event sourcing implementation, thusfulfilling the objectives of the thesis. The application is developed with an iterativedevelopment process. A basic functioning application is first provided for which isgradually being developed and refined with additional functionalities to finally fulfilthe specified requirements.

    The resulting prototyped application is then verified in three ways. An errorhandler is developed to handle illegal operations. The application is executed withexample data to make sure it returns an expected output. Profiling is then usedto make sure that there are no unnecessary time-consuming bottlenecks in the ap-plication, to verify that the two cases are comparable and the right operations areevaluated. Details of the implementation is presented in chapter 5.

    All implementation is done in Python, and the choice of language is due to itscompatibility with the company’s software where it is mostly done in Python. Theevent sourcing implementation is based on the way it is done in previous work,from both academic papers as well as programming libraries, hence following thebest practices of implementing such a system, e.g. John Bywater’s event sourcinglibrary in Python [16].

  • Chapter 5

    Implementation Details

    This section presents the implementation details of the complete simulation setup,from generating simulated data, to applying the data in the event sourcing imple-mentation and in the end evaluate the event sourcing implementation accordingly.

    The following requirements are set for the implementation:

    1. The simulation step does not affect the event sourcing evaluation

    2. A command has to be executed to trigger its corresponding event

    3. The events are stored in sequence with a version counter and can be loadedin the order they are stored

    4. Chunks of an event are described with the same version and persisted in theorder they are stored in the event store

    A set of operations are presented as well which will fulfil the requirements andthe objective of the thesis adequately.

    5.1 System OverviewAn overview of the implementation is described in figure 5.1. The main files useddescribing the system’s flow are events.py, commands.py and eventstorage.py,represented as boxes in figure 5.1. To describe the flow of the system, it is initiatedby first setting up the simulated data (for which is done in isolation to not affectthe conducted evaluation). A command is then executed after validation initiatingthe event sourcing application. The command makes sure that it originates from itscorresponding aggregate. An event is hence triggered and the system checks whetheror not the event should be chunked. If it should be chunked it is a coarse-grainedevent and goes through the chunking procedure. For the case of the fine-grainedevent, this step is skipped and goes directly to the store event procedure. The eventis then stored in the event store accordingly (described in section 5.4).

    29

  • 30 CHAPTER 5. IMPLEMENTATION DETAILS

    Figure 5.1: System overview of the event sourcing implementation

    A command is represented as a class in the commands.py file, and how it is usedis illustrated in listing 5.1. The execute_and_persist() is the function used toexecute a command triggering its corresponding event. A more detailed descriptionabout the operations is given in section 5.5.

    1 publish_match_view = PublishMatchView(stream_id=stream_id ,matchview_id=matchview_id , data=data)

    2 publish_match_view.execute_and_persist ()34 create_matchview_for_trade = CreateMatchViewForTrade(stream_id

    =stream_id , matchview_id=matchview_id)5 create_matchview_for_trade.execute_and_persist ()67 value_trade = ValueTrade(stream_id=stream_id , matchview_id=

    matchview_id)8 value_trade.execute_and_persist ()

    Listing 5.1: Commands that trigger their corresponding events in event sourcing

    An event is represented as a Python dict and sets up the data values for thepayload relevant for the event. This is illustrated in listing 5.2. The reason why a

  • 5.2. SIMULATION STEP 31

    dict is used is because it can be easily translated from a JSON structure in Python,and is a desired structure to use for the event payload.

    1 # Coarse -grained event , chunks data blob2 def matchview_published_event ():3 # Return dict with coarse -grained event payload45 # Fine -grained event6 def trade_valued_event ():7 # Return dict with fine -grained event payload89 # Fine -grained event , setting up columns for match view10 def match_view_created ():11 # Return dict with fine -grained event payload

    Listing 5.2: Events that are triggered by specific commands

    5.2 Simulation StepThe simulated data is fed to the event sourcing application as a preparation step.The data to be simulated is the payload of every event (see section 4.2). The datafor the fine-grained events is simulated using the help function get_event_json_with_dummy_data(), setting up the required number of headers and the size of eachheader value for each event payload (given in tables 4.3 and 4.4). The event pay-load for the coarse-grained event is set up by using another help function, calledget_match_view_size() that generates the corresponding number of charactersthere is in a match view (see appendix B.1 for both functions). get_dummy_string()is used to get the actual string values that represent the simulated data. Listing 5.3shows how the event payload for each event is set up.

    1 # Coarse -grained event payload , data to be chunked2 match_view_published_payload = dict(event_subject=’

    matchview_published ’, matchview_id=matchview_id , data=get_dummy_string(get_match_view_size(trade_count)))

    34 # Fine -grained event payload5 match_view_created_payload = get_event_json_with_dummy_data(

    event_subject=’matchview_created ’, matchview_id=matchview_id , number_of_event_columns =155,dummy_string_size =8)

    67 # Fine -grained event payload8 trade_valued_payload = get_event_json_with_dummy_data(

    event_subject=’trade_valued ’, matchview_id=matchview_id ,number_of_event_columns =10, dummy_string_size =7)

    Listing 5.3: Setup of simulated data for event payloads

  • 32 CHAPTER 5. IMPLEMENTATION DETAILS

    5.3 Chunking of Large ObjectThe event payload for the coarse-grained event needs to be handled before beingstored in the event store. As stated in the theory chapter, chunking is a commonway of handling large objects. Even though compression is commonly used in com-bination, it will not be used for this thesis. Compression is a mean to optimize theprocess of handling large objects, hence it is not necessary to include compressionas a step in this process.

    1 def chunks(data , chunk_size):2 return [data[i:chunk_size + i] for i in range(0, len(data)

    , chunk_size)]34 event_chunk = dict(event_subject=subject , matchview_id=mv_id ,

    total_chunks=total_chunks , chunk=chunk)

    Listing 5.4: Presents the chunking function as well as the payload of an event chunkof the match_view_published event

    Listing 5.4 (line 1-2) presents the function that chunks an input string (theevent payload) and returns a list of chunks with a defined chunk size. The chunksize is set to 1,000,000 chars due to Cassandra’s system limitations (as stated in thetheory chapter). This is done because it is desired to store an object on a singlecolumn value field before exhausting the data store, thus maximizing the storageand avoiding unnecessary chunks.

    A chunk is processed the same way an event is. The significant difference isthat the event payload is an event chunk instead. Listing 5.4 (line 4) presents theevent chunk as a dict. The number of chunks depends on the size of the dataset.As seen in the chunk payload, the total number of chunks is denoted with the keytotal_chunks. All chunks can therefore be associated to the their correspondingevent, thus maintaining correct order and quantity. In addition, all chunks have thesame event id and version since they all represent the same event. As a result,a coarse-grained event is now represented as manageable event chunks instead andcan be safely stored to the event store.

    5.4 Event Store Schema in CassandraThis section provides a technical explanation of the Cassandra database schemafor the event store, and results in a schema applicable for both coarse-grained andfine-grained events. Therefore, some additional metadata will be included in theschema specific for this context in order to make querying as simple as possible.

    The schema is presented in listing 5.5 and presents two tables, one summarizingall aggregates/event streams, event_streams, and another, events, that summa-rizes all events. A keyspace needs to be declared before creating the tables to setup the environment. The tuple stream_group and stream_id is the primary keyfor both tables. Querying the tables in Cassandra must include the primary key,

  • 5.5. OPERATIONS 33

    or else the statement will fail. The clustering order for the events table is in as-cending order, based on the version and the chunk_number. The clustering orderguarantees the sequence the events were stored in as well as the order of the chunksof a single coarse-grained event.

    1 create table if not exists event_streams (2 stream_group text ,3 stream_id text ,4 current_version int ,5 primary key (( stream_group , stream_id))6 );78 create table if not exists events (9 stream_group text ,10 stream_id text ,11 version int ,12 chunk_number int ,13 id timeuuid ,14 payload text ,15 primary key (( stream_group , stream_id), version ,

    chunk_number)16 );

    Listing 5.5: Schema of the event store including a table of all event streams and atable of the events (the event stream)

    The total number of chunks for one match_view_published event is used todenote the corresponding chunks to that event. In the events table, there is anattribute called chunk_number, shown in listing 5.5, and denotes an incrementingnumber from 0 to total_chunks-1. This makes sure that the order and the numberof chunks are persisted correctly within the event. For the case when the databeing sent is not supposed to be chunked persists a -1 integer as the value for thechunk_number, applicable only for fine-grained events.

    5.5 OperationsThis section presents and describes the operations for this case, in order to fulfil thedefined requirements for the event sourcing implementation as well as conduct theperformance evaluations to fulfil the objective of the thesis.

    execute_and_persist()

    As shown in listing 5.1, the execute_and_persist() function is the operation usedto initiate a command. This also corresponds to the initialization stage in thesystem overview in figure 5.1. A stream_group and a stream_id are defined foreach command class. The operation does the following:

    • Gets the current version of a stream by calling load_stream().

  • 34 CHAPTER 5. IMPLEMENTATION DETAILS

    • Validates if it is a legal command and is executed.

    • Returns an event. The event is then stored in the event store with store_event().

    store_event()

    The store_event() operation is the underlying functionality of storing an eventcorrectly in the Cassandra event schema. This operation is called within theexecute_and_persist() operation. The input is the corresponding stream, theevent to be stored, the current version of the stream, a Boolean saying if it shouldbe chunked and the chunk size. The operation does the following:

    • Initializes a new stream if it does not exist, with the version -1.

    • If a stream does exist, store the event either as chunks or as one event.

    • If the event should be stored as chunks, chunk the object and store every chunkseparately as sub-events on each row, all having the same version (callingpersist_event_as_chunks()). If the event should not be stored as chunks,store the event on a single row (calling persist_event()).

    • Update the current version of the stream with 1 if it