the internals of the data calculator · the internals of the data calculator stratos idreos, kostas...

36
The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT Data structures are critical in any data-driven scenario, but they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously. We present a design engine, the Data Cal- culator, which enables interactive and semi-automated design of data structures. It brings two innovations. First, it offers a set of fine-grained design primitives that capture the first principles of data layout design: how data structure nodes lay data out, and how they are positioned relative to each other. This allows for a structured description of the universe of possible data structure designs that can be synthesized as combinations of those primi- tives. The second innovation is computation of performance using learned cost models. These models are trained on diverse hardware and data profiles and capture the cost properties of fundamental data access primitives (e.g., random access). With these models, we synthesize the performance cost of complex operations on arbi- trary data structure designs without having to: 1) implement the data structure, 2) run the workload, or even 3) access the target hardware. We demonstrate that the Data Calculator can assist data structure designers and researchers by accurately answering rich what-if design questions on the order of a few seconds or minutes, i.e., computing how the performance (response time) of a given data structure design is impacted by variations in the: 1) design, 2) hardware, 3) data, and 4) query workloads. This makes it effortless to test numerous designs and ideas before embarking on lengthy implementation, deployment, and hardware acquisition steps. We also demonstrate that the Data Calculator can synthesize entirely new designs, auto-complete partial designs, and detect suboptimal design choices. Let us calculate. —Gottfried Leibniz Harvard Data Systems Laboratory: Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo. 2018. The Internals of the Data Calculator. Technical Report, Har- vard John Paulson School of Engineering and Computer Science, Cambridge, MA, USA, 36 pages. http://daslab.seas.harvard.edu This technical report is an extended version of the paper presented at ACM SIGMOD 2018 with title: The Data Calculator: Data Structure Design and Cost Synthesis from First Principles and Learned Cost Models. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be hon- ored. Abstracting with credit is permitted. To copy otherwise requires prior specific permission. Contact the author(s). Harvard DASlab Tech Report, July 2018, Cambridge, MA, USA © 2018 Association for Computing Machinery. DASlab Tech Reports 2018 (1). . . $15.00 http://daslab.harvard.seas.edu Key partitioning Random Access Serial Scan Zone-map filters Key order Fanout Immediate node links Binary Search data layout primitives p o s s ib le d e s ig n p rim itiv e c o m b in a t i o n s data access primitives Data Structures New design Hash Table design primitive combinations d e s ig n & c o st syn th e s is Trie B-Tree Figure 1: The concept of the Data Calculator: computing data access method designs as combinations of a small set of primitives. (Drawing inspired by a figure in the Ph.D. the- sis of Gottfried Leibniz who envisioned an engine that cal- culates physical laws from a small set of primitives [59].) 1 FROM MANUAL TO INTERACTIVE DESIGN The Importance of Data Structures. Data structures are at the core of any data-driven software, from relational database sys- tems, NoSQL key-value stores, operating systems, compilers, HCI systems, and scientific data management to any ad-hoc program that deals with increasingly growing data. Any operation in any data-driven system/program goes through a data structure when- ever it touches data. Any effort to rethink the design of a specific system or to add new functionality typically includes (or even begins by) rethinking how data should be stored and accessed [1, 9, 29, 38, 58, 85, 86]. In this way, the design of data structures has been an active area of research since the onset of computer science and there is an ever-growing need for alternative designs. This is fueled by 1) the continuous advent of new applications that require tailored storage and access patterns in both industry and science, and 2) new hardware that requires specific storage and access patterns to ensure longevity and maximum utilization. Every year dozens of new data structure designs are published, e.g., more than fifty new designs appeared at ACM SIGMOD, PVLDB, EDBT and IEEE ICDE in 2017 according to data from DBLP. A Vast and Complex Design Space. A data structure design con- sists of 1) a data layout to describe how data is stored, and 2) al- gorithms that describe how its basic functionality (search, insert, etc.) is achieved over the specific data layout. A data structure can be as simple as an array or arbitrarily complex using sophisticated combinations of hashing, range and radix partitioning, careful data placement, compression and encodings. Data structures may also be referred to as “data containers” or “access methods” (in which case the term “structure” applies only to the layout). The data layout design itself may be further broken down into the base data layout The name “Calculator” pays tribute to the early works that experimented with the concept of calculating complex objects from a small set of primitives [59]. arXiv:1808.02066v1 [cs.DB] 6 Aug 2018

Upload: others

Post on 20-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

The Internals of the Data CalculatorStratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo

Harvard University

ABSTRACTData structures are critical in any data-driven scenario, but they

are notoriously hard to design due to a massive design space and

the dependence of performance on workload and hardware which

evolve continuously. We present a design engine, the Data Cal-

culator, which enables interactive and semi-automated design of

data structures. It brings two innovations. First, it offers a set of

fine-grained design primitives that capture the first principles of

data layout design: how data structure nodes lay data out, and

how they are positioned relative to each other. This allows for a

structured description of the universe of possible data structure

designs that can be synthesized as combinations of those primi-

tives. The second innovation is computation of performance using

learned cost models. These models are trained on diverse hardware

and data profiles and capture the cost properties of fundamental

data access primitives (e.g., random access). With these models, we

synthesize the performance cost of complex operations on arbi-

trary data structure designs without having to: 1) implement the

data structure, 2) run the workload, or even 3) access the target

hardware. We demonstrate that the Data Calculator can assist data

structure designers and researchers by accurately answering rich

what-if design questions on the order of a few seconds or minutes,

i.e., computing how the performance (response time) of a given

data structure design is impacted by variations in the: 1) design, 2)

hardware, 3) data, and 4) query workloads. This makes it effortless

to test numerous designs and ideas before embarking on lengthy

implementation, deployment, and hardware acquisition steps. We

also demonstrate that the Data Calculator can synthesize entirely

new designs, auto-complete partial designs, and detect suboptimal

design choices.

Let us calculate. —Gottfried Leibniz

Harvard Data Systems Laboratory:Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester,

Demi Guo. 2018. The Internals of the Data Calculator. Technical Report, Har-

vard John Paulson School of Engineering and Computer Science, Cambridge,

MA, USA, 36 pages. http://daslab.seas.harvard.eduThis technical report is an extended version of the paper presented at

ACM SIGMOD 2018 with title: The Data Calculator: Data Structure Designand Cost Synthesis from First Principles and Learned Cost Models.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for profit or commercial advantage and that copies bear this notice and the full citation

on the first page. Copyrights for third-party components of this work must be hon-

ored. Abstracting with credit is permitted. To copy otherwise requires prior specific

permission. Contact the author(s).

Harvard DASlab Tech Report, July 2018, Cambridge, MA, USA© 2018 Association for Computing Machinery.

DASlab Tech Reports 2018 (1). . . $15.00

http://daslab.harvard.seas.edu

Key partitioning

RandomAccess

SerialScan

Zone-mapfilters

Key order

Fanout

Immediatenode links

BinarySearch

data layout primitives

poss

ible

de

sign primitive combinations

data access primitives

Data Structures

New designHash

Table

design primitive combinations

design & cost synthesis

TrieB-Tree

Figure 1: The concept of the Data Calculator: computingdata access method designs as combinations of a small setof primitives. (Drawing inspired by a figure in the Ph.D. the-sis of Gottfried Leibniz who envisioned an engine that cal-culates physical laws from a small set of primitives [59].)

1 FROMMANUAL TO INTERACTIVE DESIGNThe Importance of Data Structures. Data structures are at thecore of any data-driven software, from relational database sys-

tems, NoSQL key-value stores, operating systems, compilers, HCI

systems, and scientific data management to any ad-hoc program

that deals with increasingly growing data. Any operation in any

data-driven system/program goes through a data structure when-

ever it touches data. Any effort to rethink the design of a specific

system or to add new functionality typically includes (or even

begins by) rethinking how data should be stored and accessed

[1, 9, 29, 38, 58, 85, 86]. In this way, the design of data structures

has been an active area of research since the onset of computer

science and there is an ever-growing need for alternative designs.

This is fueled by 1) the continuous advent of new applications that

require tailored storage and access patterns in both industry and

science, and 2) new hardware that requires specific storage and

access patterns to ensure longevity and maximum utilization. Every

year dozens of new data structure designs are published, e.g., more

than fifty new designs appeared at ACM SIGMOD, PVLDB, EDBT

and IEEE ICDE in 2017 according to data from DBLP.

A Vast and Complex Design Space. A data structure design con-

sists of 1) a data layout to describe how data is stored, and 2) al-

gorithms that describe how its basic functionality (search, insert,

etc.) is achieved over the specific data layout. A data structure can

be as simple as an array or arbitrarily complex using sophisticated

combinations of hashing, range and radix partitioning, careful data

placement, compression and encodings. Data structures may also

be referred to as “data containers” or “access methods” (in which

case the term “structure” applies only to the layout). The data layout

design itself may be further broken down into the base data layout

∗The name “Calculator” pays tribute to the early works that experimented with the

concept of calculating complex objects from a small set of primitives [59].

arX

iv:1

808.

0206

6v1

[cs

.DB

] 6

Aug

201

8

Page 2: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

and the indexing information which helps navigate the data, i.e., the

leaves of a B+tree and its inner nodes, or buckets of a hash table and

the hash-map. We use the term data structure design throughout

the paper to refer to the overall design of the data layout, indexing,

and the algorithms together as a whole.

We define “design” as the set of all decisions that characterize

the layout and algorithms of a data structure, e.g., “Should data

nodes be sorted?”, “Should they use pointers?”, and “How should

we scan them exactly?”. The number of possible valid data structure

designs explodes to ≫ 1032

even if we limit the overall design to

only two different kinds of nodes (e.g., as is the case for B+trees). If

we allow every node to adopt different design decisions (e.g., based

on access patterns), then the number of designs grows to≫ 10100

.1

We explain how we derive these numbers in Section 2.

The Problem: Human-Driven Design Only. The design of data

structures is a slow process, relying on the expertise and intuition of

researchers and engineers who need to mentally navigate the vast

design space. For example, consider the following design questions.

(1) We need a data structure for a specific workload: Should we

strip down an existing complex data structure? Should we

build off a simpler one? Or should we design and build a new

one from scratch?

(2) We expect that the workload might shift (e.g., due to new

application features): How will performance change? Should

we redesign our core data structures?

(3) We add flash drives with more bandwidth and also add more

system memory: Should we change the layout of our B-tree

nodes? Should we change the size ratio in our LSM-tree?

(4) We want to improve throughput: How beneficial would it

be to buy faster disks? more memory? or should we invest

the same budget in redesigning our core data structure?

This complexity leads to a slow design process and has severe

cost side-effects [12, 21]. Time to market is of extreme importance,

so new data structure design effectively stops when a design “is

due” and only rarely when it “is ready”. Thus, the process of design

extends beyond the initial design phase to periods of reconsidering

the design given bugs or changes in the scenarios it should support.

Furthermore, this complexity makes it difficult to predict the impact

of design choices, workloads, and hardware on performance. We

include two quotes from a systems architect with more than two

decades of experience with relational systems and key-value stores.

(1) “I know from experience that getting a new data structure intoproduction takes years. Over several years, assumptions made aboutthe workload and hardware are likely to change, and these changesthreaten to reduce the benefit of a data structure. This risk of changemakes it hard to commit to multi-year development efforts. We needto reduce the time it takes to get new data structures into production.”

(2) “Another problem is the limited ability we have to iterate. Whilesome changes only require an online schema change, many requirea dump and reload for a data service that might be running 24x7.The budget for such changes is limited. We can overcome the limitedbudget with tools that help us determine the changes most likely to beuseful. Decisions today are frequently based on expert opinions, andthese experts are in short supply.”

1For comparison, the estimated number of stars in the universe is 10

24.

Vision Step 1: Design Synthesis from First Principles.We pro-

pose a move toward the new design paradigm captured in Figure 1.

Our intuition is that most designs (and even inventions) are about

combining a small set of fundamental concepts in different ways

or tunings. If we can describe the set of the first principles of data

structure design, i.e., the core design principles out of which all

data structures can be drawn, then we will have a structured way

to express all possible designs we may invent, study, and employ as

combinations of those principles. An analogy is the periodic table of

elements in chemistry. It classifies elements based on their atomic

number, electron configuration, and recurring chemical properties.

The structure of the table allows one to understand the elements

and how they relate to each other but crucially it also enables argu-

ing about the possible design space; more than one hundred years

since the inception of the periodic table in the 18th century, we

keep discovering elements that are predicted (synthesized) by the

“gaps” in the table, accelerating science.

Our vision is to build the periodic table of data structures sowe can express their massive design space. We take the first step in

this paper, presenting a set of first principles that can synthesize

orders of magnitude more data structure designs than what has

been published in the literature. It captures basic hardware con-

scious layouts and read operations; future work includes extending

the table for additional parts of the design space, such as updates,

concurrency, compression, adaptivity, and security.

Vision Step 2: Cost Synthesis from Learned Models. The sec-ond step in our vision is to accelerate and automate the design

process. Key here, is being able to argue about the performance

behavior of the massive number of designs so we can rank them.

Even with an intuition that a given design is an excellent choice,

one has to implement the design, and test it on a given data and

query workload and onto specific hardware. This process can take

weeks at a time and has to be repeated when any part of the envi-

ronment changes. Can we accelerate this process so we can quickly

test alternative designs (or different combinations of hardware, data,

and queries) on the order of a few seconds? If this is possible, then

we can 1) accelerate design and research of new data structures,

and 2) enable new kinds of adaptive systems that can decide core

parts of their design, and the right hardware.

Arguing formally about the performance of diverse designs is a

notoriously hard problem [14, 66, 82, 85, 87, 88] especially as work-

load and hardware properties change; even if we can come up with

a robust analytical model it may soon be obsolete [50]. We take

a hybrid route using a combination of analytical models, bench-

marks, and machine learning for a small set of fundamental access

primitives. For example, all pointer based data structures need to

perform random accesses as operations traverse their nodes. All

data structures need to perform a write during an update operation,

regardless of the exact update strategy. We synthesize the cost of

complex operations out of models that describe those simpler more

fundamental operations inspired by past work on generalized mod-

els [66, 85]. In addition, our models start out as analytical models

since we know how these primitives will likely behave. However,

they are also trained across diverse hardware profiles by running

benchmarks that isolate the behavior of those primitives. This way,

we learn a set of coefficients for each model that capture the subtle

performance details of diverse hardware settings.

Page 3: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

What If

Auto-completion

Locate Bad Design

Desig

n Q

uest

ions

?

What If

Auto-completion

Locate Bad Design

?

next choices

- Structure Layout Spec.- Data & Query Workload- Hardware Profile

- Latency - Full Design: Layout & Access (AST)

Hardware Profiles

Serial Scan

Sorted Search

Random Probe

Equality Scan

Range Scan

……

Sec3: Data Access PrimitivesSec2: Data Layout Primitives

Data Node (Element) Library

B-TreeInternal Data Page

B+TreeInternal

Linked List Trie Skip List

Array…

Micro-benchmarks train models on different hardware profiles.

Machine Learning

f(x)Pi

ci

1+e�ki(x�xi)

f(x) = ax + b…

ElementGenerator

Zone MapsPartitio

ning Bloom Filter

46 primitives

Func

. 1 -

Func

. 2 -

True -False -

Min -Max -Both -

parallelization

Operation SynthesisOperation Synthesis (Level 1) Hardware Conscious Synthesis (Level 2)

Cost Synthesis with Learned Models

Binary Search

Equality Scan

Binary Search

Range Scan

Mic

ro-b

ench

mar

ks

??

Leve

l 1Le

vel 2

Level 1 toLevel 2translation

Get Range

Bulk Load Get Range

Bulk Load

Sec4: What-if Design

CombinationValidity Rules best design

so far

Auto-complete

performance

PruningMemoization

Rules

iterative search

next node/cost evaluation

Figure 2: The architecture of the Data Calculator: From high-level layout specifications to performance cost calculation.

The Data Calculator: Automated What-if Design.We present

a “design engine” – the Data Calculator – that can compute the

performance of arbitrary data structure designs as combinations of

fundamental design primitives. It is an interactive tool that acceler-

ates the process of design by turning it into an exploration process,

improving the productivity of researchers and engineers; it is able to

answer what-if data structure design questions to understand how

the introduction of new design choices, workloads, and hardware

affect the performance (latency) of an existing design. It currently

supports read queries for basic hardware conscious layouts. It al-

lows users to give as input a high-level specification of the layout

of a data structure (as a combination of primitives), in addition to

workload, and hardware specifications. The Data Calculator gives

as output a calculation of the latency to run the input workload

on the input hardware. The architecture and components of the

Data Calculator are captured in Figure 2 (from left to right): (1) a

library of fine-grained data layout primitives that can be combined

in arbitrary ways to describe data structure layouts; (2) a library of

data access primitives that can be combined to generate designs of

operations; (3) an operation and cost synthesizer that computes the

design of operations and their latency for a given data structure

layout specification, a workload and a hardware profile, and (4) a

search component that can traverse part of the design space to sup-

plement a partial data structure specification or inspect an existing

one with respect to both the layout and the access design choices.

Inspiration. Our work is inspired by several lines of work across

many fields of computer science. John Ousterhout’s project Magic

in the area of computer architecture allows for quick verification of

transistor designs so that engineers can easily test multiple designs

[70]. Leland Wilkinson’s “grammar of graphics” provides structure

and formulation on the massive universe of possible graphics one

can design [84]. Mike Franklin’s Ph.D. thesis explores the possible

client-server architecture designs using caching based replication as

the main design primitive and proposes a taxonomy that produced

both published and unpublished (at the time) cache consistency

algorithms. Joe Hellerstein’s work on Generalized Search Indexes

[6, 7, 38, 54–57] makes it easy to design and test new data structures

by providing templates that significantly minimize implementation

time. S. Bing Yao’s work on generalized cost models [85] for data-

base organizations, and Stefan Manegold’s work on generalized

cost models tailored for the memory hierarchy [65] showed that it

is possible to synthesize the costs of database operations from basic

access patterns and based on hardware performance properties.

Work on data representation synthesis in programming languages

[18–20, 22, 35, 36, 64, 77, 79] enables selection and synthesis of rep-

resentations out of small sets of (3-5) existing data structures. The

Data Calculator [46] can be seen as a step toward the Automatic

Programmer challenge set by Jim Gray in his Turing award lecture

[33], and as a step toward the “calculus of data structures” chal-

lenge set by Turing award winner Robert Tarjan [81]: “What makesone data structure better than another for a certain application? Theknown results cry out for an underlying theory to explain them.”

Contributions. Our contributions are as follows:(1) We introduce a set of data layout design primitives that cap-

ture the first principles of data layouts including hardware

conscious designs that dictate the relative positioning of data

structure nodes (§2).

(2) We show how combinations of the design primitives can

describe known data structure designs, including arrays,

linked-lists, skip-lists, queues, hash-tables, binary trees and

(Cache-conscious) b-trees, tries, MassTree, and FAST (§2).

(3) We show that in addition to known designs, the design prim-

itives form a massive space of possible designs that has only

been minimally explored in the literature (§2).

(4) We show how to synthesize the latency cost of basic opera-

tions (point and range queries, and bulk loading) of arbitrary

data structure designs from a small set of access primitives.

Access primitives represent fundamental ways to access data

and come with learned cost models which are trained on

diverse hardware to capture hardware properties (§3).

(5) We show how to use cost synthesis to interactively answer

complex what-if design questions, i.e., the impact of changes

to design, workload, and hardware (§4).

(6) We introduce a design synthesis algorithm that completes

partial layout specifications given a workload and hardware

input; it utilizes cost synthesis to rank designs (§4).

(7) We demonstrate that the Data Calculator can accurately

compute the performance impact of design choices for state-

of-the-art designs and diverse hardware (§5).

(8) We demonstrate that the Data Calculator can accelerate the

design process by answering rich design questions in a mat-

ter of seconds or minutes (§5).

Page 4: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

2 DATA LAYOUT PRIMITIVESAND STRUCTURE SPECIFICATIONS

In this section, we discuss the library of data layout design prim-

itives and how it enables the description of a massive number of

both known and unknown data structures.

Data Layout Primitives. The Data Calculator contains a small

set of design primitives that represent fundamental design choices

when constructing a data structure layout. Each primitive belongs

to a class of primitives depending on the high-level design concept it

refers to such as node data organization, partitioning, node physical

placement, and node metadata management. Within each class,

individual primitives define design choices and allow for alternative

tunings. The complete set of primitives we introduce in this paper

is shown in Figure 11 in the appendix; they describe basic data

layouts and cache conscious optimizations for reads. For example,

“Key Partitioning (none|sorted|k-ary|range|temporal)” defines how

data is laid out in a node. Similarly, “Key Retention (none|full|func)”

defines whether and how keys are included in a node. In this way, in

a B+tree all nodes use “sorted” for order maintenance, while internal

nodes use “none” for key retention as they only store fences and

pointers, and leaf nodes use “full” for key retention.

The logic we use to generate primitives is that each one should

represent a fundamental design concept that does not break down

into more useful design choices (otherwise, there will be parts of

the design space we cannot express). Coming up with the set of

primitives is a trial and error task to map the known space of design

concepts to an as clean and elegant set of primitives as possible.

Naturally, not all layout primitives can be combined. Most invalid

relationships stem from the structure of the primitives, i.e., each

primitive combines with every other standalone primitive. Only a

few pairs of primitive tunings do not combine which generates a

small set of invalidation rules. These are mentioned in Figure 11.

From Layout Primitives to Data Structures. To describe com-

plete data structures, we introduce the concept of elements . Anelement is a full specification of a single data structure node; it

defines the data and access methods used to access the node’s data.

An element may be “terminal” or “non-terminal”. That is, an el-

ement may be describing a node that further partitions data to

more nodes or not. This is done with the “fanout” primitive whose

value represents the maximum number of children that would be

generated when a node partitions data. Or it can be set to “terminal”

in which case its value represents the capacity of a terminal node.

A data structure specification contains one or more elements. It

needs to have at least one terminal element, and it may have zero

or more non-terminal elements. Each element has a destination

element (except terminal ones) and a source element (except the

root). Recursive connections are allowed to the same element.

Examples. A visualization of the primitives can be seen at the left

side of Figure 3. It is a flat representation of the primitives shown in

Figure 11 which creates an entry for every primitive signature. The

radius depicts the domain of each primitive but different primitives

may have different domains, visually depicted via the multiple inner

circles in the radar plots of Figure 3. The small radar plots on the

right side of Figure 3 depict descriptions of nodes of known data

structures as combinations of the base primitives. Even visually

it starts to become apparent that state-of-the-art designs which

are meant to handle different scenarios are “synthesized from the

same pool of design concepts”. For example, using the non-terminal

B+tree element and the terminal sorted data page element we can

construct a full B+tree specification; data is recursively broken

down into internal nodes using the B+tree element until we reach

the leaf level, i.e., when partitions reach the terminal node size.

Figure 3 also depicts Trie and Skip-list specifications. Figure 11

provides complete specifications of Hash-table, Linked-list, B+tree,

Cache-conscious B-tree, and FAST.

Elements “Without Data”. For flat data structures without an

indexing layer, e.g., linked-lists and skip-lists, there need to be

elements in the specification that describe the algorithm used to

navigate the terminal nodes. Given that this algorithm is effectively

a model, it does not rely on any data, and so such elements do not

translate to actual nodes; they only affect algorithms that navigate

across the terminal nodes. For example, a linked-list element in

Figure 11 describes that data is divided into nodes that can only be

accessed via following the links that connect terminal nodes. Simi-

larly, one can create complex hierarchies of non-terminal elements

that do not store any data but instead their job is to synthesize

a collective model of how the keys should be distributed in the

data structure, e.g., based on their value or other properties of the

workload. These elements may lead to multiple hierarchies of both

non-terminal nodes with data and terminal ones, synthesizing data

structure designs that treat parts of the data differently. We will see

such examples in the experimental analysis.

Recursive Design Through Blocks. A block is a logical portion

of the data that we divide into smaller blocks to construct an in-

stance of a data structure specification. The elements in a speci-

fication are the “atoms” with which we construct data structure

instances by applying them recursively onto blocks. Initially, there

is a single block of data, all data. Once all elements have been ap-

plied, the original block is broken down into a set of smaller blocks

that correspond to the internal nodes (if any) and the terminal

nodes of the data structure. Elements without data can be thought

of as if they apply on a logical data block that represents part of

the data with a set of specific properties (i.e., all data if this is the

first element) and partitions the data with a particular logic into

further logical blocks or physical nodes. This recursive construction

is used when we test, cost, and search through multiple possible

designs concurrently over the same data for a given workload and

hardware as we will discuss in the next two sections, but it is also

helpful to visualize designs as if “data is pushed through the design”

based on the elements and logical blocks.

Cache-Conscious Designs. One critical aspect of data structuredesign is the relative positioning of its nodes, i.e., how “far” each

node is positioned with respect to its predecessors and successors in

a query path. This aspect is critical to the overall cost of traversing

a data structure. The Data Calculator design space allows to dictate

how nodes should be positioned explicitly: each non-terminal ele-

ment defines how its children are positioned physically with respect

to each other and with respect to the current node. For example,

setting the layout primitive “Sub-block physical layout” to BFS tells

the current node that its children are laid out sequentially. In ad-

dition, setting the layout primitive “Sub-blocks homogeneous” to

true implies that all its children have the same layout (and therefore

fixed width), and allows a parent node to access any of its children

Page 5: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

bloc

kAcc

ess.

dire

ctblo

ckAc

cess

.hea

dLin

kblo

ckAc

cess

.tailL

inkblo

om.ac

tive

bloom

.hash

Func

tions

Num

bloom.numberOfBits

capacity.function

capacity.type

capacity.value

external.links.next

external.links.prevfanout.fixedValuefanout.functionfanout.type

filters.filtersMemLayout

filters.zoneMaps.exact

filters.zoneMaps.max

filters.zoneMaps.min

indirectedPointers

keyRetention.function

keyRetention.type

links.linksMem

Layout

links

.nex

t

links

.pre

v

links

.skipL

inks.p

roba

bility

links

.skipL

inks.t

ype

log.fil

tersP

erLev

el

log.filtersP

erRun

log.initialRunSize

log.maxRunsPerLevellog.mergeFactororderMaintenance.type

partitioning.function

partitioning.type

recursionAllowed

retainedDataLayout

sub−block.phys.homog.

sub−block.phys.layout

sub−block.phys.location

utilization.constraint

utilization.functionvalueRetention.functionvalueRetention.type

zeroElementNullable

Node layout primitives

SKIP LIST(NON−TERMINAL ELEMENT)

TRIE ELEMENT(NON−TERMINAL ELEMENT)

LSM ELEMENT(NON−TERMINAL ELEMENT)

LSM DATAPAGE(TERMINAL ELEMENT)

RANGE PARTITIONING(NON−TERMINAL ELEMENT)

HASH PARTITIONING(NON−TERMINAL ELEMENT)

B+TREE ELEMENT(NON−TERMINAL ELEMENT)

B−TREE ELEMENT(NON−TERMINAL ELEMENT)

UNSORTED DATAPAGE(TERMINAL ELEMENT)

SORTED DATAPAGE(TERMINAL ELEMENT)

COMPRESSED DATAPAGE(TERMINAL ELEMENT)

LINKED LIST(NON−TERMINAL ELEMENT)

SORTED DATA PAGEELEMENT

TRIEELEMENT

Trie

B+Tre

e

SKIP LISTELEMENT

Skip

List

B+TREE ELEMENT

Data layout primitives

Figure 3: The data layout primitives and examples of synthesizing node layouts of state-of-the-art data structures.

nodes directly with a single pointer and reference number. This, in

turn, makes it possible to fit more data in internal nodes because

only one pointer is needed and thus more fences can be stored

within the same storage budget. Such primitives allow specifying

designs such as Cache Conscious B+tree [75] (Figure 11 provides

the complete specification), but also the possibility of generalizing

the optimizations made there to arbitrary structures.

Similarly, we can describe FAST [51]. First, we set “Sub-block

physical location” to inline, specifying that the children nodes are

directly after the parent node physically. Second, we set that the chil-

dren nodes are homogeneous, and finally, we set that the children

have a sub-block layout of “BFS Layer List (Page Size / Cache Line

Size)”. Here, the BFS layer list specifies that on a higher level, we

should have a BFS layout of sub-trees containing Page Size/Cache

Line Size layers; however, inside of those sub-trees pages are laid

out in BFS manner by a single level. The combination matches the

combined Page Level blocking and Cache Line level blocking of

FAST. Additionally, the Data Calculator realizes that all child node

physical locations can be calculated via offsets, and so eliminates

all pointers. Figure 11 provides the complete specification.

Size of the Design Space. To help with arguing about the possibledesign space we provide formal definitions of the various constructs.

Definition 2.1 (Data Layout Primitive). A primitive pi belongsto a domain of values Pi and describes a layout aspect of a data

structure node.

Definition 2.2 (Data Structure Element). AData Structure Element

E is defined as a set of data layout primitives: E = {p1, ...,pn } ∈Pi × ... × Pn , that uniquely identify it.

Given a set of Inv(P) invalid combinations, the set of all possible

elements E, (i.e., node layouts) that can be designed as distinct

combinations of data layout primitives has the following cardinality.

|E | = Pi × ... × Pn − Inv(P) =∏

∀Pi ∈E|Pi | − Inv(P) (1)

Definition 2.3 (Blocks). Each non-terminal element E ∈ E, ap-plied on a set of data entries D ∈ D, uses function BE (D) ={D1, ...,Df } to divide D into f blocks such that D1 ∪ ... ∪ Df = D.

A polymorphic design where every block may be described by a

different element leads to the following recursive formula for the

cardinality of all possible designs.

cpoly (D) = |E | +∑

∀E∈E

∑∀Di ∈BE (D)

cpoly (Di ) (2)

Example: AVast Space of DesignOpportunities. To get insightinto the possible total designs we make a few simplifying assump-

tions. Assume the same fanout f across all nodes and terminal node

size equal to page size psize . Then N = ⌈ |D |psize ⌉ is the total number

of pages in which we can divide the data and h = ⌈loдf (N )⌉ isthe height of the hierarchy. We can then approximate the result of

Equation 2 by considering that we have |E | possibilities for the rootelement, and f ∗ |E | possibilities for its resulting partitions whichin turn have f ∗ |E | possibilities each up to the maximum level of

recursion h = loдf (N ). This leads to the following result.

cpoly (D) ≈ |E| ∗ (f ∗ |E|) ⌈loдf (N )⌉(3)

Most sophisticated data structure designs use only two distinct

elements, each one describing all nodes across groups of levels of

the structure, e.g., B-tree designs use one element for all internal

nodes and one for all leaves. This gives the following design space

for most standard designs.

cstan (D) ≈ |E|2 (4)

Using Equations 1, 3 and 4 we can get estimations of the possi-

ble design space for different kinds of data structure designs. For

example, given the existing library of data layout primitives, and

by limiting the domain of each primitive as shown in Figure 11 in

appendix, then from Equation 1 we get |E | = 1016, meaning we can

describe data structure layouts from a design space of 1016

possible

node elements and their combinations. This number includes only

Page 6: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

valid combinations of layout primitives, i.e., all invalid combina-

tions as defined by the rules in Figure 11 are excluded. Thus, we

have a design space of 1032

for standard two-element structures

(e.g., where B-tree and Trie belong) and 1048

for three-element

structures (e.g., where MassTree [67] and Bounded-Disorder [62]

belong). For polymorphic structures, the number of possible designs

grows more quickly, and it also depends on the size of the training

data used to find a specification, e.g., it is > 10100

for 1015

keys.

The numbers in the above example highlight that data structure

design is still a wide-open space with numerous opportunities for

innovative designs as data keeps growing, application workloads

keep changing, and hardware keeps evolving. Even with hundreds

of new data structures manually designed and published each year,

this is a slow pace to test all possible designs and to be able to argue

about how the numerous designs compare. The Data Calculator is

a tool that accelerates this process by 1) providing guidance about

what is the possible design space, and 2) allowing to quickly test

how a given design fits a workload and hardware setting. A more

detailed description of the primitives can be found in the appendix.

3 DATA ACCESS PRIMITIVESAND COST SYNTHESIS

We now discuss how the Data Calculator computes the cost (la-

tency) of running a given workload on a given hardware for a

particular data structure specification. Traditional cost analysis in

systems and data structures happens through experiments and the

development of analytical cost models. Both options are not scalable

when we want to quickly test multiple different parts of the mas-

sive design space we define in this paper. They require significant

expertise and time, while they are also sensitive to hardware and

workload properties. Our intuition is that we can instead synthesize

complex operations from their fundamental components as we do

for data layouts in the previous section, and then develop a hybrid

way (through both benchmarks and models but without significant

human effort needed) to assign costs to each component individu-

ally; The main idea is that we learn a small set of cost models for

fine-grained data access patterns out of which we can synthesize

the cost of complex dictionary operations for arbitrary designs in

the possible design space of data structures.

The middle part of Figure 2 depicts the components of the Data

Calculator that make cost synthesis possible: 1) the library of data

access primitives, 2) the cost learning module which trains cost

models for each access primitive depending on hardware and data

properties, and 3) the operation and cost synthesis module which

synthesizes dictionary operations and their costs from the access

primitives and the learned models. Next, we describe the process

and components in detail.

Cost Synthesis fromData Access Primitives. Each access prim-

itive characterizes one aspect of how data is accessed. For example,

a binary search, a scan, a random read, a sequential read, a random

write, are access primitives. The goal is that these primitives should

be fundamental enough so that we can use them to synthesize oper-

ations over arbitrary designs as sequences of such primitives. There

exist two levels of access primitives. Level 1 access primitives are

marked with white color in Figure 2 and Level 2 access primitives

are nested under Level 1 primitives and marked with gray color.

For example, a scan is a Level 1 access primitive used any time an

operation needs to search a block of data where there is no order.

At the same time, a scan may be designed and implemented in

more than one way; this is exactly what Level 2 access primitives

represent. For example, a scan may use SIMD instructions for par-

allelization if keys are nicely packed in vectors, and predication

to minimize branch mispredictions with certain selectivity ranges.

In the same way, a sorted search may use interpolation search

if keys are arranged with uniform distribution. In this way, each

Level 1 primitive is a conceptual access pattern, while each Level 2

primitive is an actual implementation that signifies a specific set of

design choices. Every Level 1 access primitive has at least one Level

2 primitive and may be extended with any number of additional

ones. The complete list of access primitives currently supported by

the Data Calculator is shown in Table 1 in appendix.

Learned Cost Models. For every Level 2 primitive, the Data Cal-

culator contains one or more models that describe its performance

(latency) behavior. These are not static models; they are trained

and fitted for combinations of data and hardware profiles as both

those factors drastically affect performance. To train a model, each

Level 2 primitive includes a minimal implementation that captures

the behavior of the primitive, i.e., it isolates the performance effects

of performing the specific action. For example, an implementation

for a scan primitive simply scans an array, while an implementa-

tion for a random access primitive simply tries to access random

locations in memory. These implementations are used to run a

sequence of benchmarks to collect data for learning a model for

the behavior of each primitive. Implementations should be in the

target language/environment.

The models are simple parametric models; given the design deci-

sion to keep primitives simple (so they can be easily reused), we

have domain expertise to expect how their performance behavior

will look like. For example, for scans, we have a strong intuition

they will be linear, for binary searches that they will be logarithmic,

and that for random memory accesses that they will be smoothed

out step functions (based on the probability of caching). These sim-

ple models have many advantages: they are interpretable, they train

quickly, and they don’t need a lot of data to converge. Through the

training process, the Data Calculator learns coefficients of those

models that capture hardware properties such as CPU and data

movement costs.

Hardware and data profiles hold descriptive information about

data and hardware respectively (e.g., data distribution for data, and

CPU, Bandwidth, etc. for hardware). When an access primitive is

trained on a data profile, it runs on a sample of such data, and

when it is trained for a hardware profile, it runs on this exact

hardware. Afterward, though, design questions can get accurate

cost estimations on arbitrary access method designs without going

over the data or having to have access to the specific machine.

Overall, this is an offline process that is done once, and it can be

repeated to include new hardware and data profiles or to include

new access primitives.

Example: Binary Search Model. To give more intuition about

how models are constructed let us consider the case of a Level 2

primitive of binary searching a sorted array as shown on the upper

right part of Figure 4. The primitive contains a code snippet that

implements the bare minimum behavior (Step 1 in Figure 4). We

Page 7: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

Binary Search Primitive

1 11 17 37 51 66 80 94

if (data[middle] < search_val) { low = middle + 1; } else { high = middle;}middle = (low + high)/2;

f(x)Pi

ci

1+e�ki(x�xi)

Fitted Model

sum of sigmoids

Benchmark Results

Tim

e (s

)

Region Size (KB)

L1 L2 MemoryL3

2.8e-8

0.8e-8

1.3e-7

8 32 1.6e4

Tim

e (s

)

Data Size (KB)

Benchmark Results

Log-Linear Model

Log-Linear Model

Fitted Model

f(x) = ax + b log x + c

User designs benchmark and chooses model

System runs benchmark and gathers data

Model trains and produces function for cost prediction

for(int i=0; i<size; i++) probe(array[pos[i]])

Random Access Primitive

12 56 9 37 1 45 11 20

1 7 6 2 3 5 4 0pos

array

Scenario 1: Training Binary Search Level 2 Access Primitive

Scenario 2: Training Random Memory Access Level 2 Access Primitive

AP Binary Search

Interpolation Search

AP Sorted Search

AP

New Sorted SearchAP

Get Cost(Sorted Search Level 1)

Data Access Primitives (Level 1)

AP

Scan

Sorted Search

Sequential Memory Access

Bloom Filter Probe

Hash Probe

Random Memory Access AP

AP

AP

AP

AP

Codes Benchmark for new Sorted

Search Alg.Runs Benchmark and Gathers Data

Cost: 180 nsAlgorithm: New Sorted Search (Level 2)

Log-Linear Model

31

1 2 3

TrainRun

Run Train

2

Data Calculator Fits Model to

Data

1 2extending librarywith a new level 2 accessprimitive

3

new model

Level 2

2e-8

4e-8

8e-8

1e-7

2 4 86

Figure 4: Training and fitting models for Level 2 access primitives and extending the Data Calculator.

observe that the benchmark results (Step 2 in Figure 4) indicate

that performance is related to the size of the array by a logarithmic

component. As expected there is also bias as the relationship for

small array sizes (such as just 4 or 8 elements) might not fit exactly

a logarithmic function. We additionally add a linear term to capture

some small linear dependency on the data size. Thus, the cost of

binary searching an array of n elements can be approximated as

f (n) = c1n + c2 logn + y0 where c1, c2, and y0 are coefficients

learned through linear regression. The values of these coefficients

help us translate the abstract model, f (n) = O(logn), into a realizedpredictive model which has taken into account factors such as CPU

speed and the cost of memory accesses across the sorted array for

the specific hardware. The resulting fitted model can be seen in

Step 3 on the upper right part of Figure 4. The Data Calculator

can then use this learned model to query for the performance of

binary search within the trained range of data sizes. For example,

this would be used when querying a large sorted array as well as a

small node of a complex data structure that is sorted.

Certain critical aspects of the training process can be automated

as part of future research. For example, the data range for which

we should train a primitive depends on the memory hierarchy (e.g.,

size of caches, memory, etc.) on the target machine and what is the

target setting in the application (i.e., memory only, or also disk/flash,

etc.). In turn, this also affects the length of the training process.

Overall, such parameters can eventually be handled through high-

level knobs, letting the system make the lower level tuning choices.

Furthermore, identification of convergence can also be automated.

There exist primitives that require more training than others (e.g.,

due to more complex code, random access or sensitivity to outliers),

and so the number of benchmarks and data points we collect should

not be a fixed decision.

Synthesizing Latency Costs. Given a data layout specification

and a workload, the Data Calculator uses Level 1 access primitives

to synthesize operations and subsequently each Level 1 primitive is

translated to the appropriate Level 2 primitive to compute the cost of

the overall operation. Figure 5 depicts this process and an example

specifically for the Get operation. This is an expert system, i.e., a

sequence of rules that based on a given data structure specification

defines how to traverse its nodes.2To read Figure 5 start from the

top right corner. The input is a data structure specification, a test

data set, and the operation we need to cost, e.g., Get key x. Theprocess simulates populating the data structure with the data to

figure out how many nodes exist, the height of the structure, etc.

This is because to accurately estimate the cost of an operation, the

Data Calculator needs to take into account the expected state of

the data structure at the particular moment in the workload. It does

this by recursively dividing the data into blocks given the elements

used in the specification.

In the example of Figure 5 the structure contains two elements,

one for internal nodes and one for leaves. For every node, the

operation synthesis process takes into account the data layout

primitives used. For example, if a node is sorted it uses binary

search, but if the node is unsorted, it uses a full scan. The rhombuses

on the left side of Figure 5 reflect the data layout primitives that

operation Get relies on, while the rounded rectangles reflect data

access primitives that may be used. For each node the per-node

operation synthesis procedure (starting from the left top side of

Figure 5), first checks if this node is internal or not by checking

whether the node contains keys or values; if not, it proceeds to

determine which node it should visit next (left side of the figure)

and if yes, it continues to process the data and values (right side of

the figure). A non-terminal element leads to data of this block being

split into f new blocks and the process follows the relevant blocks

only, i.e., the blocks that this operation needs to visit to resolve.

In the end, the Data Calculator generates an abstract syntax tree

with the access patterns of the path it had to go through. This is

expressed in terms of Level 1 access primitives (bottom right part

of Figure 5). In turn, this is translated to a more detailed abstract

syntax tree where all Level 1 access primitives are translated to

Level 2 access primitives along with the estimated cost for each one

given the particular data size, hardware input, and any primitive

specific input. The overall cost is then calculated as the sum of all

those costs.

2Figure 5 is a subset of the expert system. The complete version can be found in the

appendix.

Page 8: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

I2 I3

L2 L3 L4… …

query target

h

containsdata

compressed

sorted

row-wisestorage

Serial Scan Keys

row-wisestorage

Sorted

Search

Keys

yes yes

no

no

yes

Decompress Data

Random Probe Value

no no

Serial Scan KV

pairs

yes

Sorted Search KV pairs

yes

blocks

no

partitioningFunction

no

yes

Random Probe

Partition

yes

zone maps

no

partitions

access

no

sorted

yes

Serial Scan Zone Maps

Sorted Search Zone

Maps

yesno

Bloom Filter Acces

s

inlined

Random Probe Block

Block

D1

Dn

If this element partitions data in blocks, go over each block

Try to prune some of the blocks using filters such us zone maps and bloom filters

Only continue for matching blocks

Materialize sub-blocks

Partitioning? - Log structured - Function?Capacity?Fanout?

bloom fliters

Random

Probe Links

head/tail links

containsdata

compressed

sorted

row-wisestorage

Serial Scan Keys

row-wisestorage

Sorted

Search

Keys

yes yes

no

no

yes

Decompress Data

Random Probe Value

no no

Serial Scan KV

pairs

yes

Sorted Search KV pairs

yes

blocks

no

partitioningFunction

no

yes

Random Probe

Partition

yes

zone maps

no

partitions

access

no

sorted

yes

Serial Scan Zone Maps

Sorted Search Zone

Maps

yesno

Bloom Filter Acces

s

inlined

Random Probe Block

Block

D1

Dn

If this element partitions data in blocks, go over each block

Try to prune some of the blocks using filters such us zone maps and bloom filters

Only continue for matching blocks

Materialize sub-blocks

Partitioning? - Log structured - Function?Capacity?Fanout?

bloom fliters

Random

Probe Links

head/tail links

I

L

h

Start contains

values and keys

sorted

row-wisestorage

Serial Scan Keys

row-wisestorage

Sorted Search Keys

true

false

true

Random Probe Value

false false

Serial Scan KV pairs

true

Sorted Search KV pairs

true

is elementterminal End

false

partitioningfunction

true

false

Random Probe Partition

true

zone mapsfalse

partitionsaccess

false

sorted

true

Serial Scan Zone Maps

Sorted Search Zone Maps

truefalse

Bloom Filter AccessinlinedRandom Probe

Block

truefalse

directaddressing

Block

D1

Df

other filter types

Only continue for matching blocks

Sub-block data distribution1. Create blocks using: - Partitioning property - Capacity property - Fanout property2. Distribute data in blocks

bloom fliters

Random Probe Links

head/tail links

falsetrue

Serial ScanSorted Search

data accessprimitive operation

synthesis

data layoutprimitive checks

Internal Node

1. fanout.type = FIXED;2. fanout.fixedVal = 20; 3. sorted = True; 4. zoneMaps.min = true;5. zoneMaps.max = false;6. retainsData = false;…46. capacity = BALANCED;

I

Leaf Node

1. fanout.type = FIXED;2. fanout.fixedVal = 64;3. sorted = True; 4. zoneMaps.min = false;5. zoneMaps.max = false;6. retainsData = true;…46. capacity = fixed;

L

I1

Data Access Operation Synthesis

INPUTPer Node Access Operation Synthesis

Recu

rsio

n

Recursion

LSM……

fixed block size

variable block size

variable numberof blocks

fixed number of blocks

fixed number of blocks

L1

Hardware Profile

Data & Query Workload

Structure Layout

Specifications+ +

R I2h

L1

Sorted search of zone-maps

Random probe to fetch leaf

S I2Sorted search of zone-maps

Sorted search leaf data

Random probe to fetch node

I1S

L1R

S

RP SIZEh

64, KV

BinarySearch

RandomProbe

BS 20, KBinarySearch

BinarySearch

RandomProbe

10, KBS

SIZERP

BSOpe

ratio

n Sy

nthe

sis

Out

put

Cos

t Syn

thes

is O

utpu

t

Materialize sub-block data

Forward gets to the correct sub-blocks

Try to filter sub-blocks

sub-block access cost(function call cost omitted)

sub-block access cost

data access cost

comments

Figure 5: Synthesizing the operation and cost for dictionary operation Get, given a data structure specification.

Calculating Random Accesses and Caching Effects. A cru-

cial part in calculating the cost of most data structures is capturing

random memory access costs, e.g., the cost of fetching nodes while

traversing a tree, fetching nodes linked in a hash bucket, etc. If data

is expected to be cold, then this is a rather straightforward case, i.e.,

we may assign the maximum cost a random access is expected to

incur on the target machine. If data may be hot, though, it is a more

involved scenario. For example, in a tree-like structure internal

nodes higher in the tree are much more likely to be at higher levels

of the memory hierarchy during repeated requests. We calculate

such costs using the random memory access primitive, as shown in

the lower right part of Figure 4. Its input is a “region size”, which

is best thought of as the amount of memory we are randomly ac-

cessing into (i.e., we don’t know where in this memory region our

pointer points to). The primitive is trained via benchmarking access

to an increasingly bigger contiguous array (Step 1 in Figure 4). The

results (Step 2 in Figure 4) depict a minor jump from L1 to L2 (we

can see a small bump just after 104elements). The bump from L2

to L3 is much more noticeable, with the cost of accessing memory

going from 0.1 × 107seconds to 0.3 × 10

7seconds as the memory

size crosses the 128 KB boundary. Similarly, we see a bump from

0.3× 107seconds to 1.3× 10

7seconds when we go from L3 to main

memory, at the L3 cache size of 16 MB3. We capture this behavior

as a sum of sigmoid functions, which are essentially smoothed step

functions, using

c(x) =k∑i=1

f (x) =k∑i=1

ci

1 + e−ki (log x−xi )+ y0.

3These numbers are in line with Intel’s Vtune.

This primitive is used for calculating random access to any physical

or logical region (e.g., a sequence of nodes that may be cached

together). For example, when traversing a tree, the cost synthesis

operation, costs random accesses with respect to the amount of

data that may be cached up to this point. That is, for every node

we need to access at Level x of a tree, we account for a region size

that includes all data in all levels of the tree up to Level x . In this

way, accessing a node higher in the tree costs less than a node at

lower levels. The same is true when accessing buckets of a hash

table. We give a detailed step by step example below.

Example: Cache-aware Cost Synthesis. Assume a B-tree -like

specification as follows: two node types, one for internal nodes

and one for leaf nodes. Internal nodes contain fence pointers, are

sorted, balanced, have a fixed fanout of 20 and do not contain any

keys or values. Leaf nodes instead are terminal; they contain both

keys and values, are sorted, have a maximum page size of 250

records, and follow a full columnar format, where keys and values

are stored in independent arrays. The test dataset consists of 105

records where keys and values are 8 bytes each. Overall, this means

that we have 400 full data pages, and thus a tree of height 2. The

Data Calculator needs two of its access primitives to calculate the

cost of a Get operation. Every Get query will be routed through

two internal nodes and one leaf node: within each node, it needs

to binary search (through fence pointers for internal nodes and

through keys in leaf nodes) and thus it will make use of the Sorted

Search access primitive. In addition, as a query traverses the tree it

needs to perform a random access for every hop.

Now, let us look in more detail how these two primitives are

used given the exact specification of this data structure. The Sorted

Search primitive takes as input the size of the area over which we

Page 9: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

will binary search and the number of keys. The Random Access

primitive takes as input the size of the path so far which allows us

to takes into account caching effects. Each query starts by visiting

the root node. The data calculator estimates the size of the path

so far to be 312 bytes. This is because the size of the path so far

is in practice equal to the size of the root node which containing

20 pointers (because the fanout is 20) and 19 values sums up at

root = internalnode = 20 ∗ 8 + 19 ∗ 8 = 312 bytes. In this way,

the Data Calculator logs a cost of RandomAccess(312) to access the

root node. Then, it calculates the cost of binary search across 19

fences, thus logging a cost of SortedSearch(RowStore, 19 ∗ 8). Ituses the “RowStore” option as fences and pointers are stored as

pairs within each internal node. Now, the access to the root node

is fully accounted for, and the Data Calculator moves on to cost

the access at the next tree level. Now the size of the path so far

is given by accounting for the whole next level in addition to the

root node. This is in total level2 = root+fanout∗internalnode =312 + 20 ∗ 312 = 6552 bytes (due to fanout being 20 we account

for 20 nodes at the next level). Thus to access the next node, the

Data Calculator logs a cost of RandomAccess(6552) and again a

search cost of SortedSearch(RowStore, 19 ∗ 8) to search this node.

The last step is to search the leaf level. Now the size of the path

so far is given by accounting for the whole size of the tree which

is level2 + 400 ∗ (250 ∗ 16) = 1606552 bytes since we have 400

pages at the next level (20x20) and each page has 250 records of

key-value pairs (8 bytes each). In this way, the Data Calculator logs

a cost of RandomAccess(1606552) to access the leaf node, followed

by a sorted search of SortedSearch(ColumnStore, 250 ∗ 8) to searchthe keys. It uses the “ColumnStore” option as keys and values are

stored separately in each leaf in different arrays. Finally, a cost of

RandomAccess(2000) is incurred to access the target value in the

values array (we have 8 ∗ 250 = 2000 in each leaf).

Sets of Operations. The description above considers a single

operation. The Data Calculator can also compute the latency for a

set of operations concurrently in a single pass. This is effectively the

same process as shown in Figure 5 only that in every recursion we

may follow more than one path and in every step we are computing

the latency for all queries that would visit a given node.

Workload Skew and Caching Effects. Another parameter

that can influence caching effects is workload skew. For exam-

ple, repeatedly accessing the same path of a data structure results

in all nodes in this path being cached with higher probability than

others. The Data Calculator first generates counts of how many

times every node is going to be accessed for a given workload. Us-

ing these counts and the total number of nodes accessed we get a

factor p = count/total that denotes the popularity of a node. Then

to calculate the random access cost to a node for an operation k , aweightw = 1/(p ∗ sid) is used, where sid is the sequence number of

this operation in the workload (refreshed periodically). Frequently

accessed nodes see smaller access costs and vice versa.

Training Primitives.All access primitives are trained on warm

caches. This is because they are used to calculate the cost on a node

that is already fetched. The only special case is the Random Access

primitive which is used to calculate the cost of fetching a node.

This is also trained on warm data, though, since the cost synthesis

infrastructure takes care at a higher level to pass the right region

size as discussed; in the case this region is big, this can still result

in costing a page fault as large data will not fit in the cache which

is reflected in the Random Access primitive model.

Limitations. For individual queries certain access primitives

are hard to estimate precisely without running the actual code on

an exact data instance. For example, a scan for a point Get may

abort after checking just a few values, or it may need to go all

the way to the end of an array. In this way, while lower or upper

performance bounds can be computed with absolute confidence

for both individual queries and sets of queries, actual performance

estimation works best for sets.

More Operations. The cost of range queries, and bulk loading is

synthesized as shown in Figure 10 in appendix.

Extensibility and Cross-pollination. The rationale of having

two Levels of access primitives is threefold. First, it brings a level of

abstraction allowing higher level cost synthesis algorithms to oper-

ate at Level 1 only. Second, it brings extensibility, i.e., we can add

new Level 2 primitives without affecting the overall architecture.

Third, it enhances “cross-pollination” of design concepts captured

by Level 2 primitives across designs. Consider the following ex-

ample. An engineer comes up with a new algorithm to perform

search over a sorted array, e.g., exploiting new hardware instruc-

tions. To test if this can improve performance in her B-tree design,

where she regularly searches over sorted arrays, she codes up a

benchmark for a new sorted search Level 2 primitive and plugs it

in the Calculator as shown in Figure 4. Then the original B-tree

design can be easily tested with and without the new sorted search

across several workloads and hardware profiles without having to

undergo a lengthy implementation phase. At the same time, the

new primitive can now be considered by any data structure design

that contains a sorted array such as an LSM-tree with sorted runs, a

Hash-table with sorted buckets and so on. This allows easy transfer

of ideas and optimizations across designs, a process that usually

requires a full study for each optimization and target design.

4 WHAT-IF DESIGN ANDAUTO-COMPLETION

The ability to synthesize the performance cost of arbitrary designs

allows for the development of algorithms that search the possible

design space. We expect there will be numerous opportunities in

this space for techniques that can use this ability to: 1) improve

the productivity of engineers by quickly iterating over designs and

scenarios before committing to an implementation (or hardware),

2) accelerate research by allowing researchers to easily and quickly

test completely new ideas, 3) develop educational tools that allow

for rapid testing of concepts, and 4) develop algorithms for offline

auto-tuning and online adaptive systems that transition between

designs. In this section, we provide two such opportunities for

what-if design and auto-completion of partial designs.

What-if Design. One can form design questions by varying any

one of the input parameters of the Data Calculator: 1) data structure

(layout) specification, 2) hardware profile, and 3) workload (data

and queries). For example, assume one already uses a B-tree-like

design for a given workload and hardware scenario. The Data Cal-

culator can answer design questions such as “What would be the

performance impact if I change my B-tree design by adding a bloom

Page 10: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

HW1pointgetsCPU: 64x2.3GHzL3: 46MB RAM: 256GB

HW2pointgetsCPU: 4x2.3GHzL3: 46MB RAM: 16GB

HW3pointgetsCPU: 64x2GHzL3: 16MB RAM: 1TB

HW3updatesCPU: 64x2GHzL3: 16MB RAM: 1TB

HW3rangegetsCPU: 64x2GHzL3: 16MB RAM: 1TB

● ●●

●●●●●●●●●

● ●●

●●●●●●●●●

● ●●

●●●●

●●●

●●

● ●●

●●

●●

●●●●●

● ●●

●●●●●●

● ●●

●●

●●●

● ●●

●●●●●●●●

● ●●

●●●●●●●●●

105 105.5 106 106.5 107 105 105.5 106 106.5 107 105 105.5 106 106.5 107 105 105.5 106 106.5 107 105 105.5 106 106.5 107 105 105.5 106 106.5 107 105 105.5 106 106.5 107 105 105.5 106 106.5 1070.0e+005.0e−021.0e−011.5e−012.0e−01

0.0e+001.0e−022.0e−023.0e−024.0e−025.0e−02

0.0e+00

2.0e−02

4.0e−02

6.0e−02

0.0e+00

3.0e−01

6.0e−01

9.0e−01

0.0e+001.0e−022.0e−023.0e−024.0e−02

0.0e+00

2.0e−02

4.0e−02

6.0e−02

0.0e+00

5.0e−02

1.0e−01

1.5e−01

0.0e+005.0e−021.0e−011.5e−012.0e−01

Number of entries (log scale)

Late

ncy

(sec

.)

●●

●●

●●

●●

●●

●●

●●

● ●

●●

●●

● ●

●●

●●

● ●

0.0e+002.5e−075.0e−077.5e−071.0e−06

0.0e+002.5e−075.0e−077.5e−071.0e−06

0.0e+00

5.0e−07

1.0e−06

1.5e−06

0.0e+00

1.0e−06

2.0e−06

0.0e+00

2.0e−06

4.0e−06

0.0e+00

1.0e−04

2.0e−04

3.0e−04

0.0e+005.0e−031.0e−021.5e−022.0e−02

0.0e+00

1.0e−02

2.0e−02

3.0e−02

Late

ncy

(sec

.)

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

●●

● ●

● ●●

●●

●●

0.0e+002.5e−075.0e−077.5e−071.0e−06

0.0e+002.5e−075.0e−077.5e−071.0e−06

0.0e+00

5.0e−07

1.0e−06

1.5e−06

0.0e+00

1.0e−06

2.0e−06

0.0e+00

2.0e−06

4.0e−06

0.0e+00

1.0e−04

2.0e−04

3.0e−04

0.0e+005.0e−031.0e−021.5e−022.0e−02

0.0e+00

1.0e−02

2.0e−02

3.0e−02

Late

ncy

(sec

.)

●●

●●

●●

●●

●●

● ●

●●

●●

●●

●● ●●

●●

● ●●● ●

0.0e+00

2.0e−07

4.0e−07

0.0e+002.0e−074.0e−076.0e−078.0e−07

0.0e+00

3.0e−07

6.0e−07

9.0e−07

0.0e+00

5.0e−07

1.0e−06

1.5e−06

0.0e+001.0e−062.0e−063.0e−064.0e−065.0e−06

0.0e+00

5.0e−05

1.0e−04

1.5e−04

0.0e+00

5.0e−03

1.0e−02

1.5e−02

0.0e+00

5.0e−03

1.0e−02

1.5e−02

Late

ncy

(sec

.)

●●

●●

●●

● ●

●●

●●

●●

●●

●●

● ●

● ●●●●

●●● ●

●●

LINKEDLIST ARRAY RANGE−PARTIT.LINKED LIST SKIP−LIST TRIE B+TREE SORTED−ARRAY HASH−TABLE

0.0e+00

2.0e−07

4.0e−07

0.0e+002.0e−074.0e−076.0e−078.0e−07

0.0e+00

3.0e−07

6.0e−07

9.0e−07

0.0e+00

5.0e−07

1.0e−06

1.5e−06

0.0e+001.0e−062.0e−063.0e−064.0e−06

0.0e+00

5.0e−05

1.0e−04

1.5e−04

0.0e+00

5.0e−03

1.0e−02

1.5e−02

0.0e+00

5.0e−03

1.0e−02

1.5e−02

Late

ncy

(sec

.)

Data CalculatorImplementation

Figure 6: The Data Calculator can accurately compute the latency of arbitrary data structure designs across a diverse set ofhardware and for diverse dictionary operations.filter in each leaf?” The user simply needs to give as input the high-

level specification of the existing design and cost it twice: once

with the original design and once with the bloom filter variation. In

both cases, costing should be done with the original data, queries,

and hardware profile so the results are comparable. In other words,

users can quickly test variations of data structure designs simply

by altering a high level specification, without having to implement,

debug, and test a new design. Similarly, by altering the hardware or

workload inputs, a given specification can be tested quickly on alter-

native environments without having to actually deploy code to this

new environment. For example, in order to test the impact of new

hardware the Calculator only needs to train its Level 2 primitives

on this hardware, a process that takes a few minutes. Then, one can

test the impact this new hardware would have on arbitrary designs

by running what-if questions without having implementations of

those designs and without accessing the new hardware.

Auto-completion. The Data Calculator can also complete partial

layout specifications given a workload, and a hardware profile. The

process is shown in Algorithm 1 in the appendix: The input is a

partial layout specification, data, queries, hardware, and the set of

the design space that should be considered as part of the solution,

i.e., a list of candidate elements. Starting from the last known point

of the partial specification, the Data Calculator computes the rest

of the missing subtree of the hierarchy of elements. At each step

the algorithm considers a new element as candidate for one of the

nodes of the missing subtree and computes the cost for the different

kinds of dictionary operations present in the workload. This design

is kept only if it is better than all previous ones, otherwise it is

dropped before the next iteration. The algorithm uses a cache to

remember specifications and their costs to avoid recomputation.

This process can also be used to tell if an existing design can be

improved by marking a portion of its specification as “to be tested”.

Solving the search problem completely is an open challenge as the

design space drawn by the Calculator is massive. Here we show a

first step which allows dynamic programming algorithms to select

from a restricted set of elements but numerous algorithmic options

are candidates for future research, e.g., genetic algorithms [44].

5 EXPERIMENTAL ANALYSISWe now demonstrate the ability of the Data Calculator to help with

rich design questions by accurately synthesizing performance costs.

Implementation. The core implementation of the Data Calcu-

lator is in C++. This includes the expert systems that handle layout

primitives and cost synthesis. A separate module implemented in

Python is responsible for analyzing benchmark results of Level 2

access primitives and generating the learned models. The bench-

marks of Level 2 access primitives are also implemented in C++

such that the learned models can capture performance and hard-

ware characteristics that would affect a full C++ implementation

of a data structure. The learning process for each Level 2 access

primitive is done each time we need to include a new hardware

profile; then, the learned coefficients for each model are passed to

the C++ back-end to be used for cost synthesis during design ques-

tions. For learning we use a standard loss function, i.e., least square

errors, and the actual process is done via standard optimization

libraries, e.g., SciPy’s curve fit. For models which have non-convex

loss functions such as the sum of sigmoidsmodel, we algorithmically

set up good initial parameters.

Accurate Cost Synthesis. In our first experiment we test the abil-

ity to accurately cost arbitrary data structure specifications across

different machines. To do this we compare the cost generated au-

tomatically by the Data Calculator with the cost observed when

testing a full implementation of a data structure. We set-up the

experiment as follows. To test with the Data Calculator, we manu-

ally wrote data structure specifications for eight well known access

methods 1) Array, 2) Sorted Array, 3) Linked-list, 4) Partitioned

Linked-list, 5) Skip-list, 6) Trie, 7) Hash-table, and 8) B+tree. The

Data Calculator was then responsible for generating the design

of operations for each data structure and computing their latency

given a workload. To verify the results against an actual implemen-

tation, we implemented all data structures above. We also imple-

mented algorithms for each of their basic operations: Get, RangeGet, Bulk Load and Update. The first experiment then starts

with a data workload of 105uniformly distributed integers and a

Page 11: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

0

20

40

60

TRIE

B+TREE

SKIP−L

IST

SORTED−ARRAY

RANGE PART.

LINKED−L

IST

HASH−TABLE

LINKED−L

IST

ARRAY

(a) Data Structure

Tota

l Lat

ency

(se

c.) Data Calculator

Implementation

0

25

50

75

100

HW1

HW2

HW3

(b) Hardware

Trai

ning

Tim

e (s

ec.)

Figure 7: Computing Bulk-loading cost (left) and Trainingcost across diverse hardware (right).

●●

●●

●●

●●

HW1 HW2 HW3

105 105.5 106 106.5 107105 105.5 106 106.5 107105 105.5 106 106.5 107

0.0

0.2

0.4

0.6

0.8

Number of entries (log scale)

Late

ncy

(mic

rose

c.)

●Data Calculator Implementation

CSB+Tree B+TREE CSB+TREE

0.5 1.0 1.5 2.0 0.5 1.0 1.5 2.0

Zipf alpha parameter

Data Calculator

Implementation

HW1

Figure 8: Accurately computing the latency of cache con-scious designs in diverse hardware and workloads.sequence of 10

2 Get requests, also uniformly distributed. We incre-

mentally insert more data up to a total of 107entries and at each

step we repeat the query workload.

The top row of Figure 6 depicts results using a machine with 64

cores and 264 GB of RAM. It shows the average latency per query

as data grows as computed by the Data Calculator and as observed

when running the actual implementation on this machine. For ease

of presentation results are ranked horizontally from slower to faster

(left to right). The Data Calculator gives an accurate estimation of

the cost across the whole range of data sizes and regardless of the

complexity of the designs both in terms of the data structure. It

can accurately compute the latency of both simple traversals in a

plain array and the latency of more complex access patterns such

as descending a tree and performing random hops in memory.

Diverse Machines and Operations. The rest of the rows in Fig-

ure 6 repeat the same experiment as above using different hardware

in terms of both CPU and memory properties (Rows 2 and 3) and

different operations (Rows 4 and 5). The details of the hardware are

shown on the right side of each row in Figure 6. Regardless of the

machine or operation, the Data Calculator can accurately cost any

design. By training its Level 2 primitives on individual machines

and maintaining a profile for each one of them, it can quickly test

arbitrary designs over arbitrary hardware and operations. Updates

here are simple updates that change the value of a key-value pair

and so they are effectively the same as a point query with an addi-

tional write access. More complex updates that involve restructures

are left for future work both in terms of the design space and cost

synthesis. Finally, Figure 7a) depicts that the Data Calculator can

accurately synthesize the bulk loading costs for all data structures.

Training Access Primitives. Figure 7b) depicts the time needed

to train all Level 2 primitives on a diverse set of machines. Overall,

this is an inexpensive process. It takes merely a few minutes to

train multiple different combinations of data and hardware profiles.

0.0

2.5

5.0

7.5

105 106 107

# of inserts

Synt

hesi

s co

st (m

in.)

0

10

20

30

105 106 107

# of inserts

Synt

hesi

s co

st (m

in.)

Hybr

id B+

Tree /

Has

h Tab

le / A

rray

DATAPAGE

HASH PARTITIONING

ELEMENT

B+TREEELEMENT

Hybr

id H

ash

Tabl

e/B+

Tree

(a) Scenario 1

B+TREEELEMENT

DATAPAGE(system page size)

HASH PARTITIONING

Point get intensive

Only writes

DATAPAGE (large chunks)

Range intensive

DATAPAGE(system page size)

B+TREEELEMENT

DATAPAGE

(b) Scenario 2

Point get intensive

Only writes

Figure 9: TheData Calculator designs newhybrids of knowndata structures to match a given workload.

Cache Conscious Designs and Skew. In addition, Figure 8

repeats our base fitting experiment using a cache-conscious design,

Cache Conscious B+tree (CSB). Figure 8a) depicts that the Data

Calculator accurately predicts the performance behavior across

a diverse set of machines, capturing caching effects of growing

data sizes and design patterns where the relative position of nodes

affects tree traversal costs. We use the “Full” design from Cache

Conscious B+tree [75]. Furthermore, Figure 8b) tests the fitting

when the workload exhibits skew. For this experiment Get queries

were generated with a Zipfian distribution: α = {0.5, 1.0, 1.5, 2.0}.Figure 8b) shows that for the implementation results, workload

skew improves performance and in fact it improves more for the

standard B+tree. This is because the same paths aremore likely to be

taken by queries resulting in these nodes being cached more often.

Cache Conscious B+tree improves but at a much slower rate as it

is already optimized for the cache hierarchy. The Data Calculator

is able to synthesize these costs accurately, capturing skew and the

related caching effects.

Rich Design Questions. In our next, experiment we provide in-

sights about the kinds of design questions possible and how long

they may take, working over a B-tree design and a workload of

uniform data and queries: 1 million inserts and 100 point Gets. The

hardware profile used is HW1 (defined in Figure 6). The user asks

"What if we change our hardware to HW3?". It takes the Data Cal-

culator only 20 seconds (all runs are done on HW3) to compute

that the performance would drop. The user then asks: "Is there a

better design for this new hardware and workload if we restrict

search on a specific set of five possible elements?" (from the pool

of element on right side of Figure 3). It takes only 47 seconds for

the Data Calculator to compute the best choice. The user then asks

“Would it be beneficial to add a bloom filter in all B-tree leaves?”

The Data Calculator computes in merely 20 seconds that such a

design change would be beneficial for the current workload and

hardware. The next design question is: "What if the query workload

changes to have skew targeting just 0.01% of the key space?" The

Data Calculator computes in 24 seconds that this new workload

would hurt the original design and it computes a better design in

another 47 seconds.

In two of the design phases the user asked “give me a better

design if possible”. We now provide more intuition for this kind of

design questions regarding the cost and scalability of computing

such designs as well as the kinds of designs the Data Calculator may

produce to fit a workload. We test two scenarios for a workload

Page 12: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

of mixed reads and writes (uniformly distributed inserts and point

reads) and hardware profile HW3. In the first scenario, all reads

are point queries in 20% of the domain. In the second scenario,

50% of the reads are point reads and touch 10% of the domain,

while the other half are range queries and touch a different (non

intersecting with the point reads) 10% of the domain. We do not

provide the Data Calculator with an initial specification. Given the

composition of the workload our intuition is that a mix of hashing,

B-tree like indexing (e.g., with quantile nodes and sorted pages), and

a simple log (unsorted pages) might lead to a good design, and so we

instruct the Data Calculator to use those four elements to construct

a design (this is done using Algorithm 1 but starting with an empty

specification. Figure 9 depicts the specifications of the resulting

data structures For the first scenario (left side of Figure 9) the Data

Calculator computed a design where a hashing element at the upper

levels of the hierarchy allows to quickly access data but then data

is split between the write and read intensive parts of the domain to

simple unsorted pages (like a log) and B+tree -style indexing for

the read intensive part. For the second scenario (right side of Figure

9), the Data Calculator produces a design which similarly to the

previous one takes care of read and writes separately but this time

also distinguishes between range and point gets by allowing the

part of the domain that receives point queries to be accessed with

hashing and the rest via B+tree style indexing. The time needed

for each design question was in the order of a few seconds up

to 30 minutes depending on the size of the sample workload (the

synthesis costs are embedded in Figure 9 for both scenarios). Thus,

the Data Calculator quickly answers complex questions that would

normally take humans days or even weeks to test fully.

6 RELATEDWORKTo the best of our knowledge this is the first work to discuss the

problem of interactive data structure design and to compute the

impact on performance. However, there are numerous areas from

where we draw inspiration and with which we share concepts.

Interactive Design. Conceptually, the work on Magic for layout

on integrated circuits [70] comes closest to our work. Magic uses

a set of design rules to quickly verify transistor designs so they

can be simulated by designers. In other words, a designer may

propose a transistor design and Magic will determine if this is

correct or not. Naturally, this is a huge step especially for hardware

design where actual implementation is extremely costly. The Data

Calculator pushes interactive design one step further to incorporate

cost estimation as part of the design phase by being able to estimate

the cost of adding or removing individual design options which

in turn also allows us to build design algorithms for automatic

discovery of good and bad designs instead of having to build and

test the complete design manually.

Generalized Indexes.One of the stronger connections is the workon Generalized Search Tree Indexes (GiST) [6, 7, 38, 54–57]. GiST

aims to make it easy to extend data structures and tailor them to

specific problems and data with minimal effort. It is a template, an

abstract index definition that allows designers and developers to

implement a large class of indexes. The original proposal focused on

record retrieval only but later work added support for concurrency

[55], a more general API [6], improved performance [54], selectivity

estimation on generated indexes [7] and even visual tools that help

with debugging [56, 57]. While the Data Calculator and GiST share

motivation, they are fundamentally different: GiST is a template to

implement tailored indexes while the Data Calculator is an engine

that computes the performance of a design enabling rich design

questions that compute the impact of design choices before we start

coding, making these two lines of work complementary.

Modular/Extensible Systems and System Synthesizers. A key

part of the Data Calculator is its design library, breaking down a

design space in components and then being able to use any set of

those components as a solution. As such the Data Calculator shares

concepts with the stream of work on modular systems, an idea that

has been explored in many areas of computer science: in databases

for easily adding data types [27, 28, 68, 69, 80] with minimal im-

plementation effort, or plug and play features and whole system

components with clean interfaces [11, 15, 17, 52, 60, 61], as well as

in software engineering [71], computer architecture [70], and net-

works [53]. Since for every area the output and the components are

different, there are particular challenges that have to do with defin-

ing the proper components, interfaces and algorithms. The concept

of modularity is similar in the context of the Data Calculator. The

goal and application of the concept is different though.

Additional Topics. Appendix B discusses additional related topics

such as auto-tuning systems and data representation synthesis in

programming languages.

7 SUMMARY AND NEXT STEPSThrough a new paradigm of first principles of data layouts and

learned cost models, the Data Calculator allows researchers and

engineers to interactively and semi-automatically navigate complex

design decisions when designing or re-designing data structures,

considering new workloads, and hardware. The design space we

presented here includes basic layout primitives and primitives that

enable cache conscious designs by dictating the relative positioning

of nodes, focusing on read only queries. The quest for the first

principles of data structures needs to continue to find the primi-

tives for additional significant classes of designs including updates,

compression, concurrency, adaptivity, graphs, spatial data, version

control management, and replication. Such steps will also require

innovations for cost synthesis. For every design class added (or

even for every single primitive added), the knowledge gained in

terms of the possible data structures designs grows exponentially.

Additional opportunities include full DSLs for data structures, com-

pilers for code generation and eventually certified code [74, 83],

new classes of adaptive systems that can change their core design

on-the-fly, and machine learning algorithms that can search the

whole design space.

8 ACKNOWLEDGMENTSWe thank the reviewers for valuable feedback and direction. Mark

Callaghan provided the quotes on the importance of data structure

design. Harvard DASlab members Yiyou Sun, Mali Akmanalp and

Mo Sun helped with parts of the implementation and the graph-

ics. This work is partially funded by the USA National Science

Foundation project IIS-1452595.

Page 13: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

REFERENCES[1] Daniel J. Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, and Samuel

Madden. 2013. The Design and Implementation of Modern Column-Oriented

Database Systems. Foundations and Trends in Databases 5, 3 (2013), 197–280.[2] Dana Van Aken, Andrew Pavlo, Geoffrey J Gordon, and Bohan Zhang. 2017.

Automatic Database Management System Tuning Through Large-scale Machine

Learning. In Proceedings of the ACM SIGMOD International Conference on Man-agement of Data. 1009–1024.

[3] Ioannis Alagiannis, Stratos Idreos, and Anastasia Ailamaki. 2014. H2O: A Hands-

free Adaptive Store. In Proceedings of the ACM SIGMOD International Conferenceon Management of Data. 1103–1114.

[4] Victor Alvarez, Felix Martin Schuhknecht, Jens Dittrich, and Stefan Richter. 2014.

Main Memory Adaptive Indexing for Multi-Core Systems. In Proceedings of theInternational Workshop on Data Management on New Hardware (DAMON). 3:1—-3:10.

[5] Michael R. Anderson, Dolan Antenucci, Victor Bittorf, Matthew Burgess,

Michael J. Cafarella, Arun Kumar, Feng Niu, Yongjoo Park, Christopher Ré, and Ce

Zhang. 2013. Brainwash: A Data System for Feature Engineering. In Proceedingsof the Biennial Conference on Innovative Data Systems Research (CIDR).

[6] Paul M Aoki. 1998. Generalizing "Search" in Generalized Search Trees (Extended

Abstract). In Proceedings of the IEEE International Conference on Data Engineering(ICDE). 380–389.

[7] Paul M Aoki. 1999. How to Avoid Building DataBlades That Know the Value of

Everything and the Cost of Nothing. In Proceedings of the International Conferenceon Scientific and Statistical Database Management (SSDBM). 122–133.

[8] Joy Arulraj, Andrew Pavlo, and PrashanthMenon. 2016. Bridging the Archipelago

between Row-Stores and Column-Stores for Hybrid Workloads. In Proceedings ofthe ACM SIGMOD International Conference on Management of Data.

[9] Manos Athanassoulis, Michael S. Kester, Lukas M. Maas, Radu Stoica, Stratos

Idreos, Anastasia Ailamaki, and Mark Callaghan. 2016. Designing Access Meth-

ods: The RUM Conjecture. In Proceedings of the International Conference onExtending Database Technology (EDBT). 461–466.

[10] Shivnath Babu, Nedyalko Borisov, Songyun Duan, Herodotos Herodotou, and

Vamsidhar Thummala. 2009. Automated Experiment-Driven Management of

(Database) Systems. In Proceedings of the Workshop on Hot Topics in OperatingSystems (HotOS).

[11] Don S Batory, J R Barnett, J F Garza, K P Smith, K Tsukuda, B C Twichell, and

T E Wise. 1988. GENESIS: An Extensible Database Management System. IEEETransactions on Software Engineering (TSE) 14, 11 (1988), 1711–1730.

[12] Philip A. Bernstein and David B. Lomet. 1987. CASE Requirements for Extensible

Database Systems. IEEE Data Engineering Bulletin 10, 2 (1987), 2–9.

[13] Renata Borovica-Gajic, Stratos Idreos, Anastasia Ailamaki, Marcin Zukowski,

and Campbell Fraser. 2015. Smooth Scan: Statistics-Oblivious Access Paths.

In Proceedings of the IEEE International Conference on Data Engineering (ICDE).315–326.

[14] Alfonso F. Cardenas. 1973. Evaluation and Selection of File Organization - A

Model and System. Commun. ACM 16, 9 (1973), 540–548.

[15] Michael J Carey and David J DeWitt. 1987. An Overview of the EXODUS Project.

IEEE Data Engineering Bulletin 10, 2 (1987), 47–54.

[16] Surajit Chaudhuri and Vivek R. Narasayya. 1997. An Efficient Cost-Driven

Index Selection Tool for Microsoft SQL Server. In Proceedings of the InternationalConference on Very Large Data Bases (VLDB). 146–155.

[17] Surajit Chaudhuri and Gerhard Weikum. 2000. Rethinking Database System

Architecture: Towards a Self-Tuning RISC-Style Database System. In Proceedingsof the International Conference on Very Large Data Bases (VLDB). 1–10.

[18] D. Cohen and N. Campbell. 1993. Automating Relational Operations on Data

Structures. IEEE Software 10, 3 (1993), 53–60.[19] E. Schonberg, J.T. Schwartz and Sharir. 1979. Automatic Data Structure Selection

in SETL. In Proceedings of the ACM Symposium on Principles of ProgrammingLanguages.

[20] E. Schonberg, J.T. Schwartz and Sharir. 1981. An Automatic Technique for

Selection of Data Representations in SETL Programs. ACM Transactions onProgramming Languages and Systems 3, 2 (1981), 126–143.

[21] Alvin Cheung. 2015. Towards Generating Application-Specific Data Management

Systems. In Proceedings of the Biennial Conference on Innovative Data SystemsResearch (CIDR).

[22] Y. Smaragdakis, and D. Batory. 1997. DiSTiL: A Transformation Library for Data

Structures. In Proceedings of the Conference on Domain-Specific Languages onConference on Domain-Specific Languages (DSL).

[23] Niv Dayan, Manos Athanassoulis, and Stratos Idreos. 2017. Monkey: Optimal

Navigable Key-Value Store. In Proceedings of the ACM SIGMOD InternationalConference on Management of Data. 79–94.

[24] Niv Dayan and Stratos Idreos. 2018. Dostoevsky: Better Space-Time Trade-Offs for

LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging..

In Proceedings of the ACM SIGMOD International Conference on Management ofData.

[25] Martin Dietzfelbinger, Torben Hagerup, Jyrki Katajainen, and Martti Pentto-

nen. 1997. A Reliable Randomized Algorithm for the Closest-Pair Problem. J.Algorithms (1997).

[26] Jens Dittrich and Alekh Jindal. 2011. Towards a One Size Fits All Database

Architecture. In Proceedings of the Biennial Conference on Innovative Data SystemsResearch (CIDR). 195–198.

[27] David Goldhirsch and Jack A Orenstein. 1987. Extensibility in the PROBE Data-

base System. IEEE Data Engineering Bulletin 10, 2 (1987), 24–31.

[28] Goetz Graefe. 1994. Volcano - An Extensible and Parallel Query Evaluation

System. IEEE Transactions on Knowledge and Data Engineering (TKDE) 6, 1 (feb1994), 120–135.

[29] Goetz Graefe. 2011. Modern B-Tree Techniques. Foundations and Trends inDatabases 3, 4 (2011), 203–402.

[30] Goetz Graefe, Felix Halim, Stratos Idreos, Harumi Kuno, and Stefan Manegold.

2012. Concurrency control for adaptive indexing. Proceedings of the VLDBEndowment 5, 7 (2012), 656–667.

[31] Goetz Graefe, Felix Halim, Stratos Idreos, Harumi A Kuno, Stefan Manegold, and

Bernhard Seeger. 2014. Transactional support for adaptive indexing. The VLDBJournal 23, 2 (2014), 303–328.

[32] Goetz Graefe, Stratos Idreos, Harumi Kuno, and Stefan Manegold. 2010. Bench-

marking adaptive indexing. In Proceedings of the TPC Technology Conference onPerformance Evaluation, Measurement and Characterization of Complex Systems(TPCTC). 169–184.

[33] JimGray. 2000. What Next? A FewRemaining Problems in Information Technlogy.

ACM SIGMOD Digital Symposium Collection 2, 2 (2000).

[34] Richard A Hankins and Jignesh M Patel. 2003. Data Morphing: An Adaptive,

Cache-Conscious Storage Technique. In Proceedings of the International Confer-ence on Very Large Data Bases (VLDB). 417–428.

[35] Peter Hawkins, Alex Aiken, Kathleen Fisher, Martin C Rinard, and Mooly Sa-

giv. 2011. Data Representation Synthesis. In Proceedings of the ACM SIGPLANConference on Programming Language Design and Implementation (PLDI). 38–49.

[36] Peter Hawkins, Alex Aiken, Kathleen Fisher, Martin C Rinard, and Mooly Sagiv.

2012. Concurrent data representation synthesis. In Proceedings of the ACMSIGPLANConference on Programming Language Design and Implementation (PLDI).417–428.

[37] MaxHeimel, Martin Kiefer, and VolkerMarkl. 2015. Self-Tuning, GPU-Accelerated

Kernel Density Models for Multidimensional Selectivity Estimation. In Proceed-ings of the ACM SIGMOD International Conference on Management of Data. 1477–1492.

[38] Joseph M. Hellerstein, Jeffrey F. Naughton, and Avi Pfeffer. 1995. Generalized

Search Trees for Database Systems. In Proceedings of the International Conferenceon Very Large Data Bases (VLDB). 562–573.

[39] Stratos Idreos. 2010. Database Cracking: Towards Auto-tuning Database Kernels.Ph.D. Dissertation. University of Amsterdam.

[40] Stratos Idreos, Ioannis Alagiannis, Ryan Johnson, and Anastasia Ailamaki. 2011.

Here are my Data Files. Here are my Queries. Where are my Results?. In Pro-ceedings of the Biennial Conference on Innovative Data Systems Research (CIDR).57–68.

[41] Stratos Idreos, Martin L. Kersten, and Stefan Manegold. 2007. Database Cracking.

In Proceedings of the Biennial Conference on Innovative Data Systems Research(CIDR).

[42] Stratos Idreos, Martin L. Kersten, and Stefan Manegold. 2007. Updating a Cracked

Database. In Proceedings of the ACM SIGMOD International Conference on Man-agement of Data. 413–424.

[43] Stratos Idreos, Martin L. Kersten, and Stefan Manegold. 2009. Self-organizing

Tuple Reconstruction in Column-Stores. In Proceedings of the ACM SIGMODInternational Conference on Management of Data. 297–308.

[44] Stratos Idreos, Lukas M. Maas, and Mike S. Kester. 2017. Evolutionary Data

Systems. CoRR abs/1706.05714 (2017). arXiv:1706.05714

[45] Stratos Idreos, Stefan Manegold, Harumi Kuno, and Goetz Graefe. 2011. Merging

What’s Cracked, Cracking What’s Merged: Adaptive Indexing in Main-Memory

Column-Stores. Proceedings of the VLDB Endowment 4, 9 (2011), 586–597.[46] Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, and

Demi Guo. 2018. The Data Calculator: Data Structure Design and Cost Synthe-

sis from First Principles and Learned Cost Models.. In Proceedings of the ACMSIGMOD International Conference on Management of Data.

[47] Yannis E Ioannidis and Eugene Wong. 1987. Query Optimization by Simulated

Annealing. In Proceedings of the ACM SIGMOD International Conference on Man-agement of Data. 9–22.

[48] Panagiotis Karras, Artyom Nikitin, Muhammad Saad, Rudrika Bhatt, Denis An-

tyukhov, and Stratos Idreos. 2016. Adaptive Indexing over Encrypted Numeric

Data. In Proceedings of the ACM SIGMOD International Conference on Managementof Data. 171–183.

[49] Oliver Kennedy and Lukasz Ziarek. 2015. Just-In-Time Data Structures. In Pro-ceedings of the Biennial Conference on Innovative Data Systems Research (CIDR).

[50] Michael S. Kester, Manos Athanassoulis, and Stratos Idreos. 2017. Access Path

Selection in Main-Memory Optimized Data Systems: Should I Scan or Should I

Page 14: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

Probe?. In Proceedings of the ACM SIGMOD International Conference on Manage-ment of Data. 715–730.

[51] Changkyu Kim, Jatin Chhugani, Nadathur Satish, Eric Sedlar, Anthony D Nguyen,

TimKaldewey, VictorW Lee, Scott A Brandt, and Pradeep Dubey. 2010. FAST: Fast

Architecture Sensitive Tree Search on Modern CPUs and GPUs. In Proceedings ofthe ACM SIGMOD International Conference on Management of Data. 339–350.

[52] Yannis Klonatos, Christoph Koch, Tiark Rompf, and Hassan Chafi. 2014. Building

Efficient Query Engines in a High-Level Language. Proceedings of the VLDBEndowment 7, 10 (2014), 853–864.

[53] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, and M Frans Kaashoek.

2000. The Click Modular Router. ACM Transactions on Computer Systems (TOCS)18, 3 (2000), 263–297.

[54] Marcel Kornacker. 1999. High-Performance Extensible Indexing. In Proceedingsof the International Conference on Very Large Data Bases (VLDB). 699–708.

[55] Marcel Kornacker, C Mohan, and Joseph M. Hellerstein. 1997. Concurrency

and Recovery in Generalized Search Trees. In Proceedings of the ACM SIGMODInternational Conference on Management of Data. 62–72.

[56] Marcel Kornacker, Mehul A. Shah, and Joseph M. Hellerstein. 1998. amdb: An

Access Method Debugging Tool. In Proceedings of the ACM SIGMOD InternationalConference on Management of Data. 570–571.

[57] Marcel Kornacker, Mehul A. Shah, and Joseph M. Hellerstein. 2003. Amdb: A

Design Tool for Access Methods. IEEE Data Engineering Bulletin 26, 2 (2003),

3–11.

[58] Tobin J Lehman and Michael J Carey. 1986. A Study of Index Structures for

Main Memory Database Management Systems. In Proceedings of the InternationalConference on Very Large Data Bases (VLDB). 294–303.

[59] Gottfried Wilhelm Leibniz. 1666. Dissertation on the Art of Combinations. PhDThesis, Leipzig University (1666).

[60] Justin J. Levandoski, David B. Lomet, and Sudipta Sengupta. 2013. LLAMA:

A Cache/Storage Subsystem for Modern Hardware. Proceedings of the VLDBEndowment 6, 10 (2013), 877–888.

[61] Justin J. Levandoski, David B. Lomet, and Sudipta Sengupta. 2013. The Bw-Tree:

A B-tree for New Hardware Platforms. In Proceedings of the IEEE InternationalConference on Data Engineering (ICDE). 302–313.

[62] Witold Litwin and David B. Lomet. 1986. The Bounded Disorder Access Method.

In Proceedings of the IEEE International Conference on Data Engineering (ICDE).38–48.

[63] Zezhou Liu and Stratos Idreos. 2016. Main Memory Adaptive Denormalization.

In Proceedings of the ACM SIGMOD International Conference on Management ofData. 2253–2254.

[64] Calvin Loncaric, Emina Torlak, and Michael D Ernst. 2016. Fast Synthesis of

Fast Collections. In Proceedings of the ACM SIGPLAN Conference on ProgrammingLanguage Design and Implementation (PLDI). 355–368.

[65] Stefan Manegold. 2002. Understanding, modeling, and improving main-memory

database performance. Ph.D. Thesis. University of Amsterdam (2002).

[66] Stefan Manegold, Peter A. Boncz, and Martin L. Kersten. 2002. Generic Database

Cost Models for Hierarchical Memory Systems. In Proceedings of the InternationalConference on Very Large Data Bases (VLDB). 191–202.

[67] YandongMao, Eddie Kohler, and Robert TappanMorris. 2012. Cache Craftiness for

Fast Multicore Key-value Storage. In Proceedings of the ACM European Conferenceon Computer Systems (EuroSys). 183–196.

[68] John McPherson and Hamid Pirahesh. 1987. An Overview of Extensibility in

Starburst. IEEE Data Engineering Bulletin 10, 2 (1987), 32–39.

[69] Sylvia L Orborn. 1987. Extensible Databases and RAD. IEEE Data EngineeringBulletin 10, 2 (1987), 10–15.

[70] John K Ousterhout, Gordon T Hamachi, Robert N Mayo, Walter S Scott, and

George S Taylor. 1984. Magic: A VLSI Layout System. In Proceedings of the DesignAutomation Conference (DAC). 152–159.

[71] David Lorge Parnas. 1979. Designing Software for Ease of Extension and Con-

traction. IEEE Transactions on Software Engineering (TSE) 5, 2 (1979), 128–138.[72] Eleni Petraki, Stratos Idreos, and Stefan Manegold. 2015. Holistic Indexing in

Main-memory Column-stores. In Proceedings of the ACM SIGMOD InternationalConference on Management of Data.

[73] Holger Pirk, Eleni Petraki, Stratos Idreos, Stefan Manegold, and Martin L. Kersten.

2014. Database cracking: fancy scan, not poor man’s sort!. In Proceedings of theInternational Workshop on Data Management on New Hardware (DAMON). 1–8.

[74] Xiaokang Qiu and Armando Solar-Lezama. 2017. Natural synthesis of provably-

correct data-structure manipulations. PACMPL 1 (2017), 65:1–65:28.

[75] Jun Rao and Kenneth A. Ross. 2000. Making B+-trees Cache Conscious in Main

Memory. In Proceedings of the ACM SIGMOD International Conference on Man-agement of Data. 475–486.

[76] Felix Martin Schuhknecht, Alekh Jindal, and Jens Dittrich. 2013. The Uncracked

Pieces in Database Cracking. Proceedings of the VLDB Endowment 7, 2 (2013),97–108.

[77] Ohad Shacham, Martin T Vechev, and Eran Yahav. 2009. Chameleon: Adap-

tive Selection of Collections. In Proceedings of the ACM SIGPLAN Conference onProgramming Language Design and Implementation (PLDI). 408–418.

1 Function CompleteDesign (Q, E, l, currentPath = [], H )2 if blockReachedMinimumSize(H .paдe_size) then3 return END_SEARCH;

4 if !meaningfulPath(currentPath, Q, l ) then5 return END_SEARCH;

6 if (cacheHit = cachedSolution(Q, l, H )) != null then7 return cacheHit;

8 bestSolution = initializeSolution(cost=∞);9 for E ∈ E do

10 tmpSolution = initializeSolution();11 tmpSolution.cost = synthesizeGroupCost(E, Q );12 updateCost(E, Q , tmpSolution.cost);13 if createsSubBlocks(E) then14 Q ′ = createQueryBlocks(Q );15 currentPath.append(E);16 subSolution = CompleteDesign(Q ′, E, l + 1, currentPath);17 if subSolution.cost != END_SEARCH then18 tmpSolution.append(subSolution);

19 if tmpSolution.cost ≤ bestSolution.cost then20 bestSolution = tmpSolution ;

21 cacheSolution(Q, l, bestSolution);22 return bestSolution;

Algorithm 1: Complete a partial data structure layout specification.

[78] Daniel Dominic Sleator and Robert Endre Tarjan. 1985. Self-Adjusting Binary

Search Trees. J. ACM 32, 3 (1985), 652–686.

[79] Michael J Steindorfer and Jurgen J Vinju. 2016. Towards a Software Product

Line of Trie-Based Collections. In Proceedings of the ACM SIGPLAN InternationalConference on Generative Programming: Concepts and Experiences (GPCE). 168–172.

[80] Michael Stonebraker, Jeff Anton, and Michael Hirohama. 1987. Extendability in

POSTGRES. IEEE Data Engineering Bulletin 10, 2 (1987), 16–23.

[81] R.E. Tarjan. 1978. Complexity of combinatorial algorithms. SIAM Rev (1978).

[82] Toby J. Teorey and K. Sundar Das. 1976. Application of an Analytical Model to

Evaluate Storage Structures. In Proceedings of the ACM SIGMOD InternationalConference on Management of Data.

[83] Peng Wang, Di Wang, and Adam Chlipala. 2017. TiML: a functional language for

practical complexity analysis with invariants. PACMPL 1 (2017), 79:1–79:26.

[84] Leland Wilkinson. 2005. The Grammar of Graphics. Springer (2005).[85] S Bing Yao. 1977. An Attribute Based Model for Database Access Cost Analysis.

ACM Transactions on Database Systems (TODS) 2, 1 (1977), 45–67.[86] S Bing Yao and D DeJong. 1978. Evaluation of Database Access Paths. In Pro-

ceedings of the ACM SIGMOD International Conference on Management of Data.66–77.

[87] S. Bing Yao and Alan G. Merten. 1975. Selection of File Organization Using an

Analytic Model. In Proceedings of the International Conference on Very Large DataBases.

[88] Ming Zhou. 1999. Generalizing Database Access Methods. Ph.D. Thesis. Universityof Waterloo (1999).

[89] Kostas Zoumpatianos, Stratos Idreos, and Themis Palpanas. 2014. Indexing for

interactive exploration of big data series. In Proceedings of the ACM SIGMODInternational Conference on Management of Data. 1555–1566.

A APPENDIX INTRODUCTIONThis appendix provides more detailed specifications for internal

components and the design space of the Data Calculator. Section B

discusses additional related work. Sections C and D provide detailed

specifications of the layout and access primitives respectively. In

Section E, we describe the complete expert system which is used

for cost synthesis for the Get operation. In Section F, we provide

the complete specification of all data structures used as input to the

Data Calculator for the experiments and the corresponding output

cost (Section G). Finally, we provide examples of the exact output

which includes a dot and a JSON file that describe the design and

an instance of a resulting data structure.

Page 15: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

P

U U U UU U

S

U U U UU U

U

U U U UU U

S

S S SS

U

S S SP

distribute query to all sub-blockssetting provenience mark to “unsorted”

P

P S S

distribute query only to intersecting sub-blocks (setting provenience mark as “sorted”)

Current element is sorted Current element is unsorted

Cur

rent

que

ry m

arke

d as

“p

robe

” (P

)C

urre

nt q

uery

mar

ked

as

“sor

ted”

(S)

Cur

rent

que

ry m

arke

d as

“u

nsor

ted”

(U)

mark first block as probe for costing

internal nodes

mark first block as probe for costing

internal nodes

distribute query to all sub-blockssetting provenience mark to “unsorted”

distribute query to all sub-blockssetting provenience mark to “unsorted”

distribute query only to intersecting sub-blocks (setting provenience mark as “sorted”)

distribute query only to intersecting sub-blocks (setting provenience mark as “sorted”)

(a) Range Gets

P

U

S

previous element was unsorted

previous element was sorted

left-most part of range query

don’t cost intermediate nodes

row-wisestorage

Serial Scan Keys

Serial ScanValues

Serial Scan KV pairs

true

is element terminal

false

sub-block data distribution (Appendix)

query marked“probe” (P)Start Block

End

true

true

estimate a singlepoint get cost (Fig 7)

distribute in sub-blocks and mark provenience state of sub-queries recursion

containsdata

false true

false

false

we only cost intermediate nodes for left-most parts of range queries (to find first leaves) and only when previous

element was unsorted and current sorted (thus provides structure)

Legend

per block recursion

sorted Sortpartitioningfunction

Function Call Cost

Start

accumulate data in multiple distinct blocks

Memory Allocation Memory Write

is terminal

true

falseEnd

true

falsetrue

false

(b) Bulk Loading

Figure 10: Cost synthesis Range Gets and Bulk Loading.B ADDITIONAL RELATED AREASAuto-tuning and Adaptive Systems. Work on tuning [16, 47]

and adaptive systems is also relevant as conceptually any adaptive

technique tunes along a part of the design space. For example, work

on hybrid data layouts and adaptive indexing automates selection of

the right layout [3, 4, 8, 13, 23, 24, 26, 30–32, 34, 39–43, 45, 48, 49, 63,

72, 73, 76, 78, 89]. Typically, in these lines of work the layout adapts

to incoming requests. Similarly works on tuning via experiments

[10], learning [5], and tuning via machine learning [2, 37] can

adapt parts of a design using feedback from tests. While there are

shared concepts with these lines of work, they are all restricted

to much smaller design spaces, typically to solve a very specific

systems bottleneck, e.g., incrementally building a specific index or

smoothly transitioning among specific layouts. The Data Calculator,

on the other hand, provides a generic framework to argue about

the whole design space of data layouts. Its capability to quickly test

the potential performance of a design can potentially lead to new

adaptive techniques that will also leverage experience in existing

adaptive systems literature to adapt across the massive space drawn

by the Data Calculator.

Data Representation Synthesis. Data representation synthesis

aims for programming languages that automate data structure se-

lection. SETL [19, 20] was the first language to generate structures

in the 70s as combinations of existing data structures: array, and

linked hash table. A series of works kept providing further function-

ality, and expanding on the components used [18, 22, 35, 36, 64, 77].

Cozy [64] is the latest system; it supports complex retrieval opera-

tions such as disjunctions, negations, and inequalities and by uses

a library of five data structures: array (sorted and plain), linked list,

binary tree, and hash map. These works compose data structure

designs out of a small set of existing data structures. This is parallel

to the tuning and access path selection problem in databases. The

Data Calculator introduces a new vision for what-if design and

focuses on two new dimensions: 1) design out of fine-grained prim-

itives, and 2) calculation of the performance cost given a hardware

profile and a workload. The focus on fine-grained primitives en-

ables exploration of a massive design space. For example, using the

equations of Section 2 for homomorphic two-node designs, a fixed

design space of 5 possible elements can generate 25 designs, while

the Data Calculator can generate 1032

designs. The gap grows for

polymorphic designs, i.e, 2 ∗ 109 for a 5 element library, while the

Data Calculator can generate up to 1.6 ∗ 1055 valid designs (for a

10M dataset and 4K pages). In addition, the focus on cost synthesis

through learned models of fine-grained access primitives means

that we can capture hardware and data properties for arbitrary de-

signs. Array Mapped Tries [79] use fine-grained primitives, but the

focus is only on trie-based collections and without cost synthesis.

C DATA LAYOUT PRIMITIVESIn this section we list and describe in detail all 21 primitives in

the design space currently supported by the Data Calculator. A

summary of the primitives is also shown in Figure 11.

(1) Key retentionDomain: yes | no | func

Elements can hold data within them apart from partitioning

them in sub-blocks. This data “retention” can either be full

(by setting the value to yes), like in simple arrays, where the

keys are stored, or partial (defined by a function), like tries

where only a part of the key is stored. Alternatively, when

“no” is set, they merely act as partitioners without retaining

any data, like in hash-maps. Examples:- Data pages and arrays fully store keys

- A hash map contains no data

- Tries retain only a part of the key at every level.

(2) Value retentionDomain: yes | no | func

Elements can hold data within them apart from partitioning

them in sub-blocks. This data “retention” can either be full

(by setting the value to yes), like in simple arrays, where the

values are stored, or partial (defined by a function). Alter-

natively, when “no” is set, they merely act as partitioners

without retaining any data, like in hash-maps.

Examples:- Data pages and arrays fully store values

- A hash map contains no data

(3) Key-value layoutDomain: columnar | row-wise | col-row-groups

Keys and values can either be stored in a columnar, a row-

wise fashion, or a row groups fashion. In columnar storage

all keys are stored contiguously and all values the same. In

Page 16: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

row-wise storage they are stored as key value pairs. In col-

row-groups storage data are grouped in chunks by row, but

within each chunk they are stored row-wise.

K

VK V K V K V

Columnar Row-wise Row-Groups

or or

Key and value layout

or None

(4) Intra-node accessDomain: direct | head-link | tail-link | func

Description: Determines how we can access sub-blocks.

Direct access means that we can directly access each distinct

sub-block. Head-link means that we can find the first sub-

block through a link. Tail-link means that we can find the

last sub-block through a link. Function means that we can

get a link to one or more of the sub-blocks using a function.

Examples: - A linked list has head or tail links. - A B+Tree

internal node allows us to directly address each sub-tree.

Logical Block

Direct

Logical Block

Head Link

Logical Block

Tail Link

(5) UtilizationDomain: none | >= | func

Utilization constraints, how much empty space is allowed.

It can be specified as an arbitrary function or as a “less or

equal to capacity” check.

Examples:- B+Trees can have a constraint of half-empty leaves.

Logical Block

>50%Sub-block utilization

constraints

(6) Bloom filtersDomain: off | on(num-hashes, num-bits)

If this is set to on, bloom filters per sub-block are used to

filter reads. The parameters specified as arguments to on are

used for the bloom filters. Examples:- LSM trees use bloom filters per sub-block (run or level) to

skip large chunks of data.

(7) Zone map filtersDomain: min | max | both | exact | off

If this is set to min, the minimum key under a sub-block is

kept for filtering reads (as a fence pointer). If this is set to

max, the maximum key under a sub-block is kept for filtering

reads (as a fence pointer). If this is set to both, both minimum

and maximum keys are kept. If this is set to exact, the exact

key of only each sub-block is kept. Examples:- B+Trees trees use zone maps in internal nodes to guide

search.

Logical Block

BF BF BF

Logical Block

FP FP FP

Bloom Filters Fence Pointers

(8) Filters memory layoutDomain: consolidate | scatterFilters like zone maps or bloom filters can either be stored

contiguously for an element or scattered and stored per sub-

block. For example all minimum values could be consolidated

in a single array thus allowing binary search. Alternatively,

filters can be scattered per sub-block and stored with it. Ex-amples:- B+Trees consolidate zone maps

- Linked list of very large pages could include per-page head-

ers that contain scattered zone maps.

BF BF BF

Consolidate

FP FP FP

BFBF BF

Scatter

FP FP FP

(9) Fanout/RadixDomain: fixed(size) | func | unlimited | terminal(capacity)

Description: This primitive allows us to specify the type

and number of sub-blocks for this element. Unlimited means

any number of sub-blocks is allowed. For fixed, the number

of sub-blocks is static. If it is set to function, then the number

of sub-blocks is determined by the result of a function, while

terminal means that there are no sub-blocks and this element

absorbs all data up to a certain capacity.

Examples:- Linked-lists have an infinite number of sub-blocks, as new

sub-blocks can always be appended to the structure.

- Traditional B+Tree nodes have a fixed number of sub-blocks,

as the fanout of each node is fixed a priori.

- Partitioned arrays with fixed partition size can have a func-

tional fanout determined by the maximum number of data

to be inserted.

Sub-blocks

Logical Block

Fixed/Functional

Logical Block

Unlimited

Logical Block

Terminal

Logical Block

Logical Block

Logical Block

Logical Block

Logical Block

Logical Block

[capacity]

[256] [256]

[256] [256]

(10) Key/fence partitioningDomain: append(fw|bw) | data-dependent(sorted | k-ary |

Page 17: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

func) | data-independent(range | radix | func) | temporal(size-

ratio, policy)

Description: This is used to define data are placed in order.

This defines both the order of elements with sub-blocks and

the order of data within a terminal node. Such predefined

ways decide the location of a new insert or the order of

accesses during reads. For example a data independent func-

tional partitioning will insert data in the right sub-block after

the key of the new tuple is passed through the partitioning

function. Such arbitrary functions return a block id. Append

means that inserts and reads will have to be placed in for-

ward or backward order. Finally data dependent partitioning

specifies that some kind of order has to be maintained. This

is either in the form of sorting the data or some other kind of

functional or k-ary order. K-ary specifies that the sub-blocks

are ordered and that the fence keys are stored in a k-ary

tree. Each internal node of a k-ary tree has k-1 fence keys

and k-children; leaf nodes have k keys and k values. At each

internal node, label its children ℓ = (1, 1, . . . ,k). Then for

a node at positionm, its k children are located at position

(k ∗m)+ (k − 1) ∗ ℓ. This assumes that the root is at position

0.

For the classic example of a binary tree, this gives that a

nodes left child is at 2m + 1 and its right child is at 2m + 2.This is shown below. An in-order layout specifies that the

sub-blocks are ordered and that the fence keys are laid out

via an in-order traversal of a binary tree. In-order layout

does not make sense for a tree with k larger than 2, as a

node has children which are between its smallest and largest

fence keys.

Finally, temporal partitioning partitions blocks by time of

insertion in runs and levels in an LSM style way. During

inserts the most recent run is used, during reads we process

one run at a time and one level at a time (top to bottom) until

a hit is found.

Examples:- A hash-map uses a data-independent partitioning function

to identify the correct bucket

- A linked list has no predefined partitioning and data are

appended in the end of the list in the order of arrival.

- An LSM tree has temporal/log structured partitioning.

Key partitioning: append(fw) Key partitioning: append(bw)

Key partitioning: data-independent

range() | radix() | function()Key partitioning:

data-dependent(sorted)

(11) Sub-block capacityDomain: fixed | balanced | unrestricted | func

Description: This is used to define the capacity of each sub-

block. This is the amount of tuples that can (at maximum)

end up under each one of an element’s sub-blocks. One

can have functional capacity, fixed capacity, unrestricted

capacity (where capacity can arbitrarily grow in each sub-

block) and balanced capacity where the total number of

records is equally divided across all sub-blocks.

Examples:- A hash-map uses unrestricted capacity as the amount of

elements in each bucket is not known a-priori and buckets

should arbitrarily grow

- A linked list has a fixed capacity, as each sub-block can

hold 1 or more (in the case of linked lists of pages) elements.

When one sub-block overflows a new sub-block should be

created.

- A B+Tree node has balanced capacity as under the capacity

of each sub-tree of an internal is equal to its siblings.

Logical Block

100Sub-block capacity

(12) Immediate node linksDomain: next | previous | both | none

Description: This indicates that each sub-block includes a

pointer to its immediate sibling in the “forward direction”

(defined by the structure), in the “backward direction”, or

both.

Examples: - A linked list contains sub-blocks that are linked

to each other.

(13) Skip node linksDomain: perfect | randomized | func | none

This is set to none when no skip links are used. Skip links

can navigate from one sub-block to a sub-block further away

from it. When set to perfect, all links that allow for binary-

search style navigation are materialized. When set to ran-

domized, only a probabilistic set of skip links is included,

enough to provide some performance guarantees. When set

to functional, an arbitrary function defines the links that are

going to be materialized. Examples:- Classic perfect skip-lists contain perfect skip-links.

- Randomized skip-lists contain randomized skip-lins.

(14) Area linksDomain: forward | backward | both | none Each sub-tree canbe connected with another sub-tree at the leaf level through

area links. This can happen in a forward or backward way

or both.

Examples:- The linked leaves of a B+Tree are connected to each other

for efficient range queries.

Page 18: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

Logical Block

Immediate: Next

Logical Block

Immediate: Prev

Logical Block

Immediate: Both

Logical Block

Skip-links

Logical Block

Area-links

Immediate node links

Skip node links Area-links

(15) Sub-block physical locationDomain: inline | pointed | double pointed | none

Description: This is used to define the physical location

of sub-blocks, are they inlined (i.e., physically contained

within the parent element) or pointed to using pointers to

arbitrary locations in memory. Double pointed means that

the sub-block points back to the parent as well. None means

that sub-blocks are either nonexistent (i.e., the element is

terminal) or not stored.

Examples:- A classic B+Tree uses pointed sub-blocks where sub-trees

are pointed to by their parents. - A hash-map followed by

per bucket linked-lists can contain the linked list head per

bucket inlined.

Logical Block

Inline

Logical Block

Logical Block

Pointed

Logical Block

Logical Block

Double-pointed

Logical Block

Logical Block

None

(16) Sub-block physical layoutDomain: BFS | BFS layer grouping | ScatterDescription: This represents the physical layout of sub-

blocks. Scatter: random placement in memory. BFS: laid out

in a breadth-first layout. BFS layer grouping: hierarchical

level nesting of BFS layouts.

Examples:- A cache conscious B+Tree uses BFS to layout nodes. - A

linked list can have blocks that are randomly scattered in

main memory.

B1

B2 B3 B4

B5 B6

[256] [256]

[256] [256]

B1 B2 B3 B4 B5 B6

B1 B2B3B4 B5 B6

BFS

Scatter

(17) Sub-blocks homogeneousDomain: booleanThis is set to true in the case where the sub-blocks have the

same definition. Examples:

- Data structures like bounded disorder and ART are not

homogeneous, as sibling nodes can be of different types.

- A classic B+Tree is homogeneous as all children of a node

are of the same type.

Logical Block

True

A A A

Logical Block

False

A B C

(18) Sub-block consolidationDomain: booleanIf the node has only one sub-block, this node is suppressed.

B1

B2 B3 B4

B5

[256]

[256] [256]

B1

B3 B4

B2, B5

[256]

[256] [256]

(19) Sub-block instantiationDomain: lazy | eager

If a sub-block is empty, it is represented with a null pointer

when set to lazy, or instantiated in any case when set to

eager.

B1

B2 B3 B4

B5

[256]

[256] [256]

Lazy

B1

B2 B3 B4

B5 B6

[256] [256]

[256] [256]

Eager

(20) Links locationDomain: consolidate | scatterIn case links are used this tells us how they are stored. Con-

solidate means that they are stored as a contiguous array,

scatter means that they are stored per partition. Examples:- Linked lists have scattered links, where each sub-block con-

tains the links.

- Unix inodes consolidate within the current inode links to

blocks or indirect blocks.

Consolidate Scatter

L1 L2 L3

L1 L2 L3

(21) RecursionDomain: yes(func) | noIf yes, sub-blocks will be subsequently inserted into a node

Page 19: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

of the same type until a maximum depth (expressed as a

function) is reached. Then the terminal node type of this

data structure will be used.

Logical Block Recursion

D DATA ACCESS PRIMITIVESAND LEARNED COST MODELS

We now present in detail the Level 2 access primitives and discuss

how to generate learned cost models to perform cost synthesis.

Each data access primitive has three stages.

Stage 1: In the first stage, C++ code implementing the Level 2 access

primitive is run. This implementation usually consists of doing niterations of the required action in a tight loop; as an example,

we might perform n = 10000 binary searches an array for some

set of 10000 chosen values sampled with replacement from the

values within the array. After doing the required action n times,

the benchmark latency is observed and normalized (i.e. latency =

total latency / n).This implementation benchmark is run multiple times with con-

trolled values for various input parameters such as data size. The

benchmark latency is observed and recorded for each set of input

parameters. Afterwards, these values are collected into 1) a matrix

of the parameter values X and 2) a corresponding array of the per-

formance (latency) Y observed in each benchmark. In the joined

matrix (X |Y ), each row contains a single configuration of param-

eters and a single benchmark latency. A column in X contains all

the configuration values of a parameter, and the Y column is the

benchmark latencies. Below are two examples of benchmarks and

their corresponding data matrixes:

Example 1 (Binary Search): Matrix of parameter values is (d × 1)vector, with d = the number of input sizes. A concrete example

might be sizes = {128,256,512,1024}. Additionally, a d × 1 vector of

benchmark latencies is passed.

Example 2 (Bloom Filter): The matrix of parameter values in this

case is d × 2 (recall that variables such as hash function are chosen

via the Level 1 Benchmark), with the two variables being data size

and number of hash functions. A concrete example might be

©­­­«128 1

128 2

256 1

256 2

ª®®®¬. The set of benchmark latencies is still a d × 1 vector, with the

recorded latencies passed in the same order as the parameter of

matrix values (i.e., the first benchmark latency corresponds to the

first row of parameters in the matrix of parameter values).

Stage 2: A model-specific Python function is then called to train

the model. The matrix of parameter values and corresponding ar-

ray of latencies is passed to the function. Depending on the model,

some combination of NumPy, SciPy, and PyTorch is used to train

the models. For simple models this can be as simple as a call to

the least squares regression function; for others, this requires some

pre-processing of data or training via gradient descent.

Example 1 (Interpolation Search): As input, Python is given the

required input of data sizes and benchmark latencies. It performs

two basic actions, which essentially provide new features. These

two actions are to compute log data size and log(log data size). Thuseach row in X now contains three input features, data size, log data

size, and log(log data size). At this point we run regular linear re-

gression via NumPy. We record the optimal coefficients and return

them as the return values of our train function back to the main

cost synthesis routine of the Data Calculator in C++.

Example 2 (Random Memory Access): The benchmark for a

random memory access is fit by a sum of sigmoids model: f (x) =∑ki=1

ci1+e−ki (logx−xi )

+ y0 with all coefficients learned as usual. This

model is no longer convex in its L2 loss and cannot be learned via

off the shelf techniques such as least squares regression. Instead, we

first pre-process the data by calculating rates of change for intervals

of some size z. We find the k highest local maxima for this graph of

rates of change and set these as our initial guesses for the xi values.For the values of the ki , ci we initialize them to random positive

numbers between 0 and 1. Our initial guess for y0 is just the firstpoint. At this point, we use SciPy’s curve fit package to train the

parameters and observe the results.

The learning of the coefficients for the sum of sigmoid model

requires good initial conditions for the guesses of the xi . It is notsensitive to initial guess for the parameters of ki or ci . We found

that for each of z = 0.1, 0.5, and 1, the initial guesses for the xiwere good enough so that the fittedmodel produces accurate results.

Stage 3: These coefficients are then stored in a model specific

class such as “SimpleLinearModel”. The class then has a prediction

function which uses the coefficients to calculate costs given input

parameters. Thus, even though the training is in Python, the pre-

diction is made in C++ classes.

Example 1 (Interpolation Search): In the case of interpolation

search, the corresponding model is LogLogLinear model. The model

is defined by f (x) = c1x + c2 logx + c3 log logx + c0. It’s class con-structor takes in 4 coefficients, and stores them. Then for prediction,

it takes as input x , and computes f (x) using the given coefficients.

In the case of Interpolation Search, the input variable x is data size

and so the model is producing a prediction for search cost given

the data size.

Example 2 (Weighted Nearest Neighbors): Nearest neighbors is

our only non-parametric model at the moment, although we expect

to add more in the future. Unlike the previous example, which is a

parametric model (defined by the coefficients c1, c2, c3, c0), the nonparametric model needs to keep track of its data. Thus in weighted

Page 20: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

Primitive Domain size H LL UDP B+ CSB+ FAST ODP1 Key retention. No: node contains no real key data, e.g., intermediate nodes of

b+trees and linked lists. Yes: contains complete key data, e.g., nodes of b-yes | no | function(func) 3

no no yes no no no yes

2 Value retention. No: node contains no real value data, e.g., intermediate nodes of b+trees, and linked lists. Yes: contains complete value data, e.g., nodes of b-trees, and arrays. Function: contains only a subset of the values.

yes | no | function(func) 3

no no yes no no no yes

Key-value layout. Determines the physical layout of key-value pairs.Rules: requires key retention != no or value retention != no.

4 Intra-node access. Determines how sub-blocks (one or more keys of this node) can be addressed and retrieved within a node, e.g., with direct links, a link only to the first or last block, etc.

direct | head_link | tail_link | link_function(func)

4

dire

ct

head

dire

ct

dire

ct

dire

ct

dire

ct

dire

ct

5 Utilization. Utilization constraints in regards to capacity. For example, >= 50% denotes that utilization has to be greater than or equal to half the capacity.

>= (X%) | function(func) | none(we currently only consider X=50)

3none none none >= 50% >= 50%

>= 50% none

6 Bloom filters. A node's sub-block can be filtered using bloom filters. Bloom filters get as parameters the number of hash functions and number of bits.

off | on(num_hashes: int, num_bits: int)

(up to 10 num_hashes considered)1001

off off off off off off off

7 Zone map filters. A node's sub-block can be filitered using zone maps, e.g., they can filter based on mix/max keys in each sub-block.

min | max | both | exact | off 5off off off min min min off

Filters memory layout. Filters are stored contiguously in a single area of the node or scattered across the sub-blocks.Rules: requires bloom filter != off or zone map filters != off.

9 Fanout/Radix. Fanout of current node in terms of sub-blocks. This can either be unlimited (i.e., no restriction on the number of sub-blocks), fixed to a number, decided by a function or the node is terminal and thus has a fixed capacity.

fixed(value: int) | function(func) | unlimited | terminal(cap: int)

(up to 10 different capacities and up to 10 fixed fanout values are considered)

22

fixed

(100

)

unlim

ited

term

(256

)

fixed

(20)

fixed

(20)

fixed

(16)

term

(256

)

10 Key/fence partitioning. Set if there is a pre-defined key partitioning imposed. e.g. the sub-block where a key is located can be dictated by a radix or range partitioning function. Alternatively, keys can be temporaly partitioned. If partitioning is set to none, then keys can be forward or backwards appended.

append(fw | bw) | data-dep(sorted | k-ary(int) | func) |

data-ind(range(int) | radix(int) | func) |temporal(size_ratio: int, policy: [tier| level])

406

data

-ind

rang

e(10

0)

appe

nd(fw

)

appe

nd(fw

)

data

-dep

sort

ed

data

-dep

sort

ed

data

-dep

4-ar

y

data

-dep

sort

ed

Sub-block capacity. Capacity of each sub-block. It can either be fixed to a value, or balanced (i.e., sub-blocks of same size), unrestricted or functional.

Rules: requires fanout/radix != terminal.12 Immediate node links. Whether and how sub-blocks are connected. next | previous | both | none 4 none next none none none none none

13 Skip node links. Each sub-block can be connected to another sub-block (not only the next or previous) with perfect, randomized or custom skip-links.

perfect | randomized(prob: double) | function(func) | none

13 none none none none none none none

14 Area-links. Each sub-tree can be connected with another sub-tree at the leaf level throu area links. Examples include the linked leaves of a B+Tree.

forward | backward | both | none 4 none none forw. none none none none

Sub-block physical location. This represents the physical location ofthe sub-blocks. Pointed: in heap, Inline: block physically contained in parent. Double-pointed: in heap but with pointers back to the parent.Rules: requires fanout/radix != terminal.Sub-block physical layout. This represents the physical layout of sub-blocks. Scatter: random placement in memory. BFS: laid out in a breadth-first layout. BFS layer list: hierarchical level nesting of BFS layouts.Rules: requires fanout/radix != terminal.Sub-blocks homogeneous. Set to true if all sub-blocks are of the same type.Rules: requires fanout/radix != terminal.Sub-block consolidation. Single children are merged with their parents.Rules: requires fanout/radix != terminal.Sub-block instantiation. If it is set to eager, all sub-blocks are initialized, otherwise they are initialized only when data are available (lazy).Rules: requires fanout/radix != terminal.Sub-block links layout. If there exist links, are they all stored in a single array (consolidate) or spread at a per partition level (scatter).Rules: requires immediate node links != none or skip links != none.Recursion allowed. If yes, sub-blocks will be subsequently inserted into a node of the same type until a maximum depth (expressed as a function) is reached. Then the terminal node type of this data structure will be used.Rules: requires fanout/radix != terminal.

* Primitives colored in green are only used during search

17

15

16

Child

ren

layo

ut

fixed

(256

)

Part

ition

ing

11

lazy | eager

boolean

2boolean

poin

ted

inlin

esc

atte

r

scat

ter

bala

nced

poin

ted

BFS

poin

ted

BFS-

LL

13

fixed(value: int) | balanced | unrestricted | function(func)

(up to 10 different fixed capacity values are considered)

inline | pointed | double-pointed | none

BFS | BFS layer(level-grouping: int) | scatter (up to 3 different values for layer-grouping are

considered)

unre

stric

t.po

inte

dsc

atte

r

4

5

Nod

e fil

ters

8

102row-wise | columnar |

col-row-groups(size: int) 3

Nod

e or

gani

zatio

n

scat

ter

scat

ter

scat

ter

consolidate | scatter 2

Unless otherwise specified, we use a reduced default values domain of 100 values for integers, 10 values for doubles, and 1 value for functions.

col.

B+Tree/CSB+Tree/FASTLPL

Hash Table

yes(

logn

)fa

lse

lazy

false

la

zy

col.

bala

nced

bala

nced

true

true

true

true

true

Node descriptions: H: Hash, LL: Linked List, LPL: Linked Page-List, UDP: Unordered Data Page, B+: B+Tree Internal Node, CSB+: CSB+Tree Internal Node, FAST: FAST Internal node, ODP: Ordered Data Page (Nodes highlighted with gray are terminal leaf nodes)

Total number of property combinations > 10^18 / 60 invalid combinations ≈ 10^16

false

ye

s(lo

gn)

yes(

logn

)

lazy

false

la

zy

* Primitives colored in yellow are not used during search

21

20

19

18

Recu

rsio

n

no

scat

ter

false

la

zy

2

2

2

consolidate | scatter

yes(func) | no 3

no

Figure 11: Data layout primitives and synthesis examples of data structures.

Page 21: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

0 500 1000 1500 2000 2500 3000Number of Elements

0.0e+00

2.0e-07

4.0e-07

6.0e-07

8.0e-07

1.0e-06

1.2e-06

Tim

e (

s)

Benchmark Data

Fitted Model

Figure 12: Example of Benchmark (1)nearest neighbors, its constructor takes in the matrix X and Y and

predictions are made using these values explicitly via the function:

f (x) = 1∑ki=1

1

d (x,xi )

∑ki=1

1

d (x,xk )yk .

D.1 Notes for PresentationFor the example graphs presented next, each fitted model is from

the same machine and used 8 byte key and value sizes. The graphs

are given with Data Size as the dependent variable for various

graphs, although the input into the Level 2 primitives can be either

the number of elements or data size (for fixed size elements). For

fixed size elements, the graphs are more easily read as number

of elements. Finally, in the following models, there is some small

amount of terminology that makes naming of the benchmarks

easier. We provide the terms below.

RowStore vs. ColumnStore: This refers to how the data is laid

out. In general RowStore means that we have arrays of key-value

pairs, where as a column store means we have separate arrays, one

for keys and one for values. This is the input of the Data Layout

option that comes into higher level benchmarks. (i.e. to cost a scan

over data that is stored as a “row store”, you pass Data Layout =

“row store” into the higher level benchmark).

Hash Family: For the purpose of benchmarking and cost model-

ing, we take the general form of the hash function and randomly

generate the coefficient values during execution. That is, no par-

ticular value of the coefficients is assumed and we do not attempt

(currently) to learn optimal values of these coefficients. As an ex-

ample, for the multiply-shift scheme, the form is assumed to be

h(x) = (a ∗x) >> (w −M), where b = 2M

is the number of bins and

w is the word size (both fixed). During benchmarking, we conduct

multiple runs and generate the value of a randomly during each

run.

D.2 Learned Cost Models(1) Scalar Scan (RowStore, Equal)

Description: This benchmark performs a for loop over

an array of key-value pairs. At each array index, we check

0 500 1000 1500 2000 2500 3000Number of Elements

0.0e+00

2.0e-07

4.0e-07

6.0e-07

8.0e-07

1.0e-06

1.2e-06

Tim

e (

s)

Benchmark Data

Fitted Model

Figure 13: Example of Benchmark (2)whether the key at that position matches the desired key.

Model: The benchmark is fit by a linear model: f (x) = ax+bwith a and b being learned coefficients. Both coefficients are

constrained to be non-negative.

Benchmarking PseudoCode

Require: array of n key-value pairs, search val x1: for i in 1, . . .n do2: if array[i].key == x then3: return array[i].val

4: end if5: end for

Example of Learned Cost Model: Figure 12

(2) Scalar Scan (ColumnStore, Equal)

Description: This benchmark performs a for loop over an

array of keys. At each array index, we check whether the key

at that position matches the desired key. If yes, we return

the value at the same index in the values array.

Model: The benchmark is fit by a linear model: f (x) = ax+bwith a and b being learned coefficients. Both coefficients are

constrained to be non-negative.

Benchmarking PseudoCode

Require: array of n keys, array of n values, search val x1: for i in 1, . . .n do2: if keys[i] == x then3: return values[i]

4: end if5: end for

Page 22: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

The graphs for the scans tend to look very similar, even down

to what initially appears to be noise. Each graph is in fact

different. For instance, you can see that the upper right two

points for this graph are different than the upper two right

points in the prior graph. As well, it makes sense that the

two graphs are very similar, they have almost identical code

and the process is computationally bound.

Example of Learned Cost Model: Figure 13

(3) Scalar Scan ( RowStore, Range)

Description:This benchmark performs a for loop over an ar-

ray of key-value pairs. At each array index, we checkwhether

the key at that position is less than the desired key. If the key

satisfies the predicate, the corresponding value is added to a

set of values to be returned. After looping through the whole

array, we return the list of values that had keys less than

the desired predicate. Similar benchmarks exist for between

predicates.

Model: The benchmark is fit by a linear model: f (x) = ax+bwith a and b being learned coefficients. Both coefficients are

constrained to be non-negative.

Benchmarking PseudoCode

Require: array of n key-value pairs, search val xEnsure: output array of values

1: count = 0

2: for i in 1, . . .n do3: if array[i].key < x then4: output[count] = array[i].val

5: count++

6: end if7: end for8: return output

Example of Learned Cost Model: Figure 14

(4) Scalar Scan (ColumnStore, Range)

Description: This benchmark performs a for loop over an

array of keys. At each array index, we check whether the

key at that position is less than the desired key. If the key is

less than the desired key, we add the corresponding value

in the values array to a set of values to be returned. After

looping through all the values, the set of values is returned.

Similar benchmarks exist for between predicates.

Model:The benchmark is fit by a simple linearmodel: f (x) =ax + b with a and b being learned coefficients. Both coeffi-

cients are constrained to be non-negative.

0 500 1000 1500 2000 2500 3000Number of Elements

0.0e+00

2.0e-06

4.0e-06

6.0e-06

8.0e-06

1.0e-05

Tim

e (

s)

Benchmark Data

Fitted Model

Figure 14: Example of Benchmark (3)

0 500 1000 1500 2000 2500 3000Number of Elements

0.0e+00

2.0e-06

4.0e-06

6.0e-06

8.0e-06

1.0e-05

Tim

e (

s)

Benchmark Data

Fitted Model

Figure 15: Example of Benchmark (4)Benchmarking PseudoCode

Require: array of n keys, array of n values, search val xEnsure: output array of values

1: count = 0

2: for i in 1, . . .n do3: if keys[i] < x then4: output[count] = values[i]

5: count++

6: end if7: end for8: return output

Example of Learned Cost Model: Figure 15

(5) Simd-AVX Scan (ColumnStore, Equal)

Description: This benchmark performs a SIMDified loop

over an array of keys. At each point, we compare a SIMD

bank of keys to the desired value. If a matching key is found,

Page 23: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

0 500 1000 1500 2000 2500 3000Number of Elements

0.0e+00

1.0e-07

2.0e-07

3.0e-07

4.0e-07

5.0e-07

6.0e-07

7.0e-07

8.0e-07

Tim

e (

s)

Benchmark Data

Fitted Model

Figure 16: Example of Benchmark (5)we break the loop and return the desired value at the corre-

sponding position.

Model: The benchmark is fit by a linear model: f (x) = ax+bwith a and b being learned coefficients. Both coefficients are

constrained to be non-negative.

Benchmarking PseudoCode

Require: array of n keys, array of n values, search val x .1: {SIMD bank is of width b = 128 / element size.}

2: for each SIMD bank do3: load memory into SIMD register

4: comp_bank = SIMD compare equal

5: if test_not_all_zeros(comp_bank) then6: get mask from comp_bank and use this to pull value

from values array

7: return pulled value

8: end if9: end for

Example of Learned Cost Model: Figure 16

(6) Simd-AVX Scan (ColumnStore, Range)

Description: This benchmark performs a SIMDified loop

over an array of keys. We loop through the array, perform-

ing actions one bank of SIMD values at a time. For each

bank of SIMD values, we perform a less than predicate on

the bank of keys, comparing it with the predicate value. Af-

ter finding which keys satisfy the predicate, we perform a

swizzle instruction to put the corresponding values to the

matching keys at the beginning of a separate SIMD bank. We

then store the matching values. Similar benchmarks exist

for between predicates. More than the other benchmarks,

this benchmark depends heavily on the input key size as the

code changes with each key size. In particular, the way in

which we perform the swizzle instructions depends on the

0 500 1000 1500 2000 2500 3000Number of Elements

0.0e+00

1.0e-06

2.0e-06

3.0e-06

4.0e-06

5.0e-06

Tim

e (

s)

Benchmark Data

Fitted Model

Figure 17: Example of Benchmark (6)width of the key.

Model: The benchmark is fit by a linear model: f (x) = ax+bwith a and b being learned coefficients. Both coefficients are

constrained to be non-negative.

Benchmarking PseudoCode

Require: array of n keys, array of n values, search val x .Ensure: output array of values

1: {SIMD bank is of width b = 128 / element size.}

2: count = 0

3: for each SIMD bank do4: load memory into SIMD register

5: comp_bank = SIMD comp. less than

6: pull mask from comp_bank

7: use mask to swizzle around values

8: store values into output[count]

9: count += popcnt mask

10: end for11: return output

Example of Learned Cost Model: Figure 17

(7) Binary Search (RowStore)

Description: This benchmark performs standard binary

search over an array of key-value pairs.

Model: The benchmark is fit by a log-linear model: f (x) =ax + b logx + c with a,b, and c being learned coefficients.

a,b,c are all constrained to non-negative.

Benchmarking PseudoCode

Require: array of n key-value pairs, search val x1: low = 0, high = array.size - 1

Page 24: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

0 20000 40000 60000 80000 100000Number of Elements

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

1.6

1.8

Tim

e (

s)

1e 7

Benchmark Data

Fitted Model

Figure 18: Example of Benchmark (7)2: middle = (low + high) / 2

3: while low < high do4: if array[middle].key < x then5: low = middle + 1

6: else7: high = middle

8: end if9: middle = (low + high) / 2

10: end while11: if array[middle].key == x then12: return array[middle].val

13: else14: return NOT_FOUND

15: end if

Example of Learned Cost Model: Figure 18

(8) Binary Search (ColumnStore)

Description: This benchmark performs standard binary

search over an array of keys. After, it returns the correspond-

ing value.

Model: The benchmark is fit by a log-linear model: f (x) =ax + b logx + c with a,b, and c being learned coefficients.

a,b,c are all constrained to non-negative.

Benchmarking PseudoCode

Require: array of n keys, array of n values, search val x .1: low = 0, high = keys.size - 1

2: middle = (low + high) / 2

3: while low < high do4: if keys[middle] < x then5: low = middle + 1

6: else7: high = middle

8: end if

0 20000 40000 60000 80000 100000Number of Elements

0.0

0.5

1.0

1.5

2.0

Tim

e (

s)

1e 7

Benchmark Data

Fitted Model

Figure 19: Example of Benchmark (8)9: middle = (low + high) / 2

10: end while11: if keys[middle] == x then12: return values[middle]

13: else14: return NOT_FOUND

15: end if

Example of Learned Cost Model: Figure 19

(9) Interpolation Search (RowStore)

Description:This benchmark performs interpolation search

over an array of key-value pairs, searching by key.

Model:The benchmark is fit by a loglog-linearmodel: f (x) =ax + b logx + c log logx + d with a,b,c and d being learned

coefficients.

Benchmarking PseudoCode

Require: array of n key-value pairs, search val x1: low = 0, high = array.size - 1

2: key_low = array[low].key, key_high = array[high].key

3: diff = key_high - key_low

4: while low < high do5: si = (high - low) * ((x - val_low) / diff)

6: if array[si].key < x then7: low = si + 1

8: else if array[si].key == x then9: return array[si].val

10: else11: high = si

12: end if13: end while14: return NOT_FOUND

Page 25: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

0 20000 40000 60000 80000 100000Number of Elements

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Tim

e (

s)

1e 7

Benchmark Data

Fitted Model

Figure 20: Example of Benchmark (9)Example of Learned Cost Model: Figure 20

(10) Interpolation Search (ColumnStore)

Description:This benchmark performs interpolation search

over an array of keys. After, it returns the corresponding

value.

Model:The benchmark is fit by a loglog-linearmodel: f (x) =ax + b logx + c log logx + d with a,b,c and d being learned

coefficients.

Benchmarking PseudoCode

Require: array of n keys, array of n values, search val x .1: low = 0, high = keys.size - 1

2: key_low = keys[low], key_high = keys[high]

3: diff = key_high - key_low

4: while low < high do5: si = (high - low) * ((x - val_low) / diff)

6: if keys[si] < x then7: low = si + 1

8: else if keys[si] == x then9: return vals[si]

10: else11: high = si

12: end if13: end while14: return NOT_FOUND

Example of Learned Cost Model: Figure 21

(11) Hash Probe (Multiply Shift)

Description: This benchmark computes a hash value and

probes the bucket of that hash value. This is used, for exam-

ple, in hash partitioning. We note that in this model there are

0 20000 40000 60000 80000 100000Number of Elements

0.0

0.5

1.0

1.5

2.0

2.5

3.0

3.5

4.0

Tim

e (

s)

1e 7

Benchmark Data

Fitted Model

Figure 21: Example of Benchmark (10)no collisions, which is why the benchmark is a hash probe

and not a model of a hash table.

We generate an array of length k , which is the number of

elements needed to make a structure size of input parameter

Structure Size. We call this array the probe array. Each ele-

ment in the array stores a random number. A second array

of length n, which is the number of trials we will perform, is

created as well, with each element in that array also random.

This array is called the “scan array”. A full justification of

why two arrays are needed can be seen in the description of

the Random Memory Access Benchmark.

To perform the benchmark we compute the necessary hash

value for each trial. The hash value is dependent on 1) the

value stored at the previously probed hash bucket and 2) the

value in the scan array for the current trial. 1) is required

so that the request for memory at trial n requires the value

for trial n-1, so that the cpu can’t issue multiple memory

requests in parallel. 2) is required so that cycles don’t occur

in the access patterns for hash probing (which would throw

off the expected cache latencies for large arrays of hash

buckets).

The previous paragraphs are true for all hash probe bench-

marks. The following is true only for the multiply shift hash

benchmark. 1) The input size must create a number of buck-

ets which is a power of 2. 2) If S is the size of the created

hash array, let s = log2S . The equation for the multiply shift

hash function is

h(x) = a ∗ x >> (64 − s)

where it is assumed that the processor uses 64 bit words. In

the equation, a is a randomly drawn odd integer.

Model: The model for the hash probe can be a sum of sig-

moids or the weighted nearest neighbor model. See the ran-

dom memory access benchmark for more details on training

these models.

Benchmarking PseudoCode

Page 26: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

104 105 106 107 108 109

Region Size (bytes)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Tim

e (s

)

1e 7

Figure 22: Example of Benchmark (11)Require: pa: array of k longs, sa: array of k longs, i: nu-

mAccesses

1: fill each slot of pa with random x=Uniform(0,k-20)

2: fill each slot of sa with random x=Uniform(0,20)

3: x = 0, s = loд2k4: for i = 0, . . .n − 1 do5: x= [a * (pa[x] + sa[i])] » (64 - s)

6: end for7: print x

Example of Learned Cost Model: Figure 22

(12) Hash Probe (k-wise independent)

Description: For a description of the common setup of all

hash benchmarks, see the benchmark above.

The following is true only for the 2-wise independent hash

probe benchmark. 1) The input size must create a number

of buckets which is a prime number. We will denote this

number by p. 2) The equation for the multiply shift hash

function is

h(x) = (a ∗ x + b)%p

where it is assumed that the processor uses 64 bit words. In

the equation, a and b are randomly drawn integers between

0 and p.

Model: The model for the hash probe can be a sum of sig-

moids or the weighted nearest neighbor model. See the ran-

dom memory access benchmark for more details on training

these models.

Benchmarking PseudoCode

Require: pa: array of k longs, sa: array of k longs, i: nu-

mAccesses

1: fill each slot of pa with random x=Uniform(0,k-20)

2: fill each slot of sa with random x=Uniform(0,20)

3: x = 0

4: for i = 0, . . .n − 1 do

104 105 106 107 108 109

Region Size (bytes)

0.25

0.50

0.75

1.00

1.25

1.50

Tim

e (s

)

1e 7

Figure 23: Example of Benchmark (12)5: x = [a * (pa[p] + sa[i]) + b] % p

6: end for7: print x

Example of Learned Cost Model: Figure 23

(13) Bloom Filter Probe (Muliply Shift)

Description: This benchmark first builds a bloom filter by

computing k hash values for each entry and setting the bits

at the computed k hash values to 1. It then probes the bloom

filter using the same k hash functions. In this benchmark,

each of the k hash functions is from the multiply shift family.

There are two inputs the bloom filter benchmark, the size of

the filter and the number of hash functions. For each of the

n trials, we randomly generated a value to probe for.

The following describes the multiply shift hash function. 1)

The bloom filter size must be a power of 2. 2) If S is the size

of the created hash array, let s = log2S . The equation for the

multiply shift hash function is

h(x) = a ∗ x >> (64 − s)

where it is assumed that the processor uses 64 bit words. In

the equation, a is a randomly drawn odd integer.

Model: The model for the bloom filter is the sum of sum of

sigmoids model or the weighted nearest neighbor model. The

model is trained similarly to the sum of sigmoidsmodel. More

details on the model training can be found in the description

for the random memory access benchmark.

Benchmarking PseudoCode

Require: pa: array of k longs, n: numAccesses, p: size

1: build bloom filter BF of size p, with p prime (build is

similar to probe shown below)

2: fill each slot of pa with random x.

3: count = 0

4: for i = 0, . . .n − 1 do5: possiblyInSet = true

Page 27: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

1 2 3 4 5 6 7 8Number of Hash Functions

0.4

0.6

0.8

1.0

1.2

Tim

e

1e 8

103 104 105 106 107 108

Data Size

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Tim

e

1e 7

Figure 24: Example of Benchmark (13). On the left is varyingk , size constant. On the right is varying size, k constant.

6: for j = 0 . . . ,k − 1 do7: hashBit = [a[j] * (pa[x])] » (64 - s)

8: bitOffset = hashBit & 7

9: notInSet = !((1 « bitOffset) & BF[hashBit»3])

10: if notInSet then11: count += 1

12: end if13: end for14: end for15: print count

Example of LearnedCostModel: The Bloom Filter Bench-

mark needs a two dimensional model. Figure 24 shows slices

where the size is kept constant and where the number of

hash functions is kept constant.

(14) Bloom Filter Probe (k-wise independent)

Description: This benchmark first builds a bloom filter by

computing k hash values for each entry and setting the bits

at the computed k hash values to 1. It then probes the bloom

filter using the same k hash functions. In this benchmark,

each of the k hash functions is from the multiply shift family.

There are two inputs the bloom filter benchmark, the size of

the filter and the number of hash functions. For each of the

n trials, we randomly generated a value to probe for.

The following describes the multiply shift hash function. 1)

The bloom filter size must be a power of 2. 2) If S is the size

of the created hash array, let s = log2S . The equation for the

multiply shift hash function is

h(x) = a ∗ x >> (64 − s)

where it is assumed that the processor uses 64 bit words. In

the equation, a is a randomly drawn odd integer.

Model: The model for the bloom filter is the sum of sum of

sigmoids model or the weighted nearest neighbor model. The

model is trained similarly to the sum of sigmoidsmodel. More

details on the model training can be found in the description

for the random memory access benchmark.

Benchmarking PseudoCode

Require: pa: array of k longs, n: numAccesses, s: size

1 2 3 4 5 6 7 8Number of Hash Functions

0.4

0.6

0.8

1.0

1.2

Tim

e

1e 8

103 104 105 106 107 108

Data Size

0.00

0.25

0.50

0.75

1.00

1.25

1.50

1.75

2.00

Tim

e

1e 7

Figure 25: Example of Benchmark (14). On the left is varyingk , size constant. On the right is varying size, k constant.

1: build bloom filter BF of size s (build is similar to probe

shown below)

2: fill each slot of pa with random x.

3: count = 0

4: for i = 0, . . .n − 1 do5: possiblyInSet = true

6: for j = 0 . . . ,k − 1 do7: hashBit = [a[j] * pa[i])= + b[j]] % p

8: bitOffset = hashBit & 7

9: notInSet = !((1 « bitOffset) & BF[hashBit»3])

10: if notInSet then11: count += 1

12: end if13: end for14: end for15: print count

Example of Learned Cost Model: Figure 25 shows sliceswhere the size is kept constant and where the number of

hash functions is kept constant.

(15) QuickSortDescription: This benchmark performs QuickSort to sort a

list of keys.

Model: The benchmark is fit by a NLogN model. f (x) =ax logx + bx + c with a,b, and c being learned coefficients.

Benchmarking PseudoCode

QUICKSORT:

Require: array of n key-value pairs

1: if low < high then2: p = partition(A, low, high)

3: quicksort (A, low, p - 1)

4: quicksort (A, p + 1, high)

5: end ifPARTITION:

Require: array of n key-value pairs, integers low, high

1: pivot = A[high]

2: i = low − 1

3: for j = low, . . . ,hiдh − 1 do4: if A[j] < pivot then5: i = i + 1

Page 28: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

0 5000 10000 15000 20000Data Size

0.000

0.002

0.004

0.006

0.008

Tim

e

Figure 26: Example of Benchmark (15)6: swap A[i] with A[j]

7: end if8: end for9: swap A[i + 1] with A[hi]

10: return i

Example of Learned Cost Model: Figure 26

(16) Random Memory Access

Description: This benchmark performs random memory

accesses within a region.

To do this, we first generate an array of values of length

k = region size / sizeof(long). This array will be the “probe

array”, or “pa” for short. Next, we generate a random number

between 0 and k − 20 for each slot in the array. Additionally,

we create a “scan array”, or “sa”, of n longs which contains

random numbers between 0 and 20. We will be performing

exactly i random memory accesses, and so this second array

is generally quite large.

We will initialize a starting pointer value p to be 0. At each

round i = 0, . . . ,n − 1, we will jump to pa[p] + sa[i]. Atthe end of all rounds, we will print p. This makes the com-

piler unable to optimize out the memory accesses, and is a

negligible overhead to the benchmark. The combination of

two random numbers, one between [0,k-20] and the other

between [0,20], creates a nearly uniform distribution over

[0,k] (the first and last 20 slots have slightly lower chances of

being picked via a combinatorial argument similar to why 7s

are the most common dice roll). The embedding of one part

of the next slot to go to into the probe array pa prevents the

processor from being able to fetch the next memory location

before the data arrives. The embedding of a second random

part into the scan array sa is to prevent cycles in the probing

pattern, which would occur if we just used the probe array

for the next slot.

Model: The benchmark is fit by a sum of sigmoids model:

f (x) = ∑ki=1

ci1+e−ki (logx−xi )

+y0 with all coefficients learned

as usual. This model is no longer convex in its L2 loss and

cannot be learned via off the shelf techniques such as least

squares regression. Instead, we first preprocess the data by

calculating rates of change for intervals of some size z. We

find the k highest local maxima for this graph of rates of

change and set these as our initial guesses for the xi values.For the values of the ki , ci we initialize them to random posi-

tive numbers between 0 and 1. Our initial guess for y0 is justthe first point. At this point, we use scipy’s curve fit package

to train the parameters and observe the results.

The learning of the coefficients for the sum of sigmoid model

requires good initial conditions for the guesses of the xi . Itis not sensitive to initial guess for the parameters of ki orci . We found that for each of z = 0.1, 0.5, and 1, the initial

guesses for the xi were good enough so that the fitted model

produces accurate results.

Additionally, the user can select a different model for the

Random Memory Access Benchmark should they so choose.

We currently support k nearest neighbor regression. In the

future, we plan to add support for both locally weighted

linear regression and Gaussian Process Regression. All these

techniques are non-parametric and so don’t need training,

and all three techniques produce good predictions given

enough data. As a tradeoff, these models are significantly

slower in producing predictions and are not interpretable

in any way. This contrasts with the sum of sigmoids model,

whose values of the xi tell you the cache boundaries and

whose values ci tell you the difference in memory access

latency between each subsequent level of cache.

Benchmarking PseudoCode

Require: pa: array of k longs, sa: array of k longs, i: nu-

mAccesses

1: fill each slot of pa with random x=Uniform(0,k-20)

2: fill each slot of sa with random x=Uniform(0,20)

3: p = 0

4: for i = 0, . . .n − 1 do5: p = pa[p] + sa[i]

6: end for7: print p

Example of Learned Cost Model: Figure 27

(17) Batched Random Memory Access

Description: The batched random memory access bench-

mark is similar to the random memory access benchmark,

except that now there is nothing to prevent the test from

Page 29: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

104 105 106 107 108 109

Region Size (bytes)

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

Tim

e (

s)

1e 7

Figure 27: Example of Benchmark (16)batching memory requests. The result is that the time per

operation drops dramatically.

Model: The batched random memory access benchmark is

fit by the weighted nearest neighbors model and the sum

of sigmoids model. The training is similar to the random

memory access benchmark.

Benchmarking PseudoCode

Require: pa: array of k longs, sa: array of k longs, i: nu-

mAccesses

1: fill each slot of pa with random x=Uniform(0,k)

2: fill each slot of sa with random x=Uniform(0,k)

3: p = 0

4: for i = 0, . . .n − 1 do5: p += pa[sa[i]]

6: end for7: print p

Example of Learned Cost Model: Figure 28

Final Notes: Currently the technical report contains 17 outof the 24 Level 2 Access Primitives. We will be adding the

remaining 7.

105 106 107 108 109

Region Size (bytes)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

Tim

e (s

)

1e 8

Figure 28: Example of Benchmark (16)E COST SYNTHESISGiven a data layout specification and a workload, the Data Calcu-

lator uses Level 1 access primitives to synthesize operations and

subsequently each Level 1 primitive is translated to the appropriate

Level 2 primitive to compute the cost of the overall operation. This

process relies on an expert system which captures fundamental

properties of how basic layout principles are expected to behave

and which access (high-level) algorithms are expected to best utilize

each layout. In the main part of the paper, we presented a more

high level view of the expert system. Figure 29 now provides a

more detailed look into the internals of the expert system. Each

empty rounded shape in Figure 29 represents a Level 1 access prim-

itive call, each diamond a decision, and each dark round shape

a data access primitive. Finally, blue star shaped (marked P1-P8)

represent additional processes described at the bottom of Figure 29.

Together they form the complete expert system for algorithm and

cost synthesis for the Get operation.

F INPUT EXAMPLESHere we provide more examples of data structure input specifica-

tions. The complete set of elements used in the experiments are

defined in Table 30. Specifically, we list the definition of all 8 ele-

ments used in the paper: Unordered Data Page (UDP), Ordered Data

Page (ODP), Hash partitioning (Hash), Range partitioning (Range),

Trie node, B+Tree internal node (B+), Linked List (LL), Skip List

(SL).

Then full data structures are defined as hierarchies of those

elements. We use the following notation: A→ B, which means that

all sub-blocks of element A are of type B. Dots represent multiple

levels in the hierarchy. Below we describe all full data structures

used in the experiments using this notation.

• Linked List: LL→ UDP

• Array: UDP with fixed capacity value equal to the number

of puts.

• Range Partitioned Linked List: Range→ LL→ UDP

• Skip-List: SL→ UDP

• Trie: Trie → . . .→ Trie → UDP

• B+Tree: B+→ . . .→ B+ → ODP

Page 30: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

Data Access Primitives and Fitted ModelsData Access Primitives Level 1(required parameters ; optional

parameters)

Model Parameters Data Access Primitives Layer 2 Fitted Models

1 Scan Data Size Scalar Scan (RowStore, Equal) Linear Model (1)

2 (Element Size, Comparison, Scalar Scan (RowStore, Range) Linear Model (1)

3 Data Layout; None) Scalar Scan (ColumnStore, Equal) Linear Model (1)

4 Scalar Scan (ColumnStore, Range) Linear Model (1)

5 SIMD-AVX Scan (ColumnStore, Equal) Linear Model (1)

6 SIMD-AVX Scan (ColumnStore, Range) Linear Model (1)

7 Sorted Search Data Size Binary Search (RowStore) Log-Linear Model (2)

8 (Element Size, Data Layout; ) Binary Search (ColumnStore) Log-Linear Model (2)

9 Interpolation Search (RowStore) Log + LogLog Model (3)

10 Interpolation Search (ColumnStore) Log + LogLog Model (3)

11 Hash Probe(; Hash Family)

Structure Size

Linear Probing (Multiply-shift [25])

Sum of Sigmoids (5),

Weighted Nearest

Neighbors (7)

12

Linear Probing (k-wise independent,

k=2,3,4,5)

Sum of Sigmoids (5),

Weighted Nearest

Neighbors (7)

13 Bloom Filter Probe(; Hash Family)

Structure Size, Number

of Hash FunctionsBloom Filter Probe (Multiply-shift [25])

Sum of Sum of Sigmoids

(6), Weighted Nearest

Neighbors (7)

14

Bloom Filter Probe (k-wise independent,

k=2,3,4,5)

Sum of Sum of Sigmoids

(6), Weighted Nearest

Neighbors (7)

15 Sort Data Size QuickSort NLogN Model (4)

16 (Element Size; Algorithm) MergeSort NLogN Model (4)

17 ExternalMergeSort NLogN Model (4)

18 Random Memory Access Region Size Random Memory Access Sum of Sigmoids (5),

Weighted Nearest

Neighbors (7)

19 Batched Random MemoryAccess

Region Size Batched Random Memory Access Sum of Sigmoids (5),

Weighted Nearest

Neighbors (7)

20 Unordered Batch Write Write Data Size Contiguous Write (RowStore) Linear Model (1)

21 (Layout; ) Contiguous Write (ColumnStore) Linear Model (1)

22 Ordered Batch Write Write Data Size, Batch Ordered Write (RowStore) Linear Model (1)

23 (Layout; ) Data Size Batch Ordered Write (ColumnStore) Linear Model (1)

24 Scattered Batch Write Number of Elements,

Region Size

ScatteredBatchWrite Sum of Sum of Sigmoids

(6), Weighted Nearest

Neighbors (7)

Models used for fitting data access primitivesModel Description Formula

1 Linear Fits a simple line through data f (v) = w⊤ϕ(v) + y0;ϕ(v) = (v)

2 Log-Linear Fits a linear model with a basis composed of the identity and

logarithmic functions plus a bias

f (v) = w⊤ϕ(v) + y0;ϕ(v) =(v

logv

)3 Log + LogLog Fits a model with log, log log, and linear components f (v) = w⊤ϕ(v) + y0;ϕ(v) =

( vlogv

log logv

)4 NLogN Fits a model with primarily an NLogN and linear component f (v) = w⊤ϕ(v) + y0;ϕ(v) =

(v logv

v

)5 Sum of Sigmoids Fits a model that has k approximate steps. Seen as sigmoids in

log x as this fits various cache behaviors nicely

f (x ) = ∑ki=1

ci1+e−ki (logx−xi )

+ y0

6 Sum of Sum of

Sigmoids

Fits a model which has two cost components, both of which have

k approximate steps occuring at the same locations.

f (x,m) = ∑ki=1

ci11+e−ki (logx−xi )

+

(m − 1)(∑ki=1

ci21+e−ki (logx−xi )

+ y1) + y07 Weighted Nearest

Neighbors

Takes the k nearest neighbors under the l2 norm and computes a

weighted average of their outputs. The input x is allowed to be a

vector of any size.

Let x1, ...xk be the nearest neighbors of xwith costs y1, . . . , yk . Thenf (x ) = 1∑k

i=11

d (x,xi )

∑ki=1

1

d (x,xk )yk

Notation: f is a function, v is a vector, and x, m are scalars. log(v) returns a vector with log applied on an element by element

basis to v; i.e. if v =(v1

v2

), then logv =

(logv1

logv2

). Finally, if we have vectors v (1), v (2)

of lengths n,m stacked on each other

as

(v (1)

v (2)

), then this signifies the n +m length vector produced by stacking the entries of v (1)

on top of the entries of v (2); i.e.(

v (1)

v (2)

)=

(v (1)1, . . . , v (1)

n , v (2)1, . . . , v (2)

m

)⊤.

Table 1: Data access primitives and models used for operation cost synthesis.

Page 31: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

• Sorted Array: ODP with fixed capacity value equal to the

number of puts.

• Hash Table: Hash→ LL→ UDP (with fixed capacity value

5)

The number of levels an actual instance of a data structure would

have depends on the size of the dataset for the case of the B+Tree,

and on the size of the keys for the Trie.

G OUTPUT EXAMPLESIn this section, we give examples of the raw output of the Data Cal-

culator. This includes both cases where the output is a calculation

of the response time and cases where the output is the design of a

new data structure.

G.1 Output Performance Break DownWe first present a representative cost breakdown for performing 1

Get query in 100,000 inserts. These numbers correspond to Figure

6. The cost for each operation is broken down in 3 Level 1 access

primitives: P stands for random probe, S for serial scan, and B for

binary search. The input in each function is the size of the data to

be accessed.

• Linked List: P(782)+6P(200974)+5S(256)+S(141)+P(256)1 probe for getting the head of the linked list, 6 random

probes for fetching the first 6 pages (since the answer is

found on the 6-th page), 5 full scans for the first 5 pages, 1

scan until 141 (the answer) and 1 probe for getting the value

(data is stored in a columnar format).

• Array: S(1421) + P(100000)1 serial scan of the keys until the record is found (position

1421), and a random probe in the whole array of values (data

is stored in a columnar format) to find the values.

• Range Partitioned Linked List: P(842) + P(216394) +S(185) + P(256)1 probe for the head of the data structure, 1 probe for the

linked list head of the given range, one scan on the first page

of the linked list and one probe for the value in the first page

of the linked list (data is stored in a columnar format).

• Skip-List: P(1181)+201P(391)+P(201373)+B(256)+P(256)Various probes as we navigate in the skip-list checking the

zone maps, one binary search when the correct page is found,

followed by a probe for the value.

• Trie: P(256)+P(512)+P(768)+P(1024)+P(1280)+P(1536)+P(2048) + P(102144) + P(228628) + P(32608532) + S(185) +P(256)Multiple probes as we navigate the trie, a serial scan on the

target page, followed by a probe for getting the value.

• B+Tree: 2B(1400) + P(205640) + P(40) + P(840) + B(256) +P(256)Multiple probes as we navigate the tree, followed by a binary

search for getting the key, and a probe for getting the value.

• Sorted Array: B(100000) + P(100000)Binary search for finding the key, probe for finding the value.

• Hash Table: P(39718) + P(10207526) + S(2) + P(5)Probe for the head, probe for the linked list, serial scan first

page of linked list, probe for value.

G.2 Output Design ExampleWe now discuss design output. In the experiment of Figure 9, the

Data Calculator designed new data structures given a workload. For

the first scenario (Figure 9 left side) the Data Calculator computed a

design where a hashing element at the upper levels of the hierarchy

allows to quickly access data but then data is split between the write

and read intensive parts of the domain to simple unsorted pages

(like a log) and B+tree -style indexing for the read intensive part.

For the second scenario (right side of Figure 9), the Data Calculator

produces a design which similarly to the previous one takes care of

read and writes separately but this time also distinguishes between

range and point gets by allowing the part of the domain that receives

point queries to be accessed with hashing and the rest via B+tree

style indexing.

In both scenarios the following elements were chosen by the

Data Calculator: Hash Partitioning, B+Tree Internal, and Ordered

Data Page. Below, we list all specifications as generated by the

system in JSON format and blended into a full designs as shown in

Figure 9.

G.2.1 Hash Partitioning Element.

{"external.links.next": false,"external.links.prev": false,"inter-block.blockAccess.direct": true,"inter-block.blockAccess.headLink": false,"inter-block.blockAccess.tailLink": false,"inter-block.fanout.fixedValue": 100,"inter-block.fanout.function": "","inter-block.fanout.type": "fixed","inter-block.orderMaintenance.type": "none","inter-block.partitioning.function": "KeyMod(100)","inter-block.partitioning.logStructured.filtersPerLevel": false,"inter-block.partitioning.logStructured.filtersPerRun": false,"inter-block.partitioning.logStructured.initialRunSize": 0,"inter-block.partitioning.logStructured.maxRunsPerLevel": 0,"inter-block.partitioning.logStructured.mergeFactor": 0,"inter-block.partitioning.type": "function","intra-block.blockProperties.location": "inline","intra-block.blockProperties.layout": "inline","intra-block.blockProperties.homogeneous": true,"intra-block.capacity.function": "","intra-block.capacity.type": "variable","intra-block.capacity.value": 0,"intra-block.dataRetention.keyRetention.compression": "","intra-block.dataRetention.keyRetention.function": "","intra-block.dataRetention.keyRetention.type": "none","intra-block.dataRetention.retainedDataLayout": "","intra-block.dataRetention.valueRetention.compression": "","intra-block.dataRetention.valueRetention.function": "","intra-block.dataRetention.valueRetention.type": "none","intra-block.filters.bloomFilter.active": false,"intra-block.filters.bloomFilter.hashFunctionsNumber": 0,"intra-block.filters.bloomFilter.numberOfBits": 0,"intra-block.filters.filtersMemoryLayout": "scatter","intra-block.filters.zoneMaps.max": false,"intra-block.filters.zoneMaps.min": false,"intra-block.filters.zoneMaps.exact": false,"intra-block.links.linksMemoryLayout": "scatter","intra-block.links.next": false,"intra-block.links.prev": false,"intra-block.links.skipLinks.probability": 0,"intra-block.links.skipLinks.type": "none","intra-block.utilization.constraint": "none","intra-block.utilization.function": "","layout.oneChildCompression": false,"layout.zeroElementNullable": true,"layout.indirectedPointers": false

}

G.2.2 Ordered Datapage.

{"external.links.next": true,"external.links.prev": false,"inter-block.blockAccess.direct": true,"inter-block.blockAccess.headLink": false,"inter-block.blockAccess.tailLink": false,"inter-block.fanout.fixedValue": 256,

Page 32: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

"inter-block.fanout.function": "","inter-block.fanout.type": "fixed","inter-block.orderMaintenance.type": "sorted","inter-block.partitioning.function": "","inter-block.partitioning.logStructured.filtersPerLevel": false,"inter-block.partitioning.logStructured.filtersPerRun": false,"inter-block.partitioning.logStructured.initialRunSize": 0,"inter-block.partitioning.logStructured.maxRunsPerLevel": 0,"inter-block.partitioning.logStructured.mergeFactor": 0,"inter-block.partitioning.type": "none","intra-block.blockProperties.location": "","intra-block.blockProperties.layout": "","intra-block.blockProperties.homogeneous": true,"intra-block.capacity.function": "","intra-block.capacity.type": "fixed","intra-block.capacity.value": 1,"intra-block.dataRetention.keyRetention.compression": "uncompressed","intra-block.dataRetention.keyRetention.function": "","intra-block.dataRetention.keyRetention.type": "full","intra-block.dataRetention.retainedDataLayout": "columnar","intra-block.dataRetention.valueRetention.compression": "uncompressed","intra-block.dataRetention.valueRetention.function": "","intra-block.dataRetention.valueRetention.type": "full","intra-block.filters.bloomFilter.active": false,"intra-block.filters.bloomFilter.hashFunctionsNumber": 0,"intra-block.filters.bloomFilter.numberOfBits": 0,"intra-block.filters.filtersMemoryLayout": "scatter","intra-block.filters.zoneMaps.max": false,"intra-block.filters.zoneMaps.min": false,"intra-block.filters.zoneMaps.exact": false,"intra-block.links.linksMemoryLayout": "scatter","intra-block.links.next": false,"intra-block.links.prev": false,"intra-block.links.skipLinks.probability": 0,"intra-block.links.skipLinks.type": "none","intra-block.utilization.constraint": "leq_capacity","intra-block.utilization.function": "","layout.oneChildCompression": false,"layout.zeroElementNullable": false,"layout.indirectedPointers": false

}

G.2.3 B+Tree Internal Node Element.

{"external.links.next": false,"external.links.prev": false,"inter-block.blockAccess.direct": true,"inter-block.blockAccess.headLink": false,"inter-block.blockAccess.tailLink": false,"inter-block.fanout.fixedValue": 20,"inter-block.fanout.function": "","inter-block.fanout.type": "fixed","inter-block.orderMaintenance.type": "sorted","inter-block.partitioning.function": "","inter-block.partitioning.logStructured.filtersPerLevel": false,"inter-block.partitioning.logStructured.filtersPerRun": false,"inter-block.partitioning.logStructured.initialRunSize": 0,"inter-block.partitioning.logStructured.maxRunsPerLevel": 0,"inter-block.partitioning.logStructured.mergeFactor": 0,"inter-block.partitioning.type": "none","intra-block.blockProperties.location": "inline","intra-block.blockProperties.layout": "inline","intra-block.blockProperties.homogeneous": true,"intra-block.blockProperties.storage": "pointed","intra-block.capacity.function": "","intra-block.capacity.type": "balanced","intra-block.capacity.value": 0,"intra-block.dataRetention.keyRetention.compression": "","intra-block.dataRetention.keyRetention.function": "","intra-block.dataRetention.keyRetention.type": "none","intra-block.dataRetention.retainedDataLayout": "dump","intra-block.dataRetention.valueRetention.compression": "","intra-block.dataRetention.valueRetention.function": "","intra-block.dataRetention.valueRetention.type": "none","intra-block.filters.bloomFilter.active": false,"intra-block.filters.bloomFilter.hashFunctionsNumber": 0,"intra-block.filters.bloomFilter.numberOfBits": 0,"intra-block.filters.filtersMemoryLayout": "scatter","intra-block.filters.zoneMaps.max": false,"intra-block.filters.zoneMaps.min": true,"intra-block.filters.zoneMaps.exact": false,"intra-block.links.linksMemoryLayout": "scatter","intra-block.links.next": false,"intra-block.links.prev": false,"intra-block.links.skipLinks.probability": 0,"intra-block.links.skipLinks.type": "none","intra-block.utilization.constraint": "none","intra-block.utilization.function": "",

"layout.oneChildCompression": false,"layout.zeroElementNullable": true,"layout.indirectedPointers": false}

G.3 Data Structure InstancesFinally, we present the raw output of the optimal data structure

instances for both scenarios in Figure 9. For demonstration pur-

poses we converted the raw data structures to graph structures.

Each graph node represents an element and each line represents a

connection from a parent element to its sub-blocks. Since we are

plotting an instance of the data structure there is a lot of informa-

tion. Zooming in the plots allows readers to read the type of each

node. However, intuitively, very high fanout elements represent

hash nodes, low fanout elements represent B+Tree internal nodes

and no-fanout (terminal) elements represent data pages.

Scenario 1.We have a hash-map followed by either shallowB+Trees

or large log-style data pages for write intensive parts of the domain

(corresponding to the design at the left of Figure 9).

Scenario 2. We have a B+Tree followed by either hash-maps for

point-get intensive parts of the domain or B+Trees for range inten-

sive parts, and log-style big data pages for write intensive parts of

the domain (corresponding to the design at the right of Figure 9).

Page 33: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

Start value retention value compression

order maintenance

retained datalayout

Serial Scan Keys

retained datalayout

Sorted Search Keys

none

none

Decompress Values using algorithm

Random Probe Value

columnar columnar

Serial Scan KV pairs

row-wise

Sorted Search KV pairs

row-wise

sub-block physical location End

full algorithm

sorted

Access Data

Filter Sub-Blockspartitioning type Partitioning

Function Call functionRequired partition directly retrieved

Retrieve Sub-Blocks

Puts Get(s)

Puts Get(s)PutsPuts …Get(s)Puts…Get(s)PutsGet(s)Puts

none

Sub-Block Data Distribution

direct access

Invalid

Random Probe For accessed link

zone mapsminimumorder maintenance

bloom filters

no

zone mapsmaximum

Serial Scan Zone Maps

Sorted Search Zone Maps

Bloom Filter Access

no

run for each pair

skip linksP6. identify blocks accessed during skip link search

skip links

head link

tail link

next links

prev links

cannot access sub-blocks

no

no

no

yes

no

yes

no

yesyes

no

yes

sub-block physical location

yes

pointed

none

inline

4

1

3

inline or pointed

Input

2

Output

partitioning typeorder maintenance P1. sortsorted

noneLSM

function Capacity

variable

Invalid

Capacity

none

Fanout

balanced

variable or function

P4. divide in fixed number of

sub-blocksfixed

P3. divide in sub-blocks of fixed size

fixed or function or balanced

P2. partition with function

Divide puts in sub-blocks

Serial Scan Keys

partitioning function

P5. LSM partitoning

LSM initial run size

LSM max runs/level

LSM merge factor

LSM filters per level

LSM filters per level

fixed

fanout fixed value

capacity fixed value

fanout fixed value

fanout functionfanout

function

partitioning function

none

none

perfector randomized

probability of randomized skip-links

no

yes

yesnone

sorted

yes

bloom filters hash function number

bloom filters number of bits

P7. estimate random access size

one child compression

indirected ponirs

key retentionvalue

retention …

P8. apply partial key or value

retention function

value retention function

key retention function

Legend

primitive based process

primitive based decision

dependent primitive

inputOutput

Invalid

Primitive BasedOperation and Cost

Synthesis

sub-block 1 sub-block 2 sub-block n

Total Cost

if (orderMaintenance == sorted) { puts = puts.sort()}

Process 1 (P1)init subBlocks[]for(p in puts) { target = function(p) subBlocks[target].append(p)}

Process 2 (P2) init subBlocks[] c = capacity i = 0 for(p in puts) { if(subBlocks[i].size() >= c) i++ subBlocks[i].append(put) }

Process 3 (P3) init subBlocks c = puts.size()/fanoutValue i = 0 for(p in puts) { if(subBlocks[i].size() >= c) i++ subBlocks[i].append(put) }

Process 4 (P4)

init levels, curRun[], curLvl[]for(p in puts) { if(not fits in curRun) { curLvl.append(curRun) init curRun } if(curLvl reached maximum){ curLvl.mergeRuns() levels.append(curLvl) init curLvl}

Process 5 (P5)

init touchedBlocks[ngets][] constructSkipLinks()

for(get in gets) { b = simulateSkipLinkSearch() touchedBlocks[get.id] = b; }

Process 6 (P6) // Estimate current element // size and add to size so far el_size = estimateSize() el_size += prevElementsSize() randomAccessSize = el_size

Process 7 (P7) for(subBlock in subBlocks) { for(put in subBlock) put = retentionFunc(put); for(get in subBlock) get = retentionFunc(get); }

Process 8 (P8)

1. Use bloom filters & zone maps to decide which gets go to which blocks.

2. Assign each get to all the sub-blocks that it should access.

outputpairs of puts

and gets

fflow

dependency

clarification

Figure 29: Expert System for get cost synthesis

Page 34: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

Primitive UDP ODP H R T B+ LL SL1 Key retention. No: node contains no real key data, e.g., intermediate nodes of

b+trees and linked lists. Yes: contains complete key data, e.g., nodes of b- yes yes no no func no no no

2 Value retention. No: node contains no real value data, e.g., intermediate nodes of b+trees, and linked lists. Yes: contains complete value data, e.g., nodes of b-trees, and arrays. Function: contains only a subset of the values. yes yes no no func no no no

Key-value layout. Determines the physical layout of key-value pairs.Rules: requires key retention != no or value retention != no.

4 Intra-node access. Determines how sub-blocks (one or more keys of this node) can be addressed and retrieved within a node, e.g., with direct links, a link only to the first or last block, etc.

dire

ct

dire

ct

dire

ct

dire

ct

dire

ct

dire

ct

head

head

5 Utilization. Utilization constraints in regards to capacity. For example, >= 50% denotes that utilization has to be greater than or equal to half the capacity. none none none none none >= 50% none none

6 Bloom filters. A node's sub-block can be filtered using bloom filters. Bloom filters get as parameters the number of hash functions and number of bits. off off off off off off off off

7 Zone map filters. A node's sub-block can be filitered using zone maps, e.g., they can filter based on mix/max keys in each sub-block. off off off off off min off off

Filters memory layout. Filters are stored contiguously in a single area of the node or scattered across the sub-blocks.Rules: requires bloom filter != off or zone map filters != off.

9 Fanout/Radix. Fanout of current node in terms of sub-blocks. This can either be unlimited (i.e., no restriction on the number of sub-blocks), fixed to a number, decided by a function or the node is terminal and thus has a fixed capacity.

term

(256

)

term

(256

)

fixed

(100

)

fixed

(100

)

fixed

(256

)

fixed

(20)

unlim

ited

unlim

ited

10 Key/fence partitioning. Set if there is a pre-defined key partitioning imposed. e.g. the sub-block where a key is located can be dictated by a radix or range partitioning function. Alternatively, keys can be temporaly partitioned. If partitioning is set to none, then keys can be forward or backwards appended. ap

pend

(fw)

data

-dep

sort

ed

data

-ind

func

(x%

100)

data

-ind

rang

e(10

0)

data

-ind

Radi

x(8)

data

-dep

sort

ed

appe

nd(fw

)

appe

nd(fw

)

Sub-block capacity. Capacity of each sub-block. It can either be fixed to a value, or balanced (i.e., sub-blocks of same size), unrestricted or functional.

Rules: requires fanout/radix != terminal.12 Immediate node links. Whether and how sub-blocks are connected. none none none none none none next next

13 Skip node links. Each sub-block can be connected to another sub-block (not only the next or previous) with perfect, randomized or custom skip-links.

none none none none none none none perf

14 Area-links. Each sub-tree can be connected with another sub-tree at the leaf level throu area links. Examples include the linked leaves of a B+Tree.

forw. none none none none none none none

Sub-block physical location. This represents the physical location ofthe sub-blocks. Pointed: in heap, Inline: block physically contained in parent. Double-pointed: in heap but with pointers back to the parent.Rules: requires fanout/radix != terminal.Sub-block physical layout. This represents the physical layout of sub-blocks. Scatter: random placement in memory. BFS: laid out in a breadth-first layout. BFS layer list: hierarchical level nesting of BFS layouts.Rules: requires fanout/radix != terminal.Sub-blocks homogeneous. Set to true if all sub-blocks are of the same type.Rules: requires fanout/radix != terminal.Sub-block consolidation. Single children are merged with their parents.Rules: requires fanout/radix != terminal.Sub-block instantiation. If it is set to eager, all sub-blocks are initialized, otherwise they are initialized only when data are available (lazy).Rules: requires fanout/radix != terminal.Sub-block links layout. If there exist links, are they all stored in a single array (consolidate) or spread at a per partition level (scatter).Rules: requires immediate node links != none or skip links != none.Recursion allowed. If yes, sub-blocks will be subsequently inserted into a node of the same type until a maximum depth (expressed as a function) is reached. Then the terminal node type of this data structure will be used.Rules: requires fanout/radix != terminal.

true

la

zyye

s(8)

lazy

scat

ter

no

unre

stric

t.po

inte

dsc

atte

rtr

ue

fixed

(256

)in

line

scat

ter

true

fa

lse

unre

stric

t.po

inte

dsc

atte

r

no

yes(

logn

)

noRecu

rsio

n 21

no

lazy

20

scat

ter

fals

e

19

lazy

lazy

true

18

fals

e

fals

e

scat

ter

17

true

true

poin

ted

16

scat

ter

scat

ter

Child

ren

layo

ut

15

poin

ted

inlin

e

true

fa

lse

lazy

fixed

(256

)

bala

nced

scat

ter

Part

ition

ing

11

unre

stric

t.

Nod

e fil

ters

8

col.

col.

Nod

e or

gani

zatio

n

3

Figure 30: Data Layout Specifications

Page 35: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

���������������������������� �������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

���������������������������

��������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

����������������������������������������������������������������������������������������������������

��������������������������������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

��������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

���������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

���������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

���������������������������

�������������������������

�������������������������

�������������������������������������������������� ������������������������� �������������������������

��������������������������������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

���������������������������

���������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

���������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

������������������������� �������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������������

�������������������������

���������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

��������������������������������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

��������������������������������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

������������������������� �������������������������

�������������������������

�������������������������������

�������������������������

����������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

������������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

����������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

������������������������� �������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

��������������������������������������������������

�������������������������������

����������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

Figure 31: Real output instance (Scenario 1 of Figure 1).

Page 36: The Internals of the Data Calculator · The Internals of the Data Calculator Stratos Idreos, Kostas Zoumpatianos, Brian Hentschel, Michael S. Kester, Demi Guo Harvard University ABSTRACT

������������������������� ���������������������������

�������������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�����������������������������

�����������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

���������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

������������������������� ������������������������� ������������������������� ��������������������������������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

����������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�����������������������������������������������������������������������������������������������������������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

����������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

����������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�����������������������������������������������������������������������������������������������������������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�����������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

������������������������� ������������������������� ������������������������� ��������������������������������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

�������������������������

Figure 32: Real output instance (Scenario 2 of Figure 9).