technical deep-dive in a column-oriented in-memory database

5/28/2018 Technical Deep-Dive in a Column-Oriented in-Memory Database

1/130

Technical Deep-Dive in a

Column-OrientedIn-Memory DatabaseProf. Hasso Plattner

Stephan Mller

Research Group of Prof. Hasso Plattner

Hasso Plattner Institute for Software EngineeringUniversity of Potsdam

TEC215


2/130

Motivation

Prof. Hasso Plattner

Stephan Mller




3/130

All Areas have to taken into

account

Changed

Hardware

Advances in

data processing

(software)

Complex

Enterprise Applications

Our focus

3


4/130

Why a New Data

Management?! DBMS architecture has not changed over decades

Redesign needed to handle the changes in:

Hardware trends (CPU/cache/memory)

Changed workloads Data characteristics/amount

New application requirements

Some academic prototypes:

MonetDB, C-store, HyPer, HYRISE

Several database vendors picked up

the idea and have new databases in place

(e.g., SAP, Vertica, Greenplum, Oracle)

Buffer pool

Query engine

Traditional DBMS Architecture

4


5/130

A

Changes in Hardware

give an opportunity to re-think the assumptions of

yesterday because of what is possible today.

Main Memory becomes cheaper and larger

Multi-Core Architecture

(96 cores per server)

One blade ~$50.000 =

1 Enterprise Class Server

Parallel scaling across blades

64 bit address space

2TB in current servers

25GB/s per core

Cost-performance ratio rapidly

declining

Memory hierarchies

5


6/130

In the Meantime

Research as come up with

Column-oriented data organization(the column store)

Sequentialscans allow best bandwidth utilizationbetween CPU cores and memory

Independence of tuples within columns allows easypartitioning and therefore parallel processing

Lightweight Compression Reducing data amount, while..

Increasing processing speed through late materialization

And more, e.g., parallel scan/join/aggregation

several advance in software for processing data

+

6


7/130

Data Management

for



Stephan Mller




8/130

Challenge: Diverse

ApplicationsTransactional

Data Entry

Sources: Machines, Transactional Apps,

User Interaction, etc.

Text Analytics,

Unstructured Data

Sources: web, social, logs,

support systems, etc.

Event Processing,

Stream Data

Sources: machines, sensors,

high volume systems

Real-time Analytics,

Structured Data

Sources: Reporting, Classical

Analytics, planning, simulation

Data Management

CPUs

(multi-Core + Cache)

Main Memory

8


9/130

OLTP vs. OLAP

Modern enterprise resource planning (ERP) systems are

challenged by mixedworkloads, including OLAP-style

queries. For example:

OLTP-style: create sales order, invoice, accounting documents,

display customer master data or sales order

OLAP-style: dunning, available-to-promise, cross selling, operational

reporting (list open sales orders) But: Todays data management systems are optimized

eitherfor daily transactional oranalytical workloads

(storing their data along rows or columns)

Online TransactionProcessing

Online AnalyticalProcessing

9


10/130

Drawbacks of any

Separation

OLAP/sub system does not have the latestdata

OLAP/sub system does only have predefinedsubsetof the

data

Cost-intensiveETLprocess has to synch both systems

There is a lot of redundancy

Differentdataschemasintroduce complexity for applications

combining sources

10


11/130

Workarounds in OTLP

As in OLAP-systems OLTP-systems facilitate redundant data to

overcome shortcoming in todays data management

Materialized Views

Materialized Aggregates

Pre-computed and materialized result sets

Since the database has been the bottleneck, complex data

processing is done on application server

Simple SQL statements

Nested loop joins (SELECT . SELECT SINGLE ENDSELECT)

Batch-processing lead to

Long running business processes

Inflexibility (e.g. ATP rescheduling)

11


12/130

Enterprise Applications Have a

Specific Database Footprint

Today's Enterprise Applications

Complex processes

Increased data set (but real-world events driven)

Separated into transactional (OLTP) and analytical (OLAP) applications Enterprise Data Management

Wide schemas

Sparse data with limited domain

Workload consists of complex analytical queries

Workload characteristics

Set processing

Read access

Insert operations instead of updates12


13/130

Enterprise Data Characteristics

Enterprise data is sparse data:

Most tables are empty (~150 important tables)

Many columns are not used even once (~50%)

Many columns have a low cardinality of values

NULL values/default values are dominant

Sparse distribution facilitates high compression

0

12500

25000

37500

50000

0 1-100 100-1000 1000-10000 10K-100K 100K - 1M 1M-10M >10M

14157992513852685

6290

15553

46418

Tablesthatfitinto

cate

gory

Number of Records

13


14/130

Enterprise Data Characteristics

Enterprise data is sparse data:

Most tables are empty (~150 important tables)

Many columns are not used even once (~50%)

Many columns have a low cardinality of values NULL values/default values are dominant

Sparse distribution facilitates high compression

0%

10%

20%

30%

40%

50%

60%70%

80%

1 - 32 33 - 1023 1024 - 100000000

13%

9%

78%

24%

12%

64%

Number of Distinct Values

Inventory Management

Financial Accounting

%ofColumns

14


15/130

Enterprise Workloads are

Read-Mostly

0 %

10 %

20 %

30 %

40 %

50 %

60 %

70 %

80 %

90 %

100%

OLTP OLAP

Workload

0 %

10 %

20 %

30 %

40 %

50 %

60 %

70 %

80 %

90 %

100%

TPC-C

Workload

Select

Insert

Modification

Delete

Write:

Read:

Lookup

Table Scan

Range Select

Insert

Modification

Delete

Write:

Read:

Enterprise applications have evolved:

Not just OLAP vs. OLTP

Workload in enterprise applications constitutes of

Mainly read queries (OLTP 83%, OLAP 94%)

Many queries access large sets of data

15


16/130

Approach

Change overall data management system assumption

In-Memory only

Start with read-optimized data structures

Transactional features as needed

Vertically partitioned (column store)

CPU-cache optimized

Only one optimization objectivemain memory access

Rethink how enterprise application persistence is build Single data management system

No redundant data, no materialized views, cubes

Computational application logic closer to the database

(i.e. complex queries, stored procedures)Backup

Column

ROW TEXT

IN-Memory Column + Row

OLTP + OLAP + Text

16


17/130

In-Memory Data Processing


Stephan Mller




18/130

Recap:

Memory Hierarchy

18


19/130

Recap:

Latency Numbers

L1 cache reference (cached data word) 0.5ns

Branch mispredict 5ns

L2 cache reference 7ns

Main memory reference 100ns 0.1s

Send 2K bytes over 1 Gbps network 20,000ns 20s

SSD random read 150,000ns 150s

Read 1 MB sequentially from memory 250,000ns 250s

Disk seek 10,000,000ns 10ms

Send packet CANetherlandsCA 150,000,000ns 150ms

19


20/130

In-Memory Data Processing

In DBMS, on disk as well as in memory,data processing is often:

Not CPU bound

But bandwidth bound

I/O Bottleneck

CPU could process data faster

Memory Access:

Not truly random (in the sense of constant latency) Data is read in blocks/cache lines

Even if only parts of a block are requested

Potential waste of bandwidth V1 V2 V3 V4 V5 V6 V7 V8 V9 V10

Cache Line 1 Cache Line 220


21/130

Data Layout in Main Memory


Stephan Mller




22/130

Basics (1)

Memory in todays computers has a linear address layout:

addresses start at 0x0 and go to 0xFFFFFFFFFFFFFFFF for 64bit

Not every system is 64bit addressable, e.g. modern Intel

systems use only 48bits which allows up to 256TB RAM on asingle machine

Virtual memory allocated by the program can distribute over

this space

Each UNIX process has its own view on the address space Address translation is done in hardware by the CPU

22


23/130

Basics (2)

The memory layout is only linear, every higher-dimensional

access is mapped to this linear band.

23


24/130

Physical Data Representation

Rowstore: Rows are stored consecutively

Optimal for row-wise access (e.g. SELECT *)

Columnstore:

Columns are stored consecutively

Optimal for attribute focused access

(e.g. SUM, GROUP BY)

Note: concept is independentfrom storage type

But only in-memory implementation allows fast tuple

reconstruction in case of a column store

Doc

Num

Doc

Date

Sold-

To

Value

Status

Sales

Org

Row4

Row3

Row2

Row1

Row-Store Column-store

+

24


25/130

Row Data Layout

Data is stored tuple-wise

Leverage co-location of attributes for a single tuple

Low cost for reconstruction, but higher cost for sequential

scan of a single attribute

25


26/130

Columnar Data Layout

Data is stored attribute-wise

Leverage sequential scan-speed in main memory

for predicate evaluation

Tuple reconstruction is more expensive

26


27/130

Row-oriented storage

27

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4


28/130


28

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4


29/130


29

A1 B1 C1 A2 B2 C2

A3 B3 C3

A4 B4 C4


30/130


30

A1 B1 C1 A2 B2 C2 A3 B3 C3

A4 B4 C4


31/130


31

A1 B1 C1 A2 B2 C2 A3 B3 C3 A4 B4 C4


32/130

Column-oriented storage

32

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4


33/130


33

B1 C1

B2 C2

B3 C3

B4 C4

A1 A2 A3 A4


34/130


34

A1 B1

C1

A2 B2

C2

A3 B3

C3

A4 B4

C4


35/130


35

A1 B1 C1A2 B2 C2A3 B3 C3A4 B4 C4


36/130

Example: OLTP-Style Query

36

struct Tuple {

int a,b,c;

};

Tuple data[4];

fill(data);

Tuple third = data[3];

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4


37/130

Example: OLTP-Style Query

37

struct Tuple {

int a,b,c;

};

Tuple data[4];

fill(data);

Tuple third = data[3];

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

Row Oriented Storage

Column Oriented Storage

Cache line

1 2

1 2 3


38/130

Example: OLAP-Style Query

38

struct Tuple {

int a,b,c;

};

Tuple data[4];

fill(data);

int sum = 0;

for(int i = 0;i


39/130

Example: OLAP-Style Query

39

struct Tuple {

int a,b,c;

};

Tuple data[4];

fill(data);

int sum = 0;

for(int i = 0;i


40/130

Mixed Workloads

40

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

OLTP-style queries OLAP-style queries

Mixed Workloads involve attribute- andentity-focused queries


41/130

Mixed Workloads:

Choosing the Layout

41

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

A1 B1 C1

A2 B2 C2

A3 B3 C3

A4 B4 C4

OLTP-style queries OLAP-style queries

Layout OLTP-Misses OLAP-Misses Mixed

Row 2 3 5

Column 3 1 4


42/130

Dictionary Encoding


Stephan Mller




43/130

Motivation

Main memory access is the new bottleneck

Idea: TradeCPU time to compress and decompress data

Compression reduces number of I/O operations to main

memory

Leads to less cache misses due to more information on a cache line

Operation directly on compressed data

Offsetting with bit-encoded fixed-length data types

Based on limited value domain

43

D E d


44/130

Dictionary Encoding

Example8 billion humans

Attributes

first name

last name

gender

country

city

Birthday

200 byte Each attribute is dictionary

encoded

44


45/130

Sample Data

45

rec ID fname lname gender city country birthday

39 John Smith m Chicago USA 12.03.1964

40 Mary Brown f London UK 12.05.1964

41 Jane Doe f Palo Alto USA 23.04.1976

42 John Doe m Palo Alto USA 17.06.1952

43 Peter Schmidt m Potsdam GER 11.11.1975


46/130

Dictionary Encoding a Column

A column is split into a dictionary and an attribute vector

Dictionary stores all distinct values with implicit value ID

Attribute vector stores value IDs for all entries in the column

Position is stored implicitly Enables offsetting with bit-encoded fixed-length data types

46

Rec ID fname

39 John

40 Mary

41 Jane

42 John

43 Peter

Dictionary for fname

Value ID Value

23 John

24 Mary

25 Jane

26 Peter

Attribute Vector for fname

position Value ID

39 23

40 24

41 25

42 23

43 26

Q i D i


47/130

Querying Data using

Dictionaries

Search for Attribute Value

Search Value IDs for requested value in Dictionary

Scan Attribute Vector for Value ID

Replace Value IDs in result with corresponding

dictionary value

47

Dictionary for fname

Value ID Value

23 John

24 Mary

25 Jane

26 Peter

Attribute Vector for fname

position Value ID

39 23

40 24

41 25

42 23

43 26


48/130

Sorted Dictionary

Dictionary entries are sorted either by their numeric value or

lexicographically

Dictionary lookup complexity: O(log(n)) instead of O(n)

Dictionary entries can be compressed to reduce the amount

of required storage

Selection criteria with ranges are less expensive (order-

preserving dictionary)

48


49/130

Compression Rate

Depends on cardinality / entropy

Cardinality

Table cardinality: number of tuples in a relation

Column cardinality: number of distinct values in a column

Entropy

measure for information density

Entropy = column cardinality / table cardinality

49


50/130

Data Size Examples

50

Column Column

Cardinality

Entropy Item Size Plain Size Size with Dictionary

(Dictionary + Column)

First names 5 million

23 bit

6.25 10-4 49 Byte 365.10 GB

373,840 MB

234 MB + 21.42 GB

22,168 MB

Last names 8 million

23 bit

1 10-3 50 Byte 372.5 GB

381,470 MB

381 MB + 21.42 GB

22,316 MB

Gender 2

1 bit

2.5 10-10 1 Byte 7.45 GB

7,630 MB

2 Byte + 0.93 GB

954 MB

City 1 million

20 bit

1.25 10-4 49 Byte 365.08 GB

373,840 MB

46.73 MB + 18.62 GB

19,120 MB

Country 200

8 bit

2.5 10-8 49 Byte 365.08 GB

373,840 MB

6.09 KB + 7.45 GB

7,629 MB

Birthday 40,000

16 bit

5 10-6 2 Byte 14.90 GB

15,260

MB

76.29 KB + 14.90 GB

15,259 MB


51/130

In-Memory Database

Operations


Stephan Mller




52/130

Scan Performance (1)

8 billion humans

Attributes

First Name

Last Name

Gender

Country

City

Birthday

200 byte

Question: How many men/women?

Assumed scan speed: 2MB/ms/core

52


53/130


53

Row Store Layout

Table size =

8 billion tuples x

200 bytes per tuple

1.6 TB

Scan through all rows

with 2MB/ms/core

800 seconds

with 1 core


54/130


54

Row Store Full Table Scan

Table size =

8 billion tuples x

200 bytes per tuple

1.6 TB

Scan through all rows

with 2MB/ms/core

800 seconds

with 1 core


55/130


55

8 billion cache accesses

64 byte

512 GB

Read with 2MB/ms/core256 seconds

with 1 core

Row Store Stride Access Gender


56/130


56

Column Store Layout

Table size

Attribute vectors: 91 GB

Dictionaries: 700 MB

Total: 92 GB Compression

factor: 17


57/130


57

Column Store Full Column Scan on Gender

Size of attribute vector

gender =

8 billion tuples x

1 bit per tuple1 GB

Scan through

attribute vector with

2MB/ms/core

0.5 seconds with 1 core


58/130


58

Column Store Full Column Scan on Birthday

Size of attribute vector

birthday =

8 billion tuples x

2 byte per tuple =16 GB

Scan through

column with

2MB/ms/core

8 seconds with 1 core


59/130

Scan PerformanceSummary

59

How many women, how many men?

Column

Store Row Store

Full tablescan

Strideaccess

Time inseconds

0.5 800 256

1,600xslower

512xslower


60/130

SELECT Example

SELECT first_name, last_name FROM world_populationWHERE country=IT and gender=m

id fname lname country gender

2394 Gianluigi Buffon Italy m

3010 Lena Gercke Germany f

3040 Mario Balotelli Italy m

3949 Manuel Neuer Germany m

4902 Lukas Podolski Germany m

20102 Klaas-Jan Huntelaar Netherlands m

60


61/130

Query Plan

Predicates are evaluated and generate position lists

Intermediate position lists are logically combined

Final position list is used for materialization

61

s s


62/130

Query Execution

Position

0

3

Position

0

2

3

4

5

Position

0

3

country = 3 (Italy)

gender = 1 (m)

AND

Value ID Dictionary

forcountry

0 Algeria

1 France

2 Germany

3 Italy

4 Netherlands

Value

ID

Dictionary

forgender

0 f

1 m

id fname lname country gender

2394 Gianluigi Buffon 3 1

3010 Lena Gercke 2 0

3040 Mario Balotelli 3 1

3949 Manuel Neuer 2 1

4902 Lukas Podolski 2 1

20102 Klaas-Jan Huntelaar 4 1

62


63/130

Tuple Reconstruction


Stephan Mller




64/130

Tuple Reconstruction #1

All attributes are

stored

consecutively

200 byte4 cacheaccesses 64 byte

256 byte

Read with

2MB/ms/core

0.128

microseconds

with 1 core

Accessing a record in a row store

64


65/130


All attributes are

stored in separate

columns

Implicit record IDsare used to

reconstruct rows

65

Virtual Record IDs


66/130


1 cache access for

each attribute

6 cache accesses

64 byte384byte

Read with

2MB/ms/core

0.192

microseconds

with 1 core

66

Virtual Record IDs


67/130

Insert


Stephan Mller




68/130

Example

rowID fname lname gender country city birthday

0 Martin Albrecht m GER Berlin 08-05-1955

1 Michael Berg m GER Berlin 03-05-1970

2 Hanna Schulze f GER Hamburg 04-04-19683 Anton Meyer m AUT Innsbruck 10-20-1992

4 Sophie Schulze f GER Potsdam 09-03-1977

... ... ... ... ... ... ...

world_population

INSERT INTO world_population

VALUES (Karen, Schulze, w, GER, Rostock, 06-20-2012)

68


69/130

INSERT (1) w/o new Dictionary entry

fname lname gender country city birthday



2 Hanna Schulze f GER Hamburg 04-04-1968

3 Anton Meyer m AUT Innsbruc

k

10-20-1992


... ... ... ... ... ... ...

INSERT INTO world_population VALUES (Karen, Schulze, w, GER, Rostock, 06-20-2012)

0 0

1 1

2 3

3 2

4 3

0 Albrecht

1 Berg

2 Meyer

3 Schulze

DAV

Attribute Vector (AV)

Dictionary (D)

69


70/130

0 Albrecht

1 Berg

2 Meyer

3 Schulze

1. Look-up on Dentry found



Dictionary (D)

D






k

10-20-1992


... ... ... ... ... ... ...

AV

0 0

1 1

2 3

3 2

4 3


70


71/130

0 0

1 1

2 3

3 2

4 3

5 3

1. Look-up on Dentry found

2. Append ValueID to AV



Dictionary (D)






k

10-20-1992


5 Schulze

... ... ... ... ... ... ...

AV

0 Albrecht

1 Berg

2 Meyer

3 Schulze

D


71


72/130

INSERT (2) with new Dictionary Entry I/II





3 Anton Meyer m AUT Innsbruck 10-20-1992


5 Schulze

... ... ... ... ... ... ...


0 0

1 0

2 1

3 2

4 3

DAV

0 Berlin

1 Hamburg

2 Innsbruck

3 Potsdam


Dictionary (D)

72


73/130

1. Look-up on Dno entry found


DAV

0 0

1 0

2 1

3 2

4 3

0 Berlin

1 Hamburg

2 Innsbruck

3 Potsdam







5 Schulze

... ... ... ... ... ... ...


Dictionary (D)


73


74/130


2. Append new value to D (no re-sorting necessary)


0 Berlin

1 Hamburg

2 Innsbruck

3 Potsdam4 Rostock

D







5 Schulze

... ... ... ... ... ... ...


Dictionary (D)

AV

0 0

1 0

2 1

3 24 3


74


75/130


2. Append new value to D (no re-sorting necessary)

3. Append ValueID to AV








5 Schulze Rostock

... ... ... ... ... ... ...


Dictionary (D)

0 0

1 0

2 1

3 24 3

5 4

0 Berlin

1 Hamburg

2 Innsbruck

3 Potsdam4 Rostock

DAV


75


76/130

0 Anton

1 Hanna

2 Martin

3 Michael4 Sophie







5 Schulze Rostock

... ... ... ... ... ... ...


DAV

0 2

1 3

2 1

3 04 4


Dictionary (D)


76


77/130









5 Schulze Rostock

... ... ... ... ... ... ...


Dictionary (D)

0 Anton

1 Hanna

2 Martin

3 Michael4 Sophie

DAV

0 2

1 3

2 1

3 04 4

INSERT (2) with new Dictionary Entry II/II

77


78/130


2. Append new value to D








5 Schulze Rostock

... ... ... ... ... ... ...


Dictionary (D)

0 Anton

1 Hanna

2 Martin

3 Michael4 Sophie

5 Karen

DAV

0 2

1 3

2 1

3 04 4


78


79/130



3. Sort D








5 Schulze Rostock

... ... ... ... ... ... ...


Dictionary (D)

0 Anton

1 Hanna

2 Martin

3 Michael4 Sophie

5 Karen

D (old)AV

0 2

1 3

2 1

3 04 4

0 Anton

1 Hanna

2 Karen

3 Martin4 Michael

5 Sophie

D (new)


79


80/130



3. Sort D

4. Change ValueIDs in AV








5 Schulze Rostock

... ... ... ... ... ... ...


Dictionary (D)

AV (old)

0 Anton

1 Hanna

2 Karen

3 Martin4 Michael

5 Sophie

D (new)AV (new)

0 3

1 4

2 1

3 04 5

0 2

1 3

2 1

3 04 4


80


81/130



3. Sort D

4. Change ValueIDs in AV

5. Append new ValueID to AV








5 Karen Schulze Rostock

... ... ... ... ... ... ...


Dictionary (D)

0 Anton

1 Hanna

2 Karen

3 Martin4 Michael

5 Sophie

DAV

0 3

1 4

2 1

3 04 5

5 2


81

RESULT


82/130

RESULT

rowID fname lname gender country city birthday





4 Ulrike Schulze f GER Potsdam 09-03-1977

5 Karen Schulze f GER Rostock 06-20-2012

... ... ... ... ... ... ...

world_population

INSERT INTO world_population

VALUES (Karen, Schulze, w, GER, Rostock, 06-20-2012)

82


83/130

Insert-Only


Stephan Mller



Facts about Insert Only


84/130

Facts about Insert-Only

Principles

Never delete any data

Invalidate outdated tuples instead

Logical Update is changed to an technical Insert/Append

Advantages

Gap-less time travel possible,

Legal requirements, e.g. for auditability can easily be met

Implicit logging

Snapshot Isolation and locking is simplified

Disadvantage: Increased memory consumption

but applied compression reduces overhead

84

Implementation possibilities


85/130

Implementation possibilities1. Point Representation

Store complete tuple on attribute change Save insert timestamp in column valid_from

Writes faster, reads slower

2. Interval representation

Store complete tuple on attribute change Update replaced tuple, store current timestamp in valid_to

Same timestamp is stored into valid_from in new tuple

Reads faster, writes slower

3. History reconstruction on demand

Update existing tuple on changes (not insert-only any longer)

Store outdated values in separate history table

Status updates can be done in-place with timestamps

Timestamps are not compressed85

Insert Only 1


86/130

Insert-Only 1Point Representation

Introduce one additional column: valid_from

Primary key is composed of id and valid_from

Insert is easy: valid_from = current timestamp

Select, Group By, Join: requires nested join to determine the valid_from

timestamp for each object

id fname lname gender country city birthday valid_from

0 Martin Albrecht m GER Berlin 08-05-1955 10-11-2011

1 Michael Berg m GER Berlin 03-05-1970 10-11-2011

2 Hanna Schulze f GER Hamburg 04-04-1968 10-11-2011

3 Anton Meyer m AUT Innsbruck 10-20-1992 10-11-2011

4 Ulrike Schulze f GER Potsdam 09-03-1977 10-11-2011

5 Sophie Schulze f GER Rostock 06-20-2012 10-11-2011

1 Michael Berg m GER Potsdam 03-05-1970 07-02-2012

... ... ... ... ... ... ... ...

Zacharias Perdopolus m GRE Athen 03-12-1979 10-11-20118109

86

Insert Only 2


87/130

Insert-Only 2Interval Representation

Introduce 2 additional columns: valid_from and valid_to

Primary key is composed of id and valid_from

Insert requires update of the formerly current tuple

Select, Group By, Join is easy: Where clause eliminates tuples out of range

Finding up-to-date entries can be supported by an additional bit-vector on

column valid_to

id fname lname gender country city birthday valid_from valid_to

0 Martin Albrecht m GER Berlin 08-05-1955 10-11-2011

1 Michael Berg m GER Berlin 03-05-1970 10-11-2011 07-02-2012

2 Hanna Schulze f GER Hamburg 04-04-1968 10-11-2011

3 Anton Meyer m AUT Innsbruck 10-20-1992 10-11-2011

4 Ulrike Schulze f GER Potsdam 09-03-1977 10-11-2011

5 Sophie Schulze f GER Rostock 06-20-2012 10-11-2011

1 Michael Berg m GER Potsdam 03-05-1970 07-02-2012

... ... ... ... ... ... ... ... ...

Zacharias Perdopolus m GRE Athen 03-12-1979 10-11-20118109

87

Snapshot Isolation


88/130

Snapshot Isolation Snapshot Isolation guarantees consistent reads during a

transaction, all reads retrieve the values that were active inthe moment the transaction started

Conflicts like lost updates may happen theoretically, but

are prevented through

Pre-write checks in the database or Application-level locks

wants to insert.

An alert is raised since the read

values are no longer valid.

transaction2

time

read1 insert1

read2

insert2

transaction1

transaction2

88

St t U d t


89/130

Status Updates

When updates of status fields are changed by replacement, do weneed to insert a new version of the tuple?

Insert Only would lead to an overhead (e.g. clearing in FI)

Most status fields are binary

Uncompressed in-place updates

with row timestamp

Unpaid Paid

t = 2009/06/30t = NULL

89


90/130

Handling Data Modifications


Stephan Mller



Motivation


91/130

Motivation

Inserting new tuples directly into a compressed structure can be

expensive

New values can require reorganizing the dictionary

Number of bits required to encode all dictionary values can change,

attribute vector has to be reorganized

Deletion of tuples is expensive

All attribute vectors have to be reorganized, value IDs of following tuples

have to be moved

91

Differential Buffer


92/130

Differential Buffer

New values are written to a dedicated differential buffer (Delta)

Cache Sensitive B+ Tree (CSB+) used for faster search on Delta

DictionaryAttribute Vector

fname

(compressed)

Main Store

Dictionary

Attribute Vector

CSB+

fname

Differential Buffer/ Delta

WriteRead

world_population

0 0

1 1

2 1

3 3

4 2

5 1

0 Anton

1 Hanna

2 Michael

3 Sophie

0 Angela

1 Klaus

2 Andre

0 0

1 1

2 1

3 2

8 Billion entriesup to 50,000

entries

92

Diff ti l B ff


93/130

Differential Buffer

Inserts of new values are faster, because dictionary andattribute vector does not need to be resorted

Range select on differential buffer expensive, based on

unsorted dictionary

Differential Buffer requires more memory: no attribute vector compression

additional CSB+ Tree for dictionary

93

Tuple Lifetime


94/130

Differential Buffer

Main Store

Tuple Lifetime

recId fname lname gender country city birthday

0 Martin Albrecht m GER Berlin 08-05-

1955

1 Michael Berg m GER Berlin 03-05-

1970

2 Hanna Schulze f GER Hamburg 04-04-

1968

3 Anton Meyer m AUT Innsbruck 10-20-

1992

4 Ulrike Schulze f GER Potsdam 09-03-

1977

5 Sophie Schulze f GER Rostock 06-20-2012

... ... ... ... ... ... ...

8 * 109 Zacharias Perdopolus m GRE Athen 03-12-

1979

Main Table: world_population

Michael moves from Berlin to Potsdam

UPDATE world_population

SET city=Potsdam

WHERE fname= Michael ANDlname=Berg

94

Tuple Lifetime


95/130

Differential Buffer

Main Store

Tuple Lifetime



1955


1970


1968


1992


1977


... ... ... ... ... ... ...


1979


SET city=Potsdam


95



Tuple Lifetime


96/130

Tuple Lifetime



1955


1970


1968


1992


1977


... ... ... ... ... ... ...


1979


SET city=Potsdam


Differential Buffer

Main Store

0 Michael Berg m GER Potsdam 03-05-

1970

96



Tuple Lifetime


97/130

Tuple Lifetime

Problem: Tuples are now available in Main Store and Differential

Buffer

Tuples of a table are marked by a validity vector to reduce the

required amount of reorganization steps

Like an attribute vector for validity

Invalidated tuples stay in the database table, until the next

reorganization takes place

Search results are reduced using the validity vector

1 bit required per database tuple

97

Tuple Lifetime


98/130

recId fname lname gender country city birthday valid


1955

1


1970

0


1968

1


1992

1


1977

1


1

... ... ... ... ... ... ...


1979

1

Tuple Lifetime


SET city=Potsdam



1970

1 DifferentialBuffer

Main Store

98



Tuple Lifetime


99/130

Tuple Lifetime


SET city=Potsdam


recId fname lname gender country city birthday valid


1955

1


1970

0


1968

1


1992

1


1977

1


1

... ... ... ... ... ... ...


1979

1


1970

1 DifferentialBuffer

Main Store

99




100/130

Stored Procedures

Prof. Hasso PlattnerStephan Mller



Facts about


101/130

Stored Procedures Basically a procedural program stored in the database Written in a (vendor-) specific language (i.e. PL/SQL, Java, )

Usually support constructs like loops, conditions, i.e.

Takes parameters as input

Returns a result set Main usage: set operations that cannot be expressed in SQL

(or are very hard to express)

Additional usages:

Access control Data validation

Data conversion

101

Advantages


102/130

Advantages

Performance No data transfer between database and application server

Code reduction

Share stored procedure written in a generic way

Less code in the application layer

Improved security

No SQL injection possible in stored procedures

102


103/130

Implications on ApplicationDevelopment




How does it all come together?


104/130

How does it all come together?

1. Mixed Workload combining OLTP andanalytic-style queries Column-Stores are best suited for analytic-style queries

In-memory database enables fast tuple re-construction

In-memory column store allows aggregation on-the-fly

2. Sparse enterprise data Lightweight compressionschemes are optimal

Increases query execution

Improves feasibility of in-memory database

3. Mostly read workload Read-optimized stores provide best throughput

i.e. compressed in-memory column-store

Write-optimized store as delta partition

to handle data changes is sufficient

ChangedHardware

Advances indata processing

(software)

Complex


Our focus

104

An In-Memory Database for


105/130


In-Memory Database (IMDB)

Data resides permanently

in main memory

Main Memory is the

primary persistence Still: logging to disk/recovery

from disk

Main memory access is

the new bottleneck

Cache-conscious algorithms/data structures are crucial

(locality is king)

105

Main Memoryat Blade i

Log

SnapshotsPassive Data (History)

Non-VolatileMemory

RecoveryLoggingTimetravel

Dataaging

Query Execution Metadata TA Manager

Interface Services and Session Management

Distribution Layerat Blade i

Main Store Differential

Store

Active Data

MergeC

olumn

Column

Combined

Column

Column

Column

Combined

Column

Indexes

Inverted

ObjectData Guide

Simplified Application Development


106/130

p pp p

Traditional Column-oriented

Application

cache

Database

cache

Prebuilt

aggregates

Raw data

106

Fewer caches necessary No redundant data

(OLAP/OLTP, LiveCache)

No maintenance of

materialized views oraggregates

Minimal index

maintenance


107/130

Enterprise ApplicationPoCs




SAP ERP Financials on


108/130

In-Memory Technology

In-memory column database for an ERP system

Combined workload (parallel OLTP/OLAP queries)

Leverage in-memory capabilities to

Reduce amount of data

Aggregate on-the-fly

Run analytic-style queries (to replace materialized views)

Execute stored procedures

Use Case: SAP ERP Financials solution

Post and change documents

Display open items

Run dunning job

Analytical queries, such as balance sheet108

Current Financials Solutions


109/130

Current Financials Solutions

109

The Target Financials Solution


110/130

Only base tables, algorithms, and some indexes

110

Feasibility of Financials on In-


111/130

Memory Technology in 2009

Modifications on SAP Financials

Removed secondary indices, sum tables and pre-calculated and

materialized tables

Reduce code complexity and simplify locks

Insert Only to enable history (change document replacement) Added stored procedureswith business functionality

European division of a retailer

ERP 2005 ECC 6.0 EhP3

5.5 TB system database size Financials:

23million headers / 8 GB in main memory

252million items / 50 GB in main memory(including inverted indices for join attributes and insert only extension)

111

In-Memory Financialson


112/130

BKPF

accounting documents

BSEG

sum tables

secondary indices

dunning data

change documents

CDHDR

CDPOS

MHNK

MHNDBSAD

BSAK

BSAS

BSID

BSIK

BSIS

LFC1

KNC1

GLT0

SAP ERP

112

In-Memory Financialson


113/130

BKPF

accounting documents

BSEG

SAP ERP

113

Reduction by a Factor 10


114/130

DBMS IMDB

BKPF 8.7 GB 1.5 GB

BSEG 255 GB 50 GB

Secondary

Indices255 GB -

Sum Tables 0.55 GB -

Complete 519.25 GB 51.5 GB

263.7 GB 51.5 GB

114

Booking an accounting


115/130

document

Insert into BKPF and BSEG only Lack of updates reduces locks

115

Dunning Run


116/130

Dunning Run

Dunning run determines all open and due invoices

Customer defined queries on 250M records

Current system: 20 min

New logic: 1.5 sec

In-memory column store

Parallelized stored procedures

Simplified Financials

116

Why?


117/130

Why?

Being able to perform the dunning run in such a short timelowers TCO

Add more functionality!

Run other jobs in the meantime! - in a multi-tenancy cloud

setup hardware must be used wisely

117

Bring Application Logic

Cl h S


118/130

Closer to the Storage

Layer

Select accounts to be dunned, for each:

Select open account items from BSID, for each: Calculate due date

Select dunning procedure, level and area

Create MHNK entries

Create and write dunning item tables

118


Cl h S


119/130


Layer




Create MHNK entries


1 SELECT

10000 SELECTs

10000 SELECTs

31000 Entries

119


Cl h S


120/130


Layer




Create MHNK entries


1 SELECT

10000 SELECTs

10000 SELECTs

31000 Entries

One single

stored

procedure

executedwithin newDB

120


Cl h S


121/130


Layer




Create MHNK entries


One single

stored

procedure

executedwithin newDB

Calculated on-the-fly

121

Factor: 800x AccelerationQuantity: 250 mio items, 380k open, 200k due


122/130

# Operation Original Version 1 Variant 2 Variant 3

1 Select open items 0.63s 1.01s(incl. T047 & KNB5

Join)

0.6s(incl. T047 & KNB5 Join)

2Due date,

dunning level27s

Deferred to

aggregation0.5s

3

Filter 1

(verify dunninglevels)

~19s 1.1s 0.5s

4

Filter 2

(check last

dunning)

~15s 0.8s 0.4s

5 Generate MHNK(aggregate)

done in #1 1.2s Done in #1

6Generate MHND

(execute filters)done in #1 140ms Done in #1

Total~20 minutes ~1 minute ~3.0s

(#3, #4 exec. in parallel)

~1.5s(#2, #3, #4 exec. in parallel)

AdditionalInformation

Hardware:4CPU

sx6cores,

256GBRAM

y p

122

Dunning Application


123/130

Dunning Application

123

Dunning Application


124/130

Dunning Application

124

Available-to-Promise Check


125/130

Available to Promise Check

Can I get enough quantities of a requested product on a desireddelivery date?

Goal: Analyze and validate the potential of in-memory andhighly parallel data processing for Available-to-Promise (ATP)

Challenges Dynamic aggregation

Instant rescheduling in minutes vs. nightly batch runs

Real-time and historical analytics

Outcome

Real-time ATP checks without materialized views Ad-hoc rescheduling

No materialized aggregates

125

In-Memory Available-to-

P i


126/130

Promise

126

Demand Planning


127/130

g

Flexible analysis of demandplanning data

Zooming to choose granularity

Filter by certain products orcustomers

Browse through time spans

Combination of location-basedgeo data with planning data in anin-memory database

External factors such as thetemperature, or the level ofcloudiness can be overlaid toincorporate them in planningdecisions

127

GORFID


128/130

HANA for Streaming Data Processing

Use Case: In-Memory RFID DataManagement

Evaluation of SAP OER

Prototypical implementation of: RFID Read Event Repository on HANA

Discovery Service on HANA (10 Billion datarecords with ca. 3 seconds response time)

Frontends for iPhone, iPad2

Key Findings:

HANA is suited for streaming data(using bulk inserts)

Analytics on streaming data is now possible

128

GORFID: Near Real-Time

C t


129/130

as a Concept

Bulk load every2-3 seconds:

> 50,000 inserts/s

129

Thanks!


130/130

Questions?

Online Lecture (starting Sept 3rd) http://openhpi.de

technical deep-dive in a column-oriented in-memory database

Documents

memory databaseprof

database vendors

memory independence

processing data

event processing

processing speed

unstructured datasources

stream datasources