don’t hold my data hostage - hannes mühleisen · state of the art protocols significant per-row...

Don’t Hold My Data Hostage

Mark Raasveldt, Hannes Mühleisen

A Case For Client Protocol Redesign

1

What is a Client Protocol anyway?

▸ Protocol is how a client communicates with a server

▸ ODBC, JDBC, psql

▸ Every database that supports remote clients has a client protocol

▸ Using this protocol, clients can:

▸ Connect to the database

▸ Query it

▸ Receive the query results

2

3What is a Client Protocol anyway?

Motivation

▸ Problem: Current protocols were designed for exporting small amount of rows

▸ Displaying results on screen

▸ OLTP use cases

▸ Exporting aggregates

▸ Exporting large amounts of data using these protocols is slow

▸ Analytical tools (e.g. R/Python)

▸ Data export is a bottleneck when result sets are large!

4

Motivation

▸ Cost of exporting the SF10 lineitem table from TPC-H (7.2GB in CSV format) on localhost

Netcat (10.25s)

170.9170.9170.9

189.6189.6189.6

629.9629.9629.9

221.2221.2221.2

686.5686.5686.5

101.3101.3101.3

391.3391.3391.3

202202202

MongoDB

Hive

MySQL+C

MonetDB

PostgreSQL

DBMS X

DB2

MySQL

0 200 400 600Wall clock time (s)

OperationConnectionQuery ExecutionRSS + Transfer

5

SELECT * FROM LINEITEM;

Motivation

▸ We are not the first ones to notice this problem

▸ A lot of work on in-database processing, UDFs, etc.

▸ However, that work is database-specific, requires adapting of existing work flows and introduces safety issues

▸ Why is exporting large amounts of data from a database so inefficient?

▸ Can we make it more efficient?

6

State of the Art Protocols

▸ We reverse engineered how different databases transfer the following table “on the wire”

▸ Source code/documentation

▸ Decompilation of JDBC Drivers

▸ Wireshark

increased. It is possible that they send explicit confirmationmessages from the client to the server to indicate that theclient is ready to receive the next batch of data. Thesemessages are cheap with a low latency, but become verycostly when the latency increases.

Contrary to our prediction, we find that the performanceof all systems is heavily influenced by a high latency. This isbecause, while the server and client do not explicitly sendconfirmation messages to each other, the underlying TCP/IPlayer does send acknowledgement messages when data isreceived [27]. TCP packets are sent once the underlyingbu↵er fills up, resulting in an acknowledgement message.As a result, protocols that send more data trigger moreacknowledgements and su↵er more from a higher latency.

Throughput. Reducing the throughput of a connectionadds a variable cost to sending messages depending on thesize of the message. Restricted throughput means sendingmore bytes over the socket becomes more expensive. Themore we restrict the throughput, the more protocols thatsend a lot of data are penalized.

In Figure 4, we can see the influence that lower throughputshave on the di↵erent protocols. When the bandwidth isreduced, protocols that send a lot of data start performingworse than protocols that send a lower amount of data.While the PostgreSQL protocol performs well with a highthroughput, it starts performing significantly worse than theother protocols with a lower throughput. Meanwhile, we alsoobserve that when the throughput decreases compressionbecomes more e↵ective. When the throughput is low, theactual data transfer is the main bottleneck and the cost of(de)compressing the data becomes less significant.

2.3 Result Set SerializationIn order to better understand the di↵erences in time and

transferred bytes between the di↵erent protocols, we haveinvestigated their data serialization formats.

Table 2: Simple result set table.

INT32 VARCHAR10100,000,000 OK

NULL DPFKG

For each of the protocols, we show a hexadecimal repre-sentation of Table 2 encoded with each result set format.The bytes used for the actual data are colored green, whileany overhead is colored white. For clarity, leading zeroes arecolored gray.

4450464B4705000000FFFFFFFF

44 020010000000 4F4B0200000000E1F50500000004

44 02000F000000

Total

Length

Field

Count

Length

Field

1

Data

Field

1

Data

Field

2

Length

Field

2

Message

Type

Figure 5: PostgreSQL result set wire format

PostgreSQL. Figure 5 shows the result set serializationof the widely used PostgreSQL protocol. In the PostgreSQLresult set, every single row is transferred in a separate pro-tocol message [35]. Each row includes a total length, the

amount of fields, and for each field its length (�1 if the valueis NULL) followed by the data. We can see that for this resultset, the amount of per-row metadata is greater than theactual data w.r.t. the amount of bytes. Furthermore, a lotof information is repetitive and redundant. For example,the amount of fields is expected to be constant for an entireresult set. Also, from the result set header that precedesthose messages, the amount of rows in the result set is known,which makes the message type marker redundant. This largeamount of redundant information explains why PostgreSQL’sclient protocol requires so many bytes to transfer the resultset in the experiment shown in Table 1. On the other hand,the simplicity of the protocol results in low serialization anddeserialization costs. This is reflected in its quick transfertime if the network connection is not a bottleneck.

4450464B47

4F4B00

0500

04

05

Data

Field

2

Data

Length

Packet

Nr.

Data

Field

1

Length

Field

2

0D00

0700

313030303030303002

FB

09

Length

Field

2

Length

Field

1

Figure 6: MySQL text result set wire format

MySQL. Figure 6 shows MySQL/MariaDB’s protocolencoding of the sample result set. The protocol uses binaryencoding for metadata, and text for actual field data. Thenumber of fields in a row is constant and defined in the resultset header. Each row starts with a three-byte data length.Then, a packet sequence number (0-256, wrapping around)is sent. This is followed by length-prefixed field data. Fieldlengths are encoded as variable-length integers. NULL valuesare encoded with a special field length, 0xFB. Field datais transferred in ASCII format. The sequence number isredundant here as the underlying TCP/Unix Sockets alreadyguarantees that packets arrive in the same order in whichthey were sent.

07

Length

Field

1

00

Length

Field

2

4450464B47

02

Data

Field

2

Data

Field

1

0702 4F4B

05

C502

Packet

Header

Figure 7: DBMS X result set wire format

DBMS X has a very terse protocol. However, it is muchmore computationally heavy than the protocol used by Post-greSQL. Each row is prefixed by a packet header, followedby the values. Every value is prefixed by its length in bytes.This length, however, is transferred as a variable-length inte-ger. As a result, the length-field is only a single byte for smalllengths. For NULL values, the length field is 0 and no actualvalue is transferred. Numeric values are also encoded usinga custom format. On a lower layer, DBMS X uses a fixednetwork message length for batch transfers. This messagelength is configurable and according to the documentation,considerably influences performance. We have set it to thelargest allowed value, which gave the best performance inour experiments.

7


▸ Significant per-row overhead

▸ Used by many other systems:

▸ Redshift, HyPer, Greenplum and Vertica

PostgreSQL









NULL DPFKG


4450464B4705000000FFFFFFFF

44 020010000000 4F4B0200000000E1F50500000004

44 02000F000000

Total

Length

Field

Count

Length

Field

1

Data

Field

1

Data

Field

2

Length

Field

2

Message

Type




4450464B47

4F4B00

0500

04

05

Data

Field

2

Data

Length

Packet

Nr.

Data

Field

1

Length

Field

2

0D00

0700

313030303030303002

FB

09

Length

Field

2

Length

Field

1



07

Length

Field

1

00

Length

Field

2

4450464B47

02

Data

Field

2

Data

Field

1

0702 4F4B

05

C502

Packet

Header



4450464B4705000000FFFFFFFF

44 020010000000 4F4B0200000000E1F50500000004

44 02000F000000

Total

Length

Field

Cou

nt

Length

Field

1

Data

Field

1

Data

Field

2

Length

Field

2

Message

Typ

e

8


▸ ASCII protocol

▸ Every value has a length field

▸ Supports compression with GZIP

4450464B47

4F4B00

0500

04

05

Data

Field

2

Data

Length

Packet

Nr.

Data

Field

1

Length

Field

2

0D00

0700

313030303030303002

FB

09

Length

Field

2

Length

Field

1

MySQL









NULL DPFKG


4450464B4705000000FFFFFFFF

44 020010000000 4F4B0200000000E1F50500000004

44 02000F000000

Total

Length

Field

Count

Length

Field

1

Data

Field

1

Data

Field

2

Length

Field

2

Message

Type




4450464B47

4F4B00

0500

04

05

Data

Field

2

Data

Length

Packet

Nr.

Data

Field

1

Length

Field

2

0D00

0700

313030303030303002

FB

09

Length

Field

2

Length

Field

1



07

Length

Field

1

00

Length

Field

2

4450464B47

02

Data

Field

2

Data

Field

1

0702 4F4B

05

C502

Packet

Header



9


▸ Columnar protocol

▸ Uses generic Thrift serialisation library

▸ One byte per value in NULL mask

▸ Also used by SparkSQL

Hive

1A

Field

Begin

00000000000000003F

Field

Begin

1F

Field

Begin

28

ListBegin

00

Field

Stop

5FAF8480 1F

Field

Begin

2B

ListBegin

054450464B47024F4B 00

Field

Stop

00

Field

Stop

0000000200

Field

Stop

58

Field

Begin

Start

Row

Offset

Column

Cou

nt

Data

Column1

Data

Column2

18Field

Begin

0001

NULL

Mask

Field

Length0200

Field

Stop

00









NULL DPFKG


4450464B4705000000FFFFFFFF

44 020010000000 4F4B0200000000E1F50500000004

44 02000F000000

Total

Length

Field

Count

Length

Field

1

Data

Field

1

Data

Field

2

Length

Field

2

Message

Type




4450464B47

4F4B00

0500

04

05

Data

Field

2

Data

Length

Packet

Nr.

Data

Field

1

Length

Field

2

0D00

0700

313030303030303002

FB

09

Length

Field

2

Length

Field

1



07

Length

Field

1

00

Length

Field

2

4450464B47

02

Data

Field

2

Data

Field

1

0702 4F4B

05

C502

Packet

Header



10

Protocol Design Space

▸ Implemented prototype in PostgreSQL and MonetDB

▸ Serialisation Format (ASCII, Custom Binary, Generic)

▸ Custom Binary

▸ Row Major or Column Major

▸ Column-Major (but chunked)

▸ Data Compression Methods

▸ No compression local, stream compression on remote

▸ Null Handling

▸ Close to native formats

11

Proposed Protocol in MonetDB

02000000 BC100000 00E1F505

Data

Column2

4450464B47004F4B00

Data

Column1

Row

Cou

nt

09000000

▸ Columnar, 1MB chunks prefixed with row count

▸ Missing values stored as special values in domain

▸ Variable-length columns prefixed with their length

12









NULL DPFKG


4450464B4705000000FFFFFFFF

44 020010000000 4F4B0200000000E1F50500000004

44 02000F000000

Total

Length

Field

Count

Length

Field

1

Data

Field

1

Data

Field

2

Length

Field

2

Message

Type




4450464B47

4F4B00

0500

04

05

Data

Field

2

Data

Length

Packet

Nr.

Data

Field

1

Length

Field

2

0D00

0700

313030303030303002

FB

09

Length

Field

2

Length

Field

1



07

Length

Field

1

00

Length

Field

2

4450464B47

02

Data

Field

2

Data

Field

1

0702 4F4B

05

C502

Packet

Header



Proposed Protocol in PostgreSQL

Column

Data

02000000 00E1F505

4450464B47004F4B00

Row

Cou

nt04000000

09000000

02

00

NULL

Mask

Column

Length

▸ Missing values (PostgreSQL)

▸ NULL bitmask for each column, 1 bit per value

▸ Only add mask if column has missing values

▸ Columns with mask have a column-length as well

13









NULL DPFKG


4450464B4705000000FFFFFFFF

44 020010000000 4F4B0200000000E1F50500000004

44 02000F000000

Total

Length

Field

Count

Length

Field

1

Data

Field

1

Data

Field

2

Length

Field

2

Message

Type




4450464B47

4F4B00

0500

04

05

Data

Field

2

Data

Length

Packet

Nr.

Data

Field

1

Length

Field

2

0D00

0700

313030303030303002

FB

09

Length

Field

2

Length

Field

1



07

Length

Field

1

00

Length

Field

2

4450464B47

02

Data

Field

2

Data

Field

1

0702 4F4B

05

C502

Packet

Header



Benchmarks

▸ Three different network configurations

▸ Localhost: No network restrictions

▸ LAN: 1000 Mb/s throughput, 0.3ms latency

▸ WAN: 100 Mb/s throughput, 25ms latency

▸ Lineitem: SF10, 60 million rows, 16 columns, 7.2GB in CSV format

▸ 1 hour timeout

14

BenchmarksLocalhost (No network restrictions)

Netcat

Netcat+Sy

Netcat+GZ

MonetDB++

MonetDB++C

Postgres++

Postgres++C

MySQL

MySQL+C

PostgreSQL

DB2DBMS X

HiveMonetDB

●

●

●

●

●

●

●

●

●

●

●

●

●

●

2

4

6

8

10

10 100 1000Time (s)

Size

(GB)

15

BenchmarksLAN (1000 Mb/s throughput, 0.3ms latency)

Netcat

Netcat+Sy

Netcat+GZ

MonetDB++

MonetDB++C

Postgres++

Postgres++C

MySQL

MySQL+C

PostgreSQL

DB2

DBMS X

HiveMonetDB

●

●

●

●

●

●

●

●

●

●

●

●

●

●

2

4

6

8

10

10 100 1000Time (s)

Size

(GB)

16

BenchmarksWAN (100 Mb/s throughput, 25ms latency)

Netcat

Netcat+Sy

Netcat+GZ

MonetDB++

MonetDB++C

Postgres++

Postgres++C

MySQL

MySQL+C

PostgreSQL

MonetDB

●

●

●

●

●

●

●

●

●

●

●

2

4

6

8

10

10 100 1000Time (s)

Size

(GB)

17

Conclusions

▸ Exporting data from a database does not have to be so inefficient

▸ State of the art database protocols can be improved for this use case

▸ We show this by implementing prototypes in two databases (MonetDB and PostgreSQL)

▸ Avoid per-row overhead, bulk transfer

▸ Stay close to database native formats

▸ Avoid unnecessary copying and conversion

▸ Lightweight compression on remote

▸ MonetDB implementation is already released.

▸ Benchmark information: https://goo.gl/usjfyJ

18

https://goo.gl/usjfyJ

don’t hold my data hostage - hannes mühleisen · state of the art protocols significant per-row...

Documents