transactional sql in apache hive

Transactional Operations in HiveEugene Koifman

June 2017

2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved

Agenda

Motivations/Goals

End user point of view

Design

Performance Improvements/Results

Roadmap


Motivations

Modifying existing data– INSERT OVERWRITE TABLE Target SELECT * FROM Target WHERE …

• Delete – OK, Update - ?• Concurrency

– Hope for the best (multiple updates)

– ZooKeeper lock manager S/X locks – restrictive

• Expensive to do repeatedly (write side)


Motivations

Continuously adding new data to Hive in the past– ALTER TABLE Target ADD PARTITION (dt=‘2016-06-30’)

• Lots of files – bad for performance

• Fewer files –users wait longer to see latest data– INSERT INTO Target as SELECT FROM Staging


Merge Statement – SQL Standard 2011 (Hive 2.2)

ID State County Value

1 CA LA 19.0

2 MA Norfolk 15.0

7 MA Suffolk 50.15

16 CA Orange 9.1

ID State Value

1 20.0

7 80.0

100 NH 6.0

MERGE INTO TARGET T

USING SOURCE S ON T.ID=S.ID

WHEN MATCHED THEN

UPDATE SET T.Value=S.Value

WHEN NOT MATCHED

INSERT (ID,State,Value)

VALUES(S.ID, S.State, S.Value)


1 CA LA 20.0

2 MA Norfolk 15.0

7 MA Suffolk 80.0

16 CA Orange 9.1

100 NH null 6.0


Goals

Make above use cases easy and efficient

Key Requirement– Long running analytics queries should run concurrently with update commands

NOT OLTP!!!– Support slowly changing tables

– Not for 100s of concurrent queries trying to update the same partition


System at High Level

A new type of table that supports Insert/Update/Delete/Merge SQL operations

Concept of ACID transaction– Atomic, Consistent, Isolated, Durable

Streaming Ingest API– Write a continuous stream of events to Hive in micro batches with transactional semantics


User Point of View

CREATE TABLE T(a int, b int) CLUSTERED BY (b) INTO 8 BUCKETS STORED AS ORC TBLPROPERTIES ('transactional'='true');

Not all tables support transactional semantics

Table must be bucketed

Table cannot be sorted

Currently requires ORC File but anything implementing format – AcidInputFormat/AcidOutputFormat

autoCommit=true

Transactions run at Snapshot Isolation– Lock in the state of the DB as of the start of the query for the duration of the query

– Between Serializable and Repeatable Read


Design


Design

Transaction Manager– Begin transaction and obtain a transaction ID

Storage layer enhanced to support MVCC architecture– Each row is tagged with unique ROW_ID (internal)

– Multiple versions of each row to allow concurrent readers and writers

– Result of each write is stored in a new Delta file


How ACID is implemented in Hive?

CREATE TABLE acidtbl (a INT, b STRING) CLUSTERED BY (a) INTO 1 BUCKETS STORED AS ORC TBLPROPERTIES ('transactional'='true');

ACID Metadata Columns original_transaction_idbucket_idrow_idcurrent_transaction_id

User Columns col_1: a : INT

col_2:b : STRING

ACID_PK



INSERT INTO acidtbl (a,b) VALUES (100, “foo”), (200, “xyz”), (300, “bee”);

ACID_PK a b

{ 1, 0, 0 } 100 “foo”

{ 1, 0, 1 } 200 “xyz”

{ 1, 0, 2 } 300 “bee”

delta_00001_00001/bucket_0000



UPDATE acidTbl SET b = “bar” where a = 300;

ACID_PK a b

{ 1, 0, 0 } 100 “foo”

{ 1, 0, 1 } 200 “xyz”

{ 1, 0, 2 } 300 “bee”

ACID_PK a b

{ 1, 0, 2 } 300 “bar”

delta_00001_00001/bucket_0000

delta_00002_00002/bucket_0000



DELETE FROM acidTbl where a = 200;

ACID_PK a b

{ 1, 0, 0 } 100 “foo”

{ 1, 0, 1 } 200 “xyz”

{ 1, 0, 2 } 300 “bee”

ACID_PK a b

{ 1, 0, 2 } 300 “bar”

ACID_PK a b

{ 1, 0, 1 } null null

delta_00001_00001/bucket_0000

delta_00002_00002/bucket_0000 delta_00003_00003/bucket_0000



SELECT * FROM acidtbl;

ACID_PK a b

{ 1, 0, 0 } 100 “foo”

{ 1, 0, 1 } 200 “xyz”

{ 1, 0, 2 } 300 “bee”

ACID_PK a b

{ 1, 0, 2 } 300 “bar”

ACID_PK a b

{ 1, 0, 1 } null null

delta_00001_00001/bucket_0000 delta_00002_00002/bucket_0000 delta_00003_00003/bucket_0000

{ 1, 0, 0 } 100 “foo” 100 “foo”{ 1, 0, 1 } 200 “xyz”{ 1, 0, 1 } null null{ 1, 0, 2 } 300 “bee”{ 1, 0, 2 } 300 “bar”

300 “bar”


Design - Compactor

More operations = more delta files – make reads more expensive


Design - Compactor

ALTER TABLE acidTbl COMPACT ‘MAJOR’;

ACID_PK a b

{ 1, 0, 0 } 100 “foo”

{ 1, 0, 1 } 200 “xyz”

{ 1, 0, 2 } 300 “bee”

ACID_PK a b

{ 1, 0, 2 } 300 “bar”

ACID_PK a b

{ 1, 0, 1 } null nulldelta_00001_00001/bucket_0000

delta_00002_00002/bucket_0000

delta_00003_00003/bucket_0000

ACID_PK a b

{ 1, 0, 0 } 100 “foo”

{ 1, 0, 2 } 200 “bar”

base_00003/bucket_0000


Design - Compactor

Compactor rewrites the table in the background– Minor compaction - merges delta files into fewer deltas

– Major compactor merges deltas with base - more expensive

– This amortizes the cost of updates and self tunes the tables

• Makes ORC more efficient - larger stripes, better compression

Compaction can be triggered automatically or on demand– There are various configuration options to control when the process kicks in.

– Compaction itself is a Map-Reduce job

Key design principle is that compactor does not affect readers/writers

Cleaner process – removes obsolete files


Design - Concurrency

Transaction Manager– manages transaction ID assignment

– keeps track of transaction state: open, committed, aborted

Lock Manager– DDL operations acquire eXclusive locks

– Read operations acquire Shared locks

– Also locks non transactional tables – different logic

• hive.txn.strict.locking.mode

State of both persisted in Hive Metastore


Design - Concurrency

Write Set tracking to prevent Write-Write conflicts in concurrent transactions

Note that 2 Inserts are never in conflict since Hive does not enforce unique constraints.

You are allowed to read acid and non-acid tables in same query.

You cannot write to acid and non-acid tables at the same time (multi-insert statement)


Design - Streaming Ingest

Allows you to continuously write events to a hive table– Can commit periodically to make writes durable/visible

– Can also call abort to make writes since last commit/abort invisible.

– Optimized so that it can handle writing micro batches of events - every second.

• Multiple transactions are written to one file

– Only supports adding new data

Streaming tools like NiFi, Storm and Flume rely on this API to ingest data into hive

This API is public so it can be used directly

Data written via Streaming API has the same transactional semantics as SQL side


Merge Statement – SQL Standard 2011 (Hive 2.2)


1 CA LA 19.0

2 MA Norfolk 15.0

7 MA Suffolk 50.15

16 CA Orange 9.1

ID State Value

1 20.0

7 80.0

100 NH 6.0

MERGE INTO TARGET T

USING SOURCE S ON T.ID=S.ID

WHEN MATCHED THEN

UPDATE SET T.Value=S.Value

WHEN NOT MATCHED

INSERT (ID,State,Value)

VALUES(S.ID, S.State, S.Value)


1 CA LA 20.0

2 MA Norfolk 15.0

7 MA Suffolk 80.0

16 CA Orange 9.1

100 NH null 6.0


SQL Merge

Target

Source

ACID_PK ID State

County

Value

{ 1, 0, 1 } 1 CA LA 20.0

{ 1, 0, 3 } 7 MA Suffolk 80.0

ACID_PK ID State County

Value

{ 2, 0, 1 } 100 NH 6.0

delta_00002_00002/bucket_0000

delta_00002_00002_001/bucket_0000

Right Outer Join

ON T.ID=S.ID


W/o MERGE – much less efficient

UPDATE Target set Value= 20.0 where ID = 1;

UPDATE Target set Value = 80.0 where ID = 7;

INSERT INTO Target (ID, State, Value) VALUES(100, ‘NH’, 6.0);


Performance


Work-In-Progress

Split an update into combination of delete and insert

UPDATE acidTbl SET b = “bar” where a = 300;

ACID_PK a b

{ 1, 0, 0 } 100 “foo”

{ 1, 0, 1 } 200 “xyz”

{ 1, 0, 2 } 300 “bee”

ACID_PK a b

{ 1, 0, 2 } 300 “bar”

delta_00001_00001/bucket_0000

delta_00002_00002/bucket_0000

ACID_PK a b

{ 2, 0, 0 } 300 “bar”

ACID_PK a b

{ 1, 0, 2 } null null

delta_00002_00002/bucket_0000 delete_delta_00002_00002/bucket_0000

Enabled PPD

Splits for Delta files


Benefits

Improved PPD

Better Network Utilization

Better Memory Utilization

Full Vectorization of Reads

Updating bucket/partition columns


Performance

TPC-H Benchmark– 10 node cluster at Scale Factor 1000 (1 TB of data)

– 11 delta files with 90 GB data each


Future Work

Multi statement transactions, i.e. BEGIN TRANSACTION/COMMIT/ROLLBACK

Performance– Smarter Compaction

Finer grained concurrency management/conflict detection

Read Committed w/Lock Based scheduling

Better Monitoring/Alerting

LOAD DATA … support

Optional bucketing

SMB support – user defined sort order

transactional sql in apache hive

Technology