implementing change systems in sql server 2016

19
Change Systems Critical Component Series

Upload: douglas-mcclurg

Post on 21-Mar-2017

117 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Change SystemsCritical Component Series

Doug McClurgFounder

[email protected]

Data systems engineered to last.

The goal of this series is to give you the tools you need to push analytics forward at your company.

• The nature and importance of change systems in an overall data platform

• Compare and contrast traditional and modern data warehouse architectures

• Discuss a key technology that is core to change systems in the enterprise

• Compare the SQL Server features that enable robust change data capture

Change SystemsAgenda

Database Engine

MDF LDF

OverviewThe Source of Change

• A database engine manages files.• Data structures• Transaction logs

• Change systems accurately track modifications inside data structures.

• The source of record for change is the transaction log. Using this log directly is a characteristic of passive change systems.

• Active change systems watch the data structure and record observable change.

OverviewModeling Change

AccountID CustomerID AccountBalance ModifyDate

4568456 2342 1234758.23 2017-03-11 04:11:05

4624572 9875 5768.01 2017-03-11 04:13:15

4745733 8735 478893.33 2017-03-11 04:13:01

AccountID CustomerID Type Amount EventDate

4568456 2342 Deposit 1198575.32 2017-03-08 09:09:04

4624572 9875 Deposit 4438.70 2017-03-08 09:10:01

4745733 8735 Deposit 460436.02 2017-03-07 10:13:20

4568456 2342 Deposit 528.11 2017-03-08 06:13:45

4624572 9875 Deposit 1345.23 2017-03-09 10:22:25

4745733 8735 Deposit 635.20 2017-03-08 11:13:01

4568456 2342 Withdrawal 23.21 2017-03-09 12:12:02

4624572 9875 Fee 21.34 2017-03-09 06:13:45

4745733 8735 Withdrawal 42.66 2017-03-10 13:13:12

4568456 2342 Transfer 35678.01 2017-03-11 04:11:05

4624572 9875 Deposit 5.42 2017-03-11 04:13:15

4745733 8735 Deposit 17864.77 2017-03-11 04:13:01

Table

Log

=*

*Record the CRUD operations to the table and you get a changelog.

The duality is that a table supports data at rest and logs capture change. If you have a

log you can not only create the original table but a myriad of other derived tables. Logs

therefore seem to be a more fundamental data structure.

OverviewModeling Change

Valid TimeJohn Doe who lived in Flat Rock, NC made his first visit to us on April 1st, 1985 and changed his

permanent address during a sale on November 12th 2005.

Name Address ValidFrom ValidTo

John Doe 81 Carl Sandberg Ln, Flat Rock, NC 28731 1985-04-01 10:00:00

2005-11-12 09:05:00

John Doe 9433 Collingdale Way, Raleigh, NC 27617 2005-11-12 09:06:01

9999-12-31 23:59:99

Transaction TimeOur data warehouse went live on November 1st 2005. The ETL runs

daily at 4 AM.

Name Address CreateDate ExpireDate

John Doe 81 Carl Sandberg Ln, Flat Rock, NC 28731 2005-11-01 09:25:11

2005-11-13 04:54:11

John Doe 9433 Collingdale Way, Raleigh, NC 27617 2005-11-13 04:54:12

9999-12-31 23:59:99

OverviewModeling Change

ID Name Address ModifyDate

12345 John Doe 81 Carl Sandberg Ln, Flat Rock, NC 28731 1985-04-01 10:00:00

12345 John Doe 9433 Collingdale Way, Raleigh, NC 27617 2005-11-12 09:06:01

Key ID Name Address ValidFrom ValidTo CreateDate ExpireDate

1 12345 John Doe 81 Carl Sandberg Ln, Flat Rock, NC 28731 1985-04-01 10:00:00

2005-11-12 09:06:00

2005-11-01 09:25:11

2005-11-13 04:54:11

2 12345 John Doe 9433 Collingdale Way, Raleigh, NC 27617 2005-11-12 09:06:01

9999-12-31 23:59:99

2005-11-13 04:54:12

9999-12-31 23:59:99

ETL

Source

TargetSCD 2 Dimension

This column creates risk

Latency of 1 Day at best

ApplicationDatabase

SQL

DB2

SQL

Enterprise Data Warehouse

Mart Mart

Batch ETLJobs Storage and Query

Traditional ArchitectureThe Pull Method

ApplicationDatabase

SQL

DB2

SQL

Enterprise Data Warehouse

Mart Mart

Batch ETLJobs Storage and Query

Traditional ArchitectureFocus on the Source

Focus Area One

Friction & Frustration

Data Quality• Timeliness

• Latency of change• Latency of build

• Consistency• Redundant ETL

• Accuracy• Filters• Logic• Source

Lead Time• Custom ETL• Manual ETL• Business case and ceremony• Domain knowledge

Dependencies• Business logic• Redundancy• Downstream effects• Team

Collect and

Route

Archive

Events Query | Model | AutomateStream

Modern ArchitectureThe Push Method (Lambda)

Speed Layer

Batch Layer

Serving Layer

Real-time Views

Batch Views

Events

Query | Model | Automate

Stream

Modern ArchitectureThe Push Method (Kappa)

Unified Log StorageArchive

Collect

Derive

Modern ArchitectureThe Fungibility of Data

LOG

• Ingest (don’t extract) disparate silos of data

• Store data in its atomic form (no transform)

• Collect changes as if they were events (immutable)

• Run downstream ETL more often (process less data each cycle)

Modern ArchitectureLessons Learned

Mart

ETL

ApplicationDatabase

Enriched Source

Mirror Layer

Mart

Storage and QueryMicro-BatchETL Jobs

Modern ArchitecturePhase 1

Homogenize, Protect,and Standardize

= database transaction log

Mirror Layer

Analytical Model

Temporary Staging

Source

Why Have a Mirror Layer?

1. Improve the data structure of a source system (add primary keys, indexes)

2. Hide complexity related to the type of source system (SQL, API, Mainframe)

3. Improve the quality and performance of change tracking

4. Enable data governance programs by homogenizing sources

5. Enable prototyping of new automation solutions without developer support

Risks/Assumptions

This layer must be real-time and simple, close to the metal. The more it looks like another ETL layer, the more the risks will outweigh the benefits.

Transform Near Real-time

IntensiveTransform

Mirror layerOverview

But all I read is hate for replication on the internets!

Mirror layerReplication in Production

SaleTransaction

CustomerProfile

Source Database Server

T-LOGT-LOGPub

Sub

ArticleArticle

Push

Distcmd

• Set up everything in a lower environment and replay production activity to get an idea of load.

• The source database is placed into an Always On availability group so that the database and replication can failover.

• Distributor and subscriber are moved to their own failover cluster.

• Subscribers connect to an availability group listener so they can find the right server after a failover.

• Database and log backups are still taken regularly to support disaster recovery, but additional preparations are made to enable a smooth restore of replication.

Mirror Layer DemoFeatures of SQL Server

AccountID CustomerID AccountBalance ModifyDate

4568456 2342 1234758.23 2017-03-11 04:11:05

4624572 9875 5768.01 2017-03-11 04:13:15

AccountID Operation Columns

4568456 INSERT

4624572 UPDATE AccountBalance

4745733 DELETE

Base Table

Change Table (Internal)AccountID CustomerID AccountBalance ModifyDate CreateDate ExpireDate

4624572 9875 5001.01 2017-03-10 06:19:01

2017-03-10 06:20:35

2017-03-11 04:14:22

4745733 8735 478893.33 2017-03-11 04:13:01

2017-03-11 04:14:59

2017-03-12 09:01:12

History Table

Change Tracking• Net changes only• No data• Internal tables• Internal functions• Retention period only

Temporal Tables• Net changes not automatic• Data• Normal tables• T-SQL language integration• Full support for archiving

https://github.com/dpmcclurg/ChangeSystemsDemo

Download the Code!