implementing change systems in sql server 2016
TRANSCRIPT
The goal of this series is to give you the tools you need to push analytics forward at your company.
• The nature and importance of change systems in an overall data platform
• Compare and contrast traditional and modern data warehouse architectures
• Discuss a key technology that is core to change systems in the enterprise
• Compare the SQL Server features that enable robust change data capture
Change SystemsAgenda
Database Engine
MDF LDF
OverviewThe Source of Change
• A database engine manages files.• Data structures• Transaction logs
• Change systems accurately track modifications inside data structures.
• The source of record for change is the transaction log. Using this log directly is a characteristic of passive change systems.
• Active change systems watch the data structure and record observable change.
OverviewModeling Change
AccountID CustomerID AccountBalance ModifyDate
4568456 2342 1234758.23 2017-03-11 04:11:05
4624572 9875 5768.01 2017-03-11 04:13:15
4745733 8735 478893.33 2017-03-11 04:13:01
AccountID CustomerID Type Amount EventDate
4568456 2342 Deposit 1198575.32 2017-03-08 09:09:04
4624572 9875 Deposit 4438.70 2017-03-08 09:10:01
4745733 8735 Deposit 460436.02 2017-03-07 10:13:20
4568456 2342 Deposit 528.11 2017-03-08 06:13:45
4624572 9875 Deposit 1345.23 2017-03-09 10:22:25
4745733 8735 Deposit 635.20 2017-03-08 11:13:01
4568456 2342 Withdrawal 23.21 2017-03-09 12:12:02
4624572 9875 Fee 21.34 2017-03-09 06:13:45
4745733 8735 Withdrawal 42.66 2017-03-10 13:13:12
4568456 2342 Transfer 35678.01 2017-03-11 04:11:05
4624572 9875 Deposit 5.42 2017-03-11 04:13:15
4745733 8735 Deposit 17864.77 2017-03-11 04:13:01
Table
Log
=*
*Record the CRUD operations to the table and you get a changelog.
The duality is that a table supports data at rest and logs capture change. If you have a
log you can not only create the original table but a myriad of other derived tables. Logs
therefore seem to be a more fundamental data structure.
OverviewModeling Change
Valid TimeJohn Doe who lived in Flat Rock, NC made his first visit to us on April 1st, 1985 and changed his
permanent address during a sale on November 12th 2005.
Name Address ValidFrom ValidTo
John Doe 81 Carl Sandberg Ln, Flat Rock, NC 28731 1985-04-01 10:00:00
2005-11-12 09:05:00
John Doe 9433 Collingdale Way, Raleigh, NC 27617 2005-11-12 09:06:01
9999-12-31 23:59:99
Transaction TimeOur data warehouse went live on November 1st 2005. The ETL runs
daily at 4 AM.
Name Address CreateDate ExpireDate
John Doe 81 Carl Sandberg Ln, Flat Rock, NC 28731 2005-11-01 09:25:11
2005-11-13 04:54:11
John Doe 9433 Collingdale Way, Raleigh, NC 27617 2005-11-13 04:54:12
9999-12-31 23:59:99
OverviewModeling Change
ID Name Address ModifyDate
12345 John Doe 81 Carl Sandberg Ln, Flat Rock, NC 28731 1985-04-01 10:00:00
12345 John Doe 9433 Collingdale Way, Raleigh, NC 27617 2005-11-12 09:06:01
Key ID Name Address ValidFrom ValidTo CreateDate ExpireDate
1 12345 John Doe 81 Carl Sandberg Ln, Flat Rock, NC 28731 1985-04-01 10:00:00
2005-11-12 09:06:00
2005-11-01 09:25:11
2005-11-13 04:54:11
2 12345 John Doe 9433 Collingdale Way, Raleigh, NC 27617 2005-11-12 09:06:01
9999-12-31 23:59:99
2005-11-13 04:54:12
9999-12-31 23:59:99
ETL
Source
TargetSCD 2 Dimension
This column creates risk
Latency of 1 Day at best
ApplicationDatabase
SQL
DB2
SQL
…
Enterprise Data Warehouse
Mart Mart
Batch ETLJobs Storage and Query
Traditional ArchitectureThe Pull Method
ApplicationDatabase
SQL
DB2
SQL
…
Enterprise Data Warehouse
Mart Mart
Batch ETLJobs Storage and Query
Traditional ArchitectureFocus on the Source
Focus Area One
Friction & Frustration
Data Quality• Timeliness
• Latency of change• Latency of build
• Consistency• Redundant ETL
• Accuracy• Filters• Logic• Source
Lead Time• Custom ETL• Manual ETL• Business case and ceremony• Domain knowledge
Dependencies• Business logic• Redundancy• Downstream effects• Team
Collect and
Route
Archive
Events Query | Model | AutomateStream
Modern ArchitectureThe Push Method (Lambda)
Speed Layer
Batch Layer
Serving Layer
Real-time Views
Batch Views
Events
Query | Model | Automate
Stream
Modern ArchitectureThe Push Method (Kappa)
Unified Log StorageArchive
Collect
Derive
• Ingest (don’t extract) disparate silos of data
• Store data in its atomic form (no transform)
• Collect changes as if they were events (immutable)
• Run downstream ETL more often (process less data each cycle)
Modern ArchitectureLessons Learned
Mart
ETL
ApplicationDatabase
Enriched Source
Mirror Layer
Mart
Storage and QueryMicro-BatchETL Jobs
Modern ArchitecturePhase 1
Homogenize, Protect,and Standardize
= database transaction log
Mirror Layer
Analytical Model
Temporary Staging
Source
Why Have a Mirror Layer?
1. Improve the data structure of a source system (add primary keys, indexes)
2. Hide complexity related to the type of source system (SQL, API, Mainframe)
3. Improve the quality and performance of change tracking
4. Enable data governance programs by homogenizing sources
5. Enable prototyping of new automation solutions without developer support
Risks/Assumptions
This layer must be real-time and simple, close to the metal. The more it looks like another ETL layer, the more the risks will outweigh the benefits.
Transform Near Real-time
IntensiveTransform
Mirror layerOverview
Mirror layerReplication in Production
SaleTransaction
CustomerProfile
Source Database Server
T-LOGT-LOGPub
Sub
ArticleArticle
Push
Distcmd
• Set up everything in a lower environment and replay production activity to get an idea of load.
• The source database is placed into an Always On availability group so that the database and replication can failover.
• Distributor and subscriber are moved to their own failover cluster.
• Subscribers connect to an availability group listener so they can find the right server after a failover.
• Database and log backups are still taken regularly to support disaster recovery, but additional preparations are made to enable a smooth restore of replication.
Mirror Layer DemoFeatures of SQL Server
AccountID CustomerID AccountBalance ModifyDate
4568456 2342 1234758.23 2017-03-11 04:11:05
4624572 9875 5768.01 2017-03-11 04:13:15
AccountID Operation Columns
4568456 INSERT
4624572 UPDATE AccountBalance
4745733 DELETE
Base Table
Change Table (Internal)AccountID CustomerID AccountBalance ModifyDate CreateDate ExpireDate
4624572 9875 5001.01 2017-03-10 06:19:01
2017-03-10 06:20:35
2017-03-11 04:14:22
4745733 8735 478893.33 2017-03-11 04:13:01
2017-03-11 04:14:59
2017-03-12 09:01:12
History Table
Change Tracking• Net changes only• No data• Internal tables• Internal functions• Retention period only
Temporal Tables• Net changes not automatic• Data• Normal tables• T-SQL language integration• Full support for archiving