just-in-time data warehousing on databricks: change data capture and schema on read

Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Jason Pohl, Data Solutions Engineer Denny Lee, Technology Evangelist

About the speaker: Jason Pohl

Jason Pohl is a solutions engineer with Databricks, focused on helping customers become successful with their data initiatives. Jason has spent his career building data-driven products and solutions.

2

About the moderator: Denny Lee

Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud. Prior to joining Databricks, Denny worked as a Senior Director of Data Sciences Engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).

3

We are Databricks, the company behind Apache Spark

Founded by the creators of Apache Spark in 2013

Share of Spark code contributed by Databricks in 2014

75%

4

Data Value

Created Databricks on top of Spark to make big data simple.

…

Apache Spark Engine

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Unified engine across diverse workloads & environments

Scale out, fault tolerant

Python, Java, Scala, and R APIs

Standard libraries

NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO

Source: Slide 5 of Spark Community Update

Traditional Data Warehousing Pain PointsInelasticity of compute and storage resources

• Burst workloads requires max. load capacity planning

• Fixed size DW = compute and storage to scale linearly together

(these are orthogonal problems)

• Expensive conundrum:

• If your DW is successful, you cannot easily exapnd

• If there is overcapacity = idle resources

Traditional Data Warehousing Pain PointsRigid architecture that’s difficult to change

• Traditional DW are schema-on-write requiring schemas, partitions, and indexes to be

pre-built.

• Rigidity = maintaining costly ETL pipelines

• Expend finite resources to continually augment pipelines to absorb new data.

Traditional Data Warehousing Pain PointsLimited advanced analytics capabilities

• Want more than what business intelligence and data warehousing provides

• More than just counts, aggregates and trends

• Desire forecasting using ML, segmentation, graph processing, etc.

Just-in-Time Data WarehousingScale resources on demand

13

• Scale resources based on query load

• Separate compute and storage to scale

either independently

• Easily setup multiple clusters against the

same data sources

Just-in-Time Data WarehousingDirect access to data sources

14





same data sources

Just-in-Time Data WarehousingScale resources on demand

15





same data sources

Change Data CaptureWhat is it?

• System to automatically capture changes in source system (e.g. transactional database) and automatically capture those changes in a target system (e.g. data warehouse). • Important for data warehouses because it allows it to record (and

ultimately report) any changes, e.g.: • Customer A buys a pair of skis for $250 on 1/2/2015 • On 1/5/2015, realize that the purchase was $350 not $250

16

Change Data CaptureSource to Target

17

Source

ID Date Product Price

101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00

Target



101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00

Change Data CaptureAdd new row

18

Source


101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00

Target


101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00ID Date Product Price

101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00

103 1/3/2016 Disc $15.00


101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00

103 1/3/2016 Disc $15.00

Change Data CaptureUpdate an existing row

19

Source


101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00

103 1/3/2016 Disc $15.00

Target


101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00

103 1/3/2016 Disc $15.00


101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $350.00

103 1/3/2016 Disc $15.00

Change Data CaptureUpdate an existing row

20

Source Target

ID Date Product Price LastUpdated

101 1/1/2016 Skates $80.00 1/1/2016

102 1/2/2016 Skis $250.00 1/2/2016

103 1/3/2016 Disc $15.00 1/3/2016


101 1/1/2016 Skates $80.00 1/1/2016

102 1/2/2016 Skis $350.00 1/5/2016

103 1/3/2016 Disc $15.00 1/3/2016


101 1/1/2016 Skates $80.00 1/1/2016

102 1/2/2016 Skis $250.00 1/2/2016

103 1/3/2016 Disc $15.00 1/3/2016


101 1/1/2016 Skates $80.00 1/1/2016

102 1/2/2016 Skis $250.00 1/2/2016

103 1/3/2016 Disc $15.00 1/3/2016


101 1/1/2016 Skates $80.00 1/1/2016

102 1/2/2016 Skis $250.00 1/5/2016

103 1/3/2016 Disc $15.00 1/3/2016

102 1/2/2016 Skis $350.00 1/5/2016

DemoHigh Watermark with LastUpdatedDate

21

22

Stage Data from Employee Database

23

Update Records in Employee Source Database

UPDATE employees SET last_name = 'Spark' WHERE emp_no = 16894

Job to Automate CDC

24

Source Target

ID Date Product Tag Price LastUpdated

101 1/1/2016 Skates ice $80.00 1/1/2016

102 1/2/2016 Skis snow $250.00 1/2/2016

103 1/3/2016 Disc field $15.00 1/3/2016


101 1/1/2016 Skates ice $80.00 1/1/2016

102 1/2/2016 Skis snow $250.00 1/2/2016

103 1/3/2016 Disc field $15.00 1/3/2016

Jobs


101 1/1/2016 Skates ice $80.00 1/1/2016

102 1/2/2016 Skis snow $250.00 1/2/2016

103 1/3/2016 Disc field $15.00 1/3/2016


101 1/1/2016 Skates $80.00 1/1/2016

102 1/2/2016 Skis $250.00 1/2/2016

103 1/3/2016 Disc $15.00 1/3/2016


101 1/1/2016 Skates ice $80.00 1/1/2016

102 1/2/2016 Skis snow $250.00 1/2/2016

103 1/3/2016 Disc field $15.00 1/3/2016


101 1/1/2016 Skates $80.00 1/1/2016

102 1/2/2016 Skis $250.00 1/2/2016

103 1/3/2016 Disc $15.00 1/3/2016

25

Add a column to the Departments table

ALTER TABLE departments ADD COLUMN dept_desc VARCHAR(50)

UPDATE departments SET dept_desc = dept_name

Job to Automate CDC

Source Target

Jobs

dept_no

dept_name

dept_no

dept_name dept_no

dept_name dept_desc

Notebooks

To access the notebooks, please reference the attachments in the Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read webinar.

• Stage Data From Employee Database: • Notebook that starts the process • Defines the ETL process

• Change Schema in Employee Source Database • Update Records in Employee Source Database • Validate Departments

http://www.apple.com

Resources

• Just-in-Time Data Warehousing Solution Brief • Building a Turbo-fast Data Warehousing Platform with

Databricks • Spark DataFrames: Simple and Fast Analysis of Structured Data • Transitioning from Traditional DW to Spark in OR Predictive

Modeling • Advertising Technology Sample Notebook (Part 1)

http://go.databricks.com/data-warehousing-solution-brief

http://go.databricks.com/databricks-webinar-building-a-turbo-fast-data-warehousing-platform-with-databricks-2

http://go.databricks.com/databricks-webinar-spark-dataframes-simple-and-fast-analysis-of-structured-data-0

http://go.databricks.com/databricks-webinar-spark-dataframes-simple-and-fast-analysis-of-structured-data-0

http://go.databricks.com/transitioning-from-traditional-dw-to-spark-in-or-predictive-modeling

http://go.databricks.com/transitioning-from-traditional-dw-to-spark-in-or-predictive-modeling

http://go.databricks.com/hubfs/notebooks/Samples/Miscellaneous/AdTech_Sample_Notebook_Part_1.html

http://go.databricks.com/hubfs/notebooks/Samples/Miscellaneous/AdTech_Sample_Notebook_Part_1.html

More resources

• Databricks Guide • Apache Spark User Guide • Databricks Community Forum • Training courses: public classes, MOOCs, & private training • Databricks Community Edition: Free hosted Apache Spark.

Join the waitlist for the beta release!

29

http://spark.apache.org/docs/latest/

https://forums.databricks.com/

https://databricks.com/spark/training

https://databricks.com/blog/2016/02/17/introducing-databricks-community-edition-apache-spark-for-all.html

Thanks!