just-in-time data warehousing on databricks: change data capture and schema on read

30
Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read Jason Pohl, Data Solutions Engineer Denny Lee, Technology Evangelist

Upload: databricks

Post on 16-Apr-2017

2.371 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Jason Pohl, Data Solutions Engineer Denny Lee, Technology Evangelist

Page 2: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

About the speaker: Jason Pohl

Jason Pohl is a solutions engineer with Databricks, focused on helping customers become successful with their data initiatives. Jason has spent his career building data-driven products and solutions.

2

Page 3: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

About the moderator: Denny Lee

Denny Lee is a Technology Evangelist with Databricks; he is a hands-on data sciences engineer with more than 15 years of experience developing internet-scale infrastructure, data platforms, and distributed systems for both on-premises and cloud. Prior to joining Databricks, Denny worked as a Senior Director of Data Sciences Engineering at Concur and was part of the incubation team that built Hadoop on Windows and Azure (currently known as HDInsight).

3

Page 4: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

We are Databricks, the company behind Apache Spark

Founded by the creators of Apache Spark in 2013

Share of Spark code contributed by Databricks in 2014

75%

4

Data Value

Created Databricks on top of Spark to make big data simple.

Page 5: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Apache Spark Engine

Spark Core

Spark Streaming

Spark SQL MLlib GraphX

Unified engine across diverse workloads & environments

Scale out, fault tolerant

Python, Java, Scala, and R APIs

Standard libraries

Page 6: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read
Page 7: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

NOTABLE USERS THAT PRESENTED AT SPARK SUMMIT 2015 SAN FRANCISCO

Source: Slide 5 of Spark Community Update

Page 8: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read
Page 9: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read
Page 10: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Traditional Data Warehousing Pain PointsInelasticity of compute and storage resources

• Burst workloads requires max. load capacity planning

• Fixed size DW = compute and storage to scale linearly together

(these are orthogonal problems)

• Expensive conundrum:

• If your DW is successful, you cannot easily exapnd

• If there is overcapacity = idle resources

Page 11: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Traditional Data Warehousing Pain PointsRigid architecture that’s difficult to change

• Traditional DW are schema-on-write requiring schemas, partitions, and indexes to be

pre-built.

• Rigidity = maintaining costly ETL pipelines

• Expend finite resources to continually augment pipelines to absorb new data.

Page 12: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Traditional Data Warehousing Pain PointsLimited advanced analytics capabilities

• Want more than what business intelligence and data warehousing provides

• More than just counts, aggregates and trends

• Desire forecasting using ML, segmentation, graph processing, etc.

Page 13: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Just-in-Time Data WarehousingScale resources on demand

13

• Scale resources based on query load

• Separate compute and storage to scale

either independently

• Easily setup multiple clusters against the

same data sources

Page 14: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Just-in-Time Data WarehousingDirect access to data sources

14

• Scale resources based on query load

• Separate compute and storage to scale

either independently

• Easily setup multiple clusters against the

same data sources

Page 15: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Just-in-Time Data WarehousingScale resources on demand

15

• Scale resources based on query load

• Separate compute and storage to scale

either independently

• Easily setup multiple clusters against the

same data sources

Page 16: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Change Data CaptureWhat is it?

• System to automatically capture changes in source system (e.g. transactional database) and automatically capture those changes in a target system (e.g. data warehouse). • Important for data warehouses because it allows it to record (and

ultimately report) any changes, e.g.: • Customer A buys a pair of skis for $250 on 1/2/2015 • On 1/5/2015, realize that the purchase was $350 not $250

16

Page 17: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Change Data CaptureSource to Target

17

Source

ID Date Product Price

101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00

Target

ID Date Product Price

ID Date Product Price

101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00

Page 18: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Change Data CaptureAdd new row

18

Source

ID Date Product Price

101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00

Target

ID Date Product Price

101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00ID Date Product Price

101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00

103 1/3/2016 Disc $15.00

ID Date Product Price

101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00

103 1/3/2016 Disc $15.00

Page 19: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Change Data CaptureUpdate an existing row

19

Source

ID Date Product Price

101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00

103 1/3/2016 Disc $15.00

Target

ID Date Product Price

101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $250.00

103 1/3/2016 Disc $15.00

ID Date Product Price

101 1/1/2016 Skates $80.00

102 1/2/2016 Skis $350.00

103 1/3/2016 Disc $15.00

Page 20: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Change Data CaptureUpdate an existing row

20

Source Target

ID Date Product Price LastUpdated

101 1/1/2016 Skates $80.00 1/1/2016

102 1/2/2016 Skis $250.00 1/2/2016

103 1/3/2016 Disc $15.00 1/3/2016

ID Date Product Price LastUpdated

101 1/1/2016 Skates $80.00 1/1/2016

102 1/2/2016 Skis $350.00 1/5/2016

103 1/3/2016 Disc $15.00 1/3/2016

ID Date Product Price LastUpdated

101 1/1/2016 Skates $80.00 1/1/2016

102 1/2/2016 Skis $250.00 1/2/2016

103 1/3/2016 Disc $15.00 1/3/2016

ID Date Product Price LastUpdated

101 1/1/2016 Skates $80.00 1/1/2016

102 1/2/2016 Skis $250.00 1/2/2016

103 1/3/2016 Disc $15.00 1/3/2016

ID Date Product Price LastUpdated

101 1/1/2016 Skates $80.00 1/1/2016

102 1/2/2016 Skis $250.00 1/5/2016

103 1/3/2016 Disc $15.00 1/3/2016

102 1/2/2016 Skis $350.00 1/5/2016

Page 21: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

DemoHigh Watermark with LastUpdatedDate

21

Page 22: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

22

Stage Data from Employee Database

Page 23: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

23

Update Records in Employee Source Database

UPDATE employees SET last_name = 'Spark' WHERE emp_no = 16894

Page 24: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Job to Automate CDC

24

Source Target

ID Date Product Tag Price LastUpdated

101 1/1/2016 Skates ice $80.00 1/1/2016

102 1/2/2016 Skis snow $250.00 1/2/2016

103 1/3/2016 Disc field $15.00 1/3/2016

ID Date Product Tag Price LastUpdated

101 1/1/2016 Skates ice $80.00 1/1/2016

102 1/2/2016 Skis snow $250.00 1/2/2016

103 1/3/2016 Disc field $15.00 1/3/2016

Jobs

ID Date Product Tag Price LastUpdated

101 1/1/2016 Skates ice $80.00 1/1/2016

102 1/2/2016 Skis snow $250.00 1/2/2016

103 1/3/2016 Disc field $15.00 1/3/2016

ID Date Product Price LastUpdated

101 1/1/2016 Skates $80.00 1/1/2016

102 1/2/2016 Skis $250.00 1/2/2016

103 1/3/2016 Disc $15.00 1/3/2016

ID Date Product Tag Price LastUpdated

101 1/1/2016 Skates ice $80.00 1/1/2016

102 1/2/2016 Skis snow $250.00 1/2/2016

103 1/3/2016 Disc field $15.00 1/3/2016

ID Date Product Price LastUpdated

101 1/1/2016 Skates $80.00 1/1/2016

102 1/2/2016 Skis $250.00 1/2/2016

103 1/3/2016 Disc $15.00 1/3/2016

Page 25: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

25

Add a column to the Departments table

ALTER TABLE departments ADD COLUMN dept_desc VARCHAR(50)

UPDATE departments SET dept_desc = dept_name

Page 26: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Job to Automate CDC

Source Target

Jobs

dept_no

dept_name

dept_no

dept_name dept_no

dept_name dept_desc

Page 27: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Notebooks

To access the notebooks, please reference the attachments in the Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read webinar.

• Stage Data From Employee Database: • Notebook that starts the process • Defines the ETL process

• Change Schema in Employee Source Database • Update Records in Employee Source Database • Validate Departments

Page 29: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

More resources

• Databricks Guide • Apache Spark User Guide • Databricks Community Forum • Training courses: public classes, MOOCs, & private training • Databricks Community Edition: Free hosted Apache Spark.

Join the waitlist for the beta release!

29

Page 30: Just-in-Time Data Warehousing on Databricks: Change Data Capture and Schema On Read

Thanks!