best practices data engineering tools & insight sriram baskaran · 2019. 11. 21. · storing...

67
Data Engineering Tools & Best Practices Sriram Baskaran Insight

Upload: others

Post on 30-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Data Engineering Tools & Best PracticesSriram BaskaranInsight

Page 2: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Bachelors in CSGrad 2013

Machine Learning Engineer

2013-2016

Insight2018

Masters in CS (Data Science)

Grad 2018

Sriram Baskaran

Program DirectorData Engineer

linkedin.com/[email protected]

apply.insightdatascience.com

Page 3: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Some context

Page 4: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

AppBackend

id rest_name loc

1 Everest Momo Sunnyvale

2 Cafe Centro San Francisco

... ... ...

id user_name user_base_loc

101 James San Jose

102 Mark San Francisco

... ... ...

Restaurants Customers

Let’s take an example

Page 5: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Why Relational?

● Rows of my tables are accessed together.○ Single row-All column○ All relational databases follow this pattern: Postgres, MySQL, Oracle○ Huge amount of planning is required to design good schemas!

■ No flexibility for schema changes

id rest_name loc

1 Everest Momo Sunnyvale

2 Cafe Centro San Francisco

... ... ...

id user_name user_base_loc

101 James San Jose

102 Mark San Francisco

... ... ...

Restaurants Customersid cust_id rest_id rating

1001 101 1 3

1002 102 1 5

... ... ...

Reviews

Page 6: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Backend Databases

● Mostly Relational: Postgres, MySQL are popular.● Based on Relational Algebra and Codd’s model! It’s important to know this! ● Things to know: SQL, ER modeling.

○ Crow’s foot notation

● Most of your data for Data pipelines start here○ It is important to understand backend databases.

● Binary format like Images are stored separately○ Caching and Content Delivery Networks

Page 7: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Data Engineering starts here

Page 8: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Data engineering

● Extensions and Analytics on Backend databases.● Building pipelines to move data from A to B. ● Ingest and store data in efficient storage systems. ● Ability to handle large scale data processing.● Automating a large part of ETL work

Page 9: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Agenda

Storing / Ingesting

Data

Processing Data

Visualizing Data

Scheduling and Monitoring!

Page 10: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Agenda - focus

Storing / Ingesting

Data

Processing Data

Visualizing Data

Scheduling and Monitoring!

Page 11: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Storing Data

● Database and storage systems are the most underrated tools.● Processing hinges on good storage of data● It removes the additional transformations in processing stage.

id rest_name loc

1 Everest Momo Sunnyvale

2 Cafe Centro San Francisco

... ... ...

id user_name user_base_loc

101 James San Jose

102 Mark San Francisco

... ... ...

Restaurants Customersid cust_id rest_id rating

1001 101 1 3

1002 102 1 5

... ... ...

Reviews

Page 12: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Storing Data

● Database and storage systems are the most underrated tools.● Processing hinges on good storage of data● It removes the additional transformations in processing stage.

NormalizedRestaurantsCustomersRatings

Joins happen every time.

Page 13: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Storing Data

● Database and storage systems are the most underrated tools.● Processing hinges on good storage of data● It removes the additional transformations in processing stage.

DenormalizedAll Data

Star Schema(But prod is not optimized,Let’s fix that in sometime)

Joins don’t happen here

Page 14: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Storing Data

● Database and storage systems are the most underrated tools.● Processing hinges on good storage of data● It removes the additional transformations in processing stage.

DenormalizedAll Data

Load on the production database.

Joins don’t happen here

Page 15: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Build a warehouse that is independent of your prod database

Some way to sync

Analytical DatabaseTransactional

Database

Page 16: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

What are our options?

● You will come across○ Postgres○ MySQL○ Oracle○ Druid○ Redshift○ Elastic Search○ Cassandra○ Memcached○ Redis○ Dynamo○ Couchbase○ Flat-files (S3)

Pick a database after knowing the access patterns

Page 17: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Analytical in Relational

● OLAP is pretty powerful.○ Use of ROLLUP and CUBE operations○ Star Schema and Snowflake schema are pretty nice.○ Examples: Postgres, Oracle, SQL Server, MySQL

● Good but it will not scale well. Mainly due to the way the data is stored.● Schema is rigid so changes are very hard.

Page 18: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Groupings and Aggregations

● Columnar○ Druid○ Redshift

id rest_name loc

1 Everest Momo Sunnyvale

2 Cafe Centro San Francisco

... ... ...

id user_name user_base_loc

101 James San Jose

102 Mark San Francisco

... ... ...

Restaurants Customersid cust_id rest_id rating

1001 101 1 3

1002 102 1 5

... ... ...

Reviews

Page 19: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Search through unstructured text

● Like % in SQL is not efficient. ○ SELECT * FROM reviews WHERE review_text LIKE ‘%great%’○ SELECT * FROM reviews WHERE review_text LIKE ‘Loved%’

● Indexing through unstructured text should be really good○ Elastic Search○ Solr

● Eg, searching the text in the review● Each tool has a new data structure called “Postings-list”, which makes it

faster.

Page 20: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Caching

● Temporary in-memory storage○ Redis○ Memcache

● Optimized for quick and fast storage/retrieval. Key-value store (not a document store)

● Use reasonable keys so hashing algorithm is not a bottleneck

Page 21: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

How to pick one?

● Make educated & reasonable assumptions○ Type of Data○ Access Patterns○ Scaling factor (Most databases are designed to scale in their “domain”)

● Read a lot, never stop reading it. ● Use it in a project

○ There are hundreds of open large datasets available. ○ Start with GDELT (https://www.gdeltproject.org/data.html)

Page 22: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Complexities of communication

● More tools, difficult it is to communicate between them● Keeping databases in sync is one of the main challenges in the industry.● Kafka may be a solution

○ Act as a message bus○ Use Kafka Connect to bridge

Page 23: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Remember our Denormalized issue?

DenormalizedAll Data

Star Schema(But prod is not optimized,Let’s fix that in sometime)

Joins don’t happen here

Page 24: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Remember our Denormalized issue?

AppBackend

Page 25: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Agenda - for completion

Storing / Ingesting

Data

Processing Data

Visualizing Data

Scheduling and Monitoring!

Page 26: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

We are talking about scale!

● Tackling two problems: Time and Space○ Data size is greater than size of your “main-memory”○ Data cannot fit entirely.○ It takes too long to compute

● Distributed computing is a popular solution○ Hadoop, Spark, Presto, Hive○ Kafka is gaining popularity in processing too

● Example: Scrape menu items for each restaurant○ Go to each restaurant’s website○ Scrape it○ Parse it the website○ Find the menu content and process it.

Page 27: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Yelp - update menu items

Yelp’s Database

1.Get URL

2.Get actual content from internet

3.Process text and store results

Postgres

Page 28: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Yelp - update menu items - 1 million urls!

1.custom way to get urls

2.Each script access separately

3.Each script Process text and store results

Yelp’s Database

Page 29: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Yelp - update menu items - 1 million urls!

Yelp’s Database

Page 30: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Yelp - update menu items - 1 million urls!

Yelp’s Database

Page 31: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Yelp - update menu items - 1 million urls!

Yelp’s Database

or

Page 32: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

ML Training at Scale

● Use distributed computing to scale your training. ● Compute weights in a fast and efficient manner.

○ Sparkling water wrapper: https://github.com/h2oai/sparkling-water ○ H20

Page 33: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

What about Speed/Velocity?

● Data can be unbounded stream of information● Example: Processing reviews for each restaurant, Do a POS tagging.

….r50, r52, r53, …..

id cust_id rest_id rating

1001 101 1 3

1002 102 1 5

... ... ...

Reviews

Batch Processing

POS Tagging Model

Page 34: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

What about Speed/Velocity?

● Data can be unbounded stream of information● Need a robust system● Example: Processing reviews

….r50, r52, r53, …..

Spark Streaming (Micro-batches)

id cust_id rest_id rating

1001 101 1 3

1002 102 1 5

... ... ...

Reviews

POS Tagging Model

Page 35: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Agenda - for completion

Storing / Ingesting

Data

Processing Data

Visualizing Data

Scheduling and Monitoring!

Page 36: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Visualize the output data

● It’s like building a software application○ Consider end-users○ What is most intuitive way to see this information?

● Professor would have give even better examples● Do not reinvent the wheel

○ Tableau (education edition)○ Kibana (Self-setup)○ Mode (Paid)○ Looker (Paid)○ Plotly (open source, free)○ Dash (abstraction around plotly, free)○ Matlab (not so much used in industry)

If you are not able to show it in a good way, there was no need to process it!

Page 37: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Agenda - for completion

Storing / Ingesting

Data

Processing Data

Visualizing Data

Scheduling and Monitoring!

Page 38: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Putting together a pipeline

Transactional

AppBackend

Page 39: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Putting together a pipeline

Transactional

AppBackend

Page 40: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Putting together a pipeline

Transactional

AppBackend

Page 41: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Putting together a pipeline

Transactional

AppBackend

POS Tagging Model

Page 42: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Putting together a pipeline

Transactional

AppBackend

Event Store

POS Tagging Model

Page 43: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Putting together a pipeline

Transactional

AppBackend

Event Store

Spark Streaming (Micro-batches)POS Tagging

Model

Page 44: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Putting together a pipeline

Transactional

AppBackend

Event Store

Spark Streaming (Micro-batches)POS Tagging

Model

Page 45: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Putting together a pipeline

Transactional

AppBackend

Event Store

Spark Streaming (Micro-batches)POS Tagging

Model

Page 46: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

How to automate the tasks?

Page 47: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Scheduling & Monitoring

● Scheduling tasks in a sequence● Easy to specify dependency● Code based configuration● Easy to deploy and manage● Every Batch pipeline needs a scheduler to automate tasks.● Handling failure● Also allows backfill.

Page 48: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Backfill

…………...

??

Events in time

Page 49: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Backfill

…………... Events in time

Backfill

Page 50: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on
Page 51: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Think ahead, Think smart

● Get all data in to one place (know about data warehousing)● Understand the why behind any tool choices● Expect future requests from stakeholders● Learn by collaborating, know all different ways a data can be stored,

processed and visualized.● Constantly learn, know the latest updates in a too

○ Start with basics of why the tool was built

● Learn these five: Kafka, Spark, Cassandra, Postgres (PostGIS), Redshift● Managed: Lambdas, Redshift, Dynamo, S3

Page 52: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Start using cloud resources

● Students get $300 in credits both in AWS and GCP. Start using them.● Spin up compute resources● Try out labs for managed services. ● AWS for Students

○ AWS Lambdas○ AWS Redshift○ AWS Dynamo

Page 54: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Insight

Page 55: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on
Page 56: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on
Page 57: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on
Page 58: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Insight Offerings - Which one to pick

Data Science Program

● PhD in quantitative fields.

● Have worked in analysing data.

● Good problem solving skills

Data Engineering Program

● Engineering background.

● Worked on and maintained building engineering systems.

● Java/Python

Health Data Science Program

● Postdoctoral researcher, medical doctors

● Interested in genome sequences,clinical trials.

Artificial Intelligence Program

● Engineering background.

● Have worked on training and deploying ML or NN.

DevOps Engineering Program

● Systems admin and Linux background.

● Problem solver critical thinker.

● Can understand containerized sys.

Page 59: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

New Programs - More focused domains

● Designing security measures

● Building secure applications.

● Blockchain technology

● Smart contract management

● Decentralized architectures

Page 60: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on
Page 61: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on
Page 62: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on
Page 63: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on
Page 64: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on
Page 65: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Where are we?

65

Seattle

Portland

San Francisco

Los Angeles

Austin

Chicago

New

York

Boston

Toronto

In Person

Remote

Page 66: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Apply to Insight● 3 sessions a year● Apply when you are ready

for full-time ● Prepare a role-driven

resume● Read our blog posts● Contact alumni● Application process:

○ Resume + Application Form○ Interview

Note: Data Engineering program has a Coding challenge before the interview.

Page 67: Best Practices Data Engineering Tools & Insight Sriram Baskaran · 2019. 11. 21. · Storing Data Database and storage systems are the most underrated tools. Processing hinges on

Applications open for June 2020 Session!

Apply.insightdatascience.comSign up for Notifications list