mongodb europe 2016 - warehousing mongodb data using apache beam and bigquery

35
Warehousing MongoDB Data Using Apache Beam and BigQuery Sandeep Parikh Head of Solutions Architecture, Americas East @crcsmnky

Upload: mongodb

Post on 07-Jan-2017

202 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Warehousing MongoDB DataUsing Apache Beam and BigQuerySandeep ParikhHead of Solutions Architecture, Americas East@crcsmnky

Page 2: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 2

About Me

Page 3: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Agenda

MongoDB on Google Cloud Platform

What is Data Warehousing

Tools & Technologies

Example Use Case

Show, Don’t Tell

Page 4: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Confidential & ProprietaryGoogle Cloud Platform 4

MongoDB on Google Cloud Platform

Page 5: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 5

MongoDB on Google Cloud Platform

Page 6: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 6

Manually Deploying MongoDB

Page 7: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 7

Google Cloud Launcher

Page 8: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 8

MongoDB Cloud Manager

Page 9: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 9

MongoDB Cloud Manager

How do you automate this?

Page 10: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 10

Bootstrapping MongoDB Cloud Manager

DeploymentManagerTemplate

Page 11: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 11

Cloud Deployment Manager

Provision, configure your deployment

Configuration as code

Declarative approach to configuration

Template-driven

Supports YAML, Jinja, and Python

Use schemas to constrain parameters

References control order and dependencies

Page 12: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 12

Bootstrapping Cloud Manager

Schema, Configuration & Template

Posted on Github https://github.com/GoogleCloudPlatform/mongodb-cloud-manager

Three Compute Engine instances, each with 500 GB PD-SSD

MongoDB Cloud Manager automation agent pre-installed and configured

$ gcloud deployment-manager deployments create mongodb-cloud-manager \

--config mongodb-cloud-manager.jinja \

--properties mmsGroupId=MMSGROUPID,mmsApiKey=MMSAPIKEY

Page 13: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Confidential & ProprietaryGoogle Cloud Platform 13

What’s a Data Warehouse

Page 14: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Data Warehouses are central repositories of integrated data from one or more disparate

sourceshttps://en.wikipedia.org/wiki/Data_warehouse

Page 15: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 15

Data Warehouse

Money

Data

Data

Data

Insights

Profit!

Page 16: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Confidential & ProprietaryGoogle Cloud Platform 16

Tools and Technologies

Page 17: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 17

Where: BigQuery

Complex, Petabyte-scale data warehousing made simple

Scales automatically; No setup or admin

Foundation for analytics and machine learning

Page 18: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 18

RUN QUERY

Page 19: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 19

Page 20: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 20

How: Apache Beam (incubating)

MapReduce

BigTable DremelColossus

FlumeMegastoreSpanner

PubSub

MillwheelApache Beam

Google Cloud Dataflow

Page 21: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 21

Understand What, Where, When, How

3Streaming

4Streaming

+ Accumulation

1Classic Batch

2Windowed

Batch

Page 22: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 22

Pipelines in Beam

Pipeline p = Pipeline.create();

p.begin()

.apply(TextIO.Read.from(“gs://…”))

.apply(ParDo.of(new ExtractTags())

.apply(Count.create())

.apply(ParDo.of(new ExpandPrefixes())

.apply(Top.largestPerKey(3))

.apply(TextIO.Write.to(“gs://…”));

p.run();

Pipeline p = Pipeline.create();

p.begin()

.apply(TextIO.Read.from(“gs://…”))

.apply(ParDo.of(new ExtractTags())

.apply(Count.create())

.apply(ParDo.of(new ExpandPrefixes())

.apply(Top.largestPerKey(3))

.apply(TextIO.Write.to(“gs://…”));

p.run();

.apply(PubsubIO.Read.from(“input_topic”))

.apply(Window.<Integer>by(FixedWindows.of(5, MINUTES))

.apply(PubsubIO.Write.to(“output_topic”));

Batch to Streaming

Page 23: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 23

Apache Beam Vision

Beam Model: Fn Runners

Apache Flink

Apache Spark

Beam Model: Pipeline Construction

OtherLanguagesBeam Java

Beam Python

Execution Execution

Cloud Dataflow

Execution

Page 24: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 24

Running Apache Beam

Cloud Dataflow Local Runner

Page 25: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

25

A great place for executing Beam pipelines which provides:

● Fully managed, no-ops execution environment

● Integration with Google Cloud Platform

● Java support in GA. Python in Alpha

Cloud Dataflow Service

Page 26: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Deploy Tear Down

Fully Managed: Worker Lifecycle Management

Page 27: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Fully Managed: Dynamic Worker Scaling

Page 28: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

100 mins. 65 mins.

vs.

Fully Managed: Dynamic Work Rebalancing

Page 29: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Integrated: Monitoring UI

Page 30: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Integrated: Distributed Logging

Page 31: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Cloud Logs

Google App Engine

Google Analytics Premium

Cloud Pub/Sub

BigQuery Storage(tables)

Cloud Bigtable(NoSQL)

Cloud Storage(files)

Cloud Dataflow

BigQuery Analytics(SQL)

Capture Store Analyze

Batch

Cloud DataStore

Process

Stream

Cloud MonitoringCloud

Bigtable

Real time analytics and Alerts

Cloud Dataflow

Cloud Dataproc

Integrated: Part of Google Cloud Platform

Cloud Dataproc

31

Page 32: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Confidential & ProprietaryGoogle Cloud Platform 32

Example Use Case

Page 33: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Google Cloud Platform 33

Sensor Data

Page 34: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Confidential & ProprietaryGoogle Cloud Platform 34

Show, Don’t Tell

Page 35: MongoDB Europe 2016 - Warehousing MongoDB Data using Apache Beam and BigQuery

Insert Demo Here