to process data “the cloud way” using cloud dataflow ... william vambenepe.pdf · big data on...
TRANSCRIPT
Big Data on Google Cloud
William Vambenepe, Lead Product Manager for Big Data, Google Cloud Platform
@vambenepe / [email protected]
Using Cloud Dataflow, BigQuery, and friendsto process data “the Cloud way”
Big Data at Google
Managing data through its lifecycle
Google Cloud Dataflow
Friends of Cloud Dataflow: BigQuery, Pub/Sub, Hadoop/Spark...
Optimizing your time
References and follow-up
1
2
3
4
5
Agenda
6
Building on Google’s infrastructure
1.5 million devices activatedevery day (over a billion devices)
6 billion hours watchedevery month (10h uploaded every minute)
20 billion pages crawledevery day
Spanner
Dremel
MapReduce
Big Table
Colossus
2012 20132002 2004 2006 2008 2010
GFS
MillWheel
Flume
Pregel
Software innovation
Cloud DataflowBigQuery
Data lifecycle
Stream
Batch
Cloud Pub/Sub
Cloud Logs
Google Analytics Premium
Google Cloud
Storage
Google App
Engine
Cloud Dataflow
BigQuery Storage
(tables)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
(SQL)Re
al ti
me
ana
lytics
&
aler
ts
Descriptive
Exploratory
+ Descriptive
Predictive
+ Exploratory Descriptive
Prescriptive
+Predictive
ExploratoryDescriptive
Data usage organization maturity lifecycle
● no administration
● most powerful tools in the easiest way
● constant experimentation with low risks & cost
● easy collaboration across teams and organizations
● low costs without requiring usage commitments
● best performance & virtually unlimited scale
● always on
Supporting organizations with operational ease of use
Data lifecycle
Stream
Batch
Cloud Pub/Sub
Cloud Logs
Google Analytics Premium
Google Cloud
Storage
Google App
Engine
Cloud Dataflow
BigQuery Storage
(tables)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
(SQL)Re
al ti
me
ana
lytics
&
aler
ts
Cloud Dataflow is a collection of
SDKs for building parallelized data
processing pipelines
Cloud Dataflow is a managed service
for executing parallelized data
processing pipelines
What is Cloud Dataflow?
↳ Download from GitHub:https://github.com/GoogleCloudPlatform/DataflowJavaSDK
↳ Use on Google Cloud:https://cloud.google.com/dataflow/
Cloud Dataflow SDK - Logical Model
Pipeline{
Who => Inputs
What => Transforms
Where => Windows
When => Watermarks + Triggers
To => Outputs
}
Unified programming model for both batch & stream processing.
• A Direct Acyclic Graph of data processing transformations
• Can be submitted to the Dataflow Service for optimization and execution or executed on an alternate runner e.g. Spark
• May include multiple inputs and multiple outputs
• May encompass many logical MapReduce operations
• PCollections flow through the pipeline
Cloud Dataflow Pipeline
Google Cloud Platform
Managed Service
User Code & SDK
Work Manager
Dep
loy
& S
ched
ule
Pro
gres
s &
Lo
gs
Monitoring UI
Job Manager
Life of a Dataflow Pipeline
Graph
optimiza
tion
800 RPS 1,200 RPS 5,000 RPS 50 RPS
Continuous worker scaling for long-lived streaming pipelines
time
• Run the same code in multiple modes using different runners• Direct Runner
• For local, in-memory execution.• Great for developing and unit tests
• Cloud Dataflow Service Runner• Runs on the fully-manage Dataflow Service• Your code runs distributed across GCE instances
• Community sourced• Spark runner @ github.com/cloudera/spark-dataflow• Flink runner coming soon from dataArtisans
Portability: Cloud Dataflow Runners
The most productive and portable Data pipeline SDK.
Data lifecycle
Stream
Batch
Cloud Pub/Sub
Cloud Logs
Google Analytics Premium
Google Cloud
Storage
Google App
Engine
Cloud Dataflow
BigQuery Storage
(tables)
Cloud Storage
(files)
Cloud Dataflow
BigQuery Analytics
(SQL)Re
al ti
me
ana
lytics
&
aler
ts
BigQuery
● Ingest data via streaming (100K rows/second/table) or file loader
● Process interactive SQL queries on TB or PB of data
● Zero administration; just upload data and send queries
● Pay for storage and query separately, based on actual usage
● Non-technical analysts can drive queries on massive datasets using BI tools (e.g. Tableau)
● Highly Available: Data replication in multiple geographies.
● Secure and easy collaboration: access to data is controlled using customer-owned ACLs
Hadoop and Spark
HDFS(optional)
Work NodesWork Nodes HDFS
(optional)
Name Node
(optional)
LocalSSD
PDSSD
PDstandard
GCSConnector
BigQueryConnector
Connectors
bdutil orchestration
Master Node
Work Nodes
Cloud Dataflow● Service: https://cloud.google.com/dataflow ● Questions: https://stackoverflow.com/questions/tagged/google-cloud-dataflow ● SDK: https://github.com/GoogleCloudPlatform/DataflowJavaSDK
BigQuery● https://cloud.google.com/bigquery/
Cloud Pub/Sub● https://cloud.google.com/pubsub/
Hadoop and Spark● https://cloud.google.com/hadoop/
Getting Started
Contact me● Twitter: @vambenepe● email: [email protected]