google bigquery 101 & what’s new
TRANSCRIPT
Section Slide Template Option 2
Put your subtitle here. Feel free to pick from the handful of pretty Google colors available to you.Make the subtitle something clever. People will think it’s neat.
Google BigQuery 101 & What’s NewVadim Solovey - CTO, DoIT InternationalGoogle Cloud Developer Expert | Authorized [email protected]
DoIT International confidential │ Do not distribute
About me..
Vadim Solovey - CTO, DoiT InternationalGoogle Cloud Developer Expert | AWS Solutions [email protected]
DoIT International confidential │ Do not distribute
Agenda
Google BigQuery 101
Partitioned Tables
Standard SQL & New DML Statements
1
2
3
New Formats4
Cost Optimization
6 Q & A
5
DoIT International confidential │ Do not distribute
BigQuery 101
Google’s Highly Distributed Columnar Database optimized for Analytics
● Fully managed NoOps service
● Multi petabyte scale & zero sizing required
● Ingestion + Analytics + Storage + API
● No indexes, only full table scans (!)
● Pre-integrated with other Google Cloud services:
○ Dataproc (Hadoop/Spark)
○ Dataflow (ETL for streaming data)
○ Google Cloud Storage
DoIT International confidential │ Do not distribute
BigQuery 101
Continue...
● Supports nested and repeated fields/columns
● Google’s SQL Dialect
● Query results are cached for up to 24 hours (no charge)
● Charged for storage ($10-$20 per TB/month) and for data scans ($5/TB)
○ No idle costs
○ Highly cost optimizable
DoIT International confidential │ Do not distribute
Based on Dremel
Google File System (GFS)
Leaf Leaf Leaf Leaf Leaf Leaf
Mixer 1 Mixer 1
Mixer 0BigQuery in 60 Seconds
● Long Lived Shared Tree
● Mixer = Master & Reducer
● Leaf = Mapper
● Partial Reduction
● Diskless Data flow
Columnar Storage● Execution Independent
● Reduces Disk Time
DoIT International confidential │ Do not distribute
Demo
DoIT International confidential │ Do not distribute
What’s NewNew features:
● Table Partitions
● Insert/Update/Delete DML
● Standard ANSI SQL 2011
● Identity and Access Management
● Stackdriver for Monitoring
● New data formats for import/export
DoIT International confidential │ Do not distribute
Table PartitionsNew way to shard the data to minimize amount of data being scanned by a query:
● Integrated with Streaming API for easy partition creation and update
● _PARTITIONTIME pseudo column
● Current release supports partition by DAY
Creating partitioned table (using CLI)● bq mk --time_partitioning_type=DAY mydataset.table1● bq mk --time_partitioning_type=DAY --time_partitioning_expiration=259200 mydataset.table2
Accessing partitioned data:● Query all partitions: SELECT * from mydataset.table● Query specific partition: SELECT * from mydataset.table$20161109● Query range: SELECT * FROM mydataset.table WHERE _PARTITIONTIME BETWEEN
TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')
DoIT International confidential │ Do not distribute
Insert, Update & Delete DMLBigQuery is not append-only anymore ;-)
Data Manipulation Language (DML) supporting these statements:● INSERT● UPDATE● DELETE
Every statement is implicit transactions, no multi-statement transactions yet.
Quotas:● Maximum UPDATE/DELETE statements per day per table: 48● Maximum UPDATE/DELETE statements per day per project: 500● Maximum INSERT statements per day per table: 1,000● Maximum INSERT statements per day per project: 10,000
DoIT International confidential │ Do not distribute
Standard SQLFull ANSI SQL 2011
● With extensions to support nested and repeated fields
● ‘Legacy SQL’ is still supported
Set a desired dialect using prefix, i.e.:● #legacySQL or #standardSQL
#standardSQLSELECT weight_pounds, state, year, gestation_weeksFROM `bigquery-public-data.samples.natality`ORDER BY weight_pounds DESCLIMIT 10;
DoIT International confidential │ Do not distribute
Import/Export FormatsData is importable (and exportable) into/from the following formats:
● *CV files
● JSON
● AVRO
● PARQUET
DoIT International confidential │ Do not distribute
Cost Optimization TipsSome query optimization strategies:
● Use CONTAINS() instead of REGEXP_MATCH(), where possible..
● Sometimes, the sample of data is enough. Use HASH() function to sample the data.
● Use JSON_EXTRACT() if you have raw, unstructured json data in your data
● Avoid nondeterministic queries, i.e. things like NOW() etc. to improve caching
● Don’t query the table which you stream data into (cache will be immediately invalidated)
● Keep query result < 128MB, otherwise it won’t get cached as well
● Use the __TABLES__ & __DATASET__ metadata table for house-keeping goals
DoIT International confidential │ Do not distribute
Are you paying too much?BigQuery is a Columnar Datastore, and maximum performance is achieved on denormalized data sets:
● Pre-Filter with Destination Table when running many similar queries (in WHERE clause)
● Use static tables to optimize BigQuery’s cache○ If streaming/uploading frequently, create daily/hourly ‘snapshots’ and query
them instead of primary table● Always prefer storage over compute!● Set TableExpiration on datasets/partitions for automatic data lifecycle management● Fetch only required columns in your SELECT clause● Use dryRun & EXPLAIN to find most cost efficient query● Set Cost Controls to cap your BigQuery spending