google bigquery 101 & what’s new

Section Slide Template Option 2

Put your subtitle here. Feel free to pick from the handful of pretty Google colors available to you.Make the subtitle something clever. People will think it’s neat.

Google BigQuery 101 & What’s NewVadim Solovey - CTO, DoIT InternationalGoogle Cloud Developer Expert | Authorized [email protected]

DoIT International confidential │ Do not distribute

About me..

Vadim Solovey - CTO, DoiT InternationalGoogle Cloud Developer Expert | AWS Solutions [email protected]


Agenda

Google BigQuery 101

Partitioned Tables

Standard SQL & New DML Statements

1

2

3

New Formats4

Cost Optimization

6 Q & A

5


BigQuery 101

Google’s Highly Distributed Columnar Database optimized for Analytics

● Fully managed NoOps service

● Multi petabyte scale & zero sizing required

● Ingestion + Analytics + Storage + API

● No indexes, only full table scans (!)

● Pre-integrated with other Google Cloud services:

○ Dataproc (Hadoop/Spark)

○ Dataflow (ETL for streaming data)

○ Google Cloud Storage


BigQuery 101

Continue...

● Supports nested and repeated fields/columns

● Google’s SQL Dialect

● Query results are cached for up to 24 hours (no charge)

● Charged for storage ($10-$20 per TB/month) and for data scans ($5/TB)

○ No idle costs

○ Highly cost optimizable


Based on Dremel

Google File System (GFS)

Leaf Leaf Leaf Leaf Leaf Leaf

Mixer 1 Mixer 1

Mixer 0BigQuery in 60 Seconds

● Long Lived Shared Tree

● Mixer = Master & Reducer

● Leaf = Mapper

● Partial Reduction

● Diskless Data flow

Columnar Storage● Execution Independent

● Reduces Disk Time


Demo


What’s NewNew features:

● Table Partitions

● Insert/Update/Delete DML

● Standard ANSI SQL 2011

● Identity and Access Management

● Stackdriver for Monitoring

● New data formats for import/export


Table PartitionsNew way to shard the data to minimize amount of data being scanned by a query:

● Integrated with Streaming API for easy partition creation and update

● _PARTITIONTIME pseudo column

● Current release supports partition by DAY

Creating partitioned table (using CLI)● bq mk --time_partitioning_type=DAY mydataset.table1● bq mk --time_partitioning_type=DAY --time_partitioning_expiration=259200 mydataset.table2

Accessing partitioned data:● Query all partitions: SELECT * from mydataset.table● Query specific partition: SELECT * from mydataset.table$20161109● Query range: SELECT * FROM mydataset.table WHERE _PARTITIONTIME BETWEEN

TIMESTAMP('2016-01-01') AND TIMESTAMP('2016-01-02')


Insert, Update & Delete DMLBigQuery is not append-only anymore ;-)

Data Manipulation Language (DML) supporting these statements:● INSERT● UPDATE● DELETE

Every statement is implicit transactions, no multi-statement transactions yet.

Quotas:● Maximum UPDATE/DELETE statements per day per table: 48● Maximum UPDATE/DELETE statements per day per project: 500● Maximum INSERT statements per day per table: 1,000● Maximum INSERT statements per day per project: 10,000


Standard SQLFull ANSI SQL 2011

● With extensions to support nested and repeated fields

● ‘Legacy SQL’ is still supported

Set a desired dialect using prefix, i.e.:● #legacySQL or #standardSQL

#standardSQLSELECT weight_pounds, state, year, gestation_weeksFROM `bigquery-public-data.samples.natality`ORDER BY weight_pounds DESCLIMIT 10;


Import/Export FormatsData is importable (and exportable) into/from the following formats:

● *CV files

● JSON

● AVRO

● PARQUET


Cost Optimization TipsSome query optimization strategies:

● Use CONTAINS() instead of REGEXP_MATCH(), where possible..

● Sometimes, the sample of data is enough. Use HASH() function to sample the data.

● Use JSON_EXTRACT() if you have raw, unstructured json data in your data

● Avoid nondeterministic queries, i.e. things like NOW() etc. to improve caching

● Don’t query the table which you stream data into (cache will be immediately invalidated)

● Keep query result < 128MB, otherwise it won’t get cached as well

● Use the __TABLES__ & __DATASET__ metadata table for house-keeping goals


Are you paying too much?BigQuery is a Columnar Datastore, and maximum performance is achieved on denormalized data sets:

● Pre-Filter with Destination Table when running many similar queries (in WHERE clause)

● Use static tables to optimize BigQuery’s cache○ If streaming/uploading frequently, create daily/hourly ‘snapshots’ and query

them instead of primary table● Always prefer storage over compute!● Set TableExpiration on datasets/partitions for automatic data lifecycle management● Fetch only required columns in your SELECT clause● Use dryRun & EXPLAIN to find most cost efficient query● Set Cost Controls to cap your BigQuery spending

google bigquery 101 & what’s new

Technology