(almost) serverless analytics system with bigquery & appengine
TRANSCRIPT
Gabriel PREDA @eRadical
(Almost) Serverless Analytics Systemwith BigQuery & AppEngine
Agenda
Going Serverless withAppEngine & TasksPub/Sub, DataStore
BigQueryLoad
BatchStreaming Inserts
QueryUDF
Export...some BigQueries...
AeonsSome years ago...
~ 500,000 - 2,000,000 events / day(on average)
Some time ago...
~2,000,000 - 22,000,000 events / dayDec 2014: 57,430,000 events / day
1 day to recompute » 12 hours
NOW()
22,000,000 - 70,000,000 events / dayAVG » 40,000,000 events / day
Processing ~30GB-70GB / dayRecompute 1 day » 10-20 minutes
serverless?
Desired for: https://www.innertrends.com
other... (almost) serverless products
Cloud Functions (alpha - Node.JS)
Cloud DataFlow (Java, Python - beta)
BigQuery - data types● STRING - UTF-8 (2 bytes + encoded string size)
● BYTES - base64 encoded (except in Avro)
● INTEGER - 64-bit signed (8 bytes)
● FLOAT (8 bytes)
● BOOLEAN - true/false, 1/0 only in CSV (1 byte)
● TIMESTAMP ex:”2014-08-19 12:41:35.220 UTC” (8 bytes)
● DATE, TIME, DATETIME - limited support in Legacy SQL
● RECORD - a collection of fields (size of fields)https://cloud.google.com/bigquery/data-types
BigQuery -> loadData()Formats: CSV, JSON (newline delimited), Avro, Parquet (experimental)Tools: Web UI, bq, APISource:
local files,Cloud Storage, [demo]Cloud Datastore (backup files),POST requests,SQL DML*Google Sheets
- Federated Data Sources- Streaming Inserts
BigQuery -> loadData()
bq load ...
BigQuery -> loadData()
Got some rows?
BigQuery -> SELECT … FROM surprise…
query: SELECT { * | field_path.* | expression } [ [ AS ] alias ] [ , ... ] [ FROM from_body [ WHERE bool_expression ] [ OMIT RECORD IF bool_expression] [ GROUP [ EACH ] BY [ ROLLUP ] { field_name_or_alias } [ , ... ] ] [ HAVING bool_expression ] [ ORDER BY field_name_or_alias [ { DESC | ASC } ] [, ... ] ] [ LIMIT n ] ];from_body: from_item [, ...] | # Warning: Comma means UNION ALL here from_item [ join_type ] JOIN [ EACH ] from_item [ ON join_predicate ] | (FLATTEN({ table_name | (query) }, field_name_or_alias)) | table_wildcard_functionfrom_item: { table_name | (query) } [ [ AS ] alias ]join_type: { INNER | [ FULL ] [ OUTER ] | RIGHT [ OUTER ] | LEFT [ OUTER ] | CROSS }
BigQuery -> SELECT … FROM surprise…
Date-Partitioned Tables [demo]
Table Decorators - See the past w/ @
Table Wildcard Functions - TABLE_DATE_RANGE() & TABLE_QUERY()
Interesting functions- DateTime » UTC_USEC_TO_DAY/HOUR/MONTH/WEEK/YEAR()
» Shifts a UNIX timestamp in microseconds to the beginning of the period it occurs in.- JSON_EXTRACT[_SCALAR]()- URL functions » HOST(), DOMAIN(), TLD()- REGEXP_MATCH(), REGEXP_EXTRACT()
bigquery.defineFunction( 'expandAssetLibrary', // Name of the function exported to SQL ['user_id', 'video_id', 'stage_settings'], // Names of input columns [ {'name': 'user_id', 'type': 'integer'}, // Output schema {'name': 'video_id', 'type': 'string'}, {'name': 'asset', 'type': 'string'} ], expandAssetLibrary // Reference to JavaScript UDF);
function expandAssetLibrary(row, emit) { …………………………
emit({ user_id: row.user_id, video_id: row.video_id, asset: ss.url.replace('http://', ''));}
BigQuery -> User Defined Functions
BigQuery -> DML
Standard SQL only
Maximum UPDATE/DELETE statements per day per table: 48
Maximum UPDATE/DELETE statements per day per project: 500
Maximum INSERT statements per day per table: 1,000
Maximum INSERT statements per day per project: 10,000
BigQuery -> export()
To: Google Cloud Storage
Format: CSV, JSON [.gz], Avro
…1G files
BigQuery -> some (Big)Queries
SELECT year, count(1)
FROM [bigquery-public-data:samples.natality]
WHERE father_age < 18
GROUP BY year
ORDER BY year
SELECT year, count(1)
FROM [bigquery-public-data:samples.natality]
WHERE mother_age < 18
GROUP BY year
ORDER BY year
SELECT table_id, row_count, CEIL(size_bytes/POW(1024, 3)) AS gb
FROM [bigquery-public-data:ghcn_m.__TABLES__] ORDER BY gb DESC
BigQuery -> some (Big)Queries
SELECT REGEXP_EXTRACT(path, r'.*\.(.*)$') AS file_extension,
COUNT(1) AS k
FROM [bigquery-public-data:github_repos.files]
GROUP BY file_extension
ORDER BY k DESC
LIMIT 20
SELECT table_id, row_count,
CEIL(size_bytes/POW(1024, 3)) AS gb
FROM [bigquery-public-data:github_repos.__TABLES__]
ORDER BY gb DESC