aws athena vs. google bigquery for interactive sql queries

Put your subtitle here. Feel free to pick from the handful of pretty Google colors available to you.

Make the subtitle something clever. People will think it’s neat.

Welcome!DoIT InternationalPracticing multi-cloud since 2010.

Agenda

1

2

3

4

5

AWS Athena

Google BigQuery

Test Drive

Summary

Q & A

2

DoIT International confidential │ Do not distribute

About me..

Vadim Solovey - CTO // DoiT International



AWS Athena vs Google BigQuery for interactive SQL queries on large datasets (#20/16)

Vadim Solovey - CTO // DoIT InternationalGoogle Cloud Developer Expert | AWS Certified Solutions Architect



Athena (/əˈθiːnə/; Greek:

- the goddess of wisdom, craft,

and war



OR



Will Athena slay BigQuery?Vadim Solovey - CTO // DoIT InternationalGoogle Cloud Developer Expert | AWS Certified Solutions Architect

Section Slide Template Option 2



your mileage may will vary

Warning:


AWS Athena

• Serverless Analytical Columnar Database based on Facebook’s Presto

• Data:• External Tables (*SV, JSON, ORC, PARQUET files in S3 bucket)

• Ingestion:

• Just store files on S3• Convert to columnar/compressed format using EMR

• ANSI SQL 2011

• Priced at $5/TB of scanned data & standard S3 storage/ops costs

• Cost Optimization -converting data into columnar format, partitioning, limit queried columns.


Google BigQuery

• Serverless Analytical Columnar Database based on Google Dremel

• Data:• Native Tables• External Tables (*SV, JSON, AVRO files stored in Google Cloud Storage bucket)

• Ingestion:

• File Imports• Streaming API (up to 100K records/sec per table) • Federated Tables (files in bucket, Bigtable table or Google Spreadsheet)

• ANSI SQL 2011

• Priced at $5/TB of scanned data + storage + streaming (if used)

• Cost Optimization - partitioning, limit queried columns, 24-hour cache, cold data.


Summary

Feature \ Product AWS Athena Google BigQuery

Data Formats *SV, JSON, PARQUET/z, ORC/z External (*SV, JSON, AVRO) / Native

ANSI SQL Support Yes* Yes*

DDL Support Only CREATE/ALTER/DROP CREATE/UPDATE/DELETE (w/ quotas)

Underlying Technology FB Presto Google Dremel

Caching No Yes

Cold Data Pricing S3 Lifecycle Policy 50% discount after 90 days of inactivity

User Defined Functions No Yes

Data Partitioning On Any Key By DAY

Pricing $5/TB (scanned) plus S3 ops $5/TB (scanned) less cached data


How we tested?

• Dataset

• New York Yellow Taxi Public Dataset (https://data.cityofnewyork.us) [130GB, 1.1B rows]• Akamai Log (30GB, 1B rows]

• BigQuery [NY Taxi]

• Import of data into native table• External table on top of 500x uncompressed CSV files in GCS bucket• Caching: off

• AWS Athena [NY Taxi]

• Copied 500x uncompressed CSV files from GCS to S3 bucket• Using EMR 5.2 (HIVE/PRESTO) converted the data into ORC/z and PARQUET/z formats

https://data.cityofnewyork.us


Tables & Formats

BigQuery

• trips_ext (500x CSV files, 490MB each) [245GB in total]• trips_nat (130GB total)

AWS Athena

• trips_csv (500x CSV files, 490MB each)• trips_par (4 files, 3.2GB each)• trips_parz (8 files, 1.7GB each)• trips_orc (8 files, 2GB each)• trips_orcz (8 files, 2.1GB each)


Test Drive Summary

Query Type AWS Athens (GB/time) Google BigQuery (GB/time) t.diff %

[1] LOOKUP 48MB (4.1s) 130GB (2.0s) - 51%

[2] LOOKUP & AGGR 331MB (4.35s) 13.4GB (2.7s) - 48%

[3] GROUP/ORDER BY 5.74GB (8.85s) 8.26GB (5.4s) - 27%

[4] TEXT FUNCTIONS 606MB (11.3s) 13.6GB (2.4s) - 470%

[5] JSON FUNCTIONS 29MB (17.8s) 63.9GB (8.9s) - 100%

[6] REGEX FUNCTIONS (1.3s) 5.45GB (1.9s) + 31%

[7] FEDERATED DATA 133GB (19.4s) 133GB (36.4s) +47%


What Athena does better than BigQuery?

Advantages:

• Can be faster than BigQuery, especially with federated/external tables

• Ability to use regex to define a schema (query files without needing to change the format)

• Can be faster and cheaper than BigQuery when using a partitioned/columnar format

• Tables can be partitioned on any column

Issues:

• It’s not easy to convert data between formats

• Doesn’t support DDL, i.e. no insert/update/delete

• Randomly giving the HIVE_UNKNOWN_ERROR

• No streaming support

• Struggles with really large datasets


What BigQuery does better than Athena?

• It has native table support giving it better performance and more features

• It’s easy to manipulate data, insert/update records and write query results back to a table

• Querying native tables is very fast

• Easy to convert non-columnar formats into a native table for columnar queries

• Supports UDFs, although they will be available in the future for Athena

• Supports nested tables (nested and repeated fields)

• Works well for petabyte scale queries

Section Slide Template Option 2



Questions?


[1] Lookup Query

SELECT *

FROM trips_par

WHERE vendor_id = 'VTS'

LIMIT 10

Back


[2] Lookup & Aggregation

SELECT max(passenger_count)

FROM trips_par

WHERE vendor_id <> 'VTS'

Back


[3] GROUP BY / ORDER BY Query

SELECT substr(string(pickup_datetime),1,7) month,

COUNT(*) trips

FROM [doit-playground:playground.trips_nat]

WHERE substr(string(pickup_datetime),1,4) = '2014'

GROUP BY 1

ORDER BY 1

Back


[4] ‘LIKE’ Functions Query

SELECT

count(*)

FROM

log_par

WHERE

UA LIKE '%AppleWebKit%' OR

Back


[5] JSON Functions Query

SELECT

JSON_EXTRACT(Misc_Fields,'$.network.edgeIP') AS edgeIP, COUNT(*) AS total

FROM

[doit-playground:playground.akamai_errors]

GROUP BY

edgeIP

ORDER BY total DESC

LIMIT 10

Back


[6] Regex Functions Query

SELECT *

FROM log_par

WHERE REGEXP_MATCH(reqPath, r'msn.*\-home') LIMIT 10

Back

aws athena vs. google bigquery for interactive sql queries

Software