scaling for big data at google - jen's homepage · scaling for big data at google. jen tong...

89
Scaling for Big Data at Google

Upload: others

Post on 22-May-2020

23 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Scaling for Big Data at Google

Page 2: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Jen TongDeveloper AdvocateGoogle Cloud Platform

@MimmingCodes

Page 3: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Agenda

● Research● Bigtable● BigQuery

Page 4: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Google Research Publications

Page 5: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Google Research Publications

Page 6: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Open Source Implementations

Bigtable

Flume

Dremel

Page 7: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Managed Cloud Versions

Bigtable

Flume

Dremel

Bigtable

Dataflow

BigQuery

Page 8: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Cloud BigtableCloud Bigtable

Page 9: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Bigness

Page 10: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Google Internal Bigtable in Numbers

• Storage: 100s of PB

• Throughput: 1,000,000s of QPS

• Bandwidth: 100s of GB/sec

Page 11: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

How much is that?

Several Datas worthPhoto credit: jdhancock

Page 12: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

How much is that?

Millennia of DVD videoPhoto credit: illinoislibrary

Page 13: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Engineering

Page 14: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Engineering

Hundreds of engineer-years worth

Page 15: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Bigtable - The early years• Ingredients

○ Thousands of commodity servers

○ Many petabytes of data

• Tradeoffs

○ Abandon traditional relational model

• Goals

○ Prototype the service to do its first scaling

○ Focus on batch work

○ Migrate first applications to Bigtable

○ Figure out replication

Page 16: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Bigtable - Stabilized

• Lower latency

○ Fast 99th percentile requests

○ Start serving web traffic

• Polish the Bigtable service

○ React better to abusive usage

○ Mixed media clusters - mixture of SSD + spinning disks

○ Faster tablet server recovery time: ~10 sec to ~800 ms

Page 17: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Data Model

Page 18: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Data model

Page 19: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

How it works

Page 20: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Life of Bigtable data

Page 21: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Life of Bigtable data

Page 22: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Life of Bigtable data

Page 23: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Life of Bigtable data

Page 24: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Bigtable Architecture

Bigtable Cell

Tabletserver

Tabletserver

Tabletserver

Tabletserver

Master

Tabletserver

Bloomfilter

Memtable

Sharedlog

Block Cache

TabletTablet

Tablet Tablet

Chubby

HBase Client

Colossus

Page 25: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Bigtable Architecture

Bigtable Cell

Tabletserver

Tabletserver

Tabletserver

Tabletserver

Master

Tabletserver

Bloomfilter

Memtable

Sharedlog

Block Cache

TabletTablet

Tablet Tablet

Chubby

HBase Client

Colossus

Page 26: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Bigtable Architecture

Bigtable Cell

Tabletserver

Tabletserver

Tabletserver

Tabletserver

Master

Tabletserver

Bloomfilter

Memtable

Sharedlog

Block Cache

TabletTablet

Tablet Tablet

Chubby

HBase Client

Colossus

Page 27: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Bigtable Architecture

Bigtable Cell

Tabletserver

Tabletserver

Tabletserver

Tabletserver

Master

Tabletserver

Bloomfilter

Memtable

Sharedlog

Block Cache

TabletTablet

Tablet Tablet

Chubby

HBase Client

Colossus

Page 28: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

When it's awesome

Page 29: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Financial ServicesFaster risk analysis, credit card fraud/abuse

Marketing/ Digital MediaUser engagement, clickstream analysis, real-time adaptive content

Internet of ThingsSensor data dashboards and anomaly detection

TelecommunicationsSampled traffic patterns, metric collection and reporting

EnergyOil well sensors, anomaly detection, predictive modeling

BiomedicalGenomics sequencing data analysis

Cloud Bigtable Use Cases

Page 30: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

When not to use it

• Relational joins, like for online transaction processing

• Interactive querying

• Blobs over 10MB

• ACID transactions

• Automatic cross-zone replication

• You don't have much data yet

Page 31: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

When not to use it

• Relational joins, like for online transaction processing - Cloud SQL

• Interactive querying - BigQuery

• Blobs over 10MB - Cloud Storage

• ACID transactions - Datastore

• Automatic cross-zone replication - Datastore

• You don't have much data yet - Datastore, Firebase, or Cloud SQL

Page 32: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

</Bigtable>

Page 33: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Google BigQueryGoogle BigQuery

Page 34: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Let's count some stuff

Page 35: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

SELECT count(word)FROM publicdata:samples.shakespeare

Words in Shakespeare

Page 36: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_20150212_01]

Wikipedia hits over 1 hour

Page 37: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_201505]

Wikipedia hits over 1 month

Page 38: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Several years of Wikipedia data

SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_201105], [fh-bigquery:wikipedia.pagecounts_201106], [fh-bigquery:wikipedia.pagecounts_201107],

...

Page 39: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

SELECT SUM(requests) AS totalFROM TABLE_QUERY( [fh-bigquery:wikipedia], 'REGEXP_MATCH( table_id, r"pagecounts_2015[0-9]{2}$")')

Several years of Wikipedia data

Page 40: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

How about a RegExp

SELECT SUM(requests) AS totalFROM TABLE_QUERY( [fh-bigquery:wikipedia], 'REGEXP_MATCH( table_id, r"pagecounts_2015[0-9]{2}$")')WHERE (REGEXP_MATCH(title, '.*[dD]inosaur.*'))

Page 41: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

How did it do that?o_O

Page 42: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Qualities of a good RDBMS

Page 43: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Qualities of a good RDBMS

● Inserts & locking● Indexing● Cache● Query planning

Page 44: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Qualities of a good RDBMS

● Inserts & locking● Indexing● Cache● Query planning

Page 45: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery
Page 46: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery
Page 47: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery
Page 48: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Storing data

-- -- -- ---- -- -- ---- -- -- --

Table

Columns

Disks

Page 49: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Reading data: Life of a BigQuery

SELECT sum(requests) as sumFROM ( SELECT requests, title FROM [fh-bigquery:wikipedia.pagecounts_201501] WHERE (REGEXP_MATCH(title, '[Jj]en.+')) )

Page 50: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Life of a BigQuery

L L

MMixer

Leaf

Storage

Page 51: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

L L L L

M M

M

Life of a BigQuery

Root Mixer

Mixer

Leaf

Storage

Page 52: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Life of a BigQueryQuery

L L L L

M M

MRoot Mixer

Mixer

Leaf

Storage

Page 53: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Life of a BigQueryLife of a BigQuery

L L L L

M M

MRoot Mixer

Mixer

Leaf

StorageSELECT requests, title

Page 54: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Life of a BigQueryLife of a BigQuery

L L L L

M M

MRoot Mixer

Mixer

Leaf

Storage5.4 Bil

SELECT requests, title

WHERE (REGEXP_MATCH(title, '[Jj]en.+'))

Page 55: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Life of a BigQueryLife of a BigQuery

L L L L

M M

MRoot Mixer

Mixer

Leaf

Storage5.4 Bil

SELECT sum(requests)

5.8 MilWHERE (REGEXP_MATCH(title, '[Jj]en.+'))

SELECT requests, title

Page 56: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Life of a BigQueryLife of a BigQuery

L L L L

M M

MRoot Mixer

Mixer

Leaf

Storage5.4 Bil

SELECT sum(requests)

5.8 MilWHERE (REGEXP_MATCH(title, '[Jj]en.+'))

SELECT requests, title

SELECT sum(requests)

Page 57: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Open Data

Page 60: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Finding Open Data

reddit.com/r/bigquery/wiki/datasets

Page 61: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Time to explore

Page 62: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

GSOD

Page 63: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Find nearby weather data

select namefrom [fh-bigquery:weather_gsod.stations]where state == 'TX' and usaf in(select stn from (SELECT count(stn) as cnt, stn FROM [fh-bigquery:weather_gsod.gsod2015] where stn <> '999999' group by stn order by cnt desc))group by nameorder by name ASC;

Page 64: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Weather in Atlanta

SELECT DATE(year+mo+da) day, min, maxFROM [fh-bigquery:weather_gsod.gsod2015] WHERE stn IN ( SELECT usaf FROM [fh-bigquery:weather_gsod.stations] WHERE name = 'AUSTIN CAMP MABRY')AND max < 200ORDER BY day;

Page 65: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Weather in Half Moon BaySELECT DATE(year+mo+da) day, min, maxFROM [fh-bigquery:weather_gsod.gsod2013] WHERE stn IN ( SELECT usaf FROM [fh-bigquery:weather_gsod.stations] WHERE name = 'HALF MOON BAY AIRPOR')AND max < 200ORDER BY day;

Page 66: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Global high temperatures SELECT year, max(max) as maxFROM TABLE_QUERY( [fh-bigquery:weather_gsod], 'table_id CONTAINS "gsod"')where max < 200 group by year order by year asc

Page 67: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Global high temperatures SELECT year, max(max) as maxFROM TABLE_QUERY( [fh-bigquery:weather_gsod], 'table_id CONTAINS "gsod"')where max < 200 group by year order by year asc

Page 68: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Something useful:Use Wikipedia data to pick a movie

Page 69: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

1. Wikipedia edits2. ???3. Movie recommendation

Page 70: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Follow the edits

Same editor

Page 71: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

select title, id, count(id) as editsfrom [publicdata:samples.wikipedia]where title contains 'Hackers' and title contains '(film)' and wp_namespace = 0group by title, idorder by editslimit 10

Pick a great movie

Page 72: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

select title, id, count(id) as edits from [publicdata:samples.wikipedia]where contributor_id in ( select contributor_id from [publicdata:samples.wikipedia] where

id=264176 and contributor_id is not null and is_bot is null and wp_namespace = 0 and title CONTAINS '(film)' group by contributor_id) and wp_namespace = 0 and id != 264176 and title CONTAINS '(film)'group each by title, idorder by edits desclimit 100

Find edits in common

Page 73: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Discover the most broadly popular filmsselect id from ( select id, count(id) as edits from [publicdata:samples.wikipedia] where wp_namespace = 0 and title CONTAINS '(film)' group each by id order by edits desc limit 20)

Page 74: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Edits in common, minus broadly popularselect title, id, count(id) as edits from [publicdata:samples.wikipedia]where contributor_id in ( select contributor_id from [publicdata:samples.wikipedia] where

id=264176 and contributor_id is not null and is_bot is null and wp_namespace = 0 and title CONTAINS '(film)' group by contributor_id) and wp_namespace = 0 and id != 264176 and title CONTAINS '(film)' and id not in (

select id from ( select id, count(id) as edits from [publicdata:samples.wikipedia] where wp_namespace = 0 and title CONTAINS '(film)' group each by id order by edits desc limit 20 ) )group each by title, idorder by edits desclimit 100

Page 75: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

</BigQuery>

Page 76: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Conclusion

● Scale comes with tradeoffs● Simpler problems are easier to scale● Open data is cool

Page 77: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Thank you!

Jen TongDeveloper AdvocateGoogle Cloud Platform

@MimmingCodeslittle418.com

Page 78: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery
Page 79: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

GDELT

Page 80: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Stories per month - GASELECT DATE(STRING(MonthYear) + '01') month, SUM(ActionGeo_ADM1Code='USGA') GAFROM [gdelt-bq:full.events]WHERE MonthYear > 0GROUP BY 1 ORDER BY 1

Page 81: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

SELECT DATE(STRING(MonthYear) + '01') month, SUM(ActionGeo_ADM1Code='USGA') / COUNT(*) newsynessFROM [gdelt-bq:full.events]WHERE MonthYear > 0GROUP BY 1 ORDER BY 1

Stories per month, normalized

Page 82: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

ISB teams with Google for NCI Cancer Genomics Cloud project

Page 83: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery
Page 84: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

https://developers.google.com/genomics/

Genomics

Page 85: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery
Page 86: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

Cost to sequence a genome

Page 87: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

1000 Genomes

Page 88: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

GenomicsSELECT Sample, SUM(single), SUM(double),FROM ( SELECT call.call_set_name AS Sample, SOME(call.genotype > 0) AND NOT EVERY(call.genotype > 0) WITHIN call AS single, EVERY(call.genotype > 0) WITHIN call AS double, FROM[genomics-public-data:1000_genomes.variants] OMIT RECORD IF reference_name IN ("X","Y","MT"))GROUP BY Sample ORDER BY Sample

Page 89: Scaling for Big Data at Google - Jen's Homepage · Scaling for Big Data at Google. Jen Tong Developer Advocate Google Cloud Platform @MimmingCodes. Agenda Research Bigtable BigQuery

GenomicsSELECT Sample, SUM(single), SUM(double),FROM ( SELECT call.call_set_name AS Sample, SOME(call.genotype > 0) AND NOT EVERY(call.genotype > 0) WITHIN call AS single,

EVERY(call.genotype > 0) WITHIN call AS double, FROM[genomics-public-data:1000_genomes.variants] OMIT RECORD IF reference_name IN ("X","Y","MT"))GROUP BY Sample ORDER BY Sample