scaling for big data at google - jen's homepage · scaling for big data at google. jen tong...

Scaling for Big Data at Google

Jen TongDeveloper AdvocateGoogle Cloud Platform

@MimmingCodes

http://twitter.com/MimmingCodes


Agenda

● Research● Bigtable● BigQuery

Google Research Publications

Open Source Implementations

Bigtable

Flume

Dremel

Managed Cloud Versions

Bigtable

Flume

Dremel

Bigtable

Dataflow

BigQuery

Cloud BigtableCloud Bigtable

Bigness

Google Internal Bigtable in Numbers

• Storage: 100s of PB

• Throughput: 1,000,000s of QPS

• Bandwidth: 100s of GB/sec

How much is that?

Several Datas worthPhoto credit: jdhancock

https://www.flickr.com/photos/jdhancock/8031897271/

https://www.flickr.com/photos/jdhancock/8031897271/

How much is that?

Millennia of DVD videoPhoto credit: illinoislibrary

https://www.flickr.com/photos/illinoislibrary/14442820810/

https://www.flickr.com/photos/illinoislibrary/14442820810/

Engineering

Engineering

Hundreds of engineer-years worth

Bigtable - The early years• Ingredients

○ Thousands of commodity servers

○ Many petabytes of data

• Tradeoffs

○ Abandon traditional relational model

• Goals

○ Prototype the service to do its first scaling

○ Focus on batch work

○ Migrate first applications to Bigtable

○ Figure out replication

Bigtable - Stabilized

• Lower latency

○ Fast 99th percentile requests

○ Start serving web traffic

• Polish the Bigtable service

○ React better to abusive usage

○ Mixed media clusters - mixture of SSD + spinning disks

○ Faster tablet server recovery time: ~10 sec to ~800 ms

Data Model

Data model

How it works

Life of Bigtable data

Bigtable Architecture

Bigtable Cell

Tabletserver

Tabletserver

Tabletserver

Tabletserver

Master

Tabletserver

Bloomfilter

Memtable

Sharedlog

Block Cache

TabletTablet

Tablet Tablet

Chubby

HBase Client

Colossus

When it's awesome

Financial ServicesFaster risk analysis, credit card fraud/abuse

Marketing/ Digital MediaUser engagement, clickstream analysis, real-time adaptive content

Internet of ThingsSensor data dashboards and anomaly detection

TelecommunicationsSampled traffic patterns, metric collection and reporting

EnergyOil well sensors, anomaly detection, predictive modeling

BiomedicalGenomics sequencing data analysis

Cloud Bigtable Use Cases

When not to use it

• Relational joins, like for online transaction processing

• Interactive querying

• Blobs over 10MB

• ACID transactions

• Automatic cross-zone replication

• You don't have much data yet

When not to use it

• Relational joins, like for online transaction processing - Cloud SQL

• Interactive querying - BigQuery

• Blobs over 10MB - Cloud Storage

• ACID transactions - Datastore

• Automatic cross-zone replication - Datastore

• You don't have much data yet - Datastore, Firebase, or Cloud SQL

</Bigtable>

Google BigQueryGoogle BigQuery

Let's count some stuff

SELECT count(word)FROM publicdata:samples.shakespeare

Words in Shakespeare

SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_20150212_01]

Wikipedia hits over 1 hour

SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_201505]

Wikipedia hits over 1 month

Several years of Wikipedia data

SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_201105], [fh-bigquery:wikipedia.pagecounts_201106], [fh-bigquery:wikipedia.pagecounts_201107],

...

SELECT SUM(requests) AS totalFROM TABLE_QUERY( [fh-bigquery:wikipedia], 'REGEXP_MATCH( table_id, r"pagecounts_2015[0-9]{2}$")')

Several years of Wikipedia data

How about a RegExp

SELECT SUM(requests) AS totalFROM TABLE_QUERY( [fh-bigquery:wikipedia], 'REGEXP_MATCH( table_id, r"pagecounts_2015[0-9]{2}$")')WHERE (REGEXP_MATCH(title, '.*[dD]inosaur.*'))

How did it do that?o_O

Qualities of a good RDBMS

Qualities of a good RDBMS

● Inserts & locking● Indexing● Cache● Query planning

Storing data

-- -- -- ---- -- -- ---- -- -- --

Table

Columns

Disks

Reading data: Life of a BigQuery

SELECT sum(requests) as sumFROM ( SELECT requests, title FROM [fh-bigquery:wikipedia.pagecounts_201501] WHERE (REGEXP_MATCH(title, '[Jj]en.+')) )

Life of a BigQuery

L L

MMixer

Leaf

Storage

L L L L

M M

M

Life of a BigQuery

Root Mixer

Mixer

Leaf

Storage

Life of a BigQueryQuery

L L L L

M M

MRoot Mixer

Mixer

Leaf

Storage

Life of a BigQueryLife of a BigQuery

L L L L

M M

MRoot Mixer

Mixer

Leaf

StorageSELECT requests, title


L L L L

M M

MRoot Mixer

Mixer

Leaf

Storage5.4 Bil

SELECT requests, title

WHERE (REGEXP_MATCH(title, '[Jj]en.+'))


L L L L

M M

MRoot Mixer

Mixer

Leaf

Storage5.4 Bil

SELECT sum(requests)

5.8 MilWHERE (REGEXP_MATCH(title, '[Jj]en.+'))



L L L L

M M

MRoot Mixer

Mixer

Leaf

Storage5.4 Bil


5.8 MilWHERE (REGEXP_MATCH(title, '[Jj]en.+'))



Open Data

Finding Open Data

opendata.stackexchange.com

http://opendata.stackexchange.com/

http://opendata.stackexchange.com

http://opendata.stackexchange.com

Finding Open Data

reddit.com/r/dataisbeautiful

https://reddit.com/r/dataisbeautiful



Finding Open Data

reddit.com/r/bigquery/wiki/datasets

https://www.reddit.com/r/bigquery/wiki/datasets

https://www.reddit.com/r/bigquery/wiki/datasets

Time to explore

Find nearby weather data

select namefrom [fh-bigquery:weather_gsod.stations]where state == 'TX' and usaf in(select stn from (SELECT count(stn) as cnt, stn FROM [fh-bigquery:weather_gsod.gsod2015] where stn <> '999999' group by stn order by cnt desc))group by nameorder by name ASC;

Weather in Atlanta

SELECT DATE(year+mo+da) day, min, maxFROM [fh-bigquery:weather_gsod.gsod2015] WHERE stn IN ( SELECT usaf FROM [fh-bigquery:weather_gsod.stations] WHERE name = 'AUSTIN CAMP MABRY')AND max < 200ORDER BY day;

Weather in Half Moon BaySELECT DATE(year+mo+da) day, min, maxFROM [fh-bigquery:weather_gsod.gsod2013] WHERE stn IN ( SELECT usaf FROM [fh-bigquery:weather_gsod.stations] WHERE name = 'HALF MOON BAY AIRPOR')AND max < 200ORDER BY day;

Global high temperatures SELECT year, max(max) as maxFROM TABLE_QUERY( [fh-bigquery:weather_gsod], 'table_id CONTAINS "gsod"')where max < 200 group by year order by year asc

Something useful:Use Wikipedia data to pick a movie

1. Wikipedia edits2. ???3. Movie recommendation

Follow the edits

Same editor

select title, id, count(id) as editsfrom [publicdata:samples.wikipedia]where title contains 'Hackers' and title contains '(film)' and wp_namespace = 0group by title, idorder by editslimit 10

Pick a great movie

select title, id, count(id) as edits from [publicdata:samples.wikipedia]where contributor_id in ( select contributor_id from [publicdata:samples.wikipedia] where

id=264176 and contributor_id is not null and is_bot is null and wp_namespace = 0 and title CONTAINS '(film)' group by contributor_id) and wp_namespace = 0 and id != 264176 and title CONTAINS '(film)'group each by title, idorder by edits desclimit 100

Find edits in common

Discover the most broadly popular filmsselect id from ( select id, count(id) as edits from [publicdata:samples.wikipedia] where wp_namespace = 0 and title CONTAINS '(film)' group each by id order by edits desc limit 20)

Edits in common, minus broadly popularselect title, id, count(id) as edits from [publicdata:samples.wikipedia]where contributor_id in ( select contributor_id from [publicdata:samples.wikipedia] where

id=264176 and contributor_id is not null and is_bot is null and wp_namespace = 0 and title CONTAINS '(film)' group by contributor_id) and wp_namespace = 0 and id != 264176 and title CONTAINS '(film)' and id not in (

select id from ( select id, count(id) as edits from [publicdata:samples.wikipedia] where wp_namespace = 0 and title CONTAINS '(film)' group each by id order by edits desc limit 20 ) )group each by title, idorder by edits desclimit 100

</BigQuery>

Conclusion

● Scale comes with tradeoffs● Simpler problems are easier to scale● Open data is cool

Thank you!

Jen TongDeveloper AdvocateGoogle Cloud Platform

@MimmingCodeslittle418.com



http://little418.com

http://little418.com

Stories per month - GASELECT DATE(STRING(MonthYear) + '01') month, SUM(ActionGeo_ADM1Code='USGA') GAFROM [gdelt-bq:full.events]WHERE MonthYear > 0GROUP BY 1 ORDER BY 1

SELECT DATE(STRING(MonthYear) + '01') month, SUM(ActionGeo_ADM1Code='USGA') / COUNT(*) newsynessFROM [gdelt-bq:full.events]WHERE MonthYear > 0GROUP BY 1 ORDER BY 1

Stories per month, normalized

ISB teams with Google for NCI Cancer Genomics Cloud project

https://developers.google.com/genomics/

Genomics

Cost to sequence a genome

1000 Genomes

GenomicsSELECT Sample, SUM(single), SUM(double),FROM ( SELECT call.call_set_name AS Sample, SOME(call.genotype > 0) AND NOT EVERY(call.genotype > 0) WITHIN call AS single, EVERY(call.genotype > 0) WITHIN call AS double, FROM[genomics-public-data:1000_genomes.variants] OMIT RECORD IF reference_name IN ("X","Y","MT"))GROUP BY Sample ORDER BY Sample

GenomicsSELECT Sample, SUM(single), SUM(double),FROM ( SELECT call.call_set_name AS Sample, SOME(call.genotype > 0) AND NOT EVERY(call.genotype > 0) WITHIN call AS single,

EVERY(call.genotype > 0) WITHIN call AS double, FROM[genomics-public-data:1000_genomes.variants] OMIT RECORD IF reference_name IN ("X","Y","MT"))GROUP BY Sample ORDER BY Sample

scaling for big data at google - jen's homepage · scaling for big data at google. jen tong...

Documents