scaling for big data at google - jen's homepage · scaling for big data at google. jen tong...
TRANSCRIPT
Scaling for Big Data at Google
Jen TongDeveloper AdvocateGoogle Cloud Platform
@MimmingCodes
Agenda
● Research● Bigtable● BigQuery
Google Research Publications
Google Research Publications
Open Source Implementations
Bigtable
Flume
Dremel
Managed Cloud Versions
Bigtable
Flume
Dremel
Bigtable
Dataflow
BigQuery
Cloud BigtableCloud Bigtable
Bigness
Google Internal Bigtable in Numbers
• Storage: 100s of PB
• Throughput: 1,000,000s of QPS
• Bandwidth: 100s of GB/sec
How much is that?
Several Datas worthPhoto credit: jdhancock
How much is that?
Millennia of DVD videoPhoto credit: illinoislibrary
Engineering
Engineering
Hundreds of engineer-years worth
Bigtable - The early years• Ingredients
○ Thousands of commodity servers
○ Many petabytes of data
• Tradeoffs
○ Abandon traditional relational model
• Goals
○ Prototype the service to do its first scaling
○ Focus on batch work
○ Migrate first applications to Bigtable
○ Figure out replication
Bigtable - Stabilized
• Lower latency
○ Fast 99th percentile requests
○ Start serving web traffic
• Polish the Bigtable service
○ React better to abusive usage
○ Mixed media clusters - mixture of SSD + spinning disks
○ Faster tablet server recovery time: ~10 sec to ~800 ms
Data Model
Data model
How it works
Life of Bigtable data
Life of Bigtable data
Life of Bigtable data
Life of Bigtable data
Bigtable Architecture
Bigtable Cell
Tabletserver
Tabletserver
Tabletserver
Tabletserver
Master
Tabletserver
Bloomfilter
Memtable
Sharedlog
Block Cache
TabletTablet
Tablet Tablet
Chubby
HBase Client
Colossus
Bigtable Architecture
Bigtable Cell
Tabletserver
Tabletserver
Tabletserver
Tabletserver
Master
Tabletserver
Bloomfilter
Memtable
Sharedlog
Block Cache
TabletTablet
Tablet Tablet
Chubby
HBase Client
Colossus
Bigtable Architecture
Bigtable Cell
Tabletserver
Tabletserver
Tabletserver
Tabletserver
Master
Tabletserver
Bloomfilter
Memtable
Sharedlog
Block Cache
TabletTablet
Tablet Tablet
Chubby
HBase Client
Colossus
Bigtable Architecture
Bigtable Cell
Tabletserver
Tabletserver
Tabletserver
Tabletserver
Master
Tabletserver
Bloomfilter
Memtable
Sharedlog
Block Cache
TabletTablet
Tablet Tablet
Chubby
HBase Client
Colossus
When it's awesome
Financial ServicesFaster risk analysis, credit card fraud/abuse
Marketing/ Digital MediaUser engagement, clickstream analysis, real-time adaptive content
Internet of ThingsSensor data dashboards and anomaly detection
TelecommunicationsSampled traffic patterns, metric collection and reporting
EnergyOil well sensors, anomaly detection, predictive modeling
BiomedicalGenomics sequencing data analysis
Cloud Bigtable Use Cases
When not to use it
• Relational joins, like for online transaction processing
• Interactive querying
• Blobs over 10MB
• ACID transactions
• Automatic cross-zone replication
• You don't have much data yet
When not to use it
• Relational joins, like for online transaction processing - Cloud SQL
• Interactive querying - BigQuery
• Blobs over 10MB - Cloud Storage
• ACID transactions - Datastore
• Automatic cross-zone replication - Datastore
• You don't have much data yet - Datastore, Firebase, or Cloud SQL
</Bigtable>
Google BigQueryGoogle BigQuery
Let's count some stuff
SELECT count(word)FROM publicdata:samples.shakespeare
Words in Shakespeare
SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_20150212_01]
Wikipedia hits over 1 hour
SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_201505]
Wikipedia hits over 1 month
Several years of Wikipedia data
SELECT sum(requests) as totalFROM [fh-bigquery:wikipedia.pagecounts_201105], [fh-bigquery:wikipedia.pagecounts_201106], [fh-bigquery:wikipedia.pagecounts_201107],
...
SELECT SUM(requests) AS totalFROM TABLE_QUERY( [fh-bigquery:wikipedia], 'REGEXP_MATCH( table_id, r"pagecounts_2015[0-9]{2}$")')
Several years of Wikipedia data
How about a RegExp
SELECT SUM(requests) AS totalFROM TABLE_QUERY( [fh-bigquery:wikipedia], 'REGEXP_MATCH( table_id, r"pagecounts_2015[0-9]{2}$")')WHERE (REGEXP_MATCH(title, '.*[dD]inosaur.*'))
How did it do that?o_O
Qualities of a good RDBMS
Qualities of a good RDBMS
● Inserts & locking● Indexing● Cache● Query planning
Qualities of a good RDBMS
● Inserts & locking● Indexing● Cache● Query planning
Storing data
-- -- -- ---- -- -- ---- -- -- --
Table
Columns
Disks
Reading data: Life of a BigQuery
SELECT sum(requests) as sumFROM ( SELECT requests, title FROM [fh-bigquery:wikipedia.pagecounts_201501] WHERE (REGEXP_MATCH(title, '[Jj]en.+')) )
Life of a BigQuery
L L
MMixer
Leaf
Storage
L L L L
M M
M
Life of a BigQuery
Root Mixer
Mixer
Leaf
Storage
Life of a BigQueryQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
StorageSELECT requests, title
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage5.4 Bil
SELECT requests, title
WHERE (REGEXP_MATCH(title, '[Jj]en.+'))
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage5.4 Bil
SELECT sum(requests)
5.8 MilWHERE (REGEXP_MATCH(title, '[Jj]en.+'))
SELECT requests, title
Life of a BigQueryLife of a BigQuery
L L L L
M M
MRoot Mixer
Mixer
Leaf
Storage5.4 Bil
SELECT sum(requests)
5.8 MilWHERE (REGEXP_MATCH(title, '[Jj]en.+'))
SELECT requests, title
SELECT sum(requests)
Open Data
Finding Open Data
opendata.stackexchange.com
Finding Open Data
reddit.com/r/dataisbeautiful
Finding Open Data
reddit.com/r/bigquery/wiki/datasets
Time to explore
GSOD
Find nearby weather data
select namefrom [fh-bigquery:weather_gsod.stations]where state == 'TX' and usaf in(select stn from (SELECT count(stn) as cnt, stn FROM [fh-bigquery:weather_gsod.gsod2015] where stn <> '999999' group by stn order by cnt desc))group by nameorder by name ASC;
Weather in Atlanta
SELECT DATE(year+mo+da) day, min, maxFROM [fh-bigquery:weather_gsod.gsod2015] WHERE stn IN ( SELECT usaf FROM [fh-bigquery:weather_gsod.stations] WHERE name = 'AUSTIN CAMP MABRY')AND max < 200ORDER BY day;
Weather in Half Moon BaySELECT DATE(year+mo+da) day, min, maxFROM [fh-bigquery:weather_gsod.gsod2013] WHERE stn IN ( SELECT usaf FROM [fh-bigquery:weather_gsod.stations] WHERE name = 'HALF MOON BAY AIRPOR')AND max < 200ORDER BY day;
Global high temperatures SELECT year, max(max) as maxFROM TABLE_QUERY( [fh-bigquery:weather_gsod], 'table_id CONTAINS "gsod"')where max < 200 group by year order by year asc
Global high temperatures SELECT year, max(max) as maxFROM TABLE_QUERY( [fh-bigquery:weather_gsod], 'table_id CONTAINS "gsod"')where max < 200 group by year order by year asc
Something useful:Use Wikipedia data to pick a movie
1. Wikipedia edits2. ???3. Movie recommendation
Follow the edits
Same editor
select title, id, count(id) as editsfrom [publicdata:samples.wikipedia]where title contains 'Hackers' and title contains '(film)' and wp_namespace = 0group by title, idorder by editslimit 10
Pick a great movie
select title, id, count(id) as edits from [publicdata:samples.wikipedia]where contributor_id in ( select contributor_id from [publicdata:samples.wikipedia] where
id=264176 and contributor_id is not null and is_bot is null and wp_namespace = 0 and title CONTAINS '(film)' group by contributor_id) and wp_namespace = 0 and id != 264176 and title CONTAINS '(film)'group each by title, idorder by edits desclimit 100
Find edits in common
Discover the most broadly popular filmsselect id from ( select id, count(id) as edits from [publicdata:samples.wikipedia] where wp_namespace = 0 and title CONTAINS '(film)' group each by id order by edits desc limit 20)
Edits in common, minus broadly popularselect title, id, count(id) as edits from [publicdata:samples.wikipedia]where contributor_id in ( select contributor_id from [publicdata:samples.wikipedia] where
id=264176 and contributor_id is not null and is_bot is null and wp_namespace = 0 and title CONTAINS '(film)' group by contributor_id) and wp_namespace = 0 and id != 264176 and title CONTAINS '(film)' and id not in (
select id from ( select id, count(id) as edits from [publicdata:samples.wikipedia] where wp_namespace = 0 and title CONTAINS '(film)' group each by id order by edits desc limit 20 ) )group each by title, idorder by edits desclimit 100
</BigQuery>
Conclusion
● Scale comes with tradeoffs● Simpler problems are easier to scale● Open data is cool
Thank you!
Jen TongDeveloper AdvocateGoogle Cloud Platform
@MimmingCodeslittle418.com
GDELT
Stories per month - GASELECT DATE(STRING(MonthYear) + '01') month, SUM(ActionGeo_ADM1Code='USGA') GAFROM [gdelt-bq:full.events]WHERE MonthYear > 0GROUP BY 1 ORDER BY 1
SELECT DATE(STRING(MonthYear) + '01') month, SUM(ActionGeo_ADM1Code='USGA') / COUNT(*) newsynessFROM [gdelt-bq:full.events]WHERE MonthYear > 0GROUP BY 1 ORDER BY 1
Stories per month, normalized
ISB teams with Google for NCI Cancer Genomics Cloud project
https://developers.google.com/genomics/
Genomics
Cost to sequence a genome
1000 Genomes
GenomicsSELECT Sample, SUM(single), SUM(double),FROM ( SELECT call.call_set_name AS Sample, SOME(call.genotype > 0) AND NOT EVERY(call.genotype > 0) WITHIN call AS single, EVERY(call.genotype > 0) WITHIN call AS double, FROM[genomics-public-data:1000_genomes.variants] OMIT RECORD IF reference_name IN ("X","Y","MT"))GROUP BY Sample ORDER BY Sample
GenomicsSELECT Sample, SUM(single), SUM(double),FROM ( SELECT call.call_set_name AS Sample, SOME(call.genotype > 0) AND NOT EVERY(call.genotype > 0) WITHIN call AS single,
EVERY(call.genotype > 0) WITHIN call AS double, FROM[genomics-public-data:1000_genomes.variants] OMIT RECORD IF reference_name IN ("X","Y","MT"))GROUP BY Sample ORDER BY Sample