data evolution in hbase

®

[email protected]© 2009 - 2014 Next Big Sound, Inc.

Building a Data “Development” Platform

Data Evolution In HBase

Eric Czech & Alec Zopf Next Big Sound

!HBaseCon - Case Studies Track

May 5, 2014

mailto:[email protected]

®


Intro• Eric Czech - Chief Architect

Previously worked for infrastructure team at quantitative hedge fund

!

• Alec Zopf - Senior Data Engineer Previously worked on algorithmic futures and options trading platform


®


Agenda• Data & Architecture • Data Aggregation

- Why no tools help us • Data Development (HBlocks)

- Our platform for making it happen • A Practical Example


®


Misc

iTunesPhysical Sales AmazonSitecatalyst

Facebook Facebook InsightsLast.fm Pandora RdioReverbNation SoundCloudTumblr

Streaming & SocialNext Big Sound marries billions of public social data points with customers’ internal transactional data. Public sources include up to 3+ years of historical and competitive data for hundreds of thousands of artists and millions of songs.

Google Analytics WikipediaTunesatMediabase

Sales

Spotify Twitter VevoVimeo YouTube YouTube AnalyticsDeezerInstagram

Data Sources


®


Charts Licensed to BillboardIn Billboard’s 118 year history they’ve licensed data from two providers – Nielsen in 1991 and Next Big Sound in 2010.


®


Architecture & Stats•Data collected from 60+ sources

•1M artists, 10M tracks

•10s of billions of records

•CDH 4.3.0

•48 node Hadoop cluster for 35TB dataset

•No licensing costs

•Giant counting machine!


®


Data AggregationStores raw fact tables and copies of dimension tables from MySQL HDFS

Oozie/Pig

HBase

Runs incremental joins of fact and dimension tables

Stores timeseries aggregations for random access (NOT using counters)


®


Raw Fact Data (HDFS)

Aggregate Tables (HBase)

Cube/Rollup Operations (Pig)

(and many more...)


®


Other Solutions

• OpenTSDB • Summingbird (Twitter) • DataFu Hourglass (Linkedin) • Blueflood (Rackspace) • Oozie Coordinators • Apache Accumulo

Are there better ways to just count things?

Yes! Lots:• Hadoop + Voldemort • MongoDB Incremental MapReduce • TempoDB & InfluxDB (hosted services) • KairosDB (originally built on Cassandra) • Amazon EMR/Redshift • Cassandra/Redis/Riak/HBase Counters


http://opentsdb.net/

https://github.com/twitter/summingbird

https://engineering.linkedin.com/datafu/datafus-hourglass-incremental-data-processing-hadoop

http://developer.rackspace.com/blog/blueflood-announcement.html

https://oozie.apache.org/docs/3.1.3-incubating/CoordinatorFunctionalSpec.html

https://accumulo.apache.org/

https://engineering.linkedin.com/voldemort/serving-large-scale-batch-computed-data-project-voldemort

http://docs.mongodb.org/manual/tutorial/perform-incremental-map-reduce/

https://tempo-db.com/

http://influxdb.org/

https://code.google.com/p/kairosdb/

http://aws.amazon.com/elasticmapreduce/

http://aws.amazon.com/redshift/

®


Considerations• Scalability • Cost • Performance • Client Libraries • I/O Characteristics • Optimal Hardware • Config Overhead

• Language • Community • Data Model • Monitoring/Alerting • Documentation • Support • Learning Curve


®


One More Thing..

What about mistakes?!

Data “bugs” are nearly impossible to predict and can screw you in unimaginable ways..


®


Data BugsWhy are fan counts in Schenectady, NY 1000% higher than everywhere else?

Data source uses 12345 as default for new users’ locations

Why are radio station play numbers recently all multiples of 2 or 3?

Data delivered several times and we had no idea

Why is the number of songs sold 3% too high?

We didn't account for returns

Why are all the page view spikes 8 hours after they should be?

We assumed UTC timestamps instead of PST

Hundreds of these! .. that we know of


®


Minor Data BugsGeorgia

!=

Georgia


®


Or maybe not...Can we just fix the code and re-aggregate?

NO, there’s no guarantee that the bad data is overwritten.

Can we do the aggregations “on-the-fly”? NO, we’re not using a relational model for good reason.

Can we rebuild everything in new tables? NO, we’d need 2x storage to fix < .0001% of the data.


®


Fixing data bugs online is terrifying.

• Dangerous and complicated • Difficult to generalize • Time-consuming to test • A huge database I/O burden

“Ad-hoc” updates to production datasets are:

Learning the Hard Way


®


Back To SolutionsWhat if each dataset had multiple versions?

... and we can focus on small pieces

... with alpha/beta/stable tags

... where users only see what they should

Feels familiar


®


HBlocks

• Spans HDFS, Hive, Pig, and HBase • Arbitrary versioning of data subsets • Incremental processing, full-scale re-processing,

and everything in between • Append-only model (deletes in background)

Our solution for large-scale revision control


®


The BasicsEach raw file has an ID * e.g “block_1”

Each ID has versions * ID & version stored in HBase

Version state used to filter results


®


Data DevelopmentVersion “States” control data lifecycle

PENDING New data for ETL pipeline

PROCESSING Data currently being processed

ALPHA Developers only

BETA Privileged users

STABLE Everybody

HIDDEN Ignored (but still in HBase)

DELETED Removed permanently

Birth

Death


®


A Practical ExampleTracking the number of English Language

Wikipedia page views for Hadoop

http://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-01/

http://en.wikipedia.org/wiki/Apache_Hadoop

So we’ll track this site:

Using this data:



http://en.wikipedia.org/wiki/Apache_Hadoop

®


The Datasethttp://dumps.wikimedia.org/other/pagecounts-raw/2014/2014-01/

Contains ~100MB compressed files for each hour

pagecounts-20140101-*.gzAll pageviews for Jan 1, 2014:



®


File Uploads

user@host001> for file in `ls wikipedia`!do! hblocks upload \! -file $file \! -source wikipedia !done

user@host001> ls wikipedia!pagecounts-20140101.gz!pagecounts-20140102.gz!...!pagecounts-20140131.gz

Files downloaded anywhere ...

... and uploaded to HDFS




®


Run It!Now, lets do some aggregating:

user@host001> hblocks aggregate -source wikipedia

user@host001> hblocks query -table page_views !+-------------------------------------------------------------------+!| hblock_id | version | language | page | date | value |!+-------------------------------------------------------------------+!| 2935 | 1 | en | Apache_Hadoop | 20140101 | 283 |!...!| 2935 | 1 | En | Apache_Hadoop | 20140131 | 2 |!| 2935 | 1 | en.mw | Apache_Hadoop | 20140131 | 3 |

Pig script writes results to HBase:

Wtf is this !?


®


What Happened?

• “Sub” languages (e.g. ‘en.mw’) introduced

• Capitalized languages (e.g. ‘En’) also added

• Aggregation script starts ignoring small % of records

On January 20th:

* fictitious problems - these language values are real but were not introduced in January


®


Effects Over Time

Aggregation process misses new languages causing slight drop in values


®


Fix It!Change the current aggregation code:

String language = line.get(“language”);

To handle case-sensitivity and use first part before a “.”:

String language = line.get(“language”)!! .split(“\\.”)[1]!! .toLowerCase();


®


Run It AgainRun the same aggregation for new versions:

user@host001> hblocks aggregate -source wikipedia

New results:

We made it even worse!


®


RevertHurry, hide the bad data:

.split(“\\.”)[1]

Wrong! Should have been:.split(“\\.”)[0]

user@host001> hblocks update_versions -source wikipedia \!! ! ! ! -regex ‘.*201401(2|3).*’ -state ‘HIDDEN’

Phew, back to where we started .. but what happened?


®


Fix It Again (carefully)user@host001> hblocks rebuild -source wikipedia \!! ! ! ! -regex ‘.*201401(2|3).*’ -state ‘beta’

Rebuild aggregations in ‘beta’ state this time:

hblocks aggregateAfter another only developers see:

Looks good!


®


HBase SchemaPrimary Dimensions

HBlock Id

Time 0 Secondary Dimensions

Time 1 HBlock Version Id

Time 2.0 Value 0 Time 2.N Value N

Keys

Columns

Values

Timestamps Schema #Insertion Time (secs) Value Data Type


®


HBase Keys/Columns

Primary Dimensions

HBlock Id

Time 0 Secondary Dimensions

Time 1 HBlock Version Id

Keys

Columns

Concatenated string ids artists, tracks & metrics

Times split into offsets limits row width

Queried in bulk demographics & zip codes

HBlocks metadata determines record “state”


®


HBase Values

Time 2.0 Value 0 Time 2.N Value NValues

Time offsets in values too fixed width (single byte)

Values stored as VarInts can be any width

Many values per cell keeps key count lower, reducing MemStore size * difficult without an append-only model like ours


®


Alec Zopf [email protected] Eric Czech [email protected]

Architecture @ NBS - highscalability.com

HBlocks White PaperJobs @ NBS

Links




http://highscalability.com/blog/2014/1/28/how-next-big-sound-tracks-over-a-trillion-song-plays-likes-a.html

https://dl.dropboxusercontent.com/u/65158725/hblocks.pdf

https://www.nextbigsound.com/about%23join

data evolution in hbase

Software

competitive data

bad data

fixing data bugs

data development hblocks

architecture stats data

minor data bugs georgia

new tables

hbase version state