web-scale data processing: practical approaches for low-latency and batch

$>whoami

Edward CaprioloDeveloper @ dstillery (the company formally known as m6d aka media6degrees)

Hive: Project Management Committee

Hadoop'in it since 0.17.2

Cassandra-'in it since 0.6.X

Hive'in it 0.3.X

Incredibly skilled with power point

Agenda for this talk

Batch processing via Hadoop

Stream processing

Relational Databases and NoSQL

Life lessons, quips, and other prospective

Before we talk tech...

Lets talk math!

Yay! math fun! (as people start leaving room)
Don't worry. It is only a couple slides.

Wanted to talk about relational algebra since it is the foundation of relation databases

Even in the NoSQL age, relational algebra is alive and well

Relational algebra...
A big slide with many words

Relational algebra received little attention outside of pure mathematics until the publication of E.F. Codd's relational model of data in 1970. Codd proposed such an algebra as a basis for database query languages.

In computer science, relational algebra is an offshoot of first-order logic and of algebra of sets concerned with operations over finitary relations, usually made more convenient to work with by identifying the components of a tuple by a name (called attribute) rather than by a numeric column index, which is called a relation in database terminology.

http://en.wikipedia.org/wiki/Relational_algebra

Operators of Relational algebra:

Projection

SELECT Age, Weight ...Extended projections

SELECT Age+Weight as X ...

SELECT ROUND(Weight),Age+1 as X ...

Selection

SELECT * FROM Person

SELECT * FROM Person WHERE Age >=34

SELECT * FROM Person WHERE Age = Weight

Joins

SELECT * FROM Car JOIN Boat on (CarPrice >= BoatPrice)

SELECT * FROM Car JOIN Boat on (CarPrice = BoatPrice)

Aggregate

SELECT sum(C) FROM r

SELECT A, sum(C) FROM r GROUP BY A

http://www.cbcb.umd.edu/confcour/CMSC424/Relational_algebra.pdf

Other Operators

Set operationsUnion

Intersection

Cartesian Product

Outer joinsLEFT

RIGHT,

FULL

Semi Join / Exists

Batch Processing and Big Data

When hadoop game on the scene it was a game changer because:Viable implementation of Google's map reduce white paper

Worked with commodity hardware

Had no exuberant software fees

Scaled processing and storage with growing companies without typically needed processes to be redesigned

Archetype Hadoop deployment
(circa facebook 2009)

Web Servers

Scribe Writers

RealtimeHadoop Cluster

Hadoop Hive Warehouse

Oracle RAC

MySQL

Scribe MidTier

http://hadoopblog.blogspot.com/2009/06/hdfs-scribe-integration.html

The Hadoop archetype

Component generating events (web servers)

Component collecting logs into hadoop (scribe)

Translation of raw data using hadoop and hive

Output of rollups to oracle and other data systems feedback loops (mysql hive)

Use case: Book store

Our book store will be named (say it with me!):Web scale,

Big Data,

No SQL,

Real Time Analytics,

Books!

One more time!Web scale, Big Data, No SQL, Real Time Analytics, Books(A buzzword bingo company)

Domain model

{ "id":"00001",
"refer":"http://affiliate1.superbooks.com",
"ip":"209.191.139.200",
"status":"ACCEPTED",
"eventTimeInMillis":1383011801439,
"credit_hash":"ab45de21",
"email":"[email protected]",
"purchases":[ {
"name":"Programming Hive",
"cost":30.0 }, {
"name":"frAgile Software Development",
"cost":0.2 } ]}

Complex serialized payloads

process web logs in facebook's case were NOT always tab delimited text files

In many cases scribe was logging complex structures in thrift format

Hadoop (and hive) can work with complex records not typical in RDBMS

Log collection/ingestion

http://flume.apache.org/FlumeUserGuide.html

Several ingestion approaches

Scribe never took off

Choctaw (hangs around not sexy)

Log servers log direct with HDFS API

Duck taped up set of shell scripts

Flume seems to be the most widely used, feature rich, and supported system

Left up to the user...

What format do you want the raw data in

How should the data be staged in HDFShourly directories

by host

How to monitorSemantics of what the pipeline should do if files stop appearing?

Application specific sanity checks

Unleash the hounds!

SELECT refer, sum(purchase.cost)
FROM store_transactionLATERAL VIEW explode (purchase) plist as purchaseGROUP BY refer
WHERE refer = 'y'

Hive and relational algebra

web-scale data processing: practical approaches for low-latency and batch

Technology