2014.11.25- slide 1is 257 – fall 2014 newsql and voltdb university of california, berkeley school...

2014.11.25- SLIDE 1IS 257 – Fall 2014

NewSQL and VoltDB

University of California, Berkeley

School of Information

IS 257: Database Management

2014.11.25- SLIDE 2

Project Presentation

• Sign up at • http://doodle.com/vtds6huqx45i95x9

IS 257 – Fall 2014

2014.11.25- SLIDE 3

History of the World, Part 1

• Relational Databases – mainstay of business

• Web-based applications caused spikes– Especially true for public-facing e-Commerce

sites• Developers begin to front RDBMS with

memcache or integrate other caching mechanisms within the application (ie. Ehcache)

2014.11.25- SLIDE 4

Scaling Up

• Issues with scaling up when the dataset is just too big

• RDBMS were not designed to be distributed• Began to look at multi-node database solutions• Known as ‘scaling out’ or ‘horizontal scaling’• Different approaches include:

– Master-slave– Sharding

2014.11.25- SLIDE 5

Scaling RDBMS – Master/Slave• Master-Slave

–All writes are written to the master. All reads performed against the replicated slave databases

–Critical reads may be incorrect as writes may not have been propagated down

–Large data sets can pose problems as master needs to duplicate data to slaves

2014.11.25- SLIDE 6

Scaling RDBMS - Sharding

• Partition or sharding–Scales well for both reads and writes–Not transparent, application needs to be

partition-aware–Can no longer have relationships/joins

across partitions–Loss of referential integrity across

shards

2014.11.25- SLIDE 7

Other ways to scale RDBMS

• Multi-Master replication• INSERT only, not UPDATES/DELETES• No JOINs, thereby reducing query time

– This involves de-normalizing data• In-memory databases

2014.11.25- SLIDE 8

NoSQL

• NoSQL databases adopted these approaches to scaling, but lacked ACID transaction and SQL

• At the same time, many Web-based services needed to deal with Big Data (the Three V’s we looked at last time) and created custom approaches to do this

• In particular, MapReduce…

IS 257 – Fall 2014

2014.11.25- SLIDE 9

MapReduce and Hadoop

• MapReduce developed at Google• MapReduce implemented in Nutch

– Doug Cutting at Yahoo! – Became Hadoop (named for Doug’s child’s

stuffed elephant toy)

IS 257 – Fall 2014

2014.11.25- SLIDE 10

Example

• Page 1: the weather is good• Page 2: today is good• Page 3: good weather is good.

From “MapReduce: Simplified data Processing… ”, Jeffrey Dean and Sanjay Ghemawat

IS 257 – Fall 2014

2014.11.25- SLIDE 11

Map output

• Worker 1: – (the 1), (weather 1), (is 1), (good 1).

• Worker 2: – (today 1), (is 1), (good 1).

• Worker 3: – (good 1), (weather 1), (is 1), (good 1).


IS 257 – Fall 2014

2014.11.25- SLIDE 12

Reduce Input

• Worker 1:– (the 1)

• Worker 2:– (is 1), (is 1), (is 1)

• Worker 3:– (weather 1), (weather 1)

• Worker 4:– (today 1)

• Worker 5:– (good 1), (good 1), (good 1), (good 1)


IS 257 – Fall 2014

2014.11.25- SLIDE 13

Reduce Output

• Worker 1:– (the 1)

• Worker 2:– (is 3)

• Worker 3:– (weather 2)

• Worker 4:– (today 1)

• Worker 5:– (good 4)


IS 257 – Fall 2014

2014.11.25- SLIDE 14

But – Raw Hadoop means code• Most people don’t want to write code if

they don’t have to• Various tools layered on top of Hadoop

give different, and more familiar, interfaces

• Hbase – intended to be a NoSQL database abstraction for Hadoop

• Hive and it’s SQL-like language

IS 257 – Fall 2014

2014.11.25- SLIDE 15

Introduction to PigPIG – A data-flow language for MapReduce

2014.11.25- SLIDE 16

Pig Latin

• Data flow language– User specifies a sequence of operations to

process data– More control on the processing, compared

with declarative language• Various data types are supported• ”Schema”s are supported• User-defined functions are supported

16

2014.11.25- SLIDE 17

Motivation by Example

• Suppose we have user data in one file, website data in another file.

• We need to find the top 5 most visited pages by users aged 18-25

17

2014.11.25- SLIDE 18

Hive - SQL on top of Hadoop

IS 257 – Fall 2014

2014.11.25- SLIDE 19

Hive

• A database/data warehouse on top of Hadoop– Rich data types (structs, lists and maps)– Efficient implementations of SQL filters, joins and

group-by’s on top of map reduce

• Allow users to access Hive data without using Hive

• Link:– http://svn.apache.org/repos/asf/hadoop/

hive/trunk/

IS 257 – Fall 2014

2014.11.25- SLIDE 20

Hive Architecture

HDFS

Hive CLIDDLQueriesBrowsing

Map Reduce

SerDe

Thrift Jute JSONThrift API

MetaStore

Web UIMgmt, etc

Hive QL

Planner ExecutionParser Planner

IS 257 – Fall 2014

2014.11.25- SLIDE 21

Overview

• Review– Big Data Technologies– Hadoop, Pig, Hive…

• NewSQL– The concept– VoltDB

IS 257 – Fall 2014

2014.11.25- SLIDE 22

Spark

• One problem with Hadoop/MapReduce is that it is fundamental batch oriented, and everything goes through a read/write on HDFS for every step in a dataflow

• Spark was developed to leverage the main memory of distributed clusters and to, whenever possible, use only memory-to-memory data movement (with other optimizations

• Can give up to 100fold speedup over MRIS 257 – Fall 2014

2014.11.25- SLIDE 23

Meanwhile…

• The database community continues to develop new approaches, some of which try to provide the benefits of SQL and ACID relations to big data

• As we have seen before there is a broad spectrum of database systems and capabilities…

IS 257 – Fall 2014

2014.11.25- SLIDE 24IS 257 – Fall 2014

2014.11.25- SLIDE 25

NewSQL

• NewSQL is a class of modern relational database management systems that seek to provide the same scalable performance of NoSQL systems for online transaction processing (OLTP) read-write workloads while still maintaining the ACID guarantees of a traditional RDBMS

IS 257 – Fall 2014

2014.11.25- SLIDE 26

NewSQL

• NewSQL systems focus on workloads that have large numbers of transactions that are:– Short duration– Touch a small fraction of data in the DB using

index lookups (i.e., no full table scans or large distributed joins)

– Repetitive (i.e. executing the same queries with different inputs)

IS 257 – Fall 2014

2014.11.25- SLIDE 27

NewSQL

• There are many different architectures for NewSQL DBs– E.g., Google Spanner, SAP HANA, Clustrix,

VoltDB, etc.• We are going to look at just one example

(VoltDB) and see how it works for very high-velocity workloads..

IS 257 – Fall 2014

2014.11.25- slide 1is 257 – fall 2014 newsql and voltdb university of california, berkeley school...

Documents

vertical scaling

scaling upissues

hashbased partitioning

rangebased partitioning

based applications

webbased application

single table

master needs