nosql, big data, and real time - qconsp...•pros • can be as fast or faster than traditional...

©2013 DataStax Confidential. Do not distribute without consent.

Benjamin CoverstonDSE Architect, DataStax Inc.

NoSQL, Big Data, and Real Time

1Monday, September 2, 13

Who am I?• Ben Coverston• DSE Architect• DataStax since 2010• Previous Experience in the Travel Industry• Low Cost Airlines / Web Reservations• Past: HP / Accenture• Lived in Santa Catarina for a few years.

Monday, September 2, 13

What is it?

NoSql


What is NoSQL?

NoSQL is a term coined by Carlo Strozzi and repurposed by Eric Evans to refer to “some” storage systems. The NoSQL term should be used as in the Not-Only-SQL and not as No to SQL or Never SQL.

-- Alex Popescu


What is NoSQL (Cont.)• It’s not• No to SQL

• About performance• About scaling

• ACID• Eventual consistency

• Volume

• It is:• About choice


Diversity in Data• Big Data has the 3 (or 4 or 5) V’s• Volume

• Variety• Velocity

• Variability (sometimes)• Value (other times)


Diversity in Data• The V’s don’t cover everything• Availabilty is important

• Your use case is important too


Is NoSQL Big Data?• You can store Big Data with an RDBMS• Is it easy?

• Is it cost effective?• What kid of compromises do you have to make?


The Problems• In general there are two classes of data problems• OLTP (Real-Time)

• Analytics (Batch)

• Usually you want both• No solution is perfect for everyone

• Popularity is no indication of fitness


Use Cases• OLTP• Low Latency

• High Throughput• LOB Applications

• Batch• Predictive Models• Complex Queries

• Tomorrow (or precalculated, but now we need OLTP)


Where to put your ‘Stuff’

Sharded RDBMS MPP -- Greenplum, Teradata Hadoop Key/ValueColumnarOther


Why not just one?• Analytics• Optimize serial IO

• Limitations in Storage

• OLTP• Working Set

• Distribution• Availability

• Storage Medium


Why Do We Need Something Else?• ACID semantics are often

overkill• ACID also makes the database

layer brittle• This means you get less

Availability (CAP Theorem)


The Application Stackwww.example.com

LB2LB1 LB3

ws3ws1 ws2 ws6ws4 ws5 ws9ws7 ws8

cache

1 2 3 4

DB#


http://www.example.com

http://www.example.com

Sharding• Storage Limitations•Working Set• So just make more!


(“The&eBay&Architecture,”&Randy&Shoup&and&Dan&Pritche:)&


But Sharding• Is Painful• Requires ‘something else’•Most no-sql solutions auto-shard• Sharding requires tradeoffs.• Which means your application will need to change


Which should I choose?• Analytics• Hadoop (probably) if your data is big

• Spark, other (sometimes faster) solutions available now

•NoSql• Let’s talk!


Decisions are about tradeoffs, never a zero-sum game

Fast, Cheap, Good -- Choose Two


CAP Theorem•More of two, less of one• Consistency

• Availability• Partitioning

• You have to accept P• That leaves C and A


How To Scale Anything• Partition By Function• Split Horizontally• Avoid Distributed Transactions• Avoid Synchronous Coupling• Virtualize Everywhere• Cache Everything


Partition By Function• Don’t put everything in the same database• Physical• Pools of Machines

• Geographical Distribution• Automatic sharding (look for this)• Make sure it works!

• Virtual• Logical Tables, Schema

• Not 100% necessary, but schema is nice


Partitioning (cont.)• Pros• Isolate failure• To a region

• To a service

• Simplify Failover

• Cons• Your DB has to handle multi-region replication• If you chose CP (CAP) you’re going to have a bad time• AP systems do OK here (Cassandra, actually excels)

• “Relational” part of databases becomes complex

• Everything gets denormalized


Split Horizontally• Scaling Vertically is easy• To a point, then it gets expensive.. fast..

• Easy if your system has no state to maintain• Or if the states are known, and small• Sharding over dependent fields complicates design

• Some things distribute themselves easily• key/value stores

• Others not so much• BTree indexes, foreign keys

• P2P architecture is helpful when splitting• In other words, avoid masters


Split Horizontally (cont)• Pros• Can be as fast or faster than traditional design

• Can scale up as long as you can afford more machines• Scaling is easy if you avoid having masters• Replication and failover don’t have to be special cases

• Cons• Even logical pieces of your app are distributed over many machines• example: your catalog is not all in one place

• Real time analytics is difficult, or slower


Avoid Distributed Transactions•Have you tried this?•Hard to do right• Paxos gives us some hope• CAS in Cassandra 2.0 looks promising

• Even then, it’s not good for everything!

•MVCC works for many use cases• Compensating Mechanisms• Customer Service (Amazon, inventory)


Avoid Distributed Transactions• Pros• Consistency in a distributed environment

• Cons• Slow• Overkill

• Did I say slow?• We chose CP so we get less A

• What happens when they don’t succeed?• Do we shut the whole thing down?


Avoid Synchronous Coupling•What?• A or B can be down

• A can be down, B continues to work• B can suffer, while A continues to work

• If your recommendation engine fails, your customers can still buy stuff!• Master/Slave failover is a good example of synchronous coupling• Master is down, slave needs to take over, but in the meantime.. what happens?


Avoid Syncronous Coupling• Pros• Fewer shared dependencies means less failure

• Less failure means more total uptime• For the whole

• Less coupling means that your application topology is more modular

• Introducing new, decoupled services is less risky

• Cons• More duplication of your infrastructure• e.g. now you have an application stack for each of your services.


Avoid Synchronous Processing Flows• AKA• Blocking Sockets

• Serialized Processes• Locking in General

• Do what is important FIRST• Take their money• Modify Inventory

• Other less important stuff can be queued• Triggers• Joins

• Stored Procedures• Consistency Checks


Avoid Synchronous Processing Flows• Pros• Critical operations will not block for nice to haves

• Easy monitoring of queues and assign priority to tasks• Problem areas are easier to identify

• Cons• Race conditions• More up front development cost


• DON’T• Pick your database because it has a sexy API

• Pick your database because it worked for somebody else

• DO• Pick a database that will fit with your use case

• Virtualize your data model• Encourage manipulation of your logical models

• DO NOT force interaction with your database

• Good virtualization means that you can change your data store later...• And most of your code will still work.

Virtualize Everything


Virtualize Everything• Virtualization isn’t just for the programmer• Things fall apart• Requests have to be re-routed

• Parts Replaced• APIs change

• Good virtualization means you can make changes w/o impacting availability


Cache Appropriately• You can’t cache everything• But you can cache stuff that doesn’t change• Or is expensive to retrieve


Cache Appropriately• Pros• Cache is fast (compared to traditoinal RDBMS access)

• Can give you a performance buffer

• Cons• Cache Coherence

• Cache Dependency• Is it a SPOF?

• What if it all doesn’t fit?


What about NoSQL?• All of this applies• Evaluate Products on their Strengths• If easy things are easy• The hard might be impossible

• Pick a something that makes the hard things possible


What are the ‘easy’ things?• Serialization Formats• JSON/BSON

• Data Models•HTTP/REST/JSON APIs•NodeJS Drivers!• etc.


The ‘hard’ things• Automatic Sharding• Where does the data go?

• How do I find it?• How do I add another?

•Multi DC• Replication• No SPOF

• Anti-Entropy• Continuous Availability• Upgrades

• Failure• Etc.


What should you use?• Your decision• Every database is not a fit for every problem.


DataStax Enterprise• DSE• Cassandra (OLTP)

• Analytics• Search

• The hard things are possible•We’re making the easy things easier


nosql, big data, and real time - qconsp...•pros • can be as fast or faster than traditional...

Documents