no-sql databases for high volume data · apache cassandra™ •apache cassandra™ is a massively...
TRANSCRIPT
No-SQL Databases for High Volume Data
Edward Wijnen3 November 2014
Target Conference 2014
The New Connected World Needsa Revolutionary New DBMS“The Internet of Things”
Client-Server
Semi-Connected
Isolated
Social
Radically Connected©2014 DataStax Confidential. Do not distribute without consent.
Mobile
Cloud
Mainframe
1970’s
1990’s
Today
Businesses Must Close the Gap…and Fast
Your Business Your Customers
Connected
Customers
Connected
PartnersConnected
EmployeesConnected
Devices
Connected
Products
If You Started With This
Connected
PartnersConnected
EmployeesConnected
Devices
Connected
Products
You Would End With This
Distributed
Transactional
DatabaseConnected
Customers
Apache Cassandra™
• Apache Cassandra™ is a massively scalable, open source, NoSQL,distributed database built for modern, mission-critical online applications
• Written in Java and is a hybrid of Amazon Dynamo and Google BigTable• Masterless with no single point of failure• Distributed and data centre aware• 100% uptime• Predictable scaling
Dynamo
BigTable
Cassandra
BigTable: http://research.google.com/archive/bigtable-osdi06.pdfDynamo: http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf
Apache Cassandra™
You are already using it!
Distributed Transactional Database Advantages
The Hague
C*
Distributed Transactional Database Advantages
The Hague
C*
Distributed Transactional Database Advantages
The Hague
C*
Distributed Transactional Database Advantages
The Hague
C*
Distributed Transactional Database Advantages
The HagueLondon
C*C*
Distributed Transactional Database Advantages
The Hague London
C*C*
Delivers 150+ Billion Content Recommendations Per MonthServes content for largest media brands in the world: Reuters, Wall St Journal, USA TodayNeeded a massively scalable data store High velocity of data with 58,000 links to content per secondAlways-on data architecture
Use Case: Recommendations / Personalization
Distributed Transactional Database Advantages
The HagueLondon
C*C*
Distributed Transactional Database Advantages
The Hague
Groningen
London
C*
C*C*
Distributed Transactional Database Advantages
The Hague
Groningen
London
C*
C*C*
Distributed Transactional Database Advantages
The Hague
Groningen
London
C*
C*C*
Distributed Transactional Database Advantages
The Hague
Groningen
London
C*
C*C*
Distributed Transactional Database Advantages
The Hague
Groningen
London
C*
C*C*
Distributed Transactional Database Advantages
The Hague
Groningen
London
C*
C*C*
C*
Netflix Delights Customers with Personal RecommendationsWorld’s leading streaming media provider with digital revenue $1.5BN+Tailors content delivery based on viewing preference data captured in CassandraIncreased market cap by 600% since 2012Introduction of ‘Profiles’ drove throughput to over 10M transactions per secondReplaced Oracle in six data centers, worldwide, 100% in the cloud
Use Case: Personalization
Cassandra – Always On, No Matter What
The Hague
Groningen
London
C*
C*
C*C*
l Transactional Backbonel Industry Leading
Performance l Predictable Scalability l Operational Simplicity l Business Flexibility
Cassandra – Operational Simplicity
CAP Theorem – Oracle vs Cassandra
Cassandra – Tunable Consistency
• Consistency Level (CL)• Client specifies per read or write• Handles multi-data center operations
• ALL = All replicas ack• QUORUM = > 51% of replicas ack• LOCAL_QUORUM = > 51% in local DC ack• ONE = Only one replica acks• Plus more…. (see docs)
• Blog: Eventual Consistency != Hopeful Consistencyhttp://planetcassandra.org/blog/post/a-netflix-experiment-eventual-consistency-hopeful-consistency-by-christos-kalantzis/
Node 11st copy
Node 4
Node 5 Node 22nd copy
Node 33rd copy
ParallelWriteWrite
CL=QUORUM
5 μs ack
12 μs ack
500 μs ack
12 μs ack
Node 4
Messaging
Product Catalogs andPlaylists
Recommendation/Personalization
Fraud detection
Internet ofthings/ Sensordata
Common Use Cases
Challenges Customers
A product catalog is an organized collection of products or services. Playlists refer to user-defined queues of songs, movies, games and lessons.Examples: Shopping carts, gift registries, media playlists.
Why DataStax?• Rigidity of relational databases• Increase in volume and diversity of
data• Application must have zero
downtime• Predictable scalability is hard• Desire to operate in the cloud
• Real-time database infrastructure• Rich analytics for flexible access to
information• Fast search and indexing of data• Add new features while the
application is online• Multiple data centers to ensure
applications and data have 100%uptime
Product Catalogs and Playlists
Challenges Customers
Recommendation and Personalization Engines understand each person’s unique habits andpreferences and bring to light products and items that a user may be unaware of and not lookingfor. Examples: News sites, shopping carts.
Why DataStax?• Large volumes of user data makes
accuracy challenging• Merging real-time and historical
information• Cross-product information• Response times need to be fast• Predictable scalability is hard
• Rich query language and enterprisesearch to store, search and analyzeuser activity data
• Integrations with data lakes allow forthe merging of real time andhistorical data
• Multi-data center replication ensuresapplications and data suffer nodowntime
• Linear scalability is predictable
Recommendation Engine
Challenges Customers
Fraud detection solutions identify out-of-the-ordinary patterns to prevent malicious attacks ondigital and physical assets from unauthorized applications and individuals. Examples: credit cardmonitoring, application infiltration
Why DataStax?• Increasing volume of fraudulent
attacks across all industries• Technology sophistication• Limited historical and trend
information • Information is stored across multiple
channels• The customer can be the first to spot
the fraud
• Easy management of high-datavolumes
• Real-time monitoring acrosschannels, sites and data centers
• Integrations with data lakes allow forthe merging of real time andhistorical data
• Ease of use in managing andmonitoring data
• Multi-data center replication ensuresapplications and data suffer nodowntime
Fraud Detection
Challenges Customers
IOT refers to the revolution of a growing number of internet-connected devices that cannetwork and communicate with each other.
Why DataStax?• Vast and diverse amounts of
unstructured data from internetenabled devices
• Volume of sensors is increasingexponentially
• Fast-changing technology• Support multiple channels with
varying data types• Predictable scalability is hard
• Easy management of high-datavolumes
• Rich query language and enterprisesearch to store, search and analyzedata
• Dynamic database schema• Linear scalability offers predictability
Internet of Things/ Sensor Data
Challenges Customers
Messaging facilitates communication, interaction and collaboration between diverse user-groupsand applications via social networks, cloud services and more. Examples: SMS, email and instant messaging.
Why DataStax?• Managing large data volumes at a
reasonable cost• Real-time updates and information,
getting detailed alerts andnotifications
• Predictable scalability is hard• Information is stored across multiple
platforms and systems• Agility
• Easy management of high-data volumes• Real-time monitoring across
channels, sites and data centers• Multi-data center replication ensures
applications and data suffer nodowntime
• Ease of use in managing andmonitoring data
• Dynamic database schema
Messaging
The Weather Channel onLearning to use Cassandra
“If you had a look in the past, you may have foundCassandra had a high learning curve and a fair amountof complexity. CQL3, the native drivers, and virtualnodes have changed the game entirely, makingCassandra a much more accessible and friendlyplatform.
While I have years of experience using Cassandra, myteam was mostly new to it; CQL made their transitionessentially painless. But where Cassandra reallyshines is in speed and operational simplicity, and Iwould say those two points were critical.”
Robbie Strickland Software Dev Manager
• Application drivers and connectors for all popular developerlanguages exist for Cassandra and DataStax Enterprise.
• CQL (Cassandra Query Language) is the primary API• Drivers/connectors include:
• Java• C++• Python• Ruby• PHP
API’s and Drivers
Connected
PartnersConnected
EmployeesConnected
Devices
Connected
Products
DataStax – Enabling The Future
Distributed
Transactional
DatabaseConnected
Customers
Thank you!
[email protected]@edwardwijnen