getting started with datastax enterprise from a technical perspective
DESCRIPTION
The requirements for building today’s online applications have changed. Implementing legacy technology hinders your ability to innovate, ensure application performance, and meet the demands of your customers. So how do you determine what underlying systems are the right fit for your needs? Join us as we review the following to help you get started with DataStax Enterprise: - What is Cassandra and why should you care? - What is DataStax Enterprise and how does it differ from Cassandra? - What are the steps to evaluating DataStax Enterprise? - Valuable resources to get up to speed on Cassandra and DataStax EnterpriseTRANSCRIPT
Getting Started with DataStax Enterprise
A Technical Overview
Confidential 1
Agenda
Confidential 3
Why Cassandra?
Why DataStax Enterprise?
How to Evaluate?
Confidential 4
Why Cassandra?
What is Apache Cassandra?
Apache Cassandra™ is a massively scalable NoSQL database.
• Continuous availability• High performing writes and reads• Linear scalability• Multi-data center support
10
50
3070
80
40
20
60
Client
Client
Replication Factor = 3
We could still retrieve the data from the other 2 nodes
Token Order_id
Qty Sale
70 1001 10 100
44 1002 5 50
15 1003 30 200
Node failure or it goes down temporarily
Cassandra is Fault Tolerant
Source: Netflix Tech Blog
Netflix Cloud Benchmark…
“In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear increasing throughput.”Source: Solving Big Data Challenges for Enterprise Application Performance Management benchmark paper presented at the Very Large Database Conference, 2013.
End Point Independent NoSQL BenchmarkHighest in throughput…
Lowest in latency…
The NoSQL Performance Leader
Linearly Scalable
10
50
3070
80
40
20
60
10
30
2040100,000 txns
per sec
200,000 txns
per sec
400,000 txns/
per sec
Simply add nodes to double, quadruple performance and capacity
10
20
Client
10
50
3070
80
40
20
60
Client
15
55
3575
85
45
25
65
East Data CenterWest Data Center
10
50
3070
80
40
20
60
Data Center Outage Occurs
No interruption to the business
Multi Data Center Support
Built for Modern Online Applications
• Architected for today’s needs• Linear scalability at lowest cost• 100% uptime• Operationally simple
Agenda
Confidential 11
Why Cassandra?
• Scale with ease• Always on• Deploy across data centers
Agenda
Confidential 12
Why Cassandra?
Why DataStax Enterprise?
• Scale with ease• Always on• Deploy across data centers
Confidential 13
DataStax deliversApache Cassandra to the Enterprise
DataStax supports both the open source community and modern business enterprises.
Why DataStax?
Open Source DataStax Enterprise
Apache Cassandra (Cassandra Chair and 30% of committers)
Community Edition Enterprise Edition(Tested & Certified for Production)
OpsCenter Standard Enterprise (Alerts, Automated Management Services, Cluster
Management)
DevCenter
Drivers/Connectors
Online Documentation
Online Training
Mailing Lists and Forums
Security Standard Enterprise(Kerberos Authentication & SSL Encryption)
Built-in Real-time Analytics
Built-in Enterprise Search
In-Memory Database Option
Expert Support (24x7x365)
Consultative Support
Onsite Training
• Visual browser-based UI• Point-and-click administration• Visual cluster management• Proactive alerts• Built-in external notifications• Visual backup operations
DataStax OpsCenter
Cassandra Query Language (CQL)
DataStax DevCenter – a free, visual query tool for creating and running CQL statements against Cassandra and DataStax Enterprise.
Internal Authentication
Internal validation of authorized users
Simple to implement & easy to understand
No learning curve
Object Permission Management
Deep control over who can add/change/delete/read data
Uses familiar GRANT/REVOKE from relational world
No learning curve
Client to Node Encryption
Ensures data cannot be captured/stolen in route to a server
Data is safe both in flight from/to a database and on the database
Complete coverage is ensured
Cassandra Security
External Authentication
External validation of authorized users
Leverages Kerberos & LDAP)
Single sign-on to all data domains
Transparent Data Encryption
Protects sensitive data at rest via SSL
No changes needed at application level
Encrypt both Cassandra and Hadoop data
Data Auditing
Audit trail of all accesses and changes
Control to audit only what’s needed
Uses log4j interface to ensure performance & efficient audit operations
DataStax Enterprise Security
• Delivers Solr integration • Very fast performance • Search indexes span
multiple data centers (regular Solr cannot)
• Online scalability via adding new nodes
• Built-in failover; continuously available
Built-in Enterprise Search
C* &
Solr
C* &
Solr
C* &
Solr
C* &
Solr
• Real-time analytics on Cassandra hot data
• MapReduce, Hive, Pig, Sqoop, and Mahout
• No single points of failure
Built-In Enterprise Analytics
Enterprise
Analytics
MapReduce, Hive, Pig, More
Continuous
availability
Integrated big data
platform
C* & Hadoo
p
C* & Hadoo
p
C* & Hadoo
p
C* & Hadoo
p
Agenda
Confidential 21
Why Cassandra?
Why DataStax Enterprise?
• Scale with ease• Always on• Deploy across data centers
• Enterprise-ready capabilities• 24x7x365 support
Agenda
Confidential 22
Why Cassandra?
Why DataStax Enterprise?
• Scale with ease• Always on• Deploy across data centers
• Enterprise-ready capabilities• 24x7x365 support
How to Evaluate?
Evaluation Process
Download & install binaries or sandbox
Leverage use cases to identify needs
Install DSE/OpsCenter on servers
Design/Modify data model
Implement data model
Load sample data
Stress test servers
Develop application
1) R&D Mode2) POC Cycle
3) Optimize
Add Nodes(C*, SOLR, and/or
Hadoop)
A Typical POC Environment
• Ideally at least 4 nodes, RF=3• Hardware per node:
• At least 8 core• At least16 GBs RAM (more the better)• SSD physically attached• Linux (ideally 3.x for improved buffered
cache)• Each environment has its own
steps/requirements:• EC2, Rackspace, Google Compute, Other
cloud providers• In-house servers• In-house servers VM
Tailored to Meet Your Needs
Confidential 25
FREE Resources
PAID Services
DSE Sandbox
DSE for Non-Production
OpsCenter (Standard)
DevCenter
DataStax Academy
Community Forums
White Papers &Documentation
Onsite Consulting
Remote Consulting
Onsite Training
Public Training
PAID Subscription
Production DSE Pro
Production DSE Standard
Non-Production DSE Max
Non-Production DSE Pro
Non-ProductionDSE Standard
Production DSE Max
PAID Bundles
Quick StartEnterprise
Quick StartStandard
Customer Success Manager
Proactive Guidance
Free Health Check
Free Migration Assessment
Monthly Bulletin Best Practices
Customer Benefits
The Right Mix of Support Resources
Confidential 26
Education & Training Planning & Design Develop & Test
Training Consulting Support
How to use DataStax Enterprise
Learn DataStax admin features
How to use integrated search
How to use integrated analytics
DataStax Enterprise architecture
Data modeling with DataStax
Cluster tuning and performance
Best practices and planning
Troubleshooting errors
Experiencing unexpected results
Clarification on documentation
Critical issue support
Production Support
Available Online Resources
• Patrick McFadin’s data modeling series• CQL/Data modeling on DataStax• Virtual training• Java driver sample code• SOLR documentation and tutorial on DataStax• Analytics documentation• Github code samples• Advance time series best practices
MassivelyScale a DB!
Agenda
Confidential 28
Why Cassandra?
Why DataStax Enterprise?
• Scale with ease• Always on• Deploy across data centers
• Enterprise-ready capabilities• 24x7x365 support
How to Evaluate?
• Evaluate efficiently
Q&A and Next Steps
Confidential 29
Want to learn more about the evaluation process?• Contact your account manager or email us at
Want access to more Cassandra resources?• Visit Planet Cassandra at www.planetcassandra.com
Appendix
EC2 Install Process with Linux AMI’s
• Read through ec2 production planning: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningEC2_c.html
• Go for i2.2xlarge to i2.4xlarge • Create security group: http://
www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/install/installAMIsecurity.html
• Pick a reputable reliable Linux flavored image to start with - preferably an image with the 3.x kernel on it
• Run through the wizard and start AMI's up• Install the prereq's: http://
www.datastax.com/documentation/cassandra/2.0/cassandra/install/installJREJNAabout_c.html
• Install dse node (depends on OS): http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/install/installTOC.html
• Following the "what's next at the bottom of installation instructions, including configuring dse node multidc or single dc (topology should be planned for): http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/deploy/deploySingleDC.html#deploySingleDC or http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/deploy/deployMultiDC.html#deployMultiDC
• Follow and set recommended production settings: http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html
Cassandra Architecture Basics – One NodeOrganizes Data in Partitions
Inserted data is written to a Commit Log
As well as a MemTable
MemTables are flushed to disk in an SSTable based on size.
SSTables are immutable
Changes to a partition are written to additional SSTables.
Deletes write tombstones
Node 1Row Data
Partition Key
75
Row DataPartition
Key
9
Background – How Cassandra Stores Data
Model brought from BigTable*Partition key and a lot of cellsCell names sorted (UTF8, Int, Timestamp, etc)• CQL creates timestamp if not specified
Partition key
Cell Name ... Cell Name
Cell Value Cell Value
Timestamp Timestamp
TTL TTL
1 2 Billion
©2013 DataStax Confidential. Do not distribute without consent. 33
Node 1
Node 2
Node 5
Node 3
Node 4
Row Data2
3
Row Data7
6
Row Data2
3Row Data2
3
Row Data7
6
Row Data7
6
Cassandra Architecture Basics – Multi Data Center
• Nodes can be arranged in multiple data centers
• Cassandra replicates data efficiently between remote data centers
• Each data center can have a different RF
• Use data centers to segment nodes for different query patterns
Boston
San FranciscoReal
Time
Analytics
Reading Data
©2013 DataStax Confidential. Do not distribute without consent. Slide 35
/* Demonstrate an easy way to query data. */
try { ResultSet result = session.execute ( "SELECT password from user " +
"WHERE username = 'user2';"); if (result.isExhausted())
return; Row user = result.one();
System.out.println("Password is: " + user.getString("password"));
} catch (NoHostAvailableException ex) {
System.out.println("No Host Available");} catch (QueryValidationException ex) {
System.out.println(“Requested consistency” + “level not met”);}
©2013 DataStax Confidential. Do not distribute without consent. Slide 36
Prepared Statements
PreparedStatement statement = session.prepare( "INSERT INTO user (username, password) " +
"VALUES (?, ?);");
BoundStatement boundStatement = new BoundStatement(statement);
try {
session.execute(boundStatement.bind("user4”,"user4password"));
} catch (NoHostAvailableException ex) { System.out.println("Host Not Available");} catch (QueryExecutionException ex) { System.out.println (”Syntax error, runtime, not authorized");} catch (QueryValidationException ex) { System.out.println ("Requested consistency level not met");}
Query-Driven Data Modeling
©2013 DataStax Confidential. Do not distribute without consent.
37
Start by addressing the queries that you will need to answer• Your data should be able to match it directly
Think about:• The actions your application needs to perform
• How you want to access the data
• What are the use cases?
• What does the data look like?
Queries (cont)
What are you trying to retrieve• Does it need to be ordered?
• Is there any nesting of data?
• Do you need to group data?
• Do you need to filter data?
Does data expire?Does data need to be retrieved in chronological order?
©2013 DataStax Confidential. Do not distribute without consent. 38
Relational Concept - Denormalization
• Combine table columns into a single view• No joins• All in how you set the data for fast reads
Employees
SELECT First, Last, DeptFROM employeesWHERE id = ‘1’;
id First Last Dept
1 Edgar Codd Engineering
2 Raymond Boyce Math
©2013 DataStax Confidential. Do not distribute without consent. 39
• Examples: medical device, energy devices/equipment, financial data• Application for sensors, clickstreams, historical data• Typical very high volume writes required• Usually coupled with need to analyze data or search using real-time
analytics• Great fit for DSE Cassandra, SOLR, Analytics Nodes
Time Series – Patterns
©2013 DataStax Confidential. Do not distribute without consent. Slide 40
StationID
Timestamp
Value/s
Timestamp
Value/s
1…N
FLGAZ101
20130611T01:01:01
74.34
20130611T01:01:11
74.28
20130611T01:01:21
74.41
Hardware• Ideal node:
• Processor: CPU 8 cores, • Memory: RAM 16 - 64 GB, with 8 GB of Heap, • Network: at least a Gigabit card, • Disks: lots of small disks using JBOD or basic RAIDs
(0 or 10), but prefer SSDs• Exact needs vary by use case• Production planning:
• http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/architecture/architecturePlanningHardware_c.html
Cassandra Query Language (CQL)
• Very similar to RDBMS SQL syntax• Create objects via DDL (e.g. CREATE…) • Core DML commands supported: INSERT,
UPDATE, DELETE• Query data with SELECT• Leverage Java drivers to execute queries via
PreparedStatements and ResultSets
SELECT * FROM USERSWHERE STATE = ‘TX’;
Client
SSTable
Memory
SSTables
Commit Log
Flush to Disk
Cassandra is Durable
Data is organized into Partitions
Inserted data is written to a Commit Log for a node
As well as a MemTable
MemTables are flushed to disk in an SSTable based on size.
SSTables are immutable
Overview of Replication in Cassandra
• Replication is controlled by what is called the replication factor. A replication factor of 1 means there is only one copy of a row in a cluster. A replication factor of 2 means there are two copies of a row stored in a cluster
• Replication is controlled at the keyspace level in Cassandra
Original row
Copy of row
Replication Factor (RF) determines additional nodes that get a copy of the partition Eg. RF=3
Copy of row
• The schema used in Cassandra is modeled after after Google Bigtable. It is a row-oriented, column structure
• A keyspace is akin to a database in the RDBMS world• A column family is similar to an RDBMS table but is
more flexible/dynamic• A row in a column family is indexed by its key
ID Name SSN DOB
Portfolio Keyspace
Customer Column Family
Data Model
Tunable Data Consistency• Choose between strong and eventual
consistency (one to all responding) depending on the need
• Can be done on a per-operation basis, and for both reads and writes
• Handles multi-data center operations
• Any• One• Quorum• Local_Quorum• Each_Quorum• All
Writes• One• Quorum• Local_Quorum• Each_Quorum• All
Reads
Thank You