getting started with datastax enterprise from a technical perspective

Getting Started with DataStax Enterprise

A Technical Overview

Confidential 1

Agenda

Confidential 3

Why Cassandra?

Why DataStax Enterprise?

How to Evaluate?

Confidential 4

Why Cassandra?

What is Apache Cassandra?

Apache Cassandra™ is a massively scalable NoSQL database.

• Continuous availability• High performing writes and reads• Linear scalability• Multi-data center support

10

50

3070

80

40

20

60

Client

Client

Replication Factor = 3

We could still retrieve the data from the other 2 nodes

Token Order_id

Qty Sale

70 1001 10 100

44 1002 5 50

15 1003 30 200

Node failure or it goes down temporarily

Cassandra is Fault Tolerant

Source: Netflix Tech Blog

Netflix Cloud Benchmark…

“In terms of scalability, there is a clear winner throughout our experiments. Cassandra achieves the highest throughput for the maximum number of nodes in all experiments with a linear increasing throughput.”Source: Solving Big Data Challenges for Enterprise Application Performance Management benchmark paper presented at the Very Large Database Conference, 2013.

End Point Independent NoSQL BenchmarkHighest in throughput…

Lowest in latency…

The NoSQL Performance Leader

Linearly Scalable

10

50

3070

80

40

20

60

10

30

2040100,000 txns

per sec

200,000 txns

per sec

400,000 txns/

per sec

Simply add nodes to double, quadruple performance and capacity

10

20

Client

10

50

3070

80

40

20

60

Client

15

55

3575

85

45

25

65

East Data CenterWest Data Center

10

50

3070

80

40

20

60

Data Center Outage Occurs

No interruption to the business

Multi Data Center Support

Built for Modern Online Applications

• Architected for today’s needs• Linear scalability at lowest cost• 100% uptime• Operationally simple

Agenda

Confidential 11

Why Cassandra?

• Scale with ease• Always on• Deploy across data centers

Agenda

Confidential 12

Why Cassandra?



Confidential 13

DataStax deliversApache Cassandra to the Enterprise

DataStax supports both the open source community and modern business enterprises.

Why DataStax?

Open Source DataStax Enterprise

Apache Cassandra (Cassandra Chair and 30% of committers)

Community Edition Enterprise Edition(Tested & Certified for Production)

OpsCenter Standard Enterprise (Alerts, Automated Management Services, Cluster

Management)

DevCenter

Drivers/Connectors

Online Documentation

Online Training

Mailing Lists and Forums

Security Standard Enterprise(Kerberos Authentication & SSL Encryption)

Built-in Real-time Analytics

Built-in Enterprise Search

In-Memory Database Option

Expert Support (24x7x365)

Consultative Support

Onsite Training

• Visual browser-based UI• Point-and-click administration• Visual cluster management• Proactive alerts• Built-in external notifications• Visual backup operations

DataStax OpsCenter

Cassandra Query Language (CQL)

DataStax DevCenter – a free, visual query tool for creating and running CQL statements against Cassandra and DataStax Enterprise.

Internal Authentication

Internal validation of authorized users

Simple to implement & easy to understand

No learning curve

Object Permission Management

Deep control over who can add/change/delete/read data

Uses familiar GRANT/REVOKE from relational world

No learning curve

Client to Node Encryption

Ensures data cannot be captured/stolen in route to a server

Data is safe both in flight from/to a database and on the database

Complete coverage is ensured

Cassandra Security

External Authentication

External validation of authorized users

Leverages Kerberos & LDAP)

Single sign-on to all data domains

Transparent Data Encryption

Protects sensitive data at rest via SSL

No changes needed at application level

Encrypt both Cassandra and Hadoop data

Data Auditing

Audit trail of all accesses and changes

Control to audit only what’s needed

Uses log4j interface to ensure performance & efficient audit operations

DataStax Enterprise Security

• Delivers Solr integration • Very fast performance • Search indexes span

multiple data centers (regular Solr cannot)

• Online scalability via adding new nodes

• Built-in failover; continuously available

Built-in Enterprise Search

C* &

Solr

C* &

Solr

C* &

Solr

C* &

Solr

• Real-time analytics on Cassandra hot data

• MapReduce, Hive, Pig, Sqoop, and Mahout

• No single points of failure

Built-In Enterprise Analytics

Enterprise

Analytics

MapReduce, Hive, Pig, More

Continuous

availability

Integrated big data

platform

C* & Hadoo

p

C* & Hadoo

p

C* & Hadoo

p

C* & Hadoo

p

Agenda

Confidential 21

Why Cassandra?



• Enterprise-ready capabilities• 24x7x365 support

Agenda

Confidential 22

Why Cassandra?




How to Evaluate?

Evaluation Process

Download & install binaries or sandbox

Leverage use cases to identify needs

Install DSE/OpsCenter on servers

Design/Modify data model

Implement data model

Load sample data

Stress test servers

Develop application

1) R&D Mode2) POC Cycle

3) Optimize

Add Nodes(C*, SOLR, and/or

Hadoop)

A Typical POC Environment

• Ideally at least 4 nodes, RF=3• Hardware per node:

• At least 8 core• At least16 GBs RAM (more the better)• SSD physically attached• Linux (ideally 3.x for improved buffered

cache)• Each environment has its own

steps/requirements:• EC2, Rackspace, Google Compute, Other

cloud providers• In-house servers• In-house servers VM

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&uact=8&docid=ak8NfQ2M6UmC9M&tbnid=ew8Bt6ak8BQtCM:&ved=0CAUQjRw&url=http://www.scalr.com/product/multi-cloud/supported-platforms-1&ei=9upjU_rNJsmh8QHc0YG4Ag&psig=AFQjCNHICimL8jcGMHvwyk-BULWbY1HsrQ&ust=1399143525161468

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&uact=8&docid=1x-Pic3BLxa7QM&tbnid=aTdO7Zp00IdsKM:&ved=0CAUQjRw&url=http://commons.wikimedia.org/wiki/File:CERN_Server.jpg&ei=Ye1jU7WOG8Se8QGgj4C4Dg&psig=AFQjCNG44rwR6x41d9VCRzJrwSxQF-Omdg&ust=1399144132138530

http://www.google.com/url?sa=i&rct=j&q=&esrc=s&frm=1&source=images&cd=&cad=rja&uact=8&docid=hFFRM0dJOSOqCM&tbnid=MRnovdx_QLNEaM:&ved=0CAUQjRw&url=http://wall.alphacoders.com/by_sub_category.php?id=200670&ei=vO1jU4SELofL8AH3_IHwBQ&psig=AFQjCNG44rwR6x41d9VCRzJrwSxQF-Omdg&ust=1399144132138530

Tailored to Meet Your Needs

Confidential 25

FREE Resources

PAID Services

DSE Sandbox

DSE for Non-Production

OpsCenter (Standard)

DevCenter

DataStax Academy

Community Forums

White Papers &Documentation

Onsite Consulting

Remote Consulting

Onsite Training

Public Training

PAID Subscription

Production DSE Pro

Production DSE Standard

Non-Production DSE Max

Non-Production DSE Pro

Non-ProductionDSE Standard

Production DSE Max

PAID Bundles

Quick StartEnterprise

Quick StartStandard

Customer Success Manager

Proactive Guidance

Free Health Check

Free Migration Assessment

Monthly Bulletin Best Practices

Customer Benefits

The Right Mix of Support Resources

Confidential 26

Education & Training Planning & Design Develop & Test

Training Consulting Support

How to use DataStax Enterprise

Learn DataStax admin features

How to use integrated search

How to use integrated analytics

DataStax Enterprise architecture

Data modeling with DataStax

Cluster tuning and performance

Best practices and planning

Troubleshooting errors

Experiencing unexpected results

Clarification on documentation

Critical issue support

Production Support

Available Online Resources

• Patrick McFadin’s data modeling series• CQL/Data modeling on DataStax• Virtual training• Java driver sample code• SOLR documentation and tutorial on DataStax• Analytics documentation• Github code samples• Advance time series best practices

MassivelyScale a DB!

Agenda

Confidential 28

Why Cassandra?




How to Evaluate?

• Evaluate efficiently

Q&A and Next Steps

Confidential 29

Want to learn more about the evaluation process?• Contact your account manager or email us at

[email protected]

Want access to more Cassandra resources?• Visit Planet Cassandra at www.planetcassandra.com

mailto:[email protected]

http://www.planetcassandra.com/

Appendix

EC2 Install Process with Linux AMI’s

• Read through ec2 production planning: http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningEC2_c.html

• Go for i2.2xlarge to i2.4xlarge • Create security group: http://

www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/install/installAMIsecurity.html

• Pick a reputable reliable Linux flavored image to start with - preferably an image with the 3.x kernel on it

• Run through the wizard and start AMI's up• Install the prereq's: http://

www.datastax.com/documentation/cassandra/2.0/cassandra/install/installJREJNAabout_c.html

• Install dse node (depends on OS): http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/install/installTOC.html

• Following the "what's next at the bottom of installation instructions, including configuring dse node multidc or single dc (topology should be planned for): http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/deploy/deploySingleDC.html#deploySingleDC or http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/deploy/deployMultiDC.html#deployMultiDC

• Follow and set recommended production settings: http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html

http://www.datastax.com/documentation/cassandra/2.0/cassandra/architecture/architecturePlanningEC2_c.html



http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/install/installAMIsecurity.html



http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installJREJNAabout_c.html



http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/install/installTOC.html



http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/deploy/deploySingleDC.html#deploySingleDC



http://www.datastax.com/documentation/datastax_enterprise/4.0/datastax_enterprise/deploy/deployMultiDC.html#deployMultiDC




http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html



Cassandra Architecture Basics – One NodeOrganizes Data in Partitions

Inserted data is written to a Commit Log

As well as a MemTable

MemTables are flushed to disk in an SSTable based on size.

SSTables are immutable

Changes to a partition are written to additional SSTables.

Deletes write tombstones

Node 1Row Data

Partition Key

75

Row DataPartition

Key

9

Background – How Cassandra Stores Data

Model brought from BigTable*Partition key and a lot of cellsCell names sorted (UTF8, Int, Timestamp, etc)• CQL creates timestamp if not specified

Partition key

Cell Name ... Cell Name

Cell Value Cell Value

Timestamp Timestamp

TTL TTL

1 2 Billion

©2013 DataStax Confidential. Do not distribute without consent. 33

Node 1

Node 2

Node 5

Node 3

Node 4

Row Data2

3

Row Data7

6

Row Data2

3Row Data2

3

Row Data7

6

Row Data7

6

Cassandra Architecture Basics – Multi Data Center

• Nodes can be arranged in multiple data centers

• Cassandra replicates data efficiently between remote data centers

• Each data center can have a different RF

• Use data centers to segment nodes for different query patterns

Boston

San FranciscoReal

Time

Analytics

Reading Data

©2013 DataStax Confidential. Do not distribute without consent. Slide 35

/* Demonstrate an easy way to query data. */

try { ResultSet result = session.execute ( "SELECT password from user " +

"WHERE username = 'user2';"); if (result.isExhausted())

return; Row user = result.one();

System.out.println("Password is: " + user.getString("password"));

} catch (NoHostAvailableException ex) {

System.out.println("No Host Available");} catch (QueryValidationException ex) {

System.out.println(“Requested consistency” + “level not met”);}


Prepared Statements

PreparedStatement statement = session.prepare( "INSERT INTO user (username, password) " +

"VALUES (?, ?);");

BoundStatement boundStatement = new BoundStatement(statement);

try {

session.execute(boundStatement.bind("user4”,"user4password"));

} catch (NoHostAvailableException ex) { System.out.println("Host Not Available");} catch (QueryExecutionException ex) { System.out.println (”Syntax error, runtime, not authorized");} catch (QueryValidationException ex) { System.out.println ("Requested consistency level not met");}

Query-Driven Data Modeling

©2013 DataStax Confidential. Do not distribute without consent.

37

Start by addressing the queries that you will need to answer• Your data should be able to match it directly

Think about:• The actions your application needs to perform

• How you want to access the data

• What are the use cases?

• What does the data look like?

Queries (cont)

What are you trying to retrieve• Does it need to be ordered?

• Is there any nesting of data?

• Do you need to group data?

• Do you need to filter data?

Does data expire?Does data need to be retrieved in chronological order?


Relational Concept - Denormalization

• Combine table columns into a single view• No joins• All in how you set the data for fast reads

Employees

SELECT First, Last, DeptFROM employeesWHERE id = ‘1’;

id First Last Dept

1 Edgar Codd Engineering

2 Raymond Boyce Math


• Examples: medical device, energy devices/equipment, financial data• Application for sensors, clickstreams, historical data• Typical very high volume writes required• Usually coupled with need to analyze data or search using real-time

analytics• Great fit for DSE Cassandra, SOLR, Analytics Nodes

Time Series – Patterns


StationID

Timestamp

Value/s

Timestamp

Value/s

1…N

FLGAZ101

20130611T01:01:01

74.34

20130611T01:01:11

74.28

20130611T01:01:21

74.41

Hardware• Ideal node:

• Processor: CPU 8 cores, • Memory: RAM 16 - 64 GB, with 8 GB of Heap, • Network: at least a Gigabit card, • Disks: lots of small disks using JBOD or basic RAIDs

(0 or 10), but prefer SSDs• Exact needs vary by use case• Production planning:

• http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/architecture/architecturePlanningHardware_c.html

http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/architecture/architecturePlanningHardware_c.html




Cassandra Query Language (CQL)

• Very similar to RDBMS SQL syntax• Create objects via DDL (e.g. CREATE…) • Core DML commands supported: INSERT,

UPDATE, DELETE• Query data with SELECT• Leverage Java drivers to execute queries via

PreparedStatements and ResultSets

SELECT * FROM USERSWHERE STATE = ‘TX’;

Client

SSTable

Memory

SSTables

Commit Log

Flush to Disk

Cassandra is Durable

Data is organized into Partitions

Inserted data is written to a Commit Log for a node

As well as a MemTable

MemTables are flushed to disk in an SSTable based on size.

SSTables are immutable

Overview of Replication in Cassandra

• Replication is controlled by what is called the replication factor. A replication factor of 1 means there is only one copy of a row in a cluster. A replication factor of 2 means there are two copies of a row stored in a cluster

• Replication is controlled at the keyspace level in Cassandra

Original row

Copy of row

Replication Factor (RF) determines additional nodes that get a copy of the partition Eg. RF=3

Copy of row

• The schema used in Cassandra is modeled after after Google Bigtable. It is a row-oriented, column structure

• A keyspace is akin to a database in the RDBMS world• A column family is similar to an RDBMS table but is

more flexible/dynamic• A row in a column family is indexed by its key

ID Name SSN DOB

Portfolio Keyspace

Customer Column Family

Data Model

Tunable Data Consistency• Choose between strong and eventual

consistency (one to all responding) depending on the need

• Can be done on a per-operation basis, and for both reads and writes

• Handles multi-data center operations

• Any• One• Quorum• Local_Quorum• Each_Quorum• All

Writes• One• Quorum• Local_Quorum• Each_Quorum• All

Reads

Thank You

getting started with datastax enterprise from a technical perspective

Software

enterprise confidential

sensitive data

addchangedeleteread

server data

cassandra hot data mapreduce

cassandra security

data center outage

big data platform c