cassandra 1.0 and the future of big data (cassandra tokyo 2011)
DESCRIPTION
TRANSCRIPT
Cassandra 1.0and the future of big data
Jonathan Ellis
Tuesday, October 4, 2011
About me
✤ Project chair, Apache Cassandra✤ Active since Dec 2008✤ First non-Facebook committer✤ wrote ~30% of committed patches, reviewed ~40% of the rest
✤ Distributed systems background✤ At Mozy, built a multi-petabyte, scalable storage system based on
Reed-Solomon encoding
✤ Founder and CTO, DataStax
Tuesday, October 4, 2011
About DataStax
✤ Founded in April 2010✤ Commercial leader in Apache Cassandra✤ 100+ customers✤ 30+ employees✤ Home to Apache Cassandra Chair & most committers✤ Headquartered in San Francisco Bay area, California✤ Secured $11M in Series B funding in Sep 2011
Tuesday, October 4, 2011
Job Trends (indeed.com)
Tuesday, October 4, 2011
“Big Data” trend
Tuesday, October 4, 2011
Big data
Analytics(Hadoop)
Realtime(“NoSQL”)?
Tuesday, October 4, 2011
✤ Financial✤ Social Media✤ Advertising✤ Entertainment✤ Energy✤ E-tail✤ Health care✤ Government
Some Cassandra users
Tuesday, October 4, 2011
Common use cases
✤ Time series data✤ Messaging✤ Ad tracking✤ Data mining✤ User activity streams✤ User sessions✤ Anything requiring: Scalable + performant + highly
available
Tuesday, October 4, 2011
Why people choose Cassandra
✤ Multi-master, multi-DC✤ Linearly scalable✤ Larger-than-memory datasets✤ Best-in-class performance (not just writes!)✤ Fully durable✤ Integrated caching✤ Tuneable consistency
Tuesday, October 4, 2011
0.7
✤ CREATE COLUMN FAMILY✤ Expiring columns (TTL)✤ Secondary (column) indexes✤ Efficient streaming✤ Efficient cross-datacenter writes
Tuesday, October 4, 2011
0.8
✤ CQL✤ Counters✤ Automatic memtable tuning✤ New bulk load interface
Tuesday, October 4, 2011
1.0
✤ Compression✤ Read performance✤ LeveledCompactionStrategy✤ CQL 2.0
Tuesday, October 4, 2011
Compression
✤ Rows-per-block or blocks-per-row
Tuesday, October 4, 2011
Classic size-tiered compaction
Tuesday, October 4, 2011
Level-based Compaction
✤ SSTables are non-overlapping within a level✤ Bounds the number that can contain a given row
L2: 1000 MB
L1: 100 MB
L0: newly flushed
Tuesday, October 4, 2011
Read performance: maxtimestamp
✤ Sort sstables by maximum (client-provided) timestamp✤ Only merge sstables until we have the columns requested✤ Allows pre-merging highly fragmented rows without
waiting for compaction
Tuesday, October 4, 2011
Results
Tuesday, October 4, 2011
CQL
cqlsh> SELECT * FROM users WHERE state='UT' AND birth_date > 1970;
KEY | birth_date | full_name | state | bsanderson | 1975 | Brandon Sanderson | UT |
Tuesday, October 4, 2011
CQL 2.0
✤ ALTER✤ Counter support✤ TTL support✤ SELECT count(*)
Tuesday, October 4, 2011
Post-1.0 features
✤ Ease Of Use✤ CQL
✤ “Native” transport✤ Composite columns✤ Prepared statements
✤ Triggers✤ Entity groups✤ Smarter range queries
✤ Enables more-efficient analytics
Tuesday, October 4, 2011
The evolution of Analytics
Analytics + Realtime
Tuesday, October 4, 2011
The evolution of Analytics
Analytics Realtime
replication
Tuesday, October 4, 2011
The evolution of Analytics
ETL
Tuesday, October 4, 2011
Big data
Analytics(Hadoop)
Realtime(Cassandra)
DataStaxEnterprise
Tuesday, October 4, 2011
DataStax Enterprise re-unifiesrealtime and analytics
Tuesday, October 4, 2011
26
Tuesday, October 4, 2011
Data model: Realtime
GOOG LNKD P AMZN AAPLE80 20 40 100 20
Portfolio1
Portfolios
2011-01-01 2011-01-02 2011-01-03$79.85 $75.23 $82.11
GOOG
StockHist
last$95.52
$186.10
$112.98
GOOG
LiveStocks
AAPLAMZN
Tuesday, October 4, 2011
Data model: Analytics
worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93
Portfolio1
HistLoss
Portfolio2Portfolio3
Tuesday, October 4, 2011
Data model: Analytics
ticker rdate returnGOOG 2011-07-25 $8.23GOOG 2011-07-24 $6.14GOOG 2011-07-23 $7.78AAPL 2011-07-25 $15.32AAPL 2011-07-24 $12.68
10dayreturns
INSERT OVERWRITE TABLE 10dayreturnsSELECT a.row_key ticker, b.column_name rdate, b.value - a.valueFROM StockHist a JOIN StockHist b ON (a.row_key = b.row_key AND date_add(a.column_name,10) = b.column_name);
Tuesday, October 4, 2011
2011-01-01 2011-01-02 2011-01-03$79.85 $75.23 $82.11
GOOG
row_key column_name valueGOOG 2011-01-01 $8.23GOOG 2011-01-02 $6.14GOOG 2011-001-03 $7.78
Data model: Analytics
Tuesday, October 4, 2011
Data model: Analytics
portfolio rdate preturnPortfolio1 2011-07-25 $118.21Portfolio1 2011-07-24 $60.78Portfolio1 2011-07-23 -$34.81Portfolio2 2011-07-25 $2143.92Portfolio3 2011-07-24 -$10.19
portfolio_returns
INSERT OVERWRITE TABLE portfolio_returnsSELECT row_key portfolio, rdate, SUM(b.return)FROM portfolios a JOIN 10dayreturns b ON (a.column_name = b.ticker)GROUP BY row_key, rdate;
Tuesday, October 4, 2011
Data model: Analytics
INSERT OVERWRITE TABLE HistLossSELECT a.portfolio, rdate, minpFROM ( SELECT portfolio, min(preturn) as minp FROM portfolio_returns GROUP BY portfolio) a JOIN portfolio_returns b ON (a.portfolio = b.portfolio and a.minp = b.preturn);
worst_date loss2011-07-23 -$34.812011-03-11 -$11432.242011-05-21 -$1476.93
Portfolio1
HistLoss
Portfolio2Portfolio3
Tuesday, October 4, 2011
Portfolio Demo dataflow
Portfolios
Historical Prices
Intermediate Results
Largest loss
Portfolios
Live Prices for today
Largest loss
Tuesday, October 4, 2011
Operations
✤ “Vanilla” Hadoop✤ 8+ services to setup, monitor, backup, and recover
(NameNode, SecondaryNameNode, DataNode, JobTracker, TaskTracker, Zookeeper, Region Server,...)
✤ Single points of failure✤ Can't separate online and offline processing
✤ DataStax Enterprise✤ Single, simplified component✤ Self-organizes based on workload✤ Peer to peer✤ JobTracker failover✤ No additional cassandra config
Tuesday, October 4, 2011
OpsCenter
Tuesday, October 4, 2011
37
Tuesday, October 4, 2011