bulk loading data into cassandra
DESCRIPTION
Whether running load tests or migrating historic data, loading data directly into Cassandra can be very useful to bypass the system’s write path. In this webinar, we will look at how data is stored on disk in sstables, how to generate these structures directly, and how to load this data rapidly into your cluster using sstableloader. We'll also review different use cases for when you should and shouldn't use this method.TRANSCRIPT
Bulk-Loading Data into Cassandra
Patricia Gorla@patriciagorlaCassandra Consultantwww.thelastpickle.com
Planet Cassandra 2014
About Us
• Work with clients to deliver and improve Apache Cassandra services
• Apache Cassandra committer, Datastax MVP, Hector maintainer, Apache Usergrid committer
• Based in New Zealand & USA
Why is bulk loading useful?
• Performance tests
Why is bulk loading useful?
• Performance tests
• Migrating historical data
Why is bulk loading useful?
• Performance tests
• Migrating historical data
• Changing topologies
!
• How Data is Stored
• Case Studies
- Generating Dummy Data
- Backfilling Historical Data
- Changing Topologies
• Conclusion
Cassandra Write Path write[0]
Cassandra Write Path• Writes written to both the commit log and
memtable.
write[0]
memtablecommitlog
Cassandra Write Path• Writes written to both the commit log and
memtable.
• Memtable is sorted.
write[0]
memtablecommitlog
Cassandra Write Path• Memtable flushed out to sstables.
sstable[0]sstable[1]
sstable[2]
write[0]
memtablecommitlog
Cassandra Write Path• Compaction helps keep the read latency
low.
sstable[0]sstable[1]
sstable[2]
sstable[n]
write[0]
memtablecommitlog
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
Contains all data needed to regenerate components
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
Index of row keys
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
Index summary from Index.db file
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
Bloom filter over sstable
Sorted String Tables
mykeyspace-mycf-jb-1-CompressionInfo.db mykeyspace-mycf-jb-1-Data.db mykeyspace-mycf-jb-1-Filter.db mykeyspace-mycf-jb-1-Index.db mykeyspace-mycf-jb-1-Statistics.db mykeyspace-mycf-jb-1-Summary.db mykeyspace-mycf-jb-1-TOC.txt
Table of contents of all components
!
• How Data is Stored
• Case Studies
- Generating Dummy Data
- Backfilling Historical Data
- Changing Topologies
• Conclusion
Set up keyspace and column family
create keyspace test with placement_strategy = 'org.apache.cassandra.locator.SimpleStrategy' and strategy_options = {replication_factor:1}; !
create column family test with comparator = 'AsciiType' and default_validation_class = 'AsciiType' and key_validation_class = 'AsciiType';
SStableGen.java
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, size_per_sstable_mb );
// subcomparator for super columns
SStableGen.java
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, size_per_sstable_mb );
// subcomparator for super columns
SStableGen.java
AbstractSSTableSimpleWriter writer = new SSTableSimpleUnsortedWriter( directory, partitioner, keyspace, columnFamily, AsciiType.instance, null, size_per_sstable_mb );
// subcomparator for super columns
ByteBuffer randomBytes = ByteBufferUtil.bytes(randomAscii(1024)); KeyGenerator keyGen = new KeyGenerator(); long dataSize = 0; writer = new SSTableSimpleUnsortedWriter(…); while (dataSize < max_data_bytes) { writer.newRow(key); for (int j=0; j<num_cols; j++) { ByteBuffer colName = ByteBufferUtil.bytes("col_" + j); ByteBuffer colValue = ByteBuffer.wrap(new byte[20]); randomBytes.get(colValue.array()); colValue.position(0); writer.addColumn(colName, colValue, timestamp); if (randomBytes.remaining() < colValue.limit()) { randomBytes.position(0); } else { randomBytes.position(randomBytes.position() + colValue.limit()); } } } }
Examining sstable output
patricia@dev:~/../data$ ls -lh mykeyspace/mycf total 64 -rw-r--r-- 1 patricia staff 43B Feb 2 15:31 mykeyspace-mycf-jb-1-CompressionInfo.db -rw-r--r-- 1 patricia staff 79K Feb 2 15:31 mykeyspace-mycf-jb-1-Data.db -rw-r--r-- 1 patricia staff 16B Feb 2 15:31 mykeyspace-mycf-jb-1-Filter.db -rw-r--r-- 1 patricia staff 36B Feb 2 15:31 mykeyspace-mycf-jb-1-Index.db -rw-r--r-- 1 patricia staff 4.3K Feb 2 15:31 mykeyspace-mycf-jb-1-Statistics.db -rw-r--r-- 1 patricia staff 80B Feb 2 15:31 mykeyspace-mycf-jb-1-Summary.db -rw-r--r-- 1 patricia staff 79B Feb 2 15:31 mykeyspace-mycf-jb-1-TOC.txt
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d localhost Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/127.0.0.1] progress: [/127.0.0.1 1/1 (100)] [total: 100 - 0MB/s (avg: 0MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
• Run command on separate server
$ bin/sstableloader Keyspace1/ColFam1
• Run command on separate server
• Throttle command
$ bin/sstableloader Keyspace1/ColFam1
• Run command on separate server
• Throttle command
• Parallelise processes
!
• How Data is Stored
• Case Studies
- Generating Dummy Data
- Backfilling Historical Data
- Changing Topologies
• Conclusion
// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); !// assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } !customerOrders.close() orders.close()
// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); !// assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } !customerOrders.close() orders.close()
// list of orders by user customerOrders = new SSTableSimpleUnsortedWriter(…); // orders by order id orders = new SSTableSimpleUnsortedWriter(…); !// assume orders are in date order for (Order order : oldOrders) { customerOrders.newRow(ByteBufferUtil.bytes(order.customerId)); customerOrders.addColumn(ByteBufferUtil.bytes(order.orderId), ByBufferUtil.EMPTY_BYTE_BUFFER, timestamp); ! orders.newRow(ByteBufferUtil.bytes(order.userId)); orders.addColumn(ByteBufferUtil.bytes(“customer_id), ByteBufferUtil.bytes(order.customerId), timestamp); orders.addColumn(ByteBufferUtil.bytes(“date), ByteBufferUtil.bytes(order.date), timestamp); orders.addColumn(ByteBufferUtil.bytes(“total), ByteBufferUtil.bytes(order.total), timestamp); } !customerOrders.close() orders.close()
!
• How Data is Stored
• Case Studies
- Generating Dummy Data
- Backfilling Historical Data
- Changing Topologies
• Conclusion
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \ cass1,cass2,cass3 !Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] !progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \ cass1,cass2,cass3 !Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] !progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/…/cassandra-2.0.4$ bin/sstableloader mykeyspace/mycf -d \ cass1,cass2,cass3 !Streaming relevant part of mykeyspace/mycf/mykeyspace-mycf-ic-1-Data.db to [/cass1,cass2, cass3,cass4,cass5,cass6] !progress: [/cas1 3/3 (100)] [/cas2 0/4 (0)] [/cas3 0/0 (0)] [/cas4 0/0 (0)] [/cas5 0/0 (0)] [/cas6 1/2 (50)] [total: 50 - 0MB/s (avg: 5MB/s)]
$ bin/sstableloader Keyspace1/ColFam1
patricia@dev:~/.../cassandra-2.0.4$ bin/nodetool compactionstats pending tasks: 30 Active compaction remaining time : n/a
!
• How Data is Stored
• Case Studies
- Generating Dummy Data
- Backfilling Historical Data
- Changing Topologies
• Conclusion
CQL: Keep schema consistent
cqlsh> CREATE KEYSPACE "test" WITH replication = {'class': 'SimpleStrategy', 'replication_factor': 1 }; !
cqlsh> CREATE COLUMNFAMILY "test" (id text PRIMARY KEY ) ;
CQL3 Considerations
• Uses CompositeType comparator
Q&A
Patricia Gorla@patriciagorlaCassandra Consultantwww.thelastpickle.com
Planet Cassandra 2014