![Page 1: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/1.jpg)
#CassandraEU
Top-k queries in real-time with Cassandra and Intravert
Jonathan Halliday, [email protected]
Rui Vieira, Newcastle [email protected]
![Page 2: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/2.jpg)
#CassandraEU
What is Top-k ?
![Page 3: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/3.jpg)
#CassandraEU
What is Top-k ?
![Page 4: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/4.jpg)
#CassandraEU
Top-k queries
• Rank matching results for the term(s)– We don't really care about the scoring
algorithm
• Application: text search– Documents containing the search words
• Application: log analysis– Popular URLs in the time period
![Page 5: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/5.jpg)
#CassandraEU
yawn ?
• SELECT document_id, scoreFROM dataWHERE term='top-k'ORDER BY score DESC, document_id LIMIT 100
• Lunch time!
![Page 6: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/6.jpg)
#CassandraEU
Not so fast...
• SELECT document_id, scoreFROM dataWHERE term IN('top-k', 'algorithm')GROUP BY document_idORDER BY score DESC, document_id LIMIT 100
![Page 7: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/7.jpg)
#CassandraEU
Distributed Top-k
• We have a lot of data
• It's spread out
• We need to combine a subset efficiently
• Map/Reduce to the rescue!– HiveQL, Stinger, Impala, Hawq
• Easy! But not fast
![Page 8: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/8.jpg)
#CassandraEU
'real-time'
• Web pages, not control systems
• Performance, not Timeliness
• Pre-compute as much as possible– scores for each term
• Assemble pre-computed fragments at query time– 'group by'
![Page 9: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/9.jpg)
#CassandraEU
Naive method
foreach(term in searchTerms) {SELECT ... FROM ... WHERE ...
}
• Handle group by in the application code
• Inefficient – transfers ALL the data for each term, even low scores
![Page 10: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/10.jpg)
#CassandraEU
How much data is enough?
• Data is stored keyed (i.e. sorted) by{ term, score DESC, doc_id }
or { time_period, score DESC, Url }
• Partition keys IN the query params– We can filter efficiently
• Can we range limit on score?– Avoid going into the long tail
![Page 11: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/11.jpg)
#CassandraEU
Bring on the clever algorithms
• Smart People thought about this problem already...
• ...but not in quite the same context– WAN distributed logs from CDNs
• Identify, adapt and reuse existing solutions– faster and less risky than starting over
![Page 12: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/12.jpg)
#CassandraEU
Inside a clever algorithm
• Fetch a little bit of data
• Look at it, decide how much more we need
• Fetch some more• Rinse and repeat
– but not too many times.
![Page 13: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/13.jpg)
#CassandraEU
Desirable Characteristics
• Fixed number of communication rounds is key
• Generality is good– Cope with any distribution of data
• So is flexibility– Tune for different use cases
![Page 14: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/14.jpg)
#CassandraEU
Meet the candidates
Three-Phase Uniform Threshold (TPUT)'Efficient Top-K Query Calculation in Distributed Networks', Stanford/Princeton, 2004
Hybrid Threshold'Efficient Processing of Distributed Top-k Queries', UCSB, 2005
KLEE'KLEE: a framework for distributed top-k query algorithms', Max-Planck Institute, 2005
![Page 15: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/15.jpg)
#CassandraEU
Implementation Issues
• Algorithms assume server side code execution
• Limitations of CQL3 add some round trips, increase network I/O
• Previous performance comparisons of algorithms may no longer be valid
![Page 16: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/16.jpg)
#CassandraEU
Data Transfer vs. k
![Page 17: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/17.jpg)
#CassandraEU
Execution Time vs. k
![Page 18: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/18.jpg)
#CassandraEU
Execution Time vs. peers
![Page 19: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/19.jpg)
#CassandraEU
![Page 20: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/20.jpg)
#CassandraEU
YMMV
• Test with your own data
• Test with your own hardware
• Hybrid Threshold for exact top-k– Intravert optional
• KLEE for tunable approximate top-k– Inefficient without intravert– Requires metadata
![Page 21: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/21.jpg)
#CassandraEU
Intravert
• Cassandra++– Embed and extend the existing server– Based on Vert.x
• JSON over HTTP, REST API– yup, virgil did that already
• Multiple commands per call, chain operations with REFs
![Page 22: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/22.jpg)
#CassandraEU
Intravert
• Server side code execution– Groovy (for now – Vert.x is polyglot)
• Filter result sets
• Write path triggers– C* 2.0 has CASSANDRA-1311
• Run groovy scripts on the server– Easier than extending thrift api
![Page 23: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/23.jpg)
#CassandraEU
Intravert
• Good trade-off between power and operational complexity
• More complex development cycle– Not easy to move code between client and
server
• Client not topology aware– 'run x on each node' not possible
![Page 24: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/24.jpg)
#CassandraEU
Back to the clever algorithms
• Intravert server side execution enables cleaner, more efficient implementation
• Reduces network round trips
• Some dev and ops complexity increase• Less complexity than custom server
deployment– Reuse existing tools
![Page 25: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/25.jpg)
#CassandraEU
Pre-aggregation
• For text search, can't predict common term sets
• For time periods, can predict contiguous periods
• Pre-calculate the rollups– Hours, days, weeks, months– Reduces number of terms (peers) to group
at query time
![Page 26: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/26.jpg)
#CassandraEU
Really clever algorithms
• Hierarchical node topology– Map to cassandra ring: same node may
own multiple keys (peers != nodes)
• Budget constrained approximate top-k– Get as close as possible with the allowable
time and I/O constraints
• Fault tolerance– Approximation given available nodes
![Page 27: C* Summit EU 2013: Top-K Queries in Realtime with Cassandra and Intravert](https://reader034.vdocument.in/reader034/viewer/2022052600/55856986d8b42a3e6f8b4933/html5/thumbnails/27.jpg)
#CassandraEU
Questions?
Or email us:
Jonathan Halliday, [email protected]
Rui Vieira, Newcastle [email protected]