![Page 1: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/1.jpg)
The Internet in a DatabaseA Cassandra Use Case
![Page 2: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/2.jpg)
Data on the Web
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● 48 billion pages on the Internet
● 56 million GB of data
● Incredibly powerful connections
● 70% of useful data is unstructured
● User generated data + facts
![Page 3: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/3.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Too Much Data…
![Page 4: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/4.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Modern search engines
○ Unstructured data
○ Unconnected data
○ Unnormalized data
Search
![Page 5: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/5.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Goals
○ Collect vast amounts of data through web crawling
○ Normalize and deduplicate data
○ Make it searchable and meaningful
![Page 6: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/6.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Speed
● Scale
● Adaptable
Needs
![Page 7: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/7.jpg)
● Very fast
○ Log-structured storage
● Easily scalable
○ Decentralized rings
● Completely adaptable
○ Schema-less key/value store
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
The Solution
![Page 8: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/8.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
…Almost
● Useful searching was missing
○ Secondary indexes not flexible
○ No free text searches
○ No (reasonable) range queries
![Page 9: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/9.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Pros: Full control over indexing
● Cons: Not scalable
What We Needed
![Page 10: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/10.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Reasons to go with DSE
○ Combines Cassandra and Solr
○ Constant refinements and integrations
○ Support
Putting It All Together
![Page 11: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/11.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Normalization
Cassandra
Solr
Cassandra
Solr
Cassandra
Solr
Load Balancing
Our Stack
Web Crawling
![Page 12: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/12.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Cassandra / Solr Setup
● 3 column families / 3 cores
○ Locations○ Products○ People
● 73,114,909 records
![Page 13: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/13.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● 29,818,644 records
● Interesting data
○ Reviews○ Revenue○ Contact information
● Businesses vs. Locations
○ Unique key○ Location specific user data
Data: Locations
![Page 14: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/14.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Data: Products
● 18,470,005 records
● Interesting data
○ Categories○ Price○ Reviews
● Challenges
○ Too many unique keys
![Page 15: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/15.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Data: People
● 24,826,260 records
● Interesting data
○ Work History○ Education History○ Location
● Challenges
○ Normalization○ Identification
![Page 16: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/16.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges
● Memory
● Speed
● Space
● Representation
![Page 17: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/17.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Memory
● Multi-minute garbage collection
● Exponential increase in frequency
● Virtual memory confusion
● Solr + Cassandra
● Heap Size vs Buffer Cache
● Bash Scripts
![Page 18: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/18.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Upgrade
○ Better memory management○ Smaller index size
● Reduce index size
● Future: Solaris
![Page 19: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/19.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Providing a real-time service
● Issues
○ Solr not inherently real time○ Search speeds○ I/O
![Page 20: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/20.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Solr Solution: DSE integration leverages
○ Cassandra's speed○ Cassandra's caches○ Cassandra's distribution○ Solr caches less useful
![Page 21: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/21.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Search complexity solution
○ Text vs String indexing○ Uniqueness vs Flexibility○ Leveraging Cassandra
![Page 22: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/22.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● I/O Solution
○ Cassandra's built in mapping○ Increase disk access speeds (SSDs)
■ Not cost effective○ Future: Solaris
![Page 23: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/23.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Space
● Field corruption
○ Caused by improper encoding○ Exponential growth○ Fills up Solr index
● Locate, inspect & remove corrupt records
![Page 24: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/24.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Space
● Solr index issue
○ No compression (vs Cassandra)○ Must adjust indexing
● Key things to keep in mind
○ Size of fields○ Scale vs Flexibility○ Index as little as possible
![Page 25: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/25.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Representation
● Cassandra is flat
● Actual data is not flat
○ Reviews○ Price information
● Many different output formats
○ CSV, JSON, XML, etc.
![Page 26: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/26.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Solution: Flatten when possible
○ E.g. Address object -> Separate fields● Internal subgroup representation
○ Composite keys (Occasionally)■ Known subgroups■ Non multiple subgroups
○ Dynamic fields■ Composite field + Dynamic tag■ E.g. review.text_<tag>
Challenges: Representation
![Page 27: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/27.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Representation
● Robust and adaptable conversion package
● JSON -> Internal
○ Solr returns JSON● Internal -> CSV, JSON, XML
○ User defined views○ Specify field groupings○ Specify partitioning
![Page 28: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/28.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
![Page 29: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/29.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Memory Usage
● Speed
● Space
● Containers
Future Work
![Page 30: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/30.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Memory
● Java 7 G1 (Garbage First) Collector
○ Ideal for large heaps○ Big Data Sets○ Bursty Workloads
![Page 31: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/31.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Speed
● Solaris Kernel Scheduler > Linux Kernel Scheduler
○ (At large number of cores)● Drastically increase iops
○ Cache reads (L2ARC) on PCIe SSD (~800 MB/s)○ Cache writes (ZIL) on PCIe SSD (~800 MB/s)○ Reduce needed size of SSD
■ More smaller SSDs in ZFS pool○ Fewer moving parts
![Page 32: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/32.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Space
● Caching at PCIe, Storing on SATA III
○ Cheaper larger storage via ZFS pools○ Easier to grow
● ZFS Compression (LZ4)
○ Replaces Cassandra's Snappy compression○ Very fast lossless compression (400 Mb/s per core)○ Scales to multiple CPUs○ Hits the ram speed limit
![Page 33: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/33.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Containers
● OS Level virtualization
○ Resource control○ Boundary separation
● More control over cassandra resources
● Better snapshots (whole machine)
● Hardware abstracted out
○ Many disks represented as single space○ Easily add or remove hardware
![Page 34: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/34.jpg)
Questions?https://www.datafiniti.net
http://blog.datafiniti.net@datafiniti
![Page 35: Datafiniti: The Internet in a Database - Cassandra Use Case](https://reader034.vdocument.in/reader034/viewer/2022052622/558e82a41a28ab87528b45e6/html5/thumbnails/35.jpg)
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Addendum 1
ZFS Comparison
Name Ratio (MB/s) Compression (MB/s)
Decompression (MB/s)
LZ4 (r97) 2.084 410 1810
LZO 2.06 2.106 409 600
QuickLZ 1.5.1b6 2.237 373 420
Snappy 1.1.0 2.091 323 1070
LZF 2.077 270 570
zlib 1.2.8 -1 2.730 65 280
LZ4 HC (r97) 2.720 25 2040
zlib 1.2.8 -6 3.099 21 300