datafiniti: the internet in a database - cassandra use case
DESCRIPTION
Austin Cassandra Users Meetup on July 15th 2013: http://www.meetup.com/Austin-Cassandra-Users/events/125837112/ Datafiniti will be presenting on some of the unique and interesting challenges they've faced when trying to build out their data search engine. Including a detailed use-case around their Cassandra data model and other integrated technologies like Solr.TRANSCRIPT
The Internet in a DatabaseA Cassandra Use Case
Data on the Web
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● 48 billion pages on the Internet
● 56 million GB of data
● Incredibly powerful connections
● 70% of useful data is unstructured
● User generated data + facts
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Too Much Data…
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Modern search engines
○ Unstructured data
○ Unconnected data
○ Unnormalized data
Search
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Goals
○ Collect vast amounts of data through web crawling
○ Normalize and deduplicate data
○ Make it searchable and meaningful
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Speed
● Scale
● Adaptable
Needs
● Very fast
○ Log-structured storage
● Easily scalable
○ Decentralized rings
● Completely adaptable
○ Schema-less key/value store
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
The Solution
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
…Almost
● Useful searching was missing
○ Secondary indexes not flexible
○ No free text searches
○ No (reasonable) range queries
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Pros: Full control over indexing
● Cons: Not scalable
What We Needed
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Reasons to go with DSE
○ Combines Cassandra and Solr
○ Constant refinements and integrations
○ Support
Putting It All Together
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Normalization
Cassandra
Solr
Cassandra
Solr
Cassandra
Solr
Load Balancing
Our Stack
Web Crawling
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Cassandra / Solr Setup
● 3 column families / 3 cores
○ Locations○ Products○ People
● 73,114,909 records
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● 29,818,644 records
● Interesting data
○ Reviews○ Revenue○ Contact information
● Businesses vs. Locations
○ Unique key○ Location specific user data
Data: Locations
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Data: Products
● 18,470,005 records
● Interesting data
○ Categories○ Price○ Reviews
● Challenges
○ Too many unique keys
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Data: People
● 24,826,260 records
● Interesting data
○ Work History○ Education History○ Location
● Challenges
○ Normalization○ Identification
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges
● Memory
● Speed
● Space
● Representation
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Memory
● Multi-minute garbage collection
● Exponential increase in frequency
● Virtual memory confusion
● Solr + Cassandra
● Heap Size vs Buffer Cache
● Bash Scripts
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Upgrade
○ Better memory management○ Smaller index size
● Reduce index size
● Future: Solaris
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Providing a real-time service
● Issues
○ Solr not inherently real time○ Search speeds○ I/O
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Solr Solution: DSE integration leverages
○ Cassandra's speed○ Cassandra's caches○ Cassandra's distribution○ Solr caches less useful
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● Search complexity solution
○ Text vs String indexing○ Uniqueness vs Flexibility○ Leveraging Cassandra
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Speed
● I/O Solution
○ Cassandra's built in mapping○ Increase disk access speeds (SSDs)
■ Not cost effective○ Future: Solaris
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Space
● Field corruption
○ Caused by improper encoding○ Exponential growth○ Fills up Solr index
● Locate, inspect & remove corrupt records
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Space
● Solr index issue
○ No compression (vs Cassandra)○ Must adjust indexing
● Key things to keep in mind
○ Size of fields○ Scale vs Flexibility○ Index as little as possible
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Representation
● Cassandra is flat
● Actual data is not flat
○ Reviews○ Price information
● Many different output formats
○ CSV, JSON, XML, etc.
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Solution: Flatten when possible
○ E.g. Address object -> Separate fields● Internal subgroup representation
○ Composite keys (Occasionally)■ Known subgroups■ Non multiple subgroups
○ Dynamic fields■ Composite field + Dynamic tag■ E.g. review.text_<tag>
Challenges: Representation
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Challenges: Representation
● Robust and adaptable conversion package
● JSON -> Internal
○ Solr returns JSON● Internal -> CSV, JSON, XML
○ User defined views○ Specify field groupings○ Specify partitioning
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
● Memory Usage
● Speed
● Space
● Containers
Future Work
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Memory
● Java 7 G1 (Garbage First) Collector
○ Ideal for large heaps○ Big Data Sets○ Bursty Workloads
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Speed
● Solaris Kernel Scheduler > Linux Kernel Scheduler
○ (At large number of cores)● Drastically increase iops
○ Cache reads (L2ARC) on PCIe SSD (~800 MB/s)○ Cache writes (ZIL) on PCIe SSD (~800 MB/s)○ Reduce needed size of SSD
■ More smaller SSDs in ZFS pool○ Fewer moving parts
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Space
● Caching at PCIe, Storing on SATA III
○ Cheaper larger storage via ZFS pools○ Easier to grow
● ZFS Compression (LZ4)
○ Replaces Cassandra's Snappy compression○ Very fast lossless compression (400 Mb/s per core)○ Scales to multiple CPUs○ Hits the ram speed limit
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Future Work: Containers
● OS Level virtualization
○ Resource control○ Boundary separation
● More control over cassandra resources
● Better snapshots (whole machine)
● Hardware abstracted out
○ Many disks represented as single space○ Easily add or remove hardware
Questions?https://www.datafiniti.net
http://blog.datafiniti.net@datafiniti
DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET
Addendum 1
ZFS Comparison
Name Ratio (MB/s) Compression (MB/s)
Decompression (MB/s)
LZ4 (r97) 2.084 410 1810
LZO 2.06 2.106 409 600
QuickLZ 1.5.1b6 2.237 373 420
Snappy 1.1.0 2.091 323 1070
LZF 2.077 270 570
zlib 1.2.8 -1 2.730 65 280
LZ4 HC (r97) 2.720 25 2040
zlib 1.2.8 -6 3.099 21 300