datafiniti: the internet in a database - cassandra use case

35
The Internet in a Database A Cassandra Use Case

Upload: planet-cassandra

Post on 27-Jun-2015

787 views

Category:

Technology


3 download

DESCRIPTION

Austin Cassandra Users Meetup on July 15th 2013: http://www.meetup.com/Austin-Cassandra-Users/events/125837112/ Datafiniti will be presenting on some of the unique and interesting challenges they've faced when trying to build out their data search engine. Including a detailed use-case around their Cassandra data model and other integrated technologies like Solr.

TRANSCRIPT

Page 1: Datafiniti: The Internet in a Database - Cassandra Use Case

The Internet in a DatabaseA Cassandra Use Case

Page 2: Datafiniti: The Internet in a Database - Cassandra Use Case

Data on the Web

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● 48 billion pages on the Internet

● 56 million GB of data

● Incredibly powerful connections

● 70% of useful data is unstructured

● User generated data + facts

Page 3: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Too Much Data…

Page 4: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● Modern search engines

○ Unstructured data

○ Unconnected data

○ Unnormalized data

Search

Page 5: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● Goals

○ Collect vast amounts of data through web crawling

○ Normalize and deduplicate data

○ Make it searchable and meaningful

Page 6: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● Speed

● Scale

● Adaptable

Needs

Page 7: Datafiniti: The Internet in a Database - Cassandra Use Case

● Very fast

○ Log-structured storage

● Easily scalable

○ Decentralized rings

● Completely adaptable

○ Schema-less key/value store

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

The Solution

Page 8: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

…Almost

● Useful searching was missing

○ Secondary indexes not flexible

○ No free text searches

○ No (reasonable) range queries

Page 9: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● Pros: Full control over indexing

● Cons: Not scalable

What We Needed

Page 10: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● Reasons to go with DSE

○ Combines Cassandra and Solr

○ Constant refinements and integrations

○ Support

Putting It All Together

Page 11: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Normalization

Cassandra

Solr

Cassandra

Solr

Cassandra

Solr

Load Balancing

Our Stack

Web Crawling

Page 12: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Cassandra / Solr Setup

● 3 column families / 3 cores

○ Locations○ Products○ People

● 73,114,909 records

Page 13: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● 29,818,644 records

● Interesting data

○ Reviews○ Revenue○ Contact information

● Businesses vs. Locations

○ Unique key○ Location specific user data

Data: Locations

Page 14: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Data: Products

● 18,470,005 records

● Interesting data

○ Categories○ Price○ Reviews

● Challenges

○ Too many unique keys

Page 15: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Data: People

● 24,826,260 records

● Interesting data

○ Work History○ Education History○ Location

● Challenges

○ Normalization○ Identification

Page 16: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges

● Memory

● Speed

● Space

● Representation

Page 17: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Memory

● Multi-minute garbage collection

● Exponential increase in frequency

● Virtual memory confusion

● Solr + Cassandra

● Heap Size vs Buffer Cache

● Bash Scripts

Page 18: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Speed

● Upgrade

○ Better memory management○ Smaller index size

● Reduce index size

● Future: Solaris

Page 19: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Speed

● Providing a real-time service

● Issues

○ Solr not inherently real time○ Search speeds○ I/O

Page 20: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Speed

● Solr Solution: DSE integration leverages

○ Cassandra's speed○ Cassandra's caches○ Cassandra's distribution○ Solr caches less useful

Page 21: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Speed

● Search complexity solution

○ Text vs String indexing○ Uniqueness vs Flexibility○ Leveraging Cassandra

Page 22: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Speed

● I/O Solution

○ Cassandra's built in mapping○ Increase disk access speeds (SSDs)

■ Not cost effective○ Future: Solaris

Page 23: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Space

● Field corruption

○ Caused by improper encoding○ Exponential growth○ Fills up Solr index

● Locate, inspect & remove corrupt records

Page 24: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Space

● Solr index issue

○ No compression (vs Cassandra)○ Must adjust indexing

● Key things to keep in mind

○ Size of fields○ Scale vs Flexibility○ Index as little as possible

Page 25: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Representation

● Cassandra is flat

● Actual data is not flat

○ Reviews○ Price information

● Many different output formats

○ CSV, JSON, XML, etc.

Page 26: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● Solution: Flatten when possible

○ E.g. Address object -> Separate fields● Internal subgroup representation

○ Composite keys (Occasionally)■ Known subgroups■ Non multiple subgroups

○ Dynamic fields■ Composite field + Dynamic tag■ E.g. review.text_<tag>

Challenges: Representation

Page 27: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Challenges: Representation

● Robust and adaptable conversion package

● JSON -> Internal

○ Solr returns JSON● Internal -> CSV, JSON, XML

○ User defined views○ Specify field groupings○ Specify partitioning

Page 28: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Page 29: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

● Memory Usage

● Speed

● Space

● Containers

Future Work

Page 30: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Future Work: Memory

● Java 7 G1 (Garbage First) Collector

○ Ideal for large heaps○ Big Data Sets○ Bursty Workloads

Page 31: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Future Work: Speed

● Solaris Kernel Scheduler > Linux Kernel Scheduler

○ (At large number of cores)● Drastically increase iops

○ Cache reads (L2ARC) on PCIe SSD (~800 MB/s)○ Cache writes (ZIL) on PCIe SSD (~800 MB/s)○ Reduce needed size of SSD

■ More smaller SSDs in ZFS pool○ Fewer moving parts

Page 32: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Future Work: Space

● Caching at PCIe, Storing on SATA III

○ Cheaper larger storage via ZFS pools○ Easier to grow

● ZFS Compression (LZ4)

○ Replaces Cassandra's Snappy compression○ Very fast lossless compression (400 Mb/s per core)○ Scales to multiple CPUs○ Hits the ram speed limit

Page 33: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Future Work: Containers

● OS Level virtualization

○ Resource control○ Boundary separation

● More control over cassandra resources

● Better snapshots (whole machine)

● Hardware abstracted out

○ Many disks represented as single space○ Easily add or remove hardware

Page 34: Datafiniti: The Internet in a Database - Cassandra Use Case

Questions?https://www.datafiniti.net

http://blog.datafiniti.net@datafiniti

Page 35: Datafiniti: The Internet in a Database - Cassandra Use Case

DATAFINITI • THE SEARCH ENGINE FOR DATA • WWW.DATAFINITI.NET

Addendum 1

ZFS Comparison

Name Ratio (MB/s) Compression (MB/s)

Decompression (MB/s)

LZ4 (r97) 2.084 410 1810

LZO 2.06 2.106 409 600

QuickLZ 1.5.1b6 2.237 373 420

Snappy 1.1.0 2.091 323 1070

LZF 2.077 270 570

zlib 1.2.8 -1 2.730 65 280

LZ4 HC (r97) 2.720 25 2040

zlib 1.2.8 -6 3.099 21 300