terark product and technology

19
Terark Make Data Smaller and Access Faster Sean Fu, Remy Trichard

Upload: xinyuan-fu

Post on 21-Feb-2017

63 views

Category:

Documents


9 download

TRANSCRIPT

TerarkMake Data Smaller and Access Faster

Sean Fu, Remy Trichard

Terark built a new Storage Engine for Database and Data Systems.

Terark technology enables direct search on highly compressed data, with 10X~500X faster performance and more than 15X storage savings compared to Google‘s LevelDB or Facebook’s RocksDB, resulting in larger scalability with lower cost for big data applications.

Brief Introduction

Terark Confidential

Y Combinator is the world leading start-up incubator (total valuation of portfolio companies >$100 billion). The best known are Airbnb and Dropbox.

We Are a YC Company

Terark Confidential

Paying Customers

E-Commerce Giant around the GlobeTerark technology supports its business growth through

Alibaba Cloud.

Rank 5 in Global Mobile Phone MarketTerark technology brings Xiaomi a competitive edge in mobile

phone market.

Terark technology supports Cloud, Big Data and Internet companies to have better performance with less costs.

Terark Confidential

Proven Results

Terark Compression

805G

47.5G

TerarkTPC-H Dataset

TCO (on the same data size)

Terark Confidential

Terark (1 server)Others (6 servers)

$ 30,000 $ 5,000

1-Year Hardware & Ops Cost

Strong Compression ( > 15:1 compression ratio)

- Lift Data Capacity- Increase Memory Utilization, Lower Down Disk I/O- Save Data Infrastructure Cost

Extreme Performance (10~500X QPS of Competitors)

- Lower Latency, Higher Throughput and Concurrency

Rich Features

- Flexible Data Types- Native Regular Expression Query- Works with most databases (MySQL, MongoDB, SSDB...)

Performance Report: http://www.terark.com/en/blog/detail/2

Terark Storage Engine

Terark Confidential

Core Technology

● Data as index, index as data Terark use automata data structure

● Searchable compression Terark technology enables direct search on highly compressed data

Our breakthrough technology is protected by 6 patents in the US, China and worldwide.

Terark Confidential

ThanksSean Fu

Mobile & WeChat: (+86) 13911734987E: [email protected]

Appendix 1: TCO & ROI Details

Hardware Cost (1 server ~ $5000 a year referred to AWS) Operational Cost (~20% of the hardware cost)Terark $5,000 $1,000

Other Product $30,000 $6,000

Terark Confidential

Appendix 1: TCO & ROI Details

Year(s) Cost Savings Estimated Rev Lift due to Performance/Scalibility Improvement(~20% of Cost Savings)1 $30,000 $6,0003 $90,000 $18,0005 $150,000 $30,000

Terark Confidential

• Indexing and index compression

• Value compression

• Succinct data structures

Appendix 2: Core Technology Detail

Terark Confidential

Hash B+Tree Terark Nest Succinct Trie

Compression None OK ✔✔✔ Excellent

Searching ✔✔ Very Fast OK ✔ Fast

Exact Searching ✔ Supported ✔ Supported ✔ Supported

Range Searching Not Supported ✔ Supported ✔ Supported

Prefix Searching Not Supported ✔ Supported ✔ Supported

Regex Searching Not Supported Not Supported ✔ Supported

Reverse Searching(id to key) Not Supported(can be work-around) Not Supported ✔ Supported

Indexing and Index Compression

Terark Confidential

Key can be separated with node

Key’s data can be stored into another array, use array index for accessing.

For example, can be used as DFA’s state transition table.

Use array index instead of pointers

RBTree B+Tree Terark RBTree

Memory usage 4 ptr ~ 0.75 keylen 64 bits

Searching Fast Very Fast Very Fast

Data Coupling Tight Tight Loose

Reverse Searching(id to key) Not Supported Not Supported Supported

Dynamic Indexing: Terark Threaded Red Black Tree

Terark Confidential

Block-based: leveldb, rocksdb, wiredtiger…

Short data: Terark Nest Succinct Trie

Long data: Terark Global Compression

Compression ratio OK ✔✔✔ Excellent ✔✔✔ Excellent

Random Read Slow ✔ Fast ✔ Fast

Sequential Read ✔ Fast OK ✔ Fast

Double Cache Problem YES NO NO

Compression Speed ✔ Fast Slow Slow

Data (Value) Compression

Terark Confidential

2-bits for a node, Pre-OrderDFUDS

10110100100

Level-Order LOUDS

101110010000

Parent(c) = rank0(select1(c))Child(p, i) = select0(p) – p + i

Needs findopen, findclose, enclose, which are much slower than rank/select, rarely used

Simple and fast, small:

Succinct Data Structure represents data within a space which is close to theoretical limit. It uses bitmap to represent data, and uses rank-select to look for data.

It can tremendously reduce memory usage, but it is very complex to implement. Terark has our own implementations and achieved much better performance than open-source implementations.

Index Compression: Succinct Tree

Terark Confidential

Patricia Trie: A Compressed TriePath compression: Compress all one-child nodes in a path into a single node

Nested: Convert the compressed path into a new TrieRequirements: Trie need to support “reverse searching”,meaning to extract string from the node

Patricia Trie + Nested

Terark Confidential

• Global Compression

• Global + Local Dictionary

• Short data friendly (~50 bytes)

• Larger dataset, better compression

• Seekable access (via record id)

• Similar to lz77

Data (Value) Compression

Terark Confidential

TerarkDB is schema-based table, each table can define data types of multiple columns, indexes and

features. TerarkDB can be integrated into databases like MongoDB, MySQL and SSDB.

TerarkDB

Writing Segment

Read-Only Segment

Frozen Writable Segment

Writable

Frozen

New Data

Terark Confidential

Use Terark’s indexing and compression algorithms to implement RocksDB’s SSTable.

• Much better compression

• Much better random read performance

• Terark trades off compression speed for high compression ratio and performance

• Use universal compaction to minimize write amplification

TerocksDB: Compatible with RocksDB

Terark Confidential