terark product and technology
TRANSCRIPT
Terark built a new Storage Engine for Database and Data Systems.
Terark technology enables direct search on highly compressed data, with 10X~500X faster performance and more than 15X storage savings compared to Google‘s LevelDB or Facebook’s RocksDB, resulting in larger scalability with lower cost for big data applications.
Brief Introduction
Terark Confidential
Y Combinator is the world leading start-up incubator (total valuation of portfolio companies >$100 billion). The best known are Airbnb and Dropbox.
We Are a YC Company
Terark Confidential
Paying Customers
E-Commerce Giant around the GlobeTerark technology supports its business growth through
Alibaba Cloud.
Rank 5 in Global Mobile Phone MarketTerark technology brings Xiaomi a competitive edge in mobile
phone market.
Terark technology supports Cloud, Big Data and Internet companies to have better performance with less costs.
Terark Confidential
Proven Results
Terark Compression
805G
47.5G
TerarkTPC-H Dataset
TCO (on the same data size)
Terark Confidential
Terark (1 server)Others (6 servers)
$ 30,000 $ 5,000
1-Year Hardware & Ops Cost
Strong Compression ( > 15:1 compression ratio)
- Lift Data Capacity- Increase Memory Utilization, Lower Down Disk I/O- Save Data Infrastructure Cost
Extreme Performance (10~500X QPS of Competitors)
- Lower Latency, Higher Throughput and Concurrency
Rich Features
- Flexible Data Types- Native Regular Expression Query- Works with most databases (MySQL, MongoDB, SSDB...)
Performance Report: http://www.terark.com/en/blog/detail/2
Terark Storage Engine
Terark Confidential
Core Technology
● Data as index, index as data Terark use automata data structure
● Searchable compression Terark technology enables direct search on highly compressed data
Our breakthrough technology is protected by 6 patents in the US, China and worldwide.
Terark Confidential
Appendix 1: TCO & ROI Details
Hardware Cost (1 server ~ $5000 a year referred to AWS) Operational Cost (~20% of the hardware cost)Terark $5,000 $1,000
Other Product $30,000 $6,000
Terark Confidential
Appendix 1: TCO & ROI Details
Year(s) Cost Savings Estimated Rev Lift due to Performance/Scalibility Improvement(~20% of Cost Savings)1 $30,000 $6,0003 $90,000 $18,0005 $150,000 $30,000
Terark Confidential
• Indexing and index compression
• Value compression
• Succinct data structures
Appendix 2: Core Technology Detail
Terark Confidential
Hash B+Tree Terark Nest Succinct Trie
Compression None OK ✔✔✔ Excellent
Searching ✔✔ Very Fast OK ✔ Fast
Exact Searching ✔ Supported ✔ Supported ✔ Supported
Range Searching Not Supported ✔ Supported ✔ Supported
Prefix Searching Not Supported ✔ Supported ✔ Supported
Regex Searching Not Supported Not Supported ✔ Supported
Reverse Searching(id to key) Not Supported(can be work-around) Not Supported ✔ Supported
Indexing and Index Compression
Terark Confidential
Key can be separated with node
Key’s data can be stored into another array, use array index for accessing.
For example, can be used as DFA’s state transition table.
Use array index instead of pointers
RBTree B+Tree Terark RBTree
Memory usage 4 ptr ~ 0.75 keylen 64 bits
Searching Fast Very Fast Very Fast
Data Coupling Tight Tight Loose
Reverse Searching(id to key) Not Supported Not Supported Supported
Dynamic Indexing: Terark Threaded Red Black Tree
Terark Confidential
Block-based: leveldb, rocksdb, wiredtiger…
Short data: Terark Nest Succinct Trie
Long data: Terark Global Compression
Compression ratio OK ✔✔✔ Excellent ✔✔✔ Excellent
Random Read Slow ✔ Fast ✔ Fast
Sequential Read ✔ Fast OK ✔ Fast
Double Cache Problem YES NO NO
Compression Speed ✔ Fast Slow Slow
Data (Value) Compression
Terark Confidential
2-bits for a node, Pre-OrderDFUDS
10110100100
Level-Order LOUDS
101110010000
Parent(c) = rank0(select1(c))Child(p, i) = select0(p) – p + i
Needs findopen, findclose, enclose, which are much slower than rank/select, rarely used
Simple and fast, small:
Succinct Data Structure represents data within a space which is close to theoretical limit. It uses bitmap to represent data, and uses rank-select to look for data.
It can tremendously reduce memory usage, but it is very complex to implement. Terark has our own implementations and achieved much better performance than open-source implementations.
Index Compression: Succinct Tree
Terark Confidential
Patricia Trie: A Compressed TriePath compression: Compress all one-child nodes in a path into a single node
Nested: Convert the compressed path into a new TrieRequirements: Trie need to support “reverse searching”,meaning to extract string from the node
Patricia Trie + Nested
Terark Confidential
• Global Compression
• Global + Local Dictionary
• Short data friendly (~50 bytes)
• Larger dataset, better compression
• Seekable access (via record id)
• Similar to lz77
Data (Value) Compression
Terark Confidential
TerarkDB is schema-based table, each table can define data types of multiple columns, indexes and
features. TerarkDB can be integrated into databases like MongoDB, MySQL and SSDB.
TerarkDB
Writing Segment
Read-Only Segment
Frozen Writable Segment
Writable
Frozen
New Data
Terark Confidential
Use Terark’s indexing and compression algorithms to implement RocksDB’s SSTable.
• Much better compression
• Much better random read performance
• Terark trades off compression speed for high compression ratio and performance
• Use universal compaction to minimize write amplification
TerocksDB: Compatible with RocksDB
Terark Confidential