1 lightweight indexing of observational data in log-structured storage national university of...

Download 1 Lightweight Indexing of Observational Data in Log-Structured Storage National University of Singapore (Sheng Wang, Beng Chin Ooi) Portland State University(David

If you can't read please download the document

Upload: aubrey-johns

Post on 18-Jan-2018

218 views

Category:

Documents


0 download

DESCRIPTION

3 background Huge amounts of data generated by sensors every day Data are expanding in precision and quantity High write throughput Efficient query

TRANSCRIPT

1 Lightweight Indexing of Observational Data in Log-Structured Storage National University of Singapore (Sheng Wang, Beng Chin Ooi) Portland State University(David Maier) VLDB 2014 2 Outline Background Challenges Contributions CR-Index Index Optimization Experiments 3 background Huge amounts of data generated by sensors every day Data are expanding in precision and quantity High write throughput Efficient query 4 challenges State-of-the-art storage doesn't take high-throughput into account (CMOP : RDBMS + netCDF) Record-level Index incur significant index maintenance cost B + -Tree Random IO due to update LSM-Tree large number of index entries 5 contributions A schema for storing observational data in logBase [1] to facilitate indexing A novel, lightweight index called CR-Index structure for range queries which take full advantage of observational-data traits Experiments on two real-word observational datasets [1] H. T. Vo, S. Wang, D. Agrawal, G. Chen, B. C. Ooi. Int'l Conference on Very Large Data Bases (VLDB), PVLDB 5(10): , 2012. 6 Storage High write throughput Traits of observational data no update continuous change potential discontinuities Log-structured storage append new data to the end of a file avoid random I/O 7 LogBase An unordered column-oriented distributed log-store Each node is responsible for one or more partitions of a table Version controll and transaction semantics Relational data model each record has a primary key and several attributes each record is decomposed as a set of cells (KEY, ATTRIBUTE, VALUE, TIMESTAMP) 8 Architecture 9 Logic View & Physical View 10 Basic query formats Time range Value range 11 Obervational data locality In general, append only strategy hurts read performance, but log-store provide considerable data locality time-ordered property time range query : sequencial scan value-correlated property due to continuity trait, once a record is inside a value range, surrounding records will likely lie in the range 12 Continuous Range Index(CR- Index) the value-correlated property implies a seek in the log can potentially yield many results we needn't locate qualifying records individualy, as long as identify regions containing results Group successive records into blocks as an atomic unit Each block is summarized by a value range using a boundary pair Range query can be transformed to an intersection- checking problem 13 CR-Index structure 14 CR-Index structure 15 Index structures Interval tree O(n) stabbing query:O(logn) 16 Solution partition result set into two disjoint group Group A : CR-Record that have at least one endpoint inside the query range [a,b] B + -tree Group B : CR-Records that completely contain the query range Interval-tree 17 Solution For each CR-record, two entries are inserted into the B + -tree, one for each endpoint. the endpoint as a key and CR- recrod's refrence is the value For each CR-record, its' value range is inserted into the Interval tree 18 Example 19 Example 20 Example 21 Query example query1 : [3,11] result1 : 2,3,4,5,3,2,4 2,3,4,5 query2 : [16,18] result2 : 22 Query example query : [16,18] result : 6 23 Optimization Index with delta intervals boundary pair in consective blocks may overlap, if a query intersect a block, it will probably intersect the following blocks 24 Example query : [3,5] result : 2, 3 25 Length-k delta interval 26 Evaluate range query value condition can be used at both interval-index level and CR-log level time condition can be used at CR-Log via checkpoint-list hole information can be updated 27 Query steps 1.Access the interval index to get CR-Record ids: Group A from B+-tree and Group B from interval tree 2.Locate each identified record in the CR-Log. Scan the log for additional CR-records if using delta intervals 3.Filter CR-record using checkpoint list and hole information 4.Fetch and scan the data blocks for remaining CR- records. Exract and return all qualifying results 5.For any detected false-positive blocks, track the holes and update the hole information in CR-records. 28 Experiments Compare with B + -tree and LSM-tree Data sets CMOP Costal Margin Data (13million) salinity, temperature, oxygen Real-time Soccer Game Date (25million) sensor ID, position, speed, velocity, accelaeration 29 Experiment Environment A cluster where each machine has a quad- core processor, 8GB, 500GB Java block length : 64 delta-interval : 1 range query 30 System load time Write time in load data CR-Index : 8% LSM : 45%-77% B+ : 78%-124% 31 Index update time 15% LSM-tree 9% B+-tree 32 Index space consumption disk space 10%-12% LSM-tree 4%-6% B+-tree 33 Query response time 34 Index look-up cost 35 Data access cost 36 Over QQQ