lessons learned with laser scanning point cloud management ... · laser scanning point cloud...
TRANSCRIPT
Lessons learned with laser scanning point cloud management in Hadoop HBase Prof. Debra LaeferCenter for Urban Science + ProgressNew York UniversityJune 2018
2
Laser scanning data
3
Laser scanning data
2015 Dublin point cloud
• Spatial coverage : > 2 km2
• Number of points : > 1.4 billion points
• Size on disk : 30 GB in LAS format
• Precision : 3 cm
• Density : 300 points/m2
(horizontal)
Open-access: https://geo.nyu.edu/catalog/nyu_2451_38684
4
SortedMap<RowKey, List<SortedMap<Column, List<Value, Timestamp>>>>1 2 3 4 5 6 7 8
(a) Low-level data storage structure in HBase
(Table, RowKey, Family, Column, Timestamp) Value
(b) A high-level view of HBase data structure
HBase – a distributed database
Apache HBase• Enable random access to data in the Hadoop Distributed
File System• Open-source implementation of Google’s Big Table• Is the database behind many Facebook services• HBase is: distributed, non-relational (aka NoSQL), key-
value based, column oriented
HBase’s underlying data structure
5
Data models for point cloud management in HBase
Expectations:• Scalability (distributed)• Flexibility (schema-less)• Performance (due to parallelism)
4 data models:• 2 row-key arrangements: Dual
Hilbert code, and Single Hilbert code• 2 column structures: Grouped
Attributes and Separate Attributes4 data models
6
Data ingestion
Data ingestion workflow
7
Performance evaluation – Point queries
Point queries:• Model 3 is slowest; the
remaining models are comparable.
• More than 5 times faster than pgPointCloud
• All data models are scalable
• Difference between hot and cold queries is obvious
Hot point query response times
90M 365M 1420MData size:
Cold point query response times
8
Performance evaluation – Range queries
Hot range query response times
90M 365M 1420MData size:
Cold range query response times
Range queries:• Model 4 outperforms all
other models• Model 3 is slowest• Difference with
pgPointCloud is less obvious
• All data models are scalable
9
Concluding remarks
• 4 data models were investigated for storage, indexing, and
querying point clouds in a distributed, non-relational database.
• All HBase data models were scalable, including the flat, one-
point-per-row models, which previously hit the scalability wall in
relational implementation.
• Separation of point attributes to take advantage of the
schemaless feature of HBase introduced some overheads to
both data consumption and querying costs.
• Model 4, which resembles Oracle’s SDO_PC and
PostgreSQL’s PCPATCH, appears to be the most performant
data model. Model 4 does not fully utilize HBase’s
advantageous features.