the latest in spatial & temporal search: presented by david smiley
TRANSCRIPT
The Latest in Spatial & Temporal Search David Smiley
Agenda Spatial
• Polygons and Accuracy: SerializedDVStrategy • FlexPrefixTree • BBoxSpa=alStrategy • Student/Intern contribu=ons, Geodesics
Temporal • Dates, and Date Ranges
• Search • Face=ng
About David Smiley
• Freelance search consultant / developer • Expert Lucene/Solr development skills,
advice (consul=ng), training • Java (full-‐stack), Web, Spa=al
• Apache Lucene / Solr committer & PMC, Eclipse Locationtech PMC
• Authored 1st book on Solr, plus two editions • Presented at several conferences & meetups • Taught several Solr classes, self-developed & LucidWorks
Lucene Spatial Overview • Multiple approaches to index spatial data abstract class SpatialStrategy (5+ concrete implementa=ons)
• RecursivePrefixTreeStrategy (RPT) is most prominent, versatile • Grid based
• Uses Spatial4j lib for shapes, distance calculations, and WKT • Uses JTS Topology Suite lib for polygons
Shape
Spa=alPrefixTree / Cell PrefixTreeStrategy IntersectsPrefixTreeFilter Contains… Within… Geohash | Quad
SpatialPrefixTrees and Accuracy RecursivePrefixTree (RPT) uses Lucene’s index as a PrefixTree
• Thus represents shapes as grid cells of varying precision by prefix Example, a point shape:
• D, DR, DRT, DRT2, DRT2Y Example, a polygon shape:
• Too many to list… 508 cells
More details here: h7p://opensourceconnec;ons.com/blog/2014/04/11/indexing-‐polygons-‐in-‐lucene-‐with-‐accuracy/
…continued • For more accuracy, index more levels (longer prefixes)
• Points: linear rela=onship of levels to number of cells J • Non-‐points: exponen=al rela=onship… L
RPT applies a distErrPct shape size ratio to non-point shapes to trade accuracy for scalability • distErrPct=0.025 (2.5% of the radius, the default):
• Massachuse[s: level 6 • USA: level 4 (not as precise)
SerializedDVStrategy (Lucene 4.7) • Stores serialized geometry into Lucene BinaryDocValues
• It’s as accurate as the underlying geometry coordinates/shape • But it’s not a spa=al index – it’s retrievable on a per-‐document basis
• Use RPT + SerializedDV for speed and accuracy!
• More to come eventually: • Solr adapter – SOLR-‐5728, Elas=cSearch adapter #2361 • Speed: Skip the serialized geometry check for non-‐edge cells – LUCENE-‐5579
SpatialArgs args = new SpatialArgs(INTERSECTS, point); treeStrategy = new RecursivePrefixTreeStrategy(
grid, "geometry"); verifyStrategy = new SerializedDVStrategy(
ctx, "serialized_geometry"); Query treeQuery = new ConstantScoreQuery(
treeStrategy.makeFilter(args)); Query combinedQuery = new FilteredQuery(
treeQuery, verifyStrategy.makeFilter(args), FilteredQuery.QUERY_FIRST_FILTER_STRATEGY);
Code is from a related presenta;on by the Climate Corpora;on presented at FOSS4G 2014
Sample Code
FlexPrefixTree (Coming to Lucene 5) • A new SpatialPrefixTree by Varun Shenoy (GSOC 2014) !
• LUCENE-‐4922; S=ll needs to be commi[ed. Goal is for 5.0. • More optimized, more flexible, than Geohash & Quad
• Configurable sub-‐cells at each level: 4, 16, 64, 256 • You choose trade-‐off between index speed/disk size & search speed
• Internally uses an integer coordinate system • Rectangle searches are par=cularly fast; minimal floa=ng-‐point conversion
• Cells are always squares (equal sides) – be[er for heatmaps • YMMV: 10% -‐ 100% faster than GeohashPrefixTree
BBoxSpatialStrategy (Lucene 4.10) • Rectangles (BBox’s) only, one value per field • Wide predicate support
• Equals, Intersects, Within, Contains, Disjoint • Accurate (8-byte double floating point) • Area overlap relevancy
• Weight search results by a combina=on of query shape overlap & index shape overlap ra=os
• Solr BBoxField…
Solr BBoxField • Schema configuration <field name="bbox" type="bbox" /><fieldType name="bbox" class="solr.BBoxField”
geo="true" units="degrees" numberType="_bbox_coord" /><fieldType name="_bbox_coord" class="solr.TrieDoubleField”
precisionStep="8" docValues="true" stored="false"/>
• Search with overlap ratio ordering &q={!field f=bbox score=overlapRatio}Intersects(ENVELOPE(-10, 20, 15, 10))
• score can be: overlapRa=o, area, area2D
Recent Student/Intern Contributions • Varun Shenoy via GSOC: summer 2014
• Lucene spa=al: new “FlexPrefixTree” – an op=mized grid • Rebecca Alford via F.B. Open-Academy: winter 2014
• Spa=al4j: geodesic polygons • Chris Pavlicek via F.B. Open-Academy: winter 2014
• Spa=al4j: geodesic buffered lines • Evana Gizzi, MITRE intern: winter 2014
• Spa=al4j: geodesic circle polygonizer • Liviy Ambrose, MITRE intern: fall 2013
• Lucene spa=al: integrated with Lucene’s benchmark module
Temporal/Date Durations or basically any numeric ranges
Approach: Simple Two-field (as you might do in SQL or any system without native range types) • A start-time & end-time field pair • A search window (time span) becomes two range queries
• details vary by predicate (Intersects, Contains, vs. Within) • Single-valued only
• …even though Lucene supports mul=-‐valued fields • Theore=cally possible but would be a lot of work
• because Lucene doesn’t store “posi=on” info for numeric fields • because numeric range/prefix queries are posi=on-‐less
Approach: 2D Spatial PrefixTree • Lucene Spatial QuadPrefixTree
(2D) with RPT Strategy • Use ‘x’ for start-time, ‘y’ for end-time • A search window (time span)
becomes a rectangle query • details vary by predicate (Intersects, Contains, vs. Within)
• Cool… • But floa=ng-‐point edge issues • Only ~50 levels supported; not 64
Details: h[p://wiki.apache.org/solr/Spa=alForTimeDura=ons
Approach: DateRangePrefixTree (Lucene 5) • A new 1D SpatialPrefixTree: NumberRangePrefixTree
• NumberRangePrefixTree w/ DateRangePrefixTree subclass • NR-‐SPT: Configurable sub-‐cells per level; no level limit • Not just for ranges; instances too • Index/Search with NumberRangePrefixTreeStrategy
• Indexing, and search predicate code (e.g. Intersects…) completely re-‐used
• DateRangePrefixTree • 9 Levels: 1M years, 1K years, years, months, days, hours, minutes, seconds, millis
…continued…
Trade-offs of N/D-SPT • Indexing:
• “Common” date-‐ranges use ~ <50 terms, but random millisecond ranges use up to ~14K terms
• All date instances (not a range) <= 9 terms • Comparison to 2D SPT: instance or range, always 50
• Search: • Query for “common” query ranges faster than uncommon • Comparison to 2D SPT: • Contains & Within predicates: overlapping values per document get coalesced, can’t be differen=ated
Solr DateRangeField • Configuration in schema.xml: <field name="dateRange" type=”dateRange” /> <fieldType name="dateRange" class="solr.DateRangeField" />
• Index field data, examples: • 2014-‐05-‐21T12:00:00.000Z (same as TrieDate) • 2014-‐05-‐21T12 (truncated to desired precision) • [1990 TO 1995]
• Query, examples: • fq=dateRange:[* TO 2014-‐05-‐21] • fq={!field f=dateRange op=Contains} [2000 TO 2014-‐05-‐21]
Visualizing Date Facets • http://bl.ocks.org/mbostock/4063318
Date Faceting • Option A: facet.range
• Not for indexed date-‐ranges • Internally executes one query for each value & caches large bitset
• Option B: facet.interval (Solr 4.10) • Not for indexed date-‐ranges • Requires DocValues (more index data) • Supports variable/custom intervals
• New work-in-progress option: Facet on DateRangeField • Ranges are fixed/pre-‐determined (months, days, etc.) • Op=mized for thousands of ranges to count
• Each value-‐range is only 1 term!
Future stuff I’m excited about • Continuing works in-progress • Spatial heatmaps! Coming in January 2015!
• Lucene layer & Solr adapter • Lucene term auto-prefixing LUCENE-5879
• Brings spa=al, date, numeric, indexing/search to the next level! • More prefix-tree optimizations
• Inner vs edge leaf cell differen=a=on for non-‐point shapes • RPT + SerializedDVStrategy; skip accuracy checks for inner cells • Don’t index leaf cells twice
That’s all for now; thanks for coming!
Need Lucene/Solr guidance or custom development? Contact me!
Email: [email protected] LinkedIn: h[p://www.linkedin.com/in/davidwsmiley G+: +DavidSmiley Twi[er: @DavidWSmiley
ETA: December 2014