h-hypermap - heatmap analytics at scale: presented by david smiley, d w smiley llc

25
OCTOBER 1114, 2016 BOSTON, MA

Upload: lucidworks

Post on 16-Apr-2017

159 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

O C T O B E R   1 1 -­‐ 1 4 ,   2 0 1 6     •     B O S T O N ,   M A  

Page 2: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

H-­‐Hypermap:  Heatmap  Analy?cs  at  Scale  David  Smiley  

Freelance  Search  Developer/Consultant  

Page 3: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

About:  David  Smiley  •  So2ware  Engineer  (16  years)  •  Search  (7  years)  •  Java  (full-­‐stack),  Web,  SpaGal  

•  Freelance  search  consultant  /  developer  •  Apache  Lucene  /  Solr  commiKer  &  PMC  •  Wrote  first  book  on  Solr,  updated  twice  

Page 4: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

Agenda  •  About  this  project  •  Architecture  •  Solr  &  Gme  sharding  •  Experiences  with:  –  Kotlin,  Dropwizard,  Swagger  

–  KaUa  –  Docker,  Kontena  

•  Solr  for  geo-­‐enrichment  •  Solr  adapter  for  Lucene  BKD  Lat-­‐Lon  point  search  &  sort  

•  Heatmaps  –  ExisGng  funcGonality  

•  demo  –  New  funcGonality  

Page 5: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

H-­‐Hypermap  /  BOP  •  Harvard  University,  CGA:    Center  for  GeospaGal  Analysis  hKp://gis.harvard.edu  

•  Harvard  Hypermap  Project  – Managed  by  Ben  Lewis  

•  BOP  “Billion  Object  Pla^orm”  –  Funded  by  the  Sloan  FoundaGon  

Page 6: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

BOP  Requirements  Summary  

•  Most  recent  ~billion  geo-­‐tweets  •  RealGme  search  (<5  sec  latency)  •  Sub-­‐second  queries  –  Including  heatmaps!  

•  On  the  cheap:  ~6  mediocre  boxes  

Provide  a  proof-­‐of-­‐concept  pla^orm  designed  to  lower  the  barrier  for  researchers  who  need  to  access  big  streaming  spaGo-­‐temporal  datasets.  

Page 7: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

Logical  High-­‐Level  Architecture  

Archival  

RealGme  

HarvesGng   Enrichment  

various  clients...  

various  clients...  

Data  flows  via  Apache  KaLa   Systems  expose  HTTP  web  services  

“BOP”  

Page 8: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

Shard:  W51  

The  BOP  KaUa  Topic   Ingester  

ZooKeeper  

Shard:  W52  Shard:  W53  Shard:  W54  Shard:  RT  

...  

Web-­‐Service  

KaUa  Streams  •  Create  Solr  doc  •  Routes  to  shard  

REST/JSON  API  •  Keyword  search  •  FaceGng  •  Heatmaps  •  CSV  export    

...  

Page 9: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

BOP  Solr  Sharding  Architecture  RealGme  

T2016_05_20  T2016_05_06  T2016_04_22  T2016_04_08  

…  4-­‐5  mo.  

T2016_05_20  T2016_05_06  T2016_04_22  T2016_04_08  

…  4-­‐5  mo.  

G_North_America   G_Elsewhere  

Lone  RealGme  CollecGon/Shard.  1-­‐25  hrs  Copy  then  delete,  at  night  

•  RealGme  shard  is  where  realGme  search  happens.  No  caches,  but  small.      

•  Primary  collecGons  have  useful  caches  •  Housekeeping  Tasks:  

•  Move  data  from  RT  to  primary  •  Create  new  shards;  expire  old  •  Merge/opGmize  shards  

Page 10: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

Building  a  Search  Web-­‐Service  •  Kotlin  language  (JVM  based)  – Nullity  as  first-­‐class  language  feature  

•  DropWizard  framework  – Designed  for  web-­‐services  

•  Swagger  – Dynamically  generated  dev  UI  for  web-­‐services  

Page 11: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

Apache  KaUa  •  KaUa:  a  scalable  message/queue  pla^orm  •  See  new  KaUa  Streams  &  KaUa  Connect  APIs  •  No  back-­‐pressure;  can  be  a  challenge  •  Non-­‐obvious  use:  – For  storage;  Gme  parGGoning  

•  Lots  of  benefits  yet  serious  limitaGons  

Page 12: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

Docker  •  Easy  to  find/try/use  so2ware  –  No  installaGon  –  Simplified  configuraGon  (env  variables)  

–  Common  logging  –  Isolated  

•  Ideal  for:  –  ConGnuous  Int.  servers  –  Trying  new  so2ware  –  ProducGon  advantages  

•  But  “new”  

Page 13: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

Docker  in  ProducGon  •  I  use  “Kontena”  •  Common  logging,  machine/proc  stats,  security  –  VPN  to  secure  network;  access  everything  as  local  

•  No  longer  need  to  care  about:  – Ansible,  Chef,  Puppet,  etc.  –  Security  at  network  or  proxy;  not  service  specific  

•  Challenges:  state  &  big-­‐data  

Page 14: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

Enrichment  

Geo:  Query  Solr  via  spaGal  point  query;  aKach  related  metadata  to  tweet  

KaUa  Topic   Enrich   KaUa  

Topic  

TwiKer  SenGment  Classifier  

Geo:  Solr  with  regional  polygons  &  metadata  

Page 15: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

Solr  for  Geo  Enrichment  •  Tweets  (docs)  can  have  a  geo  lat/lon  •  Enrich  tweet  with  Country,  State/Province,    …  – GazeKeer  lookup  (point-­‐in-­‐polygon)  

Data  Set   Features   Raw  size   Index  ?me   Index  size  

Admin2   46,311   824  MB   510  min   892  MB  

US  States   74,002   747  MB   4.9  min   840  MB  

MassachuseKs  Census  Blocks   154,621   152  MB   5.9  min   507  MB  

Page 16: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

Fast  Point-­‐in-­‐Polygon  Tricks  Index/Config  •  OpGmize  to  1  segment  •  RptWithGeometry  

SpaGalField  –  precisionModel=  

"floating_single"  –  autoIndex="true"  

•  <cache  name=  "perSegSpatial  FieldCache_WKT"  …  

Search  •  Embed  Solr  (in-­‐process)  •  Use  docValues,  not  stored  

–  fl=block:field(GEOID10)  Query  like  this:  •  q={!field  cache=false  

f=WKT}Intersects(POINT(  $lon  $lat))  

Sub-­‐Millisecond!  

Page 17: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

Lucene  “LatLonPoint”  •  Uses  new  PointValues  (BKD  index)  in  Lucene  6  •  Fastest:  hKp://home.apache.org/~mikemccand/geobench.html  

•  Presently  in  Lucene  sandbox  module  •  Some  limitaGons:  WGS84  points  only  •  Credit  to  Rob  Muir  and  Mike  McCandless  

Page 18: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

Solr  Adapter  For  LatLonPoint  •  New  Solr  FieldType  for  Lucene  LatLonPoint  – Filter  points  by  circle,  rect,  polygon  – Distance  sort;  but  no  boos(ng  

Coming  soon!  Solr  6.4?  

Page 19: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

Heatmaps:  SpaGal  Grid  FaceGng  •  SpaGal  density  summary  grid  faceGng,  

also  useful  for  point-­‐plovng  search  results  •  Lucene  &  Solr  APIs  •  Scalable  &  fast  usually…  

•  Usually  rendered  with  a  gradient  radius  -­‐>  •  See:  hKp://spacemansteve.github.io/  

leaflet-­‐solr-­‐heatmap/example/index.html  

Page 20: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

How-­‐to:  Heatmaps  •  On  an  RPT  field      geo="false" worldBounds= "ENVELOPE( -180, 180, 180, -180)" prefixTree="packedQuad"

•  Query:    /select?facet=true &facet.heatmap=geo_rpt &facet.heatmap.geom= ["-180 -90" TO "180 90”] &facet.heatmap.format= ints2D or png

// Normal Solr response... "facet_counts":{ ... // facet response fields "facet_heatmaps":{ "geo_rpt":[ "gridLevel",2, "columns",32, "rows",32, "minX",-180.0, "maxX",180.0, "minY",-90.0, "maxY",90.0, "counts_ints2D”, [null, null, [0, 1, ... ]] ...

Page 21: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

New  HeatmapSpaGalField  •  Why?  – With  new  BKD/PointValues,  no  “RPT”  field  to  use  – Scalable  for  heatmaps;  don’t  worry  about  search  

•  Scalable  at  all  resoluGons;  many  millions  of  docs/shard  

– Can  be  specific  about  grid  resoluGons  

Coming  soon!  Solr  6.4?  

Page 22: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

Heatmaps  with  Stats  •  Instead  of  counGng  docs;  calculate  a  metric  – Ex:  avg(minuteOfDay)  

•  Will  require  JSON  Facet  API  •  Inherently  slower  than  just  doc  counts  

Coming  soon!  Solr  6.4?  

Page 23: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC
Page 24: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC
Page 25: H-Hypermap - Heatmap Analytics at Scale: Presented by David Smiley, D W Smiley LLC

Final  Remarks  •  Open-­‐Source  – hKps://github.com/dsmiley/hhypermap-­‐bop  

•  In-­‐progress  •  Improvements  to  Solr  expected  to  be  available  before  December;  officially  in  Solr  6.4.