using apache solr for images as big data: presented by kerry koitzsch, wipro technologies

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Using Apache Solr for Images As Big Data: A Case Study Kerry Koitzsch

Architect, Wipro Technologies

Overview of this Presentation

•  This quick overview of one of our ongoing projects describes why Lucene and Solr are key parts of our ongoing research, development, and client support activities.

•  The presentation highlights areas of research which involve Solr technologies in the “images as big data” arena: an automated microscope slide application prototype as well as other kinds of data analysis and visualization. The use case described relies heavily on Lucene, Solr, and related “helper libraries” to provide data storage capabilities for the software toolkit, the “Image as Big Data Toolkit” (IABDT). •  Throughout the presentation we discuss how the flexibility, high performance, and ability to “play well with” other components makes Lucene/Solr an essential part of the application described here.

4

01 Use Case Overview: How Solr Technologies Relate To:

§ ‘Old School’ statistical displays

§ Web-based data visualization

§ ‘Glue Ware’

§ A crime statistic visualization

§ An image as big data

visualization

5

02 Types of Data Visualization Statistical displays --- ‘old school’ histogram, pie chart, and time series

Tabular displays --- stylized table-based visualization with search, etc.

Notebook based visualization

Map based displays with geo-location

Images with overlays

Constructing data visualizers with Lucene | Solr components

6

03 “Old School” Statistical Visualization

Histograms, line charts, pie charts and time series displays.

Notebook technologies, built-in visualization capabilities (such as Elasticsearch-Kibana or Apache Mahout visualization) may be used with Cassandra data and with Lucene/Solr.

A standard ETL approach may be used as part of the data pipeline, and intelligent search can be provided by Lucene/Solr.

7

01 “Old School” Statistical Visualization: Standard Plots and Charts

8

01 “Old School” Visualization of Classifier Results

9

01 “Old School” Statistical Visualization: Standard Time Series Plots

10

01 Tabular Display Visualization: Hive Notebook

11

01 Graph Visualization

ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location9955810,HY144797,02/08/2015 11:43:40 PM,081XX S COLES AVE,1811,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,STREET,true,false,0422,004,7,46,18,1198273,1851626,2015,02/15/2015 12:43:39 PM,41.747693646,-87.549035389,"(41.747693646, -87.549035389)"9955861,HY144838,02/08/2015 11:41:42 PM,118XX S STATE ST,0486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,true,true,0522,005,34,53,08B,1178335,1826581,2015,02/15/2015 12:43:39 PM,41.679442289,-87.622850758,"(41.679442289, -87.622850758)"9955801,HY144779,02/08/2015 11:30:22 PM,002XX S LARAMIE AVE,2026,NARCOTICS,POSS: PCP,SIDEWALK,true,false,1522,015,29,25,18,1141717,1898581,2015,02/15/2015 12:43:39 PM,41.87777333,-87.755117993,"(41.87777333, -87.755117993)"9956197,HY144787,02/08/2015 11:30:23 PM,006XX E 67TH ST,1811,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,STREET,true,false,0321,,6,42,18,,,2015,02/15/2015 12:43:39 PM,,,9955846,HY144829,02/08/2015 11:30:58 PM,0000X S MAYFIELD AVE,0610,BURGLARY,FORCIBLE ENTRY,APARTMENT,false,false,1513,015,29,25,05,1137239,1899372,2015,02/15/2015 12:4

§ Leveraging Graph databases and graph visualization toolkits with Lucene/Solr-centric systems § Giraph, neo4j, OrientDB, and other graph databases in combination with a Lucene/Solr centric technology stack § For example, Chicago crime data format as CSV:

Graph Visualization in Neo4J

Graph Visualization Example I: Neo4J (Separate Nodes)

Graph Visualization Example : Simple UIs and Hierarchies

Graph Visualization Example II: gojs Visualization

Notebook-Based Visualization

Jupyter or Zeppelin notebook technologies may

be used to display Solr based information and

analytics results

These notebook technologies can be used as the display component in a data pipeline oriented

processing architecture

Solr works well as one element of such a data

pipeline

Spring, Spring Data, and Apache Tika may be used

as data pipeline components

Simpler data pipelines may be evolved into Complex Event Processors (CEPs)

Notebook Visualization: Architecture and Strategy

§ A relatively simple data pipeline system

may be build using Zeppelin notebook

as a visualization of the output results

§ Geolocation data may be visualized as

in the following example

Hadoop HBase NGData Lily Solr Lucene

Solandra Katta

Cassandra ELK Stack

Kafka Apache Spark

Mesos

Akka

Technology components

Notebook Based Visualization: Example: Solr-Zeppelin-Cassandra

Map / Geolocation Visualization

Crime data can easily be imported into Solr

The data may be manipulated and pushed into Elasticsearch or Solr or back to Cassandra

Elasticsearch data can be visualized using Kibana and searched compatibly with Lucene | Solr and the other modules

Logstash may be used to assist in importing data from “log file analysis” type applications, or Flume or any of the many other import frameworks: Apache Tika is especially useful as a support library

Map / Geolocation Data: Crime Data in Solr

§ Technology stack includes the ELK Stack plus Cassandra plus Lucene/Solr/Hadoop § Data may use CSV crime data files as an original data source §  Solr can process JSON based data with geolocation data associated with it, and is especially powerful with Apache Tika

Map / Geolocation : Crime Data in Kibana

§ Technology stack includes the ELK Stack plus Cassandra plus Lucene/Solr/Hadoop § Data may use CSV crime data files as an original data source §  Kibana can process JSON based data with geolocation data associated with it, as can Lucene/Solr/Tika

Map | Geolocation Visualization: Data to Image

“Image as Big Data” Visualization

A data pipeline with images as a data source

Feature extraction can identify features of interest and write them to Cassandra as feature descriptors, using Lucene/Solr for intelligent search capability

Deep learning and machine learning can enhance the processing pipeline

Image as Big Data Analysis Image as Big Data Analysis (Poggio’s MIT Vision Machine)

Original Images

Color Analyzers Texture Analyzers Edge Detectors Motion Analyzers Stereo Image Analyzers

Discontinuity Map Generation (Including Line & Continuous Process)

Cooperating Recognition Process

Analysis Result Repository

Intelligent Search with Lucene Solr Centric Architecture

Image “As Big Data” Analytics Visualization: Linear Features

Automated Microscopy : The Original Components

Feature Extraction : Original Electron Microscope Image

Feature Extraction : Image to Data : Ellipses

Feature Extraction : Image to Data : Contours

“Image as Big Data” Visualization: Optical Microscope Hardware

Microscope Control Software, with Data Ingestion

“Image as Big Data” Visualization: Solr Search: Metadata

“Image as Big Data” Visualization: Microscopy UI

Another View of the Data Pipeline

Image and Metadata Input Sources

(or “smart sensors”)

Multi-‐sensor Fusion Software Engine

Short Term Computation Result

Repository

Long-‐Term Result Data Repository

Feature Extraction and Model Builder

Global System Controller

Conclusions and Future Work

A use case was described in which we use a Lucene/Solr- centric technology stack to provide an intelligent search component

Flat files, HDFS files, CSV data, data streams and other data sources may be used, including microscope images of many different formats, resolutions, and metadata content

“Images as big data” is a viable strategy for building image processing applications with Lucene/Solr as an intelligent search component, because of Lucene/Solr’s flexibility and ability to play well with other components Deep learning, machine learning, data mining, and hybrid techniques can be used to develop Lucene/Solr-centric analytics applications with “intelligent search” capabilities

Your Questions? [email protected]