using apache solr for images as big data: presented by kerry koitzsch, wipro technologies
TRANSCRIPT
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
Using Apache Solr for Images As Big Data: A Case Study Kerry Koitzsch
Architect, Wipro Technologies
Overview of this Presentation
• This quick overview of one of our ongoing projects describes why Lucene and Solr are key parts of our ongoing research, development, and client support activities.
• The presentation highlights areas of research which involve Solr technologies in the “images as big data” arena: an automated microscope slide application prototype as well as other kinds of data analysis and visualization. The use case described relies heavily on Lucene, Solr, and related “helper libraries” to provide data storage capabilities for the software toolkit, the “Image as Big Data Toolkit” (IABDT). • Throughout the presentation we discuss how the flexibility, high performance, and ability to “play well with” other components makes Lucene/Solr an essential part of the application described here.
4
01 Use Case Overview: How Solr Technologies Relate To:
§ ‘Old School’ statistical displays
§ Web-based data visualization
§ ‘Glue Ware’
§ A crime statistic visualization
§ An image as big data
visualization
5
02 Types of Data Visualization Statistical displays --- ‘old school’ histogram, pie chart, and time series
Tabular displays --- stylized table-based visualization with search, etc.
Notebook based visualization
Map based displays with geo-location
Images with overlays
Constructing data visualizers with Lucene | Solr components
6
03 “Old School” Statistical Visualization
Histograms, line charts, pie charts and time series displays.
Notebook technologies, built-in visualization capabilities (such as Elasticsearch-Kibana or Apache Mahout visualization) may be used with Cassandra data and with Lucene/Solr.
A standard ETL approach may be used as part of the data pipeline, and intelligent search can be provided by Lucene/Solr.
7
01 “Old School” Statistical Visualization: Standard Plots and Charts
8
01 “Old School” Visualization of Classifier Results
9
01 “Old School” Statistical Visualization: Standard Time Series Plots
10
01 Tabular Display Visualization: Hive Notebook
11
01 Graph Visualization
ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,District,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location9955810,HY144797,02/08/2015 11:43:40 PM,081XX S COLES AVE,1811,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,STREET,true,false,0422,004,7,46,18,1198273,1851626,2015,02/15/2015 12:43:39 PM,41.747693646,-87.549035389,"(41.747693646, -87.549035389)"9955861,HY144838,02/08/2015 11:41:42 PM,118XX S STATE ST,0486,BATTERY,DOMESTIC BATTERY SIMPLE,APARTMENT,true,true,0522,005,34,53,08B,1178335,1826581,2015,02/15/2015 12:43:39 PM,41.679442289,-87.622850758,"(41.679442289, -87.622850758)"9955801,HY144779,02/08/2015 11:30:22 PM,002XX S LARAMIE AVE,2026,NARCOTICS,POSS: PCP,SIDEWALK,true,false,1522,015,29,25,18,1141717,1898581,2015,02/15/2015 12:43:39 PM,41.87777333,-87.755117993,"(41.87777333, -87.755117993)"9956197,HY144787,02/08/2015 11:30:23 PM,006XX E 67TH ST,1811,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,STREET,true,false,0321,,6,42,18,,,2015,02/15/2015 12:43:39 PM,,,9955846,HY144829,02/08/2015 11:30:58 PM,0000X S MAYFIELD AVE,0610,BURGLARY,FORCIBLE ENTRY,APARTMENT,false,false,1513,015,29,25,05,1137239,1899372,2015,02/15/2015 12:4
§ Leveraging Graph databases and graph visualization toolkits with Lucene/Solr-centric systems § Giraph, neo4j, OrientDB, and other graph databases in combination with a Lucene/Solr centric technology stack § For example, Chicago crime data format as CSV:
Graph Visualization in Neo4J
Graph Visualization Example I: Neo4J (Separate Nodes)
Graph Visualization Example : Simple UIs and Hierarchies
Graph Visualization Example II: gojs Visualization
Notebook-Based Visualization
Jupyter or Zeppelin notebook technologies may
be used to display Solr based information and
analytics results
These notebook technologies can be used as the display component in a data pipeline oriented
processing architecture
Solr works well as one element of such a data
pipeline
Spring, Spring Data, and Apache Tika may be used
as data pipeline components
Simpler data pipelines may be evolved into Complex Event Processors (CEPs)
Notebook Visualization: Architecture and Strategy
§ A relatively simple data pipeline system
may be build using Zeppelin notebook
as a visualization of the output results
§ Geolocation data may be visualized as
in the following example
Hadoop HBase NGData Lily Solr Lucene
Solandra Katta
Cassandra ELK Stack
Kafka Apache Spark
Mesos
Akka
Technology components
Notebook Based Visualization: Example: Solr-Zeppelin-Cassandra
Map / Geolocation Visualization
Crime data can easily be imported into Solr
The data may be manipulated and pushed into Elasticsearch or Solr or back to Cassandra
Elasticsearch data can be visualized using Kibana and searched compatibly with Lucene | Solr and the other modules
Logstash may be used to assist in importing data from “log file analysis” type applications, or Flume or any of the many other import frameworks: Apache Tika is especially useful as a support library
Map / Geolocation Data: Crime Data in Solr
§ Technology stack includes the ELK Stack plus Cassandra plus Lucene/Solr/Hadoop § Data may use CSV crime data files as an original data source § Solr can process JSON based data with geolocation data associated with it, and is especially powerful with Apache Tika
Map / Geolocation : Crime Data in Kibana
§ Technology stack includes the ELK Stack plus Cassandra plus Lucene/Solr/Hadoop § Data may use CSV crime data files as an original data source § Kibana can process JSON based data with geolocation data associated with it, as can Lucene/Solr/Tika
Map | Geolocation Visualization: Data to Image
“Image as Big Data” Visualization
A data pipeline with images as a data source
Feature extraction can identify features of interest and write them to Cassandra as feature descriptors, using Lucene/Solr for intelligent search capability
Deep learning and machine learning can enhance the processing pipeline
Image as Big Data Analysis Image as Big Data Analysis (Poggio’s MIT Vision Machine)
Original Images
Color Analyzers Texture Analyzers Edge Detectors Motion Analyzers Stereo Image Analyzers
Discontinuity Map Generation (Including Line & Continuous Process)
Cooperating Recognition Process
Analysis Result Repository
Intelligent Search with Lucene Solr Centric Architecture
Image “As Big Data” Analytics Visualization: Linear Features
Automated Microscopy : The Original Components
Feature Extraction : Original Electron Microscope Image
Feature Extraction : Image to Data : Ellipses
Feature Extraction : Image to Data : Contours
“Image as Big Data” Visualization: Optical Microscope Hardware
Microscope Control Software, with Data Ingestion
“Image as Big Data” Visualization: Solr Search: Metadata
“Image as Big Data” Visualization: Microscopy UI
Another View of the Data Pipeline
Image and Metadata Input Sources
(or “smart sensors”)
Multi-‐sensor Fusion Software Engine
Short Term Computation Result
Repository
Long-‐Term Result Data Repository
Feature Extraction and Model Builder
Global System Controller
Conclusions and Future Work
A use case was described in which we use a Lucene/Solr- centric technology stack to provide an intelligent search component
Flat files, HDFS files, CSV data, data streams and other data sources may be used, including microscope images of many different formats, resolutions, and metadata content
“Images as big data” is a viable strategy for building image processing applications with Lucene/Solr as an intelligent search component, because of Lucene/Solr’s flexibility and ability to play well with other components Deep learning, machine learning, data mining, and hybrid techniques can be used to develop Lucene/Solr-centric analytics applications with “intelligent search” capabilities
Your Questions? [email protected]