extending solr: packaging common - · pdf fileextending solr: packaging common sense ......

Extending Solr: Packaging Common

SenseCarlos Valcarcel

Solutions Consultant, Lucidworks

•In the beginning: Solr (Classic)•Solr Cloud•Solr•Security•Content Ingestion•Fusion•Architecture•Connectors•Pipelines•Signals•Visualization

Agenda

In The Beginning…

Solr was (un)officially born August 2004

It became an official Apache project January 3, 2006

Implemented on top of Lucene

The initial architecture was Master/Slave

Blog post: https://lucidworks.com/blog/2016/02/02/happy-10th-birthday-apache-solr/

Solr (Classic)

https://lucidworks.com/blog/2016/02/02/happy-10th-birthday-apache-solr/

A natural evolution of Solr

•Zookeeper for property administration and synchronization•Transparent scaling (just add more replicas)•Replication rules•Major pro: New Solr features are distributed!•Major con: Not all Solr features are distributed! More testing!

Solr Cloud

- Search for the masses!- Easy to use- Full control over how documents are indexed and queried- DIY- Open source: you can extend it in any way you like- Mature search technology, strong underlying libraries- Used by…well, almost everyone

Awesomeness of Solr

- Solr is not trivial to use- Consultants in high demand- DIY == Fix it Yourself- Full control == Lots of responsibilities- Open source == you can extend it, but so can everybody

else- Mature search technology == higher-level search

abstractions aren’t always implemented- Large audience == Harder to implement custom features

that don’t break in between releases

Awesomeness of Solr…at a price

• Apache Lucene

• Grouping and Joins

• Stats, expressions, transformations and more

• Lang. Detection

• Extensible

• Massive Scale/Fault tolerance

Solr Key Features

• Full text search (Info Retr.)

• Facets/Guided Nav galore!

• Lots of data types

• Spelling, auto-complete, highlighting

• Cursors

• More Like This

• De-duplication

•SSL•Jetty•Kerberos•Solr plug-in•Document-level security•Sharepoint•Windows Shares

Reference: https://wiki.apache.org/solr/SolrSecurityhttps://cwiki.apache.org/confluence/display/solr/Enabling+SSL https://cwiki.apache.org/confluence/display/solr/Kerberos+Authentication+Plugin

Solr Security

https://wiki.apache.org/solr/SolrSecurity

https://cwiki.apache.org/confluence/display/solr/Enabling+SSL

https://cwiki.apache.org/confluence/display/solr/Kerberos+Authentication+Plugin

Pro: Lots of choices!Con: Lots of choices!

Basic: Request Handlers- Structured File types

- CSV- XML- JSON

- Binary File types- PDF- MS Office format

Advanced: External Repositories- ManifoldCF- Commercial products

Solr and Content Ingestion

And then along came:Fusion

Solr is so powerful that it needs a front-end

- Administration- Care and feeding- Development: REST API w/security- Allows for control of multiple external Solr Clouds- Integrates with other OS projects

- Spark- Banana (SiLK)

- First generation

Fusion and the search for World Domination

Lucidworks Fusion Is Search-Driven Everything

• Drive next generation relevance via Content, Collaboration and Context

• Harness best in class Open Source: Apache Solr + Spark

• Simplify application development and reduce ongoing maintenance

Fusion is built on three core principles:

Fusion Processes

zk

9983

ui

8765

•  admin UI •  authentication

8983

•  1 replica SolrCloud •  embedded

Zookeeper (shared with other components)

•  aggregator •  index pipelines •  query pipelines •  scheduler •  collection management •  recommender •  system metrics •  spark jobserver

8764

api (“backend”)

Developer Fusion Deployment

8769

spark worker

8766

spark master

connectors

8984 •  data sources •  index pipelines

Fusion Architecture

REST

API

Worker Worker Cluster Mgr.

Apache Spark

Shards Shards

Apache Solr

HD

FS (O

ptio

nal)

Shared Config Mgmt

Leader Election

Load Balancing

ZK 1

Apache Zookeeper

ZK N

DATABASEWEBFILELOGSHADOOP CLOUD

Connectors

Alerting/Messaging

NLP

Pipelines

Blob Storage

Scheduling

Recommenders/

…

Core Services

Admin UI

SECURITY BUILT-IN

Database - JDBC- CouchDB- MongoDBFilesystem - Box- Dropbox- FTP- GoogleDrive- HDFS- Local- S3- S3H- SolrXML- Windows Share

Fusion Connectors

Hadoop - Apache Hadoop- Cloudera- Hortonworks- Mapr- PivotalLogging - LogstashSocial Media - Jive- Slack- Twitter search- Twitter streamingWeb Websphere

Repository - Alfresco- Azure blob- Azure table- Drupal- GitHub- JIRA- Salesforce- SharePoint- ServiceNow- Solr- Subversion- Zendesk

Push - Content to a portScript - roll your own

Index Pipelines/stages- Aggregating- Apache Camel- Apache Tika Parser- CSV Parser- Date Parsing- Exclusion Filter- Field Mapper- Find Replace- Fusion Pipeline- Gazetteer Lookup Extractor- HTML Transform- Indexing RPC- Javascript- JDBC- and others

Fusion Pipelines and Stages

Query Pipelines/stages- Active Directory Security Trimming- Advanced Boosting- Aggregating- Block Documents- Boost Documents- Facet- Javascript- JDBC- Landing Pages- Logging Query Stage- Recommendation Boosting- Return QueryParams Query Stage- Rollup Aggregator Query Stage- Search Fields Query Stage- and others

Pipelines: preprocess incoming information in a predictable way

Fusion Signals

Signals are captured user events that tell us something about what the user is doing- page views- page pings- clicked links- custom configured events

Can be used to equate user behavior:- at different times of day- in different geographic locations- during different weather conditions

Reference:http://www.slideshare.net/lucidworks/events-processing-and-data-analysis-with-lucidworks-fusion-presented-by-kiran-chitturi-lucidworks

http://www.slideshare.net/lucidworks/events-processing-and-data-analysis-with-lucidworks-fusion-presented-by-kiran-chitturi-lucidworks

Fusion Signals

test

Primary collection

Raw signals

collection

Aggregated signals

collection

test_signals test_signals_aggr

Signals Service

JSON payloads

Snowplow payloads

Solr

Signals - data flow

Fusion Signals

2

Aggregations - data flow

Aggregation job

Aggregator Spark Agent

test

Primary collection

Raw signals collection

Worker Worker Cluster Mgr.

Spark

Aggregated signals collection

Spark Driver

Stores aggregated results

Fetches raw signals for processing


Fusion Signals

3

Boosting search results using aggregated documents User App

Search query

Query-pipeline

stages

Set Params Query Solr

Raw signals collection

Aggregated signals

collection


Recommendation Stages

test

Primary collection

1.  Query aggregated documents 2.  Process results 3.  Add parameters to the request

Search response

Fusion Visualization

The Day After Tomorrow…

• Ease of Use

• Modern, consistent, “introspectable" APIs

• Scalability

• Cross Data Center Replication

• Performance improvements

• Analytics and Relevance

• SQL

• Graphs

Major Themes

• Ease of Use

• Point and click Time Series indexing

• Relevance and Taxonomy Mgmt tools

• Indexing Previews

• Analytics and Relevance

• Query intent and related machine learning

• ZoomData integration

Apache Solr (6.0) Fusion

• Improved Spark-Solr data locality integration

• 10x performance improvement!

• Lucene analyzers for Spark data processing

• Easily and simply build and deploy Spark-based Machine Learning with minimal coding

• Leverage best in class libraries like MLLib, Mahout and DL4J

Simplicity on Top, Power Under the Hood

• Standalone Reference Search UI showcasing Fusion best practices (April/May)

• Signals, pipelines, auto-suggest, faceting, search, did you mean

• Built on AngularJS

• Performance improvements in pipelines (Fusion 2.3)

• 30-50% overall increase for all pipelines

• 3x improvement for pipelines using Javascript stages

• Improved Devops support (plugins, distributed coordination — 2.3 and 2.4)

• Monitoring, Server Management, Deployment

“Too much of a good thing can be wonderful.” - Mae West

You Have Questions…I might have a few answers

2016

OCTOBER 11-14BOSTON, MA

CALL FOR PAPERS OPEN THROUGH APRIL 30!lucenerevolution.org

2016

OCTOBER 11-14BOSTON, MA

Meetup Discount: 20% off Super Early Bird registration through April 30

- OR -

www.lucenesolr-revolution-2016.eventbritecom/

Code: NESTMeetup0316

• Greatly simplify the care and feeding of time-based indexes

• Point and click (or single API call) creation of time series shards

• Total control over number of shards and replication

• Easily defined retention and archiving policies (e.g. 30 day retention)

• Intelligent query parsing optimizes shard access

• Ideal for log data and signals

Time Series Done Right (Fusion 2.4)

Experiments >> Rules

Fusion 2.4 will support:

Experiment management framework for large scale, multi-variate testing

Bandit algorithms for high volume experimentation

Capture and reporting of search (and other) metrics all from within Fusion