extending solr: packaging common - · pdf fileextending solr: packaging common sense ......
TRANSCRIPT
Extending Solr: Packaging Common
SenseCarlos Valcarcel
Solutions Consultant, Lucidworks
•In the beginning: Solr (Classic)•Solr Cloud•Solr•Security•Content Ingestion•Fusion•Architecture•Connectors•Pipelines•Signals•Visualization
Agenda
In The Beginning…
Solr was (un)officially born August 2004
It became an official Apache project January 3, 2006
Implemented on top of Lucene
The initial architecture was Master/Slave
Blog post: https://lucidworks.com/blog/2016/02/02/happy-10th-birthday-apache-solr/
Solr (Classic)
A natural evolution of Solr
•Zookeeper for property administration and synchronization•Transparent scaling (just add more replicas)•Replication rules•Major pro: New Solr features are distributed!•Major con: Not all Solr features are distributed! More testing!
Solr Cloud
- Search for the masses!- Easy to use- Full control over how documents are indexed and queried- DIY- Open source: you can extend it in any way you like- Mature search technology, strong underlying libraries- Used by…well, almost everyone
Awesomeness of Solr
- Solr is not trivial to use- Consultants in high demand- DIY == Fix it Yourself- Full control == Lots of responsibilities- Open source == you can extend it, but so can everybody
else- Mature search technology == higher-level search
abstractions aren’t always implemented- Large audience == Harder to implement custom features
that don’t break in between releases
Awesomeness of Solr…at a price
• Apache Lucene
• Grouping and Joins
• Stats, expressions, transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
Solr Key Features
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete, highlighting
• Cursors
• More Like This
• De-duplication
•SSL•Jetty•Kerberos•Solr plug-in•Document-level security•Sharepoint•Windows Shares
Reference: https://wiki.apache.org/solr/SolrSecurityhttps://cwiki.apache.org/confluence/display/solr/Enabling+SSL https://cwiki.apache.org/confluence/display/solr/Kerberos+Authentication+Plugin
Solr Security
Pro: Lots of choices!Con: Lots of choices!
Basic: Request Handlers- Structured File types
- CSV- XML- JSON
- Binary File types- PDF- MS Office format
Advanced: External Repositories- ManifoldCF- Commercial products
Solr and Content Ingestion
And then along came:Fusion
Solr is so powerful that it needs a front-end
- Administration- Care and feeding- Development: REST API w/security- Allows for control of multiple external Solr Clouds- Integrates with other OS projects
- Spark- Banana (SiLK)
- First generation
Fusion and the search for World Domination
Lucidworks Fusion Is Search-Driven Everything
• Drive next generation relevance via Content, Collaboration and Context
• Harness best in class Open Source: Apache Solr + Spark
• Simplify application development and reduce ongoing maintenance
Fusion is built on three core principles:
Fusion Processes
zk
9983
ui
8765
• admin UI • authentication
8983
• 1 replica SolrCloud • embedded
Zookeeper (shared with other components)
• aggregator • index pipelines • query pipelines • scheduler • collection management • recommender • system metrics • spark jobserver
8764
api (“backend”)
Developer Fusion Deployment
8769
spark worker
8766
spark master
connectors
8984 • data sources • index pipelines
Fusion Architecture
REST
API
Worker Worker Cluster Mgr.
Apache Spark
Shards Shards
Apache Solr
HD
FS (O
ptio
nal)
Shared Config Mgmt
Leader Election
Load Balancing
ZK 1
Apache Zookeeper
ZK N
DATABASEWEBFILELOGSHADOOP CLOUD
Connectors
Alerting/Messaging
NLP
Pipelines
Blob Storage
Scheduling
Recommenders/
…
Core Services
Admin UI
SECURITY BUILT-IN
Database - JDBC- CouchDB- MongoDBFilesystem - Box- Dropbox- FTP- GoogleDrive- HDFS- Local- S3- S3H- SolrXML- Windows Share
Fusion Connectors
Hadoop - Apache Hadoop- Cloudera- Hortonworks- Mapr- PivotalLogging - LogstashSocial Media - Jive- Slack- Twitter search- Twitter streamingWeb Websphere
Repository - Alfresco- Azure blob- Azure table- Drupal- GitHub- JIRA- Salesforce- SharePoint- ServiceNow- Solr- Subversion- Zendesk
Push - Content to a portScript - roll your own
Index Pipelines/stages- Aggregating- Apache Camel- Apache Tika Parser- CSV Parser- Date Parsing- Exclusion Filter- Field Mapper- Find Replace- Fusion Pipeline- Gazetteer Lookup Extractor- HTML Transform- Indexing RPC- Javascript- JDBC- and others
Fusion Pipelines and Stages
Query Pipelines/stages- Active Directory Security Trimming- Advanced Boosting- Aggregating- Block Documents- Boost Documents- Facet- Javascript- JDBC- Landing Pages- Logging Query Stage- Recommendation Boosting- Return QueryParams Query Stage- Rollup Aggregator Query Stage- Search Fields Query Stage- and others
Pipelines: preprocess incoming information in a predictable way
Fusion Signals
Signals are captured user events that tell us something about what the user is doing- page views- page pings- clicked links- custom configured events
Can be used to equate user behavior:- at different times of day- in different geographic locations- during different weather conditions
Reference:http://www.slideshare.net/lucidworks/events-processing-and-data-analysis-with-lucidworks-fusion-presented-by-kiran-chitturi-lucidworks
Fusion Signals
test
Primary collection
Raw signals
collection
Aggregated signals
collection
test_signals test_signals_aggr
Signals Service
JSON payloads
Snowplow payloads
Solr
Signals - data flow
Fusion Signals
2
Aggregations - data flow
Aggregation job
Aggregator Spark Agent
test
Primary collection
Raw signals collection
Worker Worker Cluster Mgr.
Spark
Aggregated signals collection
Spark Driver
Stores aggregated results
Fetches raw signals for processing
test_signals test_signals_aggr
Fusion Signals
3
Boosting search results using aggregated documents User App
Search query
Query-pipeline
stages
Set Params Query Solr
Raw signals collection
Aggregated signals
collection
test_signals test_signals_aggr
Recommendation Stages
test
Primary collection
1. Query aggregated documents 2. Process results 3. Add parameters to the request
Search response
Fusion Visualization
The Day After Tomorrow…
• Ease of Use
• Modern, consistent, “introspectable" APIs
• Scalability
• Cross Data Center Replication
• Performance improvements
• Analytics and Relevance
• SQL
• Graphs
Major Themes
• Ease of Use
• Point and click Time Series indexing
• Relevance and Taxonomy Mgmt tools
• Indexing Previews
• Analytics and Relevance
• Query intent and related machine learning
• ZoomData integration
Apache Solr (6.0) Fusion
• Improved Spark-Solr data locality integration
• 10x performance improvement!
• Lucene analyzers for Spark data processing
• Easily and simply build and deploy Spark-based Machine Learning with minimal coding
• Leverage best in class libraries like MLLib, Mahout and DL4J
Simplicity on Top, Power Under the Hood
• Standalone Reference Search UI showcasing Fusion best practices (April/May)
• Signals, pipelines, auto-suggest, faceting, search, did you mean
• Built on AngularJS
• Performance improvements in pipelines (Fusion 2.3)
• 30-50% overall increase for all pipelines
• 3x improvement for pipelines using Javascript stages
• Improved Devops support (plugins, distributed coordination — 2.3 and 2.4)
• Monitoring, Server Management, Deployment
“Too much of a good thing can be wonderful.” - Mae West
You Have Questions…I might have a few answers
2016
OCTOBER 11-14BOSTON, MA
CALL FOR PAPERS OPEN THROUGH APRIL 30!lucenerevolution.org
2016
OCTOBER 11-14BOSTON, MA
Meetup Discount: 20% off Super Early Bird registration through April 30
- OR -
www.lucenesolr-revolution-2016.eventbritecom/
Code: NESTMeetup0316
• Greatly simplify the care and feeding of time-based indexes
• Point and click (or single API call) creation of time series shards
• Total control over number of shards and replication
• Easily defined retention and archiving policies (e.g. 30 day retention)
• Intelligent query parsing optimizes shard access
• Ideal for log data and signals
Time Series Done Right (Fusion 2.4)
Experiments >> Rules
Fusion 2.4 will support:
Experiment management framework for large scale, multi-variate testing
Bandit algorithms for high volume experimentation
Capture and reporting of search (and other) metrics all from within Fusion