Download - Solr for Data Science
![Page 1: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/1.jpg)
Solr for Data ScienceScalable search and analytics in one
Grant Ingersoll, CTO: @gsingers
![Page 2: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/2.jpg)
![Page 4: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/4.jpg)
Solr in a nutshell
8M+ total downloads
Solr is both established & growing
250,000+monthly downloads
Largest community of developers.
2500+open Solr jobs.
Solr most widely used search solution on the planet.
LucidworksUnmatched Solr expertise.
1/3of the active committers
70%of the open source code is committed
Lucene/Solr Revolutionworld’s largest open source user
conference dedicated to Lucene/Solr.
Solr has tens of thousands of applications in production.
You use Solr everyday.
![Page 5: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/5.jpg)
Solr’s Key Features
• Full text search (Info Retr.)
• Facets/Guided Nav galore!
• Lots of data types
• Spelling, auto-complete, highlighting
• Cursors
• More Like This
• De-duplication
• Apache Lucene
• Grouping and Joins
• Stats, expressions, transformations and more
• Lang. Detection
• Extensible
• Massive Scale/Fault tolerance
![Page 6: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/6.jpg)
![Page 7: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/7.jpg)
It is increasingly important to know what is important!
Corollary: The faster you know what is important, the better
![Page 8: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/8.jpg)
Data Exploration
![Page 9: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/9.jpg)
• Solr - Logstash - Kibana
!
• http://lucidworks.com/product/integrations/silk/
• Open source at:
• https://github.com/LucidWorks/banana
• https://github.com/LucidWorks/solrlogmanager
SiLK
![Page 10: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/10.jpg)
![Page 11: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/11.jpg)
• Feature Selection
• Analyzers for all types
• Easily get weights for terms
• Term Vectors
• Data Reduction
• Filters
• Analyzers
• Data quality tools
Feature Selection and Data Reduction
![Page 12: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/12.jpg)
• Quick and dirty:
• kNN, others
• Carrot^2 integration for search result clustering
• Integration with Mahout
• Lucene provides Bayesian classifiers built on index
• Easily build training and test sets via filter queries
Classification and Clustering
![Page 13: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/13.jpg)
• Built in expressions, stats, function queries make custom ranking a snap!
• Search is essentially vector * matrix
• Lucene index is a ranking optimized matrix
• More coming!
Math
![Page 14: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/14.jpg)
Clicks, tweets, ratings, locations and much more can all be leveraged to provide high quality recommendations
to users and deeper insight for data scientists
!
Signals power relevance
Query ModificationIncrease the findability of
documents and records with automatic creation of tags, fields
and meta-data
Curate the user experience in your application with artificial
result ranking, document injections and obfuscation
Result ManipulationIndex Time EnrichmentPerform real time decision
making and routing in order to map a users intention or
enterprise policy
![Page 15: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/15.jpg)
• http://www.lucidworks.com/products/fusion
• Ships w/ built-in Solr-based Recommender OOTB, but easy to extend
• Demo: eCommerce data set
• ~1.2M products
• ~4M clicks
Lucidworks Fusion
![Page 16: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/16.jpg)
• Data ingest:
• JSON, CSV, XML, Rich types (PDF, etc.), custom
• Clients for Python, R, Java, .NET and more
• http://cran.r-project.org/web/packages/solr/index.html, amongst others
• Output formats: JSON, CSV, XML, custom
Solr and Your Tools
![Page 17: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/17.jpg)
• Vector Space or Probabilistic, it’s your choice!
• Killer FST
• Wicked fast
• Pluggable compression, queries, indexing and more
• Advanced Similarity Models
• Lang. Modeling, Divergence from Random, more
• Easy to plug-in ranking
for Data Science
![Page 18: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/18.jpg)
But what about?
![Page 19: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/19.jpg)
• More Facets/Stats
• Combine pivots, ranges and stats
• Percentiles via t-digest
• hyper-log-log
• Deeper Spark integration for Solr
• Custom distributed computation and aggregations/maths
• Advanced schema on read options
• Time series? Trends? Anomaly Detection?
• Learn to rank?
What’s coming?
![Page 20: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/20.jpg)
Lucidworks Open Source• Logstash for Solr:
• https://github.com/LucidWorks/solrlogmanager
• Banana (Kibana for Solr):
• https://github.com/LucidWorks/banana
• Effortless AWS deployment and monitoring:
• http://www.github.com/lucidworks/solr-scale-tk
• Data Quality Toolkit:
• https://github.com/LucidWorks/data-quality
• Spark Integration
• https://github.com/LucidWorks/spark-solr
![Page 21: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/21.jpg)
• This code: http://github.com/lucidworks/solr-for-datascience
• Company: http://www.lucidworks.com
• Our blog: http://www.lucidworks.com/blog
• Book: http://www.manning.com/ingersoll
• Solr: http://lucene.apache.org/solr
• Fusion: http://www.lucidworks.com/products/fusion
• Twitter: @gsingers
Resources
![Page 22: Solr for Data Science](https://reader031.vdocument.in/reader031/viewer/2022020123/55a4e0831a28aba00e8b4729/html5/thumbnails/22.jpg)