advanced search with solr & django-haystack
DESCRIPTION
Search and information discovery is a huge part of almost any modern site. Solr is an incredibly powerful search tool that allows us to quickly add advanced search capabilities such as full-text search, faceting, autocomplete and spelling suggestions to our projects without much effort. We will be using 'django-haystack' to communicate between Django and Solr.TRANSCRIPT
ADVANCED SEARCH WITH
SOLR + DJANGO-HAYSTACK
MARCEL CHASTAINLA DJANGO – 2014-09-30
WHAT WE’LL COVER
1. THE PITCH:
The Problem With Search
The Solution(s)
Overall Architecture of System with Django/Solr/Haystack
2. THE GOOD STUFF:
Indexing Data for Search
Querying the Search Index
Advanced Search Methods
Resources
THE PITCH
OR, “WHY ANY OF THIS MATTERS”
THE PROBLEM
1. Sites with stored information are ONLY as useful as they are at retrieving and displaying that information
THE PROBLEM
2. Users have high expectations of search (thanks, Google)
THE PROBLEM
2. Users have high expectations of search
• Spelling Suggestions:
THE PROBLEM
2. Users have high expectations of search
• Hit Highlighting:
THE PROBLEM
2. Users have high expectations of search
• “Related Searches”• Distance/GeoSpatial Search
THE PROBLEM
2. Users have high expectations of search• Faceting:
THE PROBLEM
3. Good search involves lots of challenges
THE PROBLEM
3. Good search involves lots of challenges
• Stemming:
“argue”“argues”“argued”
“argu”
“argument”“arguments”
“argument”
User Searches For Word “Stem”
THE PROBLEM
3. Good search involves lots of challenges
And more..!
• Synonyms• Acronyms• Non-ASCII characters• Stop words (“and”, “to”, “a”)• Calculating relevance• Performance with millions/billions(!) of documents
THE SOLUTION
“Information Retrieval Systems”a.k.a Search Engines
THE SOLUTION
“Information Retrieval Systems”a.k.a Search Engines
SOLR
THE BACKEND
WHAT IS SOLR?Open-source enterprise search
Java-based
Created in 2004
Built on Apache Lucene
Most popular enterprise search engine
Apache 2.0 License
Built for millions or billions of documents
WHAT DOES IT DO?• Full-text search
• Hit highlighting
• Faceted search
• Clustering/replication/sharding
• Database integration
• Rich document (word, pdf, etc) handling
• Geospatial search
• Spelling corrections/suggestions
• … loads and loads more
WHO USES SOLR?
HOW CAN WE USE IT WITH DJANGO?
Haystack
From the homepage:
(http://haystacksearch.org/)
LOOK FAMILIAR?
Query style
Declarative search index definitions
THE GOOD STUFFINSTALLING, CONFIGURING & USING SOLR/HAYSTACK
WHO DOES WHATSolr:
• Provides API for submitting to & querying from index
• Stores actual index data
• Manages fields/data types in xml config (‘schema.xml’)
Haystack:• Manages connection(s) to solr• Provides familiar API for querying • Uses templates and declarative search index definitions• Helps generate solr xml config• Management commands to index content• Generic views/forms for common search use-cases• Hooks into signals to keep data up-to-date
PART 1:LET’S MAKE AN INDEX
0. GITHUB REPO
git clone https://github.com/marcelchastain/haystackdemo
1. SETUP SOLR(from github repo root)
./solr_download.sh
(or, manually)
wget http://apache.mirrors.pair.com/lucene/solr/4.10.1/solr-4.10.1.tgz
tar –xzvf solr-4.10.1.tgz
ln –s ./solr-4.10.1 ./solr
The one file to care about:• solr/example/solr/collection1/conf/schema.xml
Stores field definitions and data types. Frequently updated during development
2. RUN SOLR
(from github repo root)
./solr_start.sh
(or, manually)
cd solr/example && java –jar start.jar
Requires java 1.7+. To install on debian/ubuntu:sudo apt-get install openjdk-7-jre-headless
3. INSTALL HAYSTACK
(CWD haystackdemo/)
apt-get install python-pip python-virtualenv
virtualenv env && source env/bin/activate
(from github repo root)
pip install –r requirements.txt
(or, manually)
pip install Django==1.6.7 django-haystack
4. HAYSTACK SETTINGSINSTALLED_APPS = [
# ‘django.contrib.admin’, etc
‘haystack’,
# then your usual apps
‘myapp’,
]
HAYSTACK_CONNECTIONS = {
‘default’: {
‘ENGINE’: ‘haystack.backends.solr_backend.SolrEngine’,
‘URL’: ‘http://127.0.0.1:8983/solr’
},
}
HAYSTACK_SIGNAL_PROCESSOR = ‘haystack.signals.RealtimeSignalProcessor’
5. THE MODEL(S)
6. SYNCDB & INITIAL DATA
(CWD haystackdemo/demo/)
./manage.py syncdb
./manage.py loaddata restaurants
7. DEFINE SEARCH INDEXmyapp/search_indexes.py
7.5 BOOSTING FIELD RELEVANCE
Some fields are simply more relevant!(Note: changes to field boosts require reindex)
8. CREATE A TEMPLATE FOR INDEXED TEXT
templates/search/indexes/myapp/note_text.txt
9. UPDATE SOLR SCHEMA
(CWD: haystackdemo/demo/)
./manage.py build_solr_schema >
../solr/example/solr/collection1/conf/schema.xml
Which adds:
*Restart solr for changes to go into effect
10. REBUILD INDEX
(CWD hackstackdemo/demo/)
$ ./manage.py update_index
Indexing 6 notes
10. REBUILD INDEX
(CWD hackstackdemo/demo/)
$ ./manage.py update_index
Indexing 6 notes
PART 2:LET’S GET TO QUERYIN’
SIMPLE SEARCHQUERYSETS
GREAT, WHAT ABOUT FROM A BROWSER?
EASY MODE
urls.py
templates/search/search.html
Full-document search
HAYSTACK COMPONENTS TO EXTEND
• haystack.forms.SearchFormdjango form with extendable .search() method. Define additional fields on the form, then incorporate them in the .search() method’s logic
• haystack.views.SearchViewClass-based view made to be flexible for common search cases
PART 3: FEATURES
HIT HIGHLIGHTING
Instead of referring to a context variable directly, use the {% highlight %} tag
SPELLING SUGGESTIONSUpdate connection’s settings dictionary + reindex
Use spelling_suggestion() method
AUTOCOMPLETECreate another search index field using EdgeNgramField + reindex
Use the .autocomplete() method on a SearchQuerySet
FACETINGAdd faceting to search index definition
Regenerate schema.xml and reindex content
./manage.py build_solr_schema >
../solr/example/solr/collection1/conf/schema.xml
./manage.py update_index
FACETINGFrom a shell:
RESOURCES
LET’S SAVE YOU A GOOGLE TRIP
RESOURCES
Solr in Action ($45)Apr 2014
Haystack Documentationhttp://django-haystack.readthedocs.org/
IRC (freenode):#django#haystack#solr