scaling search to a million pages with solr, python, and django
DESCRIPTION
A talk given to DJUGL on the 26th July 2010, describing and introducing Solr, and discussing how we use it at Timetric to drive navigation across over a million dataseries.TRANSCRIPT
Scaling search to a million pageswith Solr, Django and Python
Toby [email protected]@tow21
1,079,446!!!
Data store
Big Bad Web
Django
Data store
Big Bad Web
Django
Key-Value Store
FilesystemBerkeley DB
MySQL
} unstructured
structured-
Foreign Key (RDBMS)
SQLiteMySQLPostgresOracle...
related contentthrough JOINs
overstructured data
Search Engines
Solr (Lucene)Xapian(Whoosh)
Denormalized,Inverted Index
over unstructured/semi-structured data
http://www.postgresql.org/docs/8.4/static/textsearch.htmlhttp://code.google.com/p/djangosearch/
http://www.sphinxsearch.com/
Other routes to full-text search
Solr: HTTP interface to Lucene
Lucene written by Doug Cutting (HADOOP), first release 2001.
Solr in-house CNET project, open-sourced in 2006
Solr + Lucene merged in March 2010
Solr 1.4, Lucene 3.0 released November 2009
Next version - 1.5/3.1/4.0 - not for production use yet.
SolrIndex
composed ofDocuments
ALL DOCUMENTS HAVETHE SAME STRUCTURE
RDBMSTable
composed ofRows
•Optional columns•Denormalized data
Contributer(M2M Person)
Author(FK Person)
Magazine
Editor(FK Person)
First name
Last name
Person
ISSN
Publication Frequency
Title
Book
Title
ISBNmultiValued,
defaultDefault Search
Identifier
Document
Pub. Frequency
Title
multiValued
required
required
uniqueKey
Associated name
Entity type
Field options
Associated NameDefault Search
TitlecopyField
There is no update, only overwrite!!!
Solar Enterprise
Search Server
Book
Identifier
Pub. Freq.
David Smiley,Eric Pugh
Solr 1.4 Enterprise
Search Server
Book
Identifier
Pub. Freq.
David Smiley,Eric Pugh
Solr can't overwrite without a uniqueKey
<field name="title" type="text" indexed="true" stored="true" required="true" multiValued="false"/>
Schema design
What do you want to search on?
What do you want to do with results?
╳query
textintlongfloatdoubledate
Solr
<xml>,csv,
<xml>,{json},exec. python
Ingest Output
Query:URL-escaped Lucene query syntax
(yuck)
HTTP HTTP
GET http://localhost:8983/solr/select/?q=searchterm
GET http://localhost:8983/solr/current/select/?fq=private
%3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=20&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united+kingdom%22+AND+NOT+is_mapreduce%3Atrue%29+OR
+%28%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united+kingdom%22+AND+is_index
%3Atrue%5E100%29
Need ORM equivalent (OIM?)
http://haystacksearch.org/
http://timetric.com/about/opensource/#sunburnt
(cleaves close to Django, not schema-driven)
Sunburnt:
http://github.com/tow/sunburnt
GET http://localhost:8983/solr/current/select/?fq=private
%3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=20&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A
%22united+kingdom%22%29+OR+%28%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united
+kingdom%22+AND+is_index%3Atrue%5E100%29
solr.query(tags="ons:dataseries-fullid=YBUKQA")\ .query(tags="united kingdom")\ .filter(private=False)\ .boost_relevancy(100, is_index=True)\ .facet_by("tags", mincount=1, limit=20)\ .paginate(rows=20)
FacetingMoreLikeThisHighlightingPaginationSorting
http://wiki.apache.org/solr/FrontPage
http://packtpub.com/solr-1-4-enterprise-search-server
Scaling to a million pages ...
- talk to the Guardian (Content API)
Decouple read/writeRe-indexing/optimizing strategiesFieldType/Analyzer/Tokenizer tweaks
Decouple read/write
Separate processes - many readers, single write pipeline. Beware multiple writers!
Remember standard DB practice -write to master, read from slave.
Index
Index
IndexIndex
Adddocuments
Commit
Index Optimize
Fast
Index
Warm upfacet cache
"UK crime: Betting, gaming and lotteries (year ending 5th April)"
BettingTokenizer
bet
Analyzer(Porter stemmer)
Belgium, Unemployment rate by gender, Total (BE,T)
BE,TTokenizer
(whitespace)
Tokenizer(character filter)
Understand Solr schemas - build one for your data.how do you want to query?
how do you want to show results?
Understand Solr architecture - build around your data-flow.how/when do you want to read/write?
what shape/characteristics does your corpus have
In the small
In the large