scaling search to a million pages with solr, python, and django

Scaling search to a million pageswith Solr, Django and Python

Toby [email protected]@tow21

1,079,446!!!

Data store

Big Bad Web

Django

Key-Value Store

FilesystemBerkeley DB

MySQL

} unstructured

structured-

Foreign Key (RDBMS)

SQLiteMySQLPostgresOracle...

related contentthrough JOINs

overstructured data

Search Engines

Solr (Lucene)Xapian(Whoosh)

Denormalized,Inverted Index

over unstructured/semi-structured data

http://www.postgresql.org/docs/8.4/static/textsearch.htmlhttp://code.google.com/p/djangosearch/

http://www.sphinxsearch.com/

Other routes to full-text search

Solr: HTTP interface to Lucene

Lucene written by Doug Cutting (HADOOP), first release 2001.

Solr in-house CNET project, open-sourced in 2006

Solr + Lucene merged in March 2010

Solr 1.4, Lucene 3.0 released November 2009

Next version - 1.5/3.1/4.0 - not for production use yet.

SolrIndex

composed ofDocuments

ALL DOCUMENTS HAVETHE SAME STRUCTURE

RDBMSTable

composed ofRows

•Optional columns•Denormalized data

Contributer(M2M Person)

Author(FK Person)

Magazine

Editor(FK Person)

First name

Last name

Person

ISSN

Publication Frequency

Title

Book

Title

ISBNmultiValued,

defaultDefault Search

Identifier

Document

Pub. Frequency

Title

multiValued

required

required

uniqueKey

Associated name

Entity type

Field options

Associated NameDefault Search

TitlecopyField

There is no update, only overwrite!!!

Solar Enterprise

Search Server

Book

Identifier

Pub. Freq.

David Smiley,Eric Pugh

Solr 1.4 Enterprise

Search Server

Book

Identifier

Pub. Freq.

David Smiley,Eric Pugh

Solr can't overwrite without a uniqueKey

<field name="title" type="text" indexed="true" stored="true" required="true" multiValued="false"/>

Schema design

What do you want to search on?

What do you want to do with results?

╳query

textintlongfloatdoubledate

Solr

<xml>,csv,

<xml>,{json},exec. python

Ingest Output

Query:URL-escaped Lucene query syntax

(yuck)

HTTP HTTP

GET http://localhost:8983/solr/select/?q=searchterm

GET http://localhost:8983/solr/current/select/?fq=private

%3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=20&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united+kingdom%22+AND+NOT+is_mapreduce%3Atrue%29+OR

+%28%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united+kingdom%22+AND+is_index

%3Atrue%5E100%29

Need ORM equivalent (OIM?)

http://haystacksearch.org/

http://timetric.com/about/opensource/#sunburnt

(cleaves close to Django, not schema-driven)

Sunburnt:

http://github.com/tow/sunburnt

GET http://localhost:8983/solr/current/select/?fq=private

%3Afalse&rows=20&facet.field=tags&f.tags.facet.limit=20&f.tags.facet.mincount=1&facet=true&start=0&q=%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A

%22united+kingdom%22%29+OR+%28%28tags%3A%22ons%3Adataseries-fullid%3DYBUKQA%22+AND+tags%3A%22united

+kingdom%22+AND+is_index%3Atrue%5E100%29

solr.query(tags="ons:dataseries-fullid=YBUKQA")\ .query(tags="united kingdom")\ .filter(private=False)\ .boost_relevancy(100, is_index=True)\ .facet_by("tags", mincount=1, limit=20)\ .paginate(rows=20)

FacetingMoreLikeThisHighlightingPaginationSorting

http://wiki.apache.org/solr/FrontPage

http://packtpub.com/solr-1-4-enterprise-search-server

Scaling to a million pages ...

- talk to the Guardian (Content API)

Decouple read/writeRe-indexing/optimizing strategiesFieldType/Analyzer/Tokenizer tweaks

Decouple read/write

Separate processes - many readers, single write pipeline. Beware multiple writers!

Remember standard DB practice -write to master, read from slave.

Index

Index

IndexIndex

Adddocuments

Commit

Index Optimize

Fast

Index

Warm upfacet cache

"UK crime: Betting, gaming and lotteries (year ending 5th April)"

BettingTokenizer

bet

Analyzer(Porter stemmer)

Belgium, Unemployment rate by gender, Total (BE,T)

BE,TTokenizer

(whitespace)

Tokenizer(character filter)

http://timetric.com/series/KYdGLiJ9T5m-7RjGwjGB3w/





http://timetric.com/index/unemployment-rate-belgium-eurostat/








Understand Solr schemas - build one for your data.how do you want to query?

how do you want to show results?

Understand Solr architecture - build around your data-flow.how/when do you want to read/write?

what shape/characteristics does your corpus have

In the small

In the large

Thanks for listening!

questions welcome ...

[email protected]@tow21

scaling search to a million pages with solr, python, and django

Technology

lucene lucene

pageswith solr

solr rdbmsindextable

solr architecture

solr schemas

comtowsunburnt http

http interface

book book solar solr