sébastien françois, eprints lead developer eprints developer powwow, ulcc
TRANSCRIPT
Empowering EPrints Search with Xapian
Sébastien François, EPrints Lead DeveloperEPrints Developer Powwow, ULCC
Review of EPrints Internal Search
Indexing
Searching
Extras
TO-DO’s
Using & contributing
Demo(s)
Summary
EPrints “Internal” Search - Overview
Search
Field
DataSet
MetaField Condition
List1
1..n
1..n 1..n
match = “EX” queries the main & auxilliary dataset tables
match = “IN” queries the __rindex dataset table
ordering is done via the __ordervalues_$langid dataset
table
EPrints “Internal” Search – Overview (2)
Simple search is not scalable
Lots of derived data in the DB (backup?)
No relevance matching -> good matches do not surface
up
No advanced features: suggestions, facets, boolean op’s
etc.
Home-brewed: hard to maintain the code, hard to extend
Difficult to debug…
EPrints “Internal” Search – Downsides
Introduced in 3.3
Only integrated with the simple search
Little flexibility in controlling what is indexed
Advanced features “not really” enabled
Searches every fields (“text_index” not respected)
But the idea is good & worth building upon
EPrints Xapian Search
Attempts to re-use EPrints’ default configuration:
◦ datasets’ field defintion (+ “text_index”)
◦ fields defined in the simple search (un-prefixed terms)
But needs its own bits to define:
◦ default indexing methods (by MetaField type)
◦ facet-able indexes
◦ order-able indexes
May be used to declare derived indexes – examples:◦ “open_access”: to filter references from open full-text documents
◦ “year”: to filter by year of publication (rather than by date)
◦ “image_orientation”: if you had an archive of images, you could extract the orientation via
EXIF
Indexing
Indexing - Classes
Xapian::Index
IndexMethod
Config
OrderMethod
XapianDB
Fulltext Name, etc. Alpha. Name, etc.
Indexes are prefixed by “_” e.g. “_title” so we can sanitise the user query
– otherwise users could do prefixed search (and search not necessarily
allowed fields)
Z notation: indicates a stemmed value or index: Z_title, Zhappi (internal
Xapian convention)
Script available to re-process the Xapian indexes (similar to “epadmin
reindex” but doesn’t re-index the EPrints’ internal)
Reserved indexes:
◦ _id: keep the internal id of the data-obj (/id/eprint/123)
◦ _dataset: to which dataset the record belongs to (‘eprint’, ‘user’…)
◦ _configuration_md5: keeps an MD5 of the conf. the item was indexed
against (useful?)
◦ - _index_timestamp: when the item was last indexed
Indexing – Extra information
Again, attempts to re-use EPrints’ configuration:
◦ simple search (mostly for ordering methods)
◦ advanced/staff search: which fields to use (prefixed terms)
Extra bits can be configured such as which facets can be
used on each search (simple, advanced, …)
Only indexed stuff can be searched
◦ you cannot use a facet which has not been generated
◦ you need to re-index your data if you change the simple search def.
◦ same if you add new order-able fields
Searching
Abstracted by Plugin::Search (original implementation)
Tricky to make it work with EPrints’ UI because it expects
an EPrints::Search object
Plugin::Search::Internal is a wrapped EPrints::Search
object (hack) so Plugin::Search::Xapian must emulate this
behaviour
Searching (2)
Searching – Classes & Op. Stack
/cgi/xapian
Search::XapianSearch
Paginate::Facets
Plugin::Search::Xapian
Xapian DB
Xapian::Facets
May be used in a script
Exports & feeds work
Can be serialised/de-serialised (including facets) so should
work for Saved Searches (to test)
Searching – Extra information
“Related Items”
Jiadi has developed a Bootstrap-based Pagination module:
◦ more sexy
◦ supports alternative “views” of the search results
Extras
Range searching: possible in Xapian but not yet
implemented (e.g. 1..10)
Some refactoring:
◦ Xapian::Index -> Xapian::Indexer
◦ Plugin::Search::Xapianv2 => Plugin::Search::Xapian (and replace the
default EPrints’ Xapian implementation)
Test with real life data (done to a certain extent...)
Load & scalability testing (+ number of slots etc.)
Multi-lang considerations (and related IndexMethod)
TO-DO’s
Page displaying how a data-obj has been indexed
◦ prefixes
◦ terms
◦ facets & order-able fields
Status page (cf. “Admin > Status”):
◦ DB size
◦ number of Documents
◦ indexed datasets (and how)
Weighting: supported (via conf.) but un-tested in real life
TO-DO’s – Would be nice
Xapian is more of a user search
The internal search is still required to:
◦ get records from the Database ($dataset->search())
◦ this affects screens such as “Manage Deposits”, the “Review” etc.
which cannot wait for items to be indexed (direct DB calls)
◦ may be needed to apply ACL’s (if some items cannot be searched):
safer to use the (MySQL) DB as authority
Internal Search vs Xapian Search
Plugin::Search::Xapian may be set to debug mode: shows
processing and query building
Xapian comes with an analysis tool, “delve” to:
◦ view the content of the Xapian DB or some selected Documents
◦ see if a term exists in the DB (and in which Documents)
◦ other info (term frequency etc.)
Knowing what Xapian is searching and how a data-obj is
indexed is key to debug most search-relating issues
Debugging Xapian
Not quite at release stage but it is –currently- isolated so
shouldn’t break your IR
All the code is on GitHub:
https://github.com/eprints/xapianv2
Using & Contributing
http://puffin.ecs.soton.ac.uk/cgi/xapian
Simple search / facets / export / order
Simple search with boolean op’s, suggestion
Advanced search / facets / export / order
Related items
http://vmdev1.eprints.org/cgi/xapian (more data + cached
citations)
http://vmdev1.eprints.org/cgi/xapian_status
Demos
Let’s have a play?
Code overview?
Doc?
Q&A & what’s next