sébastien françois, eprints lead developer eprints developer powwow, ulcc

21
Empowering EPrints Search with Xapian Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Upload: valentine-walsh

Post on 24-Dec-2015

227 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Empowering EPrints Search with Xapian

Sébastien François, EPrints Lead DeveloperEPrints Developer Powwow, ULCC

Page 2: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Review of EPrints Internal Search

Indexing

Searching

Extras

TO-DO’s

Using & contributing

Demo(s)

Summary

Page 3: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

EPrints “Internal” Search - Overview

Search

Field

DataSet

MetaField Condition

List1

1..n

1..n 1..n

Page 4: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

match = “EX” queries the main & auxilliary dataset tables

match = “IN” queries the __rindex dataset table

ordering is done via the __ordervalues_$langid dataset

table

EPrints “Internal” Search – Overview (2)

Page 5: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Simple search is not scalable

Lots of derived data in the DB (backup?)

No relevance matching -> good matches do not surface

up

No advanced features: suggestions, facets, boolean op’s

etc.

Home-brewed: hard to maintain the code, hard to extend

Difficult to debug…

EPrints “Internal” Search – Downsides

Page 6: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Introduced in 3.3

Only integrated with the simple search

Little flexibility in controlling what is indexed

Advanced features “not really” enabled

Searches every fields (“text_index” not respected)

But the idea is good & worth building upon

EPrints Xapian Search

Page 7: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Attempts to re-use EPrints’ default configuration:

◦ datasets’ field defintion (+ “text_index”)

◦ fields defined in the simple search (un-prefixed terms)

But needs its own bits to define:

◦ default indexing methods (by MetaField type)

◦ facet-able indexes

◦ order-able indexes

May be used to declare derived indexes – examples:◦ “open_access”: to filter references from open full-text documents

◦ “year”: to filter by year of publication (rather than by date)

◦ “image_orientation”: if you had an archive of images, you could extract the orientation via

EXIF

Indexing

Page 8: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Indexing - Classes

Xapian::Index

IndexMethod

Config

OrderMethod

XapianDB

Fulltext Name, etc. Alpha. Name, etc.

Page 9: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Indexes are prefixed by “_” e.g. “_title” so we can sanitise the user query

– otherwise users could do prefixed search (and search not necessarily

allowed fields)

Z notation: indicates a stemmed value or index: Z_title, Zhappi (internal

Xapian convention)

Script available to re-process the Xapian indexes (similar to “epadmin

reindex” but doesn’t re-index the EPrints’ internal)

Reserved indexes:

◦ _id: keep the internal id of the data-obj (/id/eprint/123)

◦ _dataset: to which dataset the record belongs to (‘eprint’, ‘user’…)

◦ _configuration_md5: keeps an MD5 of the conf. the item was indexed

against (useful?)

◦ - _index_timestamp: when the item was last indexed

Indexing – Extra information

Page 10: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Again, attempts to re-use EPrints’ configuration:

◦ simple search (mostly for ordering methods)

◦ advanced/staff search: which fields to use (prefixed terms)

Extra bits can be configured such as which facets can be

used on each search (simple, advanced, …)

Only indexed stuff can be searched

◦ you cannot use a facet which has not been generated

◦ you need to re-index your data if you change the simple search def.

◦ same if you add new order-able fields

Searching

Page 11: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Abstracted by Plugin::Search (original implementation)

Tricky to make it work with EPrints’ UI because it expects

an EPrints::Search object

Plugin::Search::Internal is a wrapped EPrints::Search

object (hack) so Plugin::Search::Xapian must emulate this

behaviour

Searching (2)

Page 12: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Searching – Classes & Op. Stack

/cgi/xapian

Search::XapianSearch

Paginate::Facets

Plugin::Search::Xapian

Xapian DB

Xapian::Facets

Page 13: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

May be used in a script

Exports & feeds work

Can be serialised/de-serialised (including facets) so should

work for Saved Searches (to test)

Searching – Extra information

Page 14: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

“Related Items”

Jiadi has developed a Bootstrap-based Pagination module:

◦ more sexy

◦ supports alternative “views” of the search results

Extras

Page 15: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Range searching: possible in Xapian but not yet

implemented (e.g. 1..10)

Some refactoring:

◦ Xapian::Index -> Xapian::Indexer

◦ Plugin::Search::Xapianv2 => Plugin::Search::Xapian (and replace the

default EPrints’ Xapian implementation)

Test with real life data (done to a certain extent...)

Load & scalability testing (+ number of slots etc.)

Multi-lang considerations (and related IndexMethod)

TO-DO’s

Page 16: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Page displaying how a data-obj has been indexed

◦ prefixes

◦ terms

◦ facets & order-able fields

Status page (cf. “Admin > Status”):

◦ DB size

◦ number of Documents

◦ indexed datasets (and how)

Weighting: supported (via conf.) but un-tested in real life

TO-DO’s – Would be nice

Page 17: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Xapian is more of a user search

The internal search is still required to:

◦ get records from the Database ($dataset->search())

◦ this affects screens such as “Manage Deposits”, the “Review” etc.

which cannot wait for items to be indexed (direct DB calls)

◦ may be needed to apply ACL’s (if some items cannot be searched):

safer to use the (MySQL) DB as authority

Internal Search vs Xapian Search

Page 18: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Plugin::Search::Xapian may be set to debug mode: shows

processing and query building

Xapian comes with an analysis tool, “delve” to:

◦ view the content of the Xapian DB or some selected Documents

◦ see if a term exists in the DB (and in which Documents)

◦ other info (term frequency etc.)

Knowing what Xapian is searching and how a data-obj is

indexed is key to debug most search-relating issues

Debugging Xapian

Page 19: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Not quite at release stage but it is –currently- isolated so

shouldn’t break your IR

All the code is on GitHub:

https://github.com/eprints/xapianv2

Using & Contributing

Page 20: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

http://puffin.ecs.soton.ac.uk/cgi/xapian

Simple search / facets / export / order

Simple search with boolean op’s, suggestion

Advanced search / facets / export / order

Related items

http://vmdev1.eprints.org/cgi/xapian (more data + cached

citations)

http://vmdev1.eprints.org/cgi/xapian_status

Demos

Page 21: Sébastien François, EPrints Lead Developer EPrints Developer Powwow, ULCC

Let’s have a play?

Code overview?

Doc?

Q&A & what’s next