bio solr building a better search for bioinformatics
TRANSCRIPT
Tom Winch & Matt Pearce21st April 2015
[email protected]/blog+44 (0) 8700 118334Twitter: @FlaxSearch
BioSolr
building a better search for bioinformatics
The European Bioinformatics Institute
Part of the European Molecular Biology Laboratory
Based on the Wellcome Genome Campus in Hinxton, Cambridge
Maintains the world’s most comprehensive range of freely available and up-to-date molecular databases, serving millions of researchers – indexing over 1 billion items
BioSolr project involves two teams from EMBL-EBI:Protein Data Bank in Europe (PDBe)Samples, Phenotypes and Ontologies (SPOt)
The genesis of BioSolr
Grant Ingersoll visits the Wellcome Campus in July '13
Around 90 people attend
Show of hands indicates 75% using Lucene/Solr
Sameer Velankar of EMBL-EBI identifies grant funding
Flax and EMBL-EBI apply successfully to the BBSRC
BioSolr One year BBSRC funded project from September 2014
“to significantly advance the state of the art withregard to indexing and querying biomedical data with freely available open source software”
Outputs:– Workshops– Papers & presentations– Software (Open source of course!)– Documentation
Inputs: from the PDBe & SPOt teams
BioSolr
Tom Winch – Working on site with Sameer Velankar & the PDBe team – Facet.contains & Xjoin
Matt Pearce– Working on site with Tony Burdett & the SPOt team– Indexing ontologies
BioSolr & PDBe - Introduction
Protein Data Bank (PDBe)
facet.contains – autosuggest
https://issues.apache.org/jira/browse/SOLR-1387
In Solr 5.1
DNA sequence similarity
BioSolr & PDBe – Xjoin concepts
The problem - sequences come from a live source
Joining with data from an external source
Custom SOLR code
BioSolr & PDBe – Solr classes
XJoinResultsFactory, XJoinResults
XJoinSearchComponent
XJoinQParserPlugin
XJoinValueSourceParser
BioSolr & PDBe – What next?
SOLR contrib – SOLR-7341
https://issues.apache.org/jira/browse/SOLR-7341
Joining from multiple external sources
Federated search
Washington, N. & Lewis, S. (2008) Ontologies: Scientific Data Sharing Made Easy. Nature Education 1(3):5
BioSolr & SPOt – Indexing Ontologies
Indexing Ontologies - the problem
You have a collection of documents annotated with ontology references.
You want to search both the documents and the associated ontology data.
This may include associated nodes – “has location”, “is part of”, etc.
Faceting by ontology reference would be nice!
Approach 1
– Keep the data separate
documents
Documents
Indexer
Documents
Indexer
ontology
Ontology
Indexer
Approach 1 - steps
Index the documents, with the node annotations, but no further detail.
Index the ontology in its own core.
Search the documents, then cross-match against the ontology.
BUT - Requires multiple calls, doesn't allow searching both cores at the same time.
Approach 2
• Add some ontology data to your documents.
Documents
Indexer Ontology
documents
Approach 2 – step 1
Index node references, plus their labels and synonyms.
Easier to include the ontology references in your search.
Can boost fields over others.
Approach 2 – step 2
Expand the ontology data being stored.
Include single (or multi)-level parent and child nodes, with labels.
Use dynamic fields to store additional relationships.
Dynamic fields allow searches across specific relation types.
BUT Requires some additional Solr look-ups to be fully dynamic.
Approach 2 – search screen
Approach 2 – search results
Approach 3
Search the ontology, and cross-match with documents.
Allow SPARQL queries over the ontology index.
SPARQL is a semantic query language
Approach 3
Adding Apache Jena
To allow SPARQL queries, we use Apache Jena to provide TDB-querying.
Jena uses Solr to search label fields.
Uses its own Triple Store for other fields.
Need to include reference URI in returned fields.
Integrating Jena results
Returned Jena data needs to be cross-matched against document index.
Use a filter query to choose the matching documents.
Integrating Jena results
Summary so far
We can search documents and ontology data with a single call to Solr.
We can dynamically search over additional related ontology nodes.
We can use SPARQL to search.
Can facet on individual ontology annotations...but we still can't present the facets in a tree.
https://github.com/flaxsearch/BioSolr/tree/master/spot
The ultimate goal
A generic ontology indexer using Solr.
Multiple ontologies stored in the same index.
Unique integer keys for each node, allowing cross-matching from document indexes.
Optional customisation, allowing for additional lookups or data manipulation.
BioSolr conclusions
Final workshop at EMBL-EBI in September
https://github.com/flaxsearch/BioSolr
Investigating funding to continue the project– We have some ideas around federated Solr search...