a collections searching center using lucene – solr
Post on 04-Feb-2016
43 Views
Preview:
DESCRIPTION
TRANSCRIPT
A Collections Searching Center Using Lucene – Solr
Ching-hsien WangSmithsonian InstitutionCollections.si.edu wangch@si.edu
Background Information
Smithsonian Institution is a public institution whose mission is the increase and diffusion of knowledge,
19 museums and 9 research institutes, 136 million collection objects, 12 major museum collection information
systems (with 30 databases), Hundreds of other databases.
Issues we faced
Users want information now! Google Effect and user’s mentality:
“if it is not online, it does not exist.” Users want immediate access to
digital documents. Separate databases are confusing
to the public.
We must act now!
Smithsonian’s Collection Searching Center Overview
a discovery center for information with a single searching point
faceted searching and content-sensitive navigation
positive and negative browse & select options
relevancy ranking of search resultsautomatic stemming for word matching
Smithsonian’s Cross Searching Catalog Overview (continued)
integrated searching of data from multiple types of databases
scalability for large data setsa metadata center which interacts with other
online applications
Project Team and Resources
Andrew Gunther – Software development and implementation
Jim Felley – Data conversion and implementation George Bowman – Database management and security
configuration Randy Arnold – Project support Ching-hsien Wang – Program Manager
Since August 2007, we have integrated data from 12 major databases with 2 million records.
Starting from Multiple databases
Transform into a single Search Center
Cross Searching Demo – simple opening screen
Demo – search result screen
Demo – search history
Process Flow Diagram
Solr
Solr
Lucene
Index
Horizon
Horizon
Horizon
Data
Extract
and
Trans-
Formation
XML
documents
MuseumDigital
Data
Extract
and
Trans-
Formation
ArchivesDigital
LibraryDigital
XML
documents
Output data
In XML
Output data
In JSON
Output data
In Python
Online
Exhibition
Virtual
Museum
In 2nd Life
Education
Interface
Open Access
Applications
Cross
Searching
Catalog
HorizonArchives
XML Data Transformation
ArtInventory
PhotoArchives
Archives
ExhibitionCatalogs
ResearchBibliographies
SmithsonianHistory
LibraryTrigger
Trigger
Trigger
Trigger
Trigger
Trigger
Trigger
Solr_Index_
Pending…….DB
Table
AirplaneDirectory
Trigger
A Perl
programconvertsrecordsbased
onBIB#
XMLDocuments
Automated Process
Define an Index Metadata Model:Free text data fields used for Keyword searching & display
Record LinkTitle/Object-nameIdentifierPhysical DescriptionGallery LabelNotesPublisherObject TypeTaxonomic Name
LanguageTopicPlaceDateNameCultureSet NameData SourceCredit LineOnline Media Group
Facet data fields used for browsing and limiting
Record IDObject TypeLanguageTopicPlaceDateNameCultureData SourceOnline Media TypeRights for Online Media FileRelated RecordUsage Flag
Taxon-KingdomTaxon-PhylumTaxon-DivisionTaxon-ClassTaxon-OrderTaxon-FamilyTabxon-Sub-FamilyScientific_nameCommon name
Geo-age-EraGeo-Age-SystemGeo-Age-SeriesGeo-Age-StageStrat-GroupStrat-Formation
Strat-Member
Getting help from Solr
Task specific handlers:Request handlerRespond handler
Update handler Schema.xml file defines fields to be
indexed, displayed, and searchable. Solrconfig.xml file defines cache size,
faceted field type, request handler customization.
Solr
Solr
Lucene
Index
Solrconfig.xml Example facet field definition <str name="facet.field">object_type</str> <str name="facet.field">language</str> <str name="facet.field">topic</str> <str name="facet.field">place</str> <str name="facet.field">date</str> <str name="facet.field">name</str> <str name="facet.field">culture</str> <str name="facet.field">online_media_type</str> <str name="facet.field">set_name</str> <str name="facet.field">data_source</str> <str name="facet.field">tax_kingdom</str> <str name="facet.field">tax_phylum</str> <str name="facet.field">tax_division</str> <str name="facet.field">tax_class</str> <str name="facet.field">tax_order</str> <str name="facet.field">tax_family</str> <str name="facet.field">tax_sub-family</str> <str name="facet.field">common_name</str> <str name="facet.field">scientific_name</str> <str name="facet.field">freetext</str> <str name="facet.field">text</str> </lst> </requestHandler>
Data Example (abbreviated) – a Library Book
<doc boost="1"><descriptiveNonRepeating><record_ID>siris_sil_905285</record_ID><unit_code>SIL</unit_code><data_source>Smithsonian Institution Libraries</data_source><title_sort>STORY OF WEST POINT: 18021943 THE WEST POINT TRADITION IN AMERICAN
LIFE</title_sort><title label="Title">Story of West Point: 1802-1943; the West Point tradition in American
life</title></descriptiveNonRepeating><descriptiveOptional><freetext category="dataSource" label="Data Source“ >Smithsonian Institution Libraries</freetext><freetext category="objectType" label="Type“ >Books</freetext><freetext category="date" label="Date">1943</freetext></descriptiveOptional><indexedStructured><object_type>Books</object_type><date>1943</date></indexedStructured></doc>
Data Example (abbreviated) – a Photograph<doc boost="6.4"><descriptiveNonRepeating><record_ID>siris_arc_104765</record_ID><unit_code>EEPA</unit_code><data_source>Eliot Elisofon Photographic Archives</data_source><title_sort>AERIAL VIEW OF DOWNTOWN JOHANNESBURG SOUTH AFRICA SLIDE</title_sort><title label="Title">Aerial view of downtown Johannesburg, South Africa, [slide]</title><online_media mediaCount="1"><media thumbnail=http://sirismm.si.edu/eepa/eepthb/eepa_05859thb.jpg Type="Images">http://sirismm.si.edu/eepa/eep/eepa_05859.jpg< /media></online_media></descriptiveNonRepeating><descriptiveOptional><freetext category="dataSource" label="Data Source">Eliot Elisofon Photographic Archives</freetext><freetext category="identifier" label="Local number">EEPA EECL 15973</freetext><freetext label="photographer" category="name">Elisofon, Eliot</freetext><freetext category="physicalDescription" label="Physical description">slide : col</freetext><freetext category="notes" label="Summary">This photograph was taken when Eliot Elisofon was on assignment for Life magazine and traveled to Africa from August 18, 1959 to December
20, 1959</freetext><freetext category="objectType" label="Type">Photographs</freetext><freetext category="topic" label="Topic">Mod. architecture/cityscape</freetext><freetext category="place" label="Place">South Africa</freetext><freetext category="date" label="Date">1959</freetext><freetext category="setName" label="See more items in">Eliot Elisofon Field photographs 1942-1972</freetext></descriptiveOptional><indexedStructured><name>Elisofon, Eliot</name><object_type>Color slides</object_type><object_type>Photographs</object_type><object_type>Archival materials</object_type><topic>Mod. architecture/cityscape</topic><topic>Cultural landscapes</topic><topic>Aerial photography</topic><place>Africa</place><place>South Africa</place><date>1959</date><online_media_type>Images</online_media_type></indexedStructured></doc>
Data Example (abbreviated) – a sculpture<doc boost="6.4">- <descriptiveNonRepeating> <record_ID>siris_ari_7985</record_ID> <unit_code>ARI</unit_code> <data_source>Art Inventories</data_source> <title_sort>DREXEL MONUMENT SCULPTURE</title_sort> <title label="Title">The Drexel Monument, (sculpture)</title> <record_link>http://siris-artinventories.si.edu/ipac20/ipac.jsp?&profile=all&source=~!
siartinventories&uri=full=3100001~!7985~!0#focus</record_link> - <online_media mediaCount="7"> <media thumbnail="http://sirismm.si.edu/saam/scan3thb/S75004286_1bthb.jpg"
type="Images">http://americanart.si.edu/images/1966/1966.47.36_1b.jpg</media> </online_media> </descriptiveNonRepeating>- <descriptiveOptional> <freetext category="dataSource" label="Data Source">Art Inventories</freetext> <freetext category="identifier" label="Control number">IAS 75004286</freetext> <freetext label="sculptor" category="name">Manger, Heinrich b. 1833</freetext> <freetext label="founder" category="name">Chas. F. Heaton</freetext> <freetext category="title" label="title">Francis M. Drexel Monument, (sculpture)</freetext> <freetext category="physicalDescription" label="Physical description">metal: bronze Sculpture: bronze; Base:
granite; Fountain basin: concrete</freetext> <freetext category="notes" label="Description">Index of American Sculpture, University of Delaware,
1985</freetext> <freetext category="objectType" label="Type">Sculptures-Fountain</freetext> <freetext category="name" label="Subject">Drexel, Francis M</freetext> <freetext category="place" label="Place">Illinois</freetext> <freetext category="date" label="Date">1881. Cast 1882. Dedicated 1883</freetext> </descriptiveOptional>- <indexedStructured> <name>Manger, Heinrich</name> <name>Chas. F. Heaton</name> <object_type>Sculptures</object_type> <topic>Portrait male</topic> <name>Drexel, Francis M</name> <place>Illinois</place> <date>1880s</date> <online_media_type>Images</online_media_type> </indexedStructured> </doc>
A system is only as good as the data that is in it.
Data mapping for multiple databases (truncated)
Faceted Categories
Determine the most useful facets; more is not better. Number of unique facets will affect system
response timeSmithsonian has 4.6 million unique
terms. Among them: 864,000 names, 126,000 topics, 47,000 places, 139 dates(down from 40,000 before cleanup), 1,000 types (down from 2,000 before cleanup)
Build the facet terms
650 $a Art $z Africa, North $v Periodicals.
<Topic> Art </Topic><Place> Africa, North </place><object_type> Periodicals </object_type>
Build the facet terms
655 $a Photographs $y 1840-1860.
<type> Photographs </type><date> 1840s </date><date> 1850s </date><date> 1860s </date>
Challenges
Adapting LCSH and AAT terms in a whole new way
Still seeking a good way to use See and See Also reference data
Reduce Data inconsistency in our records for better quality facet terms
Character conversion challenge with MARC8, UNICODE and UTF8
Future plans Continue to add data from more digital library
databases and museum collection databases Working on National History museum, and American
Indian museum.
Complete the implementation of the capability to interact with external applications
Plan to support “American Art and Artist” application
Add new functionality such as my-list, list-sharing, social tagging.
Support more visual displays such as Google map and time slider
A Collections Searching Center Using Lucene – Solr
Ching-hsien WangSmithsonian Institutionwww.siris.si.edu wangch@si.edu
top related