searching for search solutions harvard it summit june 23, 2011 randy stern | [email protected]...

42
Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | [email protected] | HUL/OIS David Heitmeyer | [email protected] | HUIT

Upload: jonas-cameron

Post on 24-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Searching for Search SolutionsHarvard IT Summit

June 23, 2011

Randy Stern | [email protected] | HUL/OIS

David Heitmeyer | [email protected] | HUIT

Page 2: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

2

Searching the Web

Page 3: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

3

Searching a Site

Page 4: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

4

Searching a Collection

Page 5: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

5

Searching Geospatially

Page 6: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

6

Search at Harvard – Web

Page 7: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Search at Harvard – Web

7

Page 8: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

8

Search at Harvard – Collections

• People

• Courses

• Grants

• Libraries

• ....many other things…

Page 9: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

9

Search at Harvard – Libraries

Page 10: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

10

Search at Harvard – Federated

Page 11: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

11

Search Models

• “To oversimplify, there's the Google model and the faceted navigation model.” – Morville & Callendar in Search Patterns

• Keyword (“Google”)

– Keyword search against an index

• Advanced Search

– Searching or selecting specific fields

• Faceted Search (“Guided Navigation”)

– Integrated search and browse

– Keyword search

– Browse by category metadata

– “No dead ends”

Page 12: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

12

Advanced Search

Page 13: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

13

Advanced Search

Page 14: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

14

Faceted Search

Page 15: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Search Technologies – Summary

15

Technology Products Examples at Harvard

Web Search Google, Yahoo, Bing everywhere

Site Search Google Search Appliance,Nutch, Sphinx, Elasticsearch

www.harvard.edu

Relational Database Oracle, MySQL, PostGres PeopleSoft, Aleph, DRS, HOLLIS Classic

XML Database Tamino, eXist VIA, OASIS, Virtual Collections

Spatially enabled ArcSDE, PostGIS Harvard Geospatial Library, WorldMap

Archived web search NutchWAX/Lucene Library Web Archiving Service

Full text and faceted search

Apache Solr/Lucene, Endeca, Autonomy, MS FAST

Library Full Text Search Service, HOLLIS, iSites, Course Catalog

Federated search Ex Libris Metalib Library Cross Search

Page 16: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Apache Lucene

• Open source from Apache

• High-performance, full-featured text search engine library written entirely in Java

• Text-based inverted index

• Documents of name/value pairs

• Stemming and tokenizers for various applications and languages

• Query syntax – and/or/not/near

• Highlighter

• **FAST**

16

Image goes here

Page 17: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Apache Solr

• “Solr is the popular, blazing fast open source enterprise search platform from Apache”

• A REST Web Service on top of Lucene for indexing and querying

– XML and JSON output

• Caching for faster response

• Faceting

• Web management interface

• XML schema configuration files

• “did you mean?” and “more like this” support

• Scalable server model

• Very active development community

17

Image goes here

http://lucene.apache.org/solr/

Page 18: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Lucene

Solr

Highly scalable with Hadoop cluster

Lucene

Solr

Lucene

Solr

Apache Solr/Lucene Ecology

18

Image goes here

Library catalogs

Enterprisedatabases

Nutch,Nutchwax

Web Archives

Lucene

Solr

TextFielded data

Page 19: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Solr Indexing

• Indexing: HTTP POST to http://mysolrserver/solr/update

<add> <doc> <field name="id">13579</field> <field name="title">Mona Lisa</field> <field name="creator">Leonardo DaVinci</field> <field name="year">1519</field> <field name="genre">painting</field> </doc></add>

19

Image goes here

Page 20: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Solr Searching

http://mysolrserver/solr/select?q=Davinci&start=0&rows=2&fl=title,genre

<response> <result numFound=“43” start="0"> <doc> <str name=“title">Mona Lisa</str> <str name=“genre”>painting</str> </doc> <doc> <str name=“title">Bronze Horse</str> <str name=“genre”>sculpture</str> </doc> </result></response>

20

Image goes here

Page 21: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Solr Searching

http://mysolrserver/solr/select?q=Davinci&start=0&rows=2&fl=title,genre&wt=json

{"response" : {"numFound" : 43,"start" : 0,"docs" :

[ {"title":"Mona Lisa", "genre":"painting"}, {"title":"Bronze Horse", "genre":"sculpture"}]

}}

21

Image goes here

Page 22: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Use of Solr Exploding

• Whitehouse.gov, FCC.gov, Comcast / xfinity, AT&T Interactive, AOL (Yellow Pages, Music, NFL Sports, Recipes), Sears, Ticketmaster, Digg, Netflix, Zappos.com, and many more

• Open source library catalogs

– Blacklight (Ruby), VuFind (PHP)

• Open source digital Repositories

– Fedora, Dspace

• Support available from Lucid Imagination (Solr creators)

22

Image goes here

Source: http://wiki.apache.org/solr/PublicServers

Page 23: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

23

Harvard University Course Catalog

coursecatalog.harvard.edu

Page 24: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Solr & Course Catalog

• 9,000+ courses from 13 schools/programs

• 15 Mb index size

– fields are indexed and stored

• Search + Faceted Navigation

– School, calendar period, term, department, day, time, cross-registration status, credit level

• Updated daily

– REST interfaceHTTP post of XML files

• XSLT/XPath 2 processing of XML data from Solr

Page 25: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

25

Course Catalog – Searching and Facets

Search Terms Facets

School

Semester

De-partment

Credit Level

Day of Week

Cross Regis-tration

Term within School Time of Day

Offered

Page 26: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

26

Course Catalog

• Access to data to other applications

• Open Search browser plugins

Page 27: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

iSites

• 5,500 course websites each year

• 20,000 websites

• 16,000 students

• 8 student portals

• 33,000 users on a peak day

Page 28: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

28

Search within iSites

Page 29: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Solr & iSites

• 4.5 million items

– File, topic, forum, image, page, html, sign-up event, video, audio, site, link, wiki, announcement, podcast

– Crawlers use database and file system

• MS Office, PDF, Audio (metadata), OpenDocument, RTF, Text, HTML, XML

• 35 Gb index size

• Updated hourly

– Master and slave

• Search Tool - Permissions

Page 30: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Search – New Ways of Navigating

Page 31: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Harvard Library Full Text Search Service

31.

Page 32: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Harvard Library Full Text Search Service

32.

Page 33: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Full Text Search Service

• Uses Lucene directly

• Full text index of OCR page text for digitized books and other page turned objects

• Relevance ranked searching

• Hits in context

• ~81,000 objects so far, 7.2 million pages

• Index size 8.5GB

33

Page 34: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Harvard Library Web Archiving Service

34.

Page 35: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Harvard Library Web Archiving Service

35.

Page 36: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Web Archiving Service

• Lucene plus Nutchwax full text index of harvested web pages and harvested resources

• Indexing HTML, PDFs, Word docs, PPTS, etc. and collection metadata

• Currently a “small” web archive

– 265 web sites

– 13M web pages

– 100M web resources, 1TB of archived web data

• Index size 170GB and growing

– 80-90% of index size is full text required for “hit in context” search results

• 3-5 sec search result times on ordinary dual core Linux box

36

Page 37: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

DRS 2 Web Administrator

37.

Facets to come!!

Page 38: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

DRS 2 Web Administrator

• Solr for digital object management searching

– Digital preservation objects have many fields that may be important for collection management or preservation planning

– Faceted browse – by user tags, content type, owners, etc.

– Full text searching for descriptions and process info

• Easy to configure, update, and use (HTTP and simple URLs) 

• Indexing metadata plus full text embedded in object descriptors, rather than the content of files themselves

• Scoped at release:

– 152 fields

– 30 million records, index size of 60GB

– master/slave configuration

38Footer reference – remove hyperlink if you want to keep this gray.

Page 39: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Email Archiving Service

39.

Page 40: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Email Archiving Service

• Why Solr for email object management?

– relevance ranking

– Facets

– full text searching of both email body and header fields 

• Indexing email header fields, rights and collection metadata, plus full text from emails

40

Page 41: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Searching for Search Solutions

• Integrating multiple forms of data (text, images, audio, maps, etc.) into single searchable indexes

• Aggregating Indexes– Google, Google Books, Google Scholar

– Licensed cloud services for articles, books, media, everything

– Library Cloud

– DPLA

• Semantic Web

– Linked Data, RDF, HTML 5’s Microdata, Microformats

• Mobile (Localized)

• Specialized search vs. general search – there’s an app for that

41

Page 42: Searching for Search Solutions Harvard IT Summit June 23, 2011 Randy Stern | randy_stern@harvard.edu | HUL/OIS David Heitmeyer | david_heitmeyer@harvard.edu

Thank You

Randy Stern | [email protected] | HUL

David Heitmeyer | [email protected] | HUIT