search technologies assessment  · web viewupdated content based on “drcf_nac r1p2 combined...

28
National Archives and Records Administration National Archives Catalog (The Catalog) NARA Catalog Web Sites Data Model Design – Catalog Perspective – Status-Final Version 1.4 July 24, 2015

Upload: others

Post on 20-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

National Archives and Records Administration

National Archives Catalog (The Catalog)

NARA Catalog Web Sites Data Model Design– Catalog Perspective –

Status-FinalVersion 1.4

July 24, 2015

Page 2: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

National Archives & Records Administration

NARA Catalog Web Sites Data Model Design

Avi Rappoport, Madhu Koneni, Kristen Martin, Terri Hobbs, Paul Nelson

Version 1.43

Contract Number GS-35F-0541U

Order Number NAMA-13-F-0120

July 24, 2015

Page 3: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

NARA Catalog Web Sites DMD

Contents

1 Overview................................................................................................................21.1 NARA Web Site Content.....................................................................................................2

1.2 What is a DMD?.................................................................................................................3

1.3 Document Conventions.....................................................................................................4

2 Web Sites and Content Extraction...........................................................................52.1 Web Sites and Sub-Sites....................................................................................................5

2.2 Standard Metadata............................................................................................................8

2.3 Metadata Parsing for the Archives.gov Site.......................................................................9

2.4 Metadata Parsing for NARA Blogs....................................................................................10

2.4.1 Parsing Date for AOTUS Blog...................................................................................11

2.4.2 Metadata Parsing for other National Archives Blogs...............................................11

2.5 Presidential Libraries.......................................................................................................12

2.6 Processing Non-HTML content........................................................................................13

2.6.1 Text Extraction........................................................................................................13

2.6.2 Metadata Mapping for Non-HTML Content............................................................13

3 File Processing......................................................................................................14

4 Mapping to Index Fields........................................................................................154.1 Field Metadata Mappings................................................................................................15

4.2 Mapping to Keywords Relevancy Model..........................................................................16

5 Search Fields.........................................................................................................17

6 Search Results Presentation..................................................................................186.1 Search Results (aka “Brief Results” Display)....................................................................18

7 Content Details Presentation................................................................................19

Page 4: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

Version Control

Version Date Reviewer Summary Description

0.1 2013-10-25Paul Nelson and Madhu Koneni

Initial Outline

0.2 2013-12-30 Rhea Mandavilli Updates to Sections 1, 6.1.1, 7

0.3 2014-1-2Archana Ballur Nagaraj

Updates to Section 2, 3, 4, 5, 6.1, 6.2(ongoing)

0.4 2014-1-3 Avi RappoportAdding blogs, presidential libraries, updates to all sections

0.5 2014-02-07 Madhu KoneniFeedback from NARA on 02/04/2010 addressed

0.6 2014-02-22 Elizabeth Hobbs Reformatted

0.7 2014-02-23 Elizabeth HobbsRemoved websites not being considered. Updated index mappings and search results.

1.0 2014-02-24 Paul Nelson Top to bottom review and cleanup.

1.1 2014-03-19 Lisong LiuUpdated based on NARA review and feedback in DCRF of 3/11/14

1.2 2014-11-14 Kristy MartinRemoved “Confidential to Search Technologies” text from the footer.

1.3 2015-06-11Jose Hernandez, Kristy Martin

Added the remaining sites specified in DE-60; section updated: 2.1

Changed branding for system name throughout document.

1.4 2015-07-24 Kristy Martin

Updated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx”

1

Page 5: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

1 Overview

This is the Data Model Design (DMD) for NARA Web Sites data source, which includes archives.gov (and sub-sites such as blogs) and the presidential library web sites.

This document aims at providing detailed documentation of all fields which come from these web sites and how they will be processed through the National Archives Catalog system. The DMD identifies and defines the following for the Web Sites data source:

Metadata elements parsing

Mapping of the metadata elements to the Search Engine Index fields

How the search results are formatted (brief results)

URLs indexed into the Catalog from web sites are unusual (when compared to other Catalog data sources) in that they do not participate in many aspects of the Catalog:

Web pages cannot be annotated (tags, comments, translations, or transcriptions)

There is no “content detail” for web pages.

o Instead the user is simply taken directly to the web page.

There are no custom advanced search fields for web pages.

Therefore, these sections will not be required in this DMD.

1.1 NARA Web Site ContentThe main archives.gov web site is extremely valuable for introductory and reference material about the many resources of the National Archives, from detailed databases to exhibitions. These pages answer common questions of the general public and beginning searchers. In addition, blogs within the archives.gov domain also provide useful information and insights into current issues and activities of the National Archives and Records Administration.

Presidential Libraries are archives for preserving and making accessible papers, records, and other historical materials of U.S. Presidents and their administration for public study and discussion without regard for political considerations or affiliations. Presidential Libraries are important sources for historians, researchers and anyone studying our presidents and history.

Web pages indexed from these resources are presented as part of search results, allowing users easy access into this information.

2

Page 6: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

1.2 What is a DMD?The purpose of a Data Model Design document (DMD) is to document and describe all relevant data fields from a data source necessary to support all Catalog functionality. This metadata includes all data fields and their structure (nesting, type, number of values, etc.).

The DMD further describes how metadata values are transformed and stored within the Catalog. This careful accounting of data processing is require to gain a complete understanding of how every field is handled through the Catalog system.

Finally, the DMD also describes how metadata values are presented to the user, in the brief results, on the content detail page (aka the “full results”), from API calls, and in various metadata downloads.

The following diagram shows how metadata is processed and mapped for the Web Sites DMD:

WebCrawler

Extract Content

(section 2)

Map toSearch Engine

Index (section 4)

SearchEngine

Index

ApplicationServer

Map toBrief Results(section 6)

BriefResults

Note that the major section numbers are maintained across all DMDs. So even though sections 3 and 5 are not required in this DMD, empty place-holders will remain so that numbering is consistent when comparing the Web Sites DMD with other DMDs for other data sources.

Specifically, this DMD includes the following:

Extracting metadata and text content from web pages.

o This includes key metadata fields (title, web area) as well as text content.

Index Representation (section 4)

o How the web site metadata fields are represented in the search engine indexes

Brief results presentation (section 6)

o Identifies how index fields are mapped to show the brief results for web pages.

3

Page 7: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

1.3 Document ConventionsSince there are many different metadata fields for different purposes and from different systems, field mappings will be used throughout this document to clearly identify the originating source for every field, as follows:

Abbrev Description

WEB This abbreviation will be used for any metadata from a web page.For example: WEB/title will represent the <title></title> field extracted from the web page.

I Fields from the search engine index.For example, “I/title” will represent the title as it is stored in the search engine index for the record.

4

Page 8: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

2 Web Sites and Content Extraction

This section covers the web sites crawled by the Catalog web crawler and the metadata extraction and content processing/parsing required for each site.

2.1 Web Sites and Sub-SitesThe web sites and sub-sites to be crawled are listed below.

Root URL Web Area

Archives.Gov Web Site

http://www.archives.gov National Archives: Archives.gov

Major Areas within Archives.gov Web Site

http://www.archives.gov/about National Archives: About

http://www.archives.gov/calendar/ National Archives: Calendar

http://www.archives.gov/dc-metro/events/ National Archives: Calendar

http://www.archives.gov/contact/ National Archives: Contact

http://www.archives.gov/contracts/ National Archives: Doing Business with NARA

http://www.archives.gov/espanol/ National Archives: En Español

http://www.archives.gov/eeo/ National Archives: Equal Employment Opportunity Program

http://www.archives.gov/faqs/ National Archives: FAQs

http://www.archives.gov/fed-employees/ National Archives: Federal Employees

http://www.archives.gov/frc/ National Archives: Federal Records Centers

http://www.archives.gov/federal-register/ National Archives: Federal Register

http://www.archives.gov/forms/ National Archives: Forms

http://www.archives.gov/research/genealogy/ National Archives: Genealogy

http://www.archives.gov/grants/ National Archives: Grants

http://www.archives.gov/careers/ National Archives: Jobs, Internships & Volunteering

http://www.archives.gov/legislative/ National Archives: Legislative Branch

http://www.archives.gov/locations/ National Archives: Locations

http://www.archives.gov/congress/ National Archives: Members of Congress

http://www.archives.gov/research/military/ National Archives: Military Records

http://www.archives.gov/oig/ National Archives: Office of the Inspector General (OIG)

http://www.archives.gov/exhibits/ National Archives: Online Exhibits

http://www.archives.gov/open/ National Archives: Open Government at the National Archives

5

Page 9: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

Root URL Web Area

http://www.archives.gov/preservation/ National Archives: Preservation

http://www.archives.gov/presidential-libraries/ National Archives: Presidential Libraries

http://www.archives.gov/press/ National Archives: Press/Journalists

http://www.archives.gov/publications/prologue/ National Archives: Prologue Magazine

http://www.archives.gov/publications/ National Archives: Publications

http://www.archives.gov/records-mgmt/ National Archives: Records Management

http://www.archives.gov/research/ National Archives: Research

http://www.archives.gov/shop/ National Archives: Shop

http://www.archives.gov/education/ National Archives: Teacher’s Resources

http://www.archives.gov/veterans/ National Archives: Veterans

http://www.archives.gov/nae/ The National Archives Museum

http://www.911commission.gov National Commission on Terrorist Attacks upon the United States

http://www.9-11commission.gov National Commission on Terrorist Attacks upon the United States

http://www.fcic.gov Financial Crisis Inquiry Commission

http://www.federalregister.gov Federal Register

http://www.nara.gov National Archives

http://www.ofr.gov The Office of the Federal Register

http://www.ourdocuments.gov The Our Documents Initiative

http://www.presidentialtimeline.org The Presidential Timeline

Blogs

http://blogs.archives.gov National Archives: Blogs

http://blogs.archives.gov/nhprc National Archives Blog: Annotation / NHPRC

http://blog.archives.gov/aotus National Archives Blog: AOTUS Blog

http://blogs.archives.gov/carter-chronicle/ National Archives Blog: The Carter Chronicle

http://blogs.archives.gov/education/ National Archives Blog: Education Updates

http://blogs.archives.gov/foiablog/ National Archives Blog: FOIA Ombudsman

http://blogs.archives.gov/Innovation National Archives Blog: Inside Innovation

http://blogs.archives.gov/mediamatters/ National Archives Blog: Media Matters

http://blogs.archives.gov/online-public-access/ National Archives Blog: NARAtions

http://blogs.archives.gov/ndc/ National Archives Blog: National Declassification Center

http://blogs.archives.gov/prologue/ National Archives Blog: Prologue: Pieces of History

http://blogs.archives.gov/records-express/ National Archives Blog: Records Express

http://blogs.archives.gov/blackhistoryblog/ National Archives Blog: Rediscovering Black History Blog

6

Page 10: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

Root URL Web Area

http://blogs.archives.gov/TextMessage/ National Archives Blog: The Text Message

http://blogs.archives.gov/transformingclassification/ National Archives Blog: Transforming Classification

http://blogs.archives.gov/hoover-blackboard/ National Archives Blog: The Hoover Blackboard

Presidential Libraries

http://www.hoover.archives.gov Herbert Hoover Presidential Library & Museum

http://www.fdrlibrary.marist.edu Franklin D. Roosevelt Presidential Library and Museum

http://www.trumanlibrary.org Harry S. Truman Library & Museum

http://www.eisenhower.archives.gov Dwight D. Eisenhower Library, Museum, and Boyhood Home

http://www.jfklibrary.org John F. Kennedy Presidential Library and Museum

http://www.lbjlibrary.org LBJ Presidential Library

http://www.nixonlibrary.gov Nixon Presidential Library & Museum

http://www.fordlibrarymuseum.gov Gerald R. Ford Presidential Library & Museum

http://www.jimmycarterlibrary.gov Jimmy Carter Library & Museum

http://www.reagan.utexas.edu Ronald Reagan Presidential Library & Museum

http://bush41library.tamu.edu George Bush Presidential Library and Museum

http://www.clintonlibrary.gov/ William J Clinton Presidential Library & Museum

http://www.georgewbushlibrary.smu.edu George W. Bush Presidential Library and Museum

Web Sites Not Included:

The following web sites are not included because they not a presidential library and they are not contained within the “archives.gov” web domain:

The Federal Register Blog: http://www.federalregister.gov/blog/

Our Archives Wiki: http://www.ourarchives.wikispaces.net/

2.2 Standard MetadataThe Catalog crawls the www.archives.gov website, the associated blogs and presidential libraries mentioned above, starting with the site root or specified start page. A set of metadata is extracted from each of the pages. Then the each page plus its metadata is indexed into the Catalog.

7

Page 11: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

Note:

The “WEB/” prefix is used to identify all metadata fields which come from the Web crawler or are extracted from the HTML page. See section 1.3 above for more details.

The metadata to collect from each web page is listed below:

Field Type From Description

WEB/type string Mapping Table

“archivesWeb” or “presidentialWeb”

WEB/url string Crawler URL of the web page.

WEB/mime string Text extraction or HTML page

Either “text/html” for all web pages, or the mime type as determined by text extraction.

WEB/title string HTML Page This is the title of the webpage or the blog post. This metadata will be displayed to the user. If no title is found, the web area may be used instead, or for a blog, the blog title (title tag of blog web page).

WEB/area string URL Parsing and Mapping Table

The area of the web site to which a web page belongs. For example, with presidential libraries, this is the title of the library web site. For a blog, it is the title of the blog (AOTUS).Also, some sub-areas within archives.gov (such as “National Archives: Teacher’s Resources”) will be separately identified.

WEB/areaUrl string Mapping Table

The URL for the web area, such as the home page for the library.

WEB/date string URL or HTML Page

The date of the web page, if identifiable.

WEB/content string HTML Page Content of the web page. Note that some web pages may have a descriptive tag to identify content. If this does not exist, it will be all of the non-tag text.

Note that not every web site or web page will have all the above metadata. At a minimum the following are needed:

WEB/type WEB/title WEB/area WEB/areaUrl WEB/content

2.3 Metadata Parsing for the Archives.gov SiteThis section provides information on how the standard metadata may be obtained, followed by some examples. For the title, different tags may be appropriate if the main <title> tag is either

8

Page 12: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

missing or not informative for a subsection of archives.gov.

Field Tags / Mapping / Parsing Instructions

WEB/type “archivesWeb”

WEB/url URL of web page (from crawler).

WEB/mime “text/html”

WEB/title The title will be extracted from the any of the following tags: <title> <meta name=”description”> <h1> URL file name

The tags are in priority order. For example, if no title can be extracted from <title>, it will be extracted from <meta name=”description”>.

WEB/area This will come from the “Web Area” column of the largest URL from the table in section 2.1 above which matches the prefix of the WEB/url field from the crawler. For example, the URL “http://www.archives.gov/oig/investigations.html” matches

“http://www.archives.gov/oig/” from the table in section 2.1. Therefore, the “WEB/area” field will be set to:

“National Archives: Office of the Inspector General (OIG)”.WEB/areaUrl This will come from the “Web Area” column of the largest URL from the table in section

2.1 above which matches the prefix of the WEB/url field from the crawler.

WEB/date From the <meta name=”date”> tag. Parsed as YYYY-MM-DD.

WEB/content If both <!-- startindex --> and <!-- stopindex --> exist: Index all text content between the two tags

If only <!-- startindex --> exists: Index all text content from startindex to the end of the web page

Otherwise Index all text content on the web page

Notes:1. Only index text content. Do not index HTML tags, HTML attributes, or HTML

attribute values.2. Ignore all content found between <script> and </script>3. Do not index the content of <!-- XML comments -->

See below for an example.

Example of startindex & stopindex:

This example is from the top level landing page at http://www.archives.gov/education.

<head> <title>Teachers' Resources</title> <meta name="description" content="Resources for Teachers" /> <meta name="date" content="2013-08-21" /> ...<body><div id="container"> <div id=”header”>..

9

Page 13: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

.<div id="container2"><div id="content"><!-- startindex -->

.

. Content to index occurs in here

.

<!--stopindex--><div class="menu connect"> ...</body></html>

2.4 Metadata Parsing for NARA BlogsThe majority of the NARA blogs have a consistent format. The exception is the AOTUS blog. The following table shows how to obtain the standard metadata.

Field Tags / Mapping / Parsing Instructions

WEB/type “archivesWeb”

WEB/url URL of web page (from crawler).

WEB/mime “text/html”

WEB/title For all blogs, the <title> tag will supply the title. However, all blog titles contain both the title of the blog as well as the title of the article. For example:

AOTUS: Collector in Chief | Calling All Walt Whitman FansTo improve accuracy, we will remove the title of the blog from the above, leaving only “Calling All Walt Whitman Fans”. The title of the blog itself will be captured in the WEB/area field below. For the AOTUS blog, extract the blog article title from the text after the “|” pipe

character. For all other blogs, extract the blog article title from the text after the “&raquo;”

character.If the delimiter specified above does not occur in the title, take the entire <title> content as the title.

WEB/area This will come from the “Web Area” column of the largest URL from the table in section 2.1 above which matches the prefix of the WEB/url field from the crawler.

WEB/areaUrl This will come from the “Web Area” column of the largest URL from the table in section 2.1 above which matches the prefix of the WEB/url field from the crawler.

WEB/date Parse the date from the text of the blog: AOTUS: <p class="meta">Written on February 14, 2014 | Other blogs: <span class="gray">on February 3, 2014</span>

See examples.

WEB/content There appears to be no standard for marking the content area of blogs. Therefore, the Use the entire text content of the HTML page will be indexed as the content.Notes:

1. Only index text content. Do not index HTML tags, HTML attributes, or HTML

10

Page 14: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

attribute values.2. Ignore all content found between <script> and </script>3. Do not index the content of <!-- XML comments -->

2.4.1 Parsing Date for AOTUS Blog

Dates for this blog can be identified with <p class="meta"> as shown below.

<h2><a href="..">Happy Valentine’s Day</a></h2><p class="meta">Written on February 14, 2014 | <a href="…"><p><a href="..."><p>This 1918 valentine refers to the World War I effort…</div>-->

2.4.2 Metadata Parsing for other National Archives Blogs

Dates for other blogs occur after the <span class="gray"> tag.

<div class="post"><div><img alt="..."><h2><a href="... title=”National Declassification Center Completes Quality Assurance of Backlog Final">National Declassification Center Completes Quality Assurance of Backlog Final</a></h2><div class="small">by <a href="...">Nancy Soderberg</a> <span class="gray">on February 3, 2014</span></div> </div></div></br /><br />

2.5 Presidential LibrariesThe page format of these libraries is unfortunately not consistent. The following strategy provides a set of tags to check to obtain the metadata.

In some cases, the library web pages’ title tag is just the official name of the library and is the same for the various pages. In this case, if possible, a different tag should be used to obtain the title of the specific page.

Field Tags / Mapping / Parsing Instructions

WEB/type “presidentialWeb”

WEB/url URL of web page (from crawler).

WEB/mime “text/html”

WEB/title The title will be extracted from the any of the following tags: <title> <h1> <h2> <meta name=”description”> <meta name=”keywords”> URL file name

The tags are in priority order. For example, if no title can be extracted from <title>, it will be extracted from <meta name=”description”>.

11

Page 15: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

WEB/area This will come from the “Web Area” column of the largest URL from the table in section 2.1 above which matches the prefix of the WEB/url field from the crawler. For example, the URL “http://www.trumanlibrary.org/whistlestop/wedding.htm”

matches “http://www.trumanlibrary.org/” from the table in section 2.1. Therefore, the “WEB/area” field will be set to:

“Harry S. Truman Library & Museum”.WEB/areaUrl This will come from the “Web Area” column of the largest URL from the table in section

2.1 above which matches the prefix of the WEB/url field from the crawler.

WEB/date From the <meta name="date"> tag. Parsed as YYYY-MM-DD.If the tag does not exist, then leave the date blank.

WEB/content If both <!-- startindex --> and <!-- stopindex --> exist: Index all text content between the two tags

If only <!-- startindex --> exists: Index all text content from startindex to the end of the web page

Otherwise Index all text content on the web page

Notes:1. Most presidential libraries do not use this convention.2. Only index text content. Do not index HTML tags, HTML attributes, or HTML

attribute values.3. Ignore all content found between <script> and </script>4. Do not index the content of <!-- XML comments -->

2.6 Processing Non-HTML contentAll non-HTML files will be processed as follows:

All media files (video, images, audio) – Skipped

Document files (PDF, MS-Word, etc.) – Will be processed through text extraction

o See below

2.6.1 Text Extraction

All document files (PDF, MS-Word, etc.) will be run through text extraction to extract metadata and text content.

The tool for text extraction (depending on the results of the Analysis of Alternatives, currently in progress) will likely be Apache Tika, see https://tika.apache.org/ for more information.

12

Page 16: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

2.6.2 Metadata Mapping for Non-HTML Content

The output of Apache Tika will be mapped to the WEB/ fields as follows:

Field Tags / Mapping / Parsing Instructions

WEB/type If the URL contains “archives.gov” this will be “archivesWeb”.Otherwise it will be “presidentialWeb”.

WEB/url URL of web page (from crawler).

WEB/mime From <meta name="Content-Type" content=". . ."/>

WEB/title The title will be extracted from the any of the following tags: <title> <h1> File name

The tags are in priority order. For example, if no title can be extracted from <title>, it will be extracted from <h1>

WEB/area This will come from the “Web Area” column of the largest URL from the table in section 2.1 above which matches the prefix of the WEB/url field from the crawler.

WEB/areaUrl This will come from the “Web Area” column of the largest URL from the table in section 2.1 above which matches the prefix of the WEB/url field from the crawler.

WEB/date The date will come from any of the following tags: <meta name="Last-Modified" content=". . ."/> <meta name="date" content=". . ."/> <meta name="Creation-Date" content=". . ."/>

The tags are listed in priority order. The date will be found in the @content attribute. The date will be in ISO 8601 format.If no date exists, then leave the date blank.

WEB/content All text content produced by Apache Tika (minus XHTML tags and XML comments) will be indexed as the text content.

13

Page 17: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

3 File Processing

Beyond the metadata parsing described in section 2, there is no file processing required for the Web Sites data source.

This section is retained to ensure that major section numbers are consistent across all Data Model Design (DMD) documents.

14

Page 18: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

4 Mapping to Index Fields

The following table maps the metadata to the appropriate index field. The metadata elements are not given here because in some cases they differ based on the particular web site they originate from. See the mappings of document tags to metadata given in section 1.3.

4.1 Field Metadata Mappings

Index Field WEB Metadata Name PurposeI/url WEB/url results

I/source “web” search

I/type WEB/type search

I/oldScope If WEB/type == “archivesWeb”: set to “archives.gov”

If WEB/type == “presidentialWeb”: set to “presidential”

search, facet

I/iconType WEB/mime (normalized) results

I/fileFormat WEB/mime (normalized) results

I/originalMimeType WEB/mime results

I/tabType “all,online,web” results

I/materialsType “web” results

I/title WEB/title results, search

I/titleSort WEB/title with articles and prepositions from the start removed.

sorting

I/allTitles WEB/title results, search

I/webArea WEB/area results

I/webAreaUrl WEB/areaUrl results

I/content WEB/content results, search, teasers

I/teaser The first 500 characters of WEB/content results

I/titleDate WEB/date results

I/dateRangeFacet Choose the appropriate date range facet which covers the WEB/date value. Leave blank if WEB/date is blank.

facets

I/productionDate WEB/date results, sorting

I/productionDateQualifier “YMD” results

Refining by location should only return archival records, not web page results.

15

Page 19: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

4.2 Mapping to Keywords Relevancy ModelWeb page metadata will be mapped to the Catalog relevancy model as follows. Note that all of the fields specified (grank1, grank2, grank3, and content) will be searched by all “q=” parameters.

When multiple WEB/ fields are mapped to the same relevancy field, all of their content will be concatenated together into the same field and searched together.

WEB/ field Relevancy Field

WEB/title I/grank2

WEB/url I/grank3

WEB/webArea I/content

WEB/content I/content

16

Page 20: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

5 Search Fields

There are no special fields for advanced search over web sites.

17

Page 21: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

6 Search Results Presentation

This section details the search results presentation for web site results.

For simple keyword searches, the top three web sites matches will be returned. They are returned as a group in the search results list, under “National Archives website pages”.

6.1 Search Results (aka “Brief Results” Display)As shown in this example, only the top 3 website results are shown as grouped under the heading “National Archives website pages”. The globe icon is used.

This table describes the 3 lines that make up each web site within the group.

Lin

e

SoLR Fields & Pattern

1 {I/title}

LINK: I/url

2 {I/webArea}

LINK: I/webAreaUrl

3 {Highlighted teaser from the search engine, or I/teaser }

[Note: Display first 200 characters as snippet in the result]

Notes:

1. “National Archives website pages” links to a tab with all the web sites search results.

2. The globe icon is a link to the same tab.

18

Page 22: Search Technologies Assessment  · Web viewUpdated content based on “DRCF_NAC R1P2 combined 1b_Increment 2 Design_IQS_Consolidated_7_21_15V1 (1).docx” Overview This is the Data

7 Content Details Presentation

The links in the search results go directly to the external website and therefore there is no content detail page for Archives.gov within the Catalog.

19