october 2, 20150 fdsys gpo fdsys – search engine configuration fdsys data analysis and parsing

Post on 29-Dec-2015

226 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

April 19, 20231

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FDsysFDsys

Data Analysis and Parsing

April 19, 20232

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Data Analysis and Parsing

Agenda:

• Data Management Definition

• Parsing

• fdsys.xml

April 19, 20233

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FDsysFDsys

Data Management Definition(DMD)

April 19, 20234

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Data Management Definition (DMD)

• Purpose of the Data Management Definition (DMD)– Define collection-specific metadata elements– Specify roles for the granules, if applicable– Collection-specific schema definition for FDsys.xsd– Define mappings of metadata elements for Documentum

and FAST– Define mappings to metadata standards

• One DMD for each collection• PMO & dev team collaborative effort for CDM

documentation development• Is both a document and a process

April 19, 20235

FDsysFDsys

GPO Fdsys – Search Engine Configuration

The DMD Defines how Data Flows Through FDsys

Business Process OverviewSubmission

Ingest Process

Congressional SubmissionWorkflow (folder)

MigrationApplication

Bulk SubmissionProcess

Preservation

Archival ProcessingWorkflow

Archival UpdatingWorkflow

Access

Public UserAccess & Delivery

Application

Authorized UserAccess & Delivery

Application

Processing

Package UpdatingWorkflow

Access ProcessingWorkflow

Publishing Process

ILS IntegrationApplication

SubmissionProcess

Congressional SubmissionWorkflow (interactive)what renditions

are available?

how will metadata be

extracted and merged?

what manual edits may be

required?

how are PDF files processed?

how will the HTML rendition

be created

how will parser data and input

files be validated

what’s on the search form?

how will the content and metadata be

indexed

what are the navigators?

how will the MODS be created?

how are search results formatted?

what do content URLs look like?

April 19, 20236

FDsysFDsys

GPO Fdsys – Search Engine Configuration

DMD – Table of Contents

1. General Description

2. fdsys.xml Schema Elements

3. Renditions, Plant Processing and Interractions

4. Parser Definition – Extraction patterns and algorithms

5. Content Management

6. Content Publishing and Index

7. Search and Browse• Search results, navigators, and collection browsing

8. Content Delivery• URLs, content-detail, Front page, actions

9. mods.xml mappings

April 19, 20237

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Metadata Flow DiagramMetadata Flow Diagram

FAST Notification

ACPCache

Documentum

fdsysxml

Validate, cleanup, normalize, and

extend metadata and renditions

FASTindexes

Search Results:

1. [title] [ [type] [size] ][line 2][teaser...] [more...]

2. [title] [ [type] [size] ][line 2][teaser...] [more...]

Index

Search

mapfields

[per collection]

Content Detail

[field1]: [data1][field 2]: [data2][field 3]: [data3]...

Package TOC:[collection]

[congress num][document type]

[chapter][chapter]

[section][article **]

[chapter]...

Search Form:

field1: [_________]field2: [________v]field3: from [___] to [___]field4: [_________]

normalize dataand map to FQL

MODSXML

PREMIS

.xslt

.xslt

Collection Browsing:

[collection][congress number]

[document type][document version]

mapnavigators

[per collection]

.xslt.xslt.xslt.xslt

fdsysxml

fdsysxml

FASTXML

contentfile(s)

index.xml

mods.xml.xslt

Index Push

contentfile(s) content

file(s)

publish

Submission

OriginalContent

Parse

April 19, 20238

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Metadata Flow DiagramMetadata Flow Diagram

FAST Notification

ACPCache

Documentum

fdsysxml

Validate, cleanup, normalize, and

extend metadata and renditions

FASTindexes

Search Results:

1. [title] [ [type] [size] ][line 2][teaser...] [more...]

2. [title] [ [type] [size] ][line 2][teaser...] [more...]

Index

Search

mapfields

[per collection]

Content Detail

[field1]: [data1][field 2]: [data2][field 3]: [data3]...

Package TOC:[collection]

[congress num][document type]

[chapter][chapter]

[section][article **]

[chapter]...

Search Form:

field1: [_________]field2: [________v]field3: from [___] to [___]field4: [_________]

normalize dataand map to FQL

MODSXML

PREMIS

.xslt

.xslt

Collection Browsing:

[collection][congress number]

[document type][document version]

mapnavigators

[per collection]

.xslt.xslt.xslt.xslt

fdsysxml

fdsysxml

FASTXML

contentfile(s)

index.xml

mods.xml.xslt

Index Push

contentfile(s) content

file(s)

publish

Submission

OriginalContent

Parse

parsing rules

CMSmetadatamapping

fdsys.xmlstructure

search indexfield mapping

modsmapping

content-detailmapping

search-formmapping

search resultsmapping browse

algorithm

April 19, 20239

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Federal Register Granules

• Each article is a granule

• Each Part is a single granule

• There are no higher-level granules• Sections are not

preserved as independent granules

April 19, 202310

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Federal Register Example Metadata

agencies

title

actionsummary

dates

contact

FR Doc Number Billing Code

April 19, 202311

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Content Files

SGML

locator

CDTP

locatorlocator

SGMLSGML

RenditionsInput Files

texttext

extract granules

pdf-submitted

pdf-submitted

pdf (public)pdf (public)

OCR embedded images

Create “FrontMatter”, “ReaderAids”, and “Issue” PDF files

PDF

extract granules

April 19, 202312

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Content Files – Creating the HTML Rendition

texttext

html (public)html (public)

Add HTML headers and header metadata

Add URL and E-mail links

extract images as JPEG

OCR images

embed image tags

pdf-submitted

pdf-submitted image

s

longdesc text

html

April 19, 202313

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Extracting Metadata

SGML content

CDTP parse

SGML TOC

parse

parse

overwrite addMerged

Metadata

(TOC headings)

• Metadata is merged based on the FR Doc Number

April 19, 202314

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Search Results

73 FR 22020 - Title I-Improving the Academic Achievement of the Disadvantaged [PDF 123 KB] Federal Register. Proposed Rules. Notice of proposed rulemaking. RIN 0324-AJ10. Wednesday, April 23, 2008. ...The Secretary proposes to amend the regulations governing programs administered under Part A of Title I of the Elementary and Secondary Education Act of 1965, as amended (ESEA)... More Information...

volume

firstpage

title

collection

section

action(first 20 chars)

rin

publishdate teaser link to content-detail

April 19, 202315

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FR Navigators

• Section

• Agency

• CFRs– Hierarchial

+ 15 CFR- Part 12- Part 13- Part 14

+ 16 CFR- Part 412- Part 413

April 19, 202316

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Collection Browsing

daynav

monthnav

yearnav

agencynav

April 19, 202317

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Advanced Search Form

April 19, 202318

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Package-Level URLs

• Package Content Detail– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/content-detail.html

• Package Metadata Standards– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/mods.xml– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/premis.xml

• Package Table of Contents– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/toc.html

• Today’s Table of Contents– http://www.gpo.gov/fdsys/html/FR/todays_toc.html

April 19, 202319

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Granule-Level URLs

• HTML and PDF Files– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/html/E6-1423.html– http://www.gpo.gov/fdsys/pkg/FR-2006-01-01/pdf/E6-1423.pdf

• Granule Content Detail– http://www.gpo.gov/fdsys/granule/FR-2006-01-01/

E6-1423/content-detail.html

• Granule Metadata Standards– http://www.gpo.gov/fdsys/granule/FR-2006-01-01/

E6-1423/mods.xml

April 19, 202320

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Content Detail

Sample UI

April 19, 202321

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FDsysFDsys

Parsing

April 19, 202322

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Parsing Overview

• Runs regular expressions to extract metadataRegular Expression:

(Public Law|Pub. L.|PL|P. L.) (1[0-9][0-9])-([0-9]+)

Example: Pub. L. 109-130Produces: <law congress="109" number="130"/>

• Written in Java

• Called from Documentum when a package needs to be parsed

• Produces an instance of fdsys.xml– Parsing has an internal XML format (called the “raw”

XML) which is transformed to produce the fdsys.xml

April 19, 202323

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FRFile

FRFile

Parser Foundation Classes

PParser PPackage PRendition PFile PGranule

USCODEParser

USCODEPackage

USCODEFile

USCODEGranule

• Foundation classes handle 95% of parsing needs• Derived classes handle all special cases

PContainer

FRRendition

USCODERendition

April 19, 202324

FDsysFDsys

GPO Fdsys – Search Engine Configuration

PContainer

PContainer

• Takes patterns and produces elements• Holds XML at each level of the parsing process

PPattern

"(Public Law|Pub. L.|P. L.) (1[0-9][0-9])-([0-9]+)"

used_by

used_by

produces

XML DOM

<publicLaw> <congressNum>109 <lawNum>123</publicLaw>

stored_in

XML Fragment

April 19, 202325

FDsysFDsys

GPO Fdsys – Search Engine Configuration

PRendition

XML

PFile

XML

PFile

XML

PGranule

XML

PGranule

XML

Parser Foundation Classes

PParser PPackage PRendition PFile PGranule

PContainer

XML DOM XML DOM XML DOM XML DOM

appendappendprioritymerge

XSLT

xml

April 19, 202326

FDsysFDsys

GPO Fdsys – Search Engine Configuration

PRendition

XML

PFile

XML

PFile

XML

Parsing XML Documents

PParser PPackage PRendition PFile

PContainer

XML DOM XML DOM XML DOM

appendprioritymerge

XSLT

fdsys.xml

XSLT

bills.xml

April 19, 202327

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Other Parsing Considerations

• Heuristics testing is integrated into the parsing– PEHelper: Checks for heuristics and adds “quality=“

attributes

• Output can be automatically Schema-Validated– Schema-Validation is run on all fdsys.xml formats

produced by the parser

• Parser Validation Tool– Used by GPO to validate that parsers meet the 90%

Service Level Agreement for accuracy– Randomly selects 100 documents or granules– Displays metadata & original text for manual review– Produces Validation Report

April 19, 202328

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FDsysFDsys

fdsys.xml

June 13, 200829

FDsysFDsys

GPO FDsys – Data Model

FDsysFDsysFDsys.xml Purpose

• Internal container of metadata related to package

• Is a detailed representation/model of the data structure across all of FDsys

• Reduces duplication of data across metadata formats

• Reduces number of required transformations

• Can be transformed into standard schemas including:– METS– MODS– PREMIS

June 13, 200830

FDsysFDsys

GPO FDsys – Data Model

FDsysFDsysFDsys.xml General Structure

Header

Content

Metadata

June 13, 200831

FDsysFDsys

GPO FDsys – Data Model

FDsysFDsys

FDsys Publish and Search

June 13, 200832

FDsysFDsys

GPO FDsys – Data Model

Publish and Search

Agenda:

• FDsys Publish

• Search Engine Configuration

• Search Engine Application Services

April 19, 202333

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FDsysFDsys

FDsys Publish

April 19, 202334

FDsysFDsys

GPO Fdsys – Search Engine Configuration

High-Level SW Components

Submission Component

- Submission Workflows- WebTop Submission User interfaces- Content Parsers- Migration Tool

Ingest Component

Content Processing

- Processing Workflows- WebTop User Interfaces- Package Management- ILS Integration

Archive Preservation

- Archival Workflows- WebTop User Interfaces- Preservation Process

Access Component

- Full-Fledged Search Application- Full Text Search Engine- Public Content Access and Delivery

Infrastructure Component

- COTS-based LDAP Integration

April 19, 202335

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Content Publishing - Overview

• Communicates from Documentum to Access

• From: Documentum– Extract fdsys.xml & premis.xml– Extract renditions and content files– Uses Documentum native DFC calls

• To: ACP Cache– Stores metadata and content files

• To: FAST ESP Search Engine– Converts fdsys.xml to FAST.xml -> to indexer

• Includes the mods.xml (indexed into ESP)• ESP pulls in content files automatically

– Uses FAST ESP content_api & search_api calls

April 19, 202336

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Component Interfaces

PackageUpdatingWorkflow

AccessProcessingWorkflow

ContentPublishing

FAST

ACPCache

CMS Access

WebApplication

UPDATE THIS

April 19, 202337

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Component Interfaces

ContentManagement

System

ContentPublishing

FAST

ACPCache

pull push

HTTP Commands

April 19, 202338

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Major Architectural Decisions

• Pull from Documentum, not Push– Maintenance of Access Subsystem databases

becomes the responsibility of the Access Subsystem– Data is pulled from Documentum only as needed

• Avoids overflow/queuing problems

– Allows multiple access systems to be fielded

• Search for Deletes in FAST– Packages can contain many granules– When updating the FAST indexes, use search to find

the list of all nested granules in the indexes– Guaranteed to avoid any “orphan” granule problems

April 19, 202339

FDsysFDsys

GPO Fdsys – Search Engine Configuration

ACP Cache Directory Structure

Proposed ACP Cache Directory : ( l i m i t s e n t r i e s p e r d i r e c t o r y t o 2 5 6 )

/ACP/hh/hh/hh/pkgXXXXXXXXXX/<package-contents>

Hexidecimal representation of the lower 24 bits of the MD5 hash of the package ID Package ID

April 19, 202340

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Metadata Flow Diagram

FAST Notification

ACP Cache

Documentum

OriginalContent

Parse

fdsysxml

Validate Values & Normalize

FASTindexes

Search Results:

1. [title] [ [type] [size] ][line 2][teaser...] [more...]

2. [title] [ [type] [size] ][line 2][teaser...] [more...]

Index

Search

mapfields

[per collection]

Content Detail

[field1]: [data1][field 2]: [data2][field 3]: [data3]...

Package TOC:[collection]

[congress num][document type]

[chapter][chapter]

[section][article **]

[chapter]...

Search Form:

field1: [_________]field2: [________v]field3: from [___] to [___]field4: [_________]

normalize dataand map to FQL

MODSXML

PREMIS

.xslt

.xslt

Collection Browsing:

[collection][congress number]

[document type][document version]

mapnavigators

[per collection]

.xslt.xslt.xslt.xslt

fdsysxml

fdsysxml

FASTXML

granulefile(s)

index.xml

search.xml.xslt

Index Push

granulefile(s)

granulefile(s)

publish

April 19, 202341

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Implementation Detail

FASTSearchEngine

Documentum

ACP Cache:

fdsys.xml and content files foreach Package

FAST Searchfor Deletes

serv

let w

rapp

erFDsys

Publish

individualgranule

files

Documentum APIs

fdsys.xml

Pro

cess

ing

Req

uest

s vi

a U

RL

FAST Content Processing

April 19, 202342

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FDsysFDsys

Search Engine ConfigurationDesign

April 19, 202343

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Component Interfaces

PackageUpdatingWorkflow

AccessProcessingWorkflow

ContentPublishing

FAST

ACPCache

CMS Access

WebApplication

Update This

April 19, 202344

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FAST System – Hardware & Network

index & search

index & search

index & search

index & search

index & search

search search search search search

publish & admin

document processors

Web Application

April 19, 202345

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FAST System – Indexing Flow

index & search

index & search

index & search

index & search

index & search

search search search search search

publish & admin

document processors

April 19, 202346

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FAST System – Search

index & search

index & search

index & search

index & search

index & search

search search search search search

QR server QR server QR server QR server QR server

Web Application

April 19, 202347

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Search Engine Sizing: Columns

• Total Number of Documents– Estimated 10 million records

• Each granule = 1 Search Engine document

– Allow 2x expansion for estimation errors and growth– Estimated 20 million records

• Sizing Recommendation:– FAST recommends: 5 million records per column

• For public facing web sites

– 5 columns: to account for the large number of navigators

April 19, 202348

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Search Engine Sizing: Disk

• Year 2006 FR – Index Sizing Test

• Scale to 20 million documents– Fixml: ~150gb– Index: ~420gb

• Total index space required:– 150gb + (420gb)*2 = 1tb– Add 50% for estimation error, total = 1.5tb

Documents Text Fixml Index Total FAST

31,500 500mb 230mb 604mb 834mb

April 19, 202349

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Search Engine Sizing: QPS

• Queries per second – Estimated from GPO Access– 0.8 QPS (across the whole day)

– Estimated peak: 2.4 qps (1/2 of queries in 4 hours)

• Estimated Peak QPS for FDsys:– Factor for improved search interface: 3x

– Factor for growth: 2x

– Estimated: 2.4 x 2 x 3 = ~15 QPS

– Correllates with other websites known to ST

• Each row: 20-30qps– Therefore: 1 row for query performance

• Recommend: 2 rows– 2nd row for redundancy, failover, and maintenance

April 19, 202350

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Metadata Flow Diagram

FAST Notification

ACPCache

Documentum

fdsysxml

Validate, cleanup, normalize, and

extend metadata and renditions

FASTindexes

Search Results:

1. [title] [ [type] [size] ][line 2][teaser...] [more...]

2. [title] [ [type] [size] ][line 2][teaser...] [more...]

Index

Search

mapfields

[per collection]

Content Detail

[field1]: [data1][field 2]: [data2][field 3]: [data3]...

Package TOC:[collection]

[congress num][document type]

[chapter][chapter]

[section][article **]

[chapter]...

Search Form:

field1: [_________]field2: [________v]field3: from [___] to [___]field4: [_________]

normalize dataand map to FQL

MODSXML

PREMIS

.xslt

.xslt

Collection Browsing:

[collection][congress number]

[document type][document version]

mapnavigators

[per collection]

.xslt.xslt.xslt.xslt

fdsysxml

fdsysxml

FASTXML

contentfile(s)

index.xml

mods.xml.xslt

Index Push

contentfile(s) content

file(s)

publish

Submission

OriginalContent

Parse

search indexfield mapping

April 19, 202351

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Mapping to the Index Profile

index profilefields:

resultsbundle

xml scope field

grank1-6

body

publishdatetitle

collectionspecific

fdsys.xml

ACP Cachecontent files

index.xslt

metadata for search results

metadata for simple search

metadata for navigation

mods.xslt

FASTExtractors

mergestandard

navigators

April 19, 202352

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Collections and Codes

• Three types of collections:– Processing Collection = “collectionCode” or

“processingCode”• Parsing, submission, processing, workflow

• Chooses which index.xslt and search.xslt to apply

– FAST Index Collection• One for each processing collection

• Allows easily deleting all documents in a collection

– “Access Collection” = “accode”• Re-group documents into collections for public users

• 98% the same as the “Processing Collection”– Reports in the Congressional Record, FR Unified Agenda

• Mapping is done in index.xslt

April 19, 202353

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Simple Maintenance #1

• New Collections– Just add the collection (admin GUI)– Start feeding data

• Add new fields– Add the field to the index profile– Reload profile with a hot-update

• Backups– Turn off feeding– Wait for documents in process to finish up– Make index backups

April 19, 202354

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Simple Maintenance #2

• Archiving Log Files– Simple file copy, can happen any time

• Correct Field Mapping Errors– Remove all documents in the FAST ESP collection– Re-index collection from ACP Cache

• Get list of packages in the collection from Documentum

• Does not require re-export (or re-publish) the packages from CMS

• Reorganize Access Collections– Remove all documents from affected collections– Re-index affected collections from ACP Cache

April 19, 202355

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Extensive Maintenance

• Examples:– New FAST Version– New FDsys Version– Complex index-profile changes (remove fields, major

restructuring)– Re-organizing collections or field mapping while

maintaining searches on the old snap-shot

• Process:– Servers to “stand-alone” mode– Make changes– Restore normal server operations

April 19, 202356

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Monitoring

• FAST standard monitoring tool (“Clarity”)

• Monitors query and indexing performance

• Built-in alerting mechanism

April 19, 202357

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Backups

• data_fixml– Holds processed copy of the indexes– Can be used to reconstruct the indexes in about 4

hours (will need to be benchmarked)

• data_index– The complete indexes actually used for searching

• Configuration backup

• When restoring a backup:– Will need to re-push all content updates which

occurred since the last backup

April 19, 202358

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Disaster Recovery Scenarios

• Servers crash– FAST restarts them automatically

• Hanging server processes– Shut it down manually and restart it

• Incremental indexing overloads the system– Should not happen in FDsys– Can “slow down” incremental indexing until situation

is corrected

• Severe incremental indexing problems– Revert to periodic batch index updates

April 19, 202359

FDsysFDsys

GPO Fdsys – Search Engine Configuration

FDsysFDsys

Search Services API

April 19, 202360

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Component Interfaces

FAST

ACPCache

AccessSearch

Web App

AccessSearch

API

FASTindexes

ContentDeliveryWeb App

search results

collection browsing

browsing PDF

content-detail

April 19, 202361

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Responsiblities: Search Services vs Web ApplicationSearch Services API

• All communication to FAST– All FAST API calls– All FAST parameters

• All FAST FQL– User query strings and

parameters to FAST FQL

• Raw data values– Allowed values, navigator

values, search results field values, etc.

• Choosing Navigators

Web Application

• Choosing which fields when– Advanced Search Form– Search Results

• User-interface oriented field data– Display names, help text,

display widgets

• Display value translation– translating from raw data

values to/from user-friendly values

April 19, 202362

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Responsiblities: Search Services vs Web Application – Browse TreesSearch Services API

• Browse tree computation– Selecting nodes to return– Returning an ordered list of

nodes– Caching search results

• Embargo Dates

Web Application

• Browse tree presentation– The definition of the levels– How many levels to display

when– Presentation of tree– Translating raw data values

to user friendly values

• Content Detail Pages

April 19, 202363

FDsysFDsys

GPO Fdsys – Search Engine Configuration

CongBills

CFR

CR

Component Interfaces

Search Services

API

FAST Search

API

FASTSearch Engine

Parsing, Processing &

Caching

SearchWeb

Application

HTTPJava method calls

Configuration files (XML)

FR

Master

Collection Specific

April 19, 202364

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Configuration File Contents

• Fields– for the Advanced Search Form– for field: searches– Allowed values

• Fixed enumerated list

• Enumerated list built from navigator

• Numeric or Date Range

• Navigators per collection

• FAST ESP Search Engine Connection Info

• Templates to reformat data for display

April 19, 202365

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Query Parsing: Syntax

• atom atom– defaults to AND

• atom and atom• atom or atom• atom before/# atom• atom near/# atom• atom adj atom

• atom– Atoms are space separated lists of characters,

double-quoted strings, or parenthetical expressions

• not atom• +atom• -atom• field:atom• range(#,#)• range(<date>,<date>)

April 19, 202366

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Query Parsing: Examples

• hearing• “congressional hearing”• congressional adj hearing• congressional hearing• ways and means• “ways and means”• ways “and” means• congressional or

congress• congress and (report or

meeting or notice)• congnum:range(103,110)

• congressional not report• congressional –report• +cardin congressional

committee• congresional not

(committee report)• congressional not

(committee or meeting)• representative near/10

cardin• representative before/10

cardin

April 19, 202367

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Derived Hierarchy: Example

• 110th Congress– House Bills

• H.R. 1-200• H.R. 201-400

– H.R. 201– H.R. 202– H.R. 203

• Engrossed in House• Introduced in House

. Condemning the persecution of labor rights advocates in Iran [PDF] [Text]

• Referred in Senate– H.R. 204

• H.R. 401-600

April 19, 202368

FDsysFDsys

GPO Fdsys – Search Engine Configuration

Specified Hierarchy: Example

April 19, 202369

FDsysFDsys

GPO Fdsys – Search Engine Configuration

PFile

XML

PFile

XML

PGranule

XML

PGranule

XML

Parser Foundation Classes

PParser PPackage PFile PGranule

PContainer

XML DOM XML DOM XML DOM

appendappend

XSLT

xml

top related