searching the united states code with solr/lucene - by ronald matamoros
DESCRIPTION
See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011TRANSCRIPT
Searching The United States Code with Solr/Lucene
Paul Nelson / Ronald Matamoros, Search [email protected], 5/25/2011
Searching the United States Code
Who are we:• Paul Nelson, Chief Architect• Ronald Matamoros, Lead Engineer
Our Mission: Replace Personal Librarian Search• A 20-Year-Old Search Engine!
Key Challenges• How to index this massive, complex, 85-year-old
document?• How to replicate 20-Year-Old search features?
Government Documents are Fun!
2
Search Technologies
The largest independent provider of enterprise search expertise and services
80 full-time dedicated search engine experts 200+ customers Technology Neutral
• (yeah, we knowSphinx too)
Offices All Over• DC, NY, CA, MD,
OH, UK, CR…
3
A Quick Civics Lesson… The United States Code
• The general & permanent laws of the U.S. Government – All in one place
• 51 titles Agriculture, Armed Forces, Conservation, The President,
Food and Drugs, Postal Service, Public Health…
• First Version: 1926
The Office of the Law Revision Council (OLRC)• 20 lawyers who author the U.S. Code• They report to the Speaker of the House of
Representatives
Bonus Question: Which Title is the largest?
4
Major Challenges1. Document Parsing
• A 50 Volume Table Of Contents!
2. Query Parsing• Custom Features (exact case, exact suffix,
proximity, query templates, lemmatization, lots of fields…)
3. Searching & Highlighting Fields• Some fields are embedded in the document• These fields must be highlighted in context
5
6
screenshot
7
screenshot
8
screenshot
9
Part The First: Document Processing
10
Document Processing / Indexing
11
USC TitleUSC Title
Parse & Granularize
Parse & Granularize
RepositoryRepository
Construct XHTML
Construct XHTML StoreStore Xform &
IndexXform &
Index SolrSolrEmbedRefs
EmbedRefs
Field Type 1: Extracted to Index
12
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --><!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --><!-- itemsortkey:140AAAD --><!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --><!-- field-start:head --><h3 class="section-head">§1. Establishment of Coast Guard</h3><!-- field-end:head --><!-- field-start:statute --><p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …<!-- field-end:statute --><!-- field-start:sourcecredit --><p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94–546, §1(1),…<!-- field-end:sourcecredit --><!-- field-start:notes --><!-- field-start:historicalandrevision-note --><h4 class="note-head">Historical and Revision Notes</h4><p class="note-body">Based on title 14, U.S.C., 1946 ed., §1 (Jan. 28, 1915, ch. 20, §1…<!-- field-end:historicalandrevision-note --><!-- field-start:amendment-note --><h4 class="note-head">Amendments</h4><p class="note-body">2002—Pub. L. 107–296 substituted “Department of …<!-- field-end:amendment-note --><!-- field-start:effectivedate-amendment-note --><h4 class="note-head">Effective Date of 2002 Amendment</h4><p class="note-body">Amendment by Pub. L. 107–296 effective on the date of transfer of …
Page NumbersPage Numbers
TitleTitleHeadingHeading
Source CreditSource Credit
Document Processing / Indexing
13
Title 14Title 14
ch. 1 ch. 1 ch. 2 ch. 2 ch. 3 ch. 3
pt. A pt. A pt. B pt. B pt. C pt. C
sec. 1 sec. 1 sec. 2 sec. 2 sec. 3 sec. 3
…
…
…
USC TitleUSC Title
Parse & Granularize
Parse & Granularize
RepositoryRepository
Construct XHTML
Construct XHTML StoreStore Xform &
IndexXform &
Index SolrSolrEmbedRefs
EmbedRefs
Field Type 2: Embedded Refs
14
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --><!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --><!-- itemsortkey:140AAAD --><!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --><!-- field-start:head --><h3 class="section-head">§1. Establishment of Coast Guard</h3><!-- field-end:head --><!-- field-start:statute --><p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …<!-- field-end:statute --><!-- field-start:sourcecredit --><p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94–546, §1(1),…<!-- field-end:sourcecredit --><!-- field-start:notes --><!-- field-start:historicalandrevision-note --><h4 class="note-head">Historical and Revision Notes</h4><p class="note-body">Based on title 14, U.S.C., 1946 ed., §1 (Jan. 28, 1915, ch. 20, §1…<!-- field-end:historicalandrevision-note --><!-- field-start:amendment-note --><h4 class="note-head">Amendments</h4><p class="note-body">2002—Pub. L. 107–296 substituted “Department of …<!-- field-end:amendment-note --><!-- field-start:effectivedate-amendment-note --><h4 class="note-head">Effective Date of 2002 Amendment</h4><p class="note-body">Amendment by Pub. L. 107–296 effective on the date of transfer of …
Public LawPublic Law
Other USC RefsOther USC Refs
Statute at LargeStatute at Large
Public LawPublic Law
Public LawPublic Law
Document Processing / Indexing
15
USC TitleUSC Title
Parse & Granularize
Parse & Granularize
RepositoryRepository
Construct XHTML
Construct XHTML StoreStore Xform &
IndexXform &
Index SolrSolrEmbedRefs
EmbedRefs
Document Processing / Indexing
16
USC TitleUSC Title
Parse & Granularize
Parse & Granularize
RepositoryRepository
Construct XHTML
Construct XHTML StoreStore Xform &
IndexXform &
Index SolrSolrEmbedRefs
EmbedRefs
/US-Code /2010
/title2 /USC-title2-section1532.htm /USC-title2-node3-rule5.htm
Part The Second: Token Processing
17
Token Processing 1
xhtml tag tokenizer
18
<!-- field-start:amendment-note --><h4 class="note-head">Amendments</h4><p class="note-body">2002—Pub. L. 107–296 substituted “Department of …<!-- field-end:amendment-note -->
<!-- field-start:amendment-note -->
<h4 class="note-head">
Amendments
</h4>
<p class="note-body">
2002
Pub
L
107
296
Substituted
Department
of
<!-- field-end:amendment-note -->
Field Type 3: Marked Within Doc
19
<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --><!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --><!-- itemsortkey:140AAAD --><!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --><!-- field-start:head --><h3 class="section-head">§1. Establishment of Coast Guard</h3><!-- field-end:head --><!-- field-start:statute --><p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …<!-- field-end:statute --><!-- field-start:sourcecredit --><p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94–546, §1(1),…<!-- field-end:sourcecredit --><!-- field-start:notes --><!-- field-start:historicalandrevision-note --><h4 class="note-head">Historical and Revision Notes</h4><p class="note-body">Based on title 14, U.S.C., 1946 ed., §1 (Jan. 28, 1915, ch. 20, §1…<!-- field-end:historicalandrevision-note --><!-- field-start:amendment-note --><h4 class="note-head">Amendments</h4><p class="note-body">2002—Pub. L. 107–296 substituted “Department of …<!-- field-end:amendment-note --><!-- field-start:effectivedate-amendment-note --><h4 class="note-head">Effective Date of 2002 Amendment</h4><p class="note-body">Amendment by Pub. L. 107–296 effective on the date of transfer of …
Token Processing 2Mark Start and End Tags
20
S/amendment
<h4 class="note-head">
Amendments
</h4>
<p class="note-body">
2002
Pub
L
107
296
Substituted
Department
of
E/amendment
<!-- field-start:amendment-note -->
<h4 class="note-head">
Amendments
</h4>
<p class="note-body">
2002
Pub
L
107
296
Substituted
Department
of
<!-- field-end:amendment-note -->
Token Processing 3Remove XHTML Tags
21
S/amendment
Amendments
2002
Pub
L
107
296
Substituted
Department
of
E/amendment
S/amendment
<h4 class="note-head">
Amendments
</h4>
<p class="note-body">
2002
Pub
L
107
296
Substituted
Department
of
E/amendment
Token Processing 4Tag Original Case & Lower Case
22
S/amendment
O/Amendments L/amendments
O/2002 L/2002
O/Pub L/pub
O/L L/l
O/107 L/107
O/296 L/296
O/Substituted L/substituted
O/Department L/department
O/of L/of
E/amendment
S/amendment
Amendments
2002
Pub
L
107
296
Substituted
Department
of
E/amendment
Token Processing 5Lemmatize
Uses dictionary-based lemmatizer based on GCIDE and WordNet
23
S/amendment
O/Amendments L/amendments amendment
O/2002 L/2002 2002
O/Pub L/Pub pub
O/L L/l; l
O/107 L/107 107
O/296 L/296 296
O/Substituted L/Substituted substitute
O/Department L/Department department
O/of L/of of
E/amendment
S/amendment
O/Amendments L/amendments
O/2002 L/2002
O/Pub L/pub
O/L L/l
O/107 L/107
O/296 L/296
O/Substituted L/substituted
O/Department L/department
O/of L/of
E/amendment
Part The Third: Query Processing
24
Query Processing
25
parseparse mark phrases
mark phrases lemmatizelemmatize query
templatequery
template
build lucene query
build lucene query
mark exact:mark exact:
QueryString
search
Communicates via generic QNode Class• Simpler to manipulate than Lucene operators
Can produce FAST FQL as well• (cue the derisive catcalls)
But most importantly:• It is a Query Processing Pipeline
Mix and match query processing modules
(not all stages shown)
Query Processing
26
parseparse mark lowercase
mark lowercase lemmatizelemmatize query
templatequery
template
build lucene query
build lucene query
mark originalmark
originalQueryString
search
andand
exact:exact:
|FOIA||FOIA|
phrasephrase
|top||top| |secret||secret|
amendment:amendment:
|RECORDS||RECORDS|
exact:FOIA “top secret” amendment:RECORDS
Query Processing
27
parseparse marklowercase
marklowercase lemmatizelemmatize query
templatequery
template
build lucene query
build lucene query
mark originalmark
originalQueryString
search
andand
O/FOIAO/FOIA phrasephrase
|top||top| |secret||secret|
amendment:amendment:
exact:FOIA “top secret” amendment:RECORDS
|RECORDS||RECORDS|
Query Processing
28
parseparse marklowercase
marklowercase lemmatizelemmatize query
templatequery
template
build lucene query
build lucene query
mark originalmark
originalQueryString
search
andand
O/FOIAO/FOIA phrasephrase
|L/top||L/top| |L/secret||L/secret|
amendment:amendment:
exact:FOIA “top secret” amendment:RECORDS
|records||records|
Query Processing
29
parseparse marklowercase
marklowercase lemmatizelemmatize query
templatequery
template
build lucene query
build lucene query
mark originalmark
originalQueryString
search
andand
O/FOIAO/FOIA phrasephrase
|L/top||L/top| |L/secret||L/secret|
amendment:amendment:
exact:FOIA “top secret” amendment:RECORDS
|record||record|
Query Processing
30
parseparse marklowercase
marklowercase lemmatizelemmatize query
templatequery
template
build lucene query
build lucene query
mark originalmark
originalQueryString
search
andand
O/FOIAO/FOIA phrasephrase
|L/top||L/top| |L/secret||L/secret|
betweenbetween
exact:FOIA “top secret” amendment:RECORDS
E/amendmentE/amendment
S/amendmentS/amendment
|record||record|
The between() Operator
between(start-tag, end-tag, pos-clause, neg-clause)
start-tag Starting tag, e.g. “S/amendment” end-tag Ending tag, e.g. “E/amendment”
pos-clause words which must occur between start and end• Note: Requires a nested ScanAnd() operator
neg-clause words which must not occur between start and end
31
Part the Fourth: Hierarchical Navigation
32
33
screenshot
Hierarchies: Requirements
Any number of levels Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part,
Section
Levels vary across titles Title 1: 3 levels Title 26: 8 levels
Multiple views: Children Ancestors Ancestor’s Siblings
Multiple search scopes: Only children, all descendents, everything
34
Hierarchies: Ancestor-Siblings US-Code
• Title 1• Title 2
Chapter 1 Chapter 2
– Part 1– Part 2
• Section 2.1• Section 2.2
– Part 3– Part 4
Chapter 3 Chapter 4
• Title 3
35
Hierarchies: Fields ancestors
• Searching USC USC-title2 USC-title2-chapter25 USC-title2-chapter25-
subchapter2
encodedAncestors – for display only• Where the node exists within the hierarchy
id;heading;subjectTitle//id;heading;subjectTitle//... USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform//
USC-title2-chapter25-subchapter2;Subchapter II;Regulatory Accountabilty and Reform
parentId – ID of the parent node USC-title2-chapter25-subchapter2
treesort – Hierarchical sort field, e.g. “13/000/0/00882”
36
Hierarchies: Tree Sort
Sorting In Print Order• Front Matter Titles Tables etc.• Everything padded to fixed-length
37
01/011/1/02032
01 = USC Title
011 = Title 11 1 = An Appendix
Sequence # in file
Hierarchies: Sample Searches
Assuming Node = “USC-title2-chapter25” Search Children
• parentId:USC-title2-chapter25
Search All Descendents• ancestors:USC-title2-chapter25
Ancestor Siblings• (parentId:USC OR parentId:USC-title2 OR
parentId:USC-title2-chapter25)
38
Contact
Paul Nelson• [email protected]
Ronald Matamoros• [email protected]
Search Technologies• http://searchtechnologies.com
39