searching the united states code with solr/lucene - by ronald matamoros

39
Searching The United States Code with Solr/Lucene Paul Nelson / Ronald Matamoros, Search Technologies [email protected], 5/25/2011 [email protected]

Upload: lucenerevolution

Post on 14-Dec-2014

461 views

Category:

Technology


0 download

DESCRIPTION

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

TRANSCRIPT

Page 1: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Searching The United States Code with Solr/Lucene

Paul Nelson / Ronald Matamoros, Search [email protected], 5/25/2011

[email protected]

Page 2: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Searching the United States Code

Who are we:• Paul Nelson, Chief Architect• Ronald Matamoros, Lead Engineer

Our Mission: Replace Personal Librarian Search• A 20-Year-Old Search Engine!

Key Challenges• How to index this massive, complex, 85-year-old

document?• How to replicate 20-Year-Old search features?

Government Documents are Fun!

2

Page 3: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Search Technologies

The largest independent provider of enterprise search expertise and services

80 full-time dedicated search engine experts 200+ customers Technology Neutral

• (yeah, we knowSphinx too)

Offices All Over• DC, NY, CA, MD,

OH, UK, CR…

3

Page 4: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

A Quick Civics Lesson… The United States Code

• The general & permanent laws of the U.S. Government – All in one place

• 51 titles Agriculture, Armed Forces, Conservation, The President,

Food and Drugs, Postal Service, Public Health…

• First Version: 1926

The Office of the Law Revision Council (OLRC)• 20 lawyers who author the U.S. Code• They report to the Speaker of the House of

Representatives

Bonus Question: Which Title is the largest?

4

Page 5: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Major Challenges1. Document Parsing

• A 50 Volume Table Of Contents!

2. Query Parsing• Custom Features (exact case, exact suffix,

proximity, query templates, lemmatization, lots of fields…)

3. Searching & Highlighting Fields• Some fields are embedded in the document• These fields must be highlighted in context

5

Page 6: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

6

screenshot

Page 7: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

7

screenshot

Page 8: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

8

screenshot

Page 9: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

9

Page 10: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Part The First: Document Processing

10

Page 11: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Document Processing / Indexing

11

USC TitleUSC Title

Parse & Granularize

Parse & Granularize

RepositoryRepository

Construct XHTML

Construct XHTML StoreStore Xform &

IndexXform &

Index SolrSolrEmbedRefs

EmbedRefs

Page 12: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Field Type 1: Extracted to Index

12

<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --><!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --><!-- itemsortkey:140AAAD --><!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --><!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3><!-- field-end:head --><!-- field-start:statute --><p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …<!-- field-end:statute --><!-- field-start:sourcecredit --><p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),…<!-- field-end:sourcecredit --><!-- field-start:notes --><!-- field-start:historicalandrevision-note --><h4 class="note-head">Historical and Revision Notes</h4><p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1…<!-- field-end:historicalandrevision-note --><!-- field-start:amendment-note --><h4 class="note-head">Amendments</h4><p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …<!-- field-end:amendment-note --><!-- field-start:effectivedate-amendment-note --><h4 class="note-head">Effective Date of 2002 Amendment</h4><p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …

Page NumbersPage Numbers

TitleTitleHeadingHeading

Source CreditSource Credit

Page 13: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Document Processing / Indexing

13

Title 14Title 14

ch. 1 ch. 1 ch. 2 ch. 2 ch. 3 ch. 3

pt. A pt. A pt. B pt. B pt. C pt. C

sec. 1 sec. 1 sec. 2 sec. 2 sec. 3 sec. 3

USC TitleUSC Title

Parse & Granularize

Parse & Granularize

RepositoryRepository

Construct XHTML

Construct XHTML StoreStore Xform &

IndexXform &

Index SolrSolrEmbedRefs

EmbedRefs

Page 14: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Field Type 2: Embedded Refs

14

<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --><!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --><!-- itemsortkey:140AAAD --><!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --><!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3><!-- field-end:head --><!-- field-start:statute --><p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …<!-- field-end:statute --><!-- field-start:sourcecredit --><p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),…<!-- field-end:sourcecredit --><!-- field-start:notes --><!-- field-start:historicalandrevision-note --><h4 class="note-head">Historical and Revision Notes</h4><p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1…<!-- field-end:historicalandrevision-note --><!-- field-start:amendment-note --><h4 class="note-head">Amendments</h4><p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …<!-- field-end:amendment-note --><!-- field-start:effectivedate-amendment-note --><h4 class="note-head">Effective Date of 2002 Amendment</h4><p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …

Public LawPublic Law

Other USC RefsOther USC Refs

Statute at LargeStatute at Large

Public LawPublic Law

Public LawPublic Law

Page 15: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Document Processing / Indexing

15

USC TitleUSC Title

Parse & Granularize

Parse & Granularize

RepositoryRepository

Construct XHTML

Construct XHTML StoreStore Xform &

IndexXform &

Index SolrSolrEmbedRefs

EmbedRefs

Page 16: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Document Processing / Indexing

16

USC TitleUSC Title

Parse & Granularize

Parse & Granularize

RepositoryRepository

Construct XHTML

Construct XHTML StoreStore Xform &

IndexXform &

Index SolrSolrEmbedRefs

EmbedRefs

/US-Code /2010

/title2 /USC-title2-section1532.htm /USC-title2-node3-rule5.htm

Page 17: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Part The Second: Token Processing

17

Page 18: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Token Processing 1

xhtml tag tokenizer

18

<!-- field-start:amendment-note --><h4 class="note-head">Amendments</h4><p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …<!-- field-end:amendment-note -->

<!-- field-start:amendment-note -->

<h4 class="note-head">

Amendments

</h4>

<p class="note-body">

2002

Pub

L

107

296

Substituted

Department

of

<!-- field-end:amendment-note -->

Page 19: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Field Type 3: Marked Within Doc

19

<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --><!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --><!-- itemsortkey:140AAAD --><!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --><!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3><!-- field-end:head --><!-- field-start:statute --><p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military …<!-- field-end:statute --><!-- field-start:sourcecredit --><p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),…<!-- field-end:sourcecredit --><!-- field-start:notes --><!-- field-start:historicalandrevision-note --><h4 class="note-head">Historical and Revision Notes</h4><p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1…<!-- field-end:historicalandrevision-note --><!-- field-start:amendment-note --><h4 class="note-head">Amendments</h4><p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of …<!-- field-end:amendment-note --><!-- field-start:effectivedate-amendment-note --><h4 class="note-head">Effective Date of 2002 Amendment</h4><p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …

Page 20: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Token Processing 2Mark Start and End Tags

20

S/amendment

<h4 class="note-head">

Amendments

</h4>

<p class="note-body">

2002

Pub

L

107

296

Substituted

Department

of

E/amendment

<!-- field-start:amendment-note -->

<h4 class="note-head">

Amendments

</h4>

<p class="note-body">

2002

Pub

L

107

296

Substituted

Department

of

<!-- field-end:amendment-note -->

Page 21: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Token Processing 3Remove XHTML Tags

21

S/amendment

Amendments

2002

Pub

L

107

296

Substituted

Department

of

E/amendment

S/amendment

<h4 class="note-head">

Amendments

</h4>

<p class="note-body">

2002

Pub

L

107

296

Substituted

Department

of

E/amendment

Page 22: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Token Processing 4Tag Original Case & Lower Case

22

S/amendment

O/Amendments L/amendments

O/2002 L/2002

O/Pub L/pub

O/L L/l

O/107 L/107

O/296 L/296

O/Substituted L/substituted

O/Department L/department

O/of L/of

E/amendment

S/amendment

Amendments

2002

Pub

L

107

296

Substituted

Department

of

E/amendment

Page 23: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Token Processing 5Lemmatize

Uses dictionary-based lemmatizer based on GCIDE and WordNet

23

S/amendment

O/Amendments L/amendments amendment

O/2002 L/2002 2002

O/Pub L/Pub pub

O/L L/l; l

O/107 L/107 107

O/296 L/296 296

O/Substituted L/Substituted substitute

O/Department L/Department department

O/of L/of of

E/amendment

S/amendment

O/Amendments L/amendments

O/2002 L/2002

O/Pub L/pub

O/L L/l

O/107 L/107

O/296 L/296

O/Substituted L/substituted

O/Department L/department

O/of L/of

E/amendment

Page 24: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Part The Third: Query Processing

24

Page 25: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Query Processing

25

parseparse mark phrases

mark phrases lemmatizelemmatize query

templatequery

template

build lucene query

build lucene query

mark exact:mark exact:

QueryString

search

Communicates via generic QNode Class• Simpler to manipulate than Lucene operators

Can produce FAST FQL as well• (cue the derisive catcalls)

But most importantly:• It is a Query Processing Pipeline

Mix and match query processing modules

(not all stages shown)

Page 26: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Query Processing

26

parseparse mark lowercase

mark lowercase lemmatizelemmatize query

templatequery

template

build lucene query

build lucene query

mark originalmark

originalQueryString

search

andand

exact:exact:

|FOIA||FOIA|

phrasephrase

|top||top| |secret||secret|

amendment:amendment:

|RECORDS||RECORDS|

exact:FOIA “top secret” amendment:RECORDS

Page 27: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Query Processing

27

parseparse marklowercase

marklowercase lemmatizelemmatize query

templatequery

template

build lucene query

build lucene query

mark originalmark

originalQueryString

search

andand

O/FOIAO/FOIA phrasephrase

|top||top| |secret||secret|

amendment:amendment:

exact:FOIA “top secret” amendment:RECORDS

|RECORDS||RECORDS|

Page 28: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Query Processing

28

parseparse marklowercase

marklowercase lemmatizelemmatize query

templatequery

template

build lucene query

build lucene query

mark originalmark

originalQueryString

search

andand

O/FOIAO/FOIA phrasephrase

|L/top||L/top| |L/secret||L/secret|

amendment:amendment:

exact:FOIA “top secret” amendment:RECORDS

|records||records|

Page 29: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Query Processing

29

parseparse marklowercase

marklowercase lemmatizelemmatize query

templatequery

template

build lucene query

build lucene query

mark originalmark

originalQueryString

search

andand

O/FOIAO/FOIA phrasephrase

|L/top||L/top| |L/secret||L/secret|

amendment:amendment:

exact:FOIA “top secret” amendment:RECORDS

|record||record|

Page 30: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Query Processing

30

parseparse marklowercase

marklowercase lemmatizelemmatize query

templatequery

template

build lucene query

build lucene query

mark originalmark

originalQueryString

search

andand

O/FOIAO/FOIA phrasephrase

|L/top||L/top| |L/secret||L/secret|

betweenbetween

exact:FOIA “top secret” amendment:RECORDS

E/amendmentE/amendment

S/amendmentS/amendment

|record||record|

Page 31: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

The between() Operator

between(start-tag, end-tag, pos-clause, neg-clause)

start-tag Starting tag, e.g. “S/amendment” end-tag Ending tag, e.g. “E/amendment”

pos-clause words which must occur between start and end• Note: Requires a nested ScanAnd() operator

neg-clause words which must not occur between start and end

31

Page 32: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Part the Fourth: Hierarchical Navigation

32

Page 33: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

33

screenshot

Page 34: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Hierarchies: Requirements

Any number of levels Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part,

Section

Levels vary across titles Title 1: 3 levels Title 26: 8 levels

Multiple views: Children Ancestors Ancestor’s Siblings

Multiple search scopes: Only children, all descendents, everything

34

Page 35: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Hierarchies: Ancestor-Siblings US-Code

• Title 1• Title 2

Chapter 1 Chapter 2

– Part 1– Part 2

• Section 2.1• Section 2.2

– Part 3– Part 4

Chapter 3 Chapter 4

• Title 3

35

Page 36: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Hierarchies: Fields ancestors

• Searching USC USC-title2 USC-title2-chapter25 USC-title2-chapter25-

subchapter2

encodedAncestors – for display only• Where the node exists within the hierarchy

id;heading;subjectTitle//id;heading;subjectTitle//... USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform//

USC-title2-chapter25-subchapter2;Subchapter II;Regulatory Accountabilty and Reform

parentId – ID of the parent node USC-title2-chapter25-subchapter2

treesort – Hierarchical sort field, e.g. “13/000/0/00882”

36

Page 37: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Hierarchies: Tree Sort

Sorting In Print Order• Front Matter Titles Tables etc.• Everything padded to fixed-length

37

01/011/1/02032

01 = USC Title

011 = Title 11 1 = An Appendix

Sequence # in file

Page 38: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Hierarchies: Sample Searches

Assuming Node = “USC-title2-chapter25” Search Children

• parentId:USC-title2-chapter25

Search All Descendents• ancestors:USC-title2-chapter25

Ancestor Siblings• (parentId:USC OR parentId:USC-title2 OR

parentId:USC-title2-chapter25)

38

Page 39: Searching The United States Code with Solr/Lucene - By Ronald Matamoros

Contact

Paul Nelson• [email protected]

Ronald Matamoros• [email protected]

Search Technologies• http://searchtechnologies.com

39