39
Searching The United States Code with Solr/Lucene Paul Nelson / Ronald Matamoros, Search Technologies [email protected], 5/25/2011 [email protected]

Searching The United States Code with Solr/Lucene

Embed Size (px)

DESCRIPTION

What are the challenges in searching an 85 year old document? The United States Code was published by the United States Congress in 1926 as a single bound volume containing all of the general and permanent laws of the United States Government. It has been updated every year since and has grown into a 30 volume set of some 40,000 pages divided into 50 titles.

Citation preview

Page 1: Searching The United States Code with Solr/Lucene

Searching The United States Code with Solr/Lucene

Paul Nelson / Ronald Matamoros, Search Technologies [email protected], 5/25/2011

[email protected]

Page 2: Searching The United States Code with Solr/Lucene

Searching the United States Code

§  Who are we: •  Paul Nelson, Chief Architect •  Ronald Matamoros, Lead Engineer

§  Our Mission: Replace Personal Librarian Search •  A 20-Year-Old Search Engine!

§  Key Challenges •  How to index this massive, complex, 85-year-old

document? •  How to replicate 20-Year-Old search features?

§  Government Documents are Fun!

3

Page 3: Searching The United States Code with Solr/Lucene

Search Technologies §  The largest independent provider of enterprise

search expertise and services §  80 full-time dedicated search engine experts §  200+ customers §  Technology Neutral

•  (yeah, we know Sphinx too)

§  Offices All Over •  DC, NY, CA, MD,

OH, UK, CR…

4

Page 4: Searching The United States Code with Solr/Lucene

A Quick Civics Lesson… §  The United States Code

•  The general & permanent laws of the U.S. Government – All in one place

•  51 titles §  Agriculture, Armed Forces, Conservation, The President,

Food and Drugs, Postal Service, Public Health…

•  First Version: 1926 §  The Office of the Law Revision Council (OLRC)

•  20 lawyers who author the U.S. Code •  They report to the Speaker of the House of

Representatives §  Bonus Question: Which Title is the largest?

5

Page 5: Searching The United States Code with Solr/Lucene

Major Challenges 1.  Document Parsing

•  A 50 Volume Table Of Contents!

2.  Query Parsing •  Custom Features (exact case, exact suffix,

proximity, query templates, lemmatization, lots of fields…)

3.  Searching & Highlighting Fields •  Some fields are embedded in the document •  These fields must be highlighted in context

6

Page 6: Searching The United States Code with Solr/Lucene

7

screenshot

Page 7: Searching The United States Code with Solr/Lucene

8

screenshot

Page 8: Searching The United States Code with Solr/Lucene

9

screenshot

Page 9: Searching The United States Code with Solr/Lucene

10

Page 10: Searching The United States Code with Solr/Lucene

Part The First: Document Processing

11

Page 11: Searching The United States Code with Solr/Lucene

Document Processing / Indexing

12

USC Title

Parse & Granularize

Repository

Construct XHTML Store Xform &

Index Solr Embed Refs

Page 12: Searching The United States Code with Solr/Lucene

Field Type 1: Extracted to Index

13

<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …

Page Numbers

Title Heading

Source Credit

Page 13: Searching The United States Code with Solr/Lucene

Document Processing / Indexing

14

Title 14

ch. 1 ch. 2 ch. 3

pt. A pt. B pt. C

sec. 1 sec. 2 sec. 3

… …

USC Title

Parse & Granularize

Repository

Construct XHTML Store Xform &

Index Solr Embed Refs

Page 14: Searching The United States Code with Solr/Lucene

Field Type 2: Embedded Refs

15

<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …

Public Law Other USC Refs

Statute at Large

Public Law

Public Law

Page 15: Searching The United States Code with Solr/Lucene

Document Processing / Indexing

16

USC Title

Parse & Granularize

Repository

Construct XHTML Store Xform &

Index Solr Embed Refs

Page 16: Searching The United States Code with Solr/Lucene

Document Processing / Indexing

17

USC Title

Parse & Granularize

Repository

Construct XHTML Store Xform &

Index Solr Embed Refs

§  /US-Code §  /2010

§  /title2 §  /USC-title2-section1532.htm §  /USC-title2-node3-rule5.htm

Page 17: Searching The United States Code with Solr/Lucene

Part The Second: Token Processing

18

Page 18: Searching The United States Code with Solr/Lucene

Token Processing 1 xhtml tag tokenizer

19

<!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note -->

<!-- field-start:amendment-note -->

<h4 class="note-head">

Amendments

</h4>

<p class="note-body">

2002

Pub

L

107

296

Substituted

Department

of

<!-- field-end:amendment-note -->

Page 19: Searching The United States Code with Solr/Lucene

Field Type 3: Marked Within Doc

20

<!-- documentid:14_1 usckey:140000000000100000000000000000000 currentthrough:20080108 documentPDFPage:3 --> <!-- itempath:/140/PART I/CHAPTER 1/Sec. 1 --> <!-- itemsortkey:140AAAD --> <!-- expcite:TITLE 14-COAST GUARD!@!PART I-REGULAR COAST GUARD!@!CHAPTER 1-ESTABLISHMENT AND DUTIES!@!Sec. 1 --> <!-- field-start:head --><h3 class="section-head">&sect;1. Establishment of Coast Guard</h3> <!-- field-end:head --> <!-- field-start:statute --> <p class="statutory-body">The Coast Guard as established January 28, 1915, shall be a military … <!-- field-end:statute --> <!-- field-start:sourcecredit --> <p class="source-credit">(Aug. 4, 1949, ch. 393, 63 Stat. 496; Pub. L. 94&ndash;546, &sect;1(1),… <!-- field-end:sourcecredit --> <!-- field-start:notes --> <!-- field-start:historicalandrevision-note --> <h4 class="note-head">Historical and Revision Notes</h4> <p class="note-body">Based on title 14, U.S.C., 1946 ed., &sect;1 (Jan. 28, 1915, ch. 20, &sect;1… <!-- field-end:historicalandrevision-note --> <!-- field-start:amendment-note --> <h4 class="note-head">Amendments</h4> <p class="note-body">2002&mdash;Pub. L. 107&ndash;296 substituted &ldquo;Department of … <!-- field-end:amendment-note --> <!-- field-start:effectivedate-amendment-note --> <h4 class="note-head">Effective Date of 2002 Amendment</h4> <p class="note-body">Amendment by Pub. L. 107&ndash;296 effective on the date of transfer of …

Page 20: Searching The United States Code with Solr/Lucene

Token Processing 2 Mark Start and End Tags

21

S/amendment

<h4 class="note-head">

Amendments

</h4>

<p class="note-body">

2002

Pub

L

107

296

Substituted

Department

of

E/amendment

<!-- field-start:amendment-note -->

<h4 class="note-head">

Amendments

</h4>

<p class="note-body">

2002

Pub

L

107

296

Substituted

Department

of

<!-- field-end:amendment-note -->

Page 21: Searching The United States Code with Solr/Lucene

Token Processing 3 Remove XHTML Tags

22

S/amendment

Amendments

2002

Pub

L

107

296

Substituted

Department

of

E/amendment

S/amendment

<h4 class="note-head">

Amendments

</h4>

<p class="note-body">

2002

Pub

L

107

296

Substituted

Department

of

E/amendment

Page 22: Searching The United States Code with Solr/Lucene

Token Processing 4 Tag Original Case & Lower Case

23

S/amendment

O/Amendments L/amendments

O/2002 L/2002

O/Pub L/pub

O/L L/l

O/107 L/107

O/296 L/296

O/Substituted L/substituted

O/Department L/department

O/of L/of

E/amendment

S/amendment

Amendments

2002

Pub

L

107

296

Substituted

Department

of

E/amendment

Page 23: Searching The United States Code with Solr/Lucene

Token Processing 5 Lemmatize

Uses dictionary-based lemmatizer based on GCIDE and WordNet

24

S/amendment

O/Amendments L/amendments amendment

O/2002 L/2002 2002

O/Pub L/Pub pub

O/L L/l; l

O/107 L/107 107

O/296 L/296 296

O/Substituted L/Substituted substitute

O/Department L/Department department

O/of L/of of

E/amendment

S/amendment

O/Amendments L/amendments

O/2002 L/2002

O/Pub L/pub

O/L L/l

O/107 L/107

O/296 L/296

O/Substituted L/substituted

O/Department L/department

O/of L/of

E/amendment

Page 24: Searching The United States Code with Solr/Lucene

Part The Third: Query Processing

25

Page 25: Searching The United States Code with Solr/Lucene

Query Processing

26

parse mark phrases lemmatize query

template

build lucene query

mark exact:

Query String search

§  Communicates via generic QNode Class •  Simpler to manipulate than Lucene operators

§  Can produce FAST FQL as well •  (cue the derisive catcalls)

§  But most importantly: •  It is a Query Processing Pipeline

§  Mix and match query processing modules

(not all stages shown)

Page 26: Searching The United States Code with Solr/Lucene

Query Processing

27

parse mark lowercase lemmatize query

template

build lucene query

mark original

Query String search

and

exact:

|FOIA|

phrase

|top| |secret|

amendment:

|RECORDS|

exact:FOIA “top secret” amendment:RECORDS

Page 27: Searching The United States Code with Solr/Lucene

Query Processing

28

parse mark lowercase lemmatize query

template

build lucene query

mark original

Query String search

and

O/FOIA phrase

|top| |secret|

amendment:

exact:FOIA “top secret” amendment:RECORDS

|RECORDS|

Page 28: Searching The United States Code with Solr/Lucene

Query Processing

29

parse mark lowercase lemmatize query

template

build lucene query

mark original

Query String search

and

O/FOIA phrase

|L/top| |L/secret|

amendment:

exact:FOIA “top secret” amendment:RECORDS

|records|

Page 29: Searching The United States Code with Solr/Lucene

Query Processing

30

parse mark lowercase lemmatize query

template

build lucene query

mark original

Query String search

and

O/FOIA phrase

|L/top| |L/secret|

amendment:

exact:FOIA “top secret” amendment:RECORDS

|record|

Page 30: Searching The United States Code with Solr/Lucene

Query Processing

31

parse mark lowercase lemmatize query

template

build lucene query

mark original

Query String search

and

O/FOIA phrase

|L/top| |L/secret|

between

exact:FOIA “top secret” amendment:RECORDS

E/amendment

S/amendment

|record|

Page 31: Searching The United States Code with Solr/Lucene

The between() Operator §  between(start-tag, end-tag, pos-clause, neg-clause)

§  start-tag à Starting tag, e.g. “S/amendment” §  end-tag à Ending tag, e.g. “E/amendment”

§  pos-clause à words which must occur between start and end •  Note: Requires a nested ScanAnd() operator

§  neg-clause à words which must not occur between start and end

32

Page 32: Searching The United States Code with Solr/Lucene

Part the Fourth: Hierarchical Navigation

33

Page 33: Searching The United States Code with Solr/Lucene

34

screenshot

Page 34: Searching The United States Code with Solr/Lucene

Hierarchies: Requirements §  Any number of levels

§  Title, Sub-Title, Chapter, Sub-Chapter, Part, Sub-Part, Section

§  Levels vary across titles §  Title 1: 3 levels §  Title 26: 8 levels

§  Multiple views: §  Children §  Ancestors §  Ancestor’s Siblings

§  Multiple search scopes: §  Only children, all descendents, everything

35

Page 35: Searching The United States Code with Solr/Lucene

Hierarchies: Ancestor-Siblings §  US-Code

•  Title 1 •  Title 2

§  Chapter 1 §  Chapter 2

–  Part 1 –  Part 2

•  Section 2.1 •  Section 2.2

–  Part 3 –  Part 4

§  Chapter 3 §  Chapter 4

•  Title 3

36

Page 36: Searching The United States Code with Solr/Lucene

Hierarchies: Fields §  ancestors

•  Searching §  USC USC-title2 USC-title2-chapter25 USC-title2-chapter25-

subchapter2

§  encodedAncestors – for display only •  Where the node exists within the hierarchy

§  id;heading;subjectTitle//id;heading;subjectTitle//... §  USC-title2-chapter25;Chapter 25;Unfunded Mandates Reform//

USC-title2-chapter25-subchapter2;Subchapter II;Regulatory Accountabilty and Reform

§  parentId – ID of the parent node §  USC-title2-chapter25-subchapter2

§  treesort – Hierarchical sort field, e.g. “13/000/0/00882”

37

Page 37: Searching The United States Code with Solr/Lucene

Hierarchies: Tree Sort §  Sorting In Print Order

•  Front Matter à Titles à Tables à etc. •  Everything padded to fixed-length

38

01/011/1/02032

01 = USC Title

011 = Title 11 1 = An Appendix

Sequence # in file

Page 38: Searching The United States Code with Solr/Lucene

Hierarchies: Sample Searches §  Assuming Node = “USC-title2-chapter25” §  Search Children

•  parentId:USC-title2-chapter25 §  Search All Descendents

•  ancestors:USC-title2-chapter25 §  Ancestor Siblings

•  (parentId:USC OR parentId:USC-title2 OR parentId:USC-title2-chapter25)

39

Page 39: Searching The United States Code with Solr/Lucene

Contact §  Paul Nelson

•  [email protected] §  Ronald Matamoros

•  [email protected] §  Search Technologies

•  http://searchtechnologies.com

40