Build Your Own World Class Directory Search From Alpha to Omega

Preview:

Citation preview

Copyright © President & Fellows of Harvard College

Build Your Own World-Class Directory Search From Α to Ω

Ravi Mynampaty

About Ravi

A hustler making a living by pretending to know more about

Enterprise Search than he actually does...

“I can live on a good compliment two weeks with nothing else to eat...”

@RaviMynampaty

Why the heck should I listen to Ravi?

Agenda

Why?

What’s the fuss about?

What features?

What data?

How?

Where are we going?

Search IndexModel &Structure

Raw data

Prototype UI

My goal… (how many iterations?)

Icebreaker

Why?

What’s the big deal?

Records vs.

Documents

vs.

Personfamily_namefirst_namephoneemail...

DocumentTitle: ...Description: ..Content: ……..

………….

………….

………….

………….

………….

………….

………….

……

Content: ……..……

…….……

…….……

…….……

…….……

…….……

…….……

…….……

…….……

…….……

…….……

Content: ……..……

…….……

…….……

…….……

…….……

…….……

…….……

…….……

…….……

…….……

…….

vs.

Personfamily_namefirst_namephoneemail...

DocumentTitle: ...Description: ..Content: ……..

………….

………….

………….

………….

………….

………….

………….

……

Nicknames

Predictable• Elizabeth

• Beth, Bess, Betty, Liz

• Richard

• Rich, Dick

• David

• Dave

Simple Substrings

Srinivas → “Srini”

Mohammad → “Mo”

Somewhat Predictable

Yakub → “Jacob”

Yusuf → “Joseph”

Xian → “Sean”

Unpredictable

Hanuman → “Hank”

Madhav → “Mike”

Babu → “Bob”

Wongsu → “Richard”

Herman → “Dutch”

Abbreviations & Acronyms

Department Names

Information Technology

ITG

Info Tech.

HBS IT

Job Titles

CEO

PM

VP

...

Educational Degrees

PhD

JD

...

ALM → “Master of Liberal Arts”

(magistri in artibus liberalibus studiorum prolatorum)

Substrings

Experiment

I’d like you to meet...

Garoppolo

What was that guy’s name?

What did you search for?

One more...

Roethlisberger

What was that guy’s name?

What did you search for?

My prediction...

G…. & R...

Jimmy Garoppolo

Ben Roethlisberger

Exercise!

Wishlist: How should search work?

Search mechanisms

What should be searchable?

How should users be able to search?

Query interface: What should be supported?

Results interface: What should be displayed?

etc.

Wishlist discussion

Features

Search by:

Name, Department, Email, Job title, Phone number

Nicknames, Aliases

Substrings

Scoped search, Sort options

Faceting/filtering options

Spelling suggestions, Autocomplete, Devices, Voice search

Hands-on!

Install Sublime Text

https://www.sublimetext.com/3

Let’s create some data

Solr XML<add>

<doc><field name="id">1813-05-05</field><field name="LastName">Kierkegaard</field><field name="FirstName">Søren</field>

</doc>

<doc><field name="id">1966-12-14</field><field name="LastName">Thorning-Schmidt</field><field name="FirstName">Helle</field>

</doc></add>

That was just for practice

Our Dataset: Members of US Congress

Need to create XML for 400+ people records

Install JDK 1.8

http://tinyurl.com/ie17java

(set JAVA_HOME env variable)

java -version

javac -version

echo $JAVA_HOME // *nix

echo %JAVA_HOME% // windows

Install Fusion

https://lucidworks.com/

Download + Unzip

Run it!

Open cmd prompt

cd ...\fusion-3.0.0\fusion\3.0.0\bin

Run it: “fusion.cmd start”

C:\..\Desktop\fusion-3.0.0\fusion\3.0.0\bin>fusion.cmd startStarting zookeeper..Successfully started zookeeper on port 9983 (process ID 144Starting solr..............Successfully started solr on port 8983 (process ID 19564)Starting api............................Successfully started api on port 8765 (process ID 12568)Starting connectors..........................Successfully started connectors on port 8984 (process ID 18Starting ui.............Successfully started ui on port 8764 (process ID 14096)

Admin UI: http://localhost:8764/

1. Create password

Follow along with me:

1. Quickstart

2. Create a new collection (call it “Test1”)

3. Select a dataset: “Revolution Session Data”

4. Try some searches

5. Add faceted search

Break

1st Matrix

1st Matrix: matrix1.xml<doc>

<field name="PersonId">Gabbard, Tulsi</field><field name="LastName">Gabbard</field><field name="FirstName">Tulsi</field><field name="State">Hawaii</field><field name="District">2nd District</field><field name="Room">1433 LHOB</field><field name="Phone">202-225-4906</field><field name="Party">Democratic</field><field name="Committee">Armed Services</field><field name="Email">Tulsi.Gabbard@mail.house.gov</field>

</doc>

Matrix XML

http://tinyurl.com/ie17matrix

Create Solr collection for US Congress

http://localhost:8764/

Devops → New → Collection Name “house” → Save Collection

Configure Fields

Create Datasource → Add → Filesystem → SolrXML → Datasource ID

Path: set path to XML file on disk: C:\cygwin64\home\rmynampaty\house\matrix1.xml

Start Crawl → (Wait for finish) → Job History → (Observe success/fail)

Let’s search!

Query Workbench

Format Results → Documents

- One Primary Field (which one do you think?)

- One Secondary

- One Other

_s vs. _t

String:

Preserves entirely: no tokenizing, preserve case

text:

Tokenizes, stopwords, lowercase

Query Workbench

Try some searches: are they generally working?

Sort: what sort makes sense for people search?

Scoped Search

<field_name>:<search_string>

e.g.,

State_s:Hawaii

Booleans

AND, "+", OR, NOT and "-"

e.g.,

LastName_s:Smith AND Party_s:Republican

Fuzzy Matching

e.g.,

Castor~

Castor~0.8

Synonyms

No one uses scoped / Boolean :-(

Trick them!

Facets

What facets make sense for people search? Add some.

2nd Matrix

2nd Matrix: matrix2.xml<doc>

<field name="PersonId">Gabbard, Tulsi</field><field name="LastInitial">G</field><field name="LastName">Gabbard</field><field name="FirstName">Tulsi</field><field name="Nickname">POTUS2024</field><field name="State">Hawaii</field><field name="District">2nd District</field><field name="Room">1433 LHOB</field><field name="Phone">202-225-4906</field><field name="Party">Democratic</field><field name="Committee">Armed Services</field><field name="Email">Tulsi.Gabbard@mail.house.gov</field>

</doc>

Recrawl Solr collection for US Congress

http://localhost:8764/

Devops → Collection Name → Datasource → Clear Datasource

Path: set path to XML file on disk: C:\cygwin64\home\rmynampaty\house\matrix2.xml

Start Crawl → (Wait for finish) → Job History → (Observe success/fail)

Search using Query Workbench

Substrings

Did they work?

3rd Matrix

3rd Matrix: matrix3.xmlGabbard

- Gabbard

- gabbar

- gabba

- gabb

- gab

- ga

- g

Substrings via

N-grams

Recrawl Solr collection for US Congress

http://localhost:8764/

Devops → Collection Name → Datasource → Clear Datasource

Path: set path to XML file on disk: C:\cygwin64\home\rmynampaty\house\matrix3.xml

Start Crawl → (Wait for finish) → Job History → (Observe success/fail)

Search using Query Workbench

Where are we?

Search IndexModel &Structure

Raw data

Prototype UI

Next steps for you

Search IndexModel &Structure

End-user UI

Raw data

PrototypeUI

Thank you!Questions?

searchguy@hbs.edu@RaviMynampatylinkedin.com/in/mynampatyfacebook.com/ravi.mynampaty

Recommended