28
1 CONFIDENTIAL | Thinking Lucene Think Lucid Grant Ingersoll Chief Scien@st Lucid Imagina@on Enhancing Discovery with Solr and Mahout

Enhance discovery Solr and Mahout

Embed Size (px)

DESCRIPTION

Los Angeles/ OC Apache Lucene/Solr User group meeting held at Shopzilla in LA on January 19th 2012.

Citation preview

Page 1: Enhance discovery Solr and Mahout

 1    CONFIDENTIAL            |  

Thinking  Lucene              Think  Lucid  

Grant  Ingersoll  Chief  Scien@st  Lucid  Imagina@on  

Enhancing  Discovery  with  Solr  and  Mahout  

Page 2: Enhance discovery Solr and Mahout

 2    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

Evolution

Documents • Models • Feature Selection

User Interaction • Clicks • Ratings/Reviews

• Learning to Rank

• Social Graph

Queries • Phrases • NLP

Content Relationships • Page Rank, etc. • Organization

Page 3: Enhance discovery Solr and Mahout

 3    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

Minding the Intersection

Search

Discovery Analytics

Page 4: Enhance discovery Solr and Mahout

 4    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

l  Background  –  Apache  Mahout  –  Apache  Solr  and  Lucene  

l  Recommenda@ons  with  Mahout  –  Collabora@ve  Filtering  

l  Discovery  with  Solr  and  Mahout  

l  Discussion  

Topics  

Page 5: Enhance discovery Solr and Mahout

 5    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

Apache  Lucene  in  a  Nutshell  

l  hOp://lucene.apache.org/java  

l  Java  based  Applica@on  Programming  Interface  (API)  for  adding  search  and  indexing  func@onality  to  applica@ons  

l  Fast  and  efficient  scoring  and  indexing  algorithms  

l  Lots  of  contribu@ons  to  make  common  tasks  easier:  –  Highligh@ng,  spa@al,  Query  Parsers,  Benchmarking  tools,  etc.  

l  Most  widely  deployed  search  library  on  the  planet    

Page 6: Enhance discovery Solr and Mahout

 6    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

Apache  Solr  in  a  Nutshell  

l  hOp://lucene.apache.org/solr  

l  Lucene-­‐based  Search  Server  +  other  features  and  func@onality  

l  Access  Lucene  over  HTTP:  –  Java,  XML,  Ruby,  Python,  .NET,  JSON,  PHP,  etc.  

l  Most  programming  tasks  in  Lucene  are  taken  care  of  in  Solr  

l  Face@ng  (guided  naviga@on,  filters,  etc.)  

l  Replica@on  and  distributed  search  support  

l  Lucene  Best  Prac@ces  

Page 7: Enhance discovery Solr and Mahout

 7    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

Apache  Mahout  in  a  Nutshell  

l  An  Apache  Socware  Founda@on  project  to  create  scalable  machine  learning  libraries  under  the  Apache  Socware  License  –  hOp://mahout.apache.org  

l  The  Three  C’s:  –  Collabora@ve  Filtering  (recommenders)  –  Clustering  –  Classifica@on  

l  Others:  –  Frequent  Item  Mining  –  Primi@ve  collec@ons  –  Math  stuff  

http://dictionary.reference.com/browse/mahout

Page 8: Enhance discovery Solr and Mahout

 8    CONFIDENTIAL            |  

Thinking  Lucene              Think  Lucid  

Recommenda@ons  with  Mahout  

Page 9: Enhance discovery Solr and Mahout

 9    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

l  Collabora@ve  Filtering  (CF)  –  Provide  recommenda@ons  solely  based  on  preferences  expressed  between  

users  and  items  –  “People  who  watched  this  also  watched  that”  

l  Content-­‐based  Recommenda@ons  (CBR)  –  Provide  recommenda@ons  based  on  the  aOributes  of  the  items  and  user  profile  –  ‘Modern  Family’  is  a  sitcom,  Bob  likes  sitcoms    

•  =>  Suggest  Modern  Family  to  Bob  

l  Mahout  geared  towards  CF,  can  be  extended  to  do  CBR  –  Classifica@on  can  also  be  used  for  CBR  

l  Aside:  search  engines  can  also  solve  these  problems  

Recommenders  

Page 10: Enhance discovery Solr and Mahout

 10    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

Dracula Jane Eyre

Frankenstein Java Programming

Bob 1 4 ??? -

Mary 5 1 4 -

l  In  many  instances,  user’s  don’t  provide  actual  ra@ngs  –  Clicks,  views,  etc.  

l  Non-­‐Boolean  ra@ngs  can  also  ocen  introduce  unnecessary  noise  –  Even  a  low  ra@ng  ocen  has  a  posi@ve  correla@on  with  highly  rated  items  in  the  

real  world  

l  Example:    Should  we  recommend  Frankenstein  to  Bob?  

To  Rate  or  Not?  

Dracula Jane Eyre Frankenstein

Bob 1 4 ???

Mary 5 1 4

Page 11: Enhance discovery Solr and Mahout

 11    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

Collabora;ve  Filtering  with  Mahout  

l  Extensive  framework  for  collabora@ve  filtering  

l  Recommenders  –  User  based  –  Item  based  –  Slope  One  

l  Online  and  Offline  support  –  Offline  can  u@lize  Hadoop  

Item 1

Item 2

… Item m

User 1 - 0.5 0.9

User 2 0.1 0.3 -

User n 0.8 0.7 0.1

Recommendations for User X

Page 12: Enhance discovery Solr and Mahout

 12    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

User  Similarity  

Item  1   Item  2   Item  3   Item  4  

User  1  

User  2   User  

3   User  4  

What  should  we  recommend  for  User  1?  

Page 13: Enhance discovery Solr and Mahout

 13    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

Item  Similarity  

Item  1   Item  2   Item  3   Item  4  

User  1  

User  2   User  

3   User  4  

What  should  we  recommend  for  User  1?  

Page 14: Enhance discovery Solr and Mahout

 14    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

l  Intui@on:  There  is  a  linear  rela@onship  between  rated  items  –  Y  =  mX  +  b    where  m  =  1  

l  Solve  for  b  upfront  based  on  exis@ng  ra@ngs:    b  =  (Y-­‐X)  –  Find  the  average  difference  in  preference  value  for  every  pair  of  items  

l  Online  can  be  very  fast,  but  requires  up  front  computa@on  and  memory  

Slope  One  

User Item 1 Item 2 A 3.5 2 B ? 3

User  A:  3.5  –  2  =  1.5  

Item  1  (User  B)  =  3  +  1.5  =  4.5    

Page 15: Enhance discovery Solr and Mahout

 15    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

l  Online  –  Predates  Hadoop  –  Designed  to  run  on  a  single  node  

•  Matrix  size  of  ~  100M  interac@ons  

–  API  for  integra@ng  with  your  applica@on  

l  Offline  –  Hadoop  based  –  Designed  to  run  on  large  cluster  –  Several  approaches:  

•  RecommenderJob,  ItemSimilarityJob,  ParallelALSFactoriza@onJob  

Online  and  Offline  Recommenda;ons  

Page 16: Enhance discovery Solr and Mahout

 16    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

l  Essen@ally  does  matrix  mul@plica@on  using  distributed  techniques  

l  $MAHOUT_HOME/bin/examples/asf-­‐email-­‐examples.sh  

RecommenderJob  

101 102 103 104 105

101 7 2 0 1 3

102 2 8 3 5 2

103 0 3 3 6 4

104 1 5 6 4 7

105 3 2 4 7 9

User A

3.0

0

4.0

3.0

2.0

X   =  

Recs

30

37

38

53

64

Page 17: Enhance discovery Solr and Mahout

 17    CONFIDENTIAL            |  

Thinking  Lucene              Think  Lucid  

Discovery  with  Solr  

Page 18: Enhance discovery Solr and Mahout

 18    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

l  Goals:  –  Guide  users  to  results  without  having  to  guess  at  keywords  –  Encourage  serendipity  –  Never  show  empty  results  

l  Out  of  the  Box:  –  Face@ng  –  Spell  Checking  –  More  Like  This  –  Clustering  (Carrot2)  

l  Extend  –  Clustering  (with  Mahout)  –  Frequent  Item  Mining  (with  Mahout)  

Discovery  with  Solr  

Page 19: Enhance discovery Solr and Mahout

 19    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

l  Automa@cally  group  similar  content  together  to  aid  users  in  discovering  related  items  and/or  avoiding  repe@@ve  content  

l  Solr  has  search  result  clustering  –  Pluggable  –  Default  implementa@on  uses  Carrot2  

l  Mahout  has  Hadoop  based  large  scale  clustering  –  K-­‐Means,  Minhash,  Dirichlet,  Canopy,  Spectral,  etc.  

Clustering  

Page 20: Enhance discovery Solr and Mahout

 20    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

Discovery  In  Ac;on  

l  Pre-­‐reqs:  –  Apache  Ant  1.7.x,  Subversion  (SVN)  

l  Command  Line  1:  –  svn  co  hOps://svn.apache.org/repos/asf/lucene/dev/trunk  solr-­‐trunk  –  cd  solr-­‐trunk/solr/  –  ant  example  –  cd  example  –  java  –Dsolr.clustering.enabled=true  –jar  start.jar  

l  Command  Line  2  –  cd  exampledocs;  java  –jar  post.jar  *.xml  

l  hOp://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true    

Page 21: Enhance discovery Solr and Mahout

 21    CONFIDENTIAL            |  

Thinking  Lucene              Think  Lucid  

Solr  +  Mahout  

Page 22: Enhance discovery Solr and Mahout

 22    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

l  Most  Mahout  tasks  are  offline  

l  Solr  provides  many  touch  points  for  integra@on:  –  ClusteringEngine  

•  Clustering  results  –  SearchComponent  

•  Sugges@ons  –  Related  searches,  clusters,  MLT,  spellchecking  

–  UpdateProcessor  •  Classifica@on  of  documents  

–  Func@onQuery  

Basics  

Page 23: Enhance discovery Solr and Mahout

 23    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

l  Discover  frequently  co-­‐occurring  items  

l  Use  Case:  Related  Searches  from  Solr  Logs  

l  Hadoop  and  sequen@al  versions  –  Parallel  FP  Growth    

l  Input:  –  <op@onal  document  id>TAB<TOKEN1>SPACE<TOKEN2>SPACE  –  Comma,  pipe  also  allowed  as  delimiters  

Example:  Frequent  Itemset  Mining  

Page 24: Enhance discovery Solr and Mahout

 24    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

l  Goal:    –  Extract  user  queries  from  Solr  logs  –  Feed  into  FIM  to  generate  Related  Keyword  Searches  

l  Context:  –  Solr  Query  logs  –  bin/mahout  regexconverter  –input  $PATH_TO_LOGS  -­‐-­‐output  /tmp/solr/output  

-­‐-­‐regex  "(?<=(\?|&)q=).*?(?=&|$)"  -­‐-­‐overwrite  -­‐-­‐transformerClass  url  -­‐-­‐formaOerClass  fpg  

–  bin/mahout  fpg  -­‐-­‐input  /tmp/solr/output/  -­‐o  /tmp/solr/fim/output  -­‐k  25  -­‐s  2  -­‐-­‐method  mapreduce  

–  bin/mahout  seqdumper  -­‐-­‐seqFile  /tmp/solr2/results/frequentpaOerns/part-­‐r-­‐00000  

FIM  on  Solr  Query  Logs  

Page 25: Enhance discovery Solr and Mahout

 25    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

l  Key:  Chris:  Value:  ([Chris,  HosteOer],870),  ([Chris],870),  ([Search,  Faceted,  Chris,  HosteOer,  Webcast,  Power,  Mastering],18),  ([Search,  Faceted,  Chris,  HosteOer,  Webcast,  Power],18),  ([Search,  Faceted,  Chris,  HosteOer],18),  ([Solr,  new,  Chris,  HosteOer,  webcast,  along,  sponsors,  DZone,  QA,  Refcard],12),  ([Solr,  new,  Chris,  HosteOer,  webcast,  along,  sponsors,  DZone],12),  ([Solr,  new,  Chris,  HosteOer,  webcast,  along,  sponsors],12),  ([Solr,  new,  Chris,  HosteOer,  webcast,  along],12),  ([Solr,  new,  Chris,  HosteOer,  webcast],12),  ([Solr,  new,  Chris,  HosteOer],12)  

Output  

Page 26: Enhance discovery Solr and Mahout

 26    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

l  hOp://lucene.apache.org  

l  hOp://mahout.apache.org  

l  hOp://manning.com/owen  

l  hOp://manning.com/ingersoll  

l  hOp://[email protected]  

l  grant@[email protected]  

l  @gsingers  

Resources  

Page 27: Enhance discovery Solr and Mahout

 27    CONFIDENTIAL            |  

Thinking  Lucene              Think  Lucid  

Appendix  

Page 28: Enhance discovery Solr and Mahout

 28    CONFIDENTIAL            |  Copyright  Lucid  Imagina@on  Copyright  Lucid  Imagina@on  

Mahout  Overview  

Math Vectors/Matrices/SVD

Recommenders Clustering Classification Freq. Pattern Mining

Genetic

Utilities/Integration Lucene/Vectorizer

Collections (primitives)

Apache Hadoop

Applications

Examples

See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms