Leveraging the power of solr with spark

Leveraging the Power of SOLR with SPARK

Johannes Weigend QAware GmbH Germany pache Big Data Europe

September 2015

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Welcome

• Johannes Weigend- CTO QAware GmbH- Software architect / developer- 25 years of experience- Custom enterprise solutions (Java, JS,…)- Lecturer for UI development at the University of

Applied Science in Rosenheim - Focus on performance and scalability- SOLR user since 2011

Brute Force Data Analysis

Read Read Read

Filter Filter Filter

Map Map Map

Reduce

Dataflow

Not Indexed

foreach() -> Minutes / Hours

Search Based Data Analysis

Filter

Search Search Search

Map Map Map

Reduce

DataflowFilter Filter

Indexed Data (There’s no free lunch)

foreach() -> Seconds/Minutes

Agenda

SOLR cloudDemo

SPARK clusterDemo

Importing data into SOLR with SPARKDemo

Analysis with SOLR and SPARKDemo

• Horizontally scalable, distributed NoSQL (Index) Database • Document oriented• A document is a collection of fields (string, number, date, …)• Simple and multiple fields (similar to arrays)• Schema and schema less• Powerful query language (Lucene)

• Distributed data in shards• Replication• Powerful full text search capabilities• Aggregation functions (aka facets)• Stable —> V 5.3

1 2 3 4

SOLR@QAware

• AIR• Aftersales Information Research

• ZEBRA• Part explosion for complex products

• EKG • Software Electro Cardiogram

• QAsearch• Enterprise search across all repositories including

history

Apache SOLR for BigData Analysis?

• Text Search Engine?• Aggregations?• Slice and Dice?• Pivots?

Demo: SOLR Cloud

• Installing and configuring SOLR Cloud• Searching, sorting and filtering• Facets

• Terms (count by term)• Ranges (count in range)• Functions (avg, sum, …)• Sub-Facets (pivot)

Counting as Term Facet

Statistics as Function Facet

Pivots as Sub Facets

careerbuilder.com

Banana

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany 18

What’s Missing?

• Client-side processing of SOLR results does not scale• No built-in M/R support• Where to store really big data?

• Images• Videos• Binaries / large text documents

• No interfaces to R / ML

• Distributed job execution engine• Map/Reduce framework• Scala based (runs on JVM)• Java/Scala/Python APIs• Processes data from various data sources

• Textfiles (accessible from all nodes)• Hadoop File System (HDFS)• Databases (JDBC)• SOLR!

1 2 3 4

Must Read: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Combining Spark with SOLR

• Use Cases• Distributed ETL – Importing data into SOLR-

Cloud• Our Usecase: importing N logfiles into SOLR

• Distributed processing – data analysis• Statistics on binary data• Map/Reduce

Four Ways to Import Data into SOLR 1. Using built-in functions

post scriptDataimport handler,Admin-UI

2. Writing custom parallel code using the SOLRJ API 3. Using and customizing Apache Nutch (Hadoop !)4. Using and customizing Apache Spark

Demo: Import Logfiles with Spark• Writing a Spark job which imports a bunch of

logfiles in one directory • Using Lucidwork’s Solr-Spark library

1 2 3 4

Demo: Distributed Analysis with Spark• Write a Spark Job which calculates the Duration of Business Actions • Use Spark to access SOLR per SQL / JDBC

1 2 3 4

SolrRDD - The Spark Abstraction to process SOLR Resultshttps://github.com/LucidWorks/spark-solr

SPARK Supports Parallel SQL

Dataframe API

SPARK WorkerSOLR 5.3SHARD #4

Odroid XU4 2 GB RAM 64 GB eMMC Disk Ubuntu Linux 70$

SPARK Master

SOLR 5.3SHARD #0

SPARK Worker

ZOOKEEPERNFS

40 Cores 10 GB RAM 320 GB eMMC Disk

Summary

Any Questions ?

Leveraging the power of solr with spark

Software

Oak / Solr integration

Solr - PHP conference 2013

Introducción a Solr

How to sell BigInsights by leveraging our differentiators(Feb 2011) Watson for Healthcare (Aug 2011 –) Watson Ecosysteem (2014–) Watson ... & Spark) R SPSS STATISTICS & Visualization

Configuraçãoção Do Solr

Solr performans i̇puçları

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

Présentation d’Apache Solr

Leveraging GoSeeOregon.com

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Lucidworks

LEVERAGING CONTINUOUS IMPROVEMENT

Drupal solr

Meetup solr

Information Retrieval Apache Solr Use Case - BioAssist Retrieval & Apache Solr Use Case May 21, 2010 Leon Mei. ... Relevance feedback. ... Solr 1.4 released

Elgg solr presentation

Leveraging Sustainable

Geneva jug Lucene Solr

เกี่ยวกับ Apache solr 4.0

Leveraging Sustainable - idx.co.id

Leveraging ecommerce