Leveraging the power of solr with spark

Preview:

Citation preview

Leveraging the Power of SOLR with SPARK

Johannes Weigend QAware GmbH Germany pache Big Data Europe

September 2015

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Welcome

• Johannes Weigend- CTO QAware GmbH- Software architect / developer- 25 years of experience- Custom enterprise solutions (Java, JS,…)- Lecturer for UI development at the University of

Applied Science in Rosenheim - Focus on performance and scalability- SOLR user since 2011

2

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Brute Force Data Analysis

3

Read Read Read

Filter Filter Filter

Map Map Map

Reduce

Dataflow

Not Indexed

foreach() -> Minutes / Hours

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Search Based Data Analysis

4

Filter

Search Search Search

Map Map Map

Reduce

DataflowFilter Filter

Indexed Data (There’s no free lunch)

foreach() -> Seconds/Minutes

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Agenda

SOLR cloudDemo

SPARK clusterDemo

Importing data into SOLR with SPARKDemo

Analysis with SOLR and SPARKDemo

5

1

2

3

4

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

• Horizontally scalable, distributed NoSQL (Index) Database • Document oriented• A document is a collection of fields (string, number, date, …)• Simple and multiple fields (similar to arrays)• Schema and schema less• Powerful query language (Lucene)

• Distributed data in shards• Replication• Powerful full text search capabilities• Aggregation functions (aka facets)• Stable —> V 5.3

6

1 2 3 4

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

SOLR@QAware

• AIR• Aftersales Information Research

• ZEBRA• Part explosion for complex products

• EKG • Software Electro Cardiogram

• QAsearch• Enterprise search across all repositories including

history

7

8

9

10

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Apache SOLR for BigData Analysis?

• Text Search Engine?• Aggregations?• Slice and Dice?• Pivots?

11

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Demo: SOLR Cloud

• Installing and configuring SOLR Cloud• Searching, sorting and filtering• Facets

• Terms (count by term)• Ranges (count in range)• Functions (avg, sum, …)• Sub-Facets (pivot)

12

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Counting as Term Facet

13

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Statistics as Function Facet

14

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Pivots as Sub Facets

15

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

careerbuilder.com

16

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Banana

17

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany 18

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

What’s Missing?

• Client-side processing of SOLR results does not scale• No built-in M/R support• Where to store really big data?

• Images• Videos• Binaries / large text documents

• No interfaces to R / ML

19

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

• Distributed job execution engine• Map/Reduce framework• Scala based (runs on JVM)• Java/Scala/Python APIs• Processes data from various data sources

• Textfiles (accessible from all nodes)• Hadoop File System (HDFS)• Databases (JDBC)• SOLR!

20

1 2 3 4

Must Read: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Combining Spark with SOLR

• Use Cases• Distributed ETL – Importing data into SOLR-

Cloud• Our Usecase: importing N logfiles into SOLR

• Distributed processing – data analysis• Statistics on binary data• Map/Reduce

21

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Four Ways to Import Data into SOLR 1. Using built-in functions

post scriptDataimport handler,Admin-UI

2. Writing custom parallel code using the SOLRJ API 3. Using and customizing Apache Nutch (Hadoop !)4. Using and customizing Apache Spark

22

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Demo: Import Logfiles with Spark• Writing a Spark job which imports a bunch of

logfiles in one directory • Using Lucidwork’s Solr-Spark library

23

1 2 3 4

24

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Demo: Distributed Analysis with Spark• Write a Spark Job which calculates the Duration of Business Actions • Use Spark to access SOLR per SQL / JDBC

25

1 2 3 4

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

SolrRDD - The Spark Abstraction to process SOLR Resultshttps://github.com/LucidWorks/spark-solr

26

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

SPARK Supports Parallel SQL

27

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Dataframe API

28

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

SPARK WorkerSOLR 5.3SHARD #4

29

Odroid XU4 2 GB RAM 64 GB eMMC Disk Ubuntu Linux 70$

SPARK WorkerSOLR 5.3SHARD #3

SPARK WorkerSOLR 5.3SHARD #1

SPARK WorkerSOLR 5.3SHARD #2

SPARK Master

SOLR 5.3SHARD #0

SPARK Worker

ZOOKEEPERNFS

40 Cores 10 GB RAM 320 GB eMMC Disk

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Summary

30

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Any Questions ?

31