31
Leveraging the Power of SOLR with SPARK Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Leveraging the power of solr with spark

Embed Size (px)

Citation preview

Page 1: Leveraging the power of solr with spark

Leveraging the Power of SOLR with SPARK

Johannes Weigend QAware GmbH Germany pache Big Data Europe

September 2015

Page 2: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Welcome

• Johannes Weigend- CTO QAware GmbH- Software architect / developer- 25 years of experience- Custom enterprise solutions (Java, JS,…)- Lecturer for UI development at the University of

Applied Science in Rosenheim - Focus on performance and scalability- SOLR user since 2011

2

Page 3: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Brute Force Data Analysis

3

Read Read Read

Filter Filter Filter

Map Map Map

Reduce

Dataflow

Not Indexed

foreach() -> Minutes / Hours

Page 4: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Search Based Data Analysis

4

Filter

Search Search Search

Map Map Map

Reduce

DataflowFilter Filter

Indexed Data (There’s no free lunch)

foreach() -> Seconds/Minutes

Page 5: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Agenda

SOLR cloudDemo

SPARK clusterDemo

Importing data into SOLR with SPARKDemo

Analysis with SOLR and SPARKDemo

5

1

2

3

4

Page 6: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

• Horizontally scalable, distributed NoSQL (Index) Database • Document oriented• A document is a collection of fields (string, number, date, …)• Simple and multiple fields (similar to arrays)• Schema and schema less• Powerful query language (Lucene)

• Distributed data in shards• Replication• Powerful full text search capabilities• Aggregation functions (aka facets)• Stable —> V 5.3

6

1 2 3 4

Page 7: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

SOLR@QAware

• AIR• Aftersales Information Research

• ZEBRA• Part explosion for complex products

• EKG • Software Electro Cardiogram

• QAsearch• Enterprise search across all repositories including

history

7

Page 8: Leveraging the power of solr with spark

8

Page 9: Leveraging the power of solr with spark

9

Page 10: Leveraging the power of solr with spark

10

Page 11: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Apache SOLR for BigData Analysis?

• Text Search Engine?• Aggregations?• Slice and Dice?• Pivots?

11

Page 12: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Demo: SOLR Cloud

• Installing and configuring SOLR Cloud• Searching, sorting and filtering• Facets

• Terms (count by term)• Ranges (count in range)• Functions (avg, sum, …)• Sub-Facets (pivot)

12

Page 13: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Counting as Term Facet

13

Page 14: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Statistics as Function Facet

14

Page 15: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Pivots as Sub Facets

15

Page 16: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

careerbuilder.com

16

Page 17: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Banana

17

Page 18: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany 18

Page 19: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

What’s Missing?

• Client-side processing of SOLR results does not scale• No built-in M/R support• Where to store really big data?

• Images• Videos• Binaries / large text documents

• No interfaces to R / ML

19

Page 20: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

• Distributed job execution engine• Map/Reduce framework• Scala based (runs on JVM)• Java/Scala/Python APIs• Processes data from various data sources

• Textfiles (accessible from all nodes)• Hadoop File System (HDFS)• Databases (JDBC)• SOLR!

20

1 2 3 4

Must Read: https://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf

Page 21: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Combining Spark with SOLR

• Use Cases• Distributed ETL – Importing data into SOLR-

Cloud• Our Usecase: importing N logfiles into SOLR

• Distributed processing – data analysis• Statistics on binary data• Map/Reduce

21

Page 22: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Four Ways to Import Data into SOLR 1. Using built-in functions

post scriptDataimport handler,Admin-UI

2. Writing custom parallel code using the SOLRJ API 3. Using and customizing Apache Nutch (Hadoop !)4. Using and customizing Apache Spark

22

Page 23: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Demo: Import Logfiles with Spark• Writing a Spark job which imports a bunch of

logfiles in one directory • Using Lucidwork’s Solr-Spark library

23

1 2 3 4

Page 24: Leveraging the power of solr with spark

24

Page 25: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Demo: Distributed Analysis with Spark• Write a Spark Job which calculates the Duration of Business Actions • Use Spark to access SOLR per SQL / JDBC

25

1 2 3 4

Page 26: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

SolrRDD - The Spark Abstraction to process SOLR Resultshttps://github.com/LucidWorks/spark-solr

26

Page 27: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

SPARK Supports Parallel SQL

27

Page 28: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Dataframe API

28

Page 29: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

SPARK WorkerSOLR 5.3SHARD #4

29

Odroid XU4 2 GB RAM 64 GB eMMC Disk Ubuntu Linux 70$

SPARK WorkerSOLR 5.3SHARD #3

SPARK WorkerSOLR 5.3SHARD #1

SPARK WorkerSOLR 5.3SHARD #2

SPARK Master

SOLR 5.3SHARD #0

SPARK Worker

ZOOKEEPERNFS

40 Cores 10 GB RAM 320 GB eMMC Disk

Page 30: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Summary

30

Page 31: Leveraging the power of solr with spark

Apache Big Data Europe | Leveraging the power of SOLR with SPARK | Johannes Weigend | QAware GmbH Germany

Any Questions ?

31