Introduction to SolrCloud

Introduction To SolrCloud Varun Thacker

Apache Solr has a huge install base and tremendous momentum

most widely used search solution on the planet. 8M+

total downloads

Solr is both established & growing

250,000+monthly downloads

Solr has tens of thousands of applications in production.

You use Solr everyday.

2500+open Solr jobs.

Activity Summary30 Day summary

Aug 18 - Sep 17 2014

• 128 Commits • 18 Contributors

via https://www.openhub.net/p/solr

12 Month Summary Sep 17, 2013 - Sep 17, 2014

• 1351 Commits • 29 Contributors

Solr scalability is unmatched.

• 10TB+ Index Size • 10 Billion+ Documents • 100 Million+ Daily Requests

Solr’s scalability is unmatched

What is Solr?

• A system built to search text

• A specialized type of database management system

• A platform to build search applications on

• Customizable, open source software

Where does Solr fit?

What is SolrCloud?

Subset of optional features in Solr to enable and simplify horizontal scaling a search index using

sharding and replication.

Goals scalability, performance, high-availability,

simplicity, and elasticity

Terminology• ZooKeeper: Distributed coordination service that provides centralised

configuration, cluster state management, and leader election

• Node: JVM process bound to a specific port on a machine

• Collection: Search index distributed across multiple nodes with same configuration

• Shard: Logical slice of a collection; each shard has a name, hash range, leader and replication factor. Documents are assigned to one and only one shard per collection using a hash-based document routing strategy

• Replica: A copy of a shard in a collection

• Overseer: A special node that executes cluster administration commands and writes updated state to ZooKeeper. Automatic failover and leader election.

Collection == Distributed Index• A collection is a distributed index defined by:

• named configuration stored in ZooKeeper

• number of shards: documents are distributed across N partitions of the index

• document routing strategy: how documents get assigned to shards

• replication factor: how many copies of each document in the collection

• Collections API:

• curl "http://localhost:8983/solr/admin/collections?action=CREATE&name=punemeetup&replicationFactor=2&numShards=2&collection.configName=myonf

Document Routing

• Each shard covers a hash-range

• Default: Hash ID into 32-bit integer, map to range

• leads to balanced (roughly) shards

• Custom-hashing

• Tri-level: app!user!doc

• Implicit: no hash-range set for shards

Replication• Why replicate?

• High-availability

• Load balancing

• How does it work in SolrCloud?

• Near-real-time, not master-slave

• Leader forwards to replicas in parallel, waits for response

• Error handling during indexing is tricky

Distributed Indexing

• Get cluster state from ZK

• Route document directly to leader (hash on doc ID)

• Persist document on durable storage (tlog)

• Forward to healthy replicas

• Acknowledge write succeed to client

Shard Leader• Additional responsibilities during indexing only!

Not a master node

• Leader is a replica (handles queries)

• Accepts update requests for the shard

• Increments the _version_ on the new or updated doc

• Sends updates (in parallel) to all replicas

Distributed Queries• Query client can be ZK aware or just query via a load

balancer

• Client can send query to any node in the cluster

• Controller node distributes the query to a replica for each shard to identify documents matching query

• Controller node sorts the results from step 3 and issues a second query for all fields for a page of results

Scalability / Stability Highlights• All nodes in cluster perform indexing and execute

queries; no master node

• Distributed indexing: No SPoF, high throughput via direct updates to leaders, automated failover to new leader

• Distributed queries: Add replicas to scale-out qps; parallelize complex query computations; fault-tolerance

• Indexing / queries continue so long as there is 1 healthy replica per shard

Zookeeper• Is a very good thing ... clusters are a zoo!

• Centralized configuration management

• Cluster state management

• Leader election (shard leader and overseer)

• Overseer distributed work queue

• Live Nodes

• Ephemeral znodes used to signal a server is gone

• Needs 3 nodes for quorum in production

Zookeeper: State Management• Keep track of live nodes /live_nodes znode

• ephemeral nodes

• ZooKeeper client timeout

• Collection metadata and replica state in /clusterstate.json

• Every core has watchers for /live_nodes and /clusterstate.json

• Leader election

• ZooKeeper sequence number on ephemeral znodes

Other Features/Highlights• Near-Real-Time Search: Documents are visible within a second or so after

being indexed

• Partial Document Update: Just update the fields you need to change on existing documents

• Optimistic Locking: Ensure updates are applied to the correct version of a document

• Transaction log: Better recoverability; peer-sync between nodes after hiccups

• HTTPS

• Use HDFS for storing indexes

• Use MapReduce for building index (SOLR-1301)

Solr on YARN

• Run the SolrClient application :

• Allocate container to run SolrMaster

• SolrMaster requests containers to run SolrCloud nodes

• Solr containers allocated across cluster

• SolrCloud node connects to ZooKeeper

More Information on Solr on YARN

• https://lucidworks.com/blog/solr-yarn/

• https://github.com/LucidWorks/yarn-proto

• https://issues.apache.org/jira/browse/SOLR-6743

Our users are also pushing the limits

https://twitter.com/bretthoerner/status/476830302430437376

Up, up and away!

https://twitter.com/bretthoerner/status/476838275106091008

Connect @

https://twitter.com/varunthacker

http://in.linkedin.com/in/varunthacker

varun.thacker@lucidworks.com

Introduction to SolrCloud

Documents

Introduction to CMMI

Chapter 1 Chapter 1 Measurment Introduction to Physics Introduction to Vectors Introduction to Calculus( 微积分 ) Chapter 0 Preface

Introduction to BioInformatics

Introduction to Algorithms

introduction to capacitance

Session1: Introduction to SOAchate/2110634/8-SOA.pdfService-Oriented Architecture Introduction to SOA Session1: Introduction to SOA. โครงการการวศวกรรมซอฟติ

Introduction to Buddhism

3.0.1.3.3 – Introduction to CGI 4/15/2004 3.0.1.3.3 - Introduction to CGI 1 3.0.1.3.3 Introduction to CGI – Session 3 · Introduction to CGI: Data persistence

Introduction to Introduction to Control SystemsControl Systems

Introduction to PEG

Introduction to Clouds Introduction to Clouds · up into cold air! Introduction to Clouds Introduction to Clouds The Earth’s Water Cycle The water on Earth is always on the move,

Introduction to CBM

Introduction to Ontology

Introduction to Response to Intervention

To Introduction

Introduction to SPICE

Introduction to Reconfigurable RF Systemael.snu.ac.kr/paper_file/Introduction to Reconfigurable... · 2010-12-08 · Reconfigurable RF System: Introduction to Reconfigurable RF System

CSED101 INTRODUCTION TO COMPUTING INTRODUCTION TO COMPUTER SCIENCE

Parallel Computing with SolrCloud: Presented by Joel Bernstein, Alfresco

Introduction to photography