Addressing scalability challenges in peer-to-peer search

Addressing scalability challenges in peer-to-peer search

PhD seminar 4 Feb, 2014

Harisankar H, PhD scholar,

DOS lab, Dept. of CSE Advisor: Prof. D. Janakiram

http://harisankarh.wordpress.com

Outline

• Issues with centralized search

– Can peer-to-peer search help?

• Scalability challenges in peer-to-peer search

• Proposed architectural extensions

– Two-layered architecture for peer-to-peer concept search

– Cloud-assisted approach to handle query spikes

Centralized search scenario

• Scenario – Search engines crawl available content, index and

maintain it in data centers

– User queries directed to data centers, processed internally and results sent back

– Centrally managed by single company

Datacenters

Content End users

Some issues with centralized search

– Privacy concerns • All user queries accessible from a single location

– Centralized control • Individual companies decide what to(not to) index, rank

etc.

– Transparency • Complete details of ranking, pre-processing etc. not

made available publicly

• Concerns of censorship and doctoring of results

Some issues with centralized search contd..

• Uses mostly syntactic search techniques

– Based on word or multi-word phrases

– Low quality of results due to ambiguity of natural language

• Issues with centralized semantic search

– Difficult to capture long tail of niche interests of users • Requires rich human generated knowledge bases in numerous

niche areas

Peer-to-peer search approach

• Edge nodes in the internet participate in providing and using the search service

• Search as a collaborative service

• Crawling, indexing and search distributed across the peers

How could peer-to-peer search help?

• Each user query can be sent to a different peer among millions – Obtaining query logs in a single location difficult – Reduced privacy concerns

• Distributed control across numerous peers – Avoids centralized control

• Search application available with all peers – Better transparency in ranking etc.

• Background knowledge of peers can be utilized for effective semantic search – Can help improve quality of results

• Led to lot of academic research in the area as well as real world p2p search engines*

* e.g., faroo.com, yacy.net; YacyPi Kickstarter project

Realizing peer-to-peer search

• Distribution of search index – Term partitioning

• Responsibility of individual terms assigned to different peers – E.g., peer1 is currently

responsible for term “computer”

• Term-to-peer mapping achieved through a structured overlay(e.g., DHT)

Image src: http://wwarodomfr.blogspot.in/2008/09/chord-routing.html

Scalability challenges in peer-to-peer search

• Peers share only idle resources

• Peers join/leave autonomously

• Limited individual resources

• leads to – Peer bandwidth bottleneck during query processing

• Particularly queries involving multiple terms(index transfer between multiple peers)

– Instability during query spikes

• Knowledge management issues at large scale – Difficult to have consensus at large scale – Need wide understanding and have to meet requirements of

large diverse group

No SLA

Two-layered architecture for peer-to-peer concept search*

• Peers organized as communities based on common interest

• Each community maintains its own background knowledge to use in semantic search – Maintained in a distributed manner

• A global layer with aggregate information to facilitate search across communities

• Background knowledge bases extend from minimal universally accepted knowledge in upper layer

• Search, indexing and knowledge management proceeds independently in each community

*joint work with Prof. Fausto Guinchiglia and Uladzimir, Univ. of Trento

Two-layered architecture for peer-to-peer concept search

Community-1

BK-1 doc index -1

Community-3

BK-3 doc index -3

Community-2

BK-2 doc index -2

UK Comm: index

GLOBAL

Two-layered architecture

• Global layer

– retrieves relevant communities for query based on universal knowledge

• Community layer

– retrieves relevant documents for query based on background knowledge of community

Overcoming the shortcomings of single-layered approaches

• Search can be scoped only to the relevant communities for a query – Results in less bandwidth-related issues

• Two layers make knowledge management scalable and interoperable – Niche interests supported at community-level background

knowledge bases – Minimal universal knowledge for interoperability

• Search within community based on community’s background knowledge – Focused interest of community helps in better term-to-

concept mapping

Two-layered approach

• Index partitioning

– Uses partition-by-term • Posting list for each term stored in different peers

– Uses Distributed Hash Table(DHT) to realize dynamic term-to-peer mapping • O(logN) hops for each lookup

• Overlay network

– Communities and global layer maintained using two-layered overlay • Based on our earlier work on computational grids*

– O(logN) hops for lookup even with two-layers

*M.V. Reddy, A.V. Srinivas, T. Gopinath, and D. Janakiram, “Vishwa: A reconfigurable P2P middleware for Grid Computations,” in ICPP'06


• Community management – Similar to public communities in flickr, facebook

etc.

• Search within community – Uses Concept Search* as underlying semantic

search scheme • Extends syntactic search with available knowledge to

realize semantic search

• Falls back to syntactic search when no knowledge is available

*Fausto Giunchiglia, Uladzimir Kharkevich, Ilya Zaihrayeu, “Concept search”, ESWC 2009


• Knowledge representation – Term -> concept mapping

– Concept hierarchy • Concept relations expressed as subsumption relations

• Concepts in documents/queries extracted – by analyzing words and natural language phrases

– Nounphrases translated into conjunctions of atomic concepts (complex concepts) • Small-1Λdog-2

– Documents/queries represented as enumerated sequences of complex concepts • Eg: 1:small-1Λdog-2 2:big-1Λanimal-3


• Relevance model – Documents having more specific concepts than query

concepts considered relevant • Eg: poodle-1 relevant when searching for dog-2

– Ranking done by extending tf-idf relevance model • Incorporates term-concept and concept-concept similarities also

• Distributed knowledge maintenance – Each atomic concept indexed on DHT with id – Node responsible for each atomic concept id also stores

ids of • All immediate more specific atomic concepts • All concepts in the path to root of the atomic concept


• Document indexing and search – Concepts mapped to peer using DHT – Query routed to peers responsible for the query concepts

and related concepts – Results from multiple peers combined to give final results

• Global search – The popularity(document frequency) of each concept

indexed in upper layer – Tf-idf extended with universal knowledge to search for

communities – Combined score of doc = (score of community)*(score of

doc within community)

Experiments • Single layer syntactic vs semantic: TREC ad-hoc,TREC8 (

simulated with 10,000 peers)

– Wordnet as knowledge base

• Single vs 2 layer

– 18 communities (doc: categories in dMoz*)

• 18*1000 = 18,000 peers simulated

– UK = domain-independent concepts and relations from wordnet

– BK = UK + wordnet domains + YAGO

– BK mapped to communities

– Queries selected as directory path to a specific subdirectory

– Standard result: documents in that subdirectory

*http://www.dmoz.org/

Experiments

• Tools – GATE(NLP), Lucene(search library), PeerSim(peer-to-

peer system simulator)

• Performance metrics – Quality

• Precision @10, precision @20 • Mean average precision, MAP

– Network bandwidth • Average number of postings transferred

– Response time • s-postings, s-hops

Results (1 layer syntactic vs semantic)

• Quality improved

• But, cost also increased

Results (1 layer vs 2 layer)

• Quality improved

• Cost decreased

– 94% decrease in posting transfer for opt. case

Two-layered approach results

• Proposed approach gives better quality and performance over single-layered approaches – Performance can further improved using

optimizations like early termination

• But, issue of query spikes remain

Query spikes in peer-to-peer search

• Query spikes can lead to instability

– Replication/caching insufficient due to high document creation rate*

rate of queries related to “Bin laden” increased by 10,000 times within one hour in Google on May 1, 2011 after Operation Geronimo.

Some background • Term-partitioned search

– Term/popular query responsibility assigned to individual peers • Updates and queries are sent to peer responsible which process them

– Term -> peer mapping done using a Distributed Hash Table(DHTs)

top-k result list of q

Cloud-assisted p2p search(CAPS)

• Offload responsibilities of spiking queries to public cloud

Issues in realizing CAPS

• Maintaining full index copy in cloud is very expensive – Storage alone will cost more than 5 million dollars per

month*

• Approach: transfer only relevant index portion to cloud – Need to be performed fast considering effect on user

experience(result quality, response time)

• Effect on the desirable properties of peer-to-peer search – Privacy, transparency, decentralized control etc.

CAPS components

• Switching decision maker

– Decide when to switch

– Simple e.g., “switch when query rate increases by X% within last Y seconds”

• Switching implementor

– Switching algorithm to seamlessly transfer index partition

– Dynamic creation of cloud instances

CAPS Switching algorithm

• Ensures that result quality is not affected • Controlled bandwidth usage at peer

Addressing additional concerns

• Transparency – Index resides both among peers and cloud

• Centralized control – Query can switched back to peers or other clouds

• Privacy – Only spiking queries(less revealing) are forwarded to

cloud

• Cost – Cloud used only transiently for spiking queries

• Cloud payment model – Anonymous keyword-based advertising model*

CAPS Evaluation

• Experimental setup – Target system consists of millions of peers

– Implemented the relevant components in a realistic network • Responsible peer, preceding peers, cloud instance

• Datasets – Real datasets on query/corresponding

updates(rates) not publicly available

– Used synthetic queries and updates with expected query/update rates/ratio

Experimental setup

• 6 heterogeneous workstations with 4-6 cores, 8-16GB RAM used

Experiments

• Two sets of experiments

1. Demonstrate effect of query spike with and without cloud-assistance

2. Effect of switching on user experience

• Response time and result quality

• Switching time

Results-1

With cloud assistance

Without cloud assistance

Results-2(effect of switching on user experience)

• Result freshness

• Response time

Switching time

Conclusions

• Peer-to-peer search has many advantages by design compared to centralized search

• But, peer-to-peer search approaches have scalability issues

• Two-layered approach to peer-to-peer search can improve efficiency and result quality of peer-to-peer search

• Offloading queries to cloud can be an effective method to handle query spikes – Desirable properties of p2p systems not lost

Publications

• Janakiram Dharanipragada and Harisankar Haridas, “Stabilizing peer-to-peer systems using public cloud: A case study of peer-to-peer search”, In the The 11th International Symposium on Parallel and Distributed Computing(ISPDC 2012), held at Munich, Germany.

• Janakiram Dharanipragada, Fausto Giunchiglia, Harisankar Haridas and Uladzimir Kharkevich, “Two-layered architecture for peer-to-peer concept search”, In the 4th International Semantic Search Workshop located at the 20th Int. World Wide Web Conference(WWW 2011), 2011), held at Hyderabad, India.

• Harisankar Haridas, Sriram Kailasam, Prateek Dhawalia, Prateek Shrivastava, Santosh Kumar and Janakiram Dharanipragada, “V-cloud: A Peer-to-peer Video Storage-Compute Cloud”, In the 21st International ACM Symposium on High-Performance Parallel and Distributed Computing(HPDC 2012), held at Delft, The Netherlands[Poster].

THANK YOU

Questions/Suggestions

harisankarh[ at ]gmail.com

mailto:[email protected]

mailto:[email protected]

Technology

Addressing scalability challenges in peer-to-peer search