Hashed Samples Selectivity Estimators for Set Similarity Selection Queries

Hashed SamplesSelectivity Estimators forSet Similarity Selection Queries

Set Similarity: An Application

•Find similar strings• Decompose strings into 3-grams.• Represent strings as sets of 3-grams.• Compare strings by comparing their respective 3-

gram sets.

•“Nick Koudas”: { ‘Nic’, ‘ick’, …, ‘das’ }

“Nick Arkoudas”: {‘Nic’, …, ‘das’ }

•We can use TF/IDF similarity (or other metrics) to evaluate set similarity.

Indexes For Set similarity Evaluation

•Current approaches use inverted lists• Compute IDF of set elements (e.g., 3-grams).• Create one inverted list per set element

consisting of one entry per database set containing the respective elements (e.g., one entry per string containing the respective 3-gram).

•Use various algorithms and sorting/compression schemes for fast merging of inverted lists.

Motivation

•Set similarity queries are very important• String matching.• Data cleaning.• Set-valued attributes in ORDBMS.

•A variety of set similarity operators have been proposed (for join, selection queries)

•Selectivity estimation is important for query optimization

The Problem

•Let I be a predefined set similarity measure.

•Let D a collection of sets.

•Given query set q and threshold τ, a set similarity selection query returns the answer set A: { s D, s.t. I(q, s) > τ }.

• A set similarity selectivity estimation query estimates the size of A.

Naïve Solutions

Random Sampling

•Maintain a sample S, of sets s D.•Size of answer:

|A| = |As||D| / |S|,

where |As| = {s S: I(q, s) > τ}.•Drawbacks:

• Query independent.• Large variance.• Needs to store complete sets in the sample.• Cannot handle updates.

One Sample Per List

•Use the existing inverted index• Compute one sample per inverted list.• Compute independent estimates per list corresponding to

the query set elements only (query specific)• Report median, max, average…

•Drawbacks• Ignores correlations between lists.• Needs to store complete sets in inverted lists.• Will not be better than simple random sampling.• Cannot handle updates.

Sample Union

•Compute the sample union of the samples corresponding to the query set elements.

•Drawbacks• Results in a biased sample.

– There are duplicate elements in those lists.– Even if we eliminate duplicates we still need to

compute the distinct set size of the sample union (for scaling up).

– This is more expensive than answering the set similarity query exactly to begin with.

• Needs to store complete sets in inverted lists.• Cannot handle updates.

Dynamically Computed Samples

•Given the query• Use reservoir sampling to compute a sample

union from the inverted lists on the fly.

•Drawbacks:• Produces a biased sample

– Skips part of the input.– Duplicates.

• Need to store complete sets in the inverted lists.

Hashed Samples

Hashed Sample

•An a priori computed sample that• Builds uniform samples from arbitrary

combinations of inverted lists.• Does not need to store complete sets in the

sample (only set ids).• Leverages partial weight information contained

in the lists.• Eliminates the need to store distinct value

estimation synopses.• Provides unbiased estimates.• Handles updates gracefully.

Construction

•We cannot draw independent samples per list• Draw samples deterministically

– In order to leverage partial weights contained in lists for computing I(q, s) efficiently.

– Guarantee that if a set id is sampled from one list, it will be always sampled in all other lists.

• Guarantee that union of list samples is a uniform random sample.

• We impose a random permutation on the domain of set ids– Use hashing and sample a consistent subset.

Construction 2

•Randomly choose a hash function h from a family of universal hash functions.

•Assume that h hashes in [1, 100]

•Values h(s1), h(s2), … appear as i.i.d. (empirically)

•Choose a value x and sample from every list sets s: h(s) < x.

Hashed Sample Properties

• We get an x% sample per list on average.• We get an overall x% sample.• The union of samples of any set of inverted lists is

an x% sample of the respective lists.• Let q = {q1, q2, …, qn}:

|A| = |As| |q1 … qn|d / |qs1 … qsn|d.

• Computing |As| is simple• Run any exact evaluation algorithm on the sampled lists!

• Performance improvement with respect to exact evaluation is directly proportional to the size of the sample.

• We still need the distinct number of set ids in q to scale up the results.

The K-Minimum Values Synopsis

•Estimating the distinct size of arbitrary list unions:• The sampled lists themselves can be used as a

KMV synopsis, by contsruction.

• The r-th smallest hash value hr of a set of elements gives an unbiased estimator of the distinct number of elements in the set:– |S|d |P| (r – 1) / hr

• Given that sample lists contain all elements s.t. h(s) x, we can deduce the rank of hm = x.

Experimental Evaluation

•IMDB, DBLP, YellowPages

•Decompose strings into 3-grams and build inverted index for TF/IDF similarity.

•Build list samples: 1%, 5%, 10%.

•Draw queries from the data:• 100 queries per workload.• Each set contains queries of preset selectivity.

•Evaluate estimation accuracy and runtime.

Storing Sets VS. Storing Set Ids

Reservoir Sampling: Accuracy

Reservoir Sampling: Cost

Hashed Sampling: Accuracy

Hashed Sampling: Cost

Hashed Sampling: Threshold

Hashed Sampling: Answer Size

Hashed Sampling: KMV Accuracy

Hashed Samples Selectivity Estimators for Set Similarity Selection Queries

Documents

Synthesis, Solution Structure and -selectivity oPhylaf a ... · Synthesis, Solution Structure and -selectivity oPhylaf a Spider -Toxin That ... Gdn·HCl. DTT (3.8 mmol) was added

Terra Station Guide KR · 2020. 9. 23. · 0.000000 Luna Validators Delegated Assets Luna DELEGATED 3,000.000000 Undelegated Assets 0.000000 Luna Validator O Hashed O Certus One O

Selectivity Drift Gillnet

Shrinkage estimators for covariance matrices in multivariate ...konno/pdf/talk30.pdf今野良彦 Shrinkage estimators for complex normal covariance matrices この講演の目的と構成

Selectivity mining – multiple activities in Activity Miner

L-estimators, R-estimators, Redescending Estimatorsartax.karlin.mff.cuni.cz/~dvorm3bm/0910z/slides.pdf · 2009. 11. 3. · L-estimators, R-estimators, Redescending Estimators Mgr

Angel Lanuza, Coordinador de la Plataforma Española de ... · UJIZZ Gem Med Device 10T Security NEURO MESH Spiritus bloq Azure Identity HASHED NETKI CHRONICLED Ledger accenture Curisium

Determinants of Ligand Subtype-Selectivity at α1A-Adrenoceptor … · 2019-01-30 · Determinants of Ligand Subtype-Selectivity at α 1A-Adrenoceptor Revealed Using Saturation Transfer

Regularized M-estimators of the Covariance Matrix

Stationary Phase: Biphenyl CH Si CH O · Selectivity Accelerated Stationary Phase: Biphenyl CH 3 Si CH 3 O Pure Chromatography Fast, Rugged Raptor Columns with Time-Tested Selectivity

Organizing and Searching Information with XML Selectivity Estimation for XML Queries Thomas Beer, Christian Linz, Mostafa Khabouze

Sips Paypage JSON€¦ · Web view2018-06-04 · These fields are designated with the word "conditional". ... of both responses is hashed with the same algorithm as the one supplied

Design, Synthesis and Subtype Selectivity effects of 3, 6

About Stein's estimators: the original result and extensions€¦ · University of Liège Master’s thesis About Stein’s estimators The original result and extensions Author: Tom

The inter- and intramolecular selectivity of the carbonate radical …8210/FULLTEXT01.pdf · 2006-03-19 · The inter- and intramolecular selectivity of the carbonate radical anion

Photocatalytic NO abatement: Why the selectivity matters · x abatement: Why the selectivity matters ... ‡ Present address: DECHEMA Research Institute, Theodor-Heuss-Allee 25, 60486

Spectral Estimators-Herlan D

Report of the Study Group on Turned 90o Codend Selectivity, focusing on Baltic Cod ... · PDF file · 2016-11-07Report of the Study Group on Turned 90° Codend Selectivity, ... 5

Unique CO2 Selectivity and Dynamics Double Helix of Opposite … · 2018-10-24 · 1 Double Helix of Opposite Charges to Form Channels with Unique CO2 Selectivity and Dynamics Guolong

1 Selectivity Estimation in Spatial Databases S. Acharya, V. Poosala, S. Ramaswamy Presented By: Eyal Flato