Redis Indices (#RedisTLV)

Redis Indices

Redis Indices127.0.0.1:6379> CREATE INDEX _email ON user:*->email@itamarhaber / #RedisTLV / 22/9/2014

A Little About MyselfA Redis Geek and Chief Developers Advocate at .com

I write at http://redislabs.com/blog and edit the Redis Watch newsletter at http://redislabs.com/redis-watch-archive

MotivationRedis is a Key-Value datastore -> fetching (is always) by (primary) key is fastSearching for keys is expensive - SCAN (or, god forbid, the "evil" KEYS command)Searching for values in keys requires a full (hash) table scan & sending the data to the client for processing

https://twitter.com/antirez/status/507082534513963009

antirez is RightRedis is a "database SDK"Indices imply some kind of schema (and there's none in Redis)Redis wasn't made for indexing...

But despite the Creator's humble opinion, sometimes you still need a fast way to search :)

So What is an Index?"A database index is a datastructure that improves the speed of data retrieval operations"Wikipedia, 2014

Space-Time Tradeoff

What Can be Indexed?DataIndexKey -> ValueValue -> Key

Values can be numbers or stringsCan be derived from "opaque" values: JSONs, data structures (e.g. Hash), functions,

Index Operations ChecklistCreate index from existing dataUpdate the index onAddition of new valuesUpdates of existing valuesDeletion of keys (and also RENAME/MIGRATE)Drop the indexIf needed do index housekeepingAccess keys using the index

A Simple Example: Reverse LookupAssume the following database, where every user has a single unique email address:

HMSET user:1 id "1" email "[email protected]"

How would you go about efficiently fetching the user's ID given an email address?

Reverse Lookup (Pseudo) Recipedef idxEmailAdd(email, id): # 2.a if not(r.setnx("_email:" + email, id)): raise Exception("INDEX_EXISTS")

def idxEmailCreate(): # 1 for each u in r.scan("user:*"): id, email = r.hmget(u, "id", "email") idxEmailAdd(email, id)

Reverse Lookup Recipe, more admindef idxEmailDel(email): # 2.c r.del("_email:" + email)

def idxEmailUpdate(old, new): # 2.b idxEmailDel(old) idxEmailAdd(new)

def idxEmailDrop(): ... # similar to Create

Reverse Lookup Recipe, integrationdef addUser(json): ... idxEmailAdd(email, id) ...

def updateUser(json): ...

Reverse Lookup Recipe, usagedef getUser(id): return r.hgetall("user:" + id)

def getUserByEmail(email): # 5 return getUser(r.get("_email:" + email))

TA-DA!

Reverse Lookup Recipe, AnalysisAsymptotic computational complexity:Creating the index: O(N), N is no. of valuesAdding a new value to the index: O(1)Deleting a value from the index: O(1)Updating a value: O(1) + O(1) = O(1)Deleting the index: O(N), N is no. of valuesWhat about memory? Every key in Redis takes up some extra space...

Hash Index _email = { "[email protected]": 1,"[email protected]": 2 ... }

Small lookups (e.g. countries) single keyBig lookups partitioned to "buckets" (e.g. by email address hash value)

More info: http://redis.io/topics/memory-optimization

Always RememberThat You Are AbsolutelyUnique(Just Like Everyone Else)

UniquenessThe lookup recipe makes the assumption that every user has a single email address and that it's unique (i.e. 1:1 relationship).

What happens if several keys (users) have the same indexed value (email)?

Non-Uniqueness with ListsUse lists instead of using Redis' strings/hashes. To add:r.lpush("_email:" + email, id) # 2.a

Simple. What about accessing the list for writes or reads? Naturally, getting the all list's members is O(N) but...

What?!? WTF do you mean O(N)?!?Because a Redis List is essentially a linked list, traversing it requires up to N operations (LINDEX, LRANGE). Thatmeans that updates & deletes are O(N)

Conclusion: suitable when N (i.e. number of duplicate index entries) is smallish (e.g. < 10)

OT: A Tip for Traversing ListsLists don't have LSCAN, but with RPOPLPUSH you easily can do a circular list pattern and go over all the members in O(N) w/o copying the entire list.

More at: http://redis.io/commands/rpoplpush

Back to Non-Uniqueness - HashesUse Hashes to store multiple index values:r.hset("_email:" + email, id, "") # 2.a

Great - still O(1). How about deleting?r.hdel("_email:" + email, id) # 2.b

Another O(1).(unused)

Non-Uniqueness, Sets Variantr.sadd("_email:" + email, id) # 2.a

Great - still O(1). How about deleting?r.srem("_email:" + email, id) # 2.b

Another O(1).

List vs. Hash vs. Set for NUIVs** Non-Unique Index ValueMemory: List ~= Set ~= Hash (N < 100)Performance: List < Set, HashUnlike a List's elements, Set members and Hash fields are:Unique - meaning you can't index the same key more than once (makes sense).Unordered - a non-issue for this type of index.Are SCANableForget Lists, use Sets or Hashes.

Forget Hashes, Sets are BetterBecause of the Set operations:SUNION, SDIFF, SINTER

Endless possibilities, includingmatchmaking:SINTER _interest:devops _hair:blond _gender:...

[This Slide has No Title]NULL means no value and Redis is all about values.

When needed, arbitrarily decide on a value for NULLs (e.g. "") and handle it appropriately in code.

Index Cardinality (~= unique values)High cardinality/no duplicates -> use a HashSome duplicates -> use Hash and "pointers" to Sets_email = { "[email protected]": 1, "[email protected]": "*" ...}_email:[email protected] = { 2, 3 }Low cardinality is, however, another story...

Low CardinalityWhen an indexed attribute has a small number of possible values (e.g. Boolean, gender...):If distribution of values is 50:50, consider not indexing it at allIf distribution is heavily unbalanced (5:95), index only the smaller subsets, full scan restUse a bitmap index if possible

Assumption: key names are orderedHow: a Bitset where a bit's position maps to a key and the bit's value is the indexed value:

_isLoggedIn = /100/first bit -> dfucbitz is onlineBitmap Indexsecond bit -> foo isn't logged in

Bitmap Index, cont.More than 2 values? Use n Bitsets, where n is the number of possible indexed values, e.g.: _isFromTerah = /100.../ _isFromEarth = /010.../

Bonus: BITOP AND / OR / XOR / NOT BITOP NOT _ET _isFromEarth BITOP AND onlineET _isLoggedIn _ET

Interlude: Redis Indices Save SpaceConsider the following: in a relational database you need "x2" space: for the indexed data (stored in a table) and for the index itself.

With most Redis indices, you don't have to store the indexed data -> space saved :)

Numerical Ranges with Sorted SetsNumerical values, including timestamps (epoch), are trivially indexed with a Sorted Set: ZADD _yearOfBirth 1972 "1" 1961 "2"... ZADD _lastLogin 1411245569 "1"

Use ZRANGEBYSCORE and ZREVRANGEBYSCORE for range queries

Ordered "Composite" Numerical IndicesUse Sorted Sets scores that are constructed by the sort (range) order. Store two values in one score using the integer and fractional parts:

user:1 = { "id": "1", "weightKg": "82", "heightCm": "218", ... } score = weightKg + ( heightCm / 1000 )

"Composite" Numerical Indices, cont.For more "complex" sorts (up to 53 bits of percision), you can construct the score like so:

user:1 = { "id": "1", "weightKg": "82", "heightCm": "218", "IQ": "100", ... } score = weightKg * 1000000 + heightCm * 1000 + IQ

Adapted from:http://www.dr-josiah.com/2013/10/multi-column-sql-like-sorting-in-redis.html

Full Text Search (Almost) (v2.8.9+)ZRANGEBYLEX on Sorted Set members that have the same score is handy for suffix wildcard searches, i.e. dfuc*, a-la autocomplete: http://autocomplete.redis.io/

Tip: by storing the reversed string (gnirts) you can also do prefix searches, i.e. *terah.net, just as easily.

Another Nice Thing With Sorted SetsBy combining the use of two of these, it is possible to map ranges to keys (or just data). For example, what is 5?

ZADD min 1 "low" 4 "medium" 7 "high" ZADD max 3 "low" 6 "medium" 9 "high"

ZREVRANGEBYSCORE min inf 5 LIMIT 0 1 ZRANGEBYSCORE max 5 +inf LIMIT 0 1

What is

Binary TreesEverybody knows thatbinary trees are really usefulfor searching and other stuff.You can store a binary tree as an array in a Sorted Set:

(Happy 80th Birthday!)

Why stop at binary trees? BTrees!@thinkingfish from Twitter explained that they took the BSD implementation of BTrees and welded it into Redis (open source rulez!). This allows them to do efficient (speed-wise, not memory) key and range lookups.

http://highscalability.com/blog/2014/9/8/how-twitter-uses-redis-to-scale-105tb-ram-39mm-qps-10000-ins.html

Index Atomicity & ConsistencyIn a relational database the index is (hopefully) always in sync with the data.

You can strive for that in Redis, but:Your code will be much more complexPerformance will sufferThere will be bugs/edge cases/extreme uses

The Opposite of Atomicity & ConsistencyOn the other extreme, you could consider implementing indexing with a:Periodical process (lazy indexing)Producer/Consumer pattern (i.e. queue)Keyspace notifications

You won't have any guarantees, but you'll be offloading the index creation from the app.

Indices, Lua & ClusteringServer-side scripting is an obvious consideration for implementing a lot (if not all) of the indexing logic. But ...

in a cluster setup, a script runs on a single shard and can only access the keys there -> no guarantee that a key and an index are on the same shard.

Don't Think Copy-Paste!For even more "inspiration" you can review the source code of popular ORMs libraries for Redis, for example:https://github.com/josiahcarlson/romhttps://github.com/yohanboniface/redis-limpyd

Data & Analytics

Redis Indices (#RedisTLV)