Real-time Cassandra

Real-time Cassandra

Richard [email protected]

@richardalow

mailto:[email protected]

mailto:[email protected]

Outline

• What is real-time?

• How do databases implement real-time queries?

• Why is Cassandra ideal for real-time applications?

• Writing real-time applications with Cassandra

What is real-time?

“Of or relating to a system in which input data is processed within milliseconds” dictionary.com

“Occurring immediately” webopedia

“Often real-time response times are understood to be in the order of milliseconds and sometimes microseconds” wikipedia

“...the most important requirement of a real-time system is predictability and not performance” wikipedia

“...a time frame that is very brief, appearing to be immediate.” wisegeek.com

Real-time queries

• ‘Give me X’

• ‘How many Y?’

• ‘What is the top K?’

• ‘How many distinct Z from P?’

Real-time definition

• Definition a query is processed in real-time if the time to get the answer is at most a constant times the transfer time plus the round-trip time

tresponse ! C(ttransfer + tping)

Real-time definition

• The more you ask for, the longer it takes

• For small queries, request dominated by round trip time

• No query can take less time than the time to receive it

Real-time definition• Users on faster networks expect a faster

response

• What we mean by real-time is getting faster

Implications• What does this mean for the database?

• Use Google Analytics example

• Simple query:‘How many page views have there been from France in the last 24 hours?’

Requirement

• Response is one number

• With overhead, say ~1KB

• Ping time 1ms

• 10Mbit connection => 1KB in ~1ms

• 2ms total

Solution 1

• grep *.fr /var/log/apache2/*.log

• Suppose have 1M hits an hour => 7GB of logs a day

• Single disk would take 70s

• Need a beefy server to do this

• Needs to grow as your audience grows

Solution 2

• Maintain a counter for each country

• Increment the counter on each hit

• On query just read the counter

• Maybe it is on disk - 5ms seek

• No need to scale speed with traffic

Implications• Real-time queries can only read about as

much data as they send to the requester

• Need to precompute answers

• Store data in a query-centric rather than data-centric view

Age of data

• A real-time query will often need to query new data

• But not necessarily

• Could run batch process pre-compute answers

Solutions

Solutions

• How make sure don’t read any more than you have to?

• Denormalisation

• Organisation of data

• Counters

• Hard drive performance constraints:

• Sequential IO at 100s MB/s

• Seek at 100 IO/s

• Avoid random IO

• Effective block size 1MB

Denormalisation

Denormalisation

• Store items accessed at similar times near to each other

• Involves copying

• Copying isn’t bad

• Storage costs <$100 per TB

Organisation of data

• If read 100 items off disk, ensure they are next to each other

• Saves reading extra data around them and index lookups

Fast range queries• Get me all keys in the range E to I

AFHIMX

[E, I]

Fast range queries• What happens when you insert?

AFHIMX

QAF

HIM

X

G

Q

G[E, I]

vs

Counters• For queries that simply count, increment the

counter

• Implement inc, dec, get

• Store multiple counts e.g. week, day, hour

Cassandra and real-time

• Write optimised

• Fast merging

• Distributed counters

Write optimised

• All writes are sequential on disk

• Each write is written multiple times during compactions

Fast merging

AF

HIM

XQ

GAFHIMX

QG

How get from this: to this?

+

Fast merging

• Write out new ordered SSTable

• When big enough, merge with existing

AFHIMX

QG

AFHIMX

ZQKGFB

ABFGHIKMQXZZ

QKGFB A

FHIMX

How fast?

Distributed counters• Distributed, fault tolerant replicated counters

• No need for distributed locks

• Super fast

Other requirements

What else do we need?

High value getting quick response

Real-time analytics

High cost if service is down

Need high availability

High value getting quick response

Real-time analytics

Need low latency

Need data geographically close

What else do we need?

Cassandra and HA

• No SPOF

• Choose point on consistency and availability curve

• Tuneable consistency

• Replication

• Multi data-centre support

Cassandra and low latency

• Can configure caches

• Can parallelise reads

• Multi-DC support enables world-wide replication

• Can choose lower consistency to avoid round-trips to other DCs

Writing real-time apps with Cassandra

Real-time apps

• Need to write code using a client library

• Design data-model

• If queries change, code changes

Acunu Analytics• Provides simple RESTful interface to

Cassandra counters

• Push processing into ingest phase

CassandraeventAA

counterupdates

Acunu Analytics• Event template, e.g.,

• Specifies “blow-up” strategy according to supported queries

• Need to know basics of query in advance, but not whole thing

select : ["COUNT", "AVG(loadTime)"],type : { time : [TIME(HOUR; MIN; SEC), ?, 0], page : PATH(/), loadTime : [LONG, 0, 0]}

Features• Simple, real-time, incremental analytics

• work done on ingest

• sum, count, distinct, avg, stddev, min-max etc

• time + hierarchy bucketing

• efficient ‘group’ semantics

• works with Apache Cassandra

Summary

• Formalise what real-time means

• Deduced how data must be stored

• Explored how Cassandra has these properties

• Discussed how Acunu Analytics helps when writing real-time apps

Technology

Real-time Cassandra