Upload
acunu
View
2.581
Download
0
Embed Size (px)
DESCRIPTION
Talk given at Denormalised London, 2012-09-20. Discussion of what a real-time system needs to do and why Cassandra is a good fit.
Citation preview
Outline
• What is real-time?
• How do databases implement real-time queries?
• Why is Cassandra ideal for real-time applications?
• Writing real-time applications with Cassandra
What is real-time?
“Of or relating to a system in which input data is processed within milliseconds” dictionary.com
“Occurring immediately” webopedia
“Often real-time response times are understood to be in the order of milliseconds and sometimes microseconds” wikipedia
“...the most important requirement of a real-time system is predictability and not performance” wikipedia
“...a time frame that is very brief, appearing to be immediate.” wisegeek.com
Real-time queries
• ‘Give me X’
• ‘How many Y?’
• ‘What is the top K?’
• ‘How many distinct Z from P?’
Real-time definition
• Definition a query is processed in real-time if the time to get the answer is at most a constant times the transfer time plus the round-trip time
tresponse ! C(ttransfer + tping)
Real-time definition
• The more you ask for, the longer it takes
• For small queries, request dominated by round trip time
• No query can take less time than the time to receive it
Real-time definition• Users on faster networks expect a faster
response
• What we mean by real-time is getting faster
Implications• What does this mean for the database?
• Use Google Analytics example
• Simple query:‘How many page views have there been from France in the last 24 hours?’
Requirement
• Response is one number
• With overhead, say ~1KB
• Ping time 1ms
• 10Mbit connection => 1KB in ~1ms
• 2ms total
Solution 1
• grep *.fr /var/log/apache2/*.log
• Suppose have 1M hits an hour => 7GB of logs a day
• Single disk would take 70s
• Need a beefy server to do this
• Needs to grow as your audience grows
Solution 2
• Maintain a counter for each country
• Increment the counter on each hit
• On query just read the counter
• Maybe it is on disk - 5ms seek
• No need to scale speed with traffic
Implications• Real-time queries can only read about as
much data as they send to the requester
• Need to precompute answers
• Store data in a query-centric rather than data-centric view
Age of data
• A real-time query will often need to query new data
• But not necessarily
• Could run batch process pre-compute answers
Solutions
Solutions
• How make sure don’t read any more than you have to?
• Denormalisation
• Organisation of data
• Counters
• Hard drive performance constraints:
• Sequential IO at 100s MB/s
• Seek at 100 IO/s
• Avoid random IO
• Effective block size 1MB
Denormalisation
Denormalisation
• Store items accessed at similar times near to each other
• Involves copying
• Copying isn’t bad
• Storage costs <$100 per TB
Organisation of data
• If read 100 items off disk, ensure they are next to each other
• Saves reading extra data around them and index lookups
Fast range queries• Get me all keys in the range E to I
AFHIMX
[E, I]
Fast range queries• What happens when you insert?
AFHIMX
QAF
HIM
X
G
Q
G[E, I]
vs
Counters• For queries that simply count, increment the
counter
• Implement inc, dec, get
• Store multiple counts e.g. week, day, hour
Cassandra and real-time
• Write optimised
• Fast merging
• Distributed counters
Write optimised
• All writes are sequential on disk
• Each write is written multiple times during compactions
Fast merging
AF
HIM
XQ
GAFHIMX
QG
How get from this: to this?
+
Fast merging
• Write out new ordered SSTable
• When big enough, merge with existing
AFHIMX
QG
AFHIMX
ZQKGFB
ABFGHIKMQXZZ
QKGFB A
FHIMX
How fast?
Distributed counters• Distributed, fault tolerant replicated counters
• No need for distributed locks
• Super fast
Other requirements
What else do we need?
High value getting quick response
Real-time analytics
High cost if service is down
Need high availability
High value getting quick response
Real-time analytics
Need low latency
Need data geographically close
What else do we need?
Cassandra and HA
• No SPOF
• Choose point on consistency and availability curve
• Tuneable consistency
• Replication
• Multi data-centre support
Cassandra and low latency
• Can configure caches
• Can parallelise reads
• Multi-DC support enables world-wide replication
• Can choose lower consistency to avoid round-trips to other DCs
Writing real-time apps with Cassandra
Real-time apps
• Need to write code using a client library
• Design data-model
• If queries change, code changes
Acunu Analytics• Provides simple RESTful interface to
Cassandra counters
• Push processing into ingest phase
CassandraeventAA
counterupdates
Acunu Analytics• Event template, e.g.,
• Specifies “blow-up” strategy according to supported queries
• Need to know basics of query in advance, but not whole thing
select : ["COUNT", "AVG(loadTime)"],type : { time : [TIME(HOUR; MIN; SEC), ?, 0], page : PATH(/), loadTime : [LONG, 0, 0]}
Features• Simple, real-time, incremental analytics
• work done on ingest
• sum, count, distinct, avg, stddev, min-max etc
• time + hierarchy bucketing
• efficient ‘group’ semantics
• works with Apache Cassandra
Summary
• Formalise what real-time means
• Deduced how data must be stored
• Explored how Cassandra has these properties
• Discussed how Acunu Analytics helps when writing real-time apps