Cassandra Basics, Counters and Time Series Modeling

C* @ Icon PlatformsVassilis Bekiaris @karbonized1

Software Architect

Presentation outline• Meet Cassandra

• CQL - Data modeling basics

• Counters & Time-series use case: Polls

Meet Cassandra

History• Started at Facebook

• Historically builds on

• Dynamo for distribution: consistent hashing, eventual consistency

• BigTable for disk storage model

Amazon’s Dynamo: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html Google’s BigTable: http://research.google.com/archive/bigtable.html

http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html

http://research.google.com/archive/bigtable.html

Cassandra is• A distributed database written in Java

• Scalable

• Masterless, no single point of failure

• Tunable consistency

• Network topology aware

Cassandra Data Model• Original “Map of Maps” schema

• row key ➞ Map<ColumnName, Value>

• Now (in CQL):

• Keyspace = Database

• ColumnFamily = Table

• Row = Partition

• Column = Cell

• Data types

• strings, booleans, integers, decimals

• collections: list, set, map

• not indexable, not individually query-able

• counters

• custom types

Cassandra Replication Factor & Consistency Levels

• CAP Theorem:

• Consistency

• Availability

• Tolerance in the face of network partitions

Original article: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf Review 12 years later: http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed Fun with distributed systems under partitions: http://aphyr.com/tags/jepsen

http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed

http://aphyr.com/tags/jepsen

Cassandra Replication Factor & Consistency Levels

• RF: designated per keyspace

• CL:

• Writes: ANY, ONE, QUORUM, ALL

• Reads: ONE, QUORUM, ALL

• Consistent reads & writes areachieved when CL(W) + CL(R) > RF

• QUORUM = RF/2 + 1

• Additional QUORUM variants:

• LOCAL_QUORUM: quorum of replica nodes within same DC

• EACH_QUORUM: quorum of replica nodes from all DCs

Cassandra parameters calculator: http://www.ecyrd.com/cassandracalculator/

http://www.ecyrd.com/cassandracalculator/

Masterless design• All nodes in the cluster are equal

• Gossip protocol among servers

• Adding / removing nodes is easy

• Clients are cluster-aware

Traditional replicated relational database systems focus on the problem of guaranteeing strong consistency to replicated data. Although strong consistency provides the application writer a convenient programming model, these systems are limited in scalability and availability [7]. These systems are not capable of handling network partitions because they typically provide strong consistency guarantees.

3.3 Discussion Dynamo differs from the aforementioned decentralized storage systems in terms of its target requirements. First, Dynamo is targeted mainly at applications that need an “always writeable” data store where no updates are rejected due to failures or concurrent writes. This is a crucial requirement for many Amazon applications. Second, as noted earlier, Dynamo is built for an infrastructure within a single administrative domain where all nodes are assumed to be trusted. Third, applications that use Dynamo do not require support for hierarchical namespaces (a norm in many file systems) or complex relational schema (supported by traditional databases). Fourth, Dynamo is built for latency sensitive applications that require at least 99.9% of read and write operations to be performed within a few hundred milliseconds. To meet these stringent latency requirements, it was imperative for us to avoid routing requests through multiple nodes (which is the typical design adopted by several distributed hash table systems such as Chord and Pastry). This is because multi-hop routing increases variability in response times, thereby increasing the latency at higher percentiles. Dynamo can be characterized as a zero-hop DHT, where each node maintains enough routing information locally to route a request to the appropriate node directly.

4. SYSTEM ARCHITECTURE The architecture of a storage system that needs to operate in a production setting is complex. In addition to the actual data persistence component, the system needs to have scalable and robust solutions for load balancing, membership and failure detection, failure recovery, replica synchronization, overload handling, state transfer, concurrency and job scheduling, request marshalling, request routing, system monitoring and alarming, and configuration management. Describing the details of each of the solutions is not possible, so this paper focuses on the core distributed systems techniques used in Dynamo: partitioning, replication, versioning, membership, failure handling and scaling.

Table 1 presents a summary of the list of techniques Dynamo uses and their respective advantages.

4.1 System Interface Dynamo stores objects associated with a key through a simple interface; it exposes two operations: get() and put(). The get(key) operation locates the object replicas associated with the key in the storage system and returns a single object or a list of objects with conflicting versions along with a context. The put(key, context, object) operation determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk. The context encodes system metadata about the object that is opaque to the caller and includes information such as the version of the object. The context information is stored along with the object so that the system can verify the validity of the context object supplied in the put request.

Dynamo treats both the key and the object supplied by the caller as an opaque array of bytes. It applies a MD5 hash on the key to generate a 128-bit identifier, which is used to determine the storage nodes that are responsible for serving the key.

4.2 Partitioning Algorithm One of the key design requirements for Dynamo is that it must scale incrementally. This requires a mechanism to dynamically partition the data over the set of nodes (i.e., storage hosts) in the system. Dynamo’s partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts. In consistent hashing [10], the output range of a hash function is treated as a fixed circular space or “ring” (i.e. the largest hash value wraps around to the smallest hash value). Each node in the system is assigned a random value within this space which represents its “position” on the ring. Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position.

A

B

C

D E

F

G

Key K

Nodes B, C and D store

keys in range (A,B)

including K.

Figure 2: Partitioning and replication of keys in Dynamo ring.

Table 1: Summary of techniques used in Dynamo and their advantages.

Problem Technique Advantage

Partitioning Consistent Hashing Incremental Scalability

High Availability for writes

Vector clocks with reconciliation during

reads

Version size is decoupled from

update rates.

Handling temporary failures

Sloppy Quorum and hinted handoff

Provides high availability and

durability guarantee when some of the replicas are not

available.

Recovering from permanent failures

Anti-entropy using Merkle trees

Synchronizes divergent replicas in

the background.

Membership and failure detection

Gossip-based membership protocol and failure detection.

Preserves symmetry and avoids having a centralized registry

for storing membership and

node liveness information.

199209

Image from “Dynamo: Amazon’s Highly Available Key-value Store”

Write path• Storage is log-structured; updates do not overwrite, deletes do not remove

• Commit log: sequential disk access

• Memtables: in-memory data structure (partially off-heap since 2.1b2)

• Memtables are flushed to SSTable on disk

• Compaction: merge SSTables, remove tombstones

Read path• For each SSTable that may contain a partition key:

• Bloom filters: estimate probability of locating partition data per SSTable

• Locate offset in SSTable

• Sequential read in SSTable (if query involves several columns)

• A partition’s columns are merged from several SSTables / memtable, as column updates never overwrite data

CQL - Data Modeling Basics

CQL• Cassandra Query Language

• Client API for Cassandra

• CQL3 available since Cassandra 1.2

• Familiar syntax

• Easy to use

• Drivers available for Java, Python, C# and more

Creating a table

Creating a table - what happened??• A new table was created

• It looks familiar!

• We defined the username as the primary key, therefore we are able to identify a row and query quickly by username

• Primary keys can be composite; the first part of the primary key is the partition key and determines the primary node for the partition

Composite Primary Key

Composite Primary KeyPartition Key

Composite Primary KeyPartition Key Clustering Column(s)


Partition key (not ordered)


Partition key (not ordered)

Clustering key (ordered)

Composite Primary Key - Partition Layout

username johndoe

key:value:

key:value:

username anna

key:value:

Insert/Update• INSERT & UPDATE are functionally equivalent

• New in Cassandra 2.0: Support for lightweight transactions (compare-and-set)

• e.g. INSERT INTO users (username, email) VALUES (‘tony’, ‘[email protected]’) IF NOT EXISTS;

• Based on Paxos consensus protocol

Paxos Made Live: An Engineering Perspective: http://research.google.com/archive/paxos_made_live.pdf

mailto:[email protected]

http://research.google.com/archive/paxos_made_live.pdf

Select query• SELECT * FROM user_attributes;

• Selecting across several partitions can be slow

• Default LIMIT 10.000

• Can filter results with WHERE clauses on partition key, partition key & clustering columns or indexed columns

• EQ & IN operators allowed for partition keys

• EQ, <, > … operators allowed for clustering columns

Select query - Ordering• Partition keys are not ordered

• … but clustering columns are ordered

• Default ordering is mandated by clustering columns

• ORDER BY can be specified on clustering columns at query time; default order can be set WITH CLUSTERING ORDER on table creation

Secondary Indexes• Secondary indexes allow queries using EQ or IN operators in columns other

than the partition key

• Internally implemented as hidden tables

• “Cassandra's built-in indexes are best on a table having many rows that contain the indexed value. The more unique values that exist in a particular column, the more overhead you will have, on average, to query and maintain the index.”

http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html

http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html

Secondary Indexes

Query Performance• Single-partition queries are fast!

• Queries for ranges on clustering columns are fast!

• Queries for multiple partitions are slow

• Use secondary indexes with caution

Counter columns

Tracing CQL requests

Setting TTL

Counters and Time Series use case: Polls

Use cases

Data access patterns• View poll ➞ Get poll name & sorted list of answers by poll id

• User votes ➞ Insert answer with user id, poll id, answer id, timestamp

• View result ➞ Retrieve counts per poll & answer

Poll & answers

POLL_ID TEXT

POLL_ID ANSWER_ID SORT_ORDER

POLL

POLL_ANSWER

ANSWER_ID TEXT

ANSWER

Poll & answers

• Need 3 queries to display a poll

• 2 by PK EQ

• 1 for multiple rows by PK IN

Poll & answers revisited

POLL_ID TEXT

POLL_ID SORT_ORDER ANSWER_ID ANSWER_TEXT

POLL

POLL_ANSWER

Poll & answers revisited

• Need 2 queries to display a poll

• both by PK EQ

Poll & answers re-revisited

POLL_ID POLL_TEXT (STATIC) SORT_ORDER ANSWER_ID ANSWER_TEXT

POLL

(Requires Cassandra 2.0.6+)

Poll & answers re-revisited

• One table to rule them all

• One query by PK EQ

Votes• Record user’s votes in a timeline

• Count of votes per answer

Votes

POLL_ID VOTED_ON USER_ID ANSWER_ID

VOTE

Time buckets• If you have tons of votes to record, you may want to split your partitions in

buckets e.g. per day

Time buckets• Partition layout

poll_id:1day:20140401

user_id:21answer_id:4


poll_id:1day:20140402



Counting votes• Count per poll_id & answer_id

Links

• http://cassandra.apache.org

• http://planetcassandra.org/ Cassandra binary distributions, use cases, webinars

• http://www.datastax.com/docsExcellent documentation for all things Cassandra (and DSE)

• http://www.slideshare.net/patrickmcfadin/cassandra-20-and-timeseriesCassandra 2.0 new features & time series modeling

http://cassandra.apache.org

http://planetcassandra.org/

http://www.datastax.com/docs

http://www.slideshare.net/patrickmcfadin/cassandra-20-and-timeseries

Thank you!