Upload
vassilis-bekiaris
View
1.948
Download
2
Embed Size (px)
DESCRIPTION
Presented at Athens Cassandra Users Group meetup http://www.meetup.com/Athens-Cassandra-Users/events/177040142/
Citation preview
C* @ Icon PlatformsVassilis Bekiaris @karbonized1
Software Architect
Presentation outline• Meet Cassandra
• CQL - Data modeling basics
• Counters & Time-series use case: Polls
Meet Cassandra
History• Started at Facebook
• Historically builds on
• Dynamo for distribution: consistent hashing, eventual consistency
• BigTable for disk storage model
Amazon’s Dynamo: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html Google’s BigTable: http://research.google.com/archive/bigtable.html
Cassandra is• A distributed database written in Java
• Scalable
• Masterless, no single point of failure
• Tunable consistency
• Network topology aware
Cassandra Data Model• Original “Map of Maps” schema
• row key ➞ Map<ColumnName, Value>
• Now (in CQL):
• Keyspace = Database
• ColumnFamily = Table
• Row = Partition
• Column = Cell
• Data types
• strings, booleans, integers, decimals
• collections: list, set, map
• not indexable, not individually query-able
• counters
• custom types
Cassandra Replication Factor & Consistency Levels
• CAP Theorem:
• Consistency
• Availability
• Tolerance in the face of network partitions
Original article: http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf Review 12 years later: http://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed Fun with distributed systems under partitions: http://aphyr.com/tags/jepsen
Cassandra Replication Factor & Consistency Levels
• RF: designated per keyspace
• CL:
• Writes: ANY, ONE, QUORUM, ALL
• Reads: ONE, QUORUM, ALL
• Consistent reads & writes areachieved when CL(W) + CL(R) > RF
• QUORUM = RF/2 + 1
• Additional QUORUM variants:
• LOCAL_QUORUM: quorum of replica nodes within same DC
• EACH_QUORUM: quorum of replica nodes from all DCs
Cassandra parameters calculator: http://www.ecyrd.com/cassandracalculator/
Masterless design• All nodes in the cluster are equal
• Gossip protocol among servers
• Adding / removing nodes is easy
• Clients are cluster-aware
Traditional replicated relational database systems focus on the problem of guaranteeing strong consistency to replicated data. Although strong consistency provides the application writer a convenient programming model, these systems are limited in scalability and availability [7]. These systems are not capable of handling network partitions because they typically provide strong consistency guarantees.
3.3 Discussion Dynamo differs from the aforementioned decentralized storage systems in terms of its target requirements. First, Dynamo is targeted mainly at applications that need an “always writeable” data store where no updates are rejected due to failures or concurrent writes. This is a crucial requirement for many Amazon applications. Second, as noted earlier, Dynamo is built for an infrastructure within a single administrative domain where all nodes are assumed to be trusted. Third, applications that use Dynamo do not require support for hierarchical namespaces (a norm in many file systems) or complex relational schema (supported by traditional databases). Fourth, Dynamo is built for latency sensitive applications that require at least 99.9% of read and write operations to be performed within a few hundred milliseconds. To meet these stringent latency requirements, it was imperative for us to avoid routing requests through multiple nodes (which is the typical design adopted by several distributed hash table systems such as Chord and Pastry). This is because multi-hop routing increases variability in response times, thereby increasing the latency at higher percentiles. Dynamo can be characterized as a zero-hop DHT, where each node maintains enough routing information locally to route a request to the appropriate node directly.
4. SYSTEM ARCHITECTURE The architecture of a storage system that needs to operate in a production setting is complex. In addition to the actual data persistence component, the system needs to have scalable and robust solutions for load balancing, membership and failure detection, failure recovery, replica synchronization, overload handling, state transfer, concurrency and job scheduling, request marshalling, request routing, system monitoring and alarming, and configuration management. Describing the details of each of the solutions is not possible, so this paper focuses on the core distributed systems techniques used in Dynamo: partitioning, replication, versioning, membership, failure handling and scaling.
Table 1 presents a summary of the list of techniques Dynamo uses and their respective advantages.
4.1 System Interface Dynamo stores objects associated with a key through a simple interface; it exposes two operations: get() and put(). The get(key) operation locates the object replicas associated with the key in the storage system and returns a single object or a list of objects with conflicting versions along with a context. The put(key, context, object) operation determines where the replicas of the object should be placed based on the associated key, and writes the replicas to disk. The context encodes system metadata about the object that is opaque to the caller and includes information such as the version of the object. The context information is stored along with the object so that the system can verify the validity of the context object supplied in the put request.
Dynamo treats both the key and the object supplied by the caller as an opaque array of bytes. It applies a MD5 hash on the key to generate a 128-bit identifier, which is used to determine the storage nodes that are responsible for serving the key.
4.2 Partitioning Algorithm One of the key design requirements for Dynamo is that it must scale incrementally. This requires a mechanism to dynamically partition the data over the set of nodes (i.e., storage hosts) in the system. Dynamo’s partitioning scheme relies on consistent hashing to distribute the load across multiple storage hosts. In consistent hashing [10], the output range of a hash function is treated as a fixed circular space or “ring” (i.e. the largest hash value wraps around to the smallest hash value). Each node in the system is assigned a random value within this space which represents its “position” on the ring. Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position.
A
B
C
D E
F
G
Key K
Nodes B, C and D store
keys in range (A,B)
including K.
Figure 2: Partitioning and replication of keys in Dynamo ring.
Table 1: Summary of techniques used in Dynamo and their advantages.
Problem Technique Advantage
Partitioning Consistent Hashing Incremental Scalability
High Availability for writes
Vector clocks with reconciliation during
reads
Version size is decoupled from
update rates.
Handling temporary failures
Sloppy Quorum and hinted handoff
Provides high availability and
durability guarantee when some of the replicas are not
available.
Recovering from permanent failures
Anti-entropy using Merkle trees
Synchronizes divergent replicas in
the background.
Membership and failure detection
Gossip-based membership protocol and failure detection.
Preserves symmetry and avoids having a centralized registry
for storing membership and
node liveness information.
199209
Image from “Dynamo: Amazon’s Highly Available Key-value Store”
Write path• Storage is log-structured; updates do not overwrite, deletes do not remove
• Commit log: sequential disk access
• Memtables: in-memory data structure (partially off-heap since 2.1b2)
• Memtables are flushed to SSTable on disk
• Compaction: merge SSTables, remove tombstones
Read path• For each SSTable that may contain a partition key:
• Bloom filters: estimate probability of locating partition data per SSTable
• Locate offset in SSTable
• Sequential read in SSTable (if query involves several columns)
• A partition’s columns are merged from several SSTables / memtable, as column updates never overwrite data
CQL - Data Modeling Basics
CQL• Cassandra Query Language
• Client API for Cassandra
• CQL3 available since Cassandra 1.2
• Familiar syntax
• Easy to use
• Drivers available for Java, Python, C# and more
Creating a table
Creating a table - what happened??• A new table was created
• It looks familiar!
• We defined the username as the primary key, therefore we are able to identify a row and query quickly by username
• Primary keys can be composite; the first part of the primary key is the partition key and determines the primary node for the partition
Composite Primary Key
Composite Primary KeyPartition Key
Composite Primary KeyPartition Key Clustering Column(s)
Composite Primary KeyPartition Key Clustering Column(s)
Partition key (not ordered)
Composite Primary KeyPartition Key Clustering Column(s)
Partition key (not ordered)
Clustering key (ordered)
Composite Primary Key - Partition Layout
username johndoe
key:value:
key:value:
username anna
key:value:
Insert/Update• INSERT & UPDATE are functionally equivalent
• New in Cassandra 2.0: Support for lightweight transactions (compare-and-set)
• e.g. INSERT INTO users (username, email) VALUES (‘tony’, ‘[email protected]’) IF NOT EXISTS;
• Based on Paxos consensus protocol
Paxos Made Live: An Engineering Perspective: http://research.google.com/archive/paxos_made_live.pdf
Select query• SELECT * FROM user_attributes;
• Selecting across several partitions can be slow
• Default LIMIT 10.000
• Can filter results with WHERE clauses on partition key, partition key & clustering columns or indexed columns
• EQ & IN operators allowed for partition keys
• EQ, <, > … operators allowed for clustering columns
Select query - Ordering• Partition keys are not ordered
• … but clustering columns are ordered
• Default ordering is mandated by clustering columns
• ORDER BY can be specified on clustering columns at query time; default order can be set WITH CLUSTERING ORDER on table creation
Secondary Indexes• Secondary indexes allow queries using EQ or IN operators in columns other
than the partition key
• Internally implemented as hidden tables
• “Cassandra's built-in indexes are best on a table having many rows that contain the indexed value. The more unique values that exist in a particular column, the more overhead you will have, on average, to query and maintain the index.”
http://www.datastax.com/documentation/cql/3.0/cql/ddl/ddl_when_use_index_c.html
Secondary Indexes
Query Performance• Single-partition queries are fast!
• Queries for ranges on clustering columns are fast!
• Queries for multiple partitions are slow
• Use secondary indexes with caution
Counter columns
Tracing CQL requests
Setting TTL
Counters and Time Series use case: Polls
Use cases
Data access patterns• View poll ➞ Get poll name & sorted list of answers by poll id
• User votes ➞ Insert answer with user id, poll id, answer id, timestamp
• View result ➞ Retrieve counts per poll & answer
Poll & answers
POLL_ID TEXT
POLL_ID ANSWER_ID SORT_ORDER
POLL
POLL_ANSWER
ANSWER_ID TEXT
ANSWER
Poll & answers
• Need 3 queries to display a poll
• 2 by PK EQ
• 1 for multiple rows by PK IN
Poll & answers revisited
POLL_ID TEXT
POLL_ID SORT_ORDER ANSWER_ID ANSWER_TEXT
POLL
POLL_ANSWER
Poll & answers revisited
• Need 2 queries to display a poll
• both by PK EQ
Poll & answers re-revisited
POLL_ID POLL_TEXT (STATIC) SORT_ORDER ANSWER_ID ANSWER_TEXT
POLL
(Requires Cassandra 2.0.6+)
Poll & answers re-revisited
• One table to rule them all
• One query by PK EQ
Votes• Record user’s votes in a timeline
• Count of votes per answer
Votes
POLL_ID VOTED_ON USER_ID ANSWER_ID
VOTE
Time buckets• If you have tons of votes to record, you may want to split your partitions in
buckets e.g. per day
Time buckets• Partition layout
poll_id:1day:20140401
user_id:21answer_id:4
user_id:22answer_id:1
poll_id:1day:20140402
user_id:27answer_id:2
user_id:29answer_id:3
Counting votes• Count per poll_id & answer_id
Links
• http://cassandra.apache.org
• http://planetcassandra.org/ Cassandra binary distributions, use cases, webinars
• http://www.datastax.com/docsExcellent documentation for all things Cassandra (and DSE)
• http://www.slideshare.net/patrickmcfadin/cassandra-20-and-timeseriesCassandra 2.0 new features & time series modeling
Thank you!