Cassnadra vs HBase

Cassandra vs HBaseSimilarities and differences in the

architectural approaches

Foundation papers

● The Google File System; Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung

● Bigtable: A Distributed Storage System for Structured Data; Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber

● Dynamo: Amazon’s Highly Available Key-value Store; Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels

Agenda

● Storage: LSM trees● Data distribution in cluster● Fault-tolerance

Log-structured merge tree layout

Log-structured merge tree

● Writes are aggregated in memory and then flushed to disk in one batch○ Memtable is actually a write-behind cache○ Write-ahead log (disk commit log) is used to protect

in-memory data from node failures● In-memory entries are asynchronously

persisted as a single segment (file) of records sorted by key○ The segments are asynchronously merged together

in order to get log(number of records) segments

Why LSM tree is good for HBase

● LSM tree suites well for HDFS○ LSM tree writes data in large batches○ SSTables are immutable

LSM tree problems

● Relatively slow read○ The requested key can be in any segment hence all

of them should be checked■ Key cache (Cassandra)■ Bloom filters can be used to ignore some of the

files● They are prone to false-positives

● Early versions of HDFS had no support for an append operation○ Append is required for write ahead log○ hflush in HDFS 0.21 allows to flush written data

without closing the file

Agenda


Shared nothing architecture

● Each node processes requests for its own shard of data

● It is always known which node is responsible for the particular key

Cassandra entry distribution

Cassandra distributed storage

● Consistent Hashing (node ring) is used to distribute column family around the cluster nodes

● A node is responsible for storing key range which hashes less than its own number (token)○ Node tokens are set explicitly in config

Virtual nodes

● Virtual nodes are available for Cassandra from v1.2

● No need for manual token assignment○ Make data distribute evenly across the physical

nodes○ It is simpler to set which proportion of data is stored

on the particular node

Cassandra partition strategies

● Random partitioner○ Node is determined by the key MD5 hash

● Byte-ordered partitioner○ Node is determined by the number constructed from

first bytes of the key○ Allows range queries○ Prone to uneven data distribution

Cassandra secondary indexes placement

HBase region distribution

HBase distributed storage

● Region meta table○ Continuous range of keys is a region○ Root table stores regions for meta table itself○ Master try to evenly distribute regions across

RegionServers■ Regions can be moved between region servers

in order to achieve better distribution● Since actual data is in HDFS no data is moved during the

process

● Secondary attribute queries○ DIY indexes: coprocessors

Region splits/merges

● Initially only one region is allocated for a table

● Uneven region sizes● Online region splitting

○ Data is not copied, new region's files just hold links on the data in old region's files

● Region merging is still unstable

Agenda


HBase cluster nodes

HDFS and CAP theorem

● CP○ HDFS replicates data synchronously on write○ DataNode is considered dead if is not visible for the

NameNode■ Lost block replicas will be restored automatically

on live nodes○ DataNode stops serving requests if the NameNode

is lost

HDFS block replication in cluster

HDFS block replication

● HDFS tends to store one copy of a block on the same server with client○ if there is a DataNode on the same server

● HDFS Rack Awareness○ one copy on the client server○ one on the same rack○ one on different rack

HDFS disadvantages (if used as storage for HBase)

● HDFS requires an additional request to NameNode in order to find a DataNode storing required block

● Data should be transferred from the DataNode to RegionServer on reads in some cases○ HBase is not taking region file blocks locations to

account when it assigns regions to RegionSevers

HBase inter-cluster replication

● Master-slave inter-cluster synchronous replication

● Request to region server is replicated to slave HBase cluster

Cassandra and CAP theorem

● AP○ Gossip style failure detection

■ Failed node is still in the ring● New replica for a data range will be assigned only if failed

node is manually removed○ Async write

■ A node will replicate the write to appropriate nodes but return to client immediately

● Can also be "eventually" consistent○ Quorum write

■ Blocks until certain number of writes is reached■ But there is no distributed commit protocol

○ Quorum reads

Lack of distributed commit protocol issue

1. Client writes to all replicas2. Write on one of the replicas is failed3. Write operation is failed4. All of the replicas except one persisted the

failed write

Inconsistent write repair measures

● Read repairs○ Difference in results will be detected on read from

multiple replicas● Hinted handoff

○ Failed write is remembered and be retried by the coordinator node

● Anti-Entropy○ Manually started replica reconciliation

Cassandra simple replication

Cassandra network topology based replication

Cassandra replica placement strategies

● Simple○ The closest neighbor down the ring is selected as a

replica● Network topology based

○ Additional replicas are placed by walking the ring clockwise until a node in a different rack is found■ If no such node exists, additional replicas are

placed in different nodes in the same rack○ Server - DC:Rack mappings are set explicitly in

config

Links

● File appends in HDFS: http://blog.cloudera.com/blog/2009/07/file-appends-in-hdfs/

● HBase file locality in HDFS: http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html

● HBase Coprocessors: https://blogs.apache.org/hbase/entry/coprocessor_introduction

● HBase Region Splitting: http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/

http://blog.cloudera.com/blog/2009/07/file-appends-in-hdfs/



http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html




https://blogs.apache.org/hbase/entry/coprocessor_introduction



http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/




Technology

Cassnadra vs HBase