Upload
dmitri-babaev
View
2.836
Download
0
Embed Size (px)
DESCRIPTION
Cassandra Moscow, April 2013 meetup
Citation preview
Cassandra vs HBaseSimilarities and differences in the
architectural approaches
Foundation papers
● The Google File System; Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung
● Bigtable: A Distributed Storage System for Structured Data; Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber
● Dynamo: Amazon’s Highly Available Key-value Store; Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels
Agenda
● Storage: LSM trees● Data distribution in cluster● Fault-tolerance
Log-structured merge tree layout
Log-structured merge tree
● Writes are aggregated in memory and then flushed to disk in one batch○ Memtable is actually a write-behind cache○ Write-ahead log (disk commit log) is used to protect
in-memory data from node failures● In-memory entries are asynchronously
persisted as a single segment (file) of records sorted by key○ The segments are asynchronously merged together
in order to get log(number of records) segments
Why LSM tree is good for HBase
● LSM tree suites well for HDFS○ LSM tree writes data in large batches○ SSTables are immutable
LSM tree problems
● Relatively slow read○ The requested key can be in any segment hence all
of them should be checked■ Key cache (Cassandra)■ Bloom filters can be used to ignore some of the
files● They are prone to false-positives
● Early versions of HDFS had no support for an append operation○ Append is required for write ahead log○ hflush in HDFS 0.21 allows to flush written data
without closing the file
Agenda
● Storage: LSM trees● Data distribution in cluster● Fault-tolerance
Shared nothing architecture
● Each node processes requests for its own shard of data
● It is always known which node is responsible for the particular key
Cassandra entry distribution
Cassandra distributed storage
● Consistent Hashing (node ring) is used to distribute column family around the cluster nodes
● A node is responsible for storing key range which hashes less than its own number (token)○ Node tokens are set explicitly in config
Virtual nodes
● Virtual nodes are available for Cassandra from v1.2
● No need for manual token assignment○ Make data distribute evenly across the physical
nodes○ It is simpler to set which proportion of data is stored
on the particular node
Cassandra partition strategies
● Random partitioner○ Node is determined by the key MD5 hash
● Byte-ordered partitioner○ Node is determined by the number constructed from
first bytes of the key○ Allows range queries○ Prone to uneven data distribution
Cassandra secondary indexes placement
HBase region distribution
HBase distributed storage
● Region meta table○ Continuous range of keys is a region○ Root table stores regions for meta table itself○ Master try to evenly distribute regions across
RegionServers■ Regions can be moved between region servers
in order to achieve better distribution● Since actual data is in HDFS no data is moved during the
process
● Secondary attribute queries○ DIY indexes: coprocessors
Region splits/merges
● Initially only one region is allocated for a table
● Uneven region sizes● Online region splitting
○ Data is not copied, new region's files just hold links on the data in old region's files
● Region merging is still unstable
Agenda
● Storage: LSM trees● Data distribution in cluster● Fault-tolerance
HBase cluster nodes
HDFS and CAP theorem
● CP○ HDFS replicates data synchronously on write○ DataNode is considered dead if is not visible for the
NameNode■ Lost block replicas will be restored automatically
on live nodes○ DataNode stops serving requests if the NameNode
is lost
HDFS block replication in cluster
HDFS block replication
● HDFS tends to store one copy of a block on the same server with client○ if there is a DataNode on the same server
● HDFS Rack Awareness○ one copy on the client server○ one on the same rack○ one on different rack
HDFS disadvantages (if used as storage for HBase)
● HDFS requires an additional request to NameNode in order to find a DataNode storing required block
● Data should be transferred from the DataNode to RegionServer on reads in some cases○ HBase is not taking region file blocks locations to
account when it assigns regions to RegionSevers
HBase inter-cluster replication
● Master-slave inter-cluster synchronous replication
● Request to region server is replicated to slave HBase cluster
Cassandra and CAP theorem
● AP○ Gossip style failure detection
■ Failed node is still in the ring● New replica for a data range will be assigned only if failed
node is manually removed○ Async write
■ A node will replicate the write to appropriate nodes but return to client immediately
● Can also be "eventually" consistent○ Quorum write
■ Blocks until certain number of writes is reached■ But there is no distributed commit protocol
○ Quorum reads
Lack of distributed commit protocol issue
1. Client writes to all replicas2. Write on one of the replicas is failed3. Write operation is failed4. All of the replicas except one persisted the
failed write
Inconsistent write repair measures
● Read repairs○ Difference in results will be detected on read from
multiple replicas● Hinted handoff
○ Failed write is remembered and be retried by the coordinator node
● Anti-Entropy○ Manually started replica reconciliation
Cassandra simple replication
Cassandra network topology based replication
Cassandra replica placement strategies
● Simple○ The closest neighbor down the ring is selected as a
replica● Network topology based
○ Additional replicas are placed by walking the ring clockwise until a node in a different rack is found■ If no such node exists, additional replicas are
placed in different nodes in the same rack○ Server - DC:Rack mappings are set explicitly in
config
Links
● File appends in HDFS: http://blog.cloudera.com/blog/2009/07/file-appends-in-hdfs/
● HBase file locality in HDFS: http://www.larsgeorge.com/2010/05/hbase-file-locality-in-hdfs.html
● HBase Coprocessors: https://blogs.apache.org/hbase/entry/coprocessor_introduction
● HBase Region Splitting: http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/