View
1.496
Download
2
Category
Preview:
Citation preview
©2014 DataStax
@AlTobey Open Source Mechanic | Datastax Apache Cassandra のオープンソースエバンジェリスト
Managing Cassandra : SCALE 12x
!1
Obsessed with infrastructure my whole life.See distributed systems everywhere.
Five years of Cassandra
0 1 2 3 4 5
0.1 0.3 0.6 0.7 1.0 1.2...
2.0
DSE
Jul-08
Why Cassandra?
Or any non-relational?
アルトビー
Leo Rufus
My boysI did not have permission from my wife to use her image ;)
!Less Tolerant
For the record. Can you spell HTTP?
More and more services on line.!And they must be available
!日本で億のインターネットユーザー理由Cassandraを選ぶのか
100,000,000 internet users in Japanacross the world … B2B, video conferencing, shopping, entertainment
!Traditional solutions…
!…may not be a good fit
Client-server
All your wildest !dreams will come !
true.
Client - server database architecture.Obsolete.The original “funnel-shaped” architecture
3-tier
Client - client - server database architecture.Still suitable for small applications.e.g. LAMP, RoR
3-tier + caching
master slaveslave
cache
more complexcache coherency is a hard problemcascading failures are commonNext: Out: the funnel. In: The ring.
Webscale
outer ring: clients (cell phones, etc.)middle ring: application serversinside ring: Cassandra servers!Serving millions of clients with mere hundreds or thousands of nodes requires a different approach to applications!
When it Rains
scale out … but at a cost
A new plan
Dynamo Paper(2007)•How do we build a data store that is: • Reliable
• Performant
• “Always On”
•Nothing new and shiny
!!Evolutionary. Real. Computer Science
Also the basis for Riak and Voldemort
BigTable(2006)
• Richer data model
• 1 key. Lots of values
• Fast sequential access
• 38 Papers cited
Cassandra(2008)
• Distributed features of Dynamo
• Data Model and storage from BigTable
• February 17, 2010 it graduated to a top-level Apache project
Cassandra - More than one server
• All nodes participate in a cluster
• Shared nothing
• Add or remove as needed
•More capacity? Add a server
!18
!19
Cassandra HBase Redis MySQL
THRO
UG
HPU
T O
PS/S
EC)
VLDB benchmark (RWS)
Cassandra - Fully Replicated
• Client writes local
• Data syncs across WAN
• Replication per Data Center
!20
Read-Modify-WriteUPDATE Employees SET Rank=4, Promoted=2014-‐01-‐24 WHERE EmployeeID=1337;
EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********3Promoted****null
EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********4Promoted****2014501524
This might be what it looks like from SQL / CQL, but …!
Read-Modify-WriteUPDATE Employees SET Rank=4, Promoted=2014-‐01-‐24 WHERE EmployeeID=1337;
TNSTAAFL 無償の昼食なんてものはありません
EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********4Promoted****2014501524
EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********3Promoted****null
RDBMS
TNSTAAFL …If you’re lucky, the cell is in cache.Otherwise, it’s a disk access to read, another to write.
Eventual ConsistencyUPDATE Employees SET Rank=4, Promoted=2014-‐01-‐24 WHERE EmployeeID=1337;
EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********4Promoted****2014501524
EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********3Promoted****null
Coordinator
Explain distributed RMWMore complicated.Will talk about how it’s abstracted in CQL later.
Eventual ConsistencyUPDATE Employees SET Rank=4, Promoted=2014-‐01-‐24 WHERE EmployeeID=1337;
EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********4Promoted****2014501524
EmployeeID**1337Name********アルトビーStartDate***2013510501Rank********3Promoted****null
Coordinator
read
write
Memory replication on write, depending on RF, usually RF=3.Reads AND writes remain available through partitions.Hinted handoff.
©2014 DataStax
memTable
Avoiding read-modify-write
!25#CASSANDRA
Albert 6 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
cassandra13_drinks column family
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 12 Wednesday 0
Tuesday
⁍ CF to track how much I expect my team at Ooyala to drink ⁍ Row keys are names ⁍ Column keys are days ⁍ Values are a count of drinks
©2014 DataStax
memTable
Avoiding read-modify-write
!26#CASSANDRA
Al Tuesday 2 Wednesday 0
Phillip Tuesday 0 Wednesday 1
cassandra13_drinks column family
ssTable
Albert 6 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 12 Wednesday 0
Tuesday
⁍ Next day, after after a flush ⁍ I’m speaking so I decided to drink less ⁍ Phillip informs me that he has quit drinking
©2014 DataStax
memTable
Avoiding read-modify-write
!27#CASSANDRA
Albert Tuesday 22 Wednesday 0
cassandra13_drinks column family
ssTableAlbert Tuesday 2 Wednesday 0
Phillip Tuesday 0 Wednesday 1
ssTable
Albert 6 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 12 Wednesday 0
Tuesday
⁍ I’m drinking with all you people so I decide to add 20 ⁍ read 2, add 20, write 22
©2014 DataStax
Avoiding read-modify-write
!28#CASSANDRA
cassandra13_drinks column family
ssTable
Albert Tuesday 22 Wednesday 0
Evan Tuesday 0 Wednesday 0
Frank Tuesday 3 Wednesday 3
Kelvin Tuesday 0 Wednesday 0
Krzysztof Tuesday 0 Wednesday 0
Phillip Tuesday 0 Wednesday 1
⁍ After compaction & conflict resolution ⁍ Overwriting the same value is just fine! Works really well for some patterns such as time-series data ⁍ Separate read/write streams handy for debugging, but not a big deal
OverwritingCREATE TABLE host_lookup ( name varchar, id uuid, PRIMARY KEY(name) ); !INSERT INTO host_uuid (name,id) VALUES (“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); !INSERT INTO host_uuid (name,id) VALUES (“tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); !INSERT INTO host_uuid (name,id) VALUES (“www.tobert.org”, “463b03ec-fcc1-4428-bac8-80ccee1c2f77”); !SELECT id FROM host_lookup WHERE name=“tobert.org”;
Beware of expensive compactionBest for: small indexes, lookup tablesCompaction handles RMW at storage level in the background.Under heavy writes, clock synchronization is very important to avoid timestamp collisions. In practice, this isn’t a problem very often and even when it goes wrong, not much harm done.
Key/ValueCREATE TABLE keyval ( key VARCHAR, value blob, PRIMARY KEY(key) ); !INSERT INTO keyval (key,value) VALUES (?, ?); !SELECT value FROM keyval WHERE key=?;
e.g. memcachedDon’t do this.But it works when you really need it.
Journaling / Logging / Time-seriesCREATE TABLE tsdb ( time_bucket timestamp, time timestamp, value blob, PRIMARY KEY(time_bucket, time) ); !INSERT INTO tsdb (time_bucket, time, value) VALUES ( “2014-10-24”, -- 1-day bucket (UTC) “2014-10-24T12:12:12Z”, -- ALWAYS USE UTC ‘{“foo”: “bar”}’ );
Oversimplified, use normalization over blobs whenever possible.ALWAYS USE UTC :)
Journaling / Logging / Time-series
{"“2014(01(24”"=>"{""""“2014(01(24T12:12:12Z”"=>"{""""""""‘{“foo”:"“bar”}’""""}}
2014(01(24 2014(01(24T12:12:12Z{“key”:"“value”}
2014(01(25 2014(01(25T13:13:13Z{“key”:"“value”}
2014(01(24T21:21:21Z{“key”:"“value”}
Oversimplified, use normalization over blobs whenever possible.ALWAYS USE UTC :)
Cassandra CollectionsCREATE TABLE posts ( id uuid, body varchar, created timestamp, authors set<varchar>, tags set<varchar>, PRIMARY KEY(id) ); !INSERT INTO posts (id,body,created,authors,tags) VALUES ( ea4aba7d-9344-4d08-8ca5-873aa1214068, ‘アルトビーの犬はばかね’, ‘now', [‘アルトビー’, ’ィオートビー’], [‘dog’, ‘silly’, ’犬’, ‘ばか’] );
quick story about 犬ばかねsets & maps are CRDTs, safe to modify
Cassandra CollectionsCREATE TABLE metrics ( bucket timestamp, time timestamp, value blob, labels map<varchar,varchar>, PRIMARY KEY(bucket) );
sets & maps are CRDTs, safe to modify
Lightweight Transactions• Cassandra 2.0 and on support LWT based on PAXOS
• PAXOS is a distributed consensus protocol
• Given a constraint, Cassandra ensures correct ordering
Lightweight TransactionsUPDATE users SET username=‘tobert’ WHERE id=68021e8a-‐9eb0-‐436c-‐8cdd-‐aac629788383 IF username=‘renice’; !INSERT INTO users (id, username) VALUES (68021e8a-‐9eb0-‐436c-‐8cdd-‐aac629788383, ‘renice’) IF NOT EXISTS; !!
Client error on conflict.
Brendan Gregg’s Tools Chart
dstat -lrvn 10
iostat -x 1
htop
Datastax Opscenter
©2014 DataStax
Config Changes: Apache 1.0 ➜ DSE 3.0
!43
⁍ Schema: compaction_strategy = LCS⁍ Schema: bloom_filter_fp_chance = 0.1⁍ Schema: sstable_size_in_mb = 256⁍ Schema: compression_options = Snappy⁍ YAML: compaction_throughput_mb_per_sec: 0
#CASSANDRA13
⁍ LCS is a huge improvement in operations life (no more major compactions) ⁍ Bloom filters were tipping over a 24GiB heap ⁍ With lots of data per node, sstable sizes in LCS must be MUCH bigger ⁍ > 100,000 open files slows everything down, especially startup ⁍ 256mb v.s. 5mb is 50x reduction in file count ⁍ default has been fixed as of C* 2.0 ⁍ Compaction can’t keep up: even huge rates don’t work, must be disabled ⁍ try to adjust heap, etc. so you’re flushing at nearly full memtables to reduce compaction needs ⁍ backreference RMW? ⁍ might be fixed in >= 1.2
©2014 DataStax
!44
nodetool ring
10.10.10.10 Analytics rack1 Up Normal 47.73 MB 1.72% 101204669472175663702469172037896580098
10.10.10.10 Analytics rack1 Up Normal 63.94 MB 0.86% 102671403812352122596707855690619718940
10.10.10.10 Analytics rack1 Up Normal 85.73 MB 0.86% 104138138152528581490946539343342857782
10.10.10.10 Analytics rack1 Up Normal 47.87 MB 0.86% 105604872492705040385185222996065996624
10.10.10.10 Analytics rack1 Up Normal 39.73 MB 0.86% 107071606832881499279423906648789135466
10.10.10.10 Analytics rack1 Up Normal 40.74 MB 1.75% 110042394566257506011458285920000334950
10.10.10.10 Analytics rack1 Up Normal 40.08 MB 2.20% 113781420866907675791616368030579466301
10.10.10.10 Analytics rack1 Up Normal 56.19 MB 3.45% 119650151395618797017962053073524524487
10.10.10.10 Analytics rack1 Up Normal 214.88 MB 11.62% 139424886777089715561324792149872061049
10.10.10.10 Analytics rack1 Up Normal 214.29 MB 2.45% 143588210871399618110700028431440799305
10.10.10.10 Analytics rack1 Up Normal 158.49 MB 1.76% 146577368624928021690175250344904436129
10.10.10.10 Analytics rack1 Up Normal 40.3 MB 0.92% 148140168357822348318107048925037023042
⁍ hotspots
©2014 DataStax
!45#CASSANDRA
nodetool cfstatsKeyspace: gostress
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Column Family: stressful
SSTable count: 1
Space used (live): 32981239
Space used (total): 32981239
Number of Keys (estimate): 128
Memtable Columns Count: 0
Memtable Data Size: 0
Memtable Switch Count: 0
Read Count: 0
Read Latency: NaN ms.
Write Count: 0
Write Latency: NaN ms.
Pending Tasks: 0
Bloom Filter False Positives: 0
Bloom Filter False Ratio: 0.00000
Bloom Filter Space Used: 336
Compacted row minimum size: 7007507
Compacted row maximum size: 8409007
Compacted row mean size: 8409007
Could be using a lot of heap
Controllable by sstable_size_in_mb
⁍ bloom filters ⁍ sstable_size_in_mb
©2014 DataStax
!46#CASSANDRA
nodetool proxyhistogramsOffset Read Latency Write Latency Range Latency
35 0 20 0
42 0 61 0
50 0 82 0
60 0 440 0
72 0 3416 0
86 0 17910 0
103 0 48675 0
124 1 97423 0
149 0 153109 0
179 2 186205 0
215 5 139022 0
258 134 44058 0
310 2656 60660 0
372 34698 742684 0
446 469515 7359351 0
535 3920391 31030588 0
642 9852708 33070248 0
770 4487796 9719615 0
924 651959 984889 0
⁍ units are microseconds ⁍ can give you a good idea of how much latency coordinator hops are costing you
©2014 DataStax
!47#CASSANDRA
nodetool compactionstats
al@node ~ $ nodetool compactionstats
pending tasks: 3
compaction type keyspace column family bytes compacted bytes total progress
Compaction hastur gauge_archive 9819749801 16922291634 58.03%
Compaction hastur counter_archive 12141850720 16147440484 75.19%
Compaction hastur mark_archive 647389841 1475432590 43.88%
Active compaction remaining time : n/a
al@node ~ $ nodetool compactionstats
pending tasks: 3
compaction type keyspace column family bytes compacted bytes total progress
Compaction hastur gauge_archive 10239806890 16922291634 60.51%
Compaction hastur counter_archive 12544404397 16147440484 77.69%
Compaction hastur mark_archive 1107897093 1475432590 75.09%
Active compaction remaining time : n/a
⁍
©2014 DataStax
!48#CASSANDRA
⁍ cassandra-stress⁍ YCSB⁍ Production⁍ Terasort (DSE)⁍ Homegrown
Stress Testing Tools
⁍ we mostly focus on cassandra-stress for burn-in of new clusters ⁍ can quickly figure out the right setting for -Xmn ⁍ Terasort is interesting for comparing DSE to Cloudera/Hortonworks/etc. (it’s fast!) ⁍ Consider writing custom benchmarks for your application patterns ⁍ sometimes it’s faster to write one than figure out how to make a generic tool do what you want
©2014 DataStax
!49#CASSANDRA
kernel.pid_max = 999999 fs.file-max = 1048576 vm.max_map_count = 1048576 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.ipv4.tcp_rmem = 4096 65536 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 vm.swappiness = 1
/etc/sysctl.conf
⁍ pid_max doesn’t fix anything, I just like it and have never had a problem with it ⁍ These are my starting point settings for nearly every system/application. ⁍ Generally safe for production. ⁍ vm.dirty*ratio can go big for fake fast writes, generally safe for Cassandra, but beware you’re more likely to see FS/file corruption on power loss ⁍ but you will get latency spikes if you hit dirty_ratio (percentage of RAM), so don’t tune too low
©2014 DataStax
!50#CASSANDRA
ra=$((2**14))# 16k ss=$(blockdev --getss /dev/sda) blockdev --setra $(($ra / $ss)) /dev/sda !echo 128 > /sys/block/sda/queue/nr_requests echo deadline > /sys/block/sda/queue/scheduler echo 16384 > /sys/block/md7/md/stripe_cache_size
/etc/rc.local
⁍ Lower readahead is better for latency on seeky workloads ⁍ More readahead will artificially increase your IOPS by reading a bunch of stuff you might not need! ⁍ nr_requests = number of IO structs the kernel will keep in flight, don’t go crazy ⁍ Deadline is best for raw throughput ⁍ CFQ supports cgroup priorities and is occasionally better for latency on SATA drives ⁍ Default stripe cache is 128. The increase seems to help MD RAID5 a lot. ⁍ Don’t forget to set readahead separately for MD RAID devices
Ending discussion notes• 2 socket, ECC memory • 16GiB minimum, prefer 32-64GiB, over 128GiB and Linux will need serious tuning
• SSD where possible, Samsung 840 Pro is a good choice, any Intel is fine
•NO SAN/NAS, 20ms latency tops • if you MUST (and please, don’t) dedicate spindles to C* nodes, use separate network
• Avoid disk configurations targeted at Hadoop, disks are too slow
• http://www.datastax.com/documentation/cassandra/2.0/pdf/cassandra20.pdf • read the sections on Repair, Tombstones & Snapshots
Recommended