Horizontally Scalable Storage Solutions
When scaling up a data store, there are three main parameters which need to grow together:
High throughput, and low latency read operations.
High throughput write operations, higher latency with writes can be accepted than with reads since they are less common and the user will have different expectations. Reasonable latency is still important.
High availability: little to no downtime in the event of a system failure and most importantly, no loss of data. Availability must be high whether the database needs to scale or not but keeping the same service level becomes more complex as the system grows.
Review of different storage solutions
Relational database storage systems
Since the 1970's relational storage has dominated the data storage scene, still most of the world depends on MySQL to store information for wikis, content management systems and even largely static websites.
Relational database advantages
- Easy to setup a small installation, anyone can type "apt-get install mysql".
- Fast and flexible query language, ability to select with greater than, less than, equal, text search using "like" as well as group by, order by, compound queries and powerful joins and unions are available.
- The standard has a lot of momentum, there are many people who understand how to use plain old SQL where they would be baffled with newer concepts in data storage. Also there are plentiful libraries written for inter-operating with SQL based databases.
- Similar to the last point, users have a lot more patience with relational database quirks and problems than they would with a system that is new and less understood, relational databases have succeeded in lowering user expectations.
Relational database disadvantage
- Performance tuning is complex, there are professional consultants who's only job is to performance tune databases.
- Rigid database structure: it is more difficult to add and remove columns and tables than with other offerings and table structure design requires some thought about performance. Not the best choice if you are looking for a "big HashMap".
How relational databases scale with increased read traffic
When a relational database cannot keep up with the read load, the master/slave replication pattern is implemented, while this pattern does improve write performance a limited amount by allowing the master to focus on writing only, it improves read performance as a function of the number of slave nodes.
http://dev.mysql.com/doc/refman/5.0/en/replication.html
How relational databases scale with increased write traffic
Relational databases can expand by using a practice called "sharding", this is partitioning of the tables over multiple machines so that a given write operation need only be sent to any one of a group of nodes. This will however increase the cost of many types of read operations. If for example: the person table is partitioned by last name and you want to select all rows where the value in the age field is between 18 and 20, that query necessarily must be executed by every partition of the table. This situation can be mitigated by combining sharding with master/slave replication.
http://blog.solutionset.com/wpmu/2009/04/08/scaling-mysql-in-the-web-environment/
How relational databases prevent downtime in the event of a failure
Since they were originally designed to run on a single machine, relational databases do not have any intrinsic protection against node failures. They can be setup in "multi master" configurations which are able to recover fast from a master node failing but must choose two of: strong consistency, write latency, and fail over without downtime.
http://mysqlha.blogspot.com/2010/04/consistency-across-wan.html
Discussion points
Users want the option of powerful fast queries that SQL provides.
It would be nice to be able to bypass the inflexibility and manual optimization needs associated with creating relational tables.
I see no compelling evidence that it is impossible to make relational databases scale using a combination of master/slave, sharding, and multi master setups.
Relational database performance is massively complex to begin with, combining master/slave and sharding is something which might be manageable in a large enterprise with database engineers perpetually watching over it but to build a turn key solution looks like an unreasonably complicated task.
If cloud providers can offer "pretend MySQL" and it scales then this will likely become the standard since it offers familiarity to the user. Amazon seems to think they can achieve this http://aws.amazon.com/running_databases/
Further resources
http://scale-out-blog.blogspot.com/2011/03/parallel-replication-using-shards-is.html
http://www.mysqlperformanceblog.com/2009/08/06/why-you-dont-want-to-shard/
http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-database-doomed.php
Cassandra
Cassandra is a storage engine which shares properties of Amazon's Dynamo and Google's BigTable. Cassandra is "eventually consistent" which is a term that creates confusion, in this case eventual means "within a few milliseconds" not "in a few minutes".
http://wiki.apache.org/cassandra/Operations#Consistency
Advantages of Cassandra
- Sharding is handled transparently. All you need to do is setup a confederation of Cassandra nodes and they will manage sharding of the database on their own.
- Everyone eventually gets a copy of everything, in even the worst disaster where only one node is left functional, that node will still be able to serve requests.
- No static schema, if you add a row to a table and you find that row really could use a couple more columns, just add them.
- By adding "column families" you can create views if the data which stand in for queries and since they are precomputed will be very fast. Work is shifted from the read cycle to the write cycle so read latency is better than with relational systems. http://www.datastax.com/docs/0.7/data_model/index
- There is no single point of failure because every node has equal roles and responsibilities. This also contributes to easier maintenance IE: no need for multi-master failover schemes.
- Written in Java (advantage for XWiki integration)
Disadvantages of Cassandra
- It is not SQL. The DataNucleus Cassandra plugin provides some basic querying such as equality, greater than and less than using Cassandra but other than that all queries are handled in memory which can be anything from a minor inconvenience to a performance nightmare.
- It has some issues with very small installations, "It scales up nicely, but doesn't scale down nearly as well. At only three nodes we weren't able to take advantage of most of Cassandra's safeguards for ending up in this situation" http://blog.reddit.com/2010/05/reddits-may-2010-state-of-servers.html
- When it becomes overloaded, bringing up additional nodes causes even more load while the new nodes integrate.
- Large objects (larger than available memory space) cannot be stored although chunking is a possibility. https://issues.apache.org/jira/browse/CASSANDRA-265
- Backing up data is hard. See discussion
NOTE: Some of these problems are minor bugs which can and will be worked out as time progresses.
How Cassandra scales with increased read traffic
Cassandra internally implements replication except that the storing node acts as the master and it posts the data to each of the W nodes whose node ids are nearest to the given record id (W is the number of write replicas IE: the number of nodes required to keep copies of any given record). On read, the reader requests the record from the nearest R nodes to the key where R is a value less than W which is the number of replicas which must come back the same before the reader is satisfied. For guaranteed consistency, the user can choose ConsistencyLevel.ALL which does not return until each of the N nodes responsible for holding that record have responded. Since that level of pedantry is rarely needed, ConsistencyLevel.ONE is commonly used wherein the first response is accepted. Cassandra still polls all nodes in the background to make sure everyone has the latest version of the data (in case there was a network glitch while the update was taking place) and if anyone has an out of date version, it updates them. There are other ConsistancyLevels between ALL and ONE which provide different levels of assuredness, see: http://wiki.apache.org/cassandra/API#ConsistencyLevel
When read load increases, the administrator can add more nodes which will bootstrap into the network and take over half of the entries handled by the node with the most entries. Read queries will then be able to hit the new node instead of the old one.
How Cassandra scales with increased write traffic
Cassandra is designed similar to Postgres in the way it processes writes. It immediately appends to a log and fsyncs that to disk so that a power failure would not cause data loss. Then it holds it's writes in memory for a configurable period of time before writing them to disk. As with Postgres, it is recommended to put the write log on one physical disk and the storage on another because the write log disk will be able to keep it's heads aligned with the end of the log while the data store disk is seeking information for reads.
When scaling is necessary, it scales the same way as it scales for reading, the administrator simply adds more nodes, they bootstrap in to the network, and they take over a group of keys.
How Cassandra prevent downtime in the event of a failure
If a node fails, the administrator may simply bring up another node and bootstrap it. Even if a node fails temporarily, the recommended practice is to wipe it's storage space and re bootstrap it. This is to prevent it's stale data from being used to respond to get requests. TODO: Understand why the new node is required to have a different IP address.
http://wiki.apache.org/cassandra/Operations#Handling_failure
Discussion points
- Cassandra's "eventual consistency" usually means within a few milliseconds which is unlikely to be noticed for most applications. If there is a network outage or a downed node then Cassandra will (unless instructed otherwise) write to whichever nodes it can and call it done. It could be imagined as a better chat service. In IRC, if you say something during a net-split, only your side of the network will ever hear it, if modes are changed (EG: someone is given channel operator status) the network will become desynchronized and may (potentially) never recover. In contrast, a split which makes 2 Cassandra nodes unable to connect has minimal impact since the gossip protocol will route the information another way. If however a node loses full connectivity then it will not get the updates until it rejoins and will have old versions of data. If the requester is asking for ConsistancyLevel.ONE and the formerly downed node responds then the requester may get old data (the data will be repaired in the background after the read operation though). http://dfeatherston.com/cassandra-adf-uiuc-su10.pdf
- Cassandra supports rolling restarts so anything (EG: configuration changes) requiring restart of the network do not necessitate any downtime.
- Backing up Cassandra data is not nearly as easy as one might imagine. There is currently no way to add a "write only node" which hears all write requests but no read requests thus building a full clone of the data store from a machine on the end of a comparatively slow internet connection. Even backing up a single node using the snapshot command is insufficient since any given node will not have a view of all data unless ConsistancyLevel.ALL is chosen for all writes. Periodically backing up each node is sufficient though highly wasteful of resources as is periodically polling for each entry in the table to see if it has changed. http://wiki.apache.org/cassandra/Operations#Backing_up_data
Further resources
- From discussion which lead up to reddit's adoption of Cassandra to replace memcachedb as their persistent cache
http://www.reddit.com/r/programming/comments/b81v1/i_was_hoping_we_could_get_a_good_technical/c0lg3yn
- An explanation of Cassandra's method of choosing which nodes shall be responsible for a given piece of data
http://weblogs.java.net/blog/2007/11/27/consistent-hashing
- A very good explanation of "strong consistency", "weak consistency" and "eventual consistency"
http://www.allthingsdistributed.com/2008/12/eventually_consistent.html
- The plugin for DataNucleus Cassandra interoperability.
https://github.com/tnine/Datanucleus-Cassandra-Plugin
Hbase
HBase is an open version of Google's BigTable engine. Unfortunately it is a heterogeneous system with single points of failure which is reminiscent of the sharding and master/slave systems for relational databases. It has benefits which are seen by very large installations such as Bing or Facebook but it will be more difficult to administer and has a smaller support community than Cassandra. As such, I don't plan to research this option too extensively.
- A comparison of the BigTable model used by Google App Engine and Apache Hbase with the Cassandra model.
http://ria101.wordpress.com/2010/02/24/hbase-vs-cassandra-why-we-moved/
- The Google BigTable paper.
http://labs.google.com/papers/bigtable-osdi06.pdf
Google App Engine (BigTable)
BigTable is the same system as HBase except that all administrative detail is handled by the Google people. This solution has 2 strikes against it.
- It locks the user in to Google App Engine. It is similar to HBase but not close enough for DataNucleus to use a single plugin for both, DataNucleus people complain that the App Engine provides some support for JPA/JDO but not all and working around that limitation is difficult.
2. Google seems to have lost interest in maintaining the App Engine as their third party plugin for DataNucleus has not been updated to be compatible with recent versions of DataNucleus and if we are to use DN with App Engine, we will have to contribute to their plugin ourselves.
I think that given the lax attitude from the Google people, we should hold off with implementing anything on the App Engine and stick with a more cloud oriented provider such as Amazon or Rackspace.
http://datanucleus.blogspot.com/2010/01/gaej-and-jdojpa.html
MongoDB
Mongo can be thought of as "relational done right". It has many of the queries which everyone knows from relational databases but it also lacks any hard schema. If you need to add a new property to a given entry, you can do that. It is known as a document database although Mongo "documents" are more analogous to Java objects than to actual documents. They are internally expressed as a binary form of Json and Mongo can query deep inside of a Json structure. Although it implements it's own query language, it's queries are written in Json and it supports many of the same functions familiar to users of relational databases. As it is written in C++, there is little chance of it being bundled with XWiki, this review is here because Mongo appears to have a good shot at replacing the all ubiquitous MySQL.
Advantages of Mongo
- Replica sets (master/slave) and sharding are designed in. All one has to do is flip a switch and it's there. This is far from the situation with MySQL where replication is handled by addons and requires manual intervention for sharding. http://www.mongodb.org/display/DOCS/Sharding+Introduction
- Many of the functions familiar to SQL users are available (albeit by different names) see http://www.datanucleus.org/products/accessplatform_3_0/datastore_features.html for a list of different functions and the stores which support them.
- De-normalization of data allows for higher performance. For example: Rather than using an address table and joining it with a student table each time you want a student and his address, with mongo you would embed the address in the student document like a blob but you could still query within that blob to find all student who live in a given town.
- Reasonably easy to setup a small installation, just download untar and start mongod. Similar to MySQL.
- Provides atomic functions such as upsert (insert or update) and increment number.
Disadvantages of Mongo
- Lack of consistency correction.
- Nodes are not all equal as with Cassandra.
- Mongo memory maps all of it's db files so is practically useless on a 32 bit system due to address space constraints.
How Mongo scales with increased read traffic
http://www.mongodb.org/display/DOCS/Replica+Set+Tutorial
How Mongo scales with increased write traffic
http://www.mongodb.org/display/DOCS/Sharding+Introduction
How Mongo prevent downtime in the event of a failure
Because of Mongo's lack of consistency correction, running replicas in different data centers is all but impossible since any temporary network outage will lead to de-synchronization and Mongo provides no way to re-synchronize as Cassandra does.
General Discussion
Pass the buck as far as you can
The higher level a problem can be solved, the better the system will perform.
Given:
- Higher levels can make assumptions about themselves that lower levels cannot make about them.
2. Being able to make assumptions about how the system will be used allows it to be optimized to run faster.
Thus the higher layers can solve problems which the lower layers cannot.
A perfect example of this trade-off is emulation vs paravirtualization.
An operating system has certain requirements such as being able to access reserved memory locations. It is not that an operating system cannot function without these requirements but since every hardware computer is the same in these regards, if an operating system is running as a guest, it obviously cannot use the same memory locations because they are already occupied.
The school of emulation says: we must provide exactly the same interface as a real computer, no matter the cost. In this case the cost is translating the memory locations every time the guest operating system or any guest program does a memory lookup or store which is prohibitively expensive and is used only for research and experimentation.
The school of paravirtualization says: let's "pass the buck" to the guest operating system and make some modifications to it so that it can solve the problem. XEN Hipervisor implements this allowing multiple operating systems to run at once so long as they are all using a specially modified version of the Linux kernel. The modification allows it to use different special memory locations and to be aware that it does not have the entire address space to itself. OpenVZ passes the buck even further by modifying all of the guest operating systems to allow them to share a single Linux kernel (the host's). These solutions perform well and are used in industry.
Malicious nodes
It may seem silly in a cloud setting to consider a node potentially malicious but if we are to store sensitive data in the cloud, we must prove that the cloud itself is safe. The cost of developing this into the software is expensive but static whereas the cost of proving that each and every node is not malicious scales with the size of the cloud as each datacenter's security practices must be evaluated and certified.
In the long term, I predict that the software will end up dealing with the potential of malicious nodes but unfortunately none of the storage engines reviewed take that as a design consideration.