Database Capabilities in a High-Volume Environment
Traditionally, the term "database" as used by technical professionals implied "relational, row-based, transactional." These implied extra characteristics do us a disservice when speaking of databases in a high-volume environment where there must be several types of databases deployed, some of which aren't relational, some of which aren't row-based, and most of which aren't transactional.
What is a Database?
When we initially started our alpha tests of Cassandra in our production environment some months ago, there was a lot of shouting from database administrators (DBAs) the world over. Some of it was clearly based on some misunderstandings about what a database is. Yes, I just said many DBAs don't know what a database is. That's okay. There's a lot of history here, and things have been changing rapidly in the last 10 years, so it'll take some time before we all get up-to-speed.
In low-volume installations, the subtleties surrounding database capabilities do not matter and can be ignored. A relational row-based transactional database is suitable for all data storage needs in those installations. However, DBAs in high-volume environments must be concerned with the underlying capabilities each Database Management System (DBMS) provides because capabilities come at a cost. The DBMS with the least capabilities for a particular function must be chosen.
What is a database? A database is merely a system for storing data such that retrieval is efficient. Therefore, a database administrator is responsible for picking the best database for the job. If you find yourself requiring words like "transactional" or "normalised data" when speaking of the fundamental capabilities of a database, you are missing the subtleties. Such capabilities are useful in some situations, but are often not required. Having capabilities where they are not required is a waste of resources which becomes painfully apparent in high-volume environments.
One capability databases need to provide in a high-volume environment is write load distribution across nodes in a cluster. This is a capability not traditionally provided by the DBMS, and engineers in high-volume environments have rolled their own systems to do it hundreds of times. DBMSs with this capability have only recently started to appear in the Open Source world.
DBAs also commonly need to do analytics processing, which is more well-understood. The purpose of this essay/post is to talk about data (write) distribution, since it is less understood by the typical DBA. I will only incidentally speak of analytics processing. It suffices to say that these needs are more well-understood and catered to by DBMSs today.
When a database system's growth reaches the phase where it must be distributed, the person or process submitting the query loses some luxuries that existed when there was only one relational row-based database serving the data. The most often talked about is data consistency, but there are other losses as well, which I shall cover in the next section.
Here are a few acronyms I'll be using:
- OLTP - Online Transaction Processing. Traditional RDBMSs do this.
- OLAP - Online Analytics Processing. Specialised for aggregate queries.
- ACID - Includes support for transactions.
- BASE - Includes support for distributed data and eventual consistency.
- RDBMS - Relational database. Commonly also ACID and row-based.
What is the Penalty for Distributing a Database?
In distributed databases, data must follow the CAP theorem, proposed by Eric Brewer (but subsequently formally proven by MIT researchers). There is a good writeup of the theorem and its history. Simply, a data item in the distributed database system can be two (and only two) of:
- (C) Always Consistent (atomic operations across nodes)
- (A) Always Available (reads always return a usually-correct value)
- (P) Always tolerant of network Partitions (nodes unable to speak to one-another)
Business owners want A and P, because those translate to making the most money. Notably, consumers (or customers) also want A and P, because humans are quite tolerant of inconsistencies. The alternative is confusing failures from the services they're accessing. Humans are less tolerant of confusion and failure.
Thus, when deciding which of the three to drop, as an industry we nearly universally give consistency the short straw and wave it adieu. That said, we don't drop consistency entirely. We implement "eventual consistency" and put off settling up inconsistencies until critical moments (like at the payment stage of an online store, for example).
One not-commonly-discussed loss related to distributed databases is loss of query flexibility. The RDBMS with its structured query language (SQL) provides JOIN and ORDER BY. These are two powerful concepts that when aided by indexes allow the end user to get data out of the DBMS in surprising, novel, and sometimes vastly efficient ways.
When a database is distributed across nodes, the end user loses JOIN and ORDER BY. It doesn't matter if you use a full relational database as the engine on each node of your distributed database. You still lose the capability to do these things. This is why many distributed databases like Cassandra do not have JOIN or ORDER BY capabilities. The capabilities can't be used anyway, so why include them? It is also why implementing a sharding strategy on top of a relational/transactional/SQL database such that rows are partitioned across nodes is inefficient. You cannot use JOINs, range scans, or transaction support, so the overhead of having those capabilities in your database translate to inefficient use of computing resources.
What Database Types are There?
There are many "new" database types that have been popularised in the past ten years based on the needs of DBAs in high-volume environments. Here are as many as I can think of, with some examples of each type that you can use in your own database system. I also include the RDBMS for completeness. Full disclosure: I do not completely understand each instance of each type of database (for example, CouchDB, Katta/Lucene, or Voldemort). Please feel free to elaborate/correct in comments on the Digg story.
Document-Oriented Database - Also known as the schema-less database. Stores semi-structured data in a distributed environment. In some, rows are stored physically in order of column-families, so they're still stored row-at-a-time but certain sets of columns are like a "primary key" or "clustered index." Storing data in that way allows range-scans queries. Writes are distributed across nodes, datafiles are largely immutable after writes, and query optimisers are largely eliminated.
- HBase - Built on top of Hadoop filesystem. Modelled very closely after Google's Bigtable.
- Hypertable - Typically built on top of Hadoop. Modelled after Google's Bigtable.
- Cassandra - Nodes store data on their own filesystem and data is distributed between nodes via P2P gossip.
- CouchDB - ACID-compliant and peer-based distributed updates.
Fulltext Database - Stores unstructured text data optimised for retrieval. Bad at storing structured data.
- Solr/Lucene - A bit slow in the face of updates/commits/optimise jobs. Replication is a bit weird and seems a somewhat hackish add-on.
- Zoie/Lucene - Specifically addresses the shortcomings of Solr.
- Katta/Lucene - Apparently much more lean than Solr. "Lucene in the cloud." You use Hadoop (or something like it) to create the indexes that you then import into the Katta/Lucene structure.
- Tokyo Dystopia - Implemented on top of Tokyo Cabinet. Very lean, lacking (by design) several high-level features of other fulltext engines.
Key/Value Store - Can't do range scans or data ordering, but given the complete lack of an optimiser, lookups are extremely fast.
- Voldemort - Based on Amazon's Dynamo.
- MySQL Cluster (NDB) - Uses a central lookup database and partitions data across nodes.
- Memcached - A volatile cache store. Data you put here is extremely fast to retrieve, but you must assume it'll disappear at any moment.
Relational Transactional Row-Oriented Database - In a distributed environment, ALL writes go to ALL nodes, hence this is only good for scaling reads. Trivial distributed implementations leave JOIN/ORDER BY capabilities intact, but most eliminate or drastically reduce them. Automatic indexing, data ordering, and other data-access simplifying algorithms are employed at the cost of overall speed reduction.
- MySQL - One of the most advanced database systems for fulltext, distributed, and OLAP database problems.
- PostgreSQL - More sophisticated transactional support, less sophisticated distributed, fulltext, or OLAP support.
- Tokyo Tyrant - Built on top of Tokyo Cabinet, a fast local hash- or b-tree-index datastore. There are a set of products (Cabinet, Tyrant, Dystopia) that together form a full solution for fulltext, distributed, and relational database storage.
- Myriad others like Oracle, Sybase, IBM, Microsoft - Often orders of magnitude more expensive, but you can tell your legal department you have someone to blame.
Column-Oriented Database - Data is laid out column-at-a-time, allowing orders of magnitude speed increases for OLAP queries even over MySQL's best row-based OLAP solutions like MyISAM. Doesn't shard well if at all. Suffers orders of magnitude greater slowness when you try to update/insert/delete data that already exists in the database. Therefore, data is entered via bulk loads.
- LucidDB - Open source.
- Vertica, Sybase IQ, Oracle, myriad others - Closed source, expensive, often more mature.
What Were the Bad Old Days Like?
Traditionally, the RDBMS was a row-based, relational, transactional ("ACID") database. The acronym RDBMS entirely fails to imply transactional or row-based, which are often key components of the database management system. Perhaps because RBOLTPRDBMS didn't roll off the tongue effortlessly? RDBMSs have been used for almost everything until quite recently. Here are some implementation details of how:
- Key/Value Stores - The primary key of the table is the key, the other columns are the (structured) value. This worked fairly well, especially after MySQL introduced MyISAM which eliminated a lot of the relational overhead of the RDBMS. As long as you treated the MyISAM table as append-only (no UPDATEs or DELETEs or INSERTs before the end of the table), it was faster than other RDBMSs and still allowed JOINs and indexing. Specialised key/value stores are faster for single-key lookups.
- Full Text Indexing - The data could be broken out in a somewhat intelligent way, and regular indexes could be used to find prefixes of words, helping to locate the document. This worked better after MySQL introduced fulltext indexing inside the MyISAM table type.
- OLAP - You can structure your data in a way where many queries are faster by, for example, eliminating transactional overhead and doing bulk loads, or by indexing every column on a table. However, column-store databases have produced far better advances in speed for OLAP. This is one area where traditionally DBAs have well understood the need for a non-relational or non-row-based database.
- Distributed Datastore - You could partition your data by table, putting some subset of tables on one master/slaves cluster, and another subset of tables on another master/slaves cluster. Thus, writes have been partitioned and reads are scaled by adding slaves to a particular cluster. Asynchronous replication ensured Availability and Partition Tolerance (with eventual Consistency). JOIN and ORDER BY are partially retained for the tables that coexist on a particular cluster. This only works temporarily (and is how Digg has been distributing MySQL for the past few years).
Thanks for reading, and Digg on!