tirsdag den 14. maj 2013

Using non-transactional cache backends in Drupal

Using a backend other than the DrupalDatabaseCache backend can cause cache consistency issues (read: lead to corrupted cache) especially during high load.

For example, when using memcache as a cache backend, memcache does not know about the database transactions currently in progress. This will lead to premature cache invalidation if cache_clear_all() is used inside a transaction. In fact, all other cache backends than the DrupalDatabaseCache backend suffers from this.

Consider the following:

  1. Start transaction
  2. Invoke hook_node_update()
  3. Modules do their stuff, e.g. field cache is invalidated, etc.
  4. Node is saved
  5. EntityCache is cleared
  6. Commit transaction

Whatever data is saved during the entire node_save() operation isn't committed to the database (and therefore available to concurrent requests) until step 6.

If concurrent requests are made to the node in question between #3 and #6, the path cache, field cache, and god knows what else, will be updated with old/wrong data. Only a cache clear or a new node save without any concurrent requests, will fix this.

Same with EntityCache. There may not be much of a time window between #5 and #6, but it's there. If concurrent requests manage to populate the entity cache in that time window, old/wrong data will be used. If you've ever experienced that you were unable to save a node because "it has already been altered", you may be the victim of this effect.

The Cache Consistent module contains a cache wrapper that addresses this issue, by buffering cache operations until the transaction is committed. Hooking in to the transaction requires a core patch, which is bundled with the module.

Drupals variable storage and locking


When using InnoDB and isolation level REPEATABLE-READ, locking may become an issue. Especially the Rules module has a reputation of overusing/abusing the variable storage. Beside the fact, that each time you change a variable it clears the entire monolithic cache entry for the variables, it can also cause deadlock issues during high concurrent load when deleting variables.

Drupal uses the db_merge() to update a variable, which begins with a transaction and then a SELECT ... FOR UPDATE.

The problem with a transactional SELECT ... FOR UPDATE, is that if 0 rows are found, a following INSERT can block other INSERTs. In principle, there's nothing wrong with this, however in the particular example regarding the variable_*() functions, this kind of locking can be greatly mitigated.

The problem arises with the use of variable_del() which removes the row from the variables table. After this the system is vulnerable to the 0 rows found locking issue.

My proposal is, that instead of deleting the row, just set it to NULL. This will prevent the gap-lock (or is it next-key lock?) at the expense of adding more data to the variable table. Since the entire variable table is stored in the cache anyways, the actual implications of this can be eliminated by not storing NULL values in the variable cache.


Quick core-hack for the variable_del() function:

function variable_del($name) {
  variable_set($name, NULL);
}


As a side note, changing the isolation level to READ-COMMITTED will also fix this problem.

MySQL and update statements


I recently had to do an SQL query that performed a 2x2 matrix multiplation where each field was represented by a column (e.g. column; a, b, c and d).

I then found out, that there's a bit of atomicity lacking when using MySQL. Depending on the way you view it, it can seem intuitive, however it can cause some problems when/if you're assuming the opposite, and/or when other db's act differently (e.g. Postgres).

The easiest example to consider, is swapping the value of two columns.

INSERT INTO mytable (id, col1, col2) VALUES(1, 500, 0);
UPDATE mytable SET col1 = col2, col2 = col1 WHERE id = 1;

Try to guess: What is the value of col2 after this operation?

The answer: It depends on which database you use.

My instincts told me that col2 would contain 500 after the operation. That is true for Postgres, but not true for MySQL. In MySQL col2 would contain 0.

I could imagine, that MySQL does this for reasons of performance, i.e. direct in-place updating of the column or something, but that's just me guessing.

In any case, there are several ways to work around this. At first, I introduced a buffer column to contain the old column value, e.g.:

UPDATE mytable SET tmp = col1, col1 = col2, col2 = tmp WHERE id = 1;

However, seeing that this may only be MySQL specific, I'm thinking about using a @variable instead, as this may be more efficient.



Drupal and temporary tables

Every once in a while, you find yourself in need of a temporary table. Fortunately, Drupal provides a very easy-to-use function: db_query_temporary(). Unfortunately, it's not very flexible.

The are 2 main things wrong with it:

  1. When using MySQL, the engine is hardcoded to MEMORY. While this may suit most purposes, it's certainly not flexible enough, should you need a temporary table containing a data set larger than whatever the DBA decided to use as a maximum for memory tables.
  2. Indices. Adding an index to a temporary table can sometimes be useful when aggregating data.
I've of course had the need to create both large temporary tables and add an index to it, and found that this is not possible out-of-the-box in Drupal.

Regarding the indices, first thing that comes to mind, is just adding the index to the table via db_add_primary_key() after having created the temporary table using db_query_temporary(). This will work for a MySQL backend, but not for other backends. The reason is, that the db_index operations (e.g. db_add_primary_key(), db_add_index(), etc.) checks if the table exists before trying. That may sound like a good thing, but the way it checks it, is by examining the information_schema where temporary tables do not exist. The reason it works for MySQL, is that the table exists function is implemented as a select query to the table in question (for reasons of performance). If it doesn't fail, the table exists.

So, for MySQL, we are currently able to add indices to a temporary table, but the table is still limited in size by the MEMORY storage engine.

I, of course, want this to work for all backends. I think I'll post a feature request on drupal.org about this at some time, once I get around to getting my Drupal 8 installation up and running again.

The feature proposal is simple; add a db_create_temporary_table() function, which acts just like db_create_table() except it creates a temporary table.

Until then, here's a workaround (tested with MySQL, Postgres and SQLite):


/**
 * Creates a new temporary table from a Drupal table definition.
 *
 * @param $table
 *   A Schema API table definition array.
 *
 * @return
 *   The name of the temporary table.
 */
function db_create_temporary_table($table) {
  static $index = 0;

  $connection = Database::getConnection();
  $schema = $connection->schema();
  $method = new ReflectionMethod($schema, 'createTableSql');
  $method->setAccessible(TRUE);

  $tablename = "_db_temporary_" . $index++;
  $statements = $method->invoke($schema, $tablename, $table);
  $statements[0] = preg_replace('/^CREATE TABLE /', 'CREATE TEMPORARY TABLE ', $statements[0]);
  foreach ($statements as $statement) {
    $connection->query($statement);
  }
  return $tablename;
}