Archive

Posts Tagged ‘recovery’

BlitzDB Crash Safety and Auto Recovery

July 22nd, 2010

Crash Safety is a big deal in the database league. Lack of durability can lead to all sorts of terrible things upon a catastrophic event. Many projects, especially in the so called NoSQL world compromises crash safety in return for higher QPS. The argument there is that the availability of the overall system should be accomplished by replication since a database server can’t be rescued if the physical disk breaks. I happen to agree with this philosophy but I am also aware that this isn’t a correct answer for everyone. So, what will I do with BlitzDB?

Several relational database hackers have pointed out that BlitzDB isn’t any safer than MyISAM since it doesn’t guarantee crash safety. This is currently true but I plan on making BlitzDB much safer than MyISAM by providing following features.

  1. Auto Recovery Routine (startup option)
  2. Tokyo Cabinet’s Transaction API (table-specific option)

The second feature above would actually guarantee BlitzDB to be crash safe (especially combined with auto recovery) but I won’t get into depth in this post since this topic deserves a blog post of it’s own. Let me just state that this feature will be provided in a form like this:

CREATE TABLE t1 (
  a int PRIMARY KEY,
  b varchar(256)
) ENGINE = BLITZDB, CRASH_SAFE;

From here on, I’ll cover how I plan on hacking auto recovery in BlitzDB.

Auto Recovery Challenges

As I blogged a while back, recovering Tokyo Cabinet is relatively simple. However, this is not a sufficient solution in BlitzDB since the data file (hash database that actually holds the rows) and the index file(s) are independent from each other. That is, the likelihood of the data file and the index file(s) to be inconsistent is very high after a crash. So, how can we hack on this? Pretty simple.

Indexes aren’t Important at Recovery Phase

Because BlitzDB logically separates the data file and it’s indexes, index files aren’t that important. If a server crash had occurred, BlitzDB could delete the index file(s) and recompute them from the data file. Needless to say, this process would involve a lot of random access and computation but it would not dominate the time space of the system since it’s a one-time cost. This approach however has one flaw in it such that the index files can’t be recomputed if the data file is broken or is unrecoverable.

Therefore to guarantee crash safety, BlitzDB must ensure that the data file is unbreakable. This is precisely where Tokyo Cabinet’s Transaction API comes in. I’m planning on using it to protect the data file from breaking. If the data file is protected, the table can be rescued. Simple!

So, that’s what I have in mind for making BlitzDB a safer engine. Unfortunately I can’t start hacking on it immediately since I have several bugs to fix first. Nevertheless I’m looking forward to start hacking on it. This challenge should be quite fun to tackle.

Toru Maesaka drizzle, oss , , ,

How to Recover a Tokyo Cabinet Database

January 8th, 2010

Recently Mark Callaghan had asked me whether BlitzDB is crash safe since he was aware that Tokyo Cabinet isn’t crash safe (unless used with transactions). For Tokyo Cabinet and Tyrant’s defense, I should mention that this is intentional. The idea is to reduce durability in return for higher throughput. The author’s philosophy is that data availability should be secured by replication. This makes sense since the design of TC and TT are influenced by mixi’s high traffic (we need single instances to handle over 10k requests per sec).

So with that said, let’s move on to the main topic. The honest answer is that BlitzDB is not crash safe either (transaction support is still a long way to go). If the admin is lucky, she would be able to repair the table(s) using the REPAIR TABLE syntax. BlitzDB’s crash safety strategy is the same as Tokyo Tyrant – You should use replication. The question is, how do you repair a broken Tokyo Cabinet file?

The answer is pretty simple and it’s documented in the Japanese TC documentation. Unfortunately it’s not not present in the English documentation. So allow me to go through it with demo code in this post. There are two ways to attempt to recover a Tokyo Cabinet database:

  1. By using the Tokyo Cabinet API.
  2. By using Tokyo Cabinet’s command line tool.

Let’s first go through how to confirm that your database is broken. I’ve also covered how to comprehend the errors.

How to confirm that your Database is broken

Simply use the command line tools installed with Tokyo Cabinet. Look at the “additional flags” line on the output of “tchmgr inform” or “tcbmgr inform” depending on your database type. If it says, “fetal” then your file is really broken. If it says “open”, it means that your application died or exited without closing the database. A file in the “open” state is still usable but your most recent records are most likely unavailable. This is because TC connects the hash chain after it has confirmed that a write operation was successful. If your application died before the record is chained, then it’s not accessible in the database.

Furthermore, the records that weren’t sync’d by the kernel won’t be present on power failure. If the disaster was a process failure, then the written data will hopefully be in the kernel’s write buffer so you won’t lose that data. For pedantic people, TC provides a way to sync the database from your application. Whether to call this function (and how often) is up to your application’s policy.

Using the Tokyo Cabinet API

(1) Open the database file without the lock option. Meaning, supply HDBONOLCK or BDBONOLCK to the open function of the appropriate database type (TCHDB or TCBDB).

/* This is for TCHDB */
TCHDB *hdb = tchdbnew();
 
if (!tchdbopen(hdb, "/path/to/broken_file", HDBONOLCK | HDBOWRITER)) {
  /* Failed to open. Do the appropriate thing. */
}
 
/* This is for TCBDB */
TCBDB *btree = tcbdbnew();
 
if (!tcbdbopen(btree, "/path/to/broken_file", BDBONOLCK | BDBOWRITER)) {
  /* Failed to open. Do the appropriate thing. */
}

(2) Run tchdboptmize() or tcbdboptimize() depending on the database type. You might wonder what you should give as the parameter for the optimize function. Conveniently, TC stores the tuning parameters of the database when you first opened it so you can just provide -1 as an argument _but_ the final one. This is because the final argument is an unsigned integer (uint8_t). What you want to provide instead is UINT8_MAX for this.

/* This if for TCHDB */
if (!tchdboptimize(hdb, -1, -1, -1, UINT8_MAX)) {
  /* We're out of luck. This hash database can't be rescued. */
}
 
/* This if for TCBDB */
if (!tcbdboptimize(btree, -1, -1, -1, -1, -1, UINT8_MAX)) {
  /* We're out of luck. This b+tree database can't be rescued. */
}

If you’re lucky, the above would repair the database that is associated with TC’s database object.

Using TC’s command line tool

This approach is more towards database admins since I’m sure the last thing they want to do is write their own program to get their work done. Lazyness is good.

TC provides a utility program called tchmgr (for a hash database) and tcbmgr (for a b+tree database) which allows you to run optimize on a database file. So if you wanted to repair a TC hash database, you would do the following:

$ tchmgr optimize -nl /path/to/broken_file

and the following for the B+Tree Database:

$ tcbmgr optimize -nl /path/to/broken_file

For those that are interested, the “-nl” option means “No Lock” which is required to repair a database file.

Well, I guess this sums up this blog post. I hope this post will help you administrate Tokyo Tyrant and/or your Tokyo Cabinet based application!

Toru Maesaka oss ,