Archive

Posts Tagged ‘storage’

Tokyo Cabinet Tip: Protected Database Iteration

May 13th, 2009

Tokyo Cabinet (TC) provides iteration functionality for both it’s persistent and non-persistent data structures. For example, if you wanted to iterate through TC’s hash database, you can use the tchdbiternext() function. This is really straight forward to use such that:

void *key;
int key_len;
 
if (tchdbiterinit(tc_database_handle) != true) {
  /* failed to initialize iterator */
}
 
while ((key = tchdbiternext(tc_database_handle, &key_len)) != NULL) {
  /* work with the fetched key and key_len */
}

will iterate through the entire hash database that “tc_database_handle” object is responsible for. This can be handy if you need to loop through your database for some arbitrary reason.

However, there is a consequence in using this function in a concurrent environment with a use-case where the order of records _really_ matter. This is because even though TC is a thread-safe library, the iteration functions aren’t thread-safe in a way that we expect.

For example, if a write operation occurs while the application iterates over the database, you will end up iterating over a database that is in a changed state. This will not make the cursor go crazy and crash your application since TC handles this internally but you still end up iterating over a database that is in a state that you did not initially intend on looping through.

Solution to this is to simply block write operations to the database while your application iterates through. For example, you could use pthread’s rw_lock to allow other threads to read while you iterate but block writes until you finish iterating.

I was planning on doing this for a table scanner in the storage engine that I’m currently working on but turns out TC has an undocumented function that will take care of this internally. I’ve talked to Mikio about this function and apparently it is intentional that he hasn’t documented it on his specification page. He has no plans on throwing it out so you do not have to worry about it to magically disappear one day. For more information, you can take a look at his header file (tchdb.h for hash database).

Explanation and Simple Example

The function is called tchdbforeach() which will atomically iterate through your database from beginning to the end by supplying each key/value pair to the callback function that you provide. The signature of the callback is the following:

bool callback(const void *kbuf, int ksiz, const void *vbuf,
              int vsiz, void *op);

where the fifth argument, “void *op” is an opaque pointer to the data that you can pass to the callback. Here is a simple example that will increment a counter integer on each iteration using this function:

/* Do whatever you like with the provided key/value pair in here */
bool callback(const void *kbuf, int ksiz, const void *vbuf,
              int vsiz, void *op) {
  if (op == NULL)
    return false;
 
  *((int *)op) += 1;
 
  return true;
}
 
int main(void) {
  int niter = 0;
 
  ...
 
  if (!tchdbforeach(tc_database_handle, callback, &niter)) {
    fprintf(stderr, "failed to iterate the database\n");
    return EXIT_FAILURE;
  }
 
  printf("iterated %d times\n", niter);
 
  ...
 
  return EXIT_SUCCESS:
}

If all goes well, the counter variable will be set to the number of records in the database. This function is slightly more complex than using tchdbiternext() but you are guaranteed to iterate atomically which is pretty important for a table scanner.

I hope this function can help you too.

Toru Maesaka knowledge, oss , ,

Journal of Storage Engine Development on Drizzle

May 12th, 2009

I’ve decided to start a series of blog entries on not-so-obvious findings that I’ve found while working on my new project. By archiving the findings, I’m hoping that I can help those that are looking into developing a storage engine for the MySQL family in the future.

Accumulating these mini-knowledge would also be useful for me since I can refer back to it when I forget something. Also, once I write enough entries I’m planning on summarizing them and making it available on the Drizzle Wiki. If MySQL is interested in updating the engine documentation, I would be more than happy to help there too.

So to begin with, I’ll describe something trivial that I stumbled across while trying to catch an error on duplicate primary key insertion to the data table.

Background

In brief, the database kernel does not care if the INSERT query contains a duplicate primary key for a given table or not. It is the storage engine’s job to tell the kernel that the request was invalid due to key collision. If a storage engine fails to do this, the kernel will acknowledge that the query was successful (given that no other errors were thrown) and will keep doing what it needs to do.

Mechanics

Data insertion is handled inside the write_row() function that your engine must implement. The return value of this function is an integer that represents the status of the work it had done. After looking through the possible error statuses in “drizzled/base.h”, I immediately found this:

#define HA_ERR_FOUND_DUPP_KEY 121 /* Dupplicate key on write */

I also looked through MyISAM and InnoDB to confirm that this was indeed the correct error status to return on duplicate primary key. Here is the snippet of my row insertion at the time:

/* TC's tchdbputkeep will not insert a row to the table if there
   was a collision */
if (tchdbputkeep(data_table, primary_key, primary_key_length, buf,
                 table->s->reclength) == false) {
  my_errno = HA_ERR_GENERIC;
 
  /* check for primary key collision */
  if (tchdbecode(data_table) == TCEKEEP)
    my_errno = HA_ERR_FOUND_DUPP_KEY;
 
  return my_errno;
}

On first glimpse, this seems right but the error I was getting from the command line prompt always differed with MyISAM and InnoDB despite returning the same error status. Specifically, this is what I was getting:

ERROR 1022 (23000): Can't write; duplicate key in table 't1'

whereas I was getting this error on other engines:

ERROR 1062 (23000): Duplicate entry '1' for key 'PRIMARY'

At this stage I couldn’t make sense of what I was doing wrong but it turned out that the solution was pretty simple.

Solution

After talking to Stewart Smith about my issue in #drizzle @ freenode, it turned out I am supposed to keep track of which key the duplication was found in write_row() and inform it to the kernel via the info() function.

You can do this by setting the errkey integer variable to the key number that is used internally by the kernel. So, obtaining the internal primary key number with this call in write_row():

share->errkey = table->s->primary_key;

and adding the following code to info():

if (flag & HA_STATUS_ERRKEY) {
  errkey = share->errkey;
}

happily fixed the issue I was experiencing. Yay.

I guess reading the section on info() in the document gives a hint that this is where you supply the key number on key-error but frankly, this is really easy to forget and miss since the importance isn’t so emphasized.

Anyhow, thats all I have to say in the first of this series and hopefully I’ll write something more interesting in the upcoming entries. Until then, happy hacking ;)

Toru Maesaka drizzle, knowledge, oss , , ,

Drizzle’s Storage Engine subsystem

April 28th, 2009

Today I was doing some work on writing a small storage engine for Drizzle that we’re hoping will perform good enough to replace the MEMORY/HEAP engine inherited from MySQL. For those that are interested, this engine is going to be based on Tokyo Cabinet’s on-memory b+tree database.

I was a little worried that this task is going to be heavy since I’ve never hacked on a relational database engine before (other than tracing through InnoDB) but it turns out it’s not so bad. I first looked up the engine interface for MySQL then looked at the engines in the Drizzle tree. The interface itself looks very similar but what’s really cool about Drizzle is how the storage engine is registered to the core server with the new plugin system. Here’s a snippet of how this looks with the Blackhole engine:

blackhole_engine= new BlackholeEngine(engine_name);
registry.add(blackhole_engine);

and the de-initialization:

registry.remove(blackhole_engine);
delete blackhole_engine;

If you’re familiar with how a Drizzle plugin is written, you will notice that this is exactly the same; You implement the interface and boom, you register it with the PluginRegistry object. So a storage engine is treated equally just like any other plugin system.

Needless to say, the storage engine declaration is exactly the same as what you would do with any Drizzle plugin:

drizzle_declare_plugin(blackhole)
{
  "BLACKHOLE",
  "1.0",
  "MySQL AB",
  "/dev/null storage engine (anything you write to it disappears)",
  PLUGIN_LICENSE_GPL,
  blackhole_init, /* Plugin Init */
  blackhole_fini, /* Plugin Deinit */
  NULL,           /* status variables */
  NULL,           /* system variables */
  NULL            /* config options */
}
drizzle_declare_plugin_end;

This is nice! a storage engine is just a plugin after all and Drizzle treats it as it ought to be. I will have some more time to hack on this after my memcached webinar so hopefully I’ll have more details posted about the progress on this work soon.

Toru Maesaka drizzle, oss , , ,

Understanding how auto increment works with InnoDB

April 14th, 2009

Lately I’ve been having lots of fun going through Drizzle and InnoDB‘s sourcecode to get a grasp of how auto increment is processed internally. I think I now have a fairly good grasp of what’s going on so I’m writing this entry as a note for myself. I’m also hoping that this will be helpful to those that are interested in this topic too.

So in MySQL and Drizzle, the storage engine (in this case InnoDB) is responsible for computing the auto increment value. Here’s an abbreviated execution path for a simple INSERT statement to a table with an auto increment column:

mysql_parse() -> mysql_execute_command() -> mysql_insert() ->
write_record() -> handler::ha_write_row() -> ha_innobase::write_row() ->
handler::update_auto_increment() -> ha_innobase::get_auto_increment()

As you may have noticed, “handler” is an abstract class so you can apply this execution path to other engines too. For example, if you look inside MyISAM there is a get_auto_increment() in there too.

InnoDB’s internal counter

InnoDB keeps track of it’s own auto increment count (an unsigned 64bit integer) called “autoinc” inside a structure that represents a database table (dict_table_struct). As you would expect, this is what InnoDB uses to compute the auto increment value that the server will use for further processing.

Diving in a little deeper

As you can see in the execution path, handler::update_auto_increment() function asks InnoDB’s get_auto_increment() function for an incremented value. Looking slightly deeper, you will see that this function:

innobase_get_autoinc(uint64_t *value);

is called inside ha_innobase::get_auto_increment() which will attempt to obtain the autoinc lock for that table (look at ha_innobase::innobase_lock_autoinc()), then given that the lock was successfully obtained, innobase_get_autoinc() will set the provided uint64_t variable with the next auto increment value:

 *value = dict_table_autoinc_read(prebuilt->table);

If you look inside dict_table_autoinc_read() you will notice that it returns table->autoinc which is InnoDB’s internal auto increment counter described in the previous section. We don’t increment table->autoinc before returning it since this value is going to be updated after it is returned. In other words, the next auto increment value is precomputed. Needless to say, this critical region is still protected by the autoinc mutex so you don’t have to worry about updated values to clash.

After obtaining the value from InnoDB

Once we return from ha_innobase::get_auto_increment(), the core server can now store the returned 64bit integer as binary to the next_number_field object for further processing.

If you wish to see the value of this data, you can print it as a primitive type (uint64_t in this case) by obtaining it with the val_int() function. So if you were going through the code with gdb, performing this at an appropriate time (e.g. after the autoinc value is stored) will print the value that will be used to update the auto increment column of your table:

(gdb) print table->next_number_field->val_int()

Finally at the end of handler::update_auto_increment(), the next auto increment value is computed so that the storage engine won’t have to do any work in case the request is a multi row statement (note that this is different to InnoDB’s autoinc precomputation). We then return back to ha_innobase::write_row(), where the row is actually inserted into the table with row_insert_for_mysql().

The rest of the flow is what you would expect: log what just happened, cleanup the mess caused in generating the autoinc value, and respond to the client.

Final words

So, this is all I have to cover in this entry. Admittedly my explanation is really abbreviated and perhaps a little over simplified but I figured this is enough to get back on the sourcecode after I forget my findings from the past couple of days in the future (my memory sucks).

I haven’t covered the auto increment internals for UPDATE statements but covering this would make this entry longer than I want it to be so I will save it for another day. Nevertheless, this entry will hopefully help curious individuals like myself to stare at Drizzle and InnoDB’s source code and enjoy an hour or two there.

Oh yea, and as a warning, do note that this entry could be outdated by the time you read it!

Toru Maesaka drizzle, knowledge, oss , , ,