Archive

Archive for the ‘knowledge’ Category

Notes on HEAP/MyISAM Index Key Handling on WRITE

January 26th, 2010

Disclaimer: This post is based on HEAP/MyISAM’s sourcecode in Drizzle.

Here are my brief notes on investigating how index keys are generated in HEAP and MyISAM. I lurked through these because I’ve started preparing for decent index support in BlitzDB. I also wrote this to assist my biological memory for later grepping (I have terrible memory for names). I’m only going to cover key generation on write in this post. Otherwise this post is going to be massive.

HEAP Engine

The index structure of HEAP can be either BTREE or HASH (in MySQL doc terms). Like other engines HEAP has a structure for keeping Key definition (parts, type, logic and etc). This structure is called HP_KEYDEF and it contains function pointers for write, delete, and getting the length of the key. These function pointers are assigned to at table creation or when the table is opened. The assigned function depends on the data structure of the index and it can be either of the following:

BTREE

  • hp_rb_write_key()
  • hp_rb_delete_key()

HASH

  • hp_write_key()
  • hp_delete_key()

As for get_key_length(), either of the following functions are used for both data structures.

  • hp_rb_var_key_length()
  • hp_rb_null_key_length()
  • hp_rb_key_length()

When writing a row to the tree, HEAP writes to the index using a key generated by hp_rb_make_key(). Note that it does not use this for the hash index. The generated key is populated inside ‘recbuffer’ in HEAP’s handler object (HP_INFO structure).

From my understanding, it loops through the key segments (I suspect it is similar the internal KEY_PART_INFO structure) and appropriately copies each key field value to the output buffer. By meaning “appropriately” it respects the characteristics of the data type when packing the buffer. For example, for a variable length field, it will only copy the actual data and not the max possible size of it. The final byte that is copied to the buffer is the address of the chunk where the record lives.

MyISAM Engine

The upper layer of key handling in MyISAM looks somewhat similar to HEAP so you can really tell that it was written by the same people. Things are nicely wrapped together by the MYISAM_SHARE structure so it’s relatively easy to follow. BlitzDB has a class called BlitzShare for the same purpose (This is based off Archive Engine’s ArchiveShare class).

Like HEAP, MyISAM has a structure for individual key definition called MI_KEYDEF (it’s defined in myisam.h). There are more function pointers in this structure than HEAP.

  • bin_search()
  • get_key()
  • pack_key()
  • store_key()
  • ck_insert()
  • ck_delete()

In Drizzle, _mi_ck_write() is assigned to ck_insert() which is the entry point to writing a MyISAM index. The key that MyISAM uses to write to the index is generated by _mi_make_key(). Like HEAP, it will loop through the key segments and pack the relevant fields accordingly to the characteristic of the data type. The output buffer belongs to MyISAM’s hander (lastkey2).

From Here

I’ve actually written a naive key generator for BlitzDB already based on Drizzle/MySQL’s internal KEY_PART_INFO array. It seems to be working on EXACT MATCH but I still need to implement an index scanner which looks much harder to pull off than a table scanner. What I’m really worried about is supporting composite indexes (namely reading/searching on it) but hopefully I’ll understand how this area of the storage system works soon.

Toru Maesaka drizzle, knowledge, oss , ,

Tips on Drizzle Development and Valgrind

December 1st, 2009

In brief, valgrind is a framework of awesome tools that does an amazing job at detecting memory errors. It will catch silly (often unexpected) mistakes and memory leaks that you’ve made in your code. IMHO, it’s a must have tool for open source hackers that work with Linux. If you develop a plugin or a storage engine for Drizzle/MySQL, you often end up wanting to test your program for memory errors. Actually, it’s not a “want”, it’s a MUST.

Conveniently by supplying a simple startup option, Drizzle and MySQL’s test runner will run the daemon process on valgrind’s virtual machine. I’m not sure about MySQL since I’ve never developed anything for it but at least with Drizzle you can run a test case independently by supplying the desired test name to the test runner.

 $ ./dtr your_test_file_name --valgrind

So, with BlitzDB this is what I do to isolate the test runner to only run my tests:

 $ ./dtr blitzdb.test --valgrind

Very simple.

The minor complication here is that the test runner will not output the valgrind report to the console and instead it writes the output to a file. So where is this file? the answer is, it’s written to the daemon’s error log which is located in the source tree:

$ less drizzle_src/tests/var/log/master.err
CURRENT_TEST: main.blitzdb
==24563== Memcheck, a memory error detector
==24563== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
...

Here’s another tip. If you ever wondered where the files that were generated in the test (like table and index files) are stored, they are stored inside the source tree as well. Here’s an example on my machine:

$ ll drizzle_src/tests/var/master-data/
total 20528
-rw-rw---- 1 tmaesaka tmaesaka 10485760 2009-12-01 22:06 ibdata1
-rw-rw---- 1 tmaesaka tmaesaka  5242880 2009-12-01 22:06 ib_logfile0
-rw-rw---- 1 tmaesaka tmaesaka  5242880 2009-12-01 22:06 ib_logfile1
drwxr-xr-x 2 tmaesaka tmaesaka     4096 2009-12-01 22:06 mysql
drwxr-xr-x 2 tmaesaka tmaesaka     4096 2009-12-01 22:06 test

So, with all that in mind, happy hacking :)

Toru Maesaka drizzle, knowledge, oss ,

TC Concurrency Model and BlitzDB Part 1

November 7th, 2009

Recently I started rewriting BlitzDB because I’ve come to realize the mistakes I’ve made from getting a better understanding of the Drizzle Storage API and Tokyo Cabinet internals. Admittedly a rewrite is an exaggeration because I’ll be reusing most of the components but more in a C++ way.

One decision I decided to make is that BlitzDB will only support a BTREE index via TC’s B+Tree API in it’s first release. Ignoring BlitzDB for now, several people I’ve talked to about key/value data structures often ask why I love B+Tree so much when it’s faster to work with a hash table. Please don’t take it wrong, O(1) operations are beautiful and I love hash tables but stereotyping key/value structures to it is not. Everything has it’s ups and downs and hash table/map is not an exception. In this blog entry, I will describe why B+Tree is good for index scanning.

Why a B+Tree Index

A search algorithm of O(1) like hashing is clearly faster than O(log n) unless there’s something fishy about the implementation or the dataset is too small for the time complexity to matter. However, this is only true for looking up and fetching the value. For those that are only interested in fetching a particular value, that’s probably the best you can ask for. However things are different if you look into things beyond lookups like fetching or scanning through a range of keys.

To do this with a typical hash table, either your data structure must be able to provide a list of stored keys OR your application must do some housekeeping and save a list of relevant keys elsewhere for future use. Your application would then need to compute the subset of keys that you’re interested in and fetch them with a loop. Algorithmically speaking, each fetch operation is O(1) but what’s expensive here is that you end up doing a lot of random access. This is obviously going to kill your performance, especially when you need to chew through a heavy workload (though this _could_ change when SSD becomes standard).

B+Tree on the other hand is fantastic for this use-case. The actual data are stored at the leaf node and they are usually logically linked so that you don’t need to re-traverse the tree to get the next greater key (if you run out of relevant pages in the node, you move on to the neighbor leaf node). The pages are aligned on disk, which means sequential access. Another bonus is that most of the time, you can keep the entire internal nodes on memory which is small and inexpensive but effective for searching.

Solution to this? well, mine is to implement a combination play of the two data structures and take advantage of the different characteristics. In BlitzDB, the actual rows are stored in TC’s Hash Database and the index will store keys to the row. So, a clustered index.

What I’ve mentioned so far is all theoretical without providing any benchmark results but all I’m trying to say is, it’s all about access patterns and use-cases. My current interest is in index scan and therefore the decision. However, if there is enough people that asks me for a HASH index, I can write that functionality relatively easily later on :)

Next Stop

I would love to keep writing but it is currently past 3am in Japan and I’m dozing out here. Apologies for not covering Tokyo Cabinet’s in-depth concurrency model but I will cover it in my next post of the series and how this impacts BlitzDB’s design.

Toru Maesaka drizzle, knowledge , , ,

Writing records with duplicate keys to Tokyo Cabinet

September 9th, 2009

Lately I’ve been noticing that people are visiting my blog to find ways to write multiple records with the same key to a Tokyo Cabinet (TC) database.

Well, the answer depends on which data structure you choose to construct a TC database. If you’re interested in TC’s hash database then you’re out of luck but TC’s B+Tree database will allow you to write duplicate keys. If you just want the answer, here’s a compilable source of how to do it. For those that are interested in how it works, keep on reading :)

So here’s how it’s done. You write the record(s) using TC’s tcbdbputdup() function so that upon key collision, TC will write the record next to the existing one. The following snippet will write three records to Tokyo Cabinet using an identical key.

const char *key = "key";
const char *r1 = "record 1";
const char *r2 = "record 2";
const char *r3 = "record 3";
 
/* store three different records with the same key.
   note that "database_handle" is a TCBDB object. */
if (!tcbdbputdup(database_handle, key, 3, r1, strlen(r1)) ||
    !tcbdbputdup(database_handle, key, 3, r2, strlen(r2)) ||
    !tcbdbputdup(database_handle, key, 3, r3, strlen(r3))) {
  fprintf(stderr, "failed to store data\n");
 
  if (!tcbdbclose(database_handle)) {
    fprintf(stderr, "failed to close the database\n");
    return 1;
  }   
  tcbdbdel(database_handle);
  return 1;
}

Something to watch out here is that because you’ve allowed duplication, running the above code multiple times will respectively keep appending the records to the database.

The next question is, how do we retrieve _only_ the records that corresponds to the key that we just inserted with. Simple! just traverse the tree from the first occurrence of the key and keep retrieving the data as we go until we hit a different key.

First thing that must be done is to create a cursor and move it to the first occurrence of the key.

BDBCUR *cursor;
 
if ((cursor = tcbdbcurnew(db)) == NULL) {
  /* FAIL. do the right thing for your application */ 
}
 
/* move the cursor to the first occurrence of the key */
if (!tcbdbcurjump(cursor, key, strlen(key))) {
  /* FAIL. do the right thing for your application */ 
}

Now we’re ready to traverse the tree. Remember that we’re only interested in a certain key so we only want to traverse the tree until we hit a different key. The following code snippet will do exactly that and print the discovered record as it traverses the tree. So in our case it would print, “record 1″, “record 2″ and “record 3″.

char *fetched_key;
char *fetched_value;
 
/* traverse the tree. terminates if the entire tree is
   traversed _OR_ if it hits a different key */
while (tcbdbcurkey2(cursor) != NULL) {
  fetched_key = tcbdbcurkey2(cursor);
 
  /* different key so break out of the loop */
  if (strcmp(key, fetched_key) != 0) {
    free(fetched_key);
    break;
  }
 
  fetched_value = tcbdbcurval2(cursor);
 
  if (fetched_value) {
    fprintf(stdout, "fetched: %s\n", fetched_value);
    free(fetched_value);
  }
  tcbdbcurnext(cursor);
}

The above tree traversal requires one additional lookup to terminate (if the entire tree isn’t traversed) but the chances are that the records are stored in the same page so this additional operation is cheap.

Alternatively, TC provides a function called tcbdbget4() which returns an allocated list of records that corresponds to the key you provide. If you decide to take this approach, you should consider whether the memory allocation cost and linked list construction overhead is feasible for your application or not.

Toru Maesaka knowledge, oss ,

Storage Engine Dev Journal #3 : Supporting variable width tables

June 16th, 2009

Something I’ve added to BlitzDB recently that was pretty high on my todo list is support for variable width tables. So what is a variable width table? it is a table that contains columns that can vary in size, namely BLOB and TEXT types.

Going back to the basics, when a new row is to be written, a storage engine is given a pointer to the row data in MySQL format that it must somehow store for later lookup/retrieval. By meaning “somehow”, the storage engine is given the freedom to do whatever it likes with the row.

Writing a row for a fixed length table (a table with columns that are always the same size) is deadly easy. A storage engine can choose to not temper with the row and simply write or copy the data to it’s storage mechanism. This is because the storage engine is given a row that contains all the data. Rows for variable width tables however, are treated differently since things aren’t as simple (it’s variable!).

The difference is that columns for BLOB and TEXT types are represented by two parts inside a MySQL/Drizzle row:

  • length of the data
  • pointer to the actual data

This is simple to understand since we need to know the size of the data to copy it.

Minor Complication

The minor complication as you would expect here is that you can’t directly write the provided row to your engine like you can with fixed length tables. The data that you want to copy/write exists elsewhere (hence the pointer) so directly writing the row has no meaning (the data would have disappeared by your next access to that row). You need to make sure that the actual data for BLOB/TEXT column(s) are arranged appropriately on your engine’s row buffer and written out to it’s storage mechanism.

This process is commonly referred to as row packing (converting to your engine format) and unpacking (convert back to MySQL format). So how is this done? it’s actually pretty simple!

The solution is actually simple

As much as it sounds like a bother to support variable length rows, it’s actually not that bad. First you need to understand what a MySQL row looks like internally.

A MySQL row begins with a bitset that represents which fields are NULL. The length of this data obviously depends on the number of NULLable columns you have but this is easy to handle with Drizzle since we’re given all the relevant information by the TableShare object (same goes for MySQL from a different object).

After this data comes the actual column data in the order that appears in your CREATE TABLE statement. What you need to do to get packing working with this row is the not-so-obvious part that you really need an example to look at. Fortunately Tweeting about this attracted Brian’s attention which helped me move forward.

Loop the fields!

So, let’s take row insertion to a variable width table as an example. Imagine this table:

CREATE TABLE t1 (
  id int PRIMARY KEY NOT NULL,
  description text,
  arbitrary_data blob
) engine=your_engine;

and let’s imagine that we need to process this query:

INSERT INTO t1 VALUES (1, "hello world", "blobbbbb");

Now, the storage engine needs to “pack” the data for each column into it’s buffer in the write_row() function. Conveniently, Drizzle/MySQL provides a pack() function for it’s column types (fields) that will do the data packing for you. That is, you do not have to inspect the provided row for pointers to the actual data and do the packing/copying yourself.

How? well, the table object (which is visible from your engine) conveniently holds a list of fields in the appropriate order. The actual pack() function is a member of these fields so you just need to call it as you loop over the list:

/* make sure row_buffer has enough memory */
unsigned char *pos = row_buffer;
 
/* copy NULL bits, "table->s" is the TableShare object */
memcpy(pos, row, table->s->null_bytes);
pos += table->s->null_bytes;
 
/* "row" is the MySQL formatted row given by the core */
for (Field **field = table->field; *field; field++) {
  if (!((*field)->is_null()))
    pos = (*field)->pack(pos, row + (*field)->offset(row));
}

The above code snippet will populate “row_buffer” with the actual data that you want to write to your storage mechanism. You do not have to forward the “pos” pointer because pack() returns a pointer at the end of where it had worked in the buffer (think Pascal Strings). This is precisely why we created the pos pointer, to avoid row_buffer from being forwarded.

For the opposite situation (when retrieving a row), an unpack() function is provided for each field so you just need to take advantage of it like we did with the pack() snippet above.

Little bit more on fields

The actual pack() function that gets called depends on the type of column since the Field class is an abstract base class for the sub classes that actually represents column types inside Drizzle/MySQL. If you want to know what a pack() function looks like for a BLOB type, grep for “Field_blob” in the source tree and there will be a pack() member function for it.

The code layout for field subsystem in MySQL is rather difficult to comprehend since everything is crammed in “sql/field.c” and “sql/field.h” files (at least as of 5.4). So, if you want to get a good grasp of how things are architectured, you should take a look at Drizzle. Field subclasses are located individually in the “drizzled/field/” directory and the base class is located in “drizzled/field.h”.

So, that’s about it! Hopefully this information will help other engine developers when they come across a need to support variable width tables :)

Toru Maesaka drizzle, knowledge, oss , , ,

Storage Engine Dev Journal #2 : Command Line Options

May 22nd, 2009

If you’re working on developing a Drizzle plugin, you may come across situations where you want to accept user options for it at server startup. For example, if you design your plugin to create files for activity logging, you may want to allow the DBA to specify where to write those files out.

In my case, I decided to provide a command line option to BlitzDB for row based query caching. This option is intended for special use-cases where the read/write ratio is 9:1. For those that are interested, row caching is disabled by default because it creates overhead in the engine for read-through logic and cache invalidation _unless_ read requests are significantly higher than update requests.

There are situations where BlitzDB’s row cache can be helpful but this is beyond the scope of this entry so I will save it for another day :)

Adding startup options to your plugin

Drizzle allows you to add command line options to your plugin without editing the server code. But before you start hacking away, there are few not-so-obvious things that you need to understand.

So, let us first look at the data types that your plugin can accept:

  • DRIZZLE_SYSVAR_BOOL
  • DRIZZLE_SYSVAR_STR
  • DRIZZLE_SYSVAR_INT
  • DRIZZLE_SYSVAR_UINT
  • DRIZZLE_SYSVAR_LONG
  • DRIZZLE_SYSVAR_ULONG
  • DRIZZLE_SYSVAR_LONGLONG
  • DRIZZLE_SYSVAR_ULONGLONG
  • DRIZZLE_SYSVAR_ENUM

As you can see, there is a wide range of types that you can choose from. What you should choose depends on what you want to use the value for.

Pick your data type

So lets take my row cache option as an example. Caching over 4 billion rows in one physical server is very unlikely and since we’re not interested in negative numbers, we’re going to pick:

  • DRIZZLE_SYSVAR_UINT

which we can store the value as uint32_t in the plugin.

Declare that your plugin accepts options

Every plugin must declare itself as a plugin which looks like this for BlitzDB:

drizzle_declare_plugin(blitz) {
  "BLITZ",
  "0.3",
  "Toru Maesaka",
  "Non-transactional General Purpose Engine",
  PLUGIN_LICENSE_GPL,
  blitz_init,             /*  Plugin Init      */
  blitz_deinit,           /*  Plugin Deinit    */
  NULL,                   /*  status variables */
  blitz_system_variables, /*  system variables */
  NULL                    /*  config options   */
}
drizzle_declare_plugin_end;

Here, we’re interested in the second last argument which is called blitz_system_variables in the above example. Feel free to call this whatever you like for your plugin.

So what exactly is blitz_system_variables? Its a null-terminated array of system variables that your plugin accepts. This is what it looks like for BlitzDB:

static struct st_mysql_sys_var *blitz_system_variables[] = { 
  DRIZZLE_SYSVAR(row_cache),
  NULL
};

As you can see, BlitzDB only supports one option at the moment so there is only one entry called row_cache.

Define your options

You must define every option that you’ve added to the system variable array. We decided to use DRIZZLE_SYSVAR_UINT earlier and called it row_cache so it is defined like this:

static DRIZZLE_SYSVAR_UINT (
  row_cache, /* option name */
  blitz_row_cache_size, /* variable to set the value to */
  PLUGIN_VAR_READONLY, /* mode */
  N_("Enable row caching for BlitzDB tables."),
  NULL,       /*  check func    */
  NULL,       /*  update func   */
  0,          /*  default value */
  0,          /*  minimum value */
  UINT32_MAX, /*  maximum value */
  0           /*  block size    */
);

The comments pretty much explains what the arguments are but for more details, you should take a look at the macros in drizzled/plugin.h. You could also look at what other plugins do by grepping for the system variable type that you’re interested in.

Test your new startup option

If all goes well you should be able to compile Drizzle and check whether command line options are visible from the plugin. An option takes the following form:

--<name_of_plugin>-<option_name>

So, in the row cache example, row cache can be enabled like this:

/usr/local/sbin/drizzled --blitz-row_cache=10000

Also note that you can replace the underscore with a hyphen:

/usr/local/sbin/drizzled --blitz-row-cache=10000

That’s it! it should be relatively easy to add more options once you successfully get your first one done.

Toru Maesaka drizzle, knowledge, oss ,

Tokyo Cabinet Tip: Protected Database Iteration

May 13th, 2009

Tokyo Cabinet (TC) provides iteration functionality for both it’s persistent and non-persistent data structures. For example, if you wanted to iterate through TC’s hash database, you can use the tchdbiternext() function. This is really straight forward to use such that:

void *key;
int key_len;
 
if (tchdbiterinit(tc_database_handle) != true) {
  /* failed to initialize iterator */
}
 
while ((key = tchdbiternext(tc_database_handle, &key_len)) != NULL) {
  /* work with the fetched key and key_len */
}

will iterate through the entire hash database that “tc_database_handle” object is responsible for. This can be handy if you need to loop through your database for some arbitrary reason.

However, there is a consequence in using this function in a concurrent environment with a use-case where the order of records _really_ matter. This is because even though TC is a thread-safe library, the iteration functions aren’t thread-safe in a way that we expect.

For example, if a write operation occurs while the application iterates over the database, you will end up iterating over a database that is in a changed state. This will not make the cursor go crazy and crash your application since TC handles this internally but you still end up iterating over a database that is in a state that you did not initially intend on looping through.

Solution to this is to simply block write operations to the database while your application iterates through. For example, you could use pthread’s rw_lock to allow other threads to read while you iterate but block writes until you finish iterating.

I was planning on doing this for a table scanner in the storage engine that I’m currently working on but turns out TC has an undocumented function that will take care of this internally. I’ve talked to Mikio about this function and apparently it is intentional that he hasn’t documented it on his specification page. He has no plans on throwing it out so you do not have to worry about it to magically disappear one day. For more information, you can take a look at his header file (tchdb.h for hash database).

Explanation and Simple Example

The function is called tchdbforeach() which will atomically iterate through your database from beginning to the end by supplying each key/value pair to the callback function that you provide. The signature of the callback is the following:

bool callback(const void *kbuf, int ksiz, const void *vbuf,
              int vsiz, void *op);

where the fifth argument, “void *op” is an opaque pointer to the data that you can pass to the callback. Here is a simple example that will increment a counter integer on each iteration using this function:

/* Do whatever you like with the provided key/value pair in here */
bool callback(const void *kbuf, int ksiz, const void *vbuf,
              int vsiz, void *op) {
  if (op == NULL)
    return false;
 
  *((int *)op) += 1;
 
  return true;
}
 
int main(void) {
  int niter = 0;
 
  ...
 
  if (!tchdbforeach(tc_database_handle, callback, &niter)) {
    fprintf(stderr, "failed to iterate the database\n");
    return EXIT_FAILURE;
  }
 
  printf("iterated %d times\n", niter);
 
  ...
 
  return EXIT_SUCCESS:
}

If all goes well, the counter variable will be set to the number of records in the database. This function is slightly more complex than using tchdbiternext() but you are guaranteed to iterate atomically which is pretty important for a table scanner.

I hope this function can help you too.

Toru Maesaka knowledge, oss , ,

Journal of Storage Engine Development on Drizzle

May 12th, 2009

I’ve decided to start a series of blog entries on not-so-obvious findings that I’ve found while working on my new project. By archiving the findings, I’m hoping that I can help those that are looking into developing a storage engine for the MySQL family in the future.

Accumulating these mini-knowledge would also be useful for me since I can refer back to it when I forget something. Also, once I write enough entries I’m planning on summarizing them and making it available on the Drizzle Wiki. If MySQL is interested in updating the engine documentation, I would be more than happy to help there too.

So to begin with, I’ll describe something trivial that I stumbled across while trying to catch an error on duplicate primary key insertion to the data table.

Background

In brief, the database kernel does not care if the INSERT query contains a duplicate primary key for a given table or not. It is the storage engine’s job to tell the kernel that the request was invalid due to key collision. If a storage engine fails to do this, the kernel will acknowledge that the query was successful (given that no other errors were thrown) and will keep doing what it needs to do.

Mechanics

Data insertion is handled inside the write_row() function that your engine must implement. The return value of this function is an integer that represents the status of the work it had done. After looking through the possible error statuses in “drizzled/base.h”, I immediately found this:

#define HA_ERR_FOUND_DUPP_KEY 121 /* Dupplicate key on write */

I also looked through MyISAM and InnoDB to confirm that this was indeed the correct error status to return on duplicate primary key. Here is the snippet of my row insertion at the time:

/* TC's tchdbputkeep will not insert a row to the table if there
   was a collision */
if (tchdbputkeep(data_table, primary_key, primary_key_length, buf,
                 table->s->reclength) == false) {
  my_errno = HA_ERR_GENERIC;
 
  /* check for primary key collision */
  if (tchdbecode(data_table) == TCEKEEP)
    my_errno = HA_ERR_FOUND_DUPP_KEY;
 
  return my_errno;
}

On first glimpse, this seems right but the error I was getting from the command line prompt always differed with MyISAM and InnoDB despite returning the same error status. Specifically, this is what I was getting:

ERROR 1022 (23000): Can't write; duplicate key in table 't1'

whereas I was getting this error on other engines:

ERROR 1062 (23000): Duplicate entry '1' for key 'PRIMARY'

At this stage I couldn’t make sense of what I was doing wrong but it turned out that the solution was pretty simple.

Solution

After talking to Stewart Smith about my issue in #drizzle @ freenode, it turned out I am supposed to keep track of which key the duplication was found in write_row() and inform it to the kernel via the info() function.

You can do this by setting the errkey integer variable to the key number that is used internally by the kernel. So, obtaining the internal primary key number with this call in write_row():

share->errkey = table->s->primary_key;

and adding the following code to info():

if (flag & HA_STATUS_ERRKEY) {
  errkey = share->errkey;
}

happily fixed the issue I was experiencing. Yay.

I guess reading the section on info() in the document gives a hint that this is where you supply the key number on key-error but frankly, this is really easy to forget and miss since the importance isn’t so emphasized.

Anyhow, thats all I have to say in the first of this series and hopefully I’ll write something more interesting in the upcoming entries. Until then, happy hacking ;)

Toru Maesaka drizzle, knowledge, oss , , ,