Archive

Posts Tagged ‘database’

DATE type under the hood in Drizzle/MySQL

March 1st, 2010

Learned something new from my own bug in BlitzDB today. The problem was that writing a DATE column index would always return a duplicate key error (regardless of what I feed it). There are two suspicious candidates that can cause this.

  • Comparison Function has a defect.
  • Key Generator has a defect.

The latter suspect was going to be tricky if it was true since BlitzDB currently uses Drizzle’s native “field packer” (except for VARCHAR) inherited from MySQL. This would mean that Drizzle’s field system has a bug in it which was somewhat difficult to believe. Furthermore, you should always blame yourself before you start suspecting other people’s code. So, I decided to look into the comparison function which was completely written by me. Turned out that’s where the bug was.

Comparison Function

Allow me to quickly clarify what I mean by “comparison function” in this context. TC’s B+Tree API has an interface that allows you to provide your own comparison function for all operations that involves traversing.

bool tcbdbsetcmpfunc(TCBDB *bdb, TCCMP cmp, void *cmpop);

What BlitzDB’s comparison function callback does is, it looks at the data type of the values to be compared and performs appropriate processing on the values then compares them. You can also look at it as a long switch statement. For those that are interested, this code is in blitzcmp.cc (blitz_keycmp_cb).

DATE under the hood

After inspecting the “type number” with GDB and looking at the corresponding ha_base_keytype enum, it turns out that the DATE type is internally represented as an unsigned 3 byte integer (HA_KEYTYPE_UINT24). This was pleasant to discover since I’ve been wondering what a 3 byte integer is still used for in Drizzle. The problem I had was that I didn’t take this type into account in the comparator and it also showed how silly I am since the answer was always there.

Now, the question is should it be kept this way? Respect alignment or reduce total I/O and space by keeping it this way? This should hopefully be a fun discussion to have in the Drizzle community :)

P.S. My two cents is that it should respect alignment since folks that seek performance should have most of their data on memory. Respecting alignment in this environment should make some difference. Although, I can only say this after benchmarking it of course.

Toru Maesaka drizzle, oss , ,

Iterating Tokyo Cabinet in Parallel

October 21st, 2009

Iterating a Tokyo Cabinet database (both B+Tree and Hash Table) is fairly easy as I’ve described it in the past. However, things are different if you want to allow multiple threads to iterate the database individually. This is because iterating TC in a “standard” way limits you to obtain only one iterator per database object.

The Problem

Iterating in a standard way means no matter how hard you try, only one thread can iterate the database at a time. This obviously kills concurrent performance and it is something you want to avoid at all costs. Why do I care? well, this is a pain for developing a relational storage engine because it means that the table scanner can’t be run in parallel. In terms of MySQL/Drizzle internals, threads will have to wait until the scanning thread exits rnd_end(). Not good at all.

The Solution

Fortunately Mikio (the author of TC) was aware of this issue from the beginning and has provided a series of hidden functions that can solve this problem. He actually has other hidden functionalities in TC that is only documented in the header file. He has no plans on deleting those and mentioned that they are only for experts that can actually be bothered reading the header file.

The function we’re interested in is a key-based iterator called tchdbgetnext() and I will introduce my favorite of the series, tchdbgetnext3(). The idea behind tchdbgetnext() is — Given a particular key, TC will return a key/value pair of the next record. The database can then be iterated by continuously throwing the returned key at tchdbgetnext(). The first record in the database can be obtained by providing a NULL key.

You’ll see why Mikio decided to hide this function in the following example with tchdbgetnext3():

const char *fetched_data;
char *current_key = NULL;
char *last_key = NULL;
 
int fetched_data_len = 0;
int current_key_len = 0;
int last_key_len = 0;
 
while (true) {
  last_key = current_key;
  last_key_len = current_key_len;
 
  current_key = tchdbgetnext3(tokyo_hashdb_handle, last_key,
                              last_key_len, &current_key_len,
                              &fetched_data, &fetched_data_len);
 
  /* This will free the value as well. Explained in the blog entry.*/
  free(last_key);
 
  /* The entire database has been iterated */
  if (current_key == NULL)
    break;
}

You might be wondering why the “last_key” pointer is needed above. The answer is, tchdbgetnext3() returns an allocated pointer to the next key so continuously using the same pointer will result in a memory leak. To avoid this leak, “last_key” is used to remember where “current_key” had pointed to before it was re-pointed by tchdbgetnext3(). Confused? This takes a bit of thinking to understand hence it is only documented in the TC header file.

Another tricky but nice thing about tchdbgetnext3() is that it only calls malloc(3) once for each key/value pair. Usually malloc(3) is called individually for both key/value but with tchdbgetnext3(), TC allocates a buffer with enough space to accommodate both key/value and copies them next to each other. So, the pointers for key and value that tchdbgetnext3() sets on success is actually on the same buffer. This is why I only call free(3) on the key pointer in the above example. It frees the entire buffer which includes the value region.

Again, this takes a little bit of thinking to understand but it can cutdown malloc(3) call by half which can mean a lot to some people.

Toru Maesaka oss ,