DATE type under the hood in Drizzle/MySQL
Learned something new from my own bug in BlitzDB today. The problem was that writing a DATE column index would always return a duplicate key error (regardless of what I feed it). There are two suspicious candidates that can cause this.
- Comparison Function has a defect.
- Key Generator has a defect.
The latter suspect was going to be tricky if it was true since BlitzDB currently uses Drizzle’s native “field packer” (except for VARCHAR) inherited from MySQL. This would mean that Drizzle’s field system has a bug in it which was somewhat difficult to believe. Furthermore, you should always blame yourself before you start suspecting other people’s code. So, I decided to look into the comparison function which was completely written by me. Turned out that’s where the bug was.
Comparison Function
Allow me to quickly clarify what I mean by “comparison function” in this context. TC’s B+Tree API has an interface that allows you to provide your own comparison function for all operations that involves traversing.
bool tcbdbsetcmpfunc(TCBDB *bdb, TCCMP cmp, void *cmpop);
What BlitzDB’s comparison function callback does is, it looks at the data type of the values to be compared and performs appropriate processing on the values then compares them. You can also look at it as a long switch statement. For those that are interested, this code is in blitzcmp.cc (blitz_keycmp_cb).
DATE under the hood
After inspecting the “type number” with GDB and looking at the corresponding ha_base_keytype enum, it turns out that the DATE type is internally represented as an unsigned 3 byte integer (HA_KEYTYPE_UINT24). This was pleasant to discover since I’ve been wondering what a 3 byte integer is still used for in Drizzle. The problem I had was that I didn’t take this type into account in the comparator and it also showed how silly I am since the answer was always there.
Now, the question is should it be kept this way? Respect alignment or reduce total I/O and space by keeping it this way? This should hopefully be a fun discussion to have in the Drizzle community :)
P.S. My two cents is that it should respect alignment since folks that seek performance should have most of their data on memory. Respecting alignment in this environment should make some difference. Although, I can only say this after benchmarking it of course.
