Archive

Archive for the ‘oss’ Category

DATE type under the hood in Drizzle/MySQL

March 1st, 2010

Learned something new from my own bug in BlitzDB today. The problem was that writing a DATE column index would always return a duplicate key error (regardless of what I feed it). There are two suspicious candidates that can cause this.

  • Comparison Function has a defect.
  • Key Generator has a defect.

The latter suspect was going to be tricky if it was true since BlitzDB currently uses Drizzle’s native “field packer” (except for VARCHAR) inherited from MySQL. This would mean that Drizzle’s field system has a bug in it which was somewhat difficult to believe. Furthermore, you should always blame yourself before you start suspecting other people’s code. So, I decided to look into the comparison function which was completely written by me. Turned out that’s where the bug was.

Comparison Function

Allow me to quickly clarify what I mean by “comparison function” in this context. TC’s B+Tree API has an interface that allows you to provide your own comparison function for all operations that involves traversing.

bool tcbdbsetcmpfunc(TCBDB *bdb, TCCMP cmp, void *cmpop);

What BlitzDB’s comparison function callback does is, it looks at the data type of the values to be compared and performs appropriate processing on the values then compares them. You can also look at it as a long switch statement. For those that are interested, this code is in blitzcmp.cc (blitz_keycmp_cb).

DATE under the hood

After inspecting the “type number” with GDB and looking at the corresponding ha_base_keytype enum, it turns out that the DATE type is internally represented as an unsigned 3 byte integer (HA_KEYTYPE_UINT24). This was pleasant to discover since I’ve been wondering what a 3 byte integer is still used for in Drizzle. The problem I had was that I didn’t take this type into account in the comparator and it also showed how silly I am since the answer was always there.

Now, the question is should it be kept this way? Respect alignment or reduce total I/O and space by keeping it this way? This should hopefully be a fun discussion to have in the Drizzle community :)

P.S. My two cents is that it should respect alignment since folks that seek performance should have most of their data on memory. Respecting alignment in this environment should make some difference. Although, I can only say this after benchmarking it of course.

Toru Maesaka drizzle, oss , ,

Speaking at the MySQL Conference 2010

February 18th, 2010

I’m a little behind in announcing this but I’m going to be speaking at O’Reilly’s MySQL Conference this year. My presentation is a three hour tutorial titled, Drizzle Storage Engine Development. Practical Example with BlitzDB. Three hours is a long time but I assure you that there will be a break.

This session isn’t solely about going through Drizzle’s Storage Engine API. Various performance topics like B+Tree structure, memory handling and concurrency control will be covered. I will also go through BlitzDB’s design concept and it’s internal stuff. So, needless to say I’ll talk a lot about Tokyo Cabinet and it’s internals as well.

Hopefully those that come along will walk out of the tutorial standing far ahead of the start line. It will help you get started on reading the implementation of other storage engines in the MySQL ecosystem (MyISAM, InnoDB, PBXT, Federated and so forth). Better yet you will start writing one.

Looking forward to seeing you there :)

Toru Maesaka drizzle, event, oss, travel , , ,

Progress on BlitzDB’s Index Component

February 18th, 2010

I recently gained some decent momentum on developing the indexing component of BlitzDB. Most of my time spent on BlitzDB for the last couple of weeks have been studying the indexing API and digging into how other engines have implemented it. I even referred back to MySQL 4.x to see how the BDB engine pulls off the Indexing API.

The actual coding wasn’t too bad thanks to Tokyo Cabinet’s awesome B+Tree API. I’ve been busier adding new tests and fixing silly bugs as they arise. I also implemented the Primary Key optimization that I blogged about a while back. As a result of all this, the following goodness has been added to BlitzDB’s Trunk.

  • Index Lookup
  • Forward Index Scan
  • Reverse Index Scan

This means that BlitzDB is now equipped with both a Table Scanner and an Index Scanner which are two essential components for a general purpose storage engine. As much as I’d like to work on optimizing the code and adding features (like recovery), I’m going to take a break and spend the rest of the month working on testing and debugging. There’s no point in adding features if the base has notable flaws in it.

Challenges Encountered

Writing the Index Scanner itself is easy. The most difficult thing that slowed me down was developing the comparison function for index keys. The end result was a simple piece of code but I had to study various things before I could start writing any code.

  • How to respect collation
  • How keys are represented internally
  • How types are represented internally
  • How to write a custom comparison function for Tokyo Cabinet
  • … and so on

I’ve also started using Evernote to jot down my spontaneous ideas on optimizing BlitzDB. I’ve made these notes public and they will most likely be updated while I’m commuting on the train.

There are much more that I’d like to write about like how I intend on developing the table recovery routine without simply using TC’s recovery mechanism but I shall restrain myself for another day.

Toru Maesaka drizzle, oss , ,

Notes on HEAP/MyISAM Index Key Handling on WRITE

January 26th, 2010

Disclaimer: This post is based on HEAP/MyISAM’s sourcecode in Drizzle.

Here are my brief notes on investigating how index keys are generated in HEAP and MyISAM. I lurked through these because I’ve started preparing for decent index support in BlitzDB. I also wrote this to assist my biological memory for later grepping (I have terrible memory for names). I’m only going to cover key generation on write in this post. Otherwise this post is going to be massive.

HEAP Engine

The index structure of HEAP can be either BTREE or HASH (in MySQL doc terms). Like other engines HEAP has a structure for keeping Key definition (parts, type, logic and etc). This structure is called HP_KEYDEF and it contains function pointers for write, delete, and getting the length of the key. These function pointers are assigned to at table creation or when the table is opened. The assigned function depends on the data structure of the index and it can be either of the following:

BTREE

  • hp_rb_write_key()
  • hp_rb_delete_key()

HASH

  • hp_write_key()
  • hp_delete_key()

As for get_key_length(), either of the following functions are used for both data structures.

  • hp_rb_var_key_length()
  • hp_rb_null_key_length()
  • hp_rb_key_length()

When writing a row to the tree, HEAP writes to the index using a key generated by hp_rb_make_key(). Note that it does not use this for the hash index. The generated key is populated inside ‘recbuffer’ in HEAP’s handler object (HP_INFO structure).

From my understanding, it loops through the key segments (I suspect it is similar the internal KEY_PART_INFO structure) and appropriately copies each key field value to the output buffer. By meaning “appropriately” it respects the characteristics of the data type when packing the buffer. For example, for a variable length field, it will only copy the actual data and not the max possible size of it. The final byte that is copied to the buffer is the address of the chunk where the record lives.

MyISAM Engine

The upper layer of key handling in MyISAM looks somewhat similar to HEAP so you can really tell that it was written by the same people. Things are nicely wrapped together by the MYISAM_SHARE structure so it’s relatively easy to follow. BlitzDB has a class called BlitzShare for the same purpose (This is based off Archive Engine’s ArchiveShare class).

Like HEAP, MyISAM has a structure for individual key definition called MI_KEYDEF (it’s defined in myisam.h). There are more function pointers in this structure than HEAP.

  • bin_search()
  • get_key()
  • pack_key()
  • store_key()
  • ck_insert()
  • ck_delete()

In Drizzle, _mi_ck_write() is assigned to ck_insert() which is the entry point to writing a MyISAM index. The key that MyISAM uses to write to the index is generated by _mi_make_key(). Like HEAP, it will loop through the key segments and pack the relevant fields accordingly to the characteristic of the data type. The output buffer belongs to MyISAM’s hander (lastkey2).

From Here

I’ve actually written a naive key generator for BlitzDB already based on Drizzle/MySQL’s internal KEY_PART_INFO array. It seems to be working on EXACT MATCH but I still need to implement an index scanner which looks much harder to pull off than a table scanner. What I’m really worried about is supporting composite indexes (namely reading/searching on it) but hopefully I’ll understand how this area of the storage system works soon.

Toru Maesaka drizzle, knowledge, oss , ,

Further thoughts on BlitzDB’s Index Handling

January 15th, 2010

I’ve been thinking quite a bit about collation handling in BlitzDB for the last couple of days. The more I think about it, the more stuck I’ve been getting with BlitzDB’s index design. I’m actually so frustrated with myself at the moment that I want to hit my head against a wall or something.

So, I’m writing this entry to clear up my mind. Heh, my blog is slowly becoming BlitzDB’s design document draft. This should hopefully be good though since by blogging it, people can tell me whether I’m moving towards a stupid direction or not.

Collation Importance

When writing database software that is intended for International use, it is important to handle textual data by respecting collation order. It is arguable that most people are only interested in English lexicographic ordering but unfortunately the world is not so standard.

Internal Primary Key Handling

I want to motivate people to actively define a PRIMARY KEY with BlitzDB. I plan to make this attractive by providing the best performance when PK is defined. In December 2009, my answer to this was to write the PK value as the key for the data dictionary (where actual rows are stored in BlitzDB). This allows BlitzDB to do a direct lookup on the data dictionary for PK based lookup, instead of consulting the B+Tree index. I’m still fond of this lookup optimization approach but it introduces problems too.

Problem 1. Consider the following textual keys: “key” and “KEY”. They obviously have different binary representations but in certain cases they can be logically equivalent. Because the data dictionary is a hash database, this is a problem. The solution that instantly pops up is to normalize the key before writing or reading it. This however, causes a problem in cases where the two keys are inequivalent. Perhaps Drizzle/MySQL provides an internal normalization function that respects this. I still need to study this area of the storage subsystem.

Problem 2. Directly writing a PK to the data dictionary means fast lookup but because of the data structure, it’s not possible to fetch the next “logical” key, meaning I can’t implement index scanning on PK as it is. Quick solution for users is to create an index on the PK column (this would create a separate B+Tree for it) but this is not so friendly because it requires the user to have prior knowledge of all this. So, my plan is to provide the best of both worlds. I’ll elaborate on how I’m planning on tackling this problem next.

Current Primary Key Read/Write Behavior

In general, keys of BlitzDB’s data dictionary is a unique 8 byte integer. The idea is that BlitzDB writes this unique ID along with the key to the B+Tree Index so that it can later identify that row. The difference with PK is that, if a PK is present in a table, BlitzDB will not generate an internal unique ID and use PK for the data dictionary’s key instead. BlitzDB won’t create a B+Tree index for PK at the time I wrote this blog entry.

Next Step

Create a B+Tree index for PK anyway. BlitzDB will still use the PK value as the key for data dictionary if it exists. For PK based lookup requests, BlitzDB will look directly at the data dictionary and for PK based requests that involve index scanning, BlitzDB will look at the B+Tree index.

This approach can consume more space when textual data is used for keys but I think it’s worth it. At the same time, you can save space if you use use types that are smaller than 8 bytes for PK. For example, using a 4 byte integer would reduce BlitzDB’s key space by 50%.

Hmm, I think my mind has cleared a little.

Toru Maesaka drizzle, oss ,

Drizzle, BlitzDB and HTON_STATS_RECORDS_IS_EXACT

January 13th, 2010

Recently I enabled HTON_STATS_RECORDS_IS_EXACT in BlitzDB to let the optimizer know that BlitzDB can instantaneously return the number of rows in a specified table. As a result, the Drizzle kernel can directly call the Cursor::info() function to get the row count. To users, it means that SELECT COUNT statements can be executed in O(1). So it’s a great thing in general.

Something Broke

After I enabled HTON_STATS_RECORDS_IS_EXACT, I noticed that issuing SELECT statement on a table with 1 row would no longer return a resultset. Weird indeed! after investigating with GDB, I noticed that rnd_next() is only called once instead of twice on a table with 1 row (second time is to find EOF) when HTON_STATS_RECORDS_IS_EXACT is enabled. This makes sense because the kernel knows that there is only 1 row and therefore it doesn’t need to keep scanning for EOF. However, this made me scratch my head since this shouldn’t break BlitzDB’s table scanner.

Remedy

Logically, I was confident that BlitzDB’s table scanner was functioning properly so I decided to look at what was going on beyond the engine API. Turns out that join_read_system() in sql_select.cc looks at the table->status value and decides that it’s an error if 0 isn’t assigned to it. What’d you know? I realized that I wasn’t assigning anything to the status variable. It’s more that I didn’t know that I was meant to update an internal structure. You’d think that engine developers aren’t meant to touch those. It’s not mentioned in the Engine Documentation at MySQL Forge either. Nevertheless, the important thing is that it works now. Oh and SELECT COUNT is fast now too.

Eye Opener

This experience among other occasions where I had to read the kernel’s source made me think that it would be nice to provide an intensive up to date documentation on how to develop storage engines for Drizzle in the future (when the API becomes stable). Needless to say, this would be co-ordinated within the Drizzle community. I’m not a license person but it should hopefully be provided with a freely available license too.

Toru Maesaka drizzle, oss ,

How to Recover a Tokyo Cabinet Database

January 8th, 2010

Recently Mark Callaghan had asked me whether BlitzDB is crash safe since he was aware that Tokyo Cabinet isn’t crash safe (unless used with transactions). For Tokyo Cabinet and Tyrant’s defense, I should mention that this is intentional. The idea is to reduce durability in return for higher throughput. The author’s philosophy is that data availability should be secured by replication. This makes sense since the design of TC and TT are influenced by mixi’s high traffic (we need single instances to handle over 10k requests per sec).

So with that said, let’s move on to the main topic. The honest answer is that BlitzDB is not crash safe either (transaction support is still a long way to go). If the admin is lucky, she would be able to repair the table(s) using the REPAIR TABLE syntax. BlitzDB’s crash safety strategy is the same as Tokyo Tyrant – You should use replication. The question is, how do you repair a broken Tokyo Cabinet file?

The answer is pretty simple and it’s documented in the Japanese TC documentation. Unfortunately it’s not not present in the English documentation. So allow me to go through it with demo code in this post. There are two ways to attempt to recover a Tokyo Cabinet database:

  1. By using the Tokyo Cabinet API.
  2. By using Tokyo Cabinet’s command line tool.

Let’s first go through how to confirm that your database is broken. I’ve also covered how to comprehend the errors.

How to confirm that your Database is broken

Simply use the command line tools installed with Tokyo Cabinet. Look at the “additional flags” line on the output of “tchmgr inform” or “tcbmgr inform” depending on your database type. If it says, “fetal” then your file is really broken. If it says “open”, it means that your application died or exited without closing the database. A file in the “open” state is still usable but your most recent records are most likely unavailable. This is because TC connects the hash chain after it has confirmed that a write operation was successful. If your application died before the record is chained, then it’s not accessible in the database.

Furthermore, the records that weren’t sync’d by the kernel won’t be present on power failure. If the disaster was a process failure, then the written data will hopefully be in the kernel’s write buffer so you won’t lose that data. For pedantic people, TC provides a way to sync the database from your application. Whether to call this function (and how often) is up to your application’s policy.

Using the Tokyo Cabinet API

(1) Open the database file without the lock option. Meaning, supply HDBONOLCK or BDBONOLCK to the open function of the appropriate database type (TCHDB or TCBDB).

/* This is for TCHDB */
TCHDB *hdb = tchdbnew();
 
if (!tchdbopen(hdb, "/path/to/broken_file", HDBONOLCK | HDBOWRITER)) {
  /* Failed to open. Do the appropriate thing. */
}
 
/* This is for TCBDB */
TCBDB *btree = tcbdbnew();
 
if (!tcbdbopen(btree, "/path/to/broken_file", BDBONOLCK | BDBOWRITER)) {
  /* Failed to open. Do the appropriate thing. */
}

(2) Run tchdboptmize() or tcbdboptimize() depending on the database type. You might wonder what you should give as the parameter for the optimize function. Conveniently, TC stores the tuning parameters of the database when you first opened it so you can just provide -1 as an argument _but_ the final one. This is because the final argument is an unsigned integer (uint8_t). What you want to provide instead is UINT8_MAX for this.

/* This if for TCHDB */
if (!tchdboptimize(hdb, -1, -1, -1, UINT8_MAX)) {
  /* We're out of luck. This hash database can't be rescued. */
}
 
/* This if for TCBDB */
if (!tcbdboptimize(btree, -1, -1, -1, -1, -1, UINT8_MAX)) {
  /* We're out of luck. This b+tree database can't be rescued. */
}

If you’re lucky, the above would repair the database that is associated with TC’s database object.

Using TC’s command line tool

This approach is more towards database admins since I’m sure the last thing they want to do is write their own program to get their work done. Lazyness is good.

TC provides a utility program called tchmgr (for a hash database) and tcbmgr (for a b+tree database) which allows you to run optimize on a database file. So if you wanted to repair a TC hash database, you would do the following:

$ tchmgr optimize -nl /path/to/broken_file

and the following for the B+Tree Database:

$ tcbmgr optimize -nl /path/to/broken_file

For those that are interested, the “-nl” option means “No Lock” which is required to repair a database file.

Well, I guess this sums up this blog post. I hope this post will help you administrate Tokyo Tyrant and/or your Tokyo Cabinet based application!

Toru Maesaka oss ,

Congrats to Kyoto Cabinet’s Alpha Release

January 1st, 2010

Writing this blog entry to congratulate Mikio for releasing Kyoto Cabinet (alpha release), which is positioned as a successor project of Tokyo Cabinet.

From development perspective, the big difference is that Kyoto Cabinet is implemented in pure C++03 whereas TC is pure C99. ASFAIK, C++03 was adopted so that KC can run on broader platforms than POSIX oriented systems (So, theoretically it can build on Windows). From project perspective, KC is licensed under GPLv3 whereas TC is licensed under LGPL.

For my projects, I currently have no interest in moving to KC from TC since I’m convinced that TC is better suited for my target platforms (UNIX/Linux). Another reason (VERY personal reason) is that I like C99 more than C++. Although… I don’t have the right to say this since I’m far from proficient at C++. For more details on the project differences, you should take a look at KC’s project page that Mikio had prepared:

For those that are interested in KC’s performance, I’ll do some benchmarks against TC when I get back from my end of year vacation. Hope everyone is enjoying New Years!

Toru Maesaka oss ,