Archive

Archive for the ‘oss’ Category

Progress on BlitzDB’s Index Component

February 18th, 2010

I recently gained some decent momentum on developing the indexing component of BlitzDB. Most of my time spent on BlitzDB for the last couple of weeks have been studying the indexing API and digging into how other engines have implemented it. I even referred back to MySQL 4.x to see how the BDB engine pulls off the Indexing API.

The actual coding wasn’t too bad thanks to Tokyo Cabinet’s awesome B+Tree API. I’ve been busier adding new tests and fixing silly bugs as they arise. I also implemented the Primary Key optimization that I blogged about a while back. As a result of all this, the following goodness has been added to BlitzDB’s Trunk.

  • Index Lookup
  • Forward Index Scan
  • Reverse Index Scan

This means that BlitzDB is now equipped with both a Table Scanner and an Index Scanner which are two essential components for a general purpose storage engine. As much as I’d like to work on optimizing the code and adding features (like recovery), I’m going to take a break and spend the rest of the month working on testing and debugging. There’s no point in adding features if the base has notable flaws in it.

Challenges Encountered

Writing the Index Scanner itself is easy. The most difficult thing that slowed me down was developing the comparison function for index keys. The end result was a simple piece of code but I had to study various things before I could start writing any code.

  • How to respect collation
  • How keys are represented internally
  • How types are represented internally
  • How to write a custom comparison function for Tokyo Cabinet
  • … and so on

I’ve also started using Evernote to jot down my spontaneous ideas on optimizing BlitzDB. I’ve made these notes public and they will most likely be updated while I’m commuting on the train.

There are much more that I’d like to write about like how I intend on developing the table recovery routine without simply using TC’s recovery mechanism but I shall restrain myself for another day.

Toru Maesaka drizzle, oss , ,

Notes on HEAP/MyISAM Index Key Handling on WRITE

January 26th, 2010

Disclaimer: This post is based on HEAP/MyISAM’s sourcecode in Drizzle.

Here are my brief notes on investigating how index keys are generated in HEAP and MyISAM. I lurked through these because I’ve started preparing for decent index support in BlitzDB. I also wrote this to assist my biological memory for later grepping (I have terrible memory for names). I’m only going to cover key generation on write in this post. Otherwise this post is going to be massive.

HEAP Engine

The index structure of HEAP can be either BTREE or HASH (in MySQL doc terms). Like other engines HEAP has a structure for keeping Key definition (parts, type, logic and etc). This structure is called HP_KEYDEF and it contains function pointers for write, delete, and getting the length of the key. These function pointers are assigned to at table creation or when the table is opened. The assigned function depends on the data structure of the index and it can be either of the following:

BTREE

  • hp_rb_write_key()
  • hp_rb_delete_key()

HASH

  • hp_write_key()
  • hp_delete_key()

As for get_key_length(), either of the following functions are used for both data structures.

  • hp_rb_var_key_length()
  • hp_rb_null_key_length()
  • hp_rb_key_length()

When writing a row to the tree, HEAP writes to the index using a key generated by hp_rb_make_key(). Note that it does not use this for the hash index. The generated key is populated inside ‘recbuffer’ in HEAP’s handler object (HP_INFO structure).

From my understanding, it loops through the key segments (I suspect it is similar the internal KEY_PART_INFO structure) and appropriately copies each key field value to the output buffer. By meaning “appropriately” it respects the characteristics of the data type when packing the buffer. For example, for a variable length field, it will only copy the actual data and not the max possible size of it. The final byte that is copied to the buffer is the address of the chunk where the record lives.

MyISAM Engine

The upper layer of key handling in MyISAM looks somewhat similar to HEAP so you can really tell that it was written by the same people. Things are nicely wrapped together by the MYISAM_SHARE structure so it’s relatively easy to follow. BlitzDB has a class called BlitzShare for the same purpose (This is based off Archive Engine’s ArchiveShare class).

Like HEAP, MyISAM has a structure for individual key definition called MI_KEYDEF (it’s defined in myisam.h). There are more function pointers in this structure than HEAP.

  • bin_search()
  • get_key()
  • pack_key()
  • store_key()
  • ck_insert()
  • ck_delete()

In Drizzle, _mi_ck_write() is assigned to ck_insert() which is the entry point to writing a MyISAM index. The key that MyISAM uses to write to the index is generated by _mi_make_key(). Like HEAP, it will loop through the key segments and pack the relevant fields accordingly to the characteristic of the data type. The output buffer belongs to MyISAM’s hander (lastkey2).

From Here

I’ve actually written a naive key generator for BlitzDB already based on Drizzle/MySQL’s internal KEY_PART_INFO array. It seems to be working on EXACT MATCH but I still need to implement an index scanner which looks much harder to pull off than a table scanner. What I’m really worried about is supporting composite indexes (namely reading/searching on it) but hopefully I’ll understand how this area of the storage system works soon.

Toru Maesaka drizzle, knowledge, oss , ,

Further thoughts on BlitzDB’s Index Handling

January 15th, 2010

I’ve been thinking quite a bit about collation handling in BlitzDB for the last couple of days. The more I think about it, the more stuck I’ve been getting with BlitzDB’s index design. I’m actually so frustrated with myself at the moment that I want to hit my head against a wall or something.

So, I’m writing this entry to clear up my mind. Heh, my blog is slowly becoming BlitzDB’s design document draft. This should hopefully be good though since by blogging it, people can tell me whether I’m moving towards a stupid direction or not.

Collation Importance

When writing database software that is intended for International use, it is important to handle textual data by respecting collation order. It is arguable that most people are only interested in English lexicographic ordering but unfortunately the world is not so standard.

Internal Primary Key Handling

I want to motivate people to actively define a PRIMARY KEY with BlitzDB. I plan to make this attractive by providing the best performance when PK is defined. In December 2009, my answer to this was to write the PK value as the key for the data dictionary (where actual rows are stored in BlitzDB). This allows BlitzDB to do a direct lookup on the data dictionary for PK based lookup, instead of consulting the B+Tree index. I’m still fond of this lookup optimization approach but it introduces problems too.

Problem 1. Consider the following textual keys: “key” and “KEY”. They obviously have different binary representations but in certain cases they can be logically equivalent. Because the data dictionary is a hash database, this is a problem. The solution that instantly pops up is to normalize the key before writing or reading it. This however, causes a problem in cases where the two keys are inequivalent. Perhaps Drizzle/MySQL provides an internal normalization function that respects this. I still need to study this area of the storage subsystem.

Problem 2. Directly writing a PK to the data dictionary means fast lookup but because of the data structure, it’s not possible to fetch the next “logical” key, meaning I can’t implement index scanning on PK as it is. Quick solution for users is to create an index on the PK column (this would create a separate B+Tree for it) but this is not so friendly because it requires the user to have prior knowledge of all this. So, my plan is to provide the best of both worlds. I’ll elaborate on how I’m planning on tackling this problem next.

Current Primary Key Read/Write Behavior

In general, keys of BlitzDB’s data dictionary is a unique 8 byte integer. The idea is that BlitzDB writes this unique ID along with the key to the B+Tree Index so that it can later identify that row. The difference with PK is that, if a PK is present in a table, BlitzDB will not generate an internal unique ID and use PK for the data dictionary’s key instead. BlitzDB won’t create a B+Tree index for PK at the time I wrote this blog entry.

Next Step

Create a B+Tree index for PK anyway. BlitzDB will still use the PK value as the key for data dictionary if it exists. For PK based lookup requests, BlitzDB will look directly at the data dictionary and for PK based requests that involve index scanning, BlitzDB will look at the B+Tree index.

This approach can consume more space when textual data is used for keys but I think it’s worth it. At the same time, you can save space if you use use types that are smaller than 8 bytes for PK. For example, using a 4 byte integer would reduce BlitzDB’s key space by 50%.

Hmm, I think my mind has cleared a little.

Toru Maesaka drizzle, oss ,

Drizzle, BlitzDB and HTON_STATS_RECORDS_IS_EXACT

January 13th, 2010

Recently I enabled HTON_STATS_RECORDS_IS_EXACT in BlitzDB to let the optimizer know that BlitzDB can instantaneously return the number of rows in a specified table. As a result, the Drizzle kernel can directly call the Cursor::info() function to get the row count. To users, it means that SELECT COUNT statements can be executed in O(1). So it’s a great thing in general.

Something Broke

After I enabled HTON_STATS_RECORDS_IS_EXACT, I noticed that issuing SELECT statement on a table with 1 row would no longer return a resultset. Weird indeed! after investigating with GDB, I noticed that rnd_next() is only called once instead of twice on a table with 1 row (second time is to find EOF) when HTON_STATS_RECORDS_IS_EXACT is enabled. This makes sense because the kernel knows that there is only 1 row and therefore it doesn’t need to keep scanning for EOF. However, this made me scratch my head since this shouldn’t break BlitzDB’s table scanner.

Remedy

Logically, I was confident that BlitzDB’s table scanner was functioning properly so I decided to look at what was going on beyond the engine API. Turns out that join_read_system() in sql_select.cc looks at the table->status value and decides that it’s an error if 0 isn’t assigned to it. What’d you know? I realized that I wasn’t assigning anything to the status variable. It’s more that I didn’t know that I was meant to update an internal structure. You’d think that engine developers aren’t meant to touch those. It’s not mentioned in the Engine Documentation at MySQL Forge either. Nevertheless, the important thing is that it works now. Oh and SELECT COUNT is fast now too.

Eye Opener

This experience among other occasions where I had to read the kernel’s source made me think that it would be nice to provide an intensive up to date documentation on how to develop storage engines for Drizzle in the future (when the API becomes stable). Needless to say, this would be co-ordinated within the Drizzle community. I’m not a license person but it should hopefully be provided with a freely available license too.

Toru Maesaka drizzle, oss ,

How to Recover a Tokyo Cabinet Database

January 8th, 2010

Recently Mark Callaghan had asked me whether BlitzDB is crash safe since he was aware that Tokyo Cabinet isn’t crash safe (unless used with transactions). For Tokyo Cabinet and Tyrant’s defense, I should mention that this is intentional. The idea is to reduce durability in return for higher throughput. The author’s philosophy is that data availability should be secured by replication. This makes sense since the design of TC and TT are influenced by mixi’s high traffic (we need single instances to handle over 10k requests per sec).

So with that said, let’s move on to the main topic. The honest answer is that BlitzDB is not crash safe either (transaction support is still a long way to go). If the admin is lucky, she would be able to repair the table(s) using the REPAIR TABLE syntax. BlitzDB’s crash safety strategy is the same as Tokyo Tyrant – You should use replication. The question is, how do you repair a broken Tokyo Cabinet file?

The answer is pretty simple and it’s documented in the Japanese TC documentation. Unfortunately it’s not not present in the English documentation. So allow me to go through it with demo code in this post. There are two ways to attempt to recover a Tokyo Cabinet database:

  1. By using the Tokyo Cabinet API.
  2. By using Tokyo Cabinet’s command line tool.

Let’s first go through how to confirm that your database is broken. I’ve also covered how to comprehend the errors.

How to confirm that your Database is broken

Simply use the command line tools installed with Tokyo Cabinet. Look at the “additional flags” line on the output of “tchmgr inform” or “tcbmgr inform” depending on your database type. If it says, “fetal” then your file is really broken. If it says “open”, it means that your application died or exited without closing the database. A file in the “open” state is still usable but your most recent records are most likely unavailable. This is because TC connects the hash chain after it has confirmed that a write operation was successful. If your application died before the record is chained, then it’s not accessible in the database.

Furthermore, the records that weren’t sync’d by the kernel won’t be present on power failure. If the disaster was a process failure, then the written data will hopefully be in the kernel’s write buffer so you won’t lose that data. For pedantic people, TC provides a way to sync the database from your application. Whether to call this function (and how often) is up to your application’s policy.

Using the Tokyo Cabinet API

(1) Open the database file without the lock option. Meaning, supply HDBONOLCK or BDBONOLCK to the open function of the appropriate database type (TCHDB or TCBDB).

/* This is for TCHDB */
TCHDB *hdb = tchdbnew();
 
if (!tchdbopen(hdb, "/path/to/broken_file", HDBONOLCK | HDBOWRITER)) {
  /* Failed to open. Do the appropriate thing. */
}
 
/* This is for TCBDB */
TCBDB *btree = tcbdbnew();
 
if (!tcbdbopen(btree, "/path/to/broken_file", BDBONOLCK | BDBOWRITER)) {
  /* Failed to open. Do the appropriate thing. */
}

(2) Run tchdboptmize() or tcbdboptimize() depending on the database type. You might wonder what you should give as the parameter for the optimize function. Conveniently, TC stores the tuning parameters of the database when you first opened it so you can just provide -1 as an argument _but_ the final one. This is because the final argument is an unsigned integer (uint8_t). What you want to provide instead is UINT8_MAX for this.

/* This if for TCHDB */
if (!tchdboptimize(hdb, -1, -1, -1, UINT8_MAX)) {
  /* We're out of luck. This hash database can't be rescued. */
}
 
/* This if for TCBDB */
if (!tcbdboptimize(btree, -1, -1, -1, -1, -1, UINT8_MAX)) {
  /* We're out of luck. This b+tree database can't be rescued. */
}

If you’re lucky, the above would repair the database that is associated with TC’s database object.

Using TC’s command line tool

This approach is more towards database admins since I’m sure the last thing they want to do is write their own program to get their work done. Lazyness is good.

TC provides a utility program called tchmgr (for a hash database) and tcbmgr (for a b+tree database) which allows you to run optimize on a database file. So if you wanted to repair a TC hash database, you would do the following:

$ tchmgr optimize -nl /path/to/broken_file

and the following for the B+Tree Database:

$ tcbmgr optimize -nl /path/to/broken_file

For those that are interested, the “-nl” option means “No Lock” which is required to repair a database file.

Well, I guess this sums up this blog post. I hope this post will help you administrate Tokyo Tyrant and/or your Tokyo Cabinet based application!

Toru Maesaka oss ,

Congrats to Kyoto Cabinet’s Alpha Release

January 1st, 2010

Writing this blog entry to congratulate Mikio for releasing Kyoto Cabinet (alpha release), which is positioned as a successor project of Tokyo Cabinet.

From development perspective, the big difference is that Kyoto Cabinet is implemented in pure C++03 whereas TC is pure C99. ASFAIK, C++03 was adopted so that KC can run on broader platforms than POSIX oriented systems (So, theoretically it can build on Windows). From project perspective, KC is licensed under GPLv3 whereas TC is licensed under LGPL.

For my projects, I currently have no interest in moving to KC from TC since I’m convinced that TC is better suited for my target platforms (UNIX/Linux). Another reason (VERY personal reason) is that I like C99 more than C++. Although… I don’t have the right to say this since I’m far from proficient at C++. For more details on the project differences, you should take a look at KC’s project page that Mikio had prepared:

For those that are interested in KC’s performance, I’ll do some benchmarks against TC when I get back from my end of year vacation. Hope everyone is enjoying New Years!

Toru Maesaka oss ,

End of Year Progress on BlitzDB

December 24th, 2009

FURTHER UPDATE: Further thoughts on BlitzDB’s Index Handling

My open source friends might have noticed that I’ve been working quite a bit on BlitzDB lately. To tell the truth, I had a hidden goal to get Version-1 done by Christmas. Unfortunately it doesn’t look like I can reach that goal. However, looking at the brightside I got a lot done in the past few weeks so allow me to “journal” it in this blog post.

Agony of Knowing

The more I understood Drizzle’s storage mechanism and Tokyo Cabinet’s internals, the more I disliked what I previously had. This led me to spending quite a bit of time rewriting BlitzDB’s codebase. I was using pthread’s rwlock for concurrency control but I decided to design and write BlitzDB’s own lock mechanism to get the best out of TC (in terms of concurrency). I also rewrote the entire table scan code which is something you’d hope won’t be executed that often (people should use indexes!) but needless to say, it’s an important component of a relational storage engine so I’ve put in a lot of effort there.

Rewriting the Table Scanner

In the process of rewriting the table scanner, Jay Pipes’ gave me a fantastic advise on using Drizzle’s internal atomic type (drizzled::atomics). He gave me this advise because he noticed that my atomic ID generator was securing atomicity with pthread’s mutex. It is debatable that this mutex was only enabled for only few CPU instructions but the philosophy of using the most efficient method on the platform where BlitzDB is to be run was appealing enough for me to use drizzled::atomics. Mikio did some experiments on this and found that in a competitive/congested environment, using the compiler’s builtin function can gain you 3x throughput.

Hacking on Index Support

I’ve finally started hacking on index support and I just finished supporting basic operations on a primary key. By design, BlitzDB’s index is a dense clustered b+tree but in the first release I am going to limit PK to only be a HASH index. This is because I want BlitzDB to treat all PKs as direct keys inside the data dictionary (hash database where the actual rows are stored). So in other words, I want people to use PK for “needle in a haystack” like queries only. An example of a needle in a haystack like query is:

SELECT * FROM TABLE WHERE primary_key_column = whatever;

Saying that, I don’t like to force people to do things the way I like so I plan on providing best of both worlds by supporting both data structures for PKs in Version-2:

CREATE TABLE t1 (id int, PRIMARY KEY(id) USING btree) ENGINE=blitzdb;
CREATE TABLE t1 (id int, PRIMARY KEY(id) USING hash) ENGINE=blitzdb;

BlitzDB’s default configuration will use PK as a “direct” data dictionary index. If you wish to do range queries on PK, the solution is to create a index on the PK column.

Primary Key lookup Performance

So, how does my implementation perform? Here’s a quick benchmark with a test-run that randomly fetches 100 thousand rows from a BlitzDB table with 1 million rows. This is the table I used:

CREATE TABLE t1 (id int PRIMARY KEY, a int, b int) ENGINE=blitzdb;

and the query looks like this:

SELECT * FROM t1 WHERE id = random_number_under_one_million;

The hardware I used is the following commodity server: Intel Quad Xeon E5345 (2x4MB L2 cache), 8GB Memory, 500GB SATA II. Unfortunately I could not prepare a standalone client server today so both the server and the test program were run on the same machine. Yeah… this sucks so I can’t claim that this benchmark is 100% creditable.

Here is the result I obtained from skyload. Please only view it as a guideline to BlitzDB’s lookup performance. I’ll do a proper benchmark with the Drizzle Community and publish it after I get Version-1 released.

[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 1
  Task Completion Time   : 5.88856 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1
 
[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 2
  Task Completion Time   : 6.94474 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1
 
[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 4
  Task Completion Time   : 7.04455 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1

As you can see, “needle in a haystack” queries can be executed pretty efficiently in BlitzDB. Looking at the first result, we can observe that it took an average of 0.058 milliseconds to process a query.

Future Plans

Admittedly, primary key support isn’t completely done so I’ll continue working on it. After that, I will start hacking on b+tree indexes and write more tests as I go. Once I support at least two indexes, I’ll ask the Drizzle Community to consider merging BlitzDB into Drizzle’s trunk. This is my goal for BlitzDB at the moment.

I also happen to own blitzdb.com so I’m planning on putting user documentation (including tutorial) and architectural notes there. This is currently not so high on my TODO list so I suspect it won’t happen until I get Version-1 released. All I can say about the release schedule at the moment is, “before the MySQL conference in april”.

So, that’s all I have to summarize for now. Thanks for reading this far. Merry Christmas and have a Happy New Year. Don’t trip on ice :)

Toru Maesaka drizzle, oss , ,

Storage Engine Tests in Drizzle. Organized!

December 13th, 2009

Good news to storage engine developers. In Drizzle, you can now place your engine specific test files (.test and .result) in your engine’s directory. Here’s an example in BlitzDB:

First, let’s look inside BlitzDB’s directory.

$ ls -l blitzdb/
Total 60
-rw-r--r-- 1 maesaka maesaka   649 2009-12-13 20:51 AUTHORS
-rw-r--r-- 1 maesaka maesaka  5878 2009-12-13 20:51 blitzdata.cc
-rw-r--r-- 1 maesaka maesaka  3347 2009-12-13 20:51 blitzlock.cc
-rw-r--r-- 1 maesaka maesaka 18146 2009-12-13 20:51 ha_blitz.cc
-rw-r--r-- 1 maesaka maesaka  8360 2009-12-13 20:51 ha_blitz.h
-rw-r--r-- 1 maesaka maesaka   289 2009-12-13 20:51 plugin.ac
-rw-r--r-- 1 maesaka maesaka   261 2009-12-13 23:51 plugin.ini
drwxr-xr-x 4 maesaka maesaka  4096 2009-12-13 23:51 tests

Notice the final line? that’s where the tests are kept. So, let’s look inside it.

$ ls -l blitzdb/tests/
Total 8
drwxr-xr-x 2 maesaka maesaka 4096 2009-12-13 23:51 r
drwxr-xr-x 2 maesaka maesaka 4096 2009-12-13 23:51 t

As you can see, there are two directories. By now, storage engine developers would have caught on to what’s going on. The r/ directory is where the .result files are kept and t/ is where the .test files are kept. This is exactly the same layout as what we’re used to working on (“src/tests/t/” and “src/tests/r/”).

$ ls -l blitzdb/tests/t/
Total 8
-rw-r--r-- 1 maesaka maesaka   21 2009-12-13 23:51 blitzdb-master.opt
-rw-r--r-- 1 maesaka maesaka 1964 2009-12-13 23:51 blitzdb.test

The .opt file is used to make sure that the server is started with your storage engine enabled. You simply write the startup option inside the .opt file. Here’s what mine looks like at the moment (there’s only a single line in it).

$ less blitzdb/tests/t/blitzdb-master.opt
--plugin_add=blitzdb
blitzdb/tests/t/blitzdb-master.opt (END)

Next step is actually running it. You simply specify your engine name with the --suite option to dtr and you’re done! Unfortunately the symlink permission for dtr seems broken on my repository so I’ll directly call test-run.pl in this example.

$ ./test-run.pl --suite=blitzdb
Logging: ./test-run.pl --suite=blitzdb
MySQL Version 2009.12.1245
Use of uninitialized value in scalar assignment at ./test-run.pl line 1416.
Using MTR_BUILD_THREAD      = -69.4
Using MASTER_MYPORT         = 9306
Using MASTER_MYPORT1        = 9307
Using SLAVE_MYPORT          = 9308
Using SLAVE_MYPORT1         = 9309
Using SLAVE_MYPORT2         = 9310
Using MC_PORT               = 9316
Killing Possible Leftover Processes
Removing Stale Files
Creating Directories
=======================================================
DEFAULT STORAGE ENGINE: innodb
TEST                           RESULT         TIME (ms)
-------------------------------------------------------
 
blitzdb.blitzdb                [ pass ]             63
-------------------------------------------------------
Stopping All Servers
All 1 tests were successful.
The servers were restarted 1 times
Spent 0.063 of 2 seconds executing testcases

That’s it! I really like this change since it makes sense for engine-specific tests to belong inside the storage engine’s directory. It makes conceptual sense and it’s a good step towards differentiating the database kernel and the storage engine, which Monty Taylor is actively hacking on. Hopefully he’ll blog more about these changes soon.

Toru Maesaka drizzle, oss ,