Archive

Posts Tagged ‘drizzle’

DATE type under the hood in Drizzle/MySQL

March 1st, 2010

Learned something new from my own bug in BlitzDB today. The problem was that writing a DATE column index would always return a duplicate key error (regardless of what I feed it). There are two suspicious candidates that can cause this.

  • Comparison Function has a defect.
  • Key Generator has a defect.

The latter suspect was going to be tricky if it was true since BlitzDB currently uses Drizzle’s native “field packer” (except for VARCHAR) inherited from MySQL. This would mean that Drizzle’s field system has a bug in it which was somewhat difficult to believe. Furthermore, you should always blame yourself before you start suspecting other people’s code. So, I decided to look into the comparison function which was completely written by me. Turned out that’s where the bug was.

Comparison Function

Allow me to quickly clarify what I mean by “comparison function” in this context. TC’s B+Tree API has an interface that allows you to provide your own comparison function for all operations that involves traversing.

bool tcbdbsetcmpfunc(TCBDB *bdb, TCCMP cmp, void *cmpop);

What BlitzDB’s comparison function callback does is, it looks at the data type of the values to be compared and performs appropriate processing on the values then compares them. You can also look at it as a long switch statement. For those that are interested, this code is in blitzcmp.cc (blitz_keycmp_cb).

DATE under the hood

After inspecting the “type number” with GDB and looking at the corresponding ha_base_keytype enum, it turns out that the DATE type is internally represented as an unsigned 3 byte integer (HA_KEYTYPE_UINT24). This was pleasant to discover since I’ve been wondering what a 3 byte integer is still used for in Drizzle. The problem I had was that I didn’t take this type into account in the comparator and it also showed how silly I am since the answer was always there.

Now, the question is should it be kept this way? Respect alignment or reduce total I/O and space by keeping it this way? This should hopefully be a fun discussion to have in the Drizzle community :)

P.S. My two cents is that it should respect alignment since folks that seek performance should have most of their data on memory. Respecting alignment in this environment should make some difference. Although, I can only say this after benchmarking it of course.

Toru Maesaka drizzle, oss , ,

Speaking at the MySQL Conference 2010

February 18th, 2010

I’m a little behind in announcing this but I’m going to be speaking at O’Reilly’s MySQL Conference this year. My presentation is a three hour tutorial titled, Drizzle Storage Engine Development. Practical Example with BlitzDB. Three hours is a long time but I assure you that there will be a break.

This session isn’t solely about going through Drizzle’s Storage Engine API. Various performance topics like B+Tree structure, memory handling and concurrency control will be covered. I will also go through BlitzDB’s design concept and it’s internal stuff. So, needless to say I’ll talk a lot about Tokyo Cabinet and it’s internals as well.

Hopefully those that come along will walk out of the tutorial standing far ahead of the start line. It will help you get started on reading the implementation of other storage engines in the MySQL ecosystem (MyISAM, InnoDB, PBXT, Federated and so forth). Better yet you will start writing one.

Looking forward to seeing you there :)

Toru Maesaka drizzle, event, oss, travel , , ,

Progress on BlitzDB’s Index Component

February 18th, 2010

I recently gained some decent momentum on developing the indexing component of BlitzDB. Most of my time spent on BlitzDB for the last couple of weeks have been studying the indexing API and digging into how other engines have implemented it. I even referred back to MySQL 4.x to see how the BDB engine pulls off the Indexing API.

The actual coding wasn’t too bad thanks to Tokyo Cabinet’s awesome B+Tree API. I’ve been busier adding new tests and fixing silly bugs as they arise. I also implemented the Primary Key optimization that I blogged about a while back. As a result of all this, the following goodness has been added to BlitzDB’s Trunk.

  • Index Lookup
  • Forward Index Scan
  • Reverse Index Scan

This means that BlitzDB is now equipped with both a Table Scanner and an Index Scanner which are two essential components for a general purpose storage engine. As much as I’d like to work on optimizing the code and adding features (like recovery), I’m going to take a break and spend the rest of the month working on testing and debugging. There’s no point in adding features if the base has notable flaws in it.

Challenges Encountered

Writing the Index Scanner itself is easy. The most difficult thing that slowed me down was developing the comparison function for index keys. The end result was a simple piece of code but I had to study various things before I could start writing any code.

  • How to respect collation
  • How keys are represented internally
  • How types are represented internally
  • How to write a custom comparison function for Tokyo Cabinet
  • … and so on

I’ve also started using Evernote to jot down my spontaneous ideas on optimizing BlitzDB. I’ve made these notes public and they will most likely be updated while I’m commuting on the train.

There are much more that I’d like to write about like how I intend on developing the table recovery routine without simply using TC’s recovery mechanism but I shall restrain myself for another day.

Toru Maesaka drizzle, oss , ,

Notes on HEAP/MyISAM Index Key Handling on WRITE

January 26th, 2010

Disclaimer: This post is based on HEAP/MyISAM’s sourcecode in Drizzle.

Here are my brief notes on investigating how index keys are generated in HEAP and MyISAM. I lurked through these because I’ve started preparing for decent index support in BlitzDB. I also wrote this to assist my biological memory for later grepping (I have terrible memory for names). I’m only going to cover key generation on write in this post. Otherwise this post is going to be massive.

HEAP Engine

The index structure of HEAP can be either BTREE or HASH (in MySQL doc terms). Like other engines HEAP has a structure for keeping Key definition (parts, type, logic and etc). This structure is called HP_KEYDEF and it contains function pointers for write, delete, and getting the length of the key. These function pointers are assigned to at table creation or when the table is opened. The assigned function depends on the data structure of the index and it can be either of the following:

BTREE

  • hp_rb_write_key()
  • hp_rb_delete_key()

HASH

  • hp_write_key()
  • hp_delete_key()

As for get_key_length(), either of the following functions are used for both data structures.

  • hp_rb_var_key_length()
  • hp_rb_null_key_length()
  • hp_rb_key_length()

When writing a row to the tree, HEAP writes to the index using a key generated by hp_rb_make_key(). Note that it does not use this for the hash index. The generated key is populated inside ‘recbuffer’ in HEAP’s handler object (HP_INFO structure).

From my understanding, it loops through the key segments (I suspect it is similar the internal KEY_PART_INFO structure) and appropriately copies each key field value to the output buffer. By meaning “appropriately” it respects the characteristics of the data type when packing the buffer. For example, for a variable length field, it will only copy the actual data and not the max possible size of it. The final byte that is copied to the buffer is the address of the chunk where the record lives.

MyISAM Engine

The upper layer of key handling in MyISAM looks somewhat similar to HEAP so you can really tell that it was written by the same people. Things are nicely wrapped together by the MYISAM_SHARE structure so it’s relatively easy to follow. BlitzDB has a class called BlitzShare for the same purpose (This is based off Archive Engine’s ArchiveShare class).

Like HEAP, MyISAM has a structure for individual key definition called MI_KEYDEF (it’s defined in myisam.h). There are more function pointers in this structure than HEAP.

  • bin_search()
  • get_key()
  • pack_key()
  • store_key()
  • ck_insert()
  • ck_delete()

In Drizzle, _mi_ck_write() is assigned to ck_insert() which is the entry point to writing a MyISAM index. The key that MyISAM uses to write to the index is generated by _mi_make_key(). Like HEAP, it will loop through the key segments and pack the relevant fields accordingly to the characteristic of the data type. The output buffer belongs to MyISAM’s hander (lastkey2).

From Here

I’ve actually written a naive key generator for BlitzDB already based on Drizzle/MySQL’s internal KEY_PART_INFO array. It seems to be working on EXACT MATCH but I still need to implement an index scanner which looks much harder to pull off than a table scanner. What I’m really worried about is supporting composite indexes (namely reading/searching on it) but hopefully I’ll understand how this area of the storage system works soon.

Toru Maesaka drizzle, knowledge, oss , ,

End of Year Progress on BlitzDB

December 24th, 2009

FURTHER UPDATE: Further thoughts on BlitzDB’s Index Handling

My open source friends might have noticed that I’ve been working quite a bit on BlitzDB lately. To tell the truth, I had a hidden goal to get Version-1 done by Christmas. Unfortunately it doesn’t look like I can reach that goal. However, looking at the brightside I got a lot done in the past few weeks so allow me to “journal” it in this blog post.

Agony of Knowing

The more I understood Drizzle’s storage mechanism and Tokyo Cabinet’s internals, the more I disliked what I previously had. This led me to spending quite a bit of time rewriting BlitzDB’s codebase. I was using pthread’s rwlock for concurrency control but I decided to design and write BlitzDB’s own lock mechanism to get the best out of TC (in terms of concurrency). I also rewrote the entire table scan code which is something you’d hope won’t be executed that often (people should use indexes!) but needless to say, it’s an important component of a relational storage engine so I’ve put in a lot of effort there.

Rewriting the Table Scanner

In the process of rewriting the table scanner, Jay Pipes’ gave me a fantastic advise on using Drizzle’s internal atomic type (drizzled::atomics). He gave me this advise because he noticed that my atomic ID generator was securing atomicity with pthread’s mutex. It is debatable that this mutex was only enabled for only few CPU instructions but the philosophy of using the most efficient method on the platform where BlitzDB is to be run was appealing enough for me to use drizzled::atomics. Mikio did some experiments on this and found that in a competitive/congested environment, using the compiler’s builtin function can gain you 3x throughput.

Hacking on Index Support

I’ve finally started hacking on index support and I just finished supporting basic operations on a primary key. By design, BlitzDB’s index is a dense clustered b+tree but in the first release I am going to limit PK to only be a HASH index. This is because I want BlitzDB to treat all PKs as direct keys inside the data dictionary (hash database where the actual rows are stored). So in other words, I want people to use PK for “needle in a haystack” like queries only. An example of a needle in a haystack like query is:

SELECT * FROM TABLE WHERE primary_key_column = whatever;

Saying that, I don’t like to force people to do things the way I like so I plan on providing best of both worlds by supporting both data structures for PKs in Version-2:

CREATE TABLE t1 (id int, PRIMARY KEY(id) USING btree) ENGINE=blitzdb;
CREATE TABLE t1 (id int, PRIMARY KEY(id) USING hash) ENGINE=blitzdb;

BlitzDB’s default configuration will use PK as a “direct” data dictionary index. If you wish to do range queries on PK, the solution is to create a index on the PK column.

Primary Key lookup Performance

So, how does my implementation perform? Here’s a quick benchmark with a test-run that randomly fetches 100 thousand rows from a BlitzDB table with 1 million rows. This is the table I used:

CREATE TABLE t1 (id int PRIMARY KEY, a int, b int) ENGINE=blitzdb;

and the query looks like this:

SELECT * FROM t1 WHERE id = random_number_under_one_million;

The hardware I used is the following commodity server: Intel Quad Xeon E5345 (2×4MB L2 cache), 8GB Memory, 500GB SATA II. Unfortunately I could not prepare a standalone client server today so both the server and the test program were run on the same machine. Yeah… this sucks so I can’t claim that this benchmark is 100% creditable.

Here is the result I obtained from skyload. Please only view it as a guideline to BlitzDB’s lookup performance. I’ll do a proper benchmark with the Drizzle Community and publish it after I get Version-1 released.

[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 1
  Task Completion Time   : 5.88856 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1
 
[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 2
  Task Completion Time   : 6.94474 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1
 
[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 4
  Task Completion Time   : 7.04455 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1

As you can see, “needle in a haystack” queries can be executed pretty efficiently in BlitzDB. Looking at the first result, we can observe that it took an average of 0.058 milliseconds to process a query.

Future Plans

Admittedly, primary key support isn’t completely done so I’ll continue working on it. After that, I will start hacking on b+tree indexes and write more tests as I go. Once I support at least two indexes, I’ll ask the Drizzle Community to consider merging BlitzDB into Drizzle’s trunk. This is my goal for BlitzDB at the moment.

I also happen to own blitzdb.com so I’m planning on putting user documentation (including tutorial) and architectural notes there. This is currently not so high on my TODO list so I suspect it won’t happen until I get Version-1 released. All I can say about the release schedule at the moment is, “before the MySQL conference in april”.

So, that’s all I have to summarize for now. Thanks for reading this far. Merry Christmas and have a Happy New Year. Don’t trip on ice :)

Toru Maesaka drizzle, oss , ,

Tips on Drizzle Development and Valgrind

December 1st, 2009

In brief, valgrind is a framework of awesome tools that does an amazing job at detecting memory errors. It will catch silly (often unexpected) mistakes and memory leaks that you’ve made in your code. IMHO, it’s a must have tool for open source hackers that work with Linux. If you develop a plugin or a storage engine for Drizzle/MySQL, you often end up wanting to test your program for memory errors. Actually, it’s not a “want”, it’s a MUST.

Conveniently by supplying a simple startup option, Drizzle and MySQL’s test runner will run the daemon process on valgrind’s virtual machine. I’m not sure about MySQL since I’ve never developed anything for it but at least with Drizzle you can run a test case independently by supplying the desired test name to the test runner.

 $ ./dtr your_test_file_name --valgrind

So, with BlitzDB this is what I do to isolate the test runner to only run my tests:

 $ ./dtr blitzdb.test --valgrind

Very simple.

The minor complication here is that the test runner will not output the valgrind report to the console and instead it writes the output to a file. So where is this file? the answer is, it’s written to the daemon’s error log which is located in the source tree:

$ less drizzle_src/tests/var/log/master.err
CURRENT_TEST: main.blitzdb
==24563== Memcheck, a memory error detector
==24563== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
...

Here’s another tip. If you ever wondered where the files that were generated in the test (like table and index files) are stored, they are stored inside the source tree as well. Here’s an example on my machine:

$ ll drizzle_src/tests/var/master-data/
total 20528
-rw-rw---- 1 tmaesaka tmaesaka 10485760 2009-12-01 22:06 ibdata1
-rw-rw---- 1 tmaesaka tmaesaka  5242880 2009-12-01 22:06 ib_logfile0
-rw-rw---- 1 tmaesaka tmaesaka  5242880 2009-12-01 22:06 ib_logfile1
drwxr-xr-x 2 tmaesaka tmaesaka     4096 2009-12-01 22:06 mysql
drwxr-xr-x 2 tmaesaka tmaesaka     4096 2009-12-01 22:06 test

So, with all that in mind, happy hacking :)

Toru Maesaka drizzle, knowledge, oss ,

Drizzle Storage Engine Alias!

November 6th, 2009

Admittedly I’m a lazy guy. Due to this nature I’m a little behind on the updates made to the Drizzle Storage Engine API. From lurking through the source code in the Drizzle trunk, I’ve noticed these changes.

  • Engines can now have an alias
  • handler class is replaced by the Cursor class
  • Engine handler is now a subclass of Cursor
  • Table definitions are handled/stored by the storage engine
  • doCreateTable(), doDropTable(), doGetTableNames(), doGetTableDefinition()

Currently I’m trying to catch up with the updated Drizzle Storage API and take this opportunity to rewrite most of BlitzDB. The reason is that the more I understand TC internal, the more mistakes I realized that I’ve made. I’ll blog more about this soon. Instead, I’m going to introduce something small but nice today.

A while back I poked folks like Monty Taylor and Stewart Smith that it would be cool if engines could have an alias. I mentioned this because InnoDB was allowed to use both “innodb” and “innobase” in the system whereas other engines could only have one name. Another reason I was interested in this issue was that I couldn’t understand how InnoDB could use two names since there was no way to do this in the old interface. Turns out InnoDB was treated specially in the core, which is obviously not desirable in a microkernel philosophy.

In the current Drizzle Storage Engine API, there is a function called addAlias(). By calling this function inside the storage engine constructor, you can allow your engine to have multiple aliases. For experiment purposes I wrote this to BlitzDB:

BlitzEngine(const string &name_arg)
  : drizzled::plugin::StorageEngine(name_arg, HTON_CAN_RECREATE) {
  table_definition_ext = drizzled::plugin::DEFAULT_DEFINITION_FILE_EXT;
  addAlias("BLITZ");
  addAlias("TCDB");
}

Here, I added aliases BLITZ and TCDB. TCDB is my way of showing respect to Mikio and Tokyo Cabinet. So given the above, we should now be able to create tables with three names.

drizzle> CREATE TABLE t1 (foo int) engine=blitzdb;
Query OK, 0 rows affected (0 sec)
 
drizzle> CREATE TABLE t2 (foo int) engine=blitz;
Query OK, 0 rows affected (0 sec)
 
drizzle> CREATE TABLE t3 (foo int) engine=tcdb;
Query OK, 0 rows affected (0.01 sec)

Success! Yep, this is something so trivial that I think most people wouldn’t care about but I was happy to see this update in the trunk :)

Toru Maesaka drizzle , ,

Notes on changes made to the Drizzle Storage Subsystem

July 9th, 2009

Yesterday I merged the BlitzDB tree with Drizzle’s trunk for the first time in a long time (yeah…) and discovered some interesting changes made to the storage subsystem while I was away.

Previously all functions that caused an action to the storage engine was a member of the handler class but various things like table creation and transaction related functions have now moved to the StorageEngine class. These changes are somewhat drastic but makes good sense for Drizzle to grow further since it makes the subsystem easier to understand and frees Drizzle from the interface design that was strongly affected by MyISAM. For those that are interested, the StorageEngine class is located in “drizzled/plugin/storage_engine.h”.

For me it was pretty easy to update BlitzDB to work with the new subsystem since I don’t have anything special in the engine that required me to use my brain. I only had to move bas_ext(), table creation and rename functions over to the StorageEngine class and adjust it to the new interface:

int createTableImpl(Session *session, const char *table_name, 
                    Table *table_arg, HA_CREATE_INFO *ha_create_info); 
 
int renameTableImpl(Session *session, const char *from, const char *to);

For a real example, I recommend comparing the old InnobaseEngine class declaration with the updated one. As for where this redesign is going, this is the answer I got on the Drizzle channel from Stewart who did the actual work for all this.

stewart: tmaesaka: the basic idea is that handler becomes a cursor. the StorageEngine is for actions on the engine.
stewart: tmaesaka: and handler is a cursor on a table.

Something to keep in mind if you’re thinking about creating or porting a storage engine to Drizzle :)

Toru Maesaka drizzle, oss , ,