Archive

Archive for the ‘drizzle’ Category

BlitzLock and RWLOCK Comparison

November 24th, 2009

As pointed out by Jay Pipes, I thought it would be nice to test and publish how BlitzLock performs against what I was originally intending on using for BlitzDB (pthread’s rwlock). So, I asked my colleagues in the operations group at Mixi to test BlitzLock on nice hardware that I don’t have access to. They kindly accepted and ran the BlitzLock sandbox on a 16 core machine running Fedora.

If you haven’t read my previous entry on BlitzLock and why I started writing it, you should. This entry won’t make sense otherwise.

Disclaimer

Before I step any further, please remember that I’m not trying to say BlitzLock is better than pthread’s rwlock. My interest is to write a lock mechanism that is optimized for Tokyo Cabinet (TC). What I wanted to gain from this test was to see if BlitzLock has enough potential for me to keep working on it.

Method

There were three kinds of workloads: “Read Oriented”, “Write Oriented” and “Neutral”. Read Oriented test has a 70% probability that each thread will call a read routine, whereas Write Oriented is the opposite where there is a 70% chance that the table state will be changed. In the Neutral test, both read and update calls have an even chance of being called. The seed value for the random number generator was identical for all tests.

Each worker sleeps for 10 milliseconds in the critical section and another 10 milliseconds right after it releases the lock. This was done to help cause context switching. Each test ran for 60 seconds.

You can obtain the standalone BlitzLock sandbox from here. I’ll upload a test friendly version that can accept startup options soon (I _really_ need to tidy it up).

Results

Below is a result from a load emulation where there was significantly more read calls than updates.

BlitzLock Benchmark (1)

As seen above, BlitzLock is nicely scaling the workload without exhausting update threads. This is important since one of the concerns involved in the current implementation of BlitzLock is starvation (covered later). I think the read/write ratio above is the sort of ratio that is typically seen in the web industry and something I’m mostly concerned with. So how about a write intensive application? Next graph is a result of when there is significantly more update operations than read.

BlitzLock Benchmark (2)

As seen above, BlitzLock is nicely scaling update tasks without neglecting readers. Compared to the first graph, we’re seeing an opposite result between update and scanner threads which is expected due to the nature of BlitzLock. This is exactly what I was hoping to gain. Next graph is a result from when there is an even chance of read and update operations to occur.

BlitzLock Benchmark (3)

As seen above, the throughput evens out for both read and update operations. I was expecting pthread’s rwlock to show noticeably lower update throughput than read (since it’s a single writer lock) but it turned out to even out. I’m not quite sure how I should interpret this but I guess the writer’s lock had a greater priority than reader’s lock in the environment that the test was run in. Nevertheless, this “even out” characteristic is something I’d like to welcome.

From Here and Weaknesses

I’m convinced to keep working on BlitzLock and use it as the default locking mechanism for BlitzDB. Ideally I should code BlitzDB to be able to switch between various locking mechanisms. This would make my life much easier when someone decides to write a locking mechanism that is better than BlitzLock for my use-case.

Thanks to Paul McCullagh‘s feedback, I’ve come to realize that BlitzLock was broadcasting more often than it needs to. Functionally it still works but I should be able to save CPU usage by applying Paul’s feedback (thanks Paul!). There is also the potential lock starvation problem (when certain types of threads hog the lock) that I need to further investigate. If it’s going to cause noticeable issues, I’ll have to add a condition to BlitzLock saying “certain number of threads can obtain a certain lock at once”.

There is still another minor scheduling logic that I need to throw into BlitzLock but once I get that done (along with testing), I can integrate BlitzLock into BlitzDB and see how it performs (I can then hack on indexing!).

Yep, there’s still quite a bit to do but I’m having fun :)

Toru Maesaka drizzle, oss , , , ,

BlitzDB and Tokyo Cabinet Concurrency Model

November 19th, 2009

Yesterday I sat in front of a whiteboard for few hours with Mikio, the author of Tokyo Cabinet discussing/debating what the optimal concurrency model would be for BlitzDB. I think we came to a pretty good conclusion so I’m going to note it on this entry. But before I step any further, allow me to go over Tokyo Cabinet’s concurrency model.

Tokyo Cabinet’s Concurrency Model

Tokyo Cabinet is often quoted as “single writer, multi reader” but this is not quite true. At the time of this blog entry, this statement holds true for TC’s B+Tree database but TC’s hash database can actually allow multiple writers to update and/or delete records concurrently.

If you look at the entry point of tchdbput(), you will notice that it is actually obtaining a reader’s lock (in terms of rwlock). TCHDB then hashes the provided key and obtains the bucket index number where the record of interest belongs to. Given the bucket/block to work on, TC then looks at the 8 most significant bits of the hash value and attempts to obtain a granular update lock from slots of 256 mutexes (2 ^ 8 = 256). So, things are still concurrent at this stage though there are some chances of collision that would block a thread.

If a record already exists, TC will go on and happily update that block but if the record is new (as in the key doesn’t exist), TC will lock the tail block of the database and write the new record there. So, only writing a new record is treated as a single writer and the rest can be processed concurrently. This is why I said it’s not quite true.

BlitzDB’s Concurrency Model

Taken the above into mind, this is what BlitzDB’s concurrency model looks like:

  1. SELECT queries can run concurrently.
  2. SELECT queries are blocked when UPDATE and/or REPLACE queries are being processed.
  3. UPDATE, REPLACE, DELETE queries can run concurrently.
  4. INSERT is never disrupted by BlitzDB and scheduled by TC.

In an ideal world, I would allow Drizzle’s worker threads to _directly_ interact with TC and let TC handle thread synchronization. This would make my life fantastically easy but unfortunately life isn’t so easy.

For example, if a record is deleted while BlitzDB’s table scan is occurring, the table scanner will stop scanning at the position where the deleted key existed. I would not have this problem if I used TC’s native iterator but my table scan implementation uses TC’s hidden API that won’t babysit me in this regard. In return I can gain maximum concurrent read throughput from TC which was a tradeoff I happily accepted.

So, there are several little gotchas like this which forces me to implement concurrency control in BlitzDB. Here’s how I’m planning on doing it (with demo code!).

Implementation (with demo code)

In the past I’ve gone through several experimental stages with BlitzDB where I used pthread’s rwlock to control concurrency. Short answer to the result is, “IT WORKS!”. However it was not taking full advantage of TC’s concurrency model.

For example I did not want to protect UPDATE queries with a writer’s lock since it would block other UPDATE/DELETE queries. So why not protect it with a reader’s lock? The issue here is that any query that can change the state of the table cannot be processed while a scanner is running (which btw is protected by a reader’s lock). Furthermore, a non-index based update/delete means that the scanner _is_ running so there’s a problem there too.

What I need is a scheduler that can allow multiple INSERT/UPDATE/REPLACE/DELETE queries to run when the scanner is not running. On the other hand the scheduler must allow multiple scanners to run when an UPDATE/REPLACE/DELETE queries aren’t being processed _BUT_ let INSERT queries come through to TC.

Implementing the above is probably possible by using multiple mutexes but it would bring complexity to the codebase and possible deadlocks that can be difficult to debug. So we decided to learn from pthread’s rwlock implementation and write an original lock mechanism similar to rwlock but something that allows us to write our own rules for scheduling.

Here’s my first attempt at a standalone sandbox of the model:

You can compile and run it to get a grasp of how threads are coordinated:

$ wget http://torum.net/code/cc/blitzlock.cc
$ g++ -Wall -pedantic blitzlock.cc -lpthread && ./a.out

If you got the program running and wondering what the output means, think of the “updater” as a thread that performs either UPDATE, REPLACE or DELETE.

There are much more that I’d love to go on about but I think I’ve bloated this entry enough so I will save my urge for another day :)

Toru Maesaka drizzle, oss , , , , ,

TC Concurrency Model and BlitzDB Part 1

November 7th, 2009

Recently I started rewriting BlitzDB because I’ve come to realize the mistakes I’ve made from getting a better understanding of the Drizzle Storage API and Tokyo Cabinet internals. Admittedly a rewrite is an exaggeration because I’ll be reusing most of the components but more in a C++ way.

One decision I decided to make is that BlitzDB will only support a BTREE index via TC’s B+Tree API in it’s first release. Ignoring BlitzDB for now, several people I’ve talked to about key/value data structures often ask why I love B+Tree so much when it’s faster to work with a hash table. Please don’t take it wrong, O(1) operations are beautiful and I love hash tables but stereotyping key/value structures to it is not. Everything has it’s ups and downs and hash table/map is not an exception. In this blog entry, I will describe why B+Tree is good for index scanning.

Why a B+Tree Index

A search algorithm of O(1) like hashing is clearly faster than O(log n) unless there’s something fishy about the implementation or the dataset is too small for the time complexity to matter. However, this is only true for looking up and fetching the value. For those that are only interested in fetching a particular value, that’s probably the best you can ask for. However things are different if you look into things beyond lookups like fetching or scanning through a range of keys.

To do this with a typical hash table, either your data structure must be able to provide a list of stored keys OR your application must do some housekeeping and save a list of relevant keys elsewhere for future use. Your application would then need to compute the subset of keys that you’re interested in and fetch them with a loop. Algorithmically speaking, each fetch operation is O(1) but what’s expensive here is that you end up doing a lot of random access. This is obviously going to kill your performance, especially when you need to chew through a heavy workload (though this _could_ change when SSD becomes standard).

B+Tree on the other hand is fantastic for this use-case. The actual data are stored at the leaf node and they are usually logically linked so that you don’t need to re-traverse the tree to get the next greater key (if you run out of relevant pages in the node, you move on to the neighbor leaf node). The pages are aligned on disk, which means sequential access. Another bonus is that most of the time, you can keep the entire internal nodes on memory which is small and inexpensive but effective for searching.

Solution to this? well, mine is to implement a combination play of the two data structures and take advantage of the different characteristics. In BlitzDB, the actual rows are stored in TC’s Hash Database and the index will store keys to the row. So, a clustered index.

What I’ve mentioned so far is all theoretical without providing any benchmark results but all I’m trying to say is, it’s all about access patterns and use-cases. My current interest is in index scan and therefore the decision. However, if there is enough people that asks me for a HASH index, I can write that functionality relatively easily later on :)

Next Stop

I would love to keep writing but it is currently past 3am in Japan and I’m dozing out here. Apologies for not covering Tokyo Cabinet’s in-depth concurrency model but I will cover it in my next post of the series and how this impacts BlitzDB’s design.

Toru Maesaka drizzle, knowledge , , ,

Drizzle Storage Engine Alias!

November 6th, 2009

Admittedly I’m a lazy guy. Due to this nature I’m a little behind on the updates made to the Drizzle Storage Engine API. From lurking through the source code in the Drizzle trunk, I’ve noticed these changes.

  • Engines can now have an alias
  • handler class is replaced by the Cursor class
  • Engine handler is now a subclass of Cursor
  • Table definitions are handled/stored by the storage engine
  • doCreateTable(), doDropTable(), doGetTableNames(), doGetTableDefinition()

Currently I’m trying to catch up with the updated Drizzle Storage API and take this opportunity to rewrite most of BlitzDB. The reason is that the more I understand TC internal, the more mistakes I realized that I’ve made. I’ll blog more about this soon. Instead, I’m going to introduce something small but nice today.

A while back I poked folks like Monty Taylor and Stewart Smith that it would be cool if engines could have an alias. I mentioned this because InnoDB was allowed to use both “innodb” and “innobase” in the system whereas other engines could only have one name. Another reason I was interested in this issue was that I couldn’t understand how InnoDB could use two names since there was no way to do this in the old interface. Turns out InnoDB was treated specially in the core, which is obviously not desirable in a microkernel philosophy.

In the current Drizzle Storage Engine API, there is a function called addAlias(). By calling this function inside the storage engine constructor, you can allow your engine to have multiple aliases. For experiment purposes I wrote this to BlitzDB:

BlitzEngine(const string &name_arg)
  : drizzled::plugin::StorageEngine(name_arg, HTON_CAN_RECREATE) {
  table_definition_ext = drizzled::plugin::DEFAULT_DEFINITION_FILE_EXT;
  addAlias("BLITZ");
  addAlias("TCDB");
}

Here, I added aliases BLITZ and TCDB. TCDB is my way of showing respect to Mikio and Tokyo Cabinet. So given the above, we should now be able to create tables with three names.

drizzle> CREATE TABLE t1 (foo int) engine=blitzdb;
Query OK, 0 rows affected (0 sec)
 
drizzle> CREATE TABLE t2 (foo int) engine=blitz;
Query OK, 0 rows affected (0 sec)
 
drizzle> CREATE TABLE t3 (foo int) engine=tcdb;
Query OK, 0 rows affected (0.01 sec)

Success! Yep, this is something so trivial that I think most people wouldn’t care about but I was happy to see this update in the trunk :)

Toru Maesaka drizzle , ,

BlitzDB and Keyless Tables

October 9th, 2009

Previously you couldn’t create a table without defining a primary key with BlitzDB. This actually sounds like a nice constraint since you should always define a primary key. However, I went ahead and made this possible since one of the reasons that I’m developing BlitzDB is to get a better understanding of how the MySQL/Drizzle storage subsystem works. So, implementing a hidden key-generator and using it internally was something I wanted to do for sometime.

Previously this is what you got if you tried to create a table without a primary key:

drizzle> create table t1 (col1 int, col2 int, col3 text) engine=blitz;
ERROR 1173 (42000): This table type requires a primary key

Now:

drizzle> create table t1 (col1 int, col2 int, col3 text) engine=blitz;
Query OK, 0 rows affected (0 sec)

Inserting rows work as you would expect:

drizzle> insert into t1 values (1, 1, "first row");
Query OK, 1 row affected (0 sec)
 
drizzle> insert into t1 values (1, 2, "second row");
Query OK, 1 row affected (0 sec)
 
drizzle> insert into t1 values (1, 3, "third row");
Query OK, 1 row affected (0 sec)
 
drizzle> insert into t1 values (2, 1, "fourth row");
Query OK, 1 row affected (0 sec)
 
drizzle> insert into t1 values (2, 2, "fifth row");
Query OK, 1 row affected (0 sec)
 
drizzle> insert into t1 values (2, 3, "sixth row");
Query OK, 1 row affected (0 sec)

Selecting rows works fine although since there isn’t a key column in this table, every operation would require a full table scan which is not sexy:

drizzle> select * from t1;
+------+------+------------+
| col1 | col2 | col3       |
+------+------+------------+
|    1 |    1 | first row  | 
|    1 |    2 | second row | 
|    1 |    3 | third row  | 
|    2 |    1 | fourth row | 
|    2 |    2 | fifth row  | 
|    2 |    3 | sixth row  | 
+------+------+------------+
 
drizzle> select * from t1 where col1 = 1;
+------+------+------------+
| col1 | col2 | col3       |
+------+------+------------+
|    1 |    1 | first row  | 
|    1 |    2 | second row | 
|    1 |    3 | third row  | 
+------+------+------------+
3 rows in set (0 sec)
 
drizzle> select * from t1 where col2 = 2;
+------+------+------------+
| col1 | col2 | col3       |
+------+------+------------+
|    1 |    2 | second row | 
|    2 |    2 | fifth row  | 
+------+------+------------+
2 rows in set (0 sec)

How the internal works

BlitzDB does what most people would assume. It atomically generates a sequential unsigned 64bit integer then if necessary, converts it to big endian (network byte order). It then uses that value as a key to store the row into TC. The auto-generated key is made sure to be big-endian because I want BlitzDB tables to work on all platforms. That is, admins should be able to copy the “data files” over to another server and happily keep using the database. Keys are converted and _always_ used as little-endian inside BlitzDB.

Next Step

There’s still some bits and pieces on update related code that I need to work on but in general things are looking good. When I get those tasks done, I can then start working on supporting secondary index which I have cool ideas for.

Toru Maesaka drizzle, oss ,

BlitzDB Primary Key Based Insertion Performance

July 17th, 2009

Like most things, I think storage engine development is about divide and conquer. The first sub-problem that I’m tackling with BlitzDB is squeezing as much juice out as possible from Tokyo Cabinet to achieve fast write performance. This by the way happens to be the primary reason that I wrote skyload.

Writing skyload turned out to be worthwhile since it helped me find several critical bugs in the engine that only occurred under concurrent insertion load. Thanks to Kazuho Oku for helping me through the issues that I was facing.

I think I’ve now reached a stage where I can share how well BlitzDB can perform insertion from concurrent connections. But before moving ahead, I’d like to emphasize that for a real guideline, I believe that performance comparison should be done by an unbiased third party. So please don’t take the results in this post as the “truth”. Heh, I did write both the storage engine and the load emulator after all :)

So, with the above in mind, here’s a skyload result on inserting one-hundred-thousand rows under different concurrency levels with BlitzDB and MyISAM (both engines under default configuration).

Skyload Result - MyISAM and BlitzDB

Figures presented above are calculated from an average of 5 runs per each concurrency level. Admittedly, an average from 5 runs is not sufficient to claim credibility of my result since the figures can easily be affected by the dirty buffer flush between the kernel and the filesystem (ext3 in this particular benchmark). For this, I plan on extending skyload to run multiple runs of an identical test and compute the median and average.

For those that are interested, this is what the test table looks like:

CREATE TABLE t1 (
    id int PRIMARY KEY,
    col1 int,
    col2 double,
    col3 varchar(255)
) ENGINE=blitz;

I didn’t bother benchmarking anything beyond 32 connections since I ran both the client and the server on the same quad core machine (there’s no point). This is probably why you can see a nice curve up to four concurrent connections with BlitzDB in the graph. Yet another reason why you should not believe everything I’ve provided in this entry.

BlitzDB needs you

BlitzDB is still very early in it’s making and I still have insane amount of work to do. For example, BlitzDB currently requires you to supply a primary key on your table. I plan on removing this limitation by generating a “fake” primary key internally but I still haven’t got around to it at this point.

Support for multiple indexes is not done yet despite having all the necessary components to achieve it. I could do all this on my own but I prefer not to. I’m totally open for ideas and contributors. If you’re interested in this storage engine project, please don’t hesitate to ping me (dev @ this domain) or the Drizzle community. More eyeballs the merrier :)

Toru Maesaka drizzle, oss , ,

Notes on changes made to the Drizzle Storage Subsystem

July 9th, 2009

Yesterday I merged the BlitzDB tree with Drizzle‘s trunk for the first time in a long time (yeah…) and discovered some interesting changes made to the storage subsystem while I was away.

Previously all functions that caused an action to the storage engine was a member of the handler class but various things like table creation and transaction related functions have now moved to the StorageEngine class. These changes are somewhat drastic but makes good sense for Drizzle to grow further since it makes the subsystem easier to understand and frees Drizzle from the interface design that was strongly affected by MyISAM. For those that are interested, the StorageEngine class is located in “drizzled/plugin/storage_engine.h”.

For me it was pretty easy to update BlitzDB to work with the new subsystem since I don’t have anything special in the engine that required me to use my brain. I only had to move bas_ext(), table creation and rename functions over to the StorageEngine class and adjust it to the new interface:

int createTableImpl(Session *session, const char *table_name, 
                    Table *table_arg, HA_CREATE_INFO *ha_create_info); 
 
int renameTableImpl(Session *session, const char *from, const char *to);

For a real example, I recommend comparing the old InnobaseEngine class declaration with the updated one. As for where this redesign is going, this is the answer I got on the Drizzle channel from Stewart who did the actual work for all this.

stewart: tmaesaka: the basic idea is that handler becomes a cursor. the StorageEngine is for actions on the engine.
stewart: tmaesaka: and handler is a cursor on a table.

Something to keep in mind if you’re thinking about creating or porting a storage engine to Drizzle :)

Toru Maesaka drizzle, oss , ,

Introducing skyload: a libdrizzle based load emulator

July 7th, 2009

Today, I would like to introduce “skyload“, a small project that I’ve been working on for the last couple of weeks. In brief, skyload is a libdrizzle based load emulation tool that is capable of running concurrent load tests against database instances that can speak Drizzle (and/or) the MySQL protocol.

Something I’d like to emphasize here is that, skyload is not a replacement for mysqlslap or drizzleslap since it only provides a subset of what they can do. As I’ve stated on the project description, skyload is designed to do a good job at this subset of tasks by giving you more control over how you emulate the load in an intuitive way. For instructions on installing skyload and quickly getting up to speed, take a look at the following URL:

As you will see, the first release only contains bare minimum specifications (only INSERT load emulation). The next step I want to take is to discuss features that other storage engine developers would actually find useful. This is because I started writing skyload for primarily myself and other storage engine developers (more on this next).

Original Intentions

I originally began writing skyload for BlitzDB development since I wanted to see the concurrent insertion performance of Tokyo Cabinet based row storage mechanism that I wrote. I first tried benchmarking the write performance with drizzleslap but it turned out that drizzleslap’s original code (inherited from MySQL 6.0) is rather buggy and segfaulted quite easily (I’m planning on contributing a fix for this).

So I gave up on drizzleslap for the time being and started looking at the sysbench port for Drizzle that Monty Taylor has been working on:

Sysbench for Drizzle is a lovely piece of software but it couldn’t quite provide what I was looking for (concurrent insertion benchmark). After having a quick conversation with Monty about my requirements on the Drizzle IRC channel, I decided to write a libdrizzle based benchmark tool that can be used for both Drizzle and MySQL.

Future Plans

I don’t want to reinvent existing software that works (or those that can be fixed). The project positioning that I’m hoping for skyload is a good mix between (mysql|drizzle)slap and sysbench. Hopefully it will be useful to folks that works on Drizzle and MySQL related projects.

I’m totally open for ideas, patches, and contributors. If this project had caught your attention, please don’t hesitate to ping me or the Drizzle community :)

I haven’t setup a mailing list since I don’t see the need for it yet so if you’d like to share your thoughts I think either the Drizzle mailing list or IRC (#drizzle @ irc.freenode.net) is the quickest way for me to get back to you.

Happy Hacking!

Toru Maesaka drizzle, oss , , , , ,