Archive

Author Archive

Congrats to Kyoto Cabinet’s Alpha Release

January 1st, 2010

Writing this blog entry to congratulate Mikio for releasing Kyoto Cabinet (alpha release), which is positioned as a successor project of Tokyo Cabinet.

From development perspective, the big difference is that Kyoto Cabinet is implemented in pure C++03 whereas TC is pure C99. ASFAIK, C++03 was adopted so that KC can run on broader platforms than POSIX oriented systems (So, theoretically it can build on Windows). From project perspective, KC is licensed under GPLv3 whereas TC is licensed under LGPL.

For my projects, I currently have no interest in moving to KC from TC since I’m convinced that TC is better suited for my target platforms (UNIX/Linux). Another reason (VERY personal reason) is that I like C99 more than C++. Although… I don’t have the right to say this since I’m far from proficient at C++. For more details on the project differences, you should take a look at KC’s project page that Mikio had prepared:

For those that are interested in KC’s performance, I’ll do some benchmarks against TC when I get back from my end of year vacation. Hope everyone is enjoying New Years!

Toru Maesaka oss ,

Last Minute Christmas Shopping

December 24th, 2009

Last minute Christmas Shopping for myself. 23″ Full HD monitor with HDMI input (Mitsubishi RDT231WLM). I’ve been looking for a reasonable new monitor with HDMI for sometime so I’m pretty happy with this purchase. This is going to be used for PS3 and watching movies.

Last Minute Purchase

Merry Christmas.

Toru Maesaka random ,

End of Year Progress on BlitzDB

December 24th, 2009

FURTHER UPDATE: Further thoughts on BlitzDB’s Index Handling

My open source friends might have noticed that I’ve been working quite a bit on BlitzDB lately. To tell the truth, I had a hidden goal to get Version-1 done by Christmas. Unfortunately it doesn’t look like I can reach that goal. However, looking at the brightside I got a lot done in the past few weeks so allow me to “journal” it in this blog post.

Agony of Knowing

The more I understood Drizzle’s storage mechanism and Tokyo Cabinet’s internals, the more I disliked what I previously had. This led me to spending quite a bit of time rewriting BlitzDB’s codebase. I was using pthread’s rwlock for concurrency control but I decided to design and write BlitzDB’s own lock mechanism to get the best out of TC (in terms of concurrency). I also rewrote the entire table scan code which is something you’d hope won’t be executed that often (people should use indexes!) but needless to say, it’s an important component of a relational storage engine so I’ve put in a lot of effort there.

Rewriting the Table Scanner

In the process of rewriting the table scanner, Jay Pipes’ gave me a fantastic advise on using Drizzle’s internal atomic type (drizzled::atomics). He gave me this advise because he noticed that my atomic ID generator was securing atomicity with pthread’s mutex. It is debatable that this mutex was only enabled for only few CPU instructions but the philosophy of using the most efficient method on the platform where BlitzDB is to be run was appealing enough for me to use drizzled::atomics. Mikio did some experiments on this and found that in a competitive/congested environment, using the compiler’s builtin function can gain you 3x throughput.

Hacking on Index Support

I’ve finally started hacking on index support and I just finished supporting basic operations on a primary key. By design, BlitzDB’s index is a dense clustered b+tree but in the first release I am going to limit PK to only be a HASH index. This is because I want BlitzDB to treat all PKs as direct keys inside the data dictionary (hash database where the actual rows are stored). So in other words, I want people to use PK for “needle in a haystack” like queries only. An example of a needle in a haystack like query is:

SELECT * FROM TABLE WHERE primary_key_column = whatever;

Saying that, I don’t like to force people to do things the way I like so I plan on providing best of both worlds by supporting both data structures for PKs in Version-2:

CREATE TABLE t1 (id int, PRIMARY KEY(id) USING btree) ENGINE=blitzdb;
CREATE TABLE t1 (id int, PRIMARY KEY(id) USING hash) ENGINE=blitzdb;

BlitzDB’s default configuration will use PK as a “direct” data dictionary index. If you wish to do range queries on PK, the solution is to create a index on the PK column.

Primary Key lookup Performance

So, how does my implementation perform? Here’s a quick benchmark with a test-run that randomly fetches 100 thousand rows from a BlitzDB table with 1 million rows. This is the table I used:

CREATE TABLE t1 (id int PRIMARY KEY, a int, b int) ENGINE=blitzdb;

and the query looks like this:

SELECT * FROM t1 WHERE id = random_number_under_one_million;

The hardware I used is the following commodity server: Intel Quad Xeon E5345 (2x4MB L2 cache), 8GB Memory, 500GB SATA II. Unfortunately I could not prepare a standalone client server today so both the server and the test program were run on the same machine. Yeah… this sucks so I can’t claim that this benchmark is 100% creditable.

Here is the result I obtained from skyload. Please only view it as a guideline to BlitzDB’s lookup performance. I’ll do a proper benchmark with the Drizzle Community and publish it after I get Version-1 released.

[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 1
  Task Completion Time   : 5.88856 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1
 
[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 2
  Task Completion Time   : 6.94474 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1
 
[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 4
  Task Completion Time   : 7.04455 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1

As you can see, “needle in a haystack” queries can be executed pretty efficiently in BlitzDB. Looking at the first result, we can observe that it took an average of 0.058 milliseconds to process a query.

Future Plans

Admittedly, primary key support isn’t completely done so I’ll continue working on it. After that, I will start hacking on b+tree indexes and write more tests as I go. Once I support at least two indexes, I’ll ask the Drizzle Community to consider merging BlitzDB into Drizzle’s trunk. This is my goal for BlitzDB at the moment.

I also happen to own blitzdb.com so I’m planning on putting user documentation (including tutorial) and architectural notes there. This is currently not so high on my TODO list so I suspect it won’t happen until I get Version-1 released. All I can say about the release schedule at the moment is, “before the MySQL conference in april”.

So, that’s all I have to summarize for now. Thanks for reading this far. Merry Christmas and have a Happy New Year. Don’t trip on ice :)

Toru Maesaka drizzle, oss , ,

Storage Engine Tests in Drizzle. Organized!

December 13th, 2009

Good news to storage engine developers. In Drizzle, you can now place your engine specific test files (.test and .result) in your engine’s directory. Here’s an example in BlitzDB:

First, let’s look inside BlitzDB’s directory.

$ ls -l blitzdb/
Total 60
-rw-r--r-- 1 maesaka maesaka   649 2009-12-13 20:51 AUTHORS
-rw-r--r-- 1 maesaka maesaka  5878 2009-12-13 20:51 blitzdata.cc
-rw-r--r-- 1 maesaka maesaka  3347 2009-12-13 20:51 blitzlock.cc
-rw-r--r-- 1 maesaka maesaka 18146 2009-12-13 20:51 ha_blitz.cc
-rw-r--r-- 1 maesaka maesaka  8360 2009-12-13 20:51 ha_blitz.h
-rw-r--r-- 1 maesaka maesaka   289 2009-12-13 20:51 plugin.ac
-rw-r--r-- 1 maesaka maesaka   261 2009-12-13 23:51 plugin.ini
drwxr-xr-x 4 maesaka maesaka  4096 2009-12-13 23:51 tests

Notice the final line? that’s where the tests are kept. So, let’s look inside it.

$ ls -l blitzdb/tests/
Total 8
drwxr-xr-x 2 maesaka maesaka 4096 2009-12-13 23:51 r
drwxr-xr-x 2 maesaka maesaka 4096 2009-12-13 23:51 t

As you can see, there are two directories. By now, storage engine developers would have caught on to what’s going on. The r/ directory is where the .result files are kept and t/ is where the .test files are kept. This is exactly the same layout as what we’re used to working on (“src/tests/t/” and “src/tests/r/”).

$ ls -l blitzdb/tests/t/
Total 8
-rw-r--r-- 1 maesaka maesaka   21 2009-12-13 23:51 blitzdb-master.opt
-rw-r--r-- 1 maesaka maesaka 1964 2009-12-13 23:51 blitzdb.test

The .opt file is used to make sure that the server is started with your storage engine enabled. You simply write the startup option inside the .opt file. Here’s what mine looks like at the moment (there’s only a single line in it).

$ less blitzdb/tests/t/blitzdb-master.opt
--plugin_add=blitzdb
blitzdb/tests/t/blitzdb-master.opt (END)

Next step is actually running it. You simply specify your engine name with the --suite option to dtr and you’re done! Unfortunately the symlink permission for dtr seems broken on my repository so I’ll directly call test-run.pl in this example.

$ ./test-run.pl --suite=blitzdb
Logging: ./test-run.pl --suite=blitzdb
MySQL Version 2009.12.1245
Use of uninitialized value in scalar assignment at ./test-run.pl line 1416.
Using MTR_BUILD_THREAD      = -69.4
Using MASTER_MYPORT         = 9306
Using MASTER_MYPORT1        = 9307
Using SLAVE_MYPORT          = 9308
Using SLAVE_MYPORT1         = 9309
Using SLAVE_MYPORT2         = 9310
Using MC_PORT               = 9316
Killing Possible Leftover Processes
Removing Stale Files
Creating Directories
=======================================================
DEFAULT STORAGE ENGINE: innodb
TEST                           RESULT         TIME (ms)
-------------------------------------------------------
 
blitzdb.blitzdb                [ pass ]             63
-------------------------------------------------------
Stopping All Servers
All 1 tests were successful.
The servers were restarted 1 times
Spent 0.063 of 2 seconds executing testcases

That’s it! I really like this change since it makes sense for engine-specific tests to belong inside the storage engine’s directory. It makes conceptual sense and it’s a good step towards differentiating the database kernel and the storage engine, which Monty Taylor is actively hacking on. Hopefully he’ll blog more about these changes soon.

Toru Maesaka drizzle, oss ,

Tips on Drizzle Development and Valgrind

December 1st, 2009

In brief, valgrind is a framework of awesome tools that does an amazing job at detecting memory errors. It will catch silly (often unexpected) mistakes and memory leaks that you’ve made in your code. IMHO, it’s a must have tool for open source hackers that work with Linux. If you develop a plugin or a storage engine for Drizzle/MySQL, you often end up wanting to test your program for memory errors. Actually, it’s not a “want”, it’s a MUST.

Conveniently by supplying a simple startup option, Drizzle and MySQL’s test runner will run the daemon process on valgrind’s virtual machine. I’m not sure about MySQL since I’ve never developed anything for it but at least with Drizzle you can run a test case independently by supplying the desired test name to the test runner.

 $ ./dtr your_test_file_name --valgrind

So, with BlitzDB this is what I do to isolate the test runner to only run my tests:

 $ ./dtr blitzdb.test --valgrind

Very simple.

The minor complication here is that the test runner will not output the valgrind report to the console and instead it writes the output to a file. So where is this file? the answer is, it’s written to the daemon’s error log which is located in the source tree:

$ less drizzle_src/tests/var/log/master.err
CURRENT_TEST: main.blitzdb
==24563== Memcheck, a memory error detector
==24563== Copyright (C) 2002-2009, and GNU GPL'd, by Julian Seward et al.
...

Here’s another tip. If you ever wondered where the files that were generated in the test (like table and index files) are stored, they are stored inside the source tree as well. Here’s an example on my machine:

$ ll drizzle_src/tests/var/master-data/
total 20528
-rw-rw---- 1 tmaesaka tmaesaka 10485760 2009-12-01 22:06 ibdata1
-rw-rw---- 1 tmaesaka tmaesaka  5242880 2009-12-01 22:06 ib_logfile0
-rw-rw---- 1 tmaesaka tmaesaka  5242880 2009-12-01 22:06 ib_logfile1
drwxr-xr-x 2 tmaesaka tmaesaka     4096 2009-12-01 22:06 mysql
drwxr-xr-x 2 tmaesaka tmaesaka     4096 2009-12-01 22:06 test

So, with all that in mind, happy hacking :)

Toru Maesaka drizzle, knowledge, oss ,

Drizzle Storage Engine Dev: Determining Query Type

November 25th, 2009

Determining what kind of SQL query is requested at the handler level is pretty important for BlitzDB since the strategy is to obtain the most suitable lock for a given request. Unfortunately there is no intuitive way to get this information. So, I took a peek into InnoDB’s sourcecode and found my solution (open source saves the day as usual).

Solution

In Drizzle, there is a function called session_sql_command(Session *session) which returns an integer that corresponds to one of the command type constants (which are accessible from the engine). Ideally I would like to call this function from anywhere in the engine but since it requires a session object as an argument, I could only call it from store_lock().

My solution was to add a variable in the handler class and assign the appropriate value to it from store_lock(). This turned out to be okay since store_lock() is called before any other API functions but the concern here is that store_lock() is planned for removal in the future.

Now I can do things like:

ha_blitz::rnd_init(bool drizzled_will_scan) {
  if (sql_command_type == SQLCOM_UPDATE)
    /* get the most suitable lock type for this task */
  else if (sql_command_type == SQLCOM_SELECT)
    /* get the most suitable lock type for this task */
  ...
}

Personal Request to Drizzle

Although I would like to see store_lock() disappear from the storage engine API, I would like storage engines (technically worker threads) to have ability to gather meta information on the query before any real work is done.

My request is for store_lock() to become something along the line of gather_information() where it gives the handler (or worker threads) a chance to gather information about the query. Needless to say, drizzled must call this function before any other API calls are made.

Toru Maesaka drizzle, oss ,

BlitzLock and RWLOCK Comparison

November 24th, 2009

As pointed out by Jay Pipes, I thought it would be nice to test and publish how BlitzLock performs against what I was originally intending on using for BlitzDB (pthread’s rwlock). So, I asked my colleagues in the operations group at Mixi to test BlitzLock on nice hardware that I don’t have access to. They kindly accepted and ran the BlitzLock sandbox on a 16 core machine running Fedora.

If you haven’t read my previous entry on BlitzLock and why I started writing it, you should. This entry won’t make sense otherwise.

Disclaimer

Before I step any further, please remember that I’m not trying to say BlitzLock is better than pthread’s rwlock. My interest is to write a lock mechanism that is optimized for Tokyo Cabinet (TC). What I wanted to gain from this test was to see if BlitzLock has enough potential for me to keep working on it.

Method

There were three kinds of workloads: “Read Oriented”, “Write Oriented” and “Neutral”. Read Oriented test has a 70% probability that each thread will call a read routine, whereas Write Oriented is the opposite where there is a 70% chance that the table state will be changed. In the Neutral test, both read and update calls have an even chance of being called. The seed value for the random number generator was identical for all tests.

Each worker sleeps for 10 milliseconds in the critical section and another 10 milliseconds right after it releases the lock. This was done to help cause context switching. Each test ran for 60 seconds.

You can obtain the standalone BlitzLock sandbox from here. I’ll upload a test friendly version that can accept startup options soon (I _really_ need to tidy it up).

Results

Below is a result from a load emulation where there was significantly more read calls than updates.

BlitzLock Benchmark (1)

As seen above, BlitzLock is nicely scaling the workload without exhausting update threads. This is important since one of the concerns involved in the current implementation of BlitzLock is starvation (covered later). I think the read/write ratio above is the sort of ratio that is typically seen in the web industry and something I’m mostly concerned with. So how about a write intensive application? Next graph is a result of when there is significantly more update operations than read.

BlitzLock Benchmark (2)

As seen above, BlitzLock is nicely scaling update tasks without neglecting readers. Compared to the first graph, we’re seeing an opposite result between update and scanner threads which is expected due to the nature of BlitzLock. This is exactly what I was hoping to gain. Next graph is a result from when there is an even chance of read and update operations to occur.

BlitzLock Benchmark (3)

As seen above, the throughput evens out for both read and update operations. I was expecting pthread’s rwlock to show noticeably lower update throughput than read (since it’s a single writer lock) but it turned out to even out. I’m not quite sure how I should interpret this but I guess the writer’s lock had a greater priority than reader’s lock in the environment that the test was run in. Nevertheless, this “even out” characteristic is something I’d like to welcome.

From Here and Weaknesses

I’m convinced to keep working on BlitzLock and use it as the default locking mechanism for BlitzDB. Ideally I should code BlitzDB to be able to switch between various locking mechanisms. This would make my life much easier when someone decides to write a locking mechanism that is better than BlitzLock for my use-case.

Thanks to Paul McCullagh‘s feedback, I’ve come to realize that BlitzLock was broadcasting more often than it needs to. Functionally it still works but I should be able to save CPU usage by applying Paul’s feedback (thanks Paul!). There is also the potential lock starvation problem (when certain types of threads hog the lock) that I need to further investigate. If it’s going to cause noticeable issues, I’ll have to add a condition to BlitzLock saying “certain number of threads can obtain a certain lock at once”.

There is still another minor scheduling logic that I need to throw into BlitzLock but once I get that done (along with testing), I can integrate BlitzLock into BlitzDB and see how it performs (I can then hack on indexing!).

Yep, there’s still quite a bit to do but I’m having fun :)

Toru Maesaka drizzle, oss , , , ,

BlitzDB and Tokyo Cabinet Concurrency Model

November 19th, 2009

Yesterday I sat in front of a whiteboard for few hours with Mikio, the author of Tokyo Cabinet discussing/debating what the optimal concurrency model would be for BlitzDB. I think we came to a pretty good conclusion so I’m going to note it on this entry. But before I step any further, allow me to go over Tokyo Cabinet’s concurrency model.

Tokyo Cabinet’s Concurrency Model

Tokyo Cabinet is often quoted as “single writer, multi reader” but this is not quite true. At the time of this blog entry, this statement holds true for TC’s B+Tree database but TC’s hash database can actually allow multiple writers to update and/or delete records concurrently.

If you look at the entry point of tchdbput(), you will notice that it is actually obtaining a reader’s lock (in terms of rwlock). TCHDB then hashes the provided key and obtains the bucket index number where the record of interest belongs to. Given the bucket/block to work on, TC then looks at the 8 most significant bits of the hash value and attempts to obtain a granular update lock from slots of 256 mutexes (2 ^ 8 = 256). So, things are still concurrent at this stage though there are some chances of collision that would block a thread.

If a record already exists, TC will go on and happily update that block but if the record is new (as in the key doesn’t exist), TC will lock the tail block of the database and write the new record there. So, only writing a new record is treated as a single writer and the rest can be processed concurrently. This is why I said it’s not quite true.

BlitzDB’s Concurrency Model

Taken the above into mind, this is what BlitzDB’s concurrency model looks like:

  1. SELECT queries can run concurrently.
  2. SELECT queries are blocked when UPDATE and/or REPLACE queries are being processed.
  3. UPDATE, REPLACE, DELETE queries can run concurrently.
  4. INSERT is never disrupted by BlitzDB and scheduled by TC.

In an ideal world, I would allow Drizzle’s worker threads to _directly_ interact with TC and let TC handle thread synchronization. This would make my life fantastically easy but unfortunately life isn’t so easy.

For example, if a record is deleted while BlitzDB’s table scan is occurring, the table scanner will stop scanning at the position where the deleted key existed. I would not have this problem if I used TC’s native iterator but my table scan implementation uses TC’s hidden API that won’t babysit me in this regard. In return I can gain maximum concurrent read throughput from TC which was a tradeoff I happily accepted.

So, there are several little gotchas like this which forces me to implement concurrency control in BlitzDB. Here’s how I’m planning on doing it (with demo code!).

Implementation (with demo code)

In the past I’ve gone through several experimental stages with BlitzDB where I used pthread’s rwlock to control concurrency. Short answer to the result is, “IT WORKS!”. However it was not taking full advantage of TC’s concurrency model.

For example I did not want to protect UPDATE queries with a writer’s lock since it would block other UPDATE/DELETE queries. So why not protect it with a reader’s lock? The issue here is that any query that can change the state of the table cannot be processed while a scanner is running (which btw is protected by a reader’s lock). Furthermore, a non-index based update/delete means that the scanner _is_ running so there’s a problem there too.

What I need is a scheduler that can allow multiple INSERT/UPDATE/REPLACE/DELETE queries to run when the scanner is not running. On the other hand the scheduler must allow multiple scanners to run when an UPDATE/REPLACE/DELETE queries aren’t being processed _BUT_ let INSERT queries come through to TC.

Implementing the above is probably possible by using multiple mutexes but it would bring complexity to the codebase and possible deadlocks that can be difficult to debug. So we decided to learn from pthread’s rwlock implementation and write an original lock mechanism similar to rwlock but something that allows us to write our own rules for scheduling.

Here’s my first attempt at a standalone sandbox of the model:

You can compile and run it to get a grasp of how threads are coordinated:

$ wget http://torum.net/code/cc/blitzlock.cc
$ g++ -Wall -pedantic blitzlock.cc -lpthread && ./a.out

If you got the program running and wondering what the output means, think of the “updater” as a thread that performs either UPDATE, REPLACE or DELETE.

There are much more that I’d love to go on about but I think I’ve bloated this entry enough so I will save my urge for another day :)

Toru Maesaka drizzle, oss , , , , ,