BlitzDB and Tokyo Cabinet Concurrency Model
Yesterday I sat in front of a whiteboard for few hours with Mikio, the author of Tokyo Cabinet discussing/debating what the optimal concurrency model would be for BlitzDB. I think we came to a pretty good conclusion so I’m going to note it on this entry. But before I step any further, allow me to go over Tokyo Cabinet’s concurrency model.
Tokyo Cabinet’s Concurrency Model
Tokyo Cabinet is often quoted as “single writer, multi reader” but this is not quite true. At the time of this blog entry, this statement holds true for TC’s B+Tree database but TC’s hash database can actually allow multiple writers to update and/or delete records concurrently.
If you look at the entry point of tchdbput(), you will notice that it is actually obtaining a reader’s lock (in terms of rwlock). TCHDB then hashes the provided key and obtains the bucket index number where the record of interest belongs to. Given the bucket/block to work on, TC then looks at the 8 most significant bits of the hash value and attempts to obtain a granular update lock from slots of 256 mutexes (2 ^ 8 = 256). So, things are still concurrent at this stage though there are some chances of collision that would block a thread.
If a record already exists, TC will go on and happily update that block but if the record is new (as in the key doesn’t exist), TC will lock the tail block of the database and write the new record there. So, only writing a new record is treated as a single writer and the rest can be processed concurrently. This is why I said it’s not quite true.
BlitzDB’s Concurrency Model
Taken the above into mind, this is what BlitzDB’s concurrency model looks like:
- SELECT queries can run concurrently.
- SELECT queries are blocked when UPDATE and/or REPLACE queries are being processed.
- UPDATE, REPLACE, DELETE queries can run concurrently.
- INSERT is never disrupted by BlitzDB and scheduled by TC.
In an ideal world, I would allow Drizzle’s worker threads to _directly_ interact with TC and let TC handle thread synchronization. This would make my life fantastically easy but unfortunately life isn’t so easy.
For example, if a record is deleted while BlitzDB’s table scan is occurring, the table scanner will stop scanning at the position where the deleted key existed. I would not have this problem if I used TC’s native iterator but my table scan implementation uses TC’s hidden API that won’t babysit me in this regard. In return I can gain maximum concurrent read throughput from TC which was a tradeoff I happily accepted.
So, there are several little gotchas like this which forces me to implement concurrency control in BlitzDB. Here’s how I’m planning on doing it (with demo code!).
Implementation (with demo code)
In the past I’ve gone through several experimental stages with BlitzDB where I used pthread’s rwlock to control concurrency. Short answer to the result is, “IT WORKS!”. However it was not taking full advantage of TC’s concurrency model.
For example I did not want to protect UPDATE queries with a writer’s lock since it would block other UPDATE/DELETE queries. So why not protect it with a reader’s lock? The issue here is that any query that can change the state of the table cannot be processed while a scanner is running (which btw is protected by a reader’s lock). Furthermore, a non-index based update/delete means that the scanner _is_ running so there’s a problem there too.
What I need is a scheduler that can allow multiple INSERT/UPDATE/REPLACE/DELETE queries to run when the scanner is not running. On the other hand the scheduler must allow multiple scanners to run when an UPDATE/REPLACE/DELETE queries aren’t being processed _BUT_ let INSERT queries come through to TC.
Implementing the above is probably possible by using multiple mutexes but it would bring complexity to the codebase and possible deadlocks that can be difficult to debug. So we decided to learn from pthread’s rwlock implementation and write an original lock mechanism similar to rwlock but something that allows us to write our own rules for scheduling.
Here’s my first attempt at a standalone sandbox of the model:
You can compile and run it to get a grasp of how threads are coordinated:
$ wget http://torum.net/code/cc/blitzlock.cc $ g++ -Wall -pedantic blitzlock.cc -lpthread && ./a.out
If you got the program running and wondering what the output means, think of the “updater” as a thread that performs either UPDATE, REPLACE or DELETE.
There are much more that I’d love to go on about but I think I’ve bloated this entry enough so I will save my urge for another day :)

I haven’t looked at the code you mentioned, but you said: “If you look at the entry point of tchdbput(), you will notice that it is actually obtaining a reader’s lock (in terms of rwlock). TCHDB then hashes the provided key…”
Why is it taking *any* lock before doing this and the other work you mentioned? Again I don’t know the code, but in general locks should be taken as late as possible. MySQL has had a lot of trouble from taking locks before it needs to :-(
Hi Baron!
So tchdbput() is the interface that the application calls and the first argument of the function is the TCHDB object that contains the state information of the database file. If a different thread is doing something critical that needs total atomicity, it would have obtained a writer’s lock which would make a call to tchdbput() block until the writer’s lock is released.
I’m not sure if this is the answer you expected but TC has a rwlock called mmtx (no idea what it stands for) which protects the entry of TC’s API functions. I think peeking at tchdb.h (first struct in tchdb.h) would give you a good picture of what kind of mutexes are involved in TC’s hash database (it’s well commented).
>> MySQL has had a lot of trouble from taking locks before it needs to :(
Wish I knew more about it… too much things to study and too little time :)
Hi Toru!
Nice stuff. Would be nice to see if your homegrown implementation is more scalable than a standard rwlock. Can you post graphs of your findings in this regard?
Also, one little suggestion:
Change the BlitzLocK::init() into a constructor and BlitzLock::destroy() into the destructor. Since you are already paying the price of the constructor and destructor, you might as well move the init/destroy code into the constructor/destructor. This allows RAII principles to dominate as well, which means user of the BlitzLock class can’t make the mistake of forgetting to call init or destroy.
Cheers!
Jay
Heya Jay,
Graphs are always good. Writing that standalone is as far as I’ve gotten so it’s not integrated into the engine yet. When I get all the pieces together, I’ll definitely compare it against the native library and publish the results.
Thanks very much for the C++ tip!
I’m definitely not a C++ ninja but I thought adding code into a constructor/deconstructor wasn’t good practice since failure inside a constructor can be pretty yucky to take care of… Please correct me if I’m wrong but my understanding was that the constructor should only be used for allocation and basic initialization only, and bring out anything that involves assignments and further allocation into a different function like init().
This is something I’d love to know more about :)
Hi Toru,
An interesting problem :)
If I correctly understand what the BlitzLock is doing, then there are two possible problems with the code:
- pthread_cond_broadcast(&condition) should only be called, when scanner_count == 0 in scan_end() (or updater_count == 0).
For example:
pthread_mutex_lock(&mutex);
scanner_count–;
assert(scanner_count >= 0);
if (scanner_count == 0)
pthread_cond_broadcast(&condition);
pthread_mutex_unlock(&mutex);
- The other potential problem is so-called “live lock”:
This would happen, for example, if you had a constant flow of scanners. An updater would never get a chance because scanner_count would never go to zero. Possible solution is to have a limit to the number of scanners or updaters before you switch “modes”.
Best regards,
Paul