Archive

Posts Tagged ‘blitzdb’

DATE type under the hood in Drizzle/MySQL

March 1st, 2010

Learned something new from my own bug in BlitzDB today. The problem was that writing a DATE column index would always return a duplicate key error (regardless of what I feed it). There are two suspicious candidates that can cause this.

  • Comparison Function has a defect.
  • Key Generator has a defect.

The latter suspect was going to be tricky if it was true since BlitzDB currently uses Drizzle’s native “field packer” (except for VARCHAR) inherited from MySQL. This would mean that Drizzle’s field system has a bug in it which was somewhat difficult to believe. Furthermore, you should always blame yourself before you start suspecting other people’s code. So, I decided to look into the comparison function which was completely written by me. Turned out that’s where the bug was.

Comparison Function

Allow me to quickly clarify what I mean by “comparison function” in this context. TC’s B+Tree API has an interface that allows you to provide your own comparison function for all operations that involves traversing.

bool tcbdbsetcmpfunc(TCBDB *bdb, TCCMP cmp, void *cmpop);

What BlitzDB’s comparison function callback does is, it looks at the data type of the values to be compared and performs appropriate processing on the values then compares them. You can also look at it as a long switch statement. For those that are interested, this code is in blitzcmp.cc (blitz_keycmp_cb).

DATE under the hood

After inspecting the “type number” with GDB and looking at the corresponding ha_base_keytype enum, it turns out that the DATE type is internally represented as an unsigned 3 byte integer (HA_KEYTYPE_UINT24). This was pleasant to discover since I’ve been wondering what a 3 byte integer is still used for in Drizzle. The problem I had was that I didn’t take this type into account in the comparator and it also showed how silly I am since the answer was always there.

Now, the question is should it be kept this way? Respect alignment or reduce total I/O and space by keeping it this way? This should hopefully be a fun discussion to have in the Drizzle community :)

P.S. My two cents is that it should respect alignment since folks that seek performance should have most of their data on memory. Respecting alignment in this environment should make some difference. Although, I can only say this after benchmarking it of course.

Toru Maesaka drizzle, oss , ,

Progress on BlitzDB’s Index Component

February 18th, 2010

I recently gained some decent momentum on developing the indexing component of BlitzDB. Most of my time spent on BlitzDB for the last couple of weeks have been studying the indexing API and digging into how other engines have implemented it. I even referred back to MySQL 4.x to see how the BDB engine pulls off the Indexing API.

The actual coding wasn’t too bad thanks to Tokyo Cabinet’s awesome B+Tree API. I’ve been busier adding new tests and fixing silly bugs as they arise. I also implemented the Primary Key optimization that I blogged about a while back. As a result of all this, the following goodness has been added to BlitzDB’s Trunk.

  • Index Lookup
  • Forward Index Scan
  • Reverse Index Scan

This means that BlitzDB is now equipped with both a Table Scanner and an Index Scanner which are two essential components for a general purpose storage engine. As much as I’d like to work on optimizing the code and adding features (like recovery), I’m going to take a break and spend the rest of the month working on testing and debugging. There’s no point in adding features if the base has notable flaws in it.

Challenges Encountered

Writing the Index Scanner itself is easy. The most difficult thing that slowed me down was developing the comparison function for index keys. The end result was a simple piece of code but I had to study various things before I could start writing any code.

  • How to respect collation
  • How keys are represented internally
  • How types are represented internally
  • How to write a custom comparison function for Tokyo Cabinet
  • … and so on

I’ve also started using Evernote to jot down my spontaneous ideas on optimizing BlitzDB. I’ve made these notes public and they will most likely be updated while I’m commuting on the train.

There are much more that I’d like to write about like how I intend on developing the table recovery routine without simply using TC’s recovery mechanism but I shall restrain myself for another day.

Toru Maesaka drizzle, oss , ,

Further thoughts on BlitzDB’s Index Handling

January 15th, 2010

I’ve been thinking quite a bit about collation handling in BlitzDB for the last couple of days. The more I think about it, the more stuck I’ve been getting with BlitzDB’s index design. I’m actually so frustrated with myself at the moment that I want to hit my head against a wall or something.

So, I’m writing this entry to clear up my mind. Heh, my blog is slowly becoming BlitzDB’s design document draft. This should hopefully be good though since by blogging it, people can tell me whether I’m moving towards a stupid direction or not.

Collation Importance

When writing database software that is intended for International use, it is important to handle textual data by respecting collation order. It is arguable that most people are only interested in English lexicographic ordering but unfortunately the world is not so standard.

Internal Primary Key Handling

I want to motivate people to actively define a PRIMARY KEY with BlitzDB. I plan to make this attractive by providing the best performance when PK is defined. In December 2009, my answer to this was to write the PK value as the key for the data dictionary (where actual rows are stored in BlitzDB). This allows BlitzDB to do a direct lookup on the data dictionary for PK based lookup, instead of consulting the B+Tree index. I’m still fond of this lookup optimization approach but it introduces problems too.

Problem 1. Consider the following textual keys: “key” and “KEY”. They obviously have different binary representations but in certain cases they can be logically equivalent. Because the data dictionary is a hash database, this is a problem. The solution that instantly pops up is to normalize the key before writing or reading it. This however, causes a problem in cases where the two keys are inequivalent. Perhaps Drizzle/MySQL provides an internal normalization function that respects this. I still need to study this area of the storage subsystem.

Problem 2. Directly writing a PK to the data dictionary means fast lookup but because of the data structure, it’s not possible to fetch the next “logical” key, meaning I can’t implement index scanning on PK as it is. Quick solution for users is to create an index on the PK column (this would create a separate B+Tree for it) but this is not so friendly because it requires the user to have prior knowledge of all this. So, my plan is to provide the best of both worlds. I’ll elaborate on how I’m planning on tackling this problem next.

Current Primary Key Read/Write Behavior

In general, keys of BlitzDB’s data dictionary is a unique 8 byte integer. The idea is that BlitzDB writes this unique ID along with the key to the B+Tree Index so that it can later identify that row. The difference with PK is that, if a PK is present in a table, BlitzDB will not generate an internal unique ID and use PK for the data dictionary’s key instead. BlitzDB won’t create a B+Tree index for PK at the time I wrote this blog entry.

Next Step

Create a B+Tree index for PK anyway. BlitzDB will still use the PK value as the key for data dictionary if it exists. For PK based lookup requests, BlitzDB will look directly at the data dictionary and for PK based requests that involve index scanning, BlitzDB will look at the B+Tree index.

This approach can consume more space when textual data is used for keys but I think it’s worth it. At the same time, you can save space if you use use types that are smaller than 8 bytes for PK. For example, using a 4 byte integer would reduce BlitzDB’s key space by 50%.

Hmm, I think my mind has cleared a little.

Toru Maesaka drizzle, oss ,

Drizzle, BlitzDB and HTON_STATS_RECORDS_IS_EXACT

January 13th, 2010

Recently I enabled HTON_STATS_RECORDS_IS_EXACT in BlitzDB to let the optimizer know that BlitzDB can instantaneously return the number of rows in a specified table. As a result, the Drizzle kernel can directly call the Cursor::info() function to get the row count. To users, it means that SELECT COUNT statements can be executed in O(1). So it’s a great thing in general.

Something Broke

After I enabled HTON_STATS_RECORDS_IS_EXACT, I noticed that issuing SELECT statement on a table with 1 row would no longer return a resultset. Weird indeed! after investigating with GDB, I noticed that rnd_next() is only called once instead of twice on a table with 1 row (second time is to find EOF) when HTON_STATS_RECORDS_IS_EXACT is enabled. This makes sense because the kernel knows that there is only 1 row and therefore it doesn’t need to keep scanning for EOF. However, this made me scratch my head since this shouldn’t break BlitzDB’s table scanner.

Remedy

Logically, I was confident that BlitzDB’s table scanner was functioning properly so I decided to look at what was going on beyond the engine API. Turns out that join_read_system() in sql_select.cc looks at the table->status value and decides that it’s an error if 0 isn’t assigned to it. What’d you know? I realized that I wasn’t assigning anything to the status variable. It’s more that I didn’t know that I was meant to update an internal structure. You’d think that engine developers aren’t meant to touch those. It’s not mentioned in the Engine Documentation at MySQL Forge either. Nevertheless, the important thing is that it works now. Oh and SELECT COUNT is fast now too.

Eye Opener

This experience among other occasions where I had to read the kernel’s source made me think that it would be nice to provide an intensive up to date documentation on how to develop storage engines for Drizzle in the future (when the API becomes stable). Needless to say, this would be co-ordinated within the Drizzle community. I’m not a license person but it should hopefully be provided with a freely available license too.

Toru Maesaka drizzle, oss ,

End of Year Progress on BlitzDB

December 24th, 2009

FURTHER UPDATE: Further thoughts on BlitzDB’s Index Handling

My open source friends might have noticed that I’ve been working quite a bit on BlitzDB lately. To tell the truth, I had a hidden goal to get Version-1 done by Christmas. Unfortunately it doesn’t look like I can reach that goal. However, looking at the brightside I got a lot done in the past few weeks so allow me to “journal” it in this blog post.

Agony of Knowing

The more I understood Drizzle’s storage mechanism and Tokyo Cabinet’s internals, the more I disliked what I previously had. This led me to spending quite a bit of time rewriting BlitzDB’s codebase. I was using pthread’s rwlock for concurrency control but I decided to design and write BlitzDB’s own lock mechanism to get the best out of TC (in terms of concurrency). I also rewrote the entire table scan code which is something you’d hope won’t be executed that often (people should use indexes!) but needless to say, it’s an important component of a relational storage engine so I’ve put in a lot of effort there.

Rewriting the Table Scanner

In the process of rewriting the table scanner, Jay Pipes’ gave me a fantastic advise on using Drizzle’s internal atomic type (drizzled::atomics). He gave me this advise because he noticed that my atomic ID generator was securing atomicity with pthread’s mutex. It is debatable that this mutex was only enabled for only few CPU instructions but the philosophy of using the most efficient method on the platform where BlitzDB is to be run was appealing enough for me to use drizzled::atomics. Mikio did some experiments on this and found that in a competitive/congested environment, using the compiler’s builtin function can gain you 3x throughput.

Hacking on Index Support

I’ve finally started hacking on index support and I just finished supporting basic operations on a primary key. By design, BlitzDB’s index is a dense clustered b+tree but in the first release I am going to limit PK to only be a HASH index. This is because I want BlitzDB to treat all PKs as direct keys inside the data dictionary (hash database where the actual rows are stored). So in other words, I want people to use PK for “needle in a haystack” like queries only. An example of a needle in a haystack like query is:

SELECT * FROM TABLE WHERE primary_key_column = whatever;

Saying that, I don’t like to force people to do things the way I like so I plan on providing best of both worlds by supporting both data structures for PKs in Version-2:

CREATE TABLE t1 (id int, PRIMARY KEY(id) USING btree) ENGINE=blitzdb;
CREATE TABLE t1 (id int, PRIMARY KEY(id) USING hash) ENGINE=blitzdb;

BlitzDB’s default configuration will use PK as a “direct” data dictionary index. If you wish to do range queries on PK, the solution is to create a index on the PK column.

Primary Key lookup Performance

So, how does my implementation perform? Here’s a quick benchmark with a test-run that randomly fetches 100 thousand rows from a BlitzDB table with 1 million rows. This is the table I used:

CREATE TABLE t1 (id int PRIMARY KEY, a int, b int) ENGINE=blitzdb;

and the query looks like this:

SELECT * FROM t1 WHERE id = random_number_under_one_million;

The hardware I used is the following commodity server: Intel Quad Xeon E5345 (2×4MB L2 cache), 8GB Memory, 500GB SATA II. Unfortunately I could not prepare a standalone client server today so both the server and the test program were run on the same machine. Yeah… this sucks so I can’t claim that this benchmark is 100% creditable.

Here is the result I obtained from skyload. Please only view it as a guideline to BlitzDB’s lookup performance. I’ll do a proper benchmark with the Drizzle Community and publish it after I get Version-1 released.

[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 1
  Task Completion Time   : 5.88856 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1
 
[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 2
  Task Completion Time   : 6.94474 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1
 
[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 4
  Task Completion Time   : 7.04455 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1

As you can see, “needle in a haystack” queries can be executed pretty efficiently in BlitzDB. Looking at the first result, we can observe that it took an average of 0.058 milliseconds to process a query.

Future Plans

Admittedly, primary key support isn’t completely done so I’ll continue working on it. After that, I will start hacking on b+tree indexes and write more tests as I go. Once I support at least two indexes, I’ll ask the Drizzle Community to consider merging BlitzDB into Drizzle’s trunk. This is my goal for BlitzDB at the moment.

I also happen to own blitzdb.com so I’m planning on putting user documentation (including tutorial) and architectural notes there. This is currently not so high on my TODO list so I suspect it won’t happen until I get Version-1 released. All I can say about the release schedule at the moment is, “before the MySQL conference in april”.

So, that’s all I have to summarize for now. Thanks for reading this far. Merry Christmas and have a Happy New Year. Don’t trip on ice :)

Toru Maesaka drizzle, oss , ,

BlitzLock and RWLOCK Comparison

November 24th, 2009

As pointed out by Jay Pipes, I thought it would be nice to test and publish how BlitzLock performs against what I was originally intending on using for BlitzDB (pthread’s rwlock). So, I asked my colleagues in the operations group at Mixi to test BlitzLock on nice hardware that I don’t have access to. They kindly accepted and ran the BlitzLock sandbox on a 16 core machine running Fedora.

If you haven’t read my previous entry on BlitzLock and why I started writing it, you should. This entry won’t make sense otherwise.

Disclaimer

Before I step any further, please remember that I’m not trying to say BlitzLock is better than pthread’s rwlock. My interest is to write a lock mechanism that is optimized for Tokyo Cabinet (TC). What I wanted to gain from this test was to see if BlitzLock has enough potential for me to keep working on it.

Method

There were three kinds of workloads: “Read Oriented”, “Write Oriented” and “Neutral”. Read Oriented test has a 70% probability that each thread will call a read routine, whereas Write Oriented is the opposite where there is a 70% chance that the table state will be changed. In the Neutral test, both read and update calls have an even chance of being called. The seed value for the random number generator was identical for all tests.

Each worker sleeps for 10 milliseconds in the critical section and another 10 milliseconds right after it releases the lock. This was done to help cause context switching. Each test ran for 60 seconds.

You can obtain the standalone BlitzLock sandbox from here. I’ll upload a test friendly version that can accept startup options soon (I _really_ need to tidy it up).

Results

Below is a result from a load emulation where there was significantly more read calls than updates.

BlitzLock Benchmark (1)

As seen above, BlitzLock is nicely scaling the workload without exhausting update threads. This is important since one of the concerns involved in the current implementation of BlitzLock is starvation (covered later). I think the read/write ratio above is the sort of ratio that is typically seen in the web industry and something I’m mostly concerned with. So how about a write intensive application? Next graph is a result of when there is significantly more update operations than read.

BlitzLock Benchmark (2)

As seen above, BlitzLock is nicely scaling update tasks without neglecting readers. Compared to the first graph, we’re seeing an opposite result between update and scanner threads which is expected due to the nature of BlitzLock. This is exactly what I was hoping to gain. Next graph is a result from when there is an even chance of read and update operations to occur.

BlitzLock Benchmark (3)

As seen above, the throughput evens out for both read and update operations. I was expecting pthread’s rwlock to show noticeably lower update throughput than read (since it’s a single writer lock) but it turned out to even out. I’m not quite sure how I should interpret this but I guess the writer’s lock had a greater priority than reader’s lock in the environment that the test was run in. Nevertheless, this “even out” characteristic is something I’d like to welcome.

From Here and Weaknesses

I’m convinced to keep working on BlitzLock and use it as the default locking mechanism for BlitzDB. Ideally I should code BlitzDB to be able to switch between various locking mechanisms. This would make my life much easier when someone decides to write a locking mechanism that is better than BlitzLock for my use-case.

Thanks to Paul McCullagh’s feedback, I’ve come to realize that BlitzLock was broadcasting more often than it needs to. Functionally it still works but I should be able to save CPU usage by applying Paul’s feedback (thanks Paul!). There is also the potential lock starvation problem (when certain types of threads hog the lock) that I need to further investigate. If it’s going to cause noticeable issues, I’ll have to add a condition to BlitzLock saying “certain number of threads can obtain a certain lock at once”.

There is still another minor scheduling logic that I need to throw into BlitzLock but once I get that done (along with testing), I can integrate BlitzLock into BlitzDB and see how it performs (I can then hack on indexing!).

Yep, there’s still quite a bit to do but I’m having fun :)

Toru Maesaka drizzle, oss , , , ,

BlitzDB and Tokyo Cabinet Concurrency Model

November 19th, 2009

Yesterday I sat in front of a whiteboard for few hours with Mikio, the author of Tokyo Cabinet discussing/debating what the optimal concurrency model would be for BlitzDB. I think we came to a pretty good conclusion so I’m going to note it on this entry. But before I step any further, allow me to go over Tokyo Cabinet’s concurrency model.

Tokyo Cabinet’s Concurrency Model

Tokyo Cabinet is often quoted as “single writer, multi reader” but this is not quite true. At the time of this blog entry, this statement holds true for TC’s B+Tree database but TC’s hash database can actually allow multiple writers to update and/or delete records concurrently.

If you look at the entry point of tchdbput(), you will notice that it is actually obtaining a reader’s lock (in terms of rwlock). TCHDB then hashes the provided key and obtains the bucket index number where the record of interest belongs to. Given the bucket/block to work on, TC then looks at the 8 most significant bits of the hash value and attempts to obtain a granular update lock from slots of 256 mutexes (2 ^ 8 = 256). So, things are still concurrent at this stage though there are some chances of collision that would block a thread.

If a record already exists, TC will go on and happily update that block but if the record is new (as in the key doesn’t exist), TC will lock the tail block of the database and write the new record there. So, only writing a new record is treated as a single writer and the rest can be processed concurrently. This is why I said it’s not quite true.

BlitzDB’s Concurrency Model

Taken the above into mind, this is what BlitzDB’s concurrency model looks like:

  1. SELECT queries can run concurrently.
  2. SELECT queries are blocked when UPDATE and/or REPLACE queries are being processed.
  3. UPDATE, REPLACE, DELETE queries can run concurrently.
  4. INSERT is never disrupted by BlitzDB and scheduled by TC.

In an ideal world, I would allow Drizzle’s worker threads to _directly_ interact with TC and let TC handle thread synchronization. This would make my life fantastically easy but unfortunately life isn’t so easy.

For example, if a record is deleted while BlitzDB’s table scan is occurring, the table scanner will stop scanning at the position where the deleted key existed. I would not have this problem if I used TC’s native iterator but my table scan implementation uses TC’s hidden API that won’t babysit me in this regard. In return I can gain maximum concurrent read throughput from TC which was a tradeoff I happily accepted.

So, there are several little gotchas like this which forces me to implement concurrency control in BlitzDB. Here’s how I’m planning on doing it (with demo code!).

Implementation (with demo code)

In the past I’ve gone through several experimental stages with BlitzDB where I used pthread’s rwlock to control concurrency. Short answer to the result is, “IT WORKS!”. However it was not taking full advantage of TC’s concurrency model.

For example I did not want to protect UPDATE queries with a writer’s lock since it would block other UPDATE/DELETE queries. So why not protect it with a reader’s lock? The issue here is that any query that can change the state of the table cannot be processed while a scanner is running (which btw is protected by a reader’s lock). Furthermore, a non-index based update/delete means that the scanner _is_ running so there’s a problem there too.

What I need is a scheduler that can allow multiple INSERT/UPDATE/REPLACE/DELETE queries to run when the scanner is not running. On the other hand the scheduler must allow multiple scanners to run when an UPDATE/REPLACE/DELETE queries aren’t being processed _BUT_ let INSERT queries come through to TC.

Implementing the above is probably possible by using multiple mutexes but it would bring complexity to the codebase and possible deadlocks that can be difficult to debug. So we decided to learn from pthread’s rwlock implementation and write an original lock mechanism similar to rwlock but something that allows us to write our own rules for scheduling.

Here’s my first attempt at a standalone sandbox of the model:

You can compile and run it to get a grasp of how threads are coordinated:

$ wget http://torum.net/code/cc/blitzlock.cc
$ g++ -Wall -pedantic blitzlock.cc -lpthread && ./a.out

If you got the program running and wondering what the output means, think of the “updater” as a thread that performs either UPDATE, REPLACE or DELETE.

There are much more that I’d love to go on about but I think I’ve bloated this entry enough so I will save my urge for another day :)

Toru Maesaka drizzle, oss , , , , ,

TC Concurrency Model and BlitzDB Part 1

November 7th, 2009

Recently I started rewriting BlitzDB because I’ve come to realize the mistakes I’ve made from getting a better understanding of the Drizzle Storage API and Tokyo Cabinet internals. Admittedly a rewrite is an exaggeration because I’ll be reusing most of the components but more in a C++ way.

One decision I decided to make is that BlitzDB will only support a BTREE index via TC’s B+Tree API in it’s first release. Ignoring BlitzDB for now, several people I’ve talked to about key/value data structures often ask why I love B+Tree so much when it’s faster to work with a hash table. Please don’t take it wrong, O(1) operations are beautiful and I love hash tables but stereotyping key/value structures to it is not. Everything has it’s ups and downs and hash table/map is not an exception. In this blog entry, I will describe why B+Tree is good for index scanning.

Why a B+Tree Index

A search algorithm of O(1) like hashing is clearly faster than O(log n) unless there’s something fishy about the implementation or the dataset is too small for the time complexity to matter. However, this is only true for looking up and fetching the value. For those that are only interested in fetching a particular value, that’s probably the best you can ask for. However things are different if you look into things beyond lookups like fetching or scanning through a range of keys.

To do this with a typical hash table, either your data structure must be able to provide a list of stored keys OR your application must do some housekeeping and save a list of relevant keys elsewhere for future use. Your application would then need to compute the subset of keys that you’re interested in and fetch them with a loop. Algorithmically speaking, each fetch operation is O(1) but what’s expensive here is that you end up doing a lot of random access. This is obviously going to kill your performance, especially when you need to chew through a heavy workload (though this _could_ change when SSD becomes standard).

B+Tree on the other hand is fantastic for this use-case. The actual data are stored at the leaf node and they are usually logically linked so that you don’t need to re-traverse the tree to get the next greater key (if you run out of relevant pages in the node, you move on to the neighbor leaf node). The pages are aligned on disk, which means sequential access. Another bonus is that most of the time, you can keep the entire internal nodes on memory which is small and inexpensive but effective for searching.

Solution to this? well, mine is to implement a combination play of the two data structures and take advantage of the different characteristics. In BlitzDB, the actual rows are stored in TC’s Hash Database and the index will store keys to the row. So, a clustered index.

What I’ve mentioned so far is all theoretical without providing any benchmark results but all I’m trying to say is, it’s all about access patterns and use-cases. My current interest is in index scan and therefore the decision. However, if there is enough people that asks me for a HASH index, I can write that functionality relatively easily later on :)

Next Stop

I would love to keep writing but it is currently past 3am in Japan and I’m dozing out here. Apologies for not covering Tokyo Cabinet’s in-depth concurrency model but I will cover it in my next post of the series and how this impacts BlitzDB’s design.

Toru Maesaka drizzle, knowledge , , ,