<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Toru Maesaka &#187; blitzdb</title>
	<atom:link href="http://torum.net/tag/blitzdb/feed/" rel="self" type="application/rss+xml" />
	<link>http://torum.net</link>
	<description>Hackaholic and a Web Addict based in Tokyo</description>
	<lastBuildDate>Tue, 28 Feb 2012 10:52:29 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.1.2</generator>
		<item>
		<title>BlitzDB Crash Safety and Auto Recovery</title>
		<link>http://torum.net/2010/07/blitzdb-crash-safety-and-auto-recovery/</link>
		<comments>http://torum.net/2010/07/blitzdb-crash-safety-and-auto-recovery/#comments</comments>
		<pubDate>Thu, 22 Jul 2010 09:43:14 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[blitzdb]]></category>
		<category><![CDATA[hacking]]></category>
		<category><![CDATA[recovery]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2369</guid>
		<description><![CDATA[Crash Safety is a big deal in the database league. Lack of durability can lead to all sorts of terrible things upon a catastrophic event. Many projects, especially in the so called NoSQL world compromises crash safety in return for higher QPS. The argument there is that the availability of the overall system should be [...]]]></description>
			<content:encoded><![CDATA[<p>Crash Safety is a big deal in the database league. Lack of durability can lead to all sorts of terrible things upon a catastrophic event. Many projects, especially in the so called NoSQL world compromises crash safety in return for higher QPS. The argument there is that the availability of the overall system should be accomplished by replication since a database server can&#8217;t be rescued if the physical disk breaks. I happen to agree with this philosophy but I am also aware that this isn&#8217;t a correct answer for everyone. So, what will I do with BlitzDB?</p>
<p>Several relational database hackers have pointed out that BlitzDB isn&#8217;t any safer than MyISAM since it doesn&#8217;t guarantee crash safety. This is currently true but I plan on making BlitzDB much safer than MyISAM by providing following features.</p>
<ol>
<li>Auto Recovery Routine (startup option)</li>
<li>Tokyo Cabinet&#8217;s Transaction API (table-specific option)</li>
</ol>
<p>The second feature above would actually guarantee BlitzDB to be crash safe (especially combined with auto recovery) but I won&#8217;t get into depth in this post since this topic deserves a blog post of it&#8217;s own. Let me just state that this feature will be provided in a form like this:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> t1 <span style="color: #66cc66;">&#40;</span>
  a int <span style="color: #993333; font-weight: bold;">PRIMARY</span> <span style="color: #993333; font-weight: bold;">KEY</span><span style="color: #66cc66;">,</span>
  b varchar<span style="color: #66cc66;">&#40;</span><span style="color: #cc66cc;">256</span><span style="color: #66cc66;">&#41;</span>
<span style="color: #66cc66;">&#41;</span> ENGINE <span style="color: #66cc66;">=</span> BLITZDB<span style="color: #66cc66;">,</span> CRASH_SAFE;</pre></div></div>

<p>From here on, I&#8217;ll cover how I plan on hacking auto recovery in BlitzDB.</p>
<h3>Auto Recovery Challenges</h3>
<p>As I blogged a while back, <a href="http://torum.net/2010/01/how-to-recover-a-tokyo-cabinet-database-file/">recovering Tokyo Cabinet</a> is relatively simple. However, this is not a sufficient solution in BlitzDB since the data file (hash database that actually holds the rows) and the index file(s) are independent from each other. That is, the likelihood of the data file and the index file(s) to be inconsistent is very high after a crash. So, how can we hack on this? Pretty simple.</p>
<h3>Indexes aren&#8217;t Important at Recovery Phase</h3>
<p>Because BlitzDB logically separates the data file and it&#8217;s indexes, index files aren&#8217;t that important. If a server crash had occurred, BlitzDB could delete the index file(s) and recompute them from the data file. Needless to say, this process would involve a lot of random access and computation but it would not dominate the time space of the system since it&#8217;s a one-time cost. This approach however has one flaw in it such that the index files can&#8217;t be recomputed if the data file is broken or is unrecoverable.</p>
<p>Therefore to guarantee crash safety, BlitzDB must ensure that the data file is unbreakable. This is precisely where Tokyo Cabinet&#8217;s Transaction API comes in. I&#8217;m planning on using it to protect the data file from breaking. If the data file is protected, the table can be rescued. Simple!</p>
<p>So, that&#8217;s what I have in mind for making BlitzDB a safer engine. Unfortunately I can&#8217;t start hacking on it immediately since I have several bugs to fix first. Nevertheless I&#8217;m looking forward to start hacking on it. This challenge should be quite fun to tackle.</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2010/07/blitzdb-crash-safety-and-auto-recovery/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>BlitzDB is now in Drizzle&#8217;s Trunk Repository</title>
		<link>http://torum.net/2010/06/blitzdb-drizzle-merge/</link>
		<comments>http://torum.net/2010/06/blitzdb-drizzle-merge/#comments</comments>
		<pubDate>Mon, 21 Jun 2010 11:20:45 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[blitzdb]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2362</guid>
		<description><![CDATA[Happy to announce that BlitzDB has been merged with Drizzle&#8217;s Trunk. As much as I&#8217;m excited, it&#8217;s time to come back to reality. This merge is merely a beginning. There is much more work that needs to be done to BlitzDB such as ensuring stability by adding more tests, find bugs, and eliminate them. I&#8217;m [...]]]></description>
			<content:encoded><![CDATA[<p>Happy to announce that BlitzDB has <a href="http://bazaar.launchpad.net/~drizzle-developers/drizzle/development/revision/1626">been merged</a> with Drizzle&#8217;s Trunk.</p>
<p>As much as I&#8217;m excited, it&#8217;s time to come back to reality. This merge is merely a beginning. There is much more work that needs to be done to BlitzDB such as ensuring stability by adding more tests, find bugs, and eliminate them. I&#8217;m hoping that the likelihood of bugs being found will increase due to this merge. Admittedly, I want to hack on fancy (yet important) things like auto recovery but I&#8217;m going to resist doing this until I&#8217;m truly satisfied with the quality of BlitzDB. My plan is to have BlitzDB rock solid by Drizzle&#8217;s Beta release.</p>
<p>The review process to get BlitzDB into Drizzle was straight forward and smooth. This is mostly due to the fact that the community was very supportive about testing. Folks like Stewart Smith and Patrick Crews from Rackspace pointed out several bugs that I would not have found myself. I&#8217;m certainly lucky to have a supportive professional QA engineer (looking at you Patrick) to test out and give punishment to BlitzDB.</p>
<p>All I&#8217;ll be doing on BlitzDB for the next couple of weeks is debugging and refactoring to improve readability. What I need more of at the moment is test cases on JOINs that are likely to be used in practice. If you have a good test case, I would greatly appreciate it!</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2010/06/blitzdb-drizzle-merge/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>BlitzDB Concurrent Testing and Write Performance</title>
		<link>http://torum.net/2010/05/blitzdb-concurrency-testing/</link>
		<comments>http://torum.net/2010/05/blitzdb-concurrency-testing/#comments</comments>
		<pubDate>Wed, 12 May 2010 06:42:05 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[blitzdb]]></category>
		<category><![CDATA[performance]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2353</guid>
		<description><![CDATA[Last month while being at the MySQL Conference, several people asked me about the status of BlitzDB. Specifically, they were interested in when I&#8217;ll release BlitzDB. Fair enough &#8211; I&#8217;ve been working on this project long enough for people to start questioning this. The answer is, BlitzDB is done in terms of implementing the design. [...]]]></description>
			<content:encoded><![CDATA[<p>Last month while being at the MySQL Conference, several people asked me about the status of BlitzDB. Specifically, they were interested in when I&#8217;ll release BlitzDB. Fair enough &#8211; I&#8217;ve been working on this project long enough for people to start questioning this.</p>
<p>The answer is, BlitzDB is done in terms of implementing the design. Right now it&#8217;s about finding bugs, fixing it and testing BlitzDB&#8217;s stability under concurrent load. Thanks to the motivation boost I gained at the conference, I&#8217;ve now fixed the bugs that were slowing me down and I&#8217;m gradually adding more tests into BlitzDB&#8217;s test suite. I consider BlitzDB&#8217;s initial release to be the day it gets merged into Drizzle&#8217;s trunk. This is almost ready as BlitzDB seems to be building fine on Drizzle&#8217;s Build Farm infrastructure. However, I won&#8217;t move to the next step until I&#8217;m satisfied with BlitzDB&#8217;s stability.</p>
<p>Yesterday I spent some time doing some concurrency testing on BlitzDB&#8217;s INSERT code with skyload. Needless to say, concurrency testing is also a convenient way to look at the performance of a particular component. So, I decided to publish my findings from this test. First, here is the background of the test.</p>
<h3>Purpose of the Test</h3>
<ul>
<li>Test BlitzDB&#8217;s slot-lock mechanism.</li>
<li>Confirm that BlitzDB will not crash under concurrent INSERT workload.</li>
<li>Confirm that key insertion to the index is working as expected.</li>
<li>Confirm that writes to multiple indexes work as expected.</li>
<li>Observe the write-performance impact of adding an index.</li>
</ul>
<p>Two commodity boxes were used. One dedicated for the client and the other dedicated for the server (Drizzle + BlitzDB). Both boxes has the same spec: Intel Quad Xeon E5345 (2×4MB L2 cache), 8GB Memory, 500GB SATA II, gigabit NIC. Servers were connected by a gigabit switch. File system on the server was ext3.</p>
<p>By default, a BlitzDB table is optimized for up to 1 million rows. Therefore this test inserted 1 million rows to a table with different concurrency levels. A different concurrency level is used per run. The table used in this test only contains three integer columns. Tests are performed up to three indexes. The linux kenel&#8217;s dirty buffer is flushed before each test run. Tests were run until the performance curve flattened.</p>
<h3>Result</h3>
<p align="center"><a href="http://www.flickr.com/photos/tmaesaka/4598572902/" title="BlitzDB Table Insertion - Multi Index by tmaesaka, on Flickr"><img src="http://farm2.static.flickr.com/1324/4598572902_c1e45d7ac5.jpg" width="500" height="294" alt="BlitzDB Table Insertion - Multi Index" /></a></p>
<p>As seen above, scalability from 1 thread to 4 thread showed an ideal curve. This is expected since the server is a 4 core box. From 4 threads, performance showed some improvements up to 12 threads. From there on, concurrency greatly exceeds the number of physical cores so we can&#8217;t observe decent performance growth. The highest insert QPS gained in this test was <strong>just over 86,000 QPS</strong>. With more cores on the server and more clients, I suspect BlitzDB can hit over 100k QPS.</p>
<p>Although this graph looks good at first sight, I&#8217;m not happy with it. The performance penalty for adding multiple indexes should be greater than what&#8217;s observed in this result. This is because TC&#8217;s B+Tree is internally protected by a single lock on writes. I suspect that the performance penalty is not observed in this graph because I didn&#8217;t give BlitzDB enough load to make TC work hard. This implies that a bottleneck could exist elsewhere (Network, Drizzle or BlitzDB&#8217;s handler level code).</p>
<p>However, I&#8217;m glad that BlitzDB stood stable on this concurrency test which was what I wanted to test in the first place. Admittedly I need to mix several types of queries to properly test BlitzDB&#8217;s stability. I plan on doing this next with sysbench and hopefully <a href="https://launchpad.net/randgen">RQG</a>.</p>
<p>Once this is done, I&#8217;ll submit a merge proposal to the Drizzle Project :)</p>
<h3>Future Development Plans</h3>
<ul>
<li>Find bugs, Fix bugs, Repeat.</li>
<li>Write an inbuilt auto recovery routine.</li>
<li>Eventually add a crash safe option to BlitzDB.</li>
</ul>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2010/05/blitzdb-concurrency-testing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Testing BlitzDB on Drizzle&#8217;s Build Farm</title>
		<link>http://torum.net/2010/05/blitzdb-on-build-farm/</link>
		<comments>http://torum.net/2010/05/blitzdb-on-build-farm/#comments</comments>
		<pubDate>Thu, 06 May 2010 10:37:36 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[blitzdb]]></category>
		<category><![CDATA[hudson]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2349</guid>
		<description><![CDATA[One of many important things that the Drizzle project takes seriously is for the project sourcecode to successfully build in all our target platforms AND pass tests in them. This is not really specific to Drizzle as most open source projects would have the same policy. For example we do the same thing in memcached [...]]]></description>
			<content:encoded><![CDATA[<p>One of many important things that the Drizzle project takes seriously is for the project sourcecode to successfully build in all our target platforms AND pass tests in them. This is not really specific to Drizzle as most open source projects would have the same policy. For example we do the same thing in memcached thanks to Dustin Sailing&#8217;s buildbot kungfu. </p>
<p>Yesterday, Monty Taylor gave me access to Drizzle&#8217;s Build Farm Infrastructure so that I could test BlitzDB on various Linux distributions and FreeBSD. Unfortunately most build machines didn&#8217;t have Tokyo Cabinet installed so I could only test builds on Ubuntu and Debian. Fortunately the build went fine on those platforms though this was predictable since Ubuntu is my primary development platform. What was disturbing was getting test errors on my index test suite. I guess it&#8217;s time to put my thinking cap on and see what the problem is there.</p>
<p>This is a big leap towards getting BlitzDB in Drizzle&#8217;s trunk which I&#8217;m steadily working towards. I also want to benchmark BlitzDB at it&#8217;s current state with <a href="http://sysbench.sourceforge.net/">sysbench</a>&#8216;s OLTP tests. This is still low in my priority queue but hopefully I&#8217;ll do it in the next couple of months.</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2010/05/blitzdb-on-build-farm/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>DATE type under the hood in Drizzle/MySQL</title>
		<link>http://torum.net/2010/03/date-type-and-drizzle/</link>
		<comments>http://torum.net/2010/03/date-type-and-drizzle/#comments</comments>
		<pubDate>Mon, 01 Mar 2010 07:24:39 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[blitzdb]]></category>
		<category><![CDATA[database]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2338</guid>
		<description><![CDATA[Learned something new from my own bug in BlitzDB today. The problem was that writing a DATE column index would always return a duplicate key error (regardless of what I feed it). There are two suspicious candidates that can cause this. Comparison Function has a defect. Key Generator has a defect. The latter suspect was [...]]]></description>
			<content:encoded><![CDATA[<p>Learned something new from my own bug in BlitzDB today. The problem was that writing a DATE column index would always return a duplicate key error (regardless of what I feed it). There are two suspicious candidates that can cause this.</p>
<ul>
<li>Comparison Function has a defect.</li>
<li>Key Generator has a defect.</li>
</ul>
<p>The latter suspect was going to be tricky if it was true since BlitzDB currently uses Drizzle&#8217;s native &#8220;field packer&#8221; (except for VARCHAR) inherited from MySQL. This would mean that Drizzle&#8217;s field system has a bug in it which was somewhat difficult to believe. Furthermore, you should always blame yourself before you start suspecting other people&#8217;s code. So, I decided to look into the comparison function which was completely written by me. Turned out that&#8217;s where the bug was.</p>
<h3>Comparison Function</h3>
<p>Allow me to quickly clarify what I mean by &#8220;comparison function&#8221; in this context. TC&#8217;s B+Tree API has an interface that allows you to provide your own comparison function for all operations that involves traversing.</p>

<div class="wp_syntax"><div class="code"><pre class="c" style="font-family:monospace;">bool tcbdbsetcmpfunc<span style="color: #009900;">&#40;</span>TCBDB <span style="color: #339933;">*</span>bdb<span style="color: #339933;">,</span> TCCMP cmp<span style="color: #339933;">,</span> <span style="color: #993333;">void</span> <span style="color: #339933;">*</span>cmpop<span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>What BlitzDB&#8217;s comparison function callback does is, it looks at the data type of the values to be compared and performs appropriate processing on the values then compares them. You can also look at it as a long switch statement. For those that are interested, this code is in <a href="http://bazaar.launchpad.net/%7Etmaesaka/blitzdb/trunk/annotate/head%3A/plugin/blitzdb/blitzcmp.cc">blitzcmp.cc</a> (blitz_keycmp_cb).</p>
<h3>DATE under the hood</h3>
<p>After inspecting the &#8220;type number&#8221; with GDB and looking at the corresponding ha_base_keytype enum, it turns out that the DATE type is internally represented as an unsigned 3 byte integer (HA_KEYTYPE_UINT24). This was pleasant to discover since I&#8217;ve been wondering what a 3 byte integer is still used for in Drizzle. The problem I had was that I didn&#8217;t take this type into account in the comparator and it also showed how silly I am since the answer was always there.</p>
<ul>
<li><a href="http://dev.mysql.com/doc/refman/5.1/en/storage-requirements.html">10.5. Data Type Storage Requirements</a></li>
</ul>
<p>Now, the question is should it be kept this way? Respect alignment or reduce total I/O and space by keeping it this way? This should hopefully be a fun discussion to have in the Drizzle community :)</p>
<p>P.S. My two cents is that it should respect alignment since folks that seek performance should have most of their data on memory. Respecting alignment in this environment should make some difference. Although, I can only say this after benchmarking it of course.</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2010/03/date-type-and-drizzle/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Progress on BlitzDB&#8217;s Index Component</title>
		<link>http://torum.net/2010/02/progress-on-blitzdb-index-component/</link>
		<comments>http://torum.net/2010/02/progress-on-blitzdb-index-component/#comments</comments>
		<pubDate>Thu, 18 Feb 2010 03:55:07 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[blitzdb]]></category>
		<category><![CDATA[index]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2335</guid>
		<description><![CDATA[I recently gained some decent momentum on developing the indexing component of BlitzDB. Most of my time spent on BlitzDB for the last couple of weeks have been studying the indexing API and digging into how other engines have implemented it. I even referred back to MySQL 4.x to see how the BDB engine pulls [...]]]></description>
			<content:encoded><![CDATA[<p>I recently gained some decent momentum on developing the indexing component of BlitzDB. Most of my time spent on BlitzDB for the last couple of weeks have been studying the indexing API and digging into how other engines have implemented it. I even referred back to MySQL 4.x to see how the BDB engine pulls off the Indexing API.</p>
<p>The actual coding wasn&#8217;t too bad thanks to Tokyo Cabinet&#8217;s awesome B+Tree API. I&#8217;ve been busier adding new tests and fixing silly bugs as they arise. I also implemented the <a href="http://torum.net/2010/01/further-thoughts-on-blitzdb-index/">Primary Key optimization</a> that I blogged about a while back. As a result of all this, the following goodness has been added to <a href="http://bazaar.launchpad.net/~tmaesaka/blitzdb/trunk/changes">BlitzDB&#8217;s Trunk</a>.</p>
<ul>
<li>Index Lookup</li>
<li>Forward Index Scan</li>
<li>Reverse Index Scan</li>
</ul>
<p>This means that BlitzDB is now equipped with both a Table Scanner and an Index Scanner which are two essential components for a general purpose storage engine. As much as I&#8217;d like to work on optimizing the code and adding features (like recovery), I&#8217;m going to take a break and spend the rest of the month working on testing and debugging. There&#8217;s no point in adding features if the base has notable flaws in it.</p>
<h3>Challenges Encountered</h3>
<p>Writing the Index Scanner itself is easy. The most difficult thing that slowed me down was developing the comparison function for index keys. The end result was a simple piece of code but I had to study various things before I could start writing any code.</p>
<ul>
<li>How to respect collation</li>
<li>How keys are represented internally</li>
<li>How types are represented internally</li>
<li>How to write a custom comparison function for Tokyo Cabinet</li>
<li>&#8230; and so on</li>
</ul>
<p>I&#8217;ve also started using <a href="http://www.evernote.com">Evernote</a> to jot down my spontaneous ideas on optimizing BlitzDB. I&#8217;ve made these notes public and they will most likely be updated while I&#8217;m commuting on the train.</p>
<ul>
<li><a href="http://www.evernote.com/pub/tmaesaka/blitzdb">http://www.evernote.com/pub/tmaesaka/blitzdb/</a></li>
</ul>
<p>There are much more that I&#8217;d like to write about like how I intend on developing the table recovery routine without simply using <a href="http://torum.net/2010/01/how-to-recover-a-tokyo-cabinet-database-file/">TC&#8217;s recovery mechanism</a> but I shall restrain myself for another day.</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2010/02/progress-on-blitzdb-index-component/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Further thoughts on BlitzDB&#8217;s Index Handling</title>
		<link>http://torum.net/2010/01/further-thoughts-on-blitzdb-index/</link>
		<comments>http://torum.net/2010/01/further-thoughts-on-blitzdb-index/#comments</comments>
		<pubDate>Fri, 15 Jan 2010 09:05:30 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[blitzdb]]></category>
		<category><![CDATA[index]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2328</guid>
		<description><![CDATA[I&#8217;ve been thinking quite a bit about collation handling in BlitzDB for the last couple of days. The more I think about it, the more stuck I&#8217;ve been getting with BlitzDB&#8217;s index design. I&#8217;m actually so frustrated with myself at the moment that I want to hit my head against a wall or something. So, [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been thinking quite a bit about collation handling in BlitzDB for the last couple of days. The more I think about it, the more stuck I&#8217;ve been getting with BlitzDB&#8217;s index design. I&#8217;m actually so frustrated with myself at the moment that I want to hit my head against a wall or something.</p>
<p>So, I&#8217;m writing this entry to clear up my mind. Heh, my blog is slowly becoming BlitzDB&#8217;s design document draft. This should hopefully be good though since by blogging it, people can tell me whether I&#8217;m moving towards a stupid direction or not.</p>
<h3>Collation Importance</h3>
<p>When writing database software that is intended for International use, it is important to handle textual data by respecting collation order. It is arguable that most people are only interested in English lexicographic ordering but unfortunately the world is not so standard.</p>
<h3>Internal Primary Key Handling</h3>
<p>I want to motivate people to actively define a PRIMARY KEY with BlitzDB. I plan to make this attractive by providing the best performance when PK is defined. In <a href="http://torum.net/2009/12/end-of-year-progress-on-blitzdb/">December 2009</a>, my answer to this was to write the PK value as the key for the data dictionary (where actual rows are stored in BlitzDB). This allows BlitzDB to do a direct lookup on the data dictionary for PK based lookup, instead of consulting the B+Tree index. I&#8217;m still fond of this lookup optimization approach but it introduces problems too. </p>
<p><strong>Problem 1.</strong> Consider the following textual keys: &#8220;key&#8221; and &#8220;KEY&#8221;. They obviously have different binary representations but in certain cases they can be logically equivalent. Because the data dictionary is a hash database, this is a problem. The solution that instantly pops up is to normalize the key before writing or reading it. This however, causes a problem in cases where the two keys are inequivalent. Perhaps Drizzle/MySQL provides an internal normalization function that respects this. I still need to study this area of the storage subsystem.</p>
<p><strong>Problem 2.</strong> Directly writing a PK to the data dictionary means fast lookup but because of the data structure, it&#8217;s not possible to fetch the next &#8220;logical&#8221; key, meaning I can&#8217;t implement index scanning on PK as it is. Quick solution for users is to create an index on the PK column (this would create a separate B+Tree for it) but this is not so friendly because it requires the user to have prior knowledge of all this. So, my plan is to provide the best of both worlds. I&#8217;ll elaborate on how I&#8217;m planning on tackling this problem next.</p>
<h3>Current Primary Key Read/Write Behavior</h3>
<p>In general, keys of BlitzDB&#8217;s data dictionary is a unique 8 byte integer. The idea is that BlitzDB writes this unique ID along with the key to the B+Tree Index so that it can later identify that row. The difference with PK is that, if a PK is present in a table, BlitzDB will not generate an internal unique ID and use PK for the data dictionary&#8217;s key instead. BlitzDB won&#8217;t create a B+Tree index for PK at the time I wrote this blog entry.</p>
<h3>Next Step</h3>
<p>Create a B+Tree index for PK anyway. BlitzDB will still use the PK value as the key for data dictionary if it exists. For PK based lookup requests, BlitzDB will look directly at the data dictionary and for PK based requests that involve index scanning, BlitzDB will look at the B+Tree index.</p>
<p>This approach can consume more space when textual data is used for keys but I think it&#8217;s worth it. At the same time, you can <strong>save space</strong> if you use use types that are smaller than 8 bytes for PK. For example, using a 4 byte integer would reduce BlitzDB&#8217;s key space by 50%.</p>
<p>Hmm, I think my mind has cleared a little.</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2010/01/further-thoughts-on-blitzdb-index/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Drizzle, BlitzDB and HTON_STATS_RECORDS_IS_EXACT</title>
		<link>http://torum.net/2010/01/blitzdb-and-record-counting/</link>
		<comments>http://torum.net/2010/01/blitzdb-and-record-counting/#comments</comments>
		<pubDate>Wed, 13 Jan 2010 13:23:34 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[blitzdb]]></category>
		<category><![CDATA[engine]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2326</guid>
		<description><![CDATA[Recently I enabled HTON_STATS_RECORDS_IS_EXACT in BlitzDB to let the optimizer know that BlitzDB can instantaneously return the number of rows in a specified table. As a result, the Drizzle kernel can directly call the Cursor::info() function to get the row count. To users, it means that SELECT COUNT statements can be executed in O(1). So [...]]]></description>
			<content:encoded><![CDATA[<p>Recently I enabled HTON_STATS_RECORDS_IS_EXACT in BlitzDB to let the optimizer know that BlitzDB can instantaneously return the number of rows in a specified table. As a result, the Drizzle kernel can directly call the Cursor::info() function to get the row count. To users, it means that SELECT COUNT statements can be executed in O(1). So it&#8217;s a great thing in general.</p>
<h3>Something Broke</h3>
<p>After I enabled HTON_STATS_RECORDS_IS_EXACT, I noticed that issuing SELECT statement on a table with 1 row would no longer return a resultset. Weird indeed! after investigating with GDB, I noticed that rnd_next() is only called once instead of twice on a table with 1 row (second time is to find EOF) when HTON_STATS_RECORDS_IS_EXACT is enabled. This makes sense because the kernel knows that there is only 1 row and therefore it doesn&#8217;t need to keep scanning for EOF. However, this made me scratch my head since this shouldn&#8217;t break BlitzDB&#8217;s table scanner.</p>
<h3>Remedy</h3>
<p>Logically, I was confident that BlitzDB&#8217;s table scanner was functioning properly so I decided to look at what was going on beyond the engine API. Turns out that join_read_system() in sql_select.cc looks at the table->status value and decides that it&#8217;s an error if 0 isn&#8217;t assigned to it. What&#8217;d you know? I realized that I wasn&#8217;t assigning anything to the status variable. It&#8217;s more that I didn&#8217;t know that I was meant to update an internal structure. You&#8217;d think that engine developers aren&#8217;t meant to touch those. It&#8217;s not mentioned in the <a href="http://forge.mysql.com/wiki/MySQL_Internals_Custom_Engine">Engine Documentation</a> at MySQL Forge either. Nevertheless, the important thing is that it works now. Oh and SELECT COUNT is fast now too. </p>
<h3>Eye Opener</h3>
<p>This experience among other occasions where I had to read the kernel&#8217;s source made me think that it would be nice to provide an intensive up to date documentation on how to develop storage engines for Drizzle in the future (when the API becomes stable). Needless to say, this would be co-ordinated within the Drizzle community. I&#8217;m not a license person but it should hopefully be provided with a freely available license too.</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2010/01/blitzdb-and-record-counting/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>End of Year Progress on BlitzDB</title>
		<link>http://torum.net/2009/12/end-of-year-progress-on-blitzdb/</link>
		<comments>http://torum.net/2009/12/end-of-year-progress-on-blitzdb/#comments</comments>
		<pubDate>Thu, 24 Dec 2009 09:47:06 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[blitzdb]]></category>
		<category><![CDATA[storage]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2318</guid>
		<description><![CDATA[FURTHER UPDATE: Further thoughts on BlitzDB’s Index Handling My open source friends might have noticed that I&#8217;ve been working quite a bit on BlitzDB lately. To tell the truth, I had a hidden goal to get Version-1 done by Christmas. Unfortunately it doesn&#8217;t look like I can reach that goal. However, looking at the brightside [...]]]></description>
			<content:encoded><![CDATA[<p><strong>FURTHER UPDATE:</strong> <a href="http://torum.net/2010/01/further-thoughts-on-blitzdb-index/">Further thoughts on BlitzDB’s Index Handling</a></p>
<p>My open source friends might have noticed that I&#8217;ve been working quite a bit on BlitzDB lately. To tell the truth, I had a hidden goal to get Version-1 done by Christmas. Unfortunately it doesn&#8217;t look like I can reach that goal. However, looking at the brightside I got a lot done in the past few weeks so allow me to &#8220;journal&#8221; it in this blog post.</p>
<h3>Agony of Knowing</h3>
<p>The more I understood Drizzle&#8217;s storage mechanism and Tokyo Cabinet&#8217;s internals, the more I disliked what I previously had. This led me to spending quite a bit of time rewriting BlitzDB&#8217;s codebase. I was using pthread&#8217;s rwlock for concurrency control but I decided to design and write <a href="http://torum.net/2009/11/blitzdb-and-tc-concurrency-model/">BlitzDB&#8217;s own lock mechanism</a> to get the best out of TC (in terms of concurrency). I also rewrote the entire table scan code which is something you&#8217;d hope won&#8217;t be executed that often (people should use indexes!) but needless to say, it&#8217;s an important component of a relational storage engine so I&#8217;ve put in a lot of effort there.</p>
<h3>Rewriting the Table Scanner</h3>
<p>In the process of rewriting the table scanner, Jay Pipes&#8217; gave me a fantastic advise on using Drizzle&#8217;s internal atomic type (drizzled::atomics). He gave me this advise because he noticed that  my atomic ID generator was securing atomicity with pthread&#8217;s mutex. It is debatable that this mutex was only enabled for only few CPU instructions but the philosophy of using the most efficient method on the platform where BlitzDB is to be run was appealing enough for me to use drizzled::atomics. Mikio did some experiments on this and found that in a competitive/congested environment, using the compiler&#8217;s builtin function can gain you <a href="http://1978th.net/tech/promenade.cgi?id=68">3x throughput</a>.</p>
<h3>Hacking on Index Support</h3>
<p>I&#8217;ve finally started hacking on index support and I just finished supporting basic operations on a primary key. By design, BlitzDB&#8217;s index is a dense <a href="http://en.wikipedia.org/wiki/Index_(database)#Clustered">clustered</a> b+tree but in the first release I am going to limit PK to only be a HASH index. This is because I want BlitzDB to treat all PKs as direct keys inside the data dictionary (hash database where the actual rows are stored). So in other words, I want people to use PK for &#8220;needle in a haystack&#8221; like queries only. An example of a needle in a haystack like query is:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">SELECT</span> <span style="color: #66cc66;">*</span> <span style="color: #993333; font-weight: bold;">FROM</span> <span style="color: #993333; font-weight: bold;">TABLE</span> <span style="color: #993333; font-weight: bold;">WHERE</span> primary_key_column <span style="color: #66cc66;">=</span> whatever;</pre></div></div>

<p>Saying that, I don&#8217;t like to force people to do things the way I like so I plan on providing best of both worlds by supporting both data structures for PKs in Version-2:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> t1 <span style="color: #66cc66;">&#40;</span>id int<span style="color: #66cc66;">,</span> <span style="color: #993333; font-weight: bold;">PRIMARY</span> <span style="color: #993333; font-weight: bold;">KEY</span><span style="color: #66cc66;">&#40;</span>id<span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">USING</span> btree<span style="color: #66cc66;">&#41;</span> ENGINE<span style="color: #66cc66;">=</span>blitzdb;
<span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> t1 <span style="color: #66cc66;">&#40;</span>id int<span style="color: #66cc66;">,</span> <span style="color: #993333; font-weight: bold;">PRIMARY</span> <span style="color: #993333; font-weight: bold;">KEY</span><span style="color: #66cc66;">&#40;</span>id<span style="color: #66cc66;">&#41;</span> <span style="color: #993333; font-weight: bold;">USING</span> hash<span style="color: #66cc66;">&#41;</span> ENGINE<span style="color: #66cc66;">=</span>blitzdb;</pre></div></div>

<p>BlitzDB&#8217;s default configuration will use PK as a &#8220;direct&#8221; data dictionary index. If you wish to do range queries on PK, the solution is to create a index on the PK column.</p>
<h3>Primary Key lookup Performance</h3>
<p>So, how does my implementation perform? Here&#8217;s a quick benchmark with a test-run that randomly fetches 100 thousand rows from a BlitzDB table with 1 million rows. This is the table I used:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">CREATE</span> <span style="color: #993333; font-weight: bold;">TABLE</span> t1 <span style="color: #66cc66;">&#40;</span>id int <span style="color: #993333; font-weight: bold;">PRIMARY</span> <span style="color: #993333; font-weight: bold;">KEY</span><span style="color: #66cc66;">,</span> a int<span style="color: #66cc66;">,</span> b int<span style="color: #66cc66;">&#41;</span> ENGINE<span style="color: #66cc66;">=</span>blitzdb;</pre></div></div>

<p>and the query looks like this:</p>

<div class="wp_syntax"><div class="code"><pre class="sql" style="font-family:monospace;"><span style="color: #993333; font-weight: bold;">SELECT</span> <span style="color: #66cc66;">*</span> <span style="color: #993333; font-weight: bold;">FROM</span> t1 <span style="color: #993333; font-weight: bold;">WHERE</span> id <span style="color: #66cc66;">=</span> random_number_under_one_million;</pre></div></div>

<p>The hardware I used is the following commodity server: Intel Quad Xeon E5345 (2x4MB L2 cache), 8GB Memory, 500GB SATA II. Unfortunately I could not prepare a standalone client server today so both the server and the test program were run on the same machine. Yeah&#8230; this sucks so I can&#8217;t claim that this benchmark is 100% creditable.</p>
<p>Here is the result I obtained from <a href="http://code.google.com/p/skyload">skyload</a>. Please only view it as a guideline to BlitzDB&#8217;s lookup performance. I&#8217;ll do a proper benchmark with the Drizzle Community and publish it after I get Version-1 released.</p>

<div class="wp_syntax"><div class="code"><pre class="null" style="font-family:monospace;">[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 1
  Task Completion Time   : 5.88856 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1
&nbsp;
[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 2
  Task Completion Time   : 6.94474 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1
&nbsp;
[ READ LOAD EMULATION RESULT ]
  SQL File               : 100k_select.sql
  Concurrent Connections : 4
  Task Completion Time   : 7.04455 secs
  Number of Queries:     : 100000
  Number of Test Runs:   : 1</pre></div></div>

<p>As you can see, &#8220;needle in a haystack&#8221; queries can be executed pretty efficiently in BlitzDB. Looking at the first result, we can observe that it took an average of 0.058 milliseconds to process a query.</p>
<h3>Future Plans</h3>
<p>Admittedly, primary key support isn&#8217;t completely done so I&#8217;ll continue working on it. After that, I will start hacking on b+tree indexes and write more tests as I go. Once I support at least two indexes, I&#8217;ll ask the Drizzle Community to consider merging BlitzDB into Drizzle&#8217;s trunk. This is my goal for BlitzDB at the moment.</p>
<p>I also happen to own blitzdb.com so I&#8217;m planning on putting user documentation (including tutorial) and architectural notes there. This is currently not so high on my TODO list so I suspect it won&#8217;t happen until I get Version-1 released. All I can say about the release schedule at the moment is, &#8220;before the MySQL conference in april&#8221;.</p>
<p>So, that&#8217;s all I have to summarize for now. Thanks for reading this far. Merry Christmas and have a Happy New Year. Don&#8217;t trip on ice :)</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2009/12/end-of-year-progress-on-blitzdb/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>BlitzLock and RWLOCK Comparison</title>
		<link>http://torum.net/2009/11/blitzlock-and-rwlock-comparison/</link>
		<comments>http://torum.net/2009/11/blitzlock-and-rwlock-comparison/#comments</comments>
		<pubDate>Tue, 24 Nov 2009 12:47:11 +0000</pubDate>
		<dc:creator>Toru Maesaka</dc:creator>
				<category><![CDATA[drizzle]]></category>
		<category><![CDATA[oss]]></category>
		<category><![CDATA[blitzdb]]></category>
		<category><![CDATA[concurrency]]></category>
		<category><![CDATA[locking]]></category>
		<category><![CDATA[parallelism]]></category>
		<category><![CDATA[pthread]]></category>

		<guid isPermaLink="false">http://torum.net/?p=2308</guid>
		<description><![CDATA[As pointed out by Jay Pipes, I thought it would be nice to test and publish how BlitzLock performs against what I was originally intending on using for BlitzDB (pthread&#8217;s rwlock). So, I asked my colleagues in the operations group at Mixi to test BlitzLock on nice hardware that I don&#8217;t have access to. They [...]]]></description>
			<content:encoded><![CDATA[<p>As pointed out by <a href="http://www.jpipes.com/">Jay Pipes</a>, I thought it would be nice to test and publish how BlitzLock performs against what I was originally intending on using for BlitzDB (pthread&#8217;s rwlock). So, I asked my colleagues in the operations group at Mixi to test BlitzLock on nice hardware that I don&#8217;t have access to. They kindly accepted and ran the BlitzLock sandbox on a 16 core machine running Fedora.</p>
<p>If you haven&#8217;t read my <a href="http://torum.net/2009/11/blitzdb-and-tc-concurrency-model/">previous entry on BlitzLock</a> and why I started writing it, you should. This entry won&#8217;t make sense otherwise.</p>
<h3>Disclaimer</h3>
<p>Before I step any further, please remember that I&#8217;m not trying to say BlitzLock is better than pthread&#8217;s rwlock. My interest is to write a lock mechanism that is optimized for Tokyo Cabinet (TC). What I wanted to gain from this test was to see if BlitzLock has enough potential for me to keep working on it.</p>
<h3>Method</h3>
<p>There were three kinds of workloads: &#8220;Read Oriented&#8221;, &#8220;Write Oriented&#8221; and &#8220;Neutral&#8221;.  Read Oriented test has a 70% probability that each thread will call a read routine, whereas Write Oriented is the opposite where there is a 70% chance that the table state will be changed. In the Neutral test, both read and update calls have an even chance of being called. The seed value for the random number generator was identical for all tests.</p>
<p>Each worker sleeps for 10 milliseconds in the critical section and another 10 milliseconds right after it releases the lock. This was done to help cause context switching. Each test ran for 60 seconds.</p>
<p>You can obtain the standalone BlitzLock sandbox <a href="http://torum.net/code/cc/blitzlock.cc">from here</a>. I&#8217;ll upload a test friendly version that can accept startup options soon (I _really_ need to tidy it up).</p>
<h3>Results</h3>
<p>Below is a result from a load emulation where there was significantly more read calls than updates.</p>
<p align="center"><a href="http://www.flickr.com/photos/tmaesaka/4130661036/" title="BlitzLock Benchmark (1) by tmaesaka, on Flickr"><img src="http://farm3.static.flickr.com/2729/4130661036_c0c5965bfb.jpg" width="500" height="306" alt="BlitzLock Benchmark (1)" /></a></p>
<p>As seen above, BlitzLock is nicely scaling the workload without exhausting update threads. This is important since one of the concerns involved in the current implementation of BlitzLock is starvation (covered later). I think the read/write ratio above is the sort of ratio that is typically seen in the web industry and something I&#8217;m mostly concerned with. So how about a write intensive application? Next graph is a result of when there is significantly more update operations than read.</p>
<p align="center"><a href="http://www.flickr.com/photos/tmaesaka/4130695000/" title="BlitzLock Benchmark (2) by tmaesaka, on Flickr"><img src="http://farm3.static.flickr.com/2605/4130695000_20303186f3.jpg" width="500" height="302" alt="BlitzLock Benchmark (2)" /></a></p>
<p>As seen above, BlitzLock is nicely scaling update tasks without neglecting readers. Compared to the first graph, we&#8217;re seeing an opposite result between update and scanner threads which is expected due to the <a href="http://torum.net/2009/11/blitzdb-and-tc-concurrency-model/">nature of BlitzLock</a>. This is exactly what I was hoping to gain. Next graph is a result from when there is an even chance of read and update operations to occur.</p>
<p align="center"><a href="http://www.flickr.com/photos/tmaesaka/4130023819/" title="BlitzLock Benchmark (3) by tmaesaka, on Flickr"><img src="http://farm3.static.flickr.com/2551/4130023819_19f6e8eba8.jpg" width="500" height="303" alt="BlitzLock Benchmark (3)" /></a></p>
<p>As seen above, the throughput evens out for both read and update operations. I was expecting pthread&#8217;s rwlock to show noticeably lower update throughput than read (since it&#8217;s a single writer lock) but it turned out to even out. I&#8217;m not quite sure how I should interpret this but I guess the writer&#8217;s lock had a greater priority than reader&#8217;s lock in the environment that the test was run in. Nevertheless, this &#8220;even out&#8221; characteristic is something I&#8217;d like to welcome.</p>
<h3>From Here and Weaknesses</h3>
<p>I&#8217;m convinced to keep working on BlitzLock and use it as the default locking mechanism for BlitzDB. Ideally I should code BlitzDB to be able to switch between various locking mechanisms. This would make my life much easier when someone decides to write a locking mechanism that is better than BlitzLock for my use-case.</p>
<p>Thanks to <a href="http://pbxt.blogspot.com/">Paul McCullagh</a>&#8216;s feedback, I&#8217;ve come to realize that BlitzLock was broadcasting more often than it needs to. Functionally it still works but I should be able to save CPU usage by applying Paul&#8217;s feedback (thanks Paul!). There is also the potential lock starvation problem (when certain types of threads hog the lock) that I need to further investigate. If it&#8217;s going to cause noticeable issues, I&#8217;ll have to add a condition to BlitzLock saying &#8220;certain number of threads can obtain a certain lock at once&#8221;.</p>
<p>There is still another minor scheduling logic that I need to throw into BlitzLock but once I get that done (along with testing), I can integrate BlitzLock into BlitzDB and see how it performs (I can then hack on indexing!).</p>
<p>Yep, there&#8217;s still quite a bit to do but I&#8217;m having fun :)</p>
]]></content:encoded>
			<wfw:commentRss>http://torum.net/2009/11/blitzlock-and-rwlock-comparison/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

